<a href="https://colab.research.google.com/github/rcdbe/sma-online/blob/master/day-3/Text_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

*Social Media Analytics Worskhop - Telkom University*


---



# Text Classification

Text classification is the process of assigning tags or categories to text according to its content. It’s one of the fundamental tasks in Natural Language Processing (NLP) with broad applications such as sentiment analysis, topic labeling, spam detection, and intent detection.

In [0]:
# Import Library
import nltk

## Sentiment Analysis

Sentiment analysis is the interpretation and classification of emotions (positive, negative and neutral) within text data using text analysis techniques. 

### Predefined Model (Only English)

In [0]:
# Install Library
! pip install vaderSentiment

In [0]:
# Install Module
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

In [0]:
# Create Sentiment Analysis Function
analyzer = SentimentIntensityAnalyzer()
def sentiment_analyzer_scores(sentence):
    score = analyzer.polarity_scores(sentence)
    print("{:-<40} {}".format(sentence, str(score)))

In [0]:
# Input English Text
text_en = 'The death toll from the coronavirus has reached 28 in South Korea with 600 newly confirmed cases, raising the national Itally to 4,812 cases, the South Korean Centers for Disease Control and Prevention (KCDC) said in a news release Tuesday.'
text_en

In [0]:
# Detect the Text Sentiment
sentiment_analyzer_scores(text_en)

### Train New Model

In [0]:
# Import Library
import pandas as pd 

In [0]:
# Import Modules
from sklearn.feature_extraction.text import CountVectorizer # to create Bag of words
from sklearn.model_selection import train_test_split  # for splitting data
from sklearn.naive_bayes import GaussianNB # to bulid classifier model
from sklearn.preprocessing import LabelEncoder # to convert classes to number 
from sklearn.metrics import accuracy_score # to calculate accuracy

In [0]:
# Import Train Data
df_grab = pd.read_csv('https://raw.githubusercontent.com/rcdbe/sma-online/master/day-3/Source/grab-tweet.csv', sep = ';')
df_grab.head()

In [0]:
# Count the Sentiment
df_grab.sentiment.value_counts()

In [0]:
# Feature Extraction (Word Embedding)
count_vector = CountVectorizer(max_features = 1500)  
grab_feature = count_vector.fit_transform(df_grab['text']).toarray() 
grab_feature_matrix = pd.DataFrame(grab_feature,columns=count_vector.get_feature_names())
grab_feature_matrix.head()

In [0]:
# Encode Target
encoder = LabelEncoder()
grab_label = encoder.fit_transform(df_grab['sentiment'])
grab_label

In [0]:
# Set Training and Testing Data (70:30)
feature_train, feature_test, target_train, target_test = train_test_split(grab_feature, grab_label, shuffle = True, test_size=0.3, random_state=1)

# Show the Training and Testing Data
print(feature_train.shape)
print(feature_test.shape)
print(target_train.shape)
print(target_test.shape)

In [0]:
# Train Naive Bayes Model
nb = GaussianNB().fit(feature_train, target_train)

# Predict to Test Data
target_predicted = nb.predict(feature_test) 
target_predicted 

In [0]:
print('Test model accuracy: ',accuracy_score(target_test, target_predicted))

In [0]:
# Input New Statement
new_statement = ['saya suka grab'] 

# Extract Features
new_statement_features = count_vector.transform(new_statement).toarray()

## encodeing predict class
predict_sentiment = encoder.inverse_transform(nb.predict(new_statement_features))
print(new_statement[0], 'sentiment: ',predict_sentiment[0])

## Hate Speech Detection

This aims to classify textual content into non-hate or hate speech, in which case the method may also identify the targeting characteristics (i.e., types of hate, such as race, and religion) in the hate speech.

In [0]:
# Import Library
import pandas as pd 

In [0]:
# Import Modules
from sklearn.feature_extraction.text import CountVectorizer # to create Bag of words
from sklearn.model_selection import train_test_split  # for splitting data
from sklearn.naive_bayes import GaussianNB # to bulid classifier model
from sklearn.preprocessing import LabelEncoder # to convert classes to number 
from sklearn.metrics import accuracy_score # to calculate accuracy

In [0]:
df_hs = pd.read_csv('https://raw.githubusercontent.com/rcdbe/sma-online/master/day-3/Source/data_hs.csv', sep = ";")
df_hs

In [0]:
# count of each type 
df_hs.label.value_counts()

In [0]:
# Feature Extraction (Word Embedding)
count_vector = CountVectorizer(max_features = 1500)  
hs_feature = count_vector.fit_transform(df_hs['text']).toarray() 
hs_feature_matrix = pd.DataFrame(hs_feature,columns=count_vector.get_feature_names())
hs_feature_matrix.head()

In [0]:
# Encode Target
encoder = LabelEncoder()
hs_label = encoder.fit_transform(df_hs['label'])
hs_label

In [0]:
# Set Training and Testing Data (70:30)
feature_train, feature_test, target_train, target_test = train_test_split(hs_feature, hs_label, shuffle = True, test_size=0.3, random_state=1)

# Show the Training and Testing Data
print(feature_train.shape)
print(feature_test.shape)
print(target_train.shape)
print(target_test.shape)

In [0]:
# Train Naive Bayes Model
nb = GaussianNB().fit(feature_train, target_train)

# Predict to Test Data
target_predicted = nb.predict(feature_test) 
target_predicted 

In [0]:
print('Test model accuracy: ',accuracy_score(target_test, target_predicted))

In [0]:
# Input New Statement
new_statement = ['Ku Cinta Dia'] 

# Extract Features
new_statement_features = count_vector.transform(new_statement).toarray()

## encodeing predict class
predict_label = encoder.inverse_transform(nb.predict(new_statement_features))
print(new_statement[0], 'sentiment: ',predict_label[0])