# Text Classification

Text classification for Sentiment Analysis is one of the most common tasks in natural language processing. It consists in computationally analysing text messages and tell whether the underlying sentiment is positive, negative our neutral. Let's create a simple text classifier using tweets from the first 2016 GOP Presidential Debate available online [kaggle](https://www.kaggle.com/ngyptr/python-nltk-sentiment-analysis/data). 

In [None]:
import pandas as pd # import pandas
tweets_data_df = pd.read_csv("https://raw.githubusercontent.com/emmanueliarussi/DataScienceCapstone/master/2_TextClassification/data/Sentiment.csv") # load data
tweets_data_df = tweets_data_df.reset_index()

## Exploratory Data Analysis
Let's take a look at the content of this dataset.

In [None]:
pd.set_option("max_rows", 5)  # only display up to 5 rows when printing dataframes (reduce visual clutter)
tweets_data_df

How are the sentiment labels distributed? How much can we trust on those labels?

In [None]:
import seaborn as sns                  # import seaborn for visualization
import matplotlib.pyplot as plt

# bar chart to show the number of tweets tagged as Positive/Neutral/Negative
f, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))
sns.countplot(x="sentiment", order=["Negative","Neutral","Positive"], data=tweets_data_df, palette="PiYG", ax=ax1)
ax1.set_title("Sentiment Label Counts")
ax1.set_xlabel("Sentiment")
ax1.set_ylabel("Count")

# scatter plot showing the labels confidence per tweet.
# usefull for discarding unreliable tweets. 
sns.scatterplot(x="id", y="sentiment_confidence", hue="sentiment_confidence", palette="viridis",
                sizes=(5,), linewidth=0, alpha=0.7,legend=False, data=tweets_data_df, ax=ax2)
ax2.set_title("Tweet Label Confidence")
ax2.set_xlabel("Tweet ID")
ax2.set_ylabel("Confidence")

## Data Filtering
Based on the figures above, we decide to filter the data we will use in order to build our classifier and keep only `text` and `sentiment` columns. Adittionally, we remove rows with `sentiment_confidence` below 0.5. Since we will be performing binary classification, we remove `Neutral` tweets and replace `Negative` and `Positive` strings with integers labels. Finally, a large unbalance in the number of labeled data for a given category can severely affect the performance of a classiier. Therefore, we randomly remove negative tweets to match the number of positive samples in or data set.

In [None]:
import numpy as np

# keeping only rows with sentiment_confidence >0.5
tweets_data_df = tweets_data_df[tweets_data_df["sentiment_confidence"]>0.5] 

# remove random negative tweets
total_negative = tweets_data_df["sentiment"].value_counts()["Negative"] # count positive and negative rows
total_positive = tweets_data_df["sentiment"].value_counts()["Positive"]
to_remove = np.random.choice(tweets_data_df[tweets_data_df['sentiment']=="Negative"].index, # select random rows
                             size=total_negative-total_positive,
                             replace=False)
tweets_data_df = tweets_data_df.drop(to_remove) # remove

# keeping only the neccessary columns [['text','sentiment']]
tweets_data_df = tweets_data_df[["text","sentiment"]]         
# replace sentiment string labels with integers and remove neutral tweets. 
tweets_data_df = tweets_data_df[tweets_data_df["sentiment"]!="Neutral"]      
tweets_data_df["sentiment"][tweets_data_df["sentiment"]=="Negative"]  = 0   
tweets_data_df["sentiment"][tweets_data_df["sentiment"]=="Positive"]  = 1      
tweets_data_df

Let's check the output distribution using a barchart

In [None]:
# bar chart to show the number of tweets tagged as Positive/Neutral/Negative
f, (ax1) = plt.subplots(1, 1, figsize=(5, 5))
sns.countplot(x="sentiment", order=[0,1], data=tweets_data_df, palette="PiYG", ax=ax1)
ax1.set_title("Sentiment Label Counts (after filtering)")
ax1.set_xlabel("Sentiment")
ax1.set_ylabel("Count")

Let's now prepare tweet text to be interpretable by our classifier. Using tools from [NLTK](https://www.nltk.org/) we first remove *stop words* which are words that do not have any important significance to be used in our classifier. Usually these words are also filtered out from search queries because they return vast amount of unnecessary information (such i.e. the, is, at, which, on, for, this, etc.). We also remove usernames, hashtags and URLs. NLTK also provides us with a tweet *tokenizer* module that splits tweet strings into a list of words. Additionally, we employ a *stemmer* to remove morphological affixes from words, leaving only the word stem (i.e.: "running" -> "run", "generously" -> "generous"). Finally, we also remove common emojis, hashtags, punctuation and numbers. 

In [None]:
from nltk.corpus   import stopwords 
from nltk.tokenize import TweetTokenizer
from nltk.stem     import PorterStemmer
import nltk
import string
import re

# Set of emojis to remove from tweets
emojis = set([ ':-)', ':)', ';)', ':o)', ':]', ':3', ':c)', ':>', '=]', '8)', '=)', ':}',
               ':^)', ':-D', ':D', '8-D', '8D', 'x-D', 'xD', 'X-D', 'XD', '=-D', '=D',
               '=-3', '=3', ':-))', ":'-)", ":')", ':*', ':^*', '>:P', ':-P', ':P', 'X-P',
               'x-p', 'xp', 'XP', ':-p', ':p', '=p', ':-b', ':b', '>:)', '>;)', '>:-)','<3',
               ':L', ':-/', '>:/', ':S', '>:[', ':@', ':-(', ':[', ':-||', '=L', ':<',
               ':-[', ':-<', '=\\', '=/', '>:(', ':(', '>.<', ":'-(", ":'(", ':\\', ':-c',
               ':c', ':{', '>:\\', ';(', '🇺🇸'])
 
# English stop words set
nltk.download('stopwords')
stopwords_english  = set(stopwords.words("english"))
    
# Tokenizer 
tokenizer    = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)

def clean_tweet(tweet):
    
    # regular expression to remove "RT" from tweets
    tweet = re.sub(r'^RT[\s]+', '', tweet)
 
    # regular expression to remove "URLs" from tweets
    tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet)
    
    # regular expression to remove "#" from tweets
    tweet = re.sub(r'#', '', tweet)
    
    # regular expression to remove "@" from tweets
    tweet = re.sub(r'@', '', tweet)
 
    # tokenize
    tweet_tokens = tokenizer.tokenize(tweet)
    
    # PorterStemmer
    stemmer = PorterStemmer()
 
    tweet_clean = []    
    for word in tweet_tokens:
        if (word not in stopwords_english and     # no stopwords
            word not in emojis and                # no emojis
            word not in string.punctuation):      # no punctuation
                stem_word = stemmer.stem(word)    # stemming word
                tweet_clean.append(stem_word)
 
    return tweet_clean

# get filtered clean words for each tweet
tweets_data_df["text"] = tweets_data_df["text"].apply(lambda x: clean_tweet(x))
# remove stopwords and punctuation
tweets_data_df

## Building a sentiment classifier
First, we identify and create a dicctionary of unique words using the [`CountVectorizer`](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) method from scikit-learn. In order to train a classifier, we need to create a vector identifying each tweet, this is called a feature vector. The vectorizer helps us to translate each word into a unique integer code, extracting the *vocabulary* in our dataset.  

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

phrases    = [' '.join(x) for x in tweets_data_df["text"].values ] # All the phrases in our tweets
vectorizer = CountVectorizer() 
vectorizer = vectorizer.fit(phrases)
print("Vocabulary size:{}".format(len(vectorizer.vocabulary_)))
print("Vocabulary content:{}".format(vectorizer.vocabulary_))

Using scikit-learn [`train_test_split`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) method, we now split the dataset into a training and a testing set. We define a test set about 10% of the original dataset size. 

In [None]:
# function for splitting data to train and test sets
from sklearn.model_selection import train_test_split            

# Splitting the dataset into train and test set
train, test = train_test_split(tweets_data_df,test_size = 0.1)

In [None]:
train

In [None]:
test

Now, we need to transform tokens in each partition into a feature vectors. We will use our vectorizer to transform each tweet into a [*bag of words*](https://en.wikipedia.org/wiki/Bag-of-words_model). 

In [None]:
X_train              = [' '.join(x) for x in train['text'].values ] 
X_train_bag_of_words = vectorizer.transform(X_train)
y_train              = train['sentiment'].values.astype(int)

X_test              = [ ' '.join(x) for x in test['text'].values ] 
X_test_bag_of_words = vectorizer.transform(X_test)
y_test              = test['sentiment'].values.astype(int)

Printing instances from the training dataset after the *bag of words* transformation, we see pairs like `(0, 628)	1`. The first number identifies the phrase index (the tweet), the second number is the word code asigned by our vectorizer, finally, the las number is the count of occurrences of the word.

In [None]:
# Printing bag-of-words representation
print(X_train_bag_of_words[0])

Finally, we fit [Random Forest Classifier](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html) model. 

In [None]:
from sklearn.ensemble import RandomForestClassifier
gnb = RandomForestClassifier(class_weight="balanced")
gnb.fit(X_train_bag_of_words.toarray(), y_train)

Let's compute standard classification metrics.

In [None]:
from sklearn.metrics import confusion_matrix, balanced_accuracy_score, accuracy_score, classification_report

y_pred_test  = gnb.predict(X_test_bag_of_words.toarray())

print(" <================= Testing dataset metrics =================> ")
print("Number of mislabeled test points out of a total {} points : {}".format(X_test_bag_of_words.toarray().shape[0],(y_test != y_pred_test).sum()))
print("Confusion matrix:\n",confusion_matrix(y_test, y_pred_test),"\n")
print("Balanced accuracy score:",balanced_accuracy_score(y_test, y_pred_test))
print("Full Report:\n",classification_report(y_test, y_pred_test))