# SISU Digital Humanities: Textual and Language Analysis on Social Media<br />
### Session 5: Classification and Sentiment Analysis
Created by Tom van Nuenen (tom.van_nuenen@kcl.ac.uk) <br />


#  Classification

This notebook introduces some common methods for **classification**.

First, we’ll be using a Naive Bayes Classifier to differentiate between different "classes" of data. We will look at some tools for sentiment analysis. Finally, we'll combine these two methods, so we can see if we can predict positive and negative posts or comments in a Reddit dataset.

After working through today's notebook, you will:

1. understand how to use a **Naive Bayes classifier** using social data;
2. understand how to use a pre-trained sentiment analysis tool using **NLTK VADER**;
3. understand how to train a Naive Bayes classifier for a **supervised class of sentiments** (i.e., creating and training your own sentiment analysis tool).

Let's get started.


## Introduction to Naive Bayes classifiers

A Naive Bayes classifier is a machine learning algorithm that uses **Bayes’ Theorem** to predict the class that a sample belongs to, given a number of features that describe that sample. It is based on the concept of conditional probability: “What is the probability of X, given Y?”. Which in our case is: "What is the probability of a post belonging to a category, given its word frequencies?" The math behind Bayes' Theorem is simple but intuitive. For more info, check out http://www.dealingdata.net/2016/07/24/PoGo-Series-NaiveBayesClassifier/.

Naive Bayes classifiers are often considered the baseline for classification tasks. They have worked quite well in many real-world situations, famously document classification and spam filtering. Mathematically speaking, they are very straightforward, you can use them when you have limited resources in terms of CPU and memory. Further, when the training time is a crucial factor, a Naive Bayes classifier is convenient, as it can be trained very quickly. 

One peculiar aspect of Naive Bayes classifiers is that they assume that the features you use are *conditionally independent*: knowledge of the outcome of one feature does not grant us knowledge of the outcome of any other feature. When dealing with text, this means that we’re not looking at entire sentences, but rather at individual words. So for our purposes, “this was a fun party” is the same as “this party was fun” and “party fun was this”. This is a pretty naive assumption (separate features are often correlated) — hence the name Naive Bayes Classifier. Yet, it turns out the algorithm often performs just as well than much more complex machine learning models.

In [None]:
# sklearn
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.metrics import confusion_matrix
from sklearn.feature_extraction import text
stop = text.ENGLISH_STOP_WORDS

# NLTK 
import nltk
from nltk import FreqDist
from nltk.classify import apply_features
from nltk.corpus import stopwords
from nltk.classify import accuracy
from nltk import NaiveBayesClassifier
from nltk.metrics import precision as prec
from nltk.metrics import recall as rec
from nltk.metrics import f_measure as fmeas
nltk.download('stopwords')
nltk.download('punkt') 

# pickle for saving
import pickle

# SpaCy 
import spacy
!spacy download en_core_web_sm
import en_core_web_sm
nlp = en_core_web_sm.load(disable=['parser', 'ner'])

# VADER - installing using magic command `!pip install vaderSentiment` (command line script)
!pip install vaderSentiment
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# Gensim's preprocessor
from gensim.utils import simple_preprocess

# General data science
import numpy as np
import re
import pandas as pd
import collections
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns 
%matplotlib inline
from IPython.display import clear_output

## Test using 20 Newsgroups

The 20 newsgroups dataset consists of around 18000 newsgroups posts on 20 topics. It's helpfully split in two subsets: one for training (i.e. development), and one for testing (i.e. performance evaluation). 

We will use a Naive Bayes classifier to **predict the topics of the posts in our test set**, based on the **word frequencies** in the posts in our training set. 

First, let's get the data. The `.target_names` method yields the topics that the newsgroups have been classified in.

In [None]:
data = fetch_20newsgroups()

In [None]:
data.target_names

Let's just get four of those to do our test with. We create a **training set** and a **test set**. The training set is, as the name implies, to train our classifier on. The test set we use to evaluate the performance of that classifier.

In [None]:
categories = ['talk.religion.misc', 'soc.religion.christian',
              'sci.space', 'comp.graphics']
train = fetch_20newsgroups(subset='train', categories=categories)


What does the data look like?


In [None]:
test.data[5]

To convert the content of each string into a vector of numbers, we will use Scikit-learn's `TfidfVectorizer()` method again.

We're using sklearn's `MultinomialNB` as our classifier. **Multinomial Naive Bayes** is a specialized version of Naive Bayes that is specifically designed for text documents. Whereas *Simple Naive Bayes* would model a document as the presence and absence of particular words, *Multinomial Naive Bayes* explicitly models all the word counts and adjusts the underlying calculations. For more on the difference between these classifiers, see:
http://blog.datumbox.com/machine-learning-tutorial-the-naive-bayes-text-classifier/ 

Scikit-klearn allows us to create a **pipeline** using `make_pipeline()`, which attaches this vectorizer to a multinomial naive Bayes classifier (meaning we don't have to instantiate these methods separately).

In [None]:
model = make_pipeline(TfidfVectorizer(), MultinomialNB())

### Fitting the data

With this pipeline, we can apply the model to the training data, and then predict labels for the test data:

In [None]:
model.fit(train.data, train.target)
labels = model.predict(test.data)

In [None]:
labels[:5]

Now that we have predicted the labels for the test data, we can evaluate them to learn about the performance of the estimator. For example, here is the confusion matrix between the true and predicted labels for the test data:

In [None]:
mat = confusion_matrix(test.target, labels)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
            xticklabels=train.target_names, yticklabels=train.target_names)
plt.xlabel('true label')
plt.ylabel('predicted label');

Evidently, even this very simple classifier based on tf-idf scores can successfully separate posts about space from posts about computers. It gets confused, however, between talk about religion and talk about Christianity – an expected area of confusion.


### Prediction
With this classifier, we can now determine the category for any string, using the `predict()` method of this pipeline. Let's write a utility function that will return the prediction for a single string:

In [None]:
def predict_category(s, train=train, model=model):
    pred = model.predict([s])
    return train.target_names[pred[0]]

In [None]:
predict_category('Scientists discover new moon around Saturn')

It works!

**Remember:** this is nothing more than a simple probability model for the (weighted) frequency of each word in the string! Nevertheless, the result is striking. Even a very naive algorithm, when used carefully and trained on a large set of high-dimensional data, can be surprisingly effective.

## Classification on social media

We will now use the same technique on an online community. We'll be using a community in which members can vote whether an original poster [OP], in some social interaction, behaved "assholeish". We want to explore whether we can predict this classification, based on the words that members use.

*Note: This is an example of a hypothesis that will fail. I include it because failure is an important part of the job of the data scientist. You have to get used to trying out things that will fail.*

### Preprocessing

We'll start by cleaning up our data a bit. Let's load it up.

In [None]:
# load into df
df = pd.read_csv("data/amita-submissions.csv", lineterminator='\n', encoding="utf8")
len(df)

In [None]:
# clean up empty entries
df = df.drop(['augmented_at', 'augmented_count', 'distinguish'], axis=1)
df = df[~df['selftext'].isin(['[removed]', '[deleted]' ])].dropna(subset=['selftext'])
len(df)

**Note:** if working with this entire dataset is still taking too long, consider using a slice.

In [None]:
df.head()

We can clean up some of the text in our DataFrame using a function which we `apply()` to the selftext column.

In [None]:
def clean_text(text):
    # Normalize tabs and remove newlines
    no_tabs = text.replace('\t', ' ').replace('\n', '');
    # Remove all characters except A-Z and a dot.
    alphas_only = re.sub("[^a-zA-Z\.]", " ", no_tabs);
    # Normalize spaces to 1
    multi_spaces = re.sub(" +", " ", alphas_only);
    # Strip trailing and leading spaces
    no_spaces = multi_spaces.strip();
    return no_spaces

df["selftext_clean"] = df["selftext"].apply(lambda x: clean_text(x))

### A classifier for assholes?

Now that we've seen how classification works, let's write a classifier for this subreddit. A typical use of this would be, for instance, to classify which particular subreddit a post belongs to. Or to classify posts that should be categorized as "NSFW".

But you can also use classification to test out certain hypotheses: for instance, can we predict whether a post will be classified as "ITA" (*Is The Asshole*), given its textual features?

Luckily, the members of r/amitheasshole have done some work for us by labeling the posts. We can use these labels to train our classifier. Let's have a look at the `flair_css_class` column in our DataFrame.

In [None]:
df['flair_css_class'].value_counts()

Looks like the `flair_css_class` column contains lots of tags for asshole and non-asshole posts. Let's use this column for now. We'll remove anything that's *not* classified as `not` or `ass`.

In [None]:
df = df[~df['flair_css_class'].isin(['shitpost', '1' ])].dropna(subset=['flair_css_class'])
len(df)

The next issue is that we have an unbalanced classification: "Not the Asshole" (`not`) posts occur over double as many times as "Asshole" (`ass`). We can choose to either **upsample** our "Asshole" category (by adding doubles) or **downsampling** our "Not the A-hole" category by removing entries. You ideally have to try both, as what works best differs from case to case, but let's downsample for now. 

First, we want our DataFrame sorted. Then, we create a boolean mask for the negative values, use `np.where` to get the indices of these rows, `drop` these indices, then drop one half of those indices:

In [None]:
# Sorting the values
df.sort_values('flair_css_class', inplace=True)

# creating a mask for the condition
mask = (df.flair_css_class == "not")

# find out which indexes this condition refers to using np.where()
idx, = np.where(mask)

# divide by 2 and drop the indices
df.drop(df.index[idx[:len(idx)//2]], inplace=True)

Let's see how many entries for our classes we have now.


In [None]:
df['flair_css_class'].value_counts()

Looks more balanced now!

Next, let's work on our input features. We'll start by lemmatizing and POS-filtering our texts. We'll output strings for each post which we can then load into our vectorizer.

In [None]:
stop = set(stopwords.words('english') + ['’', '“', '”', 'nbsp', 'http', 'edit'])

def lemmatization(texts, allowed_postags=['NOUN', 'VERB', 'ADJ', 'ADV']):
    """https://spacy.io/api/annotation"""
    out = []
    for text in texts:
        doc = nlp(text) 
        stop_out = [token for token in doc if token not in stop]
        out.append(' '.join([token.lemma_ for token in stop_out if token.pos_ in allowed_postags]))
    return out

In [None]:
lemmas = lemmatization(df.selftext_clean)

In [None]:
lemmas[:10]

Lemmatizing takes a long time, so we don't want to do it twice. We'll save the lemmatized list in a pickle, in case we need it later. Note that you can download files from your Colab workspace by clicking the folder icon to the left and clicking "download".

In [None]:
# pickling
with open("lemmas.lem", "wb") as cp: 
    pickle.dump(lemmas, cp)

In [None]:
# unpickling
with open("lemmas.lem", "rb") as cp: 
    lemmas = pickle.load(cp)

### Commence training

Now for the classfier! We'll first create our training and test set, using the lemmas as `X` and `flair_css_class` as `y`. So we're predicting `y` given `X`.

In [None]:
X = lemmas
y = df['flair_css_class']

Next, we split our data into training and test sets again. We can easily do this using scikit-learn's `train_test_split()` method. The method takes a parameter denoting the size of the test set (20% of the total set, in this case).

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

Why are we splitting into training and testing sets before vectorizing?

If we vectorize *before* we train/test split, our doc-term matrix would contain every single feature (word) in the test and training sets. What we want is to simulate the real world, where our classifier needs to encounters words it has not seen before. This allows us to evaluate it better.

So here's what we're doing: 

- We create a `TfidfVectorizer` instance
- `vect.fit.transform(X_train)` learns the t-fidf vocabulary of the training data, and uses the fitted vocabulary to build a document-term matrix from the training data;
- `vect.transform(X_test)` uses the fitted vocabulary to build a document-term matrix from the testing data (and ignores tokens it hasn't seen before).

In [None]:
vect = TfidfVectorizer(max_df=1.0, max_features=None, min_df=1) # we can add ngram_range=(1, 2) to include N-grams

# learn training data vocabulary, then use it to create a document-term matrix
X_train_dtm = vect.fit_transform(X_train)

# transform testing data (using fitted vocabulary) into a document-term matrix
X_test_dtm = vect.transform(X_test)
X_test_dtm

Time to create the classifier:

In [None]:
nb = MultinomialNB()
%time nb.fit(X_train_dtm, y_train)

That was quick! Let's see what it yields.

In [None]:
# Make class predictions for X_test_dtm
y_pred_class = nb.predict(X_test_dtm)

# Calculate accuracy of class predictions
from sklearn import metrics
metrics.accuracy_score(y_test, y_pred_class)

In [None]:
categories = ['ass', 'not']
mat = confusion_matrix(y_test, y_pred_class)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
            xticklabels=categories, yticklabels=categories)
plt.xlabel('true label')
plt.ylabel('predicted label')

Well, that doesn't work very well!

### OPTIONAL: Classification using logistic regression

Our predictions are not particularly accurate. This is no surprise, as it would be quite a feat to predict this class based on word usage alone! If you'd want to pursue this hypothesis further, you could try to take more features for X into account. 

Just for fun, let's try comparing this result to **logistic regression**, which is also often used for classification problems. Logistic regression tries to find the optimal decision boundary that best separates classes.

The difference between Naive Bayes and Logistic regression is that the first is a generative model, and the second a discriminative model. What does this mean?
- **Generative model**: Naive Bayes models the joint distribution of the feature X and target y, and then *predicts* the posterior probability given as P(y|X)
- **Discriminative model**: Logistic regression *directly models* the posterior probability of P(y|X) by learning the input to output mapping and minimizing the error.

Unlike Naive Bayes, logistic regression typically works reasonably well even when some of the features (in our case, words) are correlated. This is why it can be a good idea to try both when dealing with texts (because certain words *do* tend to appear in each other's vicinity!).

In [None]:
# instantiate a logistic regression model
logreg = LogisticRegression()

In [None]:
# train the model using X_train_dtm
%time logreg.fit(X_train_dtm, y_train)

In [None]:
# make class predictions for X_test_dtm
y_pred_class = logreg.predict(X_test_dtm)

In [None]:
# calculate predicted probabilities for X_test_dtm
y_pred_prob = logreg.predict_proba(X_test_dtm)[:, 1]
y_pred_prob

In [None]:
# calculate accuracy
metrics.accuracy_score(y_test, y_pred_class)

Still no good – but you get the idea.

### Improving the model

If you'd want to improve upon this model, you could try to extract other features from the text (for instance, based on the presence of certain words, sentence length, etc.). You could also start looking into other features from the metadata (e.g. the score it received) to improve the prediction – though this is beyond the scope of this notebook.

What you need to know is that this process is called **feature engineering**: taking whatever information you have about your problem and turning it into numbers that you can use to build your feature matrix.

Plugging other features into a classifier could potentially make it more accurate. For now, we can say that we cannot really tell the label based only on the words that people use in their posts.

### Finding the most distinctive terms

Before we move on, let's explore why we would want an accurate model in the first place! For one, this would allow us to calculate the approximate "assholishness" of each token.

In [None]:
# store the vocabulary of X_train
X_train_tokens = vect.get_feature_names()
len(X_train_tokens)

In [None]:
# examine some tokens
print(X_train_tokens[100:150])

Naive Bayes counts the number of times each token appears in each class. We can access the array of that count by running the `.feature_count_` method.


In [None]:
# note that the trailing underscore often acts to avoid naming errors
nb.feature_count_ 

What are we seeing here? The rows represent our classes (asshole / not an asshole), the columns represent our tokens (our total vocabulary).


In [None]:
# rows represent classes, columns represent tokens
nb.feature_count_.shape

In [None]:
# number of times each token appears across all ASS posts
ass_token_count = nb.feature_count_[0, :]
ass_token_count

In [None]:
# number of times each token appears across all NOT posts
not_token_count = nb.feature_count_[1, :]
not_token_count

We can now create a DataFrame of tokens with their separate ASS and NOT counts.

In [None]:
tokens = pd.DataFrame({'token':X_train_tokens, 'ass':ass_token_count, 'not':not_token_count}).set_index('token')
tokens[110:120] # just a random slice

Using `.class_count_`, we can count the number of observations in each class

In [None]:
# Naive Bayes counts the number of observations in each class
nb.class_count_

In [None]:
# add 1 to 'ass' and 'not' counts to avoid dividing by 0
tokens['ass'] = tokens['ass'] + 1
tokens['not'] = tokens['not'] + 1
tokens.sample(5, random_state=6)

In [None]:
# convert the 'ass' and 'not' counts into frequencies
tokens['ass'] = tokens['ass'] / nb.class_count_[0]
tokens['not'] = tokens['not'] / nb.class_count_[1]
tokens.sample(5, random_state=6)

In [None]:
# calculate the ratio of ass-to-not for each token
tokens['ass_ratio'] = tokens['ass'] / tokens['not']
tokens.sample(5, random_state=6)

Finally, we can examine the DataFrame sorted by ass_ratio. These are the words, according to our classifier, that are the most typical of "assholeish" posts. A well-working classifier would yield interesting info here on how telling each word is for being considered an asshole by the community. But then again, creating such a classifier is not a trivial matter.  


In [None]:
# note: use sort() instead of sort_values() for pandas 0.16.2 and earlier
tokens.sort_values('ass_ratio', ascending=False)[:10]

In [None]:
# look up the not_ratio for a given token
tokens.loc['me', 'ass_ratio']

Looks like the word 'me' occurs 3 as often for people who are classified as assholes (according to our very imprecise classifier at least). With a precise classifier, this would be a very interesting result!

### Training for more classes

We've trained for 2 classes now (asshole and non-asshole), but the posts in this subreddit (as well as many others) actually have more labels, called **flair**. We can find them here under the `flair_text` category.

In [None]:
df.flair_text.unique()[:100]

Quite a lot of categories, turns out. Let's see what the most-frequent categories are, as well as their counts:

In [None]:
df['flair_text'].value_counts()

There are a bunch of categories which we could concatenate in order to work with more data. This might make our classifier more accurate! For now, let's move on.

# Automated sentiment analysis using NLTK VADER

The second half of this notebook is about sentiment analysis (sometimes known as "opinion mining"). The most common use of sentiment analysis is to classify a text into a class (called "sentiment classification"). We'll get to that further on. First, let's see how sentiment analysis itself works.

The simplest option is to use pre-built libraries for sentiment analysis. TextBlob and NLTK VADER are two examples of such libraties. VADER (Valence Aware Dictionary and sEntiment Reasoner) is specifically built for social media texts, and takes multiple text features into account, such as:

- **Punctuation** (e.g. an exclamation mark "!" increases the magnitude of intensity)
- **Capitalization** (e.g. "The food here is GREAT" is more intense than "The food here is great")
- **Degree modifiers** (e.g. "The service here is extremely good" is more intense than "The service here is good")
- **Conjunctions** signal a shift in sentiment polarity, with the sentiment of the text following the conjunction being dominant (e.g. “The food here is great, but the service is horrible”)
- **Preceding Tri-grams** of a sentiment-laden lexical feature catch 90% of cases where negation flips the polarity of the text (e.g. "The food here isn’t really all that great”)
- **Emoji, emoticons & slang** are parsed very well by VADER.

In [None]:
# instantiate Class
analyser = SentimentIntensityAnalyzer()

In [None]:
# a small test
analyser.polarity_scores("""This is really dumb. I don't want to use this stupid 
                          program. But I slept really well.""")

VADER spits out several metrics we can use. The **Positive**, **Negative** and **Neutral** scores represent the proportion of text that falls in these categories. This means our sentence was rated as 35% Positive, 50% Neutral and 15% Negative. All these add up to 1.

The **Compound** score is a metric that calculates the sum of all the lexicon ratings which have been normalized between -1 (most extreme negative) and +1 (most extreme positive). The compound score turns out to be 0.61: a pretty positive sentiment overall. Let's use that score for now (but be aware that you can use these `neg`, `neu` and `pos` scores to your advantage to train a more precise classifier!)

Now let's write a function that takes a text in, and spits out one of 3 tags based on the compound score, as well as that compound score.

In [None]:
def sentiment_analyzer_scores(text):
    sentiment_dict = analyser.polarity_scores(text)
    # decide sentiment as positive, negative and neutral 
    if sentiment_dict['compound'] >= 0.01 : 
        tag = "pos"
    elif sentiment_dict['compound'] <= - 0.01 : 
        tag = "neg"
    else : 
        tag = "neut" 
    return [tag, sentiment_dict['compound']]

Let's append the output to 2 new columns in our dataframe. `.apply()` allows us to apply a function along an axis of the DataFrame. Also note the use of `zip()` here, allowing us to map the 2 returned values of our function to the tuple of variables we assign them to. Finally, note the `*` to unpack these containers.

In [None]:
df["sent"], df["sent_compound"] = zip(*df.selftext_clean.apply(sentiment_analyzer_scores))

Let's have a look:

In [None]:
df.sent.value_counts()

In [None]:
df.hist(column="sent_compound",bins=50)

Let's see if these scores make sense. We'll print out the comment with the highest score (note: we could also use `.idxmax()` to find the comment with the highest score). 

In [None]:
min_id = df.sent_compound.idxmin()
df.selftext_clean[min_id]

That does look like a comment with lots of negative language.

### Optional assignment: working with sentiment scores

You could use these sentiment scores for all kinds of purposes. 

First, you could look at comments with a high or low sentiment, and see what kinds of topics or concerns are discussed. 

You could train a Naive Bayes classifier (using the code we used above) to see whether sentiments in the posts tell you anything about what `flair_text` class that post falls under. You could also run a **regression analysis** to see if a post's sentiment score (X) is a predictor for something like the user score it received (y). Please see the additional notebook, I will share, "5-X Regression", if you're interested in that.

You could also combine it with your **topic modeling** output, if that output is interesting. You could create a column in your DataFrame that tells you the most-associated topic with each post/comment (using the notebook from week 3), then see if certain topics relate to certain sentiments.

## Training a sentiment analyzer using Naive Bayes

The problem with unsupervised approaches such as the one above is that they typically have poor performance compared to a supervised approach, where we create the labels ourselves.

Let's implement a Naive Bayes Classifier, based on our own sentiment labels. These labels will be based on **whether a commenter evaluates a post positively or negatively**. This requires a training set of samples which have known features and known classes. We can manually label a subset of our comments as positive or negative to form our training set. We will also use a different feature extraction technique this time (rather than using tfidf counts).

So we need to:

1. Create our own sentiment labels for part of our dataset, based on clear rules;
2. Apply feature extraction techniques;
3. Train a classifier for those labels, which we'll base on feature extraction.

In [None]:
downloaded = drive.CreateFile({'id':"1-LYy48UhoADWLlZKteH9zQT2myxQLiDC"})   
downloaded.GetContentFile('amita-comments_small.csv')

In [None]:
# load into df
df_com = pd.read_csv("data/amita-comments.csv", lineterminator='\n')

In [None]:
# Get rid of empty values and reset index
df_com = df_com[~df_com['body'].isin(['[removed]', '[deleted]' ])].dropna(subset=['body']).reset_index(drop=True)
len(df_com)

Let's start labeling ourselves. If you want to be able to classify with some accuracy, you should do **at least a few thousand (!) posts or comments**. This'll take a couple of hours. But you'll have a labeled dataset you can use for all kinds of things!

In [None]:
df_com_labeled = df_com.iloc[1:1000]

In [None]:
df_com_labeled[:5]

Now, we'll loop over the rows and display the post for each row, and ask the user to label it

In [None]:
# Classifications:
# p: overall positive evaluation of OP
# n: overall negative evaluation of OP
# x: cannot determine evaluation of OP, or comment is made by OP

for index, row in df_com_labeled.iterrows():
    print(row.body)
    df_com_labeled.loc[index, 'sentiment'] = input()
    clear_output()

In [None]:
df_com_labeled[:5]

In [None]:
# save your labels to csv
df_com_labeled.to_csv('df_com_Sentiment_Labeled.csv')

(I've cheated here and randomly assigned some tags to the corpus in the cell below, just so you can see how this works)

In [None]:
tags = ['p', 'n', 'x']
df_com_labeled["sentiment"] = np.random.choice(tags, 999, p=[0.3, 0.6, 0.1])

In [None]:
df_com_labeled = df_com_labeled.reset_index(drop=True)

### Train, test, and cross-validation sets

When building our sentiment analyzer, we need to create a test set again, in order to prevent overestimating its performance and overconstraining it. We will use the test set to evaluate the performance of our sentiment analyzer once it is complete.

When designing our sentiment analyzer, we’ll have a number of choices to make along the way (which were made for us when using VADER). How many features should we use? Should we include the capitalization of words as a feature?  All of these **hyperparameters** will effect the performance of our sentiment analyzer. To maximize performance, it's best to evaluate our classifier under different hyperparameter settings. We could do this with the training set itself – but then we run into the same risk of overconstraining the analyzer that we discussed earlier. Using the test set would mean we'd be tuning our classifier to the test set – defeating the purpose of setting that data aside in the first place.

So we'll set another portion of the original training set aside: the **cross-validation** set. We can use it to evaluate the performance of our sentiment analyzer as we tune our hyperparameters. This means the final evaluation of our analyzer’s performance will be a true representation of how it will perform on unlabeled comments from our full collection.

First, let's have a look at how much we've done

In [None]:
pos_comments = [(df_com_labeled.loc[row,'body'],'positive') for row in range(len(df_com_labeled)) if 
              df_com_labeled.loc[row,'sentiment'] == 'p']

neg_comments = [(df_com_labeled.loc[row,'body'],'negative') for row in range(len(df_com_labeled)) if 
              df_com_labeled.loc[row,'sentiment'] == 'n']

print('Number of comments labeled positive: %d' % len(pos_comments))
print('Number of comments labeled negative: %d' % len(neg_comments))

Note that you see a class imbalance here – we have way more negative than positive labels. This presents a new challenge for our sentiment analyzer. If the probability of seeing positive comments is small, Bayes Theorem is likely to predict that the probability of any comment being positive is small **regardless of the evidence**. So we need to up- or downsample again.

The choice of upsampling, downsampling, or randomly sampling (producing the true positive to negative ratio) is a **hyperparameter** of our sentiment analyzer. To know which method will work best, we ought to try all of them and see which yields the best performance on your cross-validation set. 

For now, we will put half of the positive and negative comments into the **training set** while down sampling negative tweets at a one-to-one ratio. Half of the remaining comments will go into the **cross-validation set**, and the final quarter of comments into the **test set**.

In [None]:
# Half of the positive comments go in training
# Downsampling the negative comments at 1 pos:1 neg

len_train = int(round(len(pos_comments)/2)*2) # Define total len of train set based on len of pos_comments
train_comments = pos_comments[:int(len_train/2)] + neg_comments[:int(len_train/2)]

# Half of the remaining half go in cv
cv_neg_cutoff = int((len_train/2) + round((len(neg_comments) - len_train/2)/2))
cv_pos_cutoff = int((len_train/2) + round((len(pos_comments) - len_train/2)/2))
cv_comments = neg_comments[int(len_train/2):cv_neg_cutoff] + pos_comments[int(len_train/2):cv_pos_cutoff]  

# Rest go into testing
test_comments = neg_comments[cv_neg_cutoff:] + pos_comments[cv_pos_cutoff:]  

In [None]:
## Test
train_comments[7]

(Note that you can do this kind of splitting through scikit-learn's `train_test_split()` and `KFold()` methods, making it much easier!)

## Feature extraction design

Now it's time to define our sentiment analyzer’s features. We'll do so with boolean `contains(word)` statements. For instance, the feature `contains(disagree)` will likely be `True` for a negative comment, and `False` for a positive tweet. We’ll need to construct a list of useful words for this purpose.

### Bigrams
Note that this is harder than it might seem! If we were to train a model that believes the word “like” indicates a positive tweet, we would misclassify the second tweet. It’s clear that the classifier’s performance would increase if it could recognize that the word “not” is negating the word “like.” One way to address negation markers is to include **bigrams** in our list of features. Doing so allows our classifier to recognize the collocation of words such as “(not,like)” in addition to the typical unigram features we would extract.

### Preventing overfit
This word list shouldn’t include every word we come across in our collection of comments: using too many uncommon words will cause our sentiment analyzer to *overfit* the training sample. This usually happens when the model is too complex (that is, using too many features compared to the number of observations). Such a model will be very accurate on the training data  not accurate on untrained or new data. 

We will avoid this issue by requiring that each word **appear a certain number of times** before we consider it as a feature. 

Let's write a function that applies these feature extraction techniques to our comments. First, we extract a list of unigrams and bigrams. We treat all html URLs as the same “word”, while excluding punctuation and capitalization.

In [None]:
from nltk.metrics import BigramAssocMeasures
from nltk.collocations import BigramCollocationFinder
import string
import itertools
from gensim.utils import simple_preprocess

punct = set(string.punctuation)

def filter_comments(comments):
    filtered_bigrams = []
    for words, sentiment in comments: 
        words_filtered = []
        # Lemmatize using spacy
        words = words.lower()
        doc = simple_preprocess(str(words), deacc=True)     
        words_filtered.extend([token for token in doc])
        # Identify top 200 bigams using chi_sq measure of importance
        bigram_finder = BigramCollocationFinder.from_words(words_filtered)
        bigrams = bigram_finder.nbest(BigramAssocMeasures.chi_sq, 200)      
        # Add to filtered list
        filtered_bigrams.append(([ngram for ngram in itertools.chain(words_filtered,bigrams)],sentiment))
    return filtered_bigrams

The result of the filter is a **list of tuples** for each comment. The first entry in each tuple is a list of unigrams and bigrams for that comment, and the second entry in the tuple is the manually labeled sentiment of that comment.

In [None]:
train_tuples = filter_comments(train_comments)
cv_tuples = filter_comments(cv_comments)
test_tuples = filter_comments(test_comments)    

In [None]:
train_tuples[0]

Remember that we wanted to get rid of the features that do not appear frequently? We'll now remove all unigrams or bigrams that appear less than 3 times.

In [None]:
def get_word_features(comments,min_freq):
    word_list = []
    for word_or_tuple in comments:
        word_list.extend(word_or_tuple[0])
    # Count the frequency of each unigram and bigram
    freqs = FreqDist(word_list)
    # Sort based on frequency
    sorted_word_list = sorted(freqs.items(), key=lambda x: x[1], reverse=True)
    # Only include features appearing at least min_freq times
    word_features = [sorted_word_list[word][0] for word in 
                     range(len(sorted_word_list)) if sorted_word_list[word][1] >= min_freq]    
    # Return list of features
    return word_features

word_features = get_word_features(train_tuples, 3)

Let's see how many features we have found!

In [None]:
len(word_features)

Now that we have a final list of word features, we can finally evaluate the boolean `contains(word_feature)` statements for each tweet. To do so, we first define an `extract_features()` function that takes a filtered comment as an input, then determines if each word feature exists in the filtered comment’s list of unigrams and bigrams.

In [None]:
def extract_features(filtered_comment):
    filtered_comment_words = set(filtered_comment)
    features = {}
    for word in word_features:        
        # Check if word feature is present in filtered_comment
        features['contains(%s)' % str(word)] = (word in filtered_comment_words)
    return features

The NLTK classify package provides a convenient `.apply_features()` method to apply our `extract_features()` function to each comment in our datasets:

In [None]:
training_set = apply_features(extract_features, train_tuples)
cv_set = apply_features(extract_features, cv_tuples)
test_set = apply_features(extract_features, test_tuples)

The final result is a NLTK `LazyMap` object that contains a tuple for each comment. The first entry in the tuple is a list of boolean `contains(word)` features for each of the feature words we selected, while the second entry in the tuple is the manually labeled sentiment of the comment.

In [None]:
training_set[0][0]

## Training the classifier

It's Time to train our Naive Bayes Classifier. Let's use NLTK's one this time around.

Note that we're using Bayesian classfier over **binary features** this time: that is, we use a bunch of boolean `contains()` variables that tell our classifier whether or not some feature is present in a comment. The equivalent in sklearn for this is the `BernoulliNB` classifier.

In [None]:
classifier = NaiveBayesClassifier.train(training_set)

### Evaluation

To accurately evaluate the performance of our classifier, we have to look at 2 things: **precision** and **recall**.

- **Precision:** the number of true positives (correctly labeled items) divided by the sum of true positives and false positives (incorrectly labeled items)
- **Recall:** the number of true positives divided by the total number of elements that actually belong to the positive class (i.e., the sum of true positives and false negatives).

We can average the precision and recall metrics using their harmonic mean, producing a quantity known as the F1 score. It's a measure of a test's total accuracy. See https://en.wikipedia.org/wiki/F1_score for more.

We'll use a function that evaluates our classifier, based on the cross-validation set we created earlier. We'll calculate F1, precision and recall scores.

In [None]:
def eval_classifier(data_set):
    # NLTK .accuracy method calculates accuracy
    cross_valid_accuracy = accuracy(classifier, data_set)

    # Create two sets to count positive and negative comments
    ref_set = collections.defaultdict(set)
    obs_set = collections.defaultdict(set)

    # Loop over each comment in our cross validation set
    for i, (feats, label) in enumerate(data_set):

        #Classify the comment by feeding the classifier the comment's features
        observed = classifier.classify(feats)

        #Add the current comment to the "reference" set under the actual class
        ref_set[label].add(i)

        #Add the current coment to the "observation" set under the predicted class
        obs_set[observed].add(i)

    # Calculate F score, precision, an recall for positive and negative labels
    print ('Accuracy:', cross_valid_accuracy)
    print ('F-measure [negative]:', fmeas(ref_set['negative'], obs_set['negative']))
    print ('F-measure [positive]:', fmeas(ref_set['positive'], obs_set['positive']))
    print ('Precision [negative]:', prec(ref_set['negative'], obs_set['negative']))
    print ('Precision [positive]:', prec(ref_set['positive'], obs_set['positive']))
    rec_neg=rec(ref_set['negative'], obs_set['negative'])
    rec_pos=rec(ref_set['positive'], obs_set['positive'])
    print ('Recall [negative]:', rec_neg)
    print ('Recall [positive]:', rec_pos)
    total_neg=len(neg_comments)
    total_pos=len(pos_comments)

In [None]:
eval_classifier(cv_set) 

At this point, we have to see if we are happy with the outcome of our performance metrics. 
If not, we need to change the hyperparameters of our classifier and revaluate its performance on our cross-validation set. As noted, we may want to consider limiting our classifier to **high-information features**. We can check what these features are using NLTK’s `show_most_informative_features()` method.

In [None]:
print(classifier.show_most_informative_features(5))

We can use these features in the *feature extraction design* step above to see if it will improve our model (which it won't necessarily will). **The goal is, through tweaking our hyperparameters like this, to improve the accuracy of our model.** 

### Evaluating on the test set

It's only after tweaking our model for a while that we should crack open our test set and see how well our classifier has done!

In [None]:
eval_classifier(test_set)

## Conclusion: Data analysis and hermeneutics

That was a lot! Again, don't worry if you didn't understand every part of this notebook. Its main purpose is for you to get a first look into the many ways in which data scientists can classify their data.

There's a lot you can do with the methods we've applied here. For instance, you can use classification to try and find binaries in your dataset. You could tag a corpus manually, based on the presence of some binary category X or Y that you deem important. You could also create a set of categories based on, say, particular word frequencies. Then you can create a classifier to see if you can successfully predict these categories in other texts.

You can use classification, feature extraction and sentiment analysis to explore your own data. For instance:
- Using sentiment scores to trace posts or topics of interest;
- Using feature extraction to create new classes; 
- Training a classifier based on your own classes (e.g. the presence of a specific topic you're interested in).