# Yet Another Twitter Sentiment Analysis

I finished an 11-part series blog posts on Twitter sentiment analysis not long ago. Why do I want to do the sentiment analysis again? I wanted to extend further and run sentiment analysis on real retrieved tweets. And there are other limits to my previous sentiment analysis project.

1. The project stopped at the final trained model and lacks application of the model to retrieved tweets
2. The model was trained on only positive and negative class, so it lacks the ability to predict a neutral class

Regarding neutral class, it might be possible to set a threshold value for negative, neutral, positive class, and map the final output probability value to one of three classes, but I wanted to train a model with training data, which has three sentiment classes: negative, neutral, positive.

Since I already wrote quite a lengthy series on NLP, sentiment analysis, if a concept was already covered in my previous posts, I won't go into the detailed explanation. And also the main data visualisation will be with retrieved tweets, and I won't go through extensive data visualisation with the data I use for training and testing a model.

## Data

In order to train my sentiment classifier, I need a dataset which meets conditions below.

- Preferably tweets text data with annotated sentiment label
- with 3 sentiment classes: negative, neutral, positive
- big enough to train a model

While googling to find a good data source, I learned about renowned NLP competition called SemEval.
"SemEval (Semantic Evaluation) is an ongoing series of evaluations of computational semantic analysis systems, organized under the umbrella of SIGLEX, the Special Interest Group on the Lexicon of the Association for Computational Linguistics."
http://alt.qcri.org/semeval2017/
You might have already heard of this if you're interested in NLP. Highly-skilled teams from all around the world compete on a couple of tasks such as "semantic textual similarity", "multilingual semantic word similarity", etc.
One of the competition tasks is the Twitter sentiment analysis. It also has a couple of subtasks, but what I would want to focus on is "Subtask A. : Message Polarity Classification: Given a message, classify whether the message is of positive, negative, or neutral sentiment".
http://alt.qcri.org/semeval2017/task4/

Luckily the dataset they provide for the competition is available to download.
http://alt.qcri.org/semeval2017/task4/index.php?id=data-and-tools
The training data consists of SemEval's previous training and test data. What's even better is they provide test data, and all the teams who participated in the competition are scored with the same test data. This means I can compare my model performance with 2017 participants in SemEval.

I first downloaded full training data for SemEval 2017 Task 4. http://alt.qcri.org/semeval2017/task4/index.php?id=download-the-full-training-data-for-semeval-2017-task-4

There are 11 txt files in total, spanning from SemEval 2013 to SemEval 2016. While trying to read the files into a Pandas dataframe, I found two files cannot be properly loaded as tsv file. It seems like there are some entries not properly tab-separated, so end up as a chunk of 10 or more tweets stuck together. I could have tried retrieving them with tweet ID provided, but I decided to first ignore these two files, and make up a training set with only 9 txt files.

In [2]:
import pandas as pd  
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [3]:
import glob
path ='Subtask_A/'
all_files = glob.glob(path + "/twitter*.txt")
frame = pd.DataFrame()
list_ = []
for file_ in all_files:
    df = pd.read_csv(file_,index_col=None, sep='\t', header=None, names=['id','sentiment','text','to_delete'])
    list_.append(df.iloc[:,:-1])
df = pd.concat(list_)


The dataset looks fairly simple with individual tweet ID, sentiment label, and tweet text.

In [4]:
df = df.drop_duplicates()
df = df.reset_index(drop=True)
df.tail()

Unnamed: 0,id,sentiment,text
41700,639016598477651968,neutral,@YouAreMyArsenal Wouldn't surprise me if we en...
41701,640276909633486849,neutral,Rib injury for Zlatan against Russia is a big ...
41702,640296841725235200,neutral,Noooooo! I was hoping to see Zlatan being Zlat...
41703,641017384908779520,neutral,@Fronsoir Zlatan has never done it on a wet Tu...
41704,641395811474128896,neutral,@ZIatanVines how many goals Zlatan intends to...


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 41705 entries, 0 to 41704
Data columns (total 3 columns):
id           41705 non-null int64
sentiment    41705 non-null object
text         41705 non-null object
dtypes: int64(1), object(2)
memory usage: 977.5+ KB


There are total 41,705 tweets. As another sanity check, let's take a look at how many words are there in each tweet.

In [6]:
df['token_length'] = [len(x.split(" ")) for x in df.text]
max(df.token_length)

53

In [7]:
df.loc[df.token_length.idxmax(),'text']

'NO. 108 UN JORIT MAR DALA ZAYUN                                  (Watchman on the Walls of Zion) Doh is C  1. Un... http://t.co/gNu3AzZ6qH'

OK, the token length looks fine, and the tweet for maximum token length seems like a properly parsed tweet. Let's take a look at the class distribution of the data.

In [8]:
df.sentiment.value_counts()

neutral     19466
positive    15754
negative     6485
Name: sentiment, dtype: int64

The data is not well balanced, and negative class has the least number of data entries with 6,485, and the neutral class has the most data with 19,466 entries. I want to rebalance the data so that I will have a balanced dataset at least for training. I will deal with this after I define the cleaning function.

## Data Cleaning

Data cleaning process is similar to my previous project, but this time I added a long list of contraction to expand most of the contracted form to its original form such as "don't" to "do not". And this time, instead of Regex, I used Spacy to parse the documents, and filtered numbers, URL, punctuation, etc. Below are the steps I took to clean the tweets.

1. Decoding: unicode_escape for extra "\" before unicode character, then unidecode
2. Apostrophe handled: there are two characters people use for contraction. "’"(apostrophe) and "'"(single quote). If these two symbols are both used for contraction, it will be difficult to detect and properly map the right expanded form. So any "’"(apostrophe) is changed to "'"(single quote)
3. Contraction check: check if there's any contracted form, and replace it with its original form
4. Parsing: done with Spacy
5. Filtering punctuation, white space, numbers, URL using Spacy methods while keeping the text content of hashtag intact
6. Removed @mention
7. Lemmatize: lemmatized each token using Spacy method '.lemma_'. Pronouns are kept as they are since Spacy lemmatizer transforms every pronoun to "-PRON-"
8. Special character removal
9. Single syllable token removal
10. Spell correction: it is a simple spell correction dealing with repeated characters such as "sooooo goooood". If the same character is repeated more than two times, it shortens the repetition to two. For example "sooooo goooood" will be transformed as "soo good". This is not a perfect solution since even after correction, in case of "soo", it is not a correct spelling. But at least it will help to reduce feature space by making "sooo", "soooo", "sooooo" to the same word "soo"

In [9]:
contraction_mapping = {"ain't": "is not", "aren't": "are not","can't": "cannot", 
                   "can't've": "cannot have", "'cause": "because", "could've": "could have", 
                   "couldn't": "could not", "couldn't've": "could not have","didn't": "did not", 
                   "doesn't": "does not", "don't": "do not", "hadn't": "had not", 
                   "hadn't've": "had not have", "hasn't": "has not", "haven't": "have not", 
                   "he'd": "he would", "he'd've": "he would have", "he'll": "he will", 
                   "he'll've": "he will have", "he's": "he is", "how'd": "how did", 
                   "how'd'y": "how do you", "how'll": "how will", "how's": "how is", 
                   "I'd": "I would", "I'd've": "I would have", "I'll": "I will", 
                   "I'll've": "I will have","I'm": "I am", "I've": "I have", 
                   "i'd": "i would", "i'd've": "i would have", "i'll": "i will", 
                   "i'll've": "i will have","i'm": "i am", "i've": "i have", 
                   "isn't": "is not", "it'd": "it would", "it'd've": "it would have", 
                   "it'll": "it will", "it'll've": "it will have","it's": "it is", 
                   "let's": "let us", "ma'am": "madam", "mayn't": "may not", 
                   "might've": "might have","mightn't": "might not","mightn't've": "might not have", 
                   "must've": "must have", "mustn't": "must not", "mustn't've": "must not have", 
                   "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock", 
                   "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not",
                   "sha'n't": "shall not", "shan't've": "shall not have", "she'd": "she would", 
                   "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", 
                   "she's": "she is", "should've": "should have", "shouldn't": "should not", 
                   "shouldn't've": "should not have", "so've": "so have","so's": "so as", 
                   "this's": "this is",
                   "that'd": "that would", "that'd've": "that would have","that's": "that is", 
                   "there'd": "there would", "there'd've": "there would have","there's": "there is", 
                       "here's": "here is",
                   "they'd": "they would", "they'd've": "they would have", "they'll": "they will", 
                   "they'll've": "they will have", "they're": "they are", "they've": "they have", 
                   "to've": "to have", "wasn't": "was not", "we'd": "we would", 
                   "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", 
                   "we're": "we are", "we've": "we have", "weren't": "were not", 
                   "what'll": "what will", "what'll've": "what will have", "what're": "what are", 
                   "what's": "what is", "what've": "what have", "when's": "when is", 
                   "when've": "when have", "where'd": "where did", "where's": "where is", 
                   "where've": "where have", "who'll": "who will", "who'll've": "who will have", 
                   "who's": "who is", "who've": "who have", "why's": "why is", 
                   "why've": "why have", "will've": "will have", "won't": "will not", 
                   "won't've": "will not have", "would've": "would have", "wouldn't": "would not", 
                   "wouldn't've": "would not have", "y'all": "you all", "y'all'd": "you all would",
                   "y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have",
                   "you'd": "you would", "you'd've": "you would have", "you'll": "you will", 
                   "you'll've": "you will have", "you're": "you are", "you've": "you have" } 

In [14]:
import codecs
import unidecode
import re
import spacy
nlp = spacy.load('en')

def spacy_cleaner(text):
    try:
        decoded = unidecode.unidecode(codecs.decode(text, 'unicode_escape'))
    except:
        decoded = unidecode.unidecode(text)
    apostrophe_handled = re.sub("’", "'", decoded)
    expanded = ' '.join([contraction_mapping[t] if t in contraction_mapping else t for t in apostrophe_handled.split(" ")])
    parsed = nlp(expanded)
    final_tokens = []
    for t in parsed:
        if t.is_punct or t.is_space or t.like_num or t.like_url or str(t).startswith('@'):
            pass
        else:
            if t.lemma_ == '-PRON-':
                final_tokens.append(str(t))
            else:
                sc_removed = re.sub("[^a-zA-Z]", '', str(t.lemma_))
                if len(sc_removed) > 1:
                    final_tokens.append(sc_removed)
    joined = ' '.join(final_tokens)
    spell_corrected = re.sub(r'(.)\1+', r'\1\1', joined)
    return spell_corrected
        

OK now let's see how this custom cleaner works with tweets.

In [15]:
pd.set_option('display.max_colwidth', -1)

In [16]:
df.text[:10]

0    Picturehouse's, Pink Floyd's, 'Roger Waters: The Walll - opening 29 Sept is now making waves. Watch the trailer on Rolling Stone - look...  
1    Order Go Set a Watchman in store or through our website before Tuesday and get it half price! #GSAW @GSAWatchmanBook https://t.co/KET6EGD1an
2    If these runway renovations at the airport prevent me from seeing Taylor Swift on Monday, Bad Blood will have a new meaning.                
3    If you could ask an onstage interview question at Miss USA tomorrow, what would it be?                                                      
4    A portion of book sales from our Harper Lee/Go Set a Watchman release party on Mon. 7/13 will support @CAP_Tulsa and the great work they do.
5    Excited to read "Go Set a Watchman" on Tuesday.  But can it possibly live up to "To Kill a Mockingbird?"  Any opinions?                     
6    Watching Miss USA tomorrow JUST to see @TravisGarland perform, I'm obsessed with his voice                             

In [17]:
[spacy_cleaner(t) for t in df.text[:10]]

['picturehouse pink floyd roger waters the wall open sept be now make wave watch the trailer on rolling stone look',
 'order go set watchman in store or through our website before tuesday and get it half price gsaw',
 'if these runway renovation at the airport prevent me from see taylor swift on monday bad blood will have new meaning',
 'if you could ask an onstage interview question at miss usa tomorrow what would it be',
 'portion of book sale from our harper lee go set watchman release party on mon will support and the great work they do',
 'excited to read go set watchman on tuesday but can it possibly live up to to kill mockingbird any opinion',
 'watch miss usa tomorrow just to see perform I be obsess with his voice',
 'tune in for the miss usa pageant on reelzchannel on sunday july at etp pt contestant from all',
 'call for reservation for lunch or dinner tomorrow yep sunday happy to accommodate guest in town for the miss usa pageant',
 'miss universe org prez tell me trump will

It looks like it's doing what I intended it to do. I'll clean the "text" column and create a new column called "clean_text".

In [18]:
df['clean_text'] = [spacy_cleaner(t) for t in df.text]

  if __name__ == '__main__':
  if __name__ == '__main__':
  if __name__ == '__main__':
  if __name__ == '__main__':
  if __name__ == '__main__':
  if __name__ == '__main__':
  if __name__ == '__main__':
  if __name__ == '__main__':
  if __name__ == '__main__':


By running the cleaning function I can see it encountered some "invalid escape sequence". Let's see what these are.

In [19]:
for i,t in enumerate(df.text):
    if '\m' in t:
        print(i,t)
        

2064 @TheMetalCore Can't wait for them both! September is equally impressive with FFDP, Iron Maiden, Slayer, BMTH, and Parkway Drive! \m/
5486 Tomorrow is a big Metal day.  At 8:00am the new Iron Maiden tune premiers and, of course, new albums to purchase.  I love Heavy Metal! \m/
6009 Iron Maiden released their new video on the 14th of august \m/
6234 Check out the new Iron Maiden Video from their forthcoming album Book of Souls out September 4th! \m/ Listen to... http://t.co/pVDxR2iyQX
9836 Another great preview from the new IRON MAIDEN album THE BOOK OF SOULS \m/ OUT SEPTEMBER 4th http://t.co/sxrPyjyhMH
11816 Let's hope it's not!! Look for "The Book of Souls" from Iron Maiden This Friday September 4th!! \m/\m/ Listen to... http://t.co/R39jz4rmER
12008 @TonyBasilio Iron Maiden reference... Now you're talking! New album #BookOfSouls drops Friday #uptheirons  Woooooo! \m/ \m/
12136 I may be 30 years too late to say this but The Trooper by Iron Maiden is fucking great, For some reason I

The tweets that contain '\m' were actually containing an emoticon '\m/' I didn't know about this until I googled it. Apparently '\m/' stands for the horn sign you make with your hand. This hand sign is popular in metal music.
https://www.urbandictionary.com/define.php?term=%5Cm%2F
Anyway, this is just a warning and it is not an error. Let's see how the cleaner deals with this.

In [20]:
df.text[2064]

"@TheMetalCore Can't wait for them both! September is equally impressive with FFDP, Iron Maiden, Slayer, BMTH, and Parkway Drive! \\m/"

In [21]:
spacy_cleaner(df.text[2064])

  if __name__ == '__main__':


'can not wait for them both september be equally impressive with ffdp iron maiden slayer bmth and parkway drive'

Again it seems like to be doing what I intended it to do. So far so good.

# Imbalanced Learning

"The class imbalance problem typically occurs when, in a classification problem, there are many more instances of some classes than others. In such cases, standard classifiers tend to be overwhelmed by the large classes and ignore the small ones."
https://www3.nd.edu/~dial/publications/chawla2004editorial.pdf

As I have already realised, the training data is not perfectly balanced, 'neutral' class has 3 times more data than 'negative' class, and 'positive' class has around 2.4 times more data than 'negative' class. I will try fitting a model with three different data; oversampled, downsampled, original, to see how different sampling techniques affect the learning of a classifier.

The simple default classifier I'll use to compare performances of different datasets will be the logistic regression. From my previous sentiment analysis project, I learned that Tf-Idf with Logistic Regression is a pretty powerful combination. Before I apply any other more complex models such as ANN, CNN, RNN etc, the performances with logistic regression will hopefully give me a good idea of which data sampling methods I should choose.

In terms of validation, I will use K-Fold Cross Validation. In my previous project, I split the data into three; training, validation, test, and all the parameter tuning was done with reserved validation set and finally applied the model to the test set. Considering that I had more than 1 million data for training, this kind of validation set approach was acceptable.
But this time, the data I have is much smaller (around 40,000 tweets), and by leaving out validation set from the data we might leave out interesting information about data.

## Original Imbalanced Data

In [22]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

tvec = TfidfVectorizer(stop_words=None, max_features=100000, ngram_range=(1, 3))
lr = LogisticRegression()

In [23]:
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import precision_score, recall_score, f1_score

def lr_cv(splits, X, Y, pipeline, average_method):
    
    kfold = StratifiedKFold(n_splits=splits, shuffle=True, random_state=777)
    accuracy = []
    precision = []
    recall = []
    f1 = []
    for train, test in kfold.split(X, Y):
        lr_fit = pipeline.fit(X[train], Y[train])
        prediction = lr_fit.predict(X[test])
        scores = lr_fit.score(X[test],Y[test])
        
        accuracy.append(scores * 100)
        precision.append(precision_score(Y[test], prediction, average=average_method)*100)
        print('              negative    neutral     positive')
        print('precision:',precision_score(Y[test], prediction, average=None))
        recall.append(recall_score(Y[test], prediction, average=average_method)*100)
        print('recall:   ',recall_score(Y[test], prediction, average=None))
        f1.append(f1_score(Y[test], prediction, average=average_method)*100)
        print('f1 score: ',f1_score(Y[test], prediction, average=None))
        print('-'*50)

    print("accuracy: %.2f%% (+/- %.2f%%)" % (np.mean(accuracy), np.std(accuracy)))
    print("precision: %.2f%% (+/- %.2f%%)" % (np.mean(precision), np.std(precision)))
    print("recall: %.2f%% (+/- %.2f%%)" % (np.mean(recall), np.std(recall)))
    print("f1 score: %.2f%% (+/- %.2f%%)" % (np.mean(f1), np.std(f1)))

In [24]:
from sklearn.pipeline import Pipeline

original_pipeline = Pipeline([
    ('vectorizer', tvec),
    ('classifier', lr)
])

In [25]:
lr_cv(5, df.clean_text, df.sentiment, original_pipeline, 'macro')

              negative    neutral     positive
precision: [ 0.63502455  0.6404264   0.72134595]
recall:    [ 0.29915189  0.80225989  0.65312599]
f1 score:  [ 0.4067086   0.7122663   0.68554297]
--------------------------------------------------
              negative    neutral     positive
precision: [ 0.6486014   0.64023012  0.7088215 ]
recall:    [ 0.28604472  0.80041099  0.65280863]
f1 score:  [ 0.39700375  0.71141553  0.67966298]
--------------------------------------------------
              negative    neutral     positive
precision: [ 0.61661342  0.63999173  0.70892671]
recall:    [ 0.29760987  0.7950167   0.64773088]
f1 score:  [ 0.40145606  0.70913048  0.67694859]
--------------------------------------------------
              negative    neutral     positive
precision: [ 0.64332248  0.63920279  0.70034965]
recall:    [ 0.30454896  0.79912664  0.63567122]
f1 score:  [ 0.41339613  0.71027397  0.66644485]
--------------------------------------------------
              negati

With data as it is without any resampling, we can see that the precision is higher than the recall. If we take a closer look at the result from each fold, we can also see that the recall for the negative class is quite low around 28~30%, while the precisions for the negative class are high as 61~65%. This means the classifier is very picky and does not think many things are negative. All the text it classifies as negative is 61~65% of the time really negative. However, it also misses a lot of actual negative class, because it is so very picky. We have a low recall, but a very high precision. The intuition behind this precision and recall has been taken from a Medium blog post by Andreas Klintberg.
https://medium.com/@klintcho/explaining-precision-and-recall-c770eb9c69e9

## Oversampling

There is a very useful Python package called "imbalanced-learn", which helps you deal with class imbalance issues, it is compatible with Scikit Learn, and easy to implement.
https://github.com/scikit-learn-contrib/imbalanced-learn

Within imbalanced-learn, there are different techniques you can use for oversampling. I will use below two.

1. RandomOverSampler
2. SMOTE (Synthetic Minority Over-Sampling Technique)

There is one more point to consider if you are cross-validating with oversampled data. Oversampling the minority class can result in overfitting problems if we oversample before cross-validating. Why is that so? Because by oversampling before cross validation split, you are leaking the information of validation data already to your training set. As they say "What has been seen, cannot be unseen." 

If you want more detailed explanation, I recommed this Youtube video "Machine Learning - Over-& Undersampling - Python/ Scikit/ Scikit-Imblearn" https://youtu.be/DQC_YE3I5ig

Luckily cross-validation function I defined above as "lr_cv()" will fit the pipeline only with the training set split after cross-validation split, thus it is not leaking any information of validation set to the model.

### RandomOverSampler

Random over-sampling is simply a process of repeating some samples of the minority class and balance the number of samples between classes in the dataset.

In [26]:
from imblearn.pipeline import make_pipeline
from imblearn.over_sampling import ADASYN, SMOTE, RandomOverSampler

ROS_pipeline = make_pipeline(tvec, RandomOverSampler(random_state=777),lr)
SMOTE_pipeline = make_pipeline(tvec, SMOTE(random_state=777),lr)
ADASYN_pipeline = make_pipeline(tvec, ADASYN(ratio='minority',random_state=777),lr)

Before we fit each pipeline, let's see what the RadomOverSampler does. In order to make it easier to see I defined some toy text data below, and the target sentiment value for each text.

In [108]:
sent1 = "I love dogs"
sent2 = "I don't like dogs"
sent3 = "I adore cats"
sent4 = "I hate spiders"
sent5 = "I like dogs"
testing_text = pd.Series([sent1, sent2, sent3, sent4, sent5])
testing_target = pd.Series([1,0,1,0,1])

My toy data has 5 entries in total, and the target sentiments are three positives and two negatives. In order to be balanced, this toy data needs one more entry of negative class.

One thing is over sampler won't be able to handle raw text data. It has to be transformed into a feature space for over sampler to work. I'll first fit TfidfVectorizer, and oversample using Tf-Idf representation of texts.

In [109]:
tv = TfidfVectorizer(stop_words=None, max_features=100000)
testing_tfidf = tv.fit_transform(testing_text)

In [110]:
ros = RandomOverSampler(random_state=777)
X_ROS, y_ROS = ros.fit_sample(testing_tfidf, testing_target)

In [111]:
pd.DataFrame(testing_tfidf.todense(), columns=tv.get_feature_names())

Unnamed: 0,adore,cats,dogs,don,hate,like,love,spiders
0,0.0,0.0,0.556451,0.0,0.0,0.0,0.830881,0.0
1,0.0,0.0,0.462208,0.690159,0.0,0.556816,0.0,0.0
2,0.707107,0.707107,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.707107,0.0,0.0,0.707107
4,0.0,0.0,0.638711,0.0,0.0,0.769447,0.0,0.0


In [112]:
pd.DataFrame(X_ROS.todense(), columns=tv.get_feature_names())

Unnamed: 0,adore,cats,dogs,don,hate,like,love,spiders
0,0.0,0.0,0.556451,0.0,0.0,0.0,0.830881,0.0
1,0.0,0.0,0.462208,0.690159,0.0,0.556816,0.0,0.0
2,0.707107,0.707107,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.707107,0.0,0.0,0.707107
4,0.0,0.0,0.638711,0.0,0.0,0.769447,0.0,0.0
5,0.0,0.0,0.0,0.0,0.707107,0.0,0.0,0.707107


By running RandomOverSampler, now we have one more entry at the end. The last entry added by RandomOverSampler is exactly same as the fourth one (index number 3) from the top. RandomOverSampler simply repeats some entries of the minority class to balance the data. If we look at the target sentiments after RandomOverSampler, we can see that it has now a perfect balance between classes by adding on more entry of negative class.

In [113]:
y_ROS

array([1, 0, 1, 0, 1, 0])

In [27]:
lr_cv(5, df.clean_text, df.sentiment, ROS_pipeline, 'macro')

              negative    neutral     positive
precision: [ 0.48110547  0.70700816  0.71385942]
recall:    [ 0.65767155  0.64509502  0.68327515]
f1 score:  [ 0.55570033  0.67463408  0.69823253]
--------------------------------------------------
              negative    neutral     positive
precision: [ 0.48863636  0.70156028  0.70451571]
recall:    [ 0.66306862  0.63524274  0.68327515]
f1 score:  [ 0.56264311  0.66675654  0.69373288]
--------------------------------------------------
              negative    neutral     positive
precision: [ 0.47527004  0.69661675  0.71034946]
recall:    [ 0.64456438  0.64526072  0.67089813]
f1 score:  [ 0.54712042  0.66995599  0.69006039]
--------------------------------------------------
              negative    neutral     positive
precision: [ 0.49520045  0.70120753  0.7072117 ]
recall:    [ 0.67617579  0.64140765  0.67534116]
f1 score:  [ 0.57170795  0.66997585  0.69090909]
--------------------------------------------------
              negati

Compared to the model built with original imbalanced data, now the model behaves in opposite way. The precisions for the negative class are around 47~49%, but the recalls are way higher at 64~67%. Now we have a situation of high recall, low precision. What this means is that the classifier thinks a lot of things are negative. However, it also thinks a lot of non-negative texts are negative. So from our set of data we got a lot of texts classified as negative, many of them were in the set of actual negative, however, a lot of them were also non-negative.

But if I consider that without resampling, the recall rate was as low as 28~30% for negative class, the precision rate for the negative class I get from oversampling is more robust at around 47~49%.

Another way to look at it is to look at the f1 score, which is the harmonic average of precision and recall. The original imbalanced data had 66.51% accuracy and 60.01% F1 score. However  with oversampling, we get a slightly lower accuracy of 65.95%, but a much higher F1 score of 64.18%

### SMOTE (Synthetic Minority Over-Sampling Technique)

SMOTE is an over-sampling approach in which the minority class is over-sampled by creating
“synthetic” examples rather than by over-sampling with replacement.

According to the original research paper "SMOTE: Synthetic Minority Over-sampling Technique" (Chawla et al., 2002),
"synthetic samples are generated in the following way: Take the difference between the feature vector (sample)
under consideration and its nearest neighbour. Multiply this difference by a random number
between 0 and 1, and add it to the feature vector under consideration. This causes the
selection of a random point along the line segment between two specific features. This
approach effectively forces the decision region of the minority class to become more general."
What this means is that when SMOTE creates a new synthetic data, it will choose one data to copy, and look at its k nearest neighbours. Then, on feature space, it will create random values in feature space that is between the original sample and its neighbours.

Once you see the example with the toy data, it will become clearer.
https://www.jair.org/media/953/live-953-2037-jair.pdf

In [115]:
smt = SMOTE(random_state=777, k_neighbors=1)
X_SMOTE, y_SMOTE = smt.fit_sample(testing_tfidf, testing_target)
pd.DataFrame(X_SMOTE.todense(), columns=tv.get_feature_names())

Unnamed: 0,adore,cats,dogs,don,hate,like,love,spiders
0,0.0,0.0,0.556451,0.0,0.0,0.0,0.830881,0.0
1,0.0,0.0,0.462208,0.690159,0.0,0.556816,0.0,0.0
2,0.707107,0.707107,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.707107,0.0,0.0,0.707107
4,0.0,0.0,0.638711,0.0,0.0,0.769447,0.0,0.0
5,0.0,0.0,0.298794,0.446153,0.249998,0.359954,0.0,0.249998


The last entry is the data created by SMOTE. To make it easier to see, let's see only the negative class.

In [116]:
pd.DataFrame(X_SMOTE.todense()[y_SMOTE == 0], columns=tv.get_feature_names())

Unnamed: 0,adore,cats,dogs,don,hate,like,love,spiders
0,0.0,0.0,0.462208,0.690159,0.0,0.556816,0.0,0.0
1,0.0,0.0,0.0,0.0,0.707107,0.0,0.0,0.707107
2,0.0,0.0,0.298794,0.446153,0.249998,0.359954,0.0,0.249998


The top two entries are original data, and the one on the bottom is synthetic data. You can see it didn't just repeat original data. Instead, the Tf-Idf values are created by taking random values between the top two original data. As you can see, if the Tf-Idf values for both original data are 0, then synthetic data also has 0 for those features, such as "adore", "cactus", "cats", because if two values are the same there are no random values between them. I specifically defined k_neighbors as 1 for this toy data, since there are only two entries of negative class, if SMOTE chooses one to copy, then only one other negative entry left as a neighbour.

Now let's fit the SMOTE pipeline to see how it affects performance.

In [28]:
lr_cv(5, df.clean_text, df.sentiment, SMOTE_pipeline, 'macro')

              negative    neutral     positive
precision: [ 0.48556582  0.69869842  0.71257086]
recall:    [ 0.64841943  0.64791988  0.6781974 ]
f1 score:  [ 0.55529878  0.67235177  0.69495935]
--------------------------------------------------
              negative    neutral     positive
precision: [ 0.50204559  0.70444444  0.70825083]
recall:    [ 0.66229761  0.65142564  0.68105363]
f1 score:  [ 0.57114362  0.67689844  0.69438602]
--------------------------------------------------
              negative    neutral     positive
precision: [ 0.47701149  0.69521967  0.70925553]
recall:    [ 0.63993832  0.64628821  0.67121549]
f1 score:  [ 0.54659203  0.66986155  0.6897114 ]
--------------------------------------------------
              negative    neutral     positive
precision: [ 0.49440189  0.69782189  0.70401061]
recall:    [ 0.64687741  0.65014128  0.67407172]
f1 score:  [ 0.56045424  0.6731383   0.68871595]
--------------------------------------------------
              negati

SMOTE sampling seems to have a slightly higher accuracy and F1 score compared to random oversampling. With the results so far, it seems like choosing SMOTE oversampling is preferable over original or random oversampling.

## Downsampling

How about downsampling. If we oversample the minority class in the above oversampling, with downsampling, we try to reduce the data of majority class, so that the data classes are balanced.

### RandomUnderSampler

In [40]:
from imblearn.under_sampling import NearMiss, RandomUnderSampler

RUS_pipeline = make_pipeline(tvec, RandomUnderSampler(random_state=777),lr)
NM1_pipeline = make_pipeline(tvec, NearMiss(ratio='not minority',random_state=777, version = 1),lr)
NM2_pipeline = make_pipeline(tvec, NearMiss(ratio='not minority',random_state=777, version = 2),lr)
NM3_pipeline = make_pipeline(tvec, NearMiss(ratio=nm3_dict,random_state=777, version = 3, n_neighbors_ver3=4),lr)

Again, before we run the pipeline, let's apply this to the toy data to see what it does.

In [117]:
rus = RandomUnderSampler(random_state=777)
X_RUS, y_RUS = rus.fit_sample(testing_tfidf, testing_target)
pd.DataFrame(X_RUS.todense(), columns=tv.get_feature_names())

Unnamed: 0,adore,cats,dogs,don,hate,like,love,spiders
0,0.0,0.0,0.462208,0.690159,0.0,0.556816,0.0,0.0
1,0.0,0.0,0.0,0.0,0.707107,0.0,0.0,0.707107
2,0.0,0.0,0.556451,0.0,0.0,0.0,0.830881,0.0
3,0.707107,0.707107,0.0,0.0,0.0,0.0,0.0,0.0


In [118]:
pd.DataFrame(testing_tfidf.todense(), columns=tv.get_feature_names())

Unnamed: 0,adore,cats,dogs,don,hate,like,love,spiders
0,0.0,0.0,0.556451,0.0,0.0,0.0,0.830881,0.0
1,0.0,0.0,0.462208,0.690159,0.0,0.556816,0.0,0.0
2,0.707107,0.707107,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.707107,0.0,0.0,0.707107
4,0.0,0.0,0.638711,0.0,0.0,0.769447,0.0,0.0


Compared with the original imbalanced data, we can see that downsampled data has one less entry, which is the last entry of the original data belonging to the positive class. RandomUnderSampler reduces the majority class by randomly removing data from the majority class.

In [33]:
lr_cv(5, df.clean_text, df.sentiment, RUS_pipeline, 'macro')

              negative    neutral     positive
precision: [ 0.41158668  0.69661458  0.68481569]
recall:    [ 0.73400154  0.54956343  0.64265313]
f1 score:  [ 0.52742382  0.61441286  0.66306483]
--------------------------------------------------
              negative    neutral     positive
precision: [ 0.42551293  0.69652534  0.68636057]
recall:    [ 0.73554356  0.56126381  0.645192  ]
f1 score:  [ 0.53913535  0.62162162  0.66513987]
--------------------------------------------------
              negative    neutral     positive
precision: [ 0.41437177  0.70032468  0.69186244]
recall:    [ 0.74248265  0.55407141  0.64487464]
f1 score:  [ 0.53189727  0.61867202  0.66754271]
--------------------------------------------------
              negative    neutral     positive
precision: [ 0.4228039   0.69333333  0.69424583]
recall:    [ 0.73477255  0.56100694  0.64709616]
f1 score:  [ 0.53675021  0.62019026  0.66984231]
--------------------------------------------------
              negati

Now the accuracy and the F1 score has significantly dropped. But the characteristic of low precision and high recall is as same as oversampled data. Only its overall performance dropped.

### NearMiss

According to the documentation of "imbalanced-learn",
"NearMiss adds some heuristic rules to select samples. NearMiss implements 3 different types of heuristic which can be selected with the parameter version. NearMiss heuristic rules are based on nearest neighbors algorithm."
http://contrib.scikit-learn.org/imbalanced-learn/stable/under_sampling.html#controlled-under-sampling

There is also a good paper on resampling techniques. "Survey of resampling techniques for improving classification performance in unbalanced datasets" (Ajinkya More, 2016)
https://arxiv.org/pdf/1608.06048.pdf

I borrowed the explanation of three different versions of NearMiss from More's paper.

#### NearMiss-1

In NearMiss-1, those points from majority class are retained whose mean distance to the k nearest points in minority class is lowest. Which means it will keep the points of majority class that's similar to the minority class.

In [119]:
nm = NearMiss(ratio='not minority',random_state=777, version=1, n_neighbors=1)
X_nm, y_nm = nm.fit_sample(testing_tfidf, testing_target)
pd.DataFrame(X_nm.todense(), columns=tv.get_feature_names())

Unnamed: 0,adore,cats,dogs,don,hate,like,love,spiders
0,0.0,0.0,0.462208,0.690159,0.0,0.556816,0.0,0.0
1,0.0,0.0,0.0,0.0,0.707107,0.0,0.0,0.707107
2,0.0,0.0,0.638711,0.0,0.0,0.769447,0.0,0.0
3,0.0,0.0,0.556451,0.0,0.0,0.0,0.830881,0.0


In [120]:
pd.DataFrame(testing_tfidf.todense(), columns=tv.get_feature_names())

Unnamed: 0,adore,cats,dogs,don,hate,like,love,spiders
0,0.0,0.0,0.556451,0.0,0.0,0.0,0.830881,0.0
1,0.0,0.0,0.462208,0.690159,0.0,0.556816,0.0,0.0
2,0.707107,0.707107,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.707107,0.0,0.0,0.707107
4,0.0,0.0,0.638711,0.0,0.0,0.769447,0.0,0.0


We can see that NearMiss-1 has eliminated the entry for the text "I adore cats", which makes sense because both words "adore" and "cats" are only appeared in this entry, so makes it the most different from minority class in terms of Tf-Idf representation in feature space.

In [102]:
lr_cv(5, df.clean_text, df.sentiment, NM1_pipeline, 'macro')

              negative    neutral     positive
precision: [ 0.395671    0.69915254  0.6578125 ]
recall:    [ 0.70470316  0.50847458  0.66804189]
f1 score:  [ 0.50679235  0.58876004  0.66288773]
--------------------------------------------------
              negative    neutral     positive
precision: [ 0.39861051  0.69971264  0.65304241]
recall:    [ 0.7077872   0.50038531  0.67438908]
f1 score:  [ 0.51        0.58349558  0.66354411]
--------------------------------------------------
              negative    neutral     positive
precision: [ 0.39939157  0.71075581  0.6560219 ]
recall:    [ 0.70855821  0.50244028  0.68454459]
f1 score:  [ 0.51083936  0.58871332  0.66997981]
--------------------------------------------------
              negative    neutral     positive
precision: [ 0.4         0.70036364  0.65333734]
recall:    [ 0.69853508  0.49473414  0.68962234]
f1 score:  [ 0.50870298  0.5798585   0.67098966]
--------------------------------------------------
              negati

It seems like both the accuracy and F1 score got worse than random undersampling.

#### NearMiss-2

In contrast to NearMiss-1, NearMiss-2 keeps those points from the majority class whose mean distance to the k farthest points in minority class is lowest. In other words, it will keep the points of majority class that's most different to the minority class.

In [121]:
nm = NearMiss(ratio='not minority',random_state=777, version=2, n_neighbors=1)
X_nm, y_nm = nm.fit_sample(testing_tfidf, testing_target)
pd.DataFrame(X_nm.todense(), columns=tv.get_feature_names())

Unnamed: 0,adore,cats,dogs,don,hate,like,love,spiders
0,0.0,0.0,0.462208,0.690159,0.0,0.556816,0.0,0.0
1,0.0,0.0,0.0,0.0,0.707107,0.0,0.0,0.707107
2,0.0,0.0,0.556451,0.0,0.0,0.0,0.830881,0.0
3,0.707107,0.707107,0.0,0.0,0.0,0.0,0.0,0.0


In [122]:
pd.DataFrame(testing_tfidf.todense(), columns=tv.get_feature_names())

Unnamed: 0,adore,cats,dogs,don,hate,like,love,spiders
0,0.0,0.0,0.556451,0.0,0.0,0.0,0.830881,0.0
1,0.0,0.0,0.462208,0.690159,0.0,0.556816,0.0,0.0
2,0.707107,0.707107,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.707107,0.0,0.0,0.707107
4,0.0,0.0,0.638711,0.0,0.0,0.769447,0.0,0.0


Now we can see that NearMiss-2 has eliminated the entry for the text "I like dogs", which again makes sense because we also have a negative entry "I don't like dogs". Two entries are in different classes but they share two same tokens "like" and "dogs".

In [123]:
lr_cv(5, df.clean_text, df.sentiment, NM2_pipeline, 'macro')

              negative    neutral     positive
precision: [ 0.30916717  0.70748299  0.65716225]
recall:    [ 0.79568234  0.37390858  0.61440812]
f1 score:  [ 0.44530744  0.48924731  0.63506643]
--------------------------------------------------
              negative    neutral     positive
precision: [ 0.36363636  0.70122832  0.67787115]
recall:    [ 0.7617579   0.49858721  0.61440812]
f1 score:  [ 0.49227703  0.58279538  0.64458132]
--------------------------------------------------
              negative    neutral     positive
precision: [ 0.35180364  0.68459838  0.672226  ]
recall:    [ 0.72937548  0.47726689  0.62678515]
f1 score:  [ 0.47466131  0.56243378  0.64871079]
--------------------------------------------------
              negative    neutral     positive
precision: [ 0.34680382  0.67931281  0.68176991]
recall:    [ 0.72783346  0.48754174  0.61123453]
f1 score:  [ 0.4697686   0.56766861  0.64457831]
--------------------------------------------------
              negati

Both accuracy and F1 score got even lower compared to NearMiss-1. And we can also see that all the metrics fluctuate from fold to fold quite a lot.

#### NearMiss-3

The final NearMiss variant, NearMiss-3 selects k nearest neighbours in majority class for every point in the minority class. In this case, the undersampling ratio is directly controlled by k. For example, if we set k to be 4, then NearMiss-3 will choose 4 nearest neighbours of every minority class entry.

Then we'll end up with either more or fewer samples of majority class than minority class depending on n neighbours we set. For example, with my dataset, if I run NearMiss-3 with default n_neighbors_ver3 of 3, it will complain and the number of neutral class(which is majority class in my dataset) will be smaller than negative class(which is minority class in my dataset). So I explicitly set n_neighbors_ver3 to be 4, so that I'll have enough majority class data at least the same number as the minority class.

One thing I'm not completely sure is that what kind of filtering it applies when all the data selected with n_neighbors_ver3 parameter is more than the minority class. As you will see below, after applying NearMiss-3, the dataset is perfectly balanced. However, if the algorithm simply chooses the nearest neighbour according to the n_neighbors_ver3 parameter, I doubt that it will end up with the exact same number of entries for each class.

In [138]:
lr_cv(5, df.clean_text, df.sentiment, NM3_pipeline, 'macro')

              negative    neutral     positive
precision: [ 0.41911765  0.69426121  0.66879387]
recall:    [ 0.70316114  0.54057524  0.66518566]
f1 score:  [ 0.52519436  0.60785446  0.66698488]
--------------------------------------------------
              negative    neutral     positive
precision: [ 0.44198895  0.69118148  0.6331673 ]
recall:    [ 0.67848882  0.52144875  0.68581403]
f1 score:  [ 0.53527981  0.59443631  0.65843998]
--------------------------------------------------
              negative    neutral     positive
precision: [ 0.43694141  0.69058143  0.63880685]
recall:    [ 0.67848882  0.52170563  0.68644875]
f1 score:  [ 0.53156146  0.59438104  0.66177145]
--------------------------------------------------
              negative    neutral     positive
precision: [ 0.43076923  0.68832848  0.66226233]
recall:    [ 0.71241326  0.54687901  0.65217391]
f1 score:  [ 0.53689715  0.60950472  0.65717941]
--------------------------------------------------
              negati

NearMiss-3 produced the most robust result within NearMiss family, but slightly lower than RandomUnderSampling.

In [137]:
from collections import Counter

nm3 = NearMiss(ratio='not minority',random_state=777, version=3, n_neighbors_ver3=4)
tvec = TfidfVectorizer(stop_words=None, max_features=100000, ngram_range=(1, 3))
df_tfidf = tvec.fit_transform(df.clean_text)
X_res, y_res = nm3.fit_sample(df_tfidf, df.sentiment)
print('Distribution before NearMiss-3: {}'.format(Counter(df.sentiment)))
print('Distribution after NearMiss-3: {}'.format(Counter(y_res)))

Distribution before NearMiss-3: Counter({'neutral': 19466, 'positive': 15754, 'negative': 6485})
Distribution after NearMiss-3: Counter({'negative': 6485, 'neutral': 6485, 'positive': 6485})


## Result

**5-fold cross validation result**
*(classifier used for validation: logistic regression with default setting)*

|            | accuracy | macro average precision | macro average recall | macro average f1 score     |
|------------|---------|--------|---------|------------------|
| origianl imbalanced data      |  66.51%(±0.36) | 66.41%(±0.58) |  58.33%(±0.48) | 60.01%(±0.55) |
| -oversampling- | | | | |
|  RandomOverSampler      |  65.95%(±0.31) | 63.29%(±0.32) |  66.09%(±0.40) | 64.18%(±0.34) |
|  SMOTE       |  66.03%(±0.36) | 63.34%(±0.40) |  65.94%(±0.47) | 64.21%(±0.43) |
| -downsampling- | | | | |
| RandomUnderSampler |  61.80%(±0.25) | 60.17%(±0.21) |  64.62%(±0.21) | 60.63%(±0.26) |
| NaerMiss-1 |  60.11%(±0.24) | 58.64%(±0.28) |  62.99%(±0.32) | 58.51%(±0.25) |
| NaerMiss-2 |  57.11%(±2.24) | 57.32%(±1.05) |  61.937%(±1.22) | 56.08%(±2.08) |
| NaerMiss-3 |  61.03%(±0.20) | 59.12%(±0.23) |  63.20%(±0.39) | 59.83%(±0.21) |

Based on the above result, the sampling technique I'll be using for the next post will be SMOTE. In the next post, I will try different classifiers with SMOTE oversampled data.