# Twitter Sentiment analysis | SMOTE for oversampling

Twitter is an American microblogging and social networking service on which users post and interact with messages known as "tweets". People from all over the world express their emotions via their tweets and in this notebook we will see how can we extract these emotions from the text or tweets.

The problem with tweets is that they are not written formally and therefore before using them in our model we need to actually do a lot of pre-processing to get a meaningful chunk of words

Here I am going to use a dataset that contains reviews about airline services. We are brifly going to analyse the dataset to get some insights. So let us start.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

First let us import some libraries that we need beforehand.We are going to import more libraries ahead depending on our requirements.

In [None]:
import warnings
warnings.filterwarnings("ignore") #To ignore warnings to get clean output
import pandas as pd
import matplotlib.pyplot as plt

First we will import the dataset which is in a csv format, so we use read_csv method of [pandas](http://https://www.mygreatlearning.com/blog/python-pandas-tutorial/) to store this data in a dataframe.Next we print first 5 rows using the head() method 

In [None]:
data=pd.read_csv("/kaggle/input/twitter-airline-sentiment/Tweets.csv")
print(len(data))#total number of entries
data.head()


Next we check our csv file for Null values.Here we find quite columns which have alot of null values, so it is better to drop them. Also we drop columns which have no significance to sentiment such as tweet id.So for simplicily we only keep the sentiment and tweet columns which are needed primarily for our notebook.

In [None]:
data.isnull().sum()

In [None]:
data.drop(columns=['negativereason','negativereason_confidence',
                  'airline_sentiment_gold','tweet_coord','negativereason_gold','tweet_location','user_timezone'
                  ,'tweet_id','retweet_count','name','tweet_created','airline_sentiment_confidence'],inplace=True)

To check the maximum length of our tweets, we write these lines of code.This is not important for sentimental anlysis, so you can skip this

In [None]:
data['token_length'] = [len(x.split(" ")) for x in data.text]
max(data.token_length)

Next, we check the number of data points in each class. Clearly this a imbalanced dataset, which is why we ll be using SMOTE later to see how that effects the results

In [None]:
data.airline_sentiment.value_counts()

Let us see what are the airlines for which we have the reviews.We are going to use this information ahead

In [None]:
data['airline'].unique()

Now let us check  each airline for their positive ,negative and neutral reviews and plot a bar graph to give us some insights.

In [None]:
count_neg={}
count_pos={}
count_neu={}
for i in data['airline'].unique():
    x=len(data.loc[(data['airline']==i) & (data['airline_sentiment']=='negative')])
    count_neg.update({i:x})
for i in data['airline'].unique():
    x=len(data.loc[(data['airline']==i) & (data['airline_sentiment']=='positive')])
    count_pos.update({i:x})
for i in data['airline'].unique():
    x=len(data.loc[(data['airline']==i) & (data['airline_sentiment']=='neutral')])
    count_neu.update({i:x})

In [None]:
print(count_neg)
print(count_pos)
print(count_neu)

In [None]:
import numpy as np
# set width of bar
barWidth = 0.25
plt.figure(figsize=(20,10))
 
# set height of bar
bars1 = count_neg.values()
bars2 = count_pos.values()
bars3 = count_neu.values()
 
# Set position of bar on X axis
r1 = np.arange(len(bars1))
r2 = [x + barWidth for x in r1]
r3 = [x + barWidth for x in r2]
 
# Make the plot
plt.bar(r1, bars1, color='#7f6d5f', width=barWidth, edgecolor='white', label='neg')
plt.bar(r2, bars2, color='#557f2d', width=barWidth, edgecolor='white', label='pos')
plt.bar(r3, bars3, color='#2d7f5e', width=barWidth, edgecolor='white', label='neu')
 
# Add xticks on the middle of the group bars
plt.xlabel('Airlines', fontweight='bold',fontsize=18)
plt.xticks([r + barWidth for r in range(len(bars1))], [i for i in data['airline'].unique()],fontsize=16)
                                                       
 
# Create legend & Show graphic
plt.legend(fontsize="x-large")
plt.show()



From the above plot, we see United, US airways and american have more negative reviews as compared to less.Also how can we use this information to improve our model?
If you have seen the tweets, you might have noticed that in each review these airlines are mentioned, which means in negative reviews, the names of these airlines are going to pop up more.So our model may bias on this basis and chances are having airline names in the text are going to effect the model result.We want our model to extract the sentiment from the text which contain some meaning , instead of airline names, so we are going to remove them in pre-processing stage.

Now we use [logistic regression](https://www.mygreatlearning.com/blog/logistic-regression-with-examples-in-python-and-r) along with [tf-idf vectorizer](https://www.mygreatlearning.com/blog/bag-of-words/) to extract sentiment from text.Tf-Idf is used to get features out from the text while as logistic regression is a simple classifier.
We are going to train the model on imbalanced dataset as well as banalnced dataset and see how it ll change the result.
Here we are going to use k-fold to train our model and then use the macro average to get the final result

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

tvec = TfidfVectorizer(stop_words=None, max_features=100000, ngram_range=(1, 3))
lr = LogisticRegression()

from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import precision_score, recall_score, f1_score

def lr_cv(splits, X, Y, pipeline, average_method):
    
    kfold = StratifiedKFold(n_splits=splits, shuffle=True, random_state=777)
    accuracy = []
    precision = []
    recall = []
    f1 = []
    for train, test in kfold.split(X, Y):
        lr_fit = pipeline.fit(X[train], Y[train])
        prediction = lr_fit.predict(X[test])
        scores = lr_fit.score(X[test],Y[test])
        
        accuracy.append(scores * 100)
        precision.append(precision_score(Y[test], prediction, average=average_method)*100)
        print('              negative    neutral     positive')
        print('precision:',precision_score(Y[test], prediction, average=None))
        recall.append(recall_score(Y[test], prediction, average=average_method)*100)
        print('recall:   ',recall_score(Y[test], prediction, average=None))
        f1.append(f1_score(Y[test], prediction, average=average_method)*100)
        print('f1 score: ',f1_score(Y[test], prediction, average=None))
        print('-'*50)

    print("accuracy: %.2f%% (+/- %.2f%%)" % (np.mean(accuracy), np.std(accuracy)))
    print("precision: %.2f%% (+/- %.2f%%)" % (np.mean(precision), np.std(precision)))
    print("recall: %.2f%% (+/- %.2f%%)" % (np.mean(recall), np.std(recall)))
    print("f1 score: %.2f%% (+/- %.2f%%)" % (np.mean(f1), np.std(f1)))

I have used NLTK to preprocess the tweets.The special characters,RTs, mentions(@airline) and hashtags(#airline) are removed from the text.Then I have used various preprocessings steps such as tokenization, stemming and lemmatization to bring the all the words to their base form.If you are not familiar with them, go through [this article](https://www.mygreatlearning.com/blog/nltk-tutorial-with-python/)

In [None]:
from nltk.tokenize import TweetTokenizer
import re
from nltk.corpus import stopwords
from nltk.stem import LancasterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
import nltk

def clean_tweet(tweet):
    return ''.join(re.sub(r"(@[A-Za-z0-9]+)|(http\S+)|(#[A-Za-z0-9]+)|(\$[A-Za-z0-9]+)|(RT)|([0-9]+)","",tweet))
def remove_special_chars(tweets):  # it unrolls the hashtags to normal words
    for remove in map(lambda r: re.compile(re.escape(r)), [",", ":", "\"", "=", "&", ";", "%", "$",
                                                                     "@", "%", "^", "*", "(", ")", "{", "}",
                                                                     "[", "]", "|", "/", "\\", ">", "<", "-",
                                                                     "!", "?", ".", "'",
                                                                     "--", "---", "#"]):
        tweets.replace(remove, "", inplace=True)
    return tweets
lem=WordNetLemmatizer()
tkn=TweetTokenizer()
ps=LancasterStemmer()
pd.options.display.max_colwidth=1000

def filter_tweet(tweet):
    filtered=[]
    for w in tweet:
        if w.lower() not in stopwords.words('english'):
            filtered.append(w)
    return filtered
def get_pos(word):
    tag=nltk.pos_tag([word])[0][1][0]
    if tag =='J':
        return wordnet.ADJ
    elif tag =='V':
        return wordnet.VERB
    elif tag =='N':
        return wordnet.NOUN
    elif tag =='R':
        return wordnet.ADV
    else:
        return wordnet.NOUN

In [None]:
data['cleantweet']=data['text'].apply(lambda row: clean_tweet(row))
remove_special_chars(data.cleantweet)
data.head()

In [None]:
data['tokenized_text'] = data.apply(lambda row : tkn.tokenize(row['cleantweet']), axis=1)


data['filteredsent'] = data['tokenized_text']#.apply(lambda row : filter_tweet(row))


data['Lemmatized']=data.apply(lambda row :[lem.lemmatize(i,pos=get_pos(i)) for i in row['filteredsent']],axis=1)


data['stemwords'] = data.apply(lambda row : [ps.stem(i) for i in row['filteredsent']],axis=1)




Here I am going with the base form produced by lemmatization and ten join the tokens to form a sentence.You can use the stemmed version of words too.

In [None]:
#The final sentence is made from lemetized words.It can be changed to stemmed words.
#Totally upto user.This sentence will be input to sklearn's feature extractor.
data['prtext']=data['Lemmatized'] 


data['prtext']=data['prtext'].apply(lambda row : ' '.join(row))


Next we print the few last tweets to see how preprocessing chhanged the tweets

In [None]:
data.tail()

Here we are used our function to train the model the display the results, such as recall precision, accuracy and f1 score.If you are not familiar with them checkout [this article](https://www.mygreatlearning.com/blog/confusion-matrix-an-overview-with-python-and-r).

In [None]:
lr_cv(5, data.prtext, data.airline_sentiment, original_pipeline, 'macro')

SMOTE:Since our dataset is not balanced,we are going to use SMOTE to oversample our data.Here I am going to briefly explain how SMOTE works.
SMOTE(Synthetic Minority Over-Sampling Technique) is an over-sampling approach in which the minority class is over-sampled by creating “synthetic” examples rather than by over-sampling with replacement(dulicating the data).


According to the original research paper “SMOTE: Synthetic Minority Over-sampling Technique” (Chawla et al., 2002), “synthetic samples are generated in the following way:
1. Take the difference between the feature vector (sample) under consideration and its nearest neighbour.
2. Multiply this difference by a random number between 0 and 1, and add it to the feature vector under consideration. 
This causes the selection of a random point along the line segment between two specific features. This approach effectively forces the decision region of the minority class to become more general.

We are going to use the imblearn library to implement SMOTE

In [None]:
from imblearn.pipeline import make_pipeline
from imblearn.over_sampling import SMOTE
SMOTE_pipeline = make_pipeline(tvec, SMOTE(random_state=777),lr)

In [None]:
lr_cv(5, data.text, data.airline_sentiment, SMOTE_pipeline, 'macro')


From the above results,we see that we get a better accuracy and f1 score from using SMOTE.This  means that the model is not biased towarda any particular class (neg,pos or neu).Thus we conclude that using SMOTE did actually increase the model and this is an effective way to deal with imbalanced datasets.