In this part of the exercise we have to use the training, development and test examples, in order to create and evaluate a supervised classifier for tweets. 

All the necessary imports

In [19]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import glob
from bs4 import BeautifulSoup
import re
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import WordPunctTokenizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.grid_search import GridSearchCV

Read all the files (.tsv) inside a file called train and concat them to pandas dataframe called train and also define the columns

In [2]:
cols = ['id','sentiment','text']
path =r'/home/mscuser/Desktop/language_proc_exercise_2/train/' # use your path
allFiles = glob.glob(path + "/*.tsv")
train = pd.DataFrame()
list_ = []
for file_ in allFiles:
    df = pd.read_csv(file_,header=None, names=cols, sep = '\t')
    list_.append(df)
train = pd.concat(list_)
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 16045 entries, 0 to 9664
Data columns (total 3 columns):
id           16045 non-null int64
sentiment    16045 non-null object
text         16045 non-null object
dtypes: int64(1), object(2)
memory usage: 501.4+ KB


Read the file twitter-2016test-A.tsv inside a file called test and creates pandas dataframe called test                          
Read all the files (.tsv) inside a file called dev-test and concat them to pandas dataframe called dev-test

In [3]:
test = pd.read_csv("/home/mscuser/Desktop/language_proc_exercise_2/test/twitter-2016test-A.tsv", sep='\t', header=None, names=cols)
path =r'/home/mscuser/Desktop/language_proc_exercise_2/dev-test' # use your path
allFiles = glob.glob(path + "/*.tsv")
devTest = pd.DataFrame()
list_ = []
for file_ in allFiles:
    df = pd.read_csv(file_,header=None, names=cols, sep = '\t')
    list_.append(df)
devTest = pd.concat(list_)
devTest.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3947 entries, 0 to 1946
Data columns (total 3 columns):
id           3947 non-null int64
sentiment    3947 non-null object
text         3947 non-null object
dtypes: int64(1), object(2)
memory usage: 123.3+ KB


Erase all the instances where the twitter text is 'Not Available' and drop the column which contains the id of the twitter comment for each one of the pandas dataframes created above.

In [4]:
devTest = devTest.drop(['id'], axis=1)
devTest = devTest[devTest.text != 'Not Available']
len(devTest)
devTest.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3083 entries, 1 to 1946
Data columns (total 2 columns):
sentiment    3083 non-null object
text         3083 non-null object
dtypes: object(2)
memory usage: 72.3+ KB


In [5]:
test = test.drop(['id'], axis=1)
test = test[test.text != 'Not Available']
len(test)
test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 15437 entries, 0 to 20341
Data columns (total 2 columns):
sentiment    15437 non-null object
text         15437 non-null object
dtypes: object(2)
memory usage: 361.8+ KB


In [6]:
train = train.drop(['id'], axis=1)
train = train[train.text != 'Not Available']
len(train)
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 11918 entries, 0 to 9663
Data columns (total 2 columns):
sentiment    11918 non-null object
text         11918 non-null object
dtypes: object(2)
memory usage: 279.3+ KB


Check each one by printing the number of neutral, positive and negative tweets.

In [7]:
print train.sentiment.value_counts()
print devTest.sentiment.value_counts()
print test.sentiment.value_counts()

neutral     5143
positive    5117
negative    1658
Name: sentiment, dtype: int64
positive    1422
neutral     1109
negative     552
Name: sentiment, dtype: int64
neutral     7727
positive    5439
negative    2271
Name: sentiment, dtype: int64


The code below transform each one of the phrases to seperated words. This is usefull because we want to recognise the word not as it is a very important word for sentiment analysis

In [8]:
negations_dic = {"isn't":"is not", "aren't":"are not", "wasn't":"was not", "weren't":"were not",
                "haven't":"have not","hasn't":"has not","hadn't":"had not","won't":"will not",
                "wouldn't":"would not", "don't":"do not", "doesn't":"does not","didn't":"did not",
                "can't":"can not","couldn't":"could not","shouldn't":"should not","mightn't":"might not",
                "mustn't":"must not"}
neg_pattern = re.compile(r'\b(' + '|'.join(negations_dic.keys()) + r')\b')

The function below is responsible for data cleaning. We use BeautifulSoup to create soup objects and we use regular exceptions to remove punctuations, emoticons and references which begin with @. We have to point out that we decided not to exclude stopwords because there are many words that are really important for the sentiment analysis, such as no, not, nor, against etc. An other choice whould be to exclude these words from erasing but the choice we made was to keep all stopwords. 

In [9]:
def cleaner(text):
    soup = BeautifulSoup(text, 'lxml')
    souped = soup.get_text()
    try:
        clean = souped.decode("utf-8-sig").replace(u"\ufffd", "?")
    except:
        clean = souped
    stripped = re.sub(r'|'.join((r'@[A-Za-z0-9_]+', r'https?://[^ ]+')), '', clean)
    stripped = re.sub(r'www.[^ ]+', '', stripped)
    lower_case = stripped.lower()
    neg_handled = neg_pattern.sub(lambda x: negations_dic[x.group()], lower_case)
    letters_only = re.sub("[^a-zA-Z]", " ", neg_handled)
    words = [x for x  in WordPunctTokenizer().tokenize(letters_only) if len(x) > 1]
    return (" ".join(words)).strip()

In the code below we clean each one of the train, test and devtest. Also, we keep a list of sentiments for each one of them

In [20]:
clean_train = []
for t in train.text:
    clean_train.append(cleaner(t))
len(clean_train)

clean_devTest = []
for t in devTest.text:
    clean_devTest.append(cleaner(t))

clean_test = []
for t in test.text:
    clean_test.append(cleaner(t))

train_sentiment = []
for t in train.sentiment:
    train_sentiment.append(t)

devTest_sentiment = []
for t in devTest.sentiment:
    devTest_sentiment.append(t)

test_sentiment = []
for t in test.sentiment:
    test_sentiment.append(t)

We use CountVectorizer to create a vector for each one of the tweets in the clean_train, clean_test, clean_devTest dataframes

In [11]:
countVectorizer = CountVectorizer()
Vtrain = countVectorizer.fit_transform(clean_train)
Vtest = countVectorizer.transform(clean_test)
VdevTest = countVectorizer.transform(clean_devTest)

In the code below we train different models with the train set and we use the dev set for prediction in order to choose the best algorithm for this dataset. To do so we print the accuracy_score for each one.

In [12]:
model = MultinomialNB()
model.fit(Vtrain, train_sentiment)
Prediction = model.predict(VdevTest)
print accuracy_score(Prediction, devTest_sentiment)

0.519299383717


In [13]:
model2 = RandomForestClassifier()
model2.fit(Vtrain, train_sentiment)
Prediction = model2.predict(VdevTest)
print accuracy_score(Prediction, devTest_sentiment)

0.467077521894


In [14]:
model3 = LogisticRegression()
model3.fit(Vtrain, train_sentiment)
Prediction = model3.predict(VdevTest)
print accuracy_score(Prediction, devTest_sentiment)

0.490431397989


In [15]:
model4 = SVC()
model4.fit(Vtrain, train_sentiment)
Prediction = model4.predict(VdevTest)
print accuracy_score(Prediction, devTest_sentiment)

0.359714563737


In [16]:
model5 = DecisionTreeClassifier()
model5.fit(Vtrain, train_sentiment)
Prediction = model5.predict(VdevTest)
print accuracy_score(Prediction, devTest_sentiment)

0.450210833604


By observing the scores above, we are going to use MultinomialNB. It is important to point out that in general, MultinomialNB works better with countVectorizer according to documentation. In the section above, we use gridSearch in order to find the best parameters for our case. In order to perform that, we use the devTest 

In [17]:
params = {"alpha": [0.01, 0.1, 1.0, 10], "fit_prior": ['True', 'False']} 
grid = GridSearchCV(model, params) 
grid.fit(VdevTest, (np.array(devTest_sentiment)).ravel()) 
print grid.best_params_  

{'alpha': 1.0, 'fit_prior': 'True'}


According the results of gridSearch we are going to use the default values for the parameters alpha and fit_prior. Finally we use our model to make predictions for the test and print the accuracy. We can see that the accuracy has been increased

In [18]:
model = MultinomialNB(alpha = 1.0, fit_prior = 'True')
model.fit(Vtrain, train_sentiment)
Prediction = model.predict(Vtest)
print accuracy_score(Prediction, test_sentiment)

0.562155859299


In this pragraph we are going to explain all the choices made for each one of the steps of the performed analysis. To begin with, we have to clarify that we performed sentiments analysis using supervised machine learning algorithm. Before that we clean our tweets by erasing empty tweets, punctuations, emoticons, and @references. Also we use CountVectorizer to vectorize the tweets. With the procedure above, we reduce the number of features. 

As for the algorithms, we tried MultinomialNB, LogisticRegression, DecisionTree, SVM, RandomForest using train set to train the model and devtest to make the predictions and we chose MultinomialNB because of the accuracy. Then, we used gridSearch and devTest in order to find the best hyperparameters for our model. Finally we trained again our model with the right hyperparameters and made our prediction using test set. 