# Sentiment Analysis

In Natural Language Processing, Sentiment Analysis refers to methods that systematically extract, classify and predict the polarity (positive or negative) of text documents. Let's implement a **Support Vector Machine** that learns from a dataset of ~100k positive and negative observations. 

In [1]:
import numpy as np
import pandas as pd

import re
from nltk.corpus import stopwords

# Load data into a pandas dataframe

In [2]:
data = pd.read_csv('sentiment.csv', encoding='latin1')

## Let's check out the first 5 rows

In [3]:
data.head()

Unnamed: 0,ItemID,Sentiment,SentimentText
0,1,0,is so sad for my APL frie...
1,2,0,I missed the New Moon trail...
2,3,1,omg its already 7:30 :O
3,4,0,.. Omgaga. Im sooo im gunna CRy. I'...
4,5,0,i think mi bf is cheating on me!!! ...


## Print the number of unique labels and the length of the dataset

In [4]:
print(len(data.Sentiment.unique()))
print(len(data))

2
99989


In [5]:
print(data.groupby('Sentiment').size())

Sentiment
0    43532
1    56457
dtype: int64


**0 shows us the number of negative tweets and 1 shows us the labels for positive tweets.** We have quite a bit more positive than negative examples. Imbalanced data can be a problem sometimes, especially when you deal with small datasets and multi-class classification. There are many ways how to deal with imbalanced classes, e.g. collecting more data, upsampling, data augmentation, re-balancing with weights, etc. In this case with have close to 100.000 observations and binary classification, let's move on for now and see how it goes.

# Pre-processing 


Get the label and text columns of our dataframe and transform them to arrays

In [6]:
labels = data.Sentiment.values
text = data.SentimentText.values

In [7]:
labels

array([0, 0, 1, ..., 0, 1, 1])

In [8]:
text

array(['                     is so sad for my APL friend.............',
       '                   I missed the New Moon trailer...',
       '              omg its already 7:30 :O', ...,
       '@CuPcAkE_2120 ya i thought so ',
       "@Cupcake_Dollie Yes. Yes. I'm glad you had more fun with me. ",
       '@cupcake_kayla haha yes you do '], dtype=object)

Looking at the text, it's very messy. Even at a first glance we see a lot of problems, like multiple whitespaces (makes it difficult for our tokenizer), a lot of usernames, urls, etc. We need to clean up the text before we feed it to our machine learning model. We want to reduce the amount of noise as much as possible, ease up the process for the tokenizer, reduce the number of features, e.g. by removing stopwords (very high in frequency, but don't contain any meaning).

### (a rough) Clean up

In [9]:
texts = []

for i in range(0, len(text)):
    tweet = re.sub('@([A-Za-z0-9_]+)', ' ', text[i]) # replace usernames with a whitespace
    tweet = re.sub('http([:/A-Za-z0-9_.]+)', ' ', tweet) # replace urls with a whitespace
    tweet = re.sub('www([:/A-Za-z0-9_.]+)', ' ', tweet) # replace urls with a whitespace
    tweet = re.sub('[^a-zA-Z0-9.:_!?\-\(\)]', ' ', tweet) # replace everything that is not in the selection with a whitespac
    tweet = re.sub('(\s+)', ' ', tweet).lower().split() # replace multiple whitespaces with one whitespace, lowercase everything
    tweet = [word for word in tweet if not word in set(stopwords.words('english'))] # remove stopwords
    tweet = ' '.join(tweet)
    texts.append(tweet)

In [10]:
texts

['sad apl friend.............',
 'missed new moon trailer...',
 'omg already 7:30 :o',
 '.. omgaga. im sooo im gunna cry. dentist since 11.. suposed 2 get crown put (30mins)...',
 'think mi bf cheating me!!! t_t',
 'worry much?',
 'juuuuuuuuuuuuuuuuussssst chillin!!',
 'sunny work tomorrow :- tv tonight',
 'handed uniform today . miss already',
 'hmmmm.... wonder number -)',
 'must think positive..',
 'thanks haters face day! 112-102',
 'weekend sucked far',
 'jb isnt showing australia more!',
 'ok thats win.',
 'lt -------- way feel right now...',
 'awhhe man.... completely useless rt now. funny twitter.',
 'feeling strangely fine. gonna go listen semisonic celebrate',
 'huge roll thunder now...so scary!!!!',
 'cut beard off. growing well year. gonna start over. happy meantime.',
 'sad iran.',
 'wompppp wompp',
 'one see cause one else following pretty awesome',
 'lt ---sad level 3. writing massive blog tweet myspace comp shut down. lost lays fetal position',
 '... headed hospitol : p

## X (documents) and y (labels). We're predicting ŷ.

In [11]:
X = texts
y = labels

### First things first. We are using the scikit-learn Countvectorizer to assign an integer to each token

In [12]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 5000) # choose the max number of features carefully
X = cv.fit_transform(X).toarray()

### Let's count the occurrences and show a list of the top 20 tokens in our dataset

In [13]:
occ = np.asarray(X.sum(axis=0)).ravel().tolist()
counts_df = pd.DataFrame({'term': cv.get_feature_names(), 'occurrences': occ})
counts_df.sort_values(by='occurrences', ascending=False).head(20)

Unnamed: 0,occurrences,term
3497,9153,quot
1894,5923,good
2576,5538,like
2626,5400,lol
1846,5310,get
2468,4609,know
2657,4515,love
4355,4116,thanks
3108,3744,one
1881,3641,go


Okay. We see some positive sentiment-bearing words that are very high in frequency. Another important hint is 'quot'. From just skimming through the first few rows of our dataset we couldn't see it, but we have a lot of HTML tags and those should be removed in the pre-processing phase, or even better, prevent it to happen in the first place. For now let's move on and see how the performance will be when we skip that cleaning step.

## TF-IDF

In information retrieval, **TF-IDF**, or **term frequency–inverse document frequency**, is a numerical statistic that reflects how relevant a word is to a document in a collection or corpus. The TF-IDF value increases proportionally to the number of times a word appears in the document and is offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general. (source:wiki)

In [14]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer()
X = tfidf.fit_transform(X).toarray()

## train-test-split

We will split X and y into train and test sets.For this we will use the train-test-split class from scitkit-learn, with a training size of 80 % and a test size of 20 %. You can also have an additional dev set, but for now let's keep it simple.

In [15]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=0) # shuffle is by default true

# Support Vector Machines

Support Vector Machines **(SVMs)** are supervised machine learning models used for classification, regression and outlier detection.

The advantages of support vector machines are:

* Effective in high dimensional spaces.
* Still effective in cases where number of dimensions is greater than the number of samples.
* Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.
* Versatile: different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also possible to specify custom kernels.

The disadvantages of support vector machines include:

* If the number of features is much greater than the number of samples, avoid over-fitting in choosing Kernel functions and regularization term is crucial.
* SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation (see Scores and probabilities, below).

In addition to performing linear classification, SVMs can efficiently perform non-linear classifications using the **kernel trick**, elevating and mapping the inputs into high-dimensional feature spaces.

There are many paramters, some of which are very sensitive. Some of the most important SVM parameters:

* **C :** float, optional (default=1.0). Penalty parameter C of the error term.
* **kernel :** string, optional (default=’rbf’). Specifies the kernel type to be used in the algorithm. 
    It must be one of ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’ or a callable. If none is given, ‘rbf’ will be used. If a callable is given it is used to pre-compute the kernel matrix from data matrices; that matrix should be an array of shape (n_samples, n_samples).
* **gamma :** float, optional (default=’auto’). Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’. If gamma is ‘auto’ then 1/n_features will be used instead.
* **class_weight :** {dict, ‘balanced’}, optional. Set the parameter C of class i to class_weight[i]*C for SVC. 
    If not given, all classes are supposed to have weight one. The “balanced” mode uses the values of y to automatically adjust weights inversely proportional to class frequencies in the input data as n_samples / (n_classes * np.bincount(y))
* **decision_function_shape :** ‘ovo’, ‘ovr’, default=’ovr’. Whether to return a one-vs-rest (‘ovr’) decision function of shape (n_samples, n_classes) as all other classifiers, or the original one-vs-one (‘ovo’) decision function of libsvm which has shape (n_samples, n_classes * (n_classes - 1) / 2).
* **random_state :** int, RandomState instance or None, optional (default=None). The seed of the pseudo random number generator to use when shuffling the data. If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

(source: scikit-learn)


# SVM

In [16]:
from sklearn import svm

clf = svm.SVC()
clf.fit(X_train, y_train)  

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

## Predicting the result

In [17]:
y_pred = clf.predict(X_test)

## Let's get the F1 score. 
F1 = 2 * (precision * recall) / (precision + recall)

In [19]:
from sklearn.metrics import f1_score
f1_score(y_test, y_pred)

0.7186518869737937

The results are not impressive, especially when comparing to the Kaggle leaderboard, with top results around 0.79. But it's a good start and it gives us a baseline to further improve upon.

# TODO

* Handle imbalanced classes
* Cleaning should be done more thoroughly
* GridSearch for finding the optimal parameters

## Grid search and training the classifier

Let's use **GridSearchCV** to make sure to **identifiy the best for values for the paramters C and gamma, as well as choose between the sigmoid and the rbf kernel.** We then train our SVM classifier, print the best estimators and finally make the predictions on the test data. All supervised estimators in scikit-learn implement a **fit(X, y)** method to fit the model and a **predict(X)** method that, given unlabeled observations X, returns the predicted labels y.

In [None]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

# Create a dictionary with a range of parameter values for C, gamma and the kernel
parameters = {'C': [0.01, 0.1, 1], # default: 1.0
          'gamma': [0.001, 0.01, 0.1], # default: auto
          'kernel':['sigmoid','rbf'] }

svc = SVC()
clf = GridSearchCV(svc, parameters)

# Fit the data with the best possible parameters
clf.fit(X_train, y_train)

#Print the best estimator with its parameters
print(clf.best_estimators)