# Sentiment Analysis Using Logistic Regression



In [3]:
import pandas as pd

#Load data into a Data Frame
train = pd.read_csv('train.csv',encoding = 'latin-1')
test = pd.read_csv('test.csv', encoding='latin-1')

train.head(10)

Unnamed: 0,ItemID,Sentiment,SentimentText
0,1,0,is so sad for my APL frie...
1,2,0,I missed the New Moon trail...
2,3,1,omg its already 7:30 :O
3,4,0,.. Omgaga. Im sooo im gunna CRy. I'...
4,5,0,i think mi bf is cheating on me!!! ...
5,6,0,or i just worry too much?
6,7,1,Juuuuuuuuuuuuuuuuussssst Chillin!!
7,8,0,Sunny Again Work Tomorrow :-| ...
8,9,1,handed in my uniform today . i miss you ...
9,10,1,hmmmm.... i wonder how she my number @-)


This dataset has three fields : itemID, Sentiment and Sentiment Text

Structure of the text varies from one twit to another twit. They ahve different length,letters, numbers , extrange characters etc.

It is also important to note that a lot of words are not correctly spelled, for example the word "Juuuuuuuuuuuuuuuuussssst" or the word "frie" instear of "friend"

This makes it hard to mesure how positive or negative are the words withing the corpus of twits. If they were all correct dictionary words, we could use a lexicon to punctuate words. However because of the nature of social media language, we cannot do that.
So we need a way of scoring the words such that words that appear in positive twits have greater score that those that appear in negative twits.

But first... how do we represent the twits as vectors we can input to our algorithm?

## Bag of Words

One thing we could do to represent the twits as equal-sized vectors of numbers is the following:

* Create a list (vocabulary) with all the unique words in the whole corpus of twits.
* We construct a feature vector from each twit that contains the counts of how often each word occurs in the particular twit

Note that since the unique words in each twit represent only a small subset of all the words in the bag-of-words vocabulary, the feature vectors will mostly consist of zeros

Let's construct the bag of words. We will work with a smaller example for illustrative purposes, and at the end we will work with our real data.

In [5]:
from sklearn.feature_extraction.text import CountVectorizer

twits = [
    'This is Amazing!',
    'ML is the best, yes it is',
    'I am not sure about how this is going to end...'
]

count = CountVectorizer()
bag = count.fit_transform(twits)

count.vocabulary_

{'this': 13,
 'is': 7,
 'amazing': 2,
 'ml': 9,
 'the': 12,
 'best': 3,
 'yes': 15,
 'it': 8,
 'am': 1,
 'not': 10,
 'sure': 11,
 'about': 0,
 'how': 6,
 'going': 5,
 'to': 14,
 'end': 4}

In [6]:
bag.toarray()

array([[0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 2, 1, 1, 0, 0, 1, 0, 0, 1],
       [1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0]], dtype=int64)

Each index position in the feature vectors corresponds to the integer values that are stored as dictionary items in the count vectorizer. (dictionary : string and its index in the bag).

These values in the feature vectors are also called the __raw term frequencies__ tf(t,d) - the number of times a term t occurs in a document d.

__How relevant are words ? TF vs IDF___

<b>Term Frequency : </b> The number of times a term occurs in a document is called its term frequency. <br>
<b>Inverse Term Frequency: </b> Inverse document frequency factor is incorporated which diminishes the weight of terms that occur very frequently in the document set and increases the weight of terms that occur rarely.

To solve this problem we will use Term Frequency - Inverse Term Frequency, which will reduce the score the more frequent the word is across all twits. It is calculated as :

                              tf(t,d) - idf(t,d) = tf(t,d) ~ idf(t,d)

tf(t,d) is the raw term frequency described above idf(t,d) is the inverse document frequency, than can calculated as fellows : 

$$\log{\frac{n_d}{1+ df(d,t)}}$$

where n is the total number of documents and df(t,d) is the number of documents where term t appears. 

Scikit Learn will do all these calculations for us :

In [13]:
import numpy as np

from sklearn.feature_extraction.text import TfidfTransformer

tfidf = TfidfTransformer(use_idf = True, 
                        norm = "l2",
                        smooth_idf = True)

np.set_printoptions(precision=2)

tfidf.fit_transform(bag).toarray()

array([[0.  , 0.  , 0.72, 0.  , 0.  , 0.  , 0.  , 0.43, 0.  , 0.  , 0.  ,
        0.  , 0.  , 0.55, 0.  , 0.  ],
       [0.  , 0.  , 0.  , 0.4 , 0.  , 0.  , 0.  , 0.47, 0.4 , 0.4 , 0.  ,
        0.  , 0.4 , 0.  , 0.  , 0.4 ],
       [0.33, 0.33, 0.  , 0.  , 0.33, 0.33, 0.33, 0.2 , 0.  , 0.  , 0.33,
        0.33, 0.  , 0.25, 0.33, 0.  ]])

We are using norm "l2" parameter in the Tfidftransformer : This is an important one, and what is doing is normalize the tf-idfs so that they're all in the same scale and thus work better with Logistic Regression.

## Data Clean Up 

### Removing stop words



In [15]:
from collections import Counter

vocab = Counter()
for twit in train.SentimentText:
    for word in twit.split(' '):
        vocab[word] += 1

vocab.most_common(20)

[('', 123916),
 ('I', 32879),
 ('to', 28810),
 ('the', 28087),
 ('a', 21321),
 ('you', 21180),
 ('i', 15995),
 ('and', 14565),
 ('it', 12818),
 ('my', 12385),
 ('for', 12149),
 ('in', 11199),
 ('is', 11185),
 ('of', 10326),
 ('that', 9181),
 ('on', 9020),
 ('have', 8991),
 ('me', 8255),
 ('so', 7612),
 ('but', 7220)]

In [16]:
from bokeh.models import ColumnDataSource, LabelSet
from bokeh.plotting import figure, show, output_file
from bokeh.io import output_notebook
output_notebook()

In [17]:
import math

def plot_distribution(vocabulary):

    hist, edges = np.histogram(list(map(lambda x:math.log(x[1]),vocabulary.most_common())), density=True, bins=500)

    p = figure(tools="pan,wheel_zoom,reset,save",
               toolbar_location="above",
               title="Word distribution accross all twits")
    p.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:], line_color="#555555", )
    show(p)

plot_distribution(vocab)

In [18]:
import nltk

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/rohith/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [19]:
from nltk.corpus import stopwords
stop = stopwords.words('english')

vocab_reduced = Counter()
for w, c in vocab.items():
    if not w in stop:
        vocab_reduced[w]=c

vocab_reduced.most_common(20)

[('', 123916),
 ('I', 32879),
 ("I'm", 6416),
 ('like', 5086),
 ('-', 4922),
 ('get', 4864),
 ('u', 4194),
 ('good', 3953),
 ('love', 3494),
 ('know', 3472),
 ('go', 2990),
 ('see', 2868),
 ('one', 2787),
 ('got', 2774),
 ('think', 2613),
 ('&amp;', 2556),
 ('lol', 2419),
 ('going', 2396),
 ('really', 2287),
 ('im', 2200)]

In [20]:
plot_distribution(vocab_reduced)