# Sentiment Analysis

## 1. Data Source

url --> https://www.cs.jhu.edu/~mdredze/datasets/sentiment/index2.html

## 2. Data Loading

In [1]:
import nltk
import numpy as np
from nltk.stem import WordNetLemmatizer
from sklearn.linear_model import LogisticRegression
from bs4 import BeautifulSoup

In [2]:
print("Initial NLTK data path: {0}".format(nltk.data.path))
      
nltk_data_path = '/media/raul/Data/nltk_data'
nltk.data.path.append(nltk_data_path)

print("Final NLTK data path: {0}".format(nltk.data.path))

Initial NLTK data path: ['/home/raul/nltk_data', '/usr/share/nltk_data', '/usr/local/share/nltk_data', '/usr/lib/nltk_data', '/usr/local/lib/nltk_data']
Final NLTK data path: ['/home/raul/nltk_data', '/usr/share/nltk_data', '/usr/local/share/nltk_data', '/usr/lib/nltk_data', '/usr/local/lib/nltk_data', '/media/raul/Data/nltk_data']


We load now the stop words taking into account in the exercise

In [3]:
path_stopwords = '/home/raul/Documents/udemy/datas_cience_nlp/stopwords.txt'
STOP_WORDS = set(w.rstrip() for w in open(path_stopwords))

We initialize a lemmatizer

In [4]:
wordnet_lemmatizer = WordNetLemmatizer()

Let's check how the lemmatizer works:

In [5]:
for word in ["hello", "programmer", "computers", "crying"]:
    print("Original: {0}; Lemmatized: {1}".format(word, wordnet_lemmatizer.lemmatize(word)))

Original: hello; Lemmatized: hello
Original: programmer; Lemmatized: programmer
Original: computers; Lemmatized: computer
Original: crying; Lemmatized: cry


Now, we load the raw data: possitive and negative reviews of electronic articles.

In [6]:
pos_reviews_path = '/home/raul/Documents/udemy/datas_cience_nlp/sentiment_analysis/sorted_data_acl/electronics/positive.review'
neg_reviews_path = '/home/raul/Documents/udemy/datas_cience_nlp/sentiment_analysis/sorted_data_acl/electronics/negative.review'

positive_reviews = BeautifulSoup(open(pos_reviews_path).read(), "lxml")
positive_reviews = positive_reviews.findAll('review_text')
negative_reviews = BeautifulSoup(open(neg_reviews_path).read(), "lxml")
negative_reviews = negative_reviews.findAll('review_text')

Let's check the first 3 possitive reviews:

In [7]:
print(type(positive_reviews))
print(positive_reviews[:3])

<class 'bs4.element.ResultSet'>
[<review_text>
I purchased this unit due to frequent blackouts in my area and 2 power supplies going bad.  It will run my cable modem, router, PC, and LCD monitor for 5 minutes.  This is more than enough time to save work and shut down.   Equally important, I know that my electronics are receiving clean power.

I feel that this investment is minor compared to the loss of valuable data or the failure of equipment due to a power spike or an irregular power supply.

As always, Amazon had it to me in &lt;2 business days
</review_text>, <review_text>
I ordered 3 APC Back-UPS ES 500s on the recommendation of an employee of mine who used to work at APC. I've had them for about a month now without any problems. They've functioned properly through a few unexpected power interruptions. I'll gladly order more if the need arises.

Pros:
 - Large plug spacing, good for power adapters
 - Simple design
 - Long cord

Cons:
 - No line conditioning (usually an expensive o

## 3. Data Transformation

In this section, we are going to vectorize our literal data. First, se suffle the data and take as many positive reviews as negative (there are more possitive reviews than negatives)

In [8]:
np.random.shuffle(positive_reviews)
positive_reviews = positive_reviews[:len(negative_reviews)]

Now, we defined the `my_tokenizer` function, which tokenize an input string.

In [9]:
def my_tokenizer(s):
    """
    Tokenize an input string
    
    :param s: input string which represents several lines / paragraph
    :return: list containing all the tokens
    """
    s = s.lower()
    tokens = nltk.tokenize.word_tokenize(s)
    tokens = [t for t in tokens if len(t) > 2]
    tokens = [wordnet_lemmatizer.lemmatize(t) for t in tokens]
    tokens = [t for t in tokens if t not in STOP_WORDS]
    return tokens

Now, we first store all the features (tokens) corresponding to all the observations, and we also get the features map. We have to iterate then over both the positive and negative reviews.

In [10]:
word_index_map = {}
current_index = 0

positive_tokenized = []
negative_tokenized = []

In [11]:
for review in positive_reviews:
    tokens = my_tokenizer(review.text)
    positive_tokenized.append(tokens)
    for token in tokens:
        if token not in word_index_map:
            word_index_map[token] = current_index
            current_index += 1
            
for review in negative_reviews:
    tokens = my_tokenizer(review.text)
    negative_tokenized.append(tokens)
    for token in tokens:
        if token not in word_index_map:
            word_index_map[token] = current_index
            current_index += 1
    

In [12]:
print("Lenth of word_index_map: {0}".format(len(word_index_map)))

Lenth of word_index_map: 11091


We define the `tokens_to_vector` function, which is going to obtain the vectorized format for each observation (adding also the label)

In [13]:
def tokens_to_vector(tokens, label, word_index_map):
    """
    Returns the vectorized format of an observation,
    including features and label

    :param tokens: tokens corresponding to the observation
    :param label: label of the observation
    :param word_index_map: dictionary which contains the strings-to-index 
    map
    :return: numpy array containing the vectorized representation
    of the observation (including features and label)
    """
    
    x = np.zeros(len(word_index_map)+1)
    for t in tokens:
        i = word_index_map[t]
        x[i] += 1
    x = x/x.sum()
    x[-1] = label
    return x

Now, we are going to obtain our final dataset containing all the observations in its vectorized format and save it in the variable `data`

In [14]:
N = len(positive_tokenized) + len(negative_tokenized)

data = np.zeros((N, len(word_index_map) + 1))
i = 0

for tokens in positive_tokenized:
    xy = tokens_to_vector(tokens, 1, word_index_map)
    data[i,:] = xy
    i += 1
    
for tokens in negative_tokenized:
    xy = tokens_to_vector(tokens, 0, word_index_map)
    data[i,:] = xy
    i += 1

Finally, we split our data in train and test

In [15]:
np.random.shuffle(data)

X = data[:, :-1]
Y = data[:, -1]
Xtrain = X[:-100, ]
Xtest = X[-100:, ]
Ytrain = Y[:-100, ]
Ytest = Y[-100:, ]

## 4. Training a Logistic Regression Model

In [16]:
model = LogisticRegression()
model.fit(Xtrain, Ytrain)
print("Accuracy: {0}".format(model.score(Xtest, Ytest)))

Accuracy: 0.68


Finally, we try to interpret the meaning of each word corresponding to the features

In [17]:
threshold = 0.5
for word, index in word_index_map.items():
    weight = model.coef_[0][index]
    if abs(weight) > threshold:
        print(word, weight)

item -0.951794126417
poor -0.756664646438
look 0.558818507452
money -1.0673566248
junk -0.559994220637
unit -0.723025994661
returned -0.78973086928
perfect 0.956468267869
sound 1.10385206621
've 0.798391784631
company -0.551715300813
excellent 1.32142262447
then -1.14804101983
cable 0.822756873881
week -0.730898444164
value 0.560089392122
memory 0.980963258223
quality 1.58839854932
picture 0.572354555612
refund -0.625908994638
bit 0.663082241948
expected 0.550221338526
tried -0.799445271553
try -0.679510192992
speaker 0.83212845677
warranty -0.613773953173
happy 0.607481542225
space 0.577802709344
love 1.17363432721
radio -0.561903719949
little 1.01766258733
doe -1.14714117698
stopped -0.543420542325
month -0.847680312154
comfortable 0.637361759049
card -0.517436835623
you 1.16336809261
pretty 0.810205838129
support -0.822755883536
easy 1.67880350844
price 2.7144396971
return -1.18455962988
bad -0.757009821991
home 0.504365586163
ha 0.796669805859
laptop 0.614367283195
n't -2.050116540