# L10 - Sentiment Analysis

## Author: Rodolfo Lerma

This assignment requires that you build a sentiment analysis classifier for a series of tweets.
The data consists of a file "twitter_data.csv". The file contains 16,000 tweets with their respective score. The attributes are the sentences, and the score is either 4 (for positive) or 0 (for negative).

Assignment Instructions
1. Complete all questions below.
2. Comment on the applicability of the model on future tweets.  

In [10]:
#Imports
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
from nltk.stem.wordnet import WordNetLemmatizer

In [2]:
#Read files
url = "twitter_data.csv"
df = pd.read_csv(url, sep=",")
df.columns = ["sentiment_label","tweet_text"]
    
print(df.head())

   sentiment_label                                         tweet_text
0                4  @elephantbird Hey dear, Happy Friday to You  A...
1                4  Ughhh layin downnnn    Waiting for zeina to co...
2                0  @greeniebach I reckon he'll play, even if he's...
3                0              @vaLewee I know!  Saw it on the news!
4                0  very sad that http://www.fabchannel.com/ has c...


In [3]:
# But sentiment is either '4' or '0'. We'll change that to '1' or '0' to indicate positive or negative sentiment.
df.sentiment_label=df.sentiment_label.replace(4,1)

# Check the Data frame again
print(df.head())

   sentiment_label                                         tweet_text
0                1  @elephantbird Hey dear, Happy Friday to You  A...
1                1  Ughhh layin downnnn    Waiting for zeina to co...
2                0  @greeniebach I reckon he'll play, even if he's...
3                0              @vaLewee I know!  Saw it on the news!
4                0  very sad that http://www.fabchannel.com/ has c...


In [4]:
print(df.sentiment_label.unique())

[1 0]


Q1: Generate word cloud for positive sentiment.

In [5]:
print('\n\n Count of positives: {}'.format(np.sum(df['sentiment_label'])))



 Count of positives: 80000


Q2: Generate word cloud for negative sentiment.

In [6]:
print('\n\n Count of negative: {}'.format(len(df['sentiment_label']) - np.sum(df['sentiment_label'])))



 Count of negative: 80000


Q3: Split data into 70% for training and 30% for testing.

In [7]:
tweet_data = df.values.tolist()

In [8]:
# Create a function to do this
import string
import re
def preprocess(text, list_of_steps):
    
    for step in list_of_steps:
        if step == 'remove_non_ascii':
            text = ''.join([x for x in text if ord(x) < 128])
        elif step == 'lowercase':
            text = text.lower()
        elif step == 'remove_punctuation':
            punct_exclude = set(string.punctuation)
            text = ''.join(char for char in text if char not in punct_exclude)
        elif step == 'remove_numbers':
            text = re.sub("\d+", "", text)
        elif step == 'strip_whitespace':
            text = ' '.join(text.split())
        elif step == 'remove_stopwords':
            stops = stopwords.words('english')
            word_list = text.split(' ')
            text_words = [word for word in word_list if word not in stops]
            text = ' '.join(text_words)
        elif step == 'stem_words':
            lmtzr = WordNetLemmatizer()
            word_list = text.split(' ')
            stemmed_words = [lmtzr.lemmatize(word) for word in word_list]
            text = ' '.join(stemmed_words)
    return text

In [11]:
# Clean tweets
#steps = ['lowercase', 'remove_punctuation', 'remove_numbers', 'strip_whitespace']
steps = ['lowercase', 'remove_punctuation', 'remove_numbers', 'strip_whitespace', 'stem_words']

df['clean_tweet'] = df['tweet_text'].map(lambda s: preprocess(s, steps))

In [None]:
# # Create a document storage matrix
# clean_texts = df['clean_tweet']
# docs = {}
# labels = []
# for ix, row in enumerate(clean_texts):
#     # Store the sentiment
#     labels = tweet_data[ix][0]
#     docs[ix] = row.split(' ')

In [None]:
# # We want to keep track of how many unique words there are:
# num_nonzero = 0
# vocab = set()

# for word_list in docs.values():
#     unique_terms = set(word_list)    # all unique terms of this tweet
#     vocab.update(unique_terms)       # set union: add unique terms of this tweet
#     num_nonzero += len(unique_terms) # add count of unique terms in this tweet

# doc_key_list = list(docs.keys())

In [None]:
# # Need to convert everything to a numpy array:
# doc_key_list = np.array(doc_key_list)
# vocab = np.array(list(vocab))

In [None]:
# # We should keep track of how the vocab/term indices map to the matrix so that we can look them up later.
# vocab_sorter = np.argsort(vocab)

In [None]:
# # Initialize our sparse matrix:
# num_docs = len(doc_key_list)
# vocab_size = len(vocab)
# # A COO matrix is just a tuple of data, row indices, and column indices. Everything else is assumed to be zero.
# data = np.empty(num_nonzero, dtype=np.intc)     # all non-zero
# rows = np.empty(num_nonzero, dtype=np.intc)     # row index
# cols = np.empty(num_nonzero, dtype=np.intc)     # column index

In [None]:
# ix = 0
# # go through all documents with their terms
# print('Computing full term-document matrix (sparse), please wait!')
# for doc_key, terms in docs.items():
#     # find indices to insert-into such that, if the corresponding elements were
#     # inserted before the indices, the order would be preserved
#     term_indices = vocab_sorter[np.searchsorted(vocab, terms, sorter=vocab_sorter)]

#     # count the unique terms of the document and get their vocabulary indices
#     uniq_indices, counts = np.unique(term_indices, return_counts=True)
#     n_vals = len(uniq_indices)  # = number of unique terms
#     ix_end = ix + n_vals # Add count to index.

#     data[ix:ix_end] = counts                  # save the counts (term frequencies)
#     cols[ix:ix_end] = uniq_indices            # save the column index: index in 
#     doc_ix = np.where(doc_key_list == doc_key)   # get the document index for the document name
#     rows[ix:ix_end] = np.repeat(doc_ix, n_vals)  # save it as repeated value

#     ix = ix_end  # resume with next document -> will add future data on the end.

In [None]:
# # Create the sparse matrix!
# doc_term_mat = coo_matrix((data, (rows, cols)), shape=(num_docs, vocab_size), dtype=np.intc)

In [None]:
# # Let's check to make sure!
# vocab_list = list(vocab)
# word_of_interest = 'math'
# vocab_interesting_ix = list(vocab).index(word_of_interest)
# print('vocab index of {} : {}'.format(word_of_interest, vocab_interesting_ix))
# # Find which tweets contain word:
# doc_ix_with_word = []
# for ix, row in enumerate(tweet_data): # Note on this line later.
#     if word_of_interest in row[1]:
#         doc_ix_with_word.append(ix)

In [12]:
df.head()

Unnamed: 0,sentiment_label,tweet_text,clean_tweet
0,1,"@elephantbird Hey dear, Happy Friday to You A...",elephantbird hey dear happy friday to you alre...
1,1,Ughhh layin downnnn Waiting for zeina to co...,ughhh layin downnnn waiting for zeina to cook ...
2,0,"@greeniebach I reckon he'll play, even if he's...",greeniebach i reckon hell play even if he not ...
3,0,@vaLewee I know! Saw it on the news!,valewee i know saw it on the news
4,0,very sad that http://www.fabchannel.com/ has c...,very sad that httpwwwfabchannelcom ha closed d...


In [13]:
# Declare the TFIDF vectorizer.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, max_features=6228, stop_words='english')

# Fit the vectorizer over the dataset
clean_texts = df['clean_tweet']
tf_idf_tweets = vectorizer.fit_transform(clean_texts)

In [14]:
print('Splitting into train-test. Please wait!')
from sklearn.model_selection import train_test_split

y_targets = np.array([y[0] for y in tweet_data])
X_train, X_test, y_train, y_test = train_test_split(tf_idf_tweets, y_targets,test_size=0.30,random_state=42)

Splitting into train-test. Please wait!


Q4: Build a classifier that classifies the sentiment of a sentence.

In [15]:
print('Starting a standard Logistic Model training!')
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
lr.fit(X_train, y_train)

Starting a standard Logistic Model training!


LogisticRegression()

In [19]:
print('Starting training regularized logistic regression')
from sklearn.linear_model import SGDClassifier
lr_reg = SGDClassifier(loss='log', penalty='elasticnet', alpha=0.0001, l1_ratio=0.15)
lr_reg.fit(X_train, y_train)

Starting training regularized logistic regression


SGDClassifier(loss='log', penalty='elasticnet')

Q5: What is the accuracy of your model when applied to testing data?

In [22]:
## Compute results on the train and test set
train_probs = lr.predict_proba(X_train)
train_results = np.argmax(train_probs, axis=1)

test_probs = lr.predict_proba(X_test)
test_results = np.argmax(test_probs, axis=1)

# Compute accuracies
train_logical_correct = [pred == actual for pred, actual in zip(train_results, y_train)]
train_acc = np.mean(train_logical_correct)

test_logical_correct = [pred == actual for pred, actual in zip(test_results, y_test)]
test_acc = np.mean(test_logical_correct)

print('Train accuracy: {}'.format(train_acc))
print('Test accuracy: {}'.format(test_acc))

from sklearn.metrics import precision_recall_fscore_support, confusion_matrix

# Remember:
# Precision is the proportion of correct predictions among all predicted
# Recall (sensitivity) is the proportion of correct predictions among all true actual examples
# F1 is the harmonic average of precision and recall
# Support is count of actual cases of specific class
# Here, each of the following is a pair of numbers, the first is for class 1 ('1') and second for class 0 ('0')
precision, recall, f1, support = precision_recall_fscore_support(y_test, test_results)

# Get the parts of the confusion matrix
tn, fp, fn, tp = confusion_matrix(y_test, test_results).ravel()

# Print results
print(confusion_matrix(y_test, test_results))
print('='*35)
print('             Class 1   -   Class 0')
print('Precision: {}'.format(precision))
print('Recall   : {}'.format(recall))
print('F1       : {}'.format(f1))
print('Support  : {}'.format(support))

Train accuracy: 0.7798482142857143
Test accuracy: 0.7548125
[[17505  6539]
 [ 5230 18726]]
             Class 1   -   Class 0
Precision: [0.76995821 0.74118346]
Recall   : [0.72804026 0.78168309]
F1       : [0.74841275 0.76089474]
Support  : [24044 23956]


In [21]:
## Compute results on the train and test set
train_probs = lr_reg.predict_proba(X_train)
train_results = np.argmax(train_probs, axis=1)

test_probs = lr_reg.predict_proba(X_test)
test_results = np.argmax(test_probs, axis=1)

# Compute accuracies
train_logical_correct = [pred == actual for pred, actual in zip(train_results, y_train)]
train_acc = np.mean(train_logical_correct)

test_logical_correct = [pred == actual for pred, actual in zip(test_results, y_test)]
test_acc = np.mean(test_logical_correct)

print('Train accuracy: {}'.format(train_acc))
print('Test accuracy: {}'.format(test_acc))

from sklearn.metrics import precision_recall_fscore_support, confusion_matrix

# Remember:
# Precision is the proportion of correct predictions among all predicted
# Recall (sensitivity) is the proportion of correct predictions among all true actual examples
# F1 is the harmonic average of precision and recall
# Support is count of actual cases of specific class
# Here, each of the following is a pair of numbers, the first is for class 1 ('1') and second for class 0 ('0')
precision, recall, f1, support = precision_recall_fscore_support(y_test, test_results)

# Get the parts of the confusion matrix
tn, fp, fn, tp = confusion_matrix(y_test, test_results).ravel()

# Print results
print(confusion_matrix(y_test, test_results))
print('='*35)
print('             Class 1   -   Class 0')
print('Precision: {}'.format(precision))
print('Recall   : {}'.format(recall))
print('F1       : {}'.format(f1))
print('Support  : {}'.format(support))

Train accuracy: 0.7516517857142857
Test accuracy: 0.7437708333333334
[[16852  7192]
 [ 5107 18849]]
             Class 1   -   Class 0
Precision: [0.76743021 0.72382013]
Recall   : [0.70088172 0.7868175 ]
F1       : [0.73264787 0.75400524]
Support  : [24044 23956]


Q6: What conclusions can you draw from the model?

Q7: Is it better to have a model per source?