# Sentiment Analysis
# Srivalli Abbriano

The given dataset was downloaded from the following site: 
https://blog.cambridgespark.com/50-free-machine-learning-datasets-sentiment-analysis-b9388f79c124  
Go, A., Bhayani, R. and Huang, L., 2009. Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, 1(12) 
This dataset is freely available and doesn't require any forms to register.
The data comprises of six fields - polarity, tweet ID, date, query username and the text of the tweet. The sentiment used in this dataset was indicated as 0 for negative and 1 for positive. The dataset was split into 70:30 ratio train and test. Twitter sentiment analysis can be often challenging due to classify by the sentiment since its an informal platorm. Therefore a higher sample size for training example was used in this project. The final objecive was to understand how well classifier algorithm can be used to decipher text.  

The website mentioned above had two files training and test data. There was some technical glitch to read the test data. Hence, for this project only the training dataset was utilized and split to train and test portion. 

In [1]:
import os
import pandas as pd
import numpy as np
import nltk

## Downloaded an archive
df = pd.read_csv('Sentiment140.csv',  header=None, encoding='utf-8')

df.head()

Unnamed: 0,0,1,2,3,4,5
0,0,1467810369,Mon Apr 06 22:19:45 PDT 2009,NO_QUERY,_TheSpecialOne_,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,1467810672,Mon Apr 06 22:19:49 PDT 2009,NO_QUERY,scotthamilton,is upset that he can't update his Facebook by ...
2,0,1467810917,Mon Apr 06 22:19:53 PDT 2009,NO_QUERY,mattycus,@Kenichan I dived many times for the ball. Man...
3,0,1467811184,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,ElleCTF,my whole body feels itchy and like its on fire
4,0,1467811193,Mon Apr 06 22:19:57 PDT 2009,NO_QUERY,Karoli,"@nationwideclass no, it's not behaving at all...."


In [2]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /home/jovyan/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [3]:
##drop unnecessary columns for sentiment analysis
dftweet = df.drop(df.columns[[1, 2, 3, 4]], axis=1)

##change column names
dftweet.columns = ['sentiment', 'tweet']

In [4]:
##Check classes for sentiment column
dftweet.sentiment.unique()

array([0, 4])

In [5]:
dftweet["sentiment"].replace({4:1}, inplace=True)
dftweet.head()

Unnamed: 0,sentiment,tweet
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,is upset that he can't update his Facebook by ...
2,0,@Kenichan I dived many times for the ball. Man...
3,0,my whole body feels itchy and like its on fire
4,0,"@nationwideclass no, it's not behaving at all...."


In [6]:
##Remove any retweets from the datset that could affect classification accuracy
dftweets = dftweet[~dftweet.tweet.str.startswith('RT')]

In [7]:
##check for null values
dftweets.isnull().sum()

sentiment    0
tweet        0
dtype: int64

In [8]:
dftweets.head()

Unnamed: 0,sentiment,tweet
0,0,"@switchfoot http://twitpic.com/2y1zl - Awww, t..."
1,0,is upset that he can't update his Facebook by ...
2,0,@Kenichan I dived many times for the ball. Man...
3,0,my whole body feels itchy and like its on fire
4,0,"@nationwideclass no, it's not behaving at all...."


Check if the given dataset is balanced

In [9]:
total = sum(dftweets['sentiment'].value_counts())
neg = dftweets['sentiment'].value_counts()[0]
pos = dftweets['sentiment'].value_counts()[1]

print("negative sentiments:", round(((neg/total)*100),2),"%")
print("positive sentiments:", round(((pos/total)*100),2),"%")

negative sentiments: 76.29 %
positive sentiments: 23.71 %


In [10]:
X = dftweets['tweet']
y = dftweets['sentiment']

In [100]:
from sklearn.model_selection import train_test_split 
from sklearn.utils import resample

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.7,random_state = 0,stratify=y)

In [101]:
X1 = pd.concat([X_train, y_train], axis=1)
X1.head()

Unnamed: 0,tweet,sentiment
130570,@happysahmom I'm leaving for Portland Friday @...,0
525711,Anyone else having gmail issues? Damn page wo...,0
112175,"srs I miss @xoxo_laura @MoondanceMandy, @xoxoJ...",0
545968,"@sdtips - Hmmm, that's interesting. Had breaky...",0
532854,i just wished he'd've. oh,0


In [102]:
#https://www.kaggle.com/tboyle10/methods-for-dealing-with-imbalanced-data
# separate minority and majority classes
negative = X1[X1.sentiment==0]
positive = X1[X1.sentiment==1]

# upsample minority
pos_upsampled = resample(positive,
                         replace=True, # sample with replacement
                         n_samples=len(negative), # match number in majority class
                         random_state=27) # reproducible results

# combine majority and upsampled minority
upsampled = pd.concat([negative, pos_upsampled])
upsampled = upsampled.reset_index(drop=True)

# check new class counts
upsampled.sentiment.value_counts()

1    239999
0    239999
Name: sentiment, dtype: int64

In [103]:
X_train1 = upsampled.tweet
y_train1 = upsampled.sentiment

In [16]:
!pip install Unidecode



In [17]:
!pip install contractions



In [18]:
import re 
import unidecode
import contractions
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()

def tokenizer(text):
        
    ##remove mentions from tweets
    text = re.sub(r"(?:\@|http?\://|https?\://|www)\S+", "", text)
    ##remove html links
    text = re.sub(r'(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', '', text)
    ##remove email addresses
    text = re.sub(r'([a-zA-Z0-9+._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+)', '', text) 
    ##remove digits from text
    text = re.sub(r'\d+', '', text)
    ##expand contractions
    text = contractions.fix(text)
    ##remove accented characters
    text = unidecode.unidecode(text)
    ##remove special characters
    text = re.sub(r"[-()\"#/@;:<>{}`+=~|.!?,]", "", text)      
    ##remove emoticons
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)(;D)',text)
    text = (re.sub('[\W]+', ' ', text.lower()) +
            ' '.join(emoticons).replace('-', ''))
    return text.split()

def tokenizer_porter(text):
    
    ##remove mentions from tweets
    text = re.sub(r"(?:\@|http?\://|https?\://|www)\S+", "", text)
    ##remove html links
    text = re.sub(r'(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', '', text)
    ##remove email addresses
    text = re.sub(r'([a-zA-Z0-9+._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+)', '', text) 
    ##remove digits from text
    text = re.sub(r'\d+', '', text)
    ##expand contractions
    text = contractions.fix(text)
    ##remove accented characters
    text = unidecode.unidecode(text)
    ##remove special characters
    text = re.sub(r"[-()\"#/@;:<>{}`+=~|.!?,]", "", text)      
    ##remove emoticons
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)(;D)',text)
    text = (re.sub('[\W]+', ' ', text.lower()) +
            ' '.join(emoticons).replace('-', ''))
    return [porter.stem(word) for word in text.split()]


In [19]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None)


In [21]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import GridSearchCV

tfidf = TfidfVectorizer(strip_accents=None,
                        lowercase=False,
                        preprocessor=None)
stop = stopwords.words('english')

param_grid = [{'vect__ngram_range': [(1, 1),(1, 2)],
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'clf__penalty': ['l1', 'l2'],
               'clf__alpha':[1.0000000000000001e-05, 9.9999999999999995e-07]},
              {'vect__ngram_range': ((1, 1), (1, 2)),  # unigrams or bigrams
               'vect__stop_words': [stop, None],
               'vect__tokenizer': [tokenizer, tokenizer_porter],
               'vect__use_idf':[False],
               'vect__norm':[None],
               'clf__penalty': ['l1', 'l2'],
               'clf__alpha':[1.0000000000000001e-05, 9.9999999999999995e-07]},
              ]

sgd_tfidf = Pipeline([('vect', tfidf),
                     ('clf', SGDClassifier(loss='log',random_state = 1))])

gs_sgd_tfidf = GridSearchCV(sgd_tfidf, param_grid,
                            scoring='accuracy',
                            cv=5,
                            verbose=2,
                            n_jobs=-1)

In [22]:
gs_sgd_tfidf.fit(X_train1, y_train1)

Fitting 5 folds for each of 64 candidates, totalling 320 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed: 18.2min
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed: 108.1min
[Parallel(n_jobs=-1)]: Done 320 out of 320 | elapsed: 241.8min finished


GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('vect',
                                        TfidfVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.float64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=False,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                        n

In [23]:
print('Best parameter set: %s ' % gs_sgd_tfidf.best_params_)
print('CV Accuracy: %.3f' % gs_sgd_tfidf.best_score_)

Best parameter set: {'clf__alpha': 1e-06, 'clf__penalty': 'l2', 'vect__ngram_range': (1, 2), 'vect__norm': None, 'vect__stop_words': None, 'vect__tokenizer': <function tokenizer at 0x7f33ea074d40>, 'vect__use_idf': False} 
CV Accuracy: 0.919


In [24]:
clf = gs_sgd_tfidf.best_estimator_
print('Test Accuracy: %.3f' % clf.score(X_test, y_test))

Test Accuracy: 0.840


The different hyperparameters used are vect__ngram_range; Basically this enables the use of unigram and bigrams and choose the one which is optimal. From scikit-learn:"ngram_rangetuple (min_n, max_n), default=(1, 1)
The lower and upper boundary of the range of n-values for different word n-grams or char n-grams to be extracted." It was found that bigram was a better than unigram search.
Another hyperparameter used in this project is to test l1 and l2 regularization since SGD classifier was used. For this dataset l2 regarlization was better as opposed to l1 that was used in the book.
The last hyperparameter used was alpha and its optimal value was found to be 1e-06 whereas the book used logistic regression and attempted different values for C parameter. 

Out-of-core learning

In [104]:
from sklearn.feature_extraction.text import HashingVectorizer

vect = HashingVectorizer(decode_error='ignore',
                         n_features=2**21,
                         preprocessor=None,
                         ngram_range=(1,2),
                         stop_words=None,
                         tokenizer=tokenizer)

In [105]:
from sklearn.linear_model import SGDClassifier
from distutils.version import LooseVersion as Version
from sklearn import __version__ as sklearn_version

def stream_docs(df):
    for index, row in df.iterrows():
        text, label = row['tweet'], row['sentiment']
        yield text, label
        
def get_minibatch(doc_stream, size):
    docs, y = [], []
    try:
        for _ in range(size):
            text, label = next(doc_stream)
            docs.append(text)
            y.append(label)
    except StopIteration:
        return None, None
    return docs, y

clf = SGDClassifier(loss='log', random_state=1, penalty='l2', alpha=1e-06) 

doc_stream = stream_docs(upsampled)

In [106]:
testCase = pd.concat([X_test, y_test], axis=1)
doc_stream1 = stream_docs(testCase)

In [107]:
import re
import pyprind
pbar = pyprind.ProgBar(45)

classes = np.array([0, 1])
for _ in range(45):
    X_train, y_train = get_minibatch(doc_stream, size=1000)
    if not X_train:
        break
    X_train = vect.transform(X_train)
    clf.partial_fit(X_train, y_train, classes=classes)
    pbar.update()

0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:13


In [108]:
X_test, y_test = get_minibatch(doc_stream1, size=5000)
X_test = vect.transform(X_test)
print('Accuracy: %.3f' % clf.score(X_test, y_test))

Accuracy: 0.759


In [109]:
clf = clf.partial_fit(X_test, y_test)
clf

SGDClassifier(alpha=1e-06, average=False, class_weight=None,
              early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
              l1_ratio=0.15, learning_rate='optimal', loss='log', max_iter=1000,
              n_iter_no_change=5, n_jobs=None, penalty='l2', power_t=0.5,
              random_state=1, shuffle=True, tol=0.001, validation_fraction=0.1,
              verbose=0, warm_start=False)

The partial_fit() method is used to incrementally train on small batches of data instead of weights being updated based on the sum of accumulated errors from the training example. Partial_fit() method is suitable for online learning wherein the data is trained by streaming. Therefore it is generally used when the dataset is relatively large. The objective of online learning is to quickly train the incoming new data.[1],[2]

[1] https://medium.com/value-stream-design/online-machine-learning-515556ff72c5
[2] Python Machine Learning- Sebastian Raschka et la.,

In [110]:
import pickle
import os
from nltk.corpus import stopwords
stop = stopwords.words('english')
dest = os.path.join('website', 'pkl_objects')
if not os.path.exists(dest):
    os.makedirs(dest)

pickle.dump(stop, open(os.path.join(dest, 'stopwords.pkl'), 'wb'), protocol=4)   
pickle.dump(clf, open(os.path.join(dest, 'tweet_classifier.pkl'), 'wb'), protocol=4)


In [111]:
%%writefile website/vector.py 
from sklearn.feature_extraction.text import HashingVectorizer
import unidecode
import contractions
from nltk.stem import PorterStemmer
from nltk.corpus import stopwords
import pickle
import re
import os

porter=PorterStemmer()

cur_dir = os.path.dirname(__file__)
stop = pickle.load(open(
                os.path.join(cur_dir, 
                'pkl_objects', 
                'stopwords.pkl'), 'rb'))

def tokenizer(text):
        
    ##remove mentions from tweets
    text = re.sub(r"(?:\@|http?\://|https?\://|www)\S+", "", text)
    ##remove html links
    text = re.sub(r'(http|ftp|https)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', '', text)
    ##remove email addresses
    text = re.sub(r'([a-zA-Z0-9+._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9_-]+)', '', text) 
    ##remove digits from text
    text = re.sub(r'\d+', '', text)
    ##expand contractions
    text = contractions.fix(text)
    ##remove accented characters
    text = unidecode.unidecode(text)
    ##remove special characters
    text = re.sub(r"[-()\"#/@;:<>{}`+=~|.!?,]", "", text)      
    ##remove emoticons
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)(;D)',text)
    text = (re.sub('[\W]+', ' ', text.lower()) +
            ' '.join(emoticons).replace('-', ''))
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    tokenized = [w for w in text.split() if w not in stop]
    return tokenized

vect = HashingVectorizer(decode_error='ignore',
                         n_features=2**21,
                         preprocessor=None,
                         ngram_range=(1,2),
                         stop_words=None,
                         tokenizer=tokenizer)

Overwriting website/vector.py


In [112]:
import os
os.chdir('website')

In [113]:
import pickle
import re
from vector import vect

clf = pickle.load(open(os.path.join('pkl_objects', 'tweet_classifier.pkl'), 'rb'))

In [118]:
import numpy as np
import contractions

label = {0:'negative', 1:'positive'}

example = ["#gonzpiration yeah! bravooo! (World Record Attempt in Paris live &gt; http://ustre.am/2X3V)"]
X = vect.transform(example)
print('Prediction: %s\nProbability: %.2f%%' %\
      (label[clf.predict(X)[0]], 
       np.max(clf.predict_proba(X))*100))

Prediction: positive
Probability: 93.47%


In [123]:
label = {0:'negative', 1:'positive'}

example = ["Morning twitters enjoy your day 2day keep God first n please try n make I to church on time  I'm up in NJ @ 8am service crazy tired  ..."]
X = vect.transform(example)
print('Prediction: %s\nProbability: %.2f%%' %\
      (label[clf.predict(X)[0]], 
       np.max(clf.predict_proba(X))*100))

Prediction: positive
Probability: 55.62%


In [116]:
label = {0:'negative', 1:'positive'}

example = ["good morning all! Waiting for the coffee, doing some last minute things before our Open House today. What a beautiful sunny morning! "]
X = vect.transform(example)
print('Prediction: %s\nProbability: %.2f%%' %\
      (label[clf.predict(X)[0]], 
       np.max(clf.predict_proba(X))*100))

Prediction: positive
Probability: 100.00%


In [117]:
label = {0:'negative', 1:'positive'}

example = ["Something looks right"]
X = vect.transform(example)
print('Prediction: %s\nProbability: %.2f%%' %\
      (label[clf.predict(X)[0]], 
       np.max(clf.predict_proba(X))*100))

Prediction: negative
Probability: 81.23%


In [121]:
import os
os.getcwd()

'/home/jovyan/projects/project02/website'

In [122]:
import sqlite3
import os

conn = sqlite3.connect('tweets.sqlite')
c = conn.cursor()

c.execute('DROP TABLE IF EXISTS tweet_db')
c.execute('CREATE TABLE tweet_db (tweet TEXT, sentiment INTEGER, date TEXT)')


conn.commit()
conn.close()

Website:
http://shreevs.pythonanywhere.com/