# Sentiment Analysis using Naive Bayes

In this assignment, we will attempt to label tweets with sentiments (positive, neutral and negative) using Naive Bayes classifier. Naive Bayes is a very basic approach to this problem, but gives surprisingly good accuracy sometimes.

**Fill in the Blanks**

## Importing required libraries

In [1]:
import pandas as pd
import re
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer

## Reading dataset

In [2]:
data=pd.read_csv('tweets.csv')
data.drop(data.columns[0],axis=1,inplace=True)
data.head()

Unnamed: 0,tweets,labels
0,Obama has called the GOP budget social Darwini...,1
1,"In his teen years, Obama has been known to use...",0
2,IPA Congratulates President Barack Obama for L...,0
3,RT @Professor_Why: #WhatsRomneyHiding - his co...,0
4,RT @wardollarshome: Obama has approved more ta...,1


## Text processing for the tweets

In [3]:
import nltk 
nltk.download('stopwords')
nltk.download('punkt')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [4]:
from nltk.tokenize import word_tokenize
from string import punctuation 
from nltk.corpus import stopwords 

stopwords = set(stopwords.words('english') + list(punctuation) + ['AT_USER','URL'])
    
def processTweet(tweet):
    # tweet is the text we will pass for preprocessing 
    # convert passed tweet to lower case 
    tweet = tweet.lower()
    tweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+))', 'URL', tweet) # remove URLs
    tweet = re.sub('@[^\s]+', 'AT_USER', tweet) # remove usernames
    tweet = re.sub(r'#([^\s]+)', r'\1', tweet) # remove the # in #hashtag
    
    # use work_tokenize imported above to tokenize the tweet
    tweet = word_tokenize(tweet)
    return [word for word in tweet if word not in stopwords]

In [5]:
data['tweets']

0       Obama has called the GOP budget social Darwini...
1       In his teen years, Obama has been known to use...
2       IPA Congratulates President Barack Obama for L...
3       RT @Professor_Why: #WhatsRomneyHiding - his co...
4       RT @wardollarshome: Obama has approved more ta...
                              ...                        
1375    @liberalminds Its trending idiot.. Did you loo...
1376    RT @AstoldByBass: #KimKardashiansNextBoyfriend...
1377    RT @GatorNation41: gas was $1.92 when Obama to...
1378    @xShwag haha i know im just so smart, i mean y...
1379    #OBAMA:  DICTATOR IN TRAINING.  If he passes t...
Name: tweets, Length: 1380, dtype: object

In [6]:
data['tweets'].shape

(1380,)

In [7]:
data.shape

(1380, 2)

In [8]:
data = data.dropna()

In [9]:
data.shape

(1375, 2)

## Process all tweets

In [10]:
processed=[]

for tweet in data['tweets']:
    #print(tweet)
    #print(type(tweet))
    # process all tweets using processTweet function above - store in variable 'cleaned' 
    cleaned=processTweet(tweet)
    processed.append(' '.join(cleaned))

In [11]:
data['processed'] = processed

## Create pipeline and define parameters for GridSearch

In [12]:
text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB())])

tuned_parameters = {
    'vect__ngram_range': [(1, 1), (1, 2), (2, 2)],
    'tfidf__use_idf': (True, False),
    'tfidf__norm': ('l1', 'l2'),
    'clf__alpha': [1, 1e-1, 1e-2]
}

## Split data into test and train

In [13]:
# split data into train and test with split as 0.2 
X = data.processed
y = data.labels

train, test = train_test_split(data, test_size=0.2, random_state=1)
x_train = train['processed'].values
x_test = test['processed'].values
y_train = train['labels']
y_test = test['labels']

## Perform classification (using GridSearch)

In [14]:
# perform GridSearch CV with 10 fold CV using pipeline and tuned_paramters defined above 
clf =GridSearchCV(text_clf,
                  param_grid = tuned_parameters,
                  cv = 10,                              
                  verbose=1,
                  n_jobs=-1)
clf.fit(x_train, y_train)

Fitting 10 folds for each of 36 candidates, totalling 360 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done 104 tasks      | elapsed:    4.6s
[Parallel(n_jobs=-1)]: Done 360 out of 360 | elapsed:   12.0s finished


GridSearchCV(cv=10, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('vect',
                                        CountVectorizer(analyzer='word',
                                                        binary=False,
                                                        decode_error='strict',
                                                        dtype=<class 'numpy.int64'>,
                                                        encoding='utf-8',
                                                        input='content',
                                                        lowercase=True,
                                                        max_df=1.0,
                                                        max_features=None,
                                                        min_df=1,
                                                        ngram_range=(1, 1),
                                                        pre

In [15]:
print(clf.best_params_)
print(clf.best_score_)

{'clf__alpha': 0.1, 'tfidf__norm': 'l2', 'tfidf__use_idf': True, 'vect__ngram_range': (1, 2)}
0.8409090909090908


In [16]:
x_test

array(["n't obama run initial bill",
       "truth obama budget trillions debt high gas price inflation weak dollar jobs ca n't run record",
       "wonder 'd win footrace barack obama mitt romney ... stereotypes tell bam ... things n't reliable",
       "rt whatsromneyhiding birth certificate nope 's obama",
       "kimkardashiansnextboyfriend obama lol michelle ai n't goin 4 dat",
       "rt american kid `` 're uk ohhh cool tea queen '' british kid `` like go mcdonalds obama",
       "rt romney two harvard degrees blasts obama spending `` much time harvard '' icymi",
       'rt whatsromneyhiding person refuses let obama clear',
       "rt american kid `` 're uk ohhh cool tea queen '' british kid `` like go mcdonalds obama",
       'rt probably important thing world hope security theatre would diminish least slightly obama rule',
       'rt latest ndaa lawsuit prettiest website',
       'rt obama sign jobsact law designed make crowdfunding legal option startups',
       "rt mccaskill 

In [17]:
y_pred = clf.predict(x_test)
y_pred

array([0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0,
       1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 2, 1, 0, 0, 0,
       0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0,
       0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 2, 0, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0,
       0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 2, 0,
       0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 2, 1, 0, 0, 0,
       1, 0, 0, 0, 1, 0, 2, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 1, 0, 0, 1, 2, 0, 1, 0, 1])

## Classification report 

In [18]:
cm = classification_report(y_test,y_pred)
print(cm)

              precision    recall  f1-score   support

           0       0.86      0.94      0.90       182
           1       0.81      0.73      0.77        78
           2       0.83      0.33      0.48        15

    accuracy                           0.85       275
   macro avg       0.84      0.67      0.71       275
weighted avg       0.85      0.85      0.84       275



## Important:

In [19]:
counts = data.labels.value_counts()
print(counts)

0    942
1    352
2     81
Name: labels, dtype: int64


We can see above that the class distribution is highly imbalanced, this would not lead to good sampling of the data for the classifier. For your learning, try using [SMOTE](https://imbalanced-learn.readthedocs.io/en/stable/api.html) to oversample the minority classes and then evaluate the performance with Naive Bayes and compare.