## Sentiment analysis <br> 

The objective of the second problem is to perform Sentiment analysis from the tweets data collected from the users targeted at various mobile devices.
Based on the tweet posted by a user (text), we will classify if the sentiment of the user targeted at a particular mobile device is positive or not.

### 1. Read the dataset (tweets.csv) and drop the NA's while reading the dataset

In [17]:
import pandas as pd
import numpy as np
import spacy
import pandas as pd
import numpy as np
import nltk
from nltk.tokenize.toktok import ToktokTokenizer
from nltk.corpus import stopwords
import re
from bs4 import BeautifulSoup
import unicodedata

In [37]:
df=pd.read_csv("tweets.csv",encoding='unicode_escape')

In [38]:
df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion


### 2. Preprocess the text and add the preprocessed text in a column with name `text` in the dataframe.

In [19]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\kalya\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.


True

In [20]:
tokenizer = ToktokTokenizer()
stopword_list = stopwords.words('english')
stopword_list.remove('no')
stopword_list.remove('not')
nlp = spacy.load('en_core_web_sm', parse=True, tag=True, entity=True)

In [39]:
def preprocess(text, is_lower_case=False, remove_digits=False):
    try:
        #remove_stopwords
        tokens = tokenizer.tokenize(text)
        tokens = [token.strip() for token in tokens]
        if is_lower_case:
            filtered_tokens = [token for token in tokens if token not in stopword_list]
        else:
            filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]
        filtered_text = ' '.join(filtered_tokens)
        
        #remove_special_characters
        pattern = r'[^a-zA-z0-9\s]' if not remove_digits else r'[^a-zA-z\s]'
        text = re.sub(pattern, '', filtered_text)
        
        #lemmatize_text
        text = nlp(text)
        text = ' '.join([word.lemma_ if word.lemma_ != '-PRON-' else word.text for word in text])
        
        #strip_html_tags
        soup = BeautifulSoup(text, "html.parser")
        stripped_text = soup.get_text()
        
        #remove_accented_chars
        text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
        return text
        
    except Exception as e:
        return ""

In [41]:
df['text'] = [preprocess(text) for text in df.tweet_text]

In [42]:
df.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product,text
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion,wesley83 3 g iPhone 3 hrs tweet rise_austin ...
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion,jessedee Know fludapp Awesome ipadiphone app...
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion,swonderlin not wait iPad 2 also sale SXSW
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion,sxsw hope year festival crashy year iPho...
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion,sxtxstate great stuff Fri SXSW Marissa Mayer...


### 3. Consider only rows having Positive emotion and Negative emotion and remove other rows from the dataframe.

In [43]:
df["is_there_an_emotion_directed_at_a_brand_or_product"].unique()

array(['Negative emotion', 'Positive emotion',
       'No emotion toward brand or product', "I can't tell"], dtype=object)

In [48]:
df_emotion=df[df["is_there_an_emotion_directed_at_a_brand_or_product"].isin(['Negative emotion', 'Positive emotion'])]

### 4. Represent text as numerical data using `CountVectorizer` and get the document term frequency matrix

#### Use `vect` as the variable name for initialising CountVectorizer.

In [65]:
from sklearn.feature_extraction.text import CountVectorizer

In [66]:
vect = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             max_features = None) 

In [67]:
df_emotion_features = vect.fit_transform(df_emotion["text"])

# Numpy arrays are easy to work with, so convert the result to an 
# array
df_emotion_features = df_emotion_features.toarray()

In [68]:
df_emotion_features.shape

(3548, 5455)

### 5. Find number of different words in vocabulary

In [69]:
vect.vocabulary_

{'wesley83': 5244,
 'iphone': 2581,
 'hrs': 2363,
 'tweet': 4969,
 'rise_austin': 4032,
 'dead': 1293,
 'nee': 3240,
 'upgrade': 5065,
 'plugin': 3642,
 'station': 4475,
 'sxsw': 4617,
 'jessedee': 2645,
 'know': 2735,
 'fludapp': 1859,
 'awesome': 501,
 'ipadiphone': 2570,
 'app': 374,
 'likely': 2845,
 'appreciate': 398,
 'design': 1344,
 'also': 298,
 'give': 2052,
 'free': 1923,
 'ts': 4949,
 'swonderlin': 4608,
 'not': 3310,
 'wait': 5180,
 'ipad': 2562,
 'sale': 4080,
 'hope': 2340,
 'year': 5405,
 'festival': 1801,
 'crashy': 1189,
 'sxtxstate': 4654,
 'great': 2133,
 'stuff': 4536,
 'fri': 1929,
 'marissa': 2999,
 'mayer': 3031,
 'google': 2089,
 'tim': 4832,
 'reilly': 3937,
 'tech': 4712,
 'booksconference': 692,
 'amp': 321,
 'matt': 3023,
 'mullenweg': 3195,
 'wordpress': 5341,
 'start': 4468,
 'ctia': 1226,
 'around': 418,
 'corner': 1146,
 'googleio': 2103,
 'hop': 2339,
 'skip': 4297,
 'jump': 2681,
 'good': 2081,
 'time': 4834,
 'android': 332,
 'fan': 1753,
 'beautiful

#### Tip: To see all available functions for an Object use dir

### 6. Find out how many Positive and Negative emotions are there.

Hint: Use value_counts on that column

In [71]:
df_emotion["is_there_an_emotion_directed_at_a_brand_or_product"].value_counts()

Positive emotion    2978
Negative emotion     570
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64

### 7. Change the labels for Positive and Negative emotions as 1 and 0 respectively and store in a different column in the same dataframe named 'Label'

Hint: use map on that column and give labels

In [73]:
df_emotion['label']=np.where(df_emotion["is_there_an_emotion_directed_at_a_brand_or_product"]=='Positive emotion',1,0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


In [74]:
df_emotion.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product,text,label
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion,wesley83 3 g iPhone 3 hrs tweet rise_austin ...,0
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion,jessedee Know fludapp Awesome ipadiphone app...,1
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion,swonderlin not wait iPad 2 also sale SXSW,1
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion,sxsw hope year festival crashy year iPho...,0
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion,sxtxstate great stuff Fri SXSW Marissa Mayer...,1


### 8 Define the feature set (independent variable or X) to be `text` column and `labels` as target (or dependent variable)  and divide into train and test datasets

In [120]:
X=df_emotion_features
Y=df_emotion['label']

In [121]:
from sklearn.model_selection import train_test_split
#Split data into Train and Test 70:30 respectively)
x_train, x_test, y_train, y_test = train_test_split(X, Y, test_size=0.30, random_state=1)

## 9. **Predicting the sentiment:**


### Use Naive Bayes and Logistic Regression and their accuracy scores for predicting the sentiment of the given text

In [131]:
from sklearn.linear_model import LogisticRegression as lr
from sklearn.naive_bayes import GaussianNB as nb
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics

In [123]:
lr_model = lr(solver='lbfgs' , max_iter=5000 , multi_class='multinomial')
lr_model.fit(x_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=5000, multi_class='multinomial',
          n_jobs=None, penalty='l2', random_state=None, solver='lbfgs',
          tol=0.0001, verbose=0, warm_start=False)

In [124]:
lr_model.score(x_test , y_test)  

0.8769953051643192

In [125]:
y_pred_lr=lr_model.predict(x_test)

In [126]:
print(metrics.classification_report(y_test, y_pred_lr))
print(metrics.confusion_matrix(y_test, y_pred_lr))

              precision    recall  f1-score   support

           0       0.69      0.41      0.52       170
           1       0.90      0.97      0.93       895

   micro avg       0.88      0.88      0.88      1065
   macro avg       0.79      0.69      0.72      1065
weighted avg       0.86      0.88      0.86      1065

[[ 70 100]
 [ 31 864]]


In [127]:
nb_model = nb()
nb_model.fit(x_train, y_train)

GaussianNB(priors=None, var_smoothing=1e-09)

In [128]:
nb_model.score(x_test , y_test)  

0.7295774647887324

In [129]:
y_pred_nb=nb_model.predict(x_test)

In [130]:
print(metrics.classification_report(y_test, y_pred_lr))
print(metrics.confusion_matrix(y_test, y_pred_lr))

              precision    recall  f1-score   support

           0       0.69      0.41      0.52       170
           1       0.90      0.97      0.93       895

   micro avg       0.88      0.88      0.88      1065
   macro avg       0.79      0.69      0.72      1065
weighted avg       0.86      0.88      0.86      1065

[[ 70 100]
 [ 31 864]]


## 10. Create a function called `tokenize_predict` which can take count vectorizer object as input and prints the accuracy for x (text) and y (labels)

In [132]:
vect = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             max_features = None) 

In [133]:
x_train, x_test, y_train, y_test = train_test_split(df_emotion['text'], df_emotion['label'], test_size=0.30, random_state=1)

In [134]:
def tokenize_test(vect):
    x_train_dtm = vect.fit_transform(x_train)
    print('Features: ', x_train_dtm.shape[1])
    x_test_dtm = vect.transform(x_test)
    nb = MultinomialNB()
    nb.fit(x_train_dtm, y_train)
    y_pred_class = nb.predict(x_test_dtm)
    print('Accuracy: ', metrics.accuracy_score(y_test, y_pred_class))

In [135]:
tokenize_test(vect)

Features:  4509
Accuracy:  0.8741784037558685


### 11 Create a count vectorizer function which includes n_grams = 1,2  and pass it to tokenize_predict function to print the accuracy score

In [103]:
vect = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             max_features = None, \
                             ngram_range=(1,2)) 


In [104]:
tokenize_test(vect)

Features:  19942
Accuracy:  0.8769953051643192


### Q 12 Create a count vectorizer function with stopwords = 'english'  and pass it to tokenize_predict function to print the accuracy score

In [110]:
vect = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = 'english',   \
                             max_features = None) 

In [111]:
tokenize_test(vect)

Features:  4338
Accuracy:  0.8685446009389671


### Q 13 Create a count vectorizer function with stopwords = 'english' and max_features =300  and pass it to tokenize_predict function to print the accuracy score

In [112]:
vect = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = 'english',   \
                             max_features = 300) 

In [113]:
tokenize_test(vect)

Features:  300
Accuracy:  0.8169014084507042


### Q 14 Create a count vectorizer function with n_grams = 1,2  and max_features = 15000  and pass it to tokenize_predict function to print the accuracy score

In [114]:
vect = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             max_features = 15000, ngram_range=(1,2)) 

In [115]:
tokenize_test(vect)

Features:  15000
Accuracy:  0.8788732394366198


### Q. 15 -Create a count vectorizer function with n_grams = 1,2  and include terms that appear at least 2 times (min_df = 2)  and pass it to tokenize_predict function to print the accuracy score

In [116]:
vect = CountVectorizer(analyzer = "word",   \
                             tokenizer = None,    \
                             preprocessor = None, \
                             stop_words = None,   \
                             max_features = None, ngram_range=(1,2), min_df=2) 

In [117]:
tokenize_test(vect)

Features:  5869
Accuracy:  0.863849765258216
