## Sentiment analysis <br> 

The objective of the problem is to perform Sentiment analysis from the tweets data collected from the users targeted at various mobile devices.
Based on the tweet posted by a user (text), we will classify if the sentiment of the user targeted at a particular mobile device is positive or not.

### 1. Read the dataset (tweets.csv) and drop the NA's while reading the dataset

In [1323]:
import pandas as pd

In [1324]:
tweets = pd.read_csv("tweets.csv", encoding="Unicode_escape")

In [1325]:
tweets.head(2)

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion


In [1326]:
tweets.shape

(9092, 3)

In [1327]:
df_tweets = tweets.dropna(subset = ["tweet_text"])

In [1328]:
df_tweets.shape

(9092, 3)

In [1329]:
df_tweets.head(2)

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion


### 2. Preprocess the text and add the preprocessed text in a column with name `text` in the dataframe.

In [1330]:
def preprocess(text):
    try:
        return text.decode('ascii')
    except Exception as e:
        return ""

In [1331]:
import nltk  
nltk.download()
from nltk.tokenize.toktok import ToktokTokenizer
import re
import unicodedata

tokenizer = ToktokTokenizer()
stopword_list = nltk.corpus.stopwords.words('english')

def remove_special_characters(text, remove_digits=False):
    pattern = r'[^a-zA-z0-9\s]' if not remove_digits else r'[^a-zA-z\s]'
    text = re.sub(pattern, '', text)
    return text

def to_lower_case(text):
    return text.lower()

def remove_accented_chars(text):
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return text

def remove_stopwords(text, is_lower_case=False):
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopword_list]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopword_list]
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


In [1332]:
df_tweets['text'] = [to_lower_case(text) for text in df_tweets.tweet_text]
df_tweets['text'] = [remove_special_characters(text) for text in df_tweets.tweet_text]
df_tweets['text'] = [remove_accented_chars(text) for text in df_tweets.tweet_text]
df_tweets['text'] = [remove_stopwords(text) for text in df_tweets.tweet_text]

In [1333]:
df_tweets.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product,text
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion,.@wesley83 3G iPhone. 3 hrs tweeting #RISE_Aus...
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion,@jessedee Know @fludapp ? Awesome iPad/iPhone ...
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion,@swonderlin wait #iPad 2 also. sale #SXSW .
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion,@sxsw hope year ' festival ' crashy year ' iPh...
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion,@sxtxstate great stuff Fri #SXSW : Marissa May...


### 3. Consider only rows having Positive emotion and Negative emotion and remove other rows from the dataframe.

In [1334]:
emotion_colname = "is_there_an_emotion_directed_at_a_brand_or_product"

In [1335]:
df_tweets1 = df_tweets[df_tweets[emotion_colname].isin(["Negative emotion","Positive emotion"])]

In [1336]:
df_tweets1.shape

(3548, 4)

### 4. Represent text as numerical data using `CountVectorizer` and get the document term frequency matrix

#### Use `vect` as the variable name for initialising CountVectorizer.

In [1337]:
from sklearn.feature_extraction.text import CountVectorizer

In [1338]:
# create the transform
vectorizer = CountVectorizer()

In [1339]:
vectorizer.fit(df_tweets1["text"])

CountVectorizer()

In [1340]:
# summarize
print(vectorizer.vocabulary_)



In [1341]:
print("Vector Type: ", type(vectorizer))

Vector Type:  <class 'sklearn.feature_extraction.text.CountVectorizer'>


In [1342]:
vector = vectorizer.transform(df_tweets1["text"])

In [1343]:
print(type(vector))
print(vector.toarray())

<class 'scipy.sparse.csr.csr_matrix'>
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


### 5. Find number of different words in vocabulary

In [1344]:
len(vectorizer.get_feature_names())

5956

#### Tip: To see all available functions for an Object use dir

### 6. Find out how many Positive and Negative emotions are there.

Hint: Use value_counts on that column

In [1345]:
df_tweets1[emotion_colname].value_counts()

Positive emotion    2978
Negative emotion     570
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64

### 7. Change the labels for Positive and Negative emotions as 1 and 0 respectively and store in a different column in the same dataframe named 'Label'

Hint: use map on that column and give labels

In [1346]:
df_tweets1["Label"] = df_tweets1[emotion_colname].map({'Positive emotion': 1, 'Negative emotion': 0})

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_tweets1["Label"] = df_tweets1[emotion_colname].map({'Positive emotion': 1, 'Negative emotion': 0})


In [1347]:
df_tweets1.head(2)

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product,text,Label
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion,.@wesley83 3G iPhone. 3 hrs tweeting #RISE_Aus...,0
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion,@jessedee Know @fludapp ? Awesome iPad/iPhone ...,1


### 8 Define the feature set (independent variable or X) to be `text` column and `labels` as target (or dependent variable)  and divide into train and test datasets

In [1348]:
df_features = vector.toarray()
df_target = df_tweets1["Label"]

In [1349]:
from sklearn.model_selection import train_test_split
test_size = 0.40 # taking 70:30 training and test set
seed = 7  # Random numbmer seeding for reapeatability of the code
X_train, X_test, y_train, y_test = train_test_split(df_features, df_target, test_size=test_size, random_state=seed)

## 9. **Predicting the sentiment:**


### Use Naive Bayes and Logistic Regression and their accuracy scores for predicting the sentiment of the given text

In [1350]:
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics

In [1351]:
# Logistic Regression
lr_model = LogisticRegression()
lr_model.fit(X_train, y_train)
lr_model_score = lr_model.score(X_test, y_test) # get the accuracy score for testing samples
print("Logistic Regression: Accuracy Score\n" , lr_model_score)

# Naive Bayes
nb_model = GaussianNB()
nb_model.fit(X_train, y_train)
nb_model_score = nb_model.score(X_test, y_test) # get the accuracy score for testing samples
print("Naive Bayes: Accuracy Score\n" , nb_model_score)

#random forest
rf_model = RandomForestClassifier(n_estimators = 75, criterion = 'entropy', random_state = 0)
rf_model.fit(X_train, y_train)
rf_model_score = rf_model.score(X_test, y_test) # get the accuracy score for testing samples
print("randomforest: Accuracy Score\n" , rf_model_score)
      
#svm
model = SVC(kernel = 'linear', random_state = 0)
model.fit(X_train, y_train)
model_score =model.score(X_test, y_test) # get the accuracy score for testing samples
print("svm: Accuracy Score\n" , model_score)
#knearest
n_model = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
n_model.fit(X_train, y_train)
n_model_score = n_model.score(X_test, y_test) # get the accuracy score for testing samples
print("kneighbors: Accuracy Score\n" , n_model_score)

dt_model = DecisionTreeClassifier(criterion = 'entropy', random_state = 0)
dt_model.fit(X_train, y_train)
dt_model_score = dt_model.score(X_test, y_test) # get the accuracy score for testing samples
print("decisiontree: Accuracy Score\n" , dt_model_score)


Logistic Regression: Accuracy Score
 0.8718309859154929
Naive Bayes: Accuracy Score
 0.7711267605633803
randomforest: Accuracy Score
 0.8626760563380281
svm: Accuracy Score
 0.8704225352112676
kneighbors: Accuracy Score
 0.8535211267605634
decisiontree: Accuracy Score
 0.8507042253521127


In [1352]:
y_predict = lr_model.predict(X_test)
cr = metrics.classification_report(y_test,y_predict)
print("Logistic Regression: Classification Report: \n\n", cr)

Logistic Regression: Classification Report: 

               precision    recall  f1-score   support

           0       0.74      0.32      0.45       230
           1       0.88      0.98      0.93      1190

    accuracy                           0.87      1420
   macro avg       0.81      0.65      0.69      1420
weighted avg       0.86      0.87      0.85      1420



In [1353]:
y_predict = nb_model.predict(X_test)
cr = metrics.classification_report(y_test,y_predict)
print("Naive Bayes: Classification Report: \n\n", cr)

Naive Bayes: Classification Report: 

               precision    recall  f1-score   support

           0       0.34      0.45      0.39       230
           1       0.89      0.83      0.86      1190

    accuracy                           0.77      1420
   macro avg       0.62      0.64      0.62      1420
weighted avg       0.80      0.77      0.78      1420



## 10. Create a function called `tokenize_predict` which can take count vectorizer object as input and prints the accuracy for x (text) and y (labels)

In [1354]:
df_features = df_tweets1["text"]
df_target = df_tweets1["Label"]

In [1355]:
from sklearn.model_selection import train_test_split
test_size = 0.40 # taking 60:40 training and test set
seed = 7  # Random numbmer seeding for reapeatability of the code

X_train, X_test, y_train, y_test = train_test_split(df_features, df_target, test_size=test_size, random_state=seed)

In [1356]:
from sklearn.naive_bayes import MultinomialNB
from sklearn import metrics
def tokenize_predict2(vectorizer, x_train, y_train, x_test, y_test):
    x_train_dtm = vectorizer.fit_transform(x_train)
    print('Features: ', x_train_dtm.shape[1])
    x_test_dtm = vectorizer.transform(x_test)
    nb = MultinomialNB()
    nb.fit(x_train_dtm, y_train)
    y_pred_class = nb.predict(x_test_dtm)
    print('Accuracy: ', metrics.accuracy_score(y_test, y_pred_class))

In [1357]:
tokenize_predict2(vectorizer, X_train, y_train, X_test, y_test)

Features:  4518
Accuracy:  0.8605633802816901


In [1358]:
from sklearn.naive_bayes import ComplementNB
from sklearn import metrics
def tokenize_predict(vectorizer, x_train, y_train, x_test, y_test):
    x_train_dtm = vectorizer.fit_transform(x_train)
    print('Features: ', x_train_dtm.shape[1])
    x_test_dtm = vectorizer.transform(x_test)
    cb = ComplementNB()
    cb.fit(x_train_dtm, y_train)
    y_pred_class = cb.predict(x_test_dtm)
    print('Accuracy: ', metrics.accuracy_score(y_test, y_pred_class))

In [1359]:
tokenize_predict(vectorizer, X_train, y_train, X_test, y_test)

Features:  4518
Accuracy:  0.8605633802816901


In [1360]:
from sklearn.naive_bayes import ComplementNB
from sklearn import metrics
from sklearn.model_selection import GridSearchCV
def tokenize_predict4(vectorizer, x_train, y_train, x_test, y_test):
    x_train_dtm = vectorizer.fit_transform(x_train)
    print('Features: ', x_train_dtm.shape[1])
    x_test_dtm = vectorizer.transform(x_test)
    params = {'alpha': [0.01, 0.1, 0.5, 1.0, 10.0, ],
         }
    cb_grid = GridSearchCV(ComplementNB(),param_grid=params, n_jobs=-1, cv=7, verbose=5)
    cb_grid.fit(x_train_dtm, y_train)
    y_pred_class = cb_grid.predict(x_test_dtm)
    print('Accuracy: ', metrics.accuracy_score(y_test, y_pred_class))

In [1361]:
tokenize_predict4(vectorizer, X_train, y_train, X_test, y_test)

Features:  4518
Fitting 7 folds for each of 5 candidates, totalling 35 fits
Accuracy:  0.8387323943661972


In [1362]:
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics
def tokenize_predict3(vectorizer, x_train, y_train, x_test, y_test):
    x_train_dtm = vectorizer.fit_transform(x_train)
    print('Features: ', x_train_dtm.shape[1])
    x_test_dtm = vectorizer.transform(x_test)
    gb = GaussianNB()
    gb.fit(x_train_dtm.toarray(), y_train)
    y_pred_class = gb.predict(x_test_dtm.toarray())
    print('Accuracy: ', metrics.accuracy_score(y_test, y_pred_class))

In [1363]:
tokenize_predict3(vectorizer, X_train, y_train, X_test, y_test)

Features:  4518
Accuracy:  0.7711267605633803


In [1364]:
from sklearn.naive_bayes import BernoulliNB
from sklearn import metrics
def tokenize_predict5(vectorizer, x_train, y_train, x_test, y_test):
    x_train_dtm = vectorizer.fit_transform(x_train)
    print('Features: ', x_train_dtm.shape[1])
    x_test_dtm = vectorizer.transform(x_test)
    bb = BernoulliNB()
    bb.fit(x_train_dtm.toarray(), y_train)
    y_pred_class = bb.predict(x_test_dtm.toarray())
    print('Accuracy: ', metrics.accuracy_score(y_test, y_pred_class))

In [1365]:
tokenize_predict5(vectorizer, X_train, y_train, X_test, y_test)

Features:  4518
Accuracy:  0.8492957746478873


### 11 Create a count vectorizer function which includes n_grams = 1,2  and pass it to tokenize_predict function to print the accuracy score

In [1373]:
vectorizer = CountVectorizer(ngram_range=(1,2),stop_words={'english'})
tokenize_predict(vectorizer,X_train, y_train, X_test, y_test)

Features:  18632
Accuracy:  0.8725352112676056


### Q 12 Create a count vectorizer function with stopwords = 'english'  and pass it to tokenize_predict function to print the accuracy score

In [1374]:
vectorizer = CountVectorizer(stop_words={'english'})
tokenize_predict(vectorizer,X_train, y_train, X_test, y_test)

Features:  4518
Accuracy:  0.8605633802816901


### Q 13 Create a count vectorizer function with stopwords = 'english' and max_features =300  and pass it to tokenize_predict function to print the accuracy score

In [1368]:
vectorizer = CountVectorizer(stop_words={'english'}, max_features = 3000)
tokenize_predict2(vectorizer,X_train, y_train, X_test, y_test)

Features:  3000
Accuracy:  0.8612676056338028


### Q 14 Create a count vectorizer function with n_grams = 1,2  and max_features = 15000  and pass it to tokenize_predict function to print the accuracy score

In [1369]:
vectorizer = CountVectorizer(stop_words={'english'},ngram_range=(2,2),max_features =3000)
tokenize_predict2(vectorizer,X_train, y_train, X_test, y_test)

Features:  3000
Accuracy:  0.8535211267605634


### Q. 15 -Create a count vectorizer function with n_grams = 1,2  and include terms that appear at least 2 times (min_df = 2)  and pass it to tokenize_predict function to print the accuracy score

In [1370]:
vectorizer = CountVectorizer(stop_words={'english'},ngram_range=(1,2))
tokenize_predict(vectorizer,X_train, y_train, X_test, y_test)

Features:  18632
Accuracy:  0.8725352112676056


In [1371]:
tokenize_predict4(vectorizer, X_train, y_train, X_test, y_test)

Features:  18632
Fitting 7 folds for each of 5 candidates, totalling 35 fits
Accuracy:  0.8401408450704225
