## Sentiment analysis <br> 

The objective of the problem is to perform Sentiment analysis from the tweets data collected from the users targeted at various mobile devices.
Based on the tweet posted by a user (text), we will classify if the sentiment of the user targeted at a particular mobile device is positive or not.

### 1. Read the dataset (tweets.csv) and drop the NA's while reading the dataset

In [1]:
import pandas as pd

In [2]:
tweets = pd.read_csv("tweets.csv", encoding="Unicode_escape")

In [3]:
tweets.head(2)

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion


In [4]:
tweets.shape

(9092, 3)

In [5]:
df_tweets = tweets.dropna(subset = ["tweet_text"])

In [6]:
df_tweets.shape

(9092, 3)

In [7]:
df_tweets.head(2)

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion


### 2. Preprocess the text and add the preprocessed text in a column with name `text` in the dataframe.

In [8]:
def preprocess(text):
    try:
        return text.decode('ascii')
    except Exception as e:
        return ""

In [9]:
from nltk.tokenize.toktok import ToktokTokenizer
import nltk
import re
import unicodedata
from nltk.stem.porter import PorterStemmer

tokenizer = ToktokTokenizer()
stopword_list = nltk.corpus.stopwords.words('english')

def remove_special_characters(text, remove_digits=False):
    pattern = r'[^a-zA-z0-9\s]' if not remove_digits else r'[^a-zA-z\s]'
    text = re.sub(pattern, '', text)
    return text

def to_lower_case(text):
    return text.lower()

def remove_accented_chars(text):
    text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore').decode('utf-8', 'ignore')
    return text

def remove_stopwords(text, is_lower_case=False):
    tokens = tokenizer.tokenize(text)
    tokens = [token.strip() for token in tokens]
    if is_lower_case:
        filtered_tokens = [token for token in tokens if token not in stopword_list and '@' not in token]
    else:
        filtered_tokens = [token for token in tokens if token.lower() not in stopword_list and '@' not in token]
    filtered_text = ' '.join(filtered_tokens)    
    return filtered_text


In [10]:
df_tweets['text'] = [to_lower_case(text) for text in df_tweets.tweet_text]
df_tweets['text'] = [remove_special_characters(text) for text in df_tweets.tweet_text]
df_tweets['text'] = [remove_accented_chars(text) for text in df_tweets.tweet_text]
df_tweets['text'] = [remove_stopwords(text) for text in df_tweets.tweet_text]

In [11]:
df_tweets.head()

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product,text
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion,"3G iPhone. 3 hrs tweeting #RISE_Austin , dead ..."
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion,Know ? Awesome iPad/iPhone app ' likely apprec...
2,@swonderlin Can not wait for #iPad 2 also. The...,iPad,Positive emotion,wait #iPad 2 also. sale #SXSW .
3,@sxsw I hope this year's festival isn't as cra...,iPad or iPhone App,Negative emotion,hope year ' festival ' crashy year ' iPhone ap...
4,@sxtxstate great stuff on Fri #SXSW: Marissa M...,Google,Positive emotion,great stuff Fri #SXSW : Marissa Mayer ( Google...


### 3. Consider only rows having Positive emotion and Negative emotion and remove other rows from the dataframe.

In [12]:
emotion_colname = "is_there_an_emotion_directed_at_a_brand_or_product"

In [13]:
df_tweets1 = df_tweets[df_tweets[emotion_colname].isin(["Negative emotion","Positive emotion"])]

In [14]:
df_tweets1.shape

(3548, 4)

### 4. Represent text as numerical data using `CountVectorizer` and get the document term frequency matrix

#### Use `vect` as the variable name for initialising CountVectorizer.

In [15]:
from sklearn.feature_extraction.text import CountVectorizer

In [16]:
# create the transform
vectorizer = CountVectorizer()

In [17]:
vectorizer.fit(df_tweets1["text"])

CountVectorizer()

In [18]:
# summarize
print(vectorizer.vocabulary_)



In [19]:
vector = vectorizer.transform(df_tweets1["text"])

In [20]:
print(type(vector))
print(vector.toarray())

<class 'scipy.sparse.csr.csr_matrix'>
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]]


### 5. Find number of different words in vocabulary

In [21]:
len(vectorizer.get_feature_names())

5933

#### Tip: To see all available functions for an Object use dir

### 6. Find out how many Positive and Negative emotions are there.

Hint: Use value_counts on that column

In [22]:
df_tweets1[emotion_colname].value_counts()

Positive emotion    2978
Negative emotion     570
Name: is_there_an_emotion_directed_at_a_brand_or_product, dtype: int64

### 7. Change the labels for Positive and Negative emotions as 1 and 0 respectively and store in a different column in the same dataframe named 'Label'

Hint: use map on that column and give labels

In [23]:
df_tweets1["Label"] = df_tweets1[emotion_colname].map({'Positive emotion': 1, 'Negative emotion': 0})

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_tweets1["Label"] = df_tweets1[emotion_colname].map({'Positive emotion': 1, 'Negative emotion': 0})


In [24]:
df_tweets1.head(2)

Unnamed: 0,tweet_text,emotion_in_tweet_is_directed_at,is_there_an_emotion_directed_at_a_brand_or_product,text,Label
0,.@wesley83 I have a 3G iPhone. After 3 hrs twe...,iPhone,Negative emotion,"3G iPhone. 3 hrs tweeting #RISE_Austin , dead ...",0
1,@jessedee Know about @fludapp ? Awesome iPad/i...,iPad or iPhone App,Positive emotion,Know ? Awesome iPad/iPhone app ' likely apprec...,1


### 8 Define the feature set (independent variable or X) to be `text` column and `labels` as target (or dependent variable)  and divide into train and test datasets

In [25]:
df_features = vector.toarray()
df_target = df_tweets1["Label"]

In [26]:
from sklearn.model_selection import train_test_split
test_size = 0.30 # taking 70:30 training and test set
seed = 7  # Random numbmer seeding for reapeatability of the code
X_train, X_test, y_train, y_test = train_test_split(df_features, df_target, test_size=test_size, random_state=seed)

## 9. **Predicting the sentiment:**




In [27]:
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn import metrics

In [28]:
# Logistic Regression
lr_model = LogisticRegression()
lr_model.fit(X_train, y_train)
lr_model_score = lr_model.score(X_test, y_test) # get the accuracy score for testing samples
print("Logistic Regression: Accuracy Score\n" , lr_model_score)

# Naive Bayes
nb_model = GaussianNB()
nb_model.fit(X_train, y_train)
nb_model_score = nb_model.score(X_test, y_test) # get the accuracy score for testing samples
print("Naive Bayes: Accuracy Score\n" , nb_model_score)


Logistic Regression: Accuracy Score
 0.8704225352112676
Naive Bayes: Accuracy Score
 0.7887323943661971


In [29]:
y_predict = lr_model.predict(X_test)
cr = metrics.classification_report(y_test,y_predict)
print("Logistic Regression: Classification Report: \n\n", cr)

Logistic Regression: Classification Report: 

               precision    recall  f1-score   support

           0       0.72      0.32      0.44       172
           1       0.88      0.98      0.93       893

    accuracy                           0.87      1065
   macro avg       0.80      0.65      0.69      1065
weighted avg       0.86      0.87      0.85      1065



In [30]:
y_predict = nb_model.predict(X_test)
cr = metrics.classification_report(y_test,y_predict)
print("Naive Bayes: Classification Report: \n\n", cr)

Naive Bayes: Classification Report: 

               precision    recall  f1-score   support

           0       0.38      0.49      0.43       172
           1       0.90      0.85      0.87       893

    accuracy                           0.79      1065
   macro avg       0.64      0.67      0.65      1065
weighted avg       0.81      0.79      0.80      1065



In [None]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
parameters = {'kernel': ['rbf','sigmoid','linear'],
              'C': [0.1,1,10,100],
              'gamma': [0.01, 0.1, 1],
             }

svc = SVC()
model = GridSearchCV(svc, param_grid=parameters, cv=2)
model.fit(X_train, y_train)

In [None]:
model.best_params_

In [31]:
from sklearn.svm import SVC
svc = SVC(kernel='rbf', C=10, gamma=0.1)
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)
print(metrics.classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.81      0.30      0.43       172
           1       0.88      0.99      0.93       893

    accuracy                           0.88      1065
   macro avg       0.84      0.64      0.68      1065
weighted avg       0.87      0.88      0.85      1065

