# Project: Climate Change - Sentiment Classifier

## Topics discussed:
- Using the manually labelled text. Developed Sentiment Classifier for predicting Sentiment of new sentence
- Tried different classification Models given below
-      K-NN
-      Logistic Regression
-      Multinomial NB
-      Linear Support Vector Machine
-      Kernelized Support Vector Machines
-      Neural Network
-      Neural Network (tuned parameters)
-      XG Boost Classifier
-      Decision Tree 
-      Random Forest
- Comparison Of All models
- Identified the best model and tested example sentences 

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn import tree
from sklearn.ensemble import RandomForestClassifier

### Reading in the combined 2000 rows with target variable

In [2]:
# Sentiment file
sentiment_file = "_sentiments_target.csv"
df = pd.read_csv(sentiment_file, sep="\t")

### Checking the sample entries

In [3]:
df.head()

Unnamed: 0,sentence,sentiment
0,It also increases carbon dioxide emissions whi...,neutral
1,We can already see this happening.,negative
2,The ecological disaster is a consequence of no...,positive
3,We may be dealing with an issue with a level o...,negative
4,Preventable chronic diseases are Australias le...,negative


### Checking the label distribution 
#### Seems like most values are categorized as neutral

In [4]:
df.sentiment.value_counts()

neutral     1148
negative     662
positive     462
Name: sentiment, dtype: int64

In [5]:
vectorizer = TfidfVectorizer(use_idf=True, norm="l2", stop_words="english", max_df=0.7)
X = vectorizer.fit_transform(df.sentence)
y = df.sentiment

In [6]:
X.shape, y.shape

((2272, 7011), (2272,))

### Spliting data into test and training sets

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)

### User defined function to retrieve model score

In [8]:
def get_score(model,X_train, y_train, cv):
    scores = cross_val_score(model, X_train, y_train, cv=cv)
    return scores 

### User defined function to return best parameter with max score

In [9]:
def print_compare_score(val,scores,score_max,param,param_best):
    print("{} = {}: {}\n{:.3f}, {:.3f}\n".format(val,param, scores, scores.mean(), scores.std()))
    if scores.mean() > score_max:
        score_max = scores.mean()
        param_best = param 
    return(score_max,param_best)

### User defined function to fit test data to learned model

In [10]:
def train_test(X_train, X_test, y_train, y_test, classifier):
    classifier.fit(X_train, y_train)
    pred = classifier.predict(X_test)
    
    print("Train score: {:.2f}".format(classifier.score(X_train, y_train)))
    print("Test score: {:.2f}\n".format(classifier.score(X_test, y_test)))
    print("Classification report:\n{}".format(classification_report(y_test, pred, zero_division=0)))
    print(confusion_matrix(y_test,pred))
    
    return classifier

## K-NN

#### Finding the Best value of K

In [11]:
score_max = 0                      # Score_max is a temoporay variable to store the max score 
param_best = 0
for param in [10,40,50,70,100]:
    model = KNeighborsClassifier(n_neighbors=param)
    scores = get_score(model,X_train, y_train, 5)
    (score_max,param_best) = print_compare_score('k',scores,score_max,param,param_best)        
print("Highest score : {:.3f} when k = {}".format(score_max, param_best))

k = 10: [0.49725275 0.48901099 0.51239669 0.46831956 0.47933884]
0.489, 0.015

k = 40: [0.51098901 0.50824176 0.52892562 0.49586777 0.49586777]
0.508, 0.012

k = 50: [0.50549451 0.50824176 0.52892562 0.50137741 0.51790634]
0.512, 0.010

k = 70: [0.52197802 0.51648352 0.52066116 0.51790634 0.50688705]
0.517, 0.005

k = 100: [0.51098901 0.5        0.51790634 0.50137741 0.51790634]
0.510, 0.008

Highest score : 0.517 when k = 70


Run the Model to see the Test and Train Score with k = 50

In [12]:
print("k = {}".format(param_best))
knn = KNeighborsClassifier(n_neighbors=param_best)
knn = train_test(X_train, X_test, y_train, y_test, knn)

k = 70
Train score: 0.53
Test score: 0.50

Classification report:
              precision    recall  f1-score   support

    negative       0.53      0.11      0.18       150
     neutral       0.49      0.95      0.65       220
    positive       0.50      0.01      0.02        85

    accuracy                           0.50       455
   macro avg       0.51      0.36      0.28       455
weighted avg       0.51      0.50      0.38       455

[[ 16 133   1]
 [ 11 209   0]
 [  3  81   1]]


In [13]:
summary_test = {}
summary_train = {}
summary_test["k-NNs Test"] = round(knn.score(X_test, y_test), 3)
summary_train["k-NNs Train"] = round(knn.score(X_train, y_train), 3)

## Logistic Regression

In [14]:
lr = LogisticRegression()

In [15]:
scores = get_score(lr,X_train, y_train, 5)
print("{}\n{:.3f}, {:.3f}".format(scores, scores.mean(), scores.std()))

[0.52747253 0.51098901 0.5261708  0.51515152 0.51515152]
0.519, 0.007


Run the Model to see the Test and Train Score

In [16]:
lr = train_test(X_train, X_test, y_train, y_test, lr)

Train score: 0.83
Test score: 0.48

Classification report:
              precision    recall  f1-score   support

    negative       0.48      0.13      0.21       150
     neutral       0.49      0.90      0.64       220
    positive       0.14      0.01      0.02        85

    accuracy                           0.48       455
   macro avg       0.37      0.35      0.29       455
weighted avg       0.42      0.48      0.38       455

[[ 20 127   3]
 [ 18 199   3]
 [  4  80   1]]


In [17]:
summary_test["Logistic Regression Test"] = round(lr.score(X_test, y_test), 3)
summary_train["Logistic Regression Train"] = round(lr.score(X_train, y_train), 3)

## Multinomial NB

In [18]:
mnb = MultinomialNB()

In [19]:
scores = get_score(mnb,X_train, y_train, 5)
print("{}\n{:.3f}, {:.3f}".format(scores, scores.mean(), scores.std()))

[0.51648352 0.52472527 0.51790634 0.51515152 0.51790634]
0.518, 0.003


Run the Model to see the Test and Train Score

In [20]:
mnb = train_test(X_train, X_test, y_train, y_test, mnb)

Train score: 0.71
Test score: 0.49

Classification report:
              precision    recall  f1-score   support

    negative       0.55      0.04      0.07       150
     neutral       0.49      0.99      0.66       220
    positive       0.00      0.00      0.00        85

    accuracy                           0.49       455
   macro avg       0.35      0.34      0.24       455
weighted avg       0.42      0.49      0.34       455

[[  6 143   1]
 [  2 218   0]
 [  3  82   0]]


In [21]:
summary_test["Multinomial Naive Bayes Test"] = round(mnb.score(X_test, y_test), 3)
summary_train["Multinomial Naive Bayes Train"] = round(mnb.score(X_train, y_train), 3)

## Modeling with Linear Support Vector Machines (SVMs)

#### Finding the Best value of C

In [22]:
score_max = 0
param_best = 0
for param in [0.01, 0.03, 0.1, 0.3, 0.4, 0.5, 1, 3, 10]:
    model = LinearSVC(C=param)
    scores = get_score(model,X_train, y_train, 5)
    (score_max,param_best) = print_compare_score('C',scores,score_max,param,param_best) 
        
print("Highest score : {:.3f} when C = {}".format(score_max, param_best))

C = 0.01: [0.51098901 0.51098901 0.51239669 0.50964187 0.50964187]
0.511, 0.001

C = 0.03: [0.51098901 0.51098901 0.51239669 0.50964187 0.50964187]
0.511, 0.001

C = 0.1: [0.52197802 0.50824176 0.50964187 0.51239669 0.51790634]
0.514, 0.005

C = 0.3: [0.51923077 0.50274725 0.53443526 0.52066116 0.49862259]
0.515, 0.013

C = 0.4: [0.51648352 0.48351648 0.53719008 0.50964187 0.49862259]
0.509, 0.018

C = 0.5: [0.51648352 0.48626374 0.53168044 0.49862259 0.49586777]
0.506, 0.016

C = 1: [0.48901099 0.46428571 0.50137741 0.47658402 0.49862259]
0.486, 0.014

C = 3: [0.47527473 0.43956044 0.49586777 0.45730028 0.49035813]
0.472, 0.021

C = 10: [0.46153846 0.4478022  0.49035813 0.44352617 0.48209366]
0.465, 0.018

Highest score : 0.515 when C = 0.3


### Run the Model to see the Test and Train Score with C = 0.3

In [23]:
print("C = {}".format(param_best))
svm = LinearSVC(C=param_best)
svm = train_test(X_train, X_test, y_train, y_test, svm)

C = 0.3
Train score: 0.96
Test score: 0.49

Classification report:
              precision    recall  f1-score   support

    negative       0.49      0.22      0.30       150
     neutral       0.50      0.85      0.63       220
    positive       0.14      0.02      0.04        85

    accuracy                           0.49       455
   macro avg       0.38      0.36      0.32       455
weighted avg       0.43      0.49      0.41       455

[[ 33 112   5]
 [ 27 186   7]
 [  7  76   2]]


In [24]:
summary_test["Linear SVMs Test"] = round(svm.score(X_test, y_test), 3)
summary_train["Linear SVMs Train"] = round(svm.score(X_train, y_train), 3)

## Modeling with Kernelized Support Vector Machines (KSVMs)

#### Finding the Best value of C

In [25]:
score_max = 0
param_best = 0
for param in [0.01, 0.03, 0.1, 0.3, 1, 3, 10]:
    model = SVC(C=param, kernel="rbf", gamma="scale")
    scores = get_score(model,X_train, y_train, 5)
    (score_max,param_best) = print_compare_score('C',scores,score_max,param,param_best) 
        
print("Highest score : {:.3f} when C = {}".format(score_max, param_best))


C = 0.01: [0.51098901 0.51098901 0.51239669 0.50964187 0.50964187]
0.511, 0.001

C = 0.03: [0.51098901 0.51098901 0.51239669 0.50964187 0.50964187]
0.511, 0.001

C = 0.1: [0.51098901 0.51098901 0.51239669 0.50964187 0.50964187]
0.511, 0.001

C = 0.3: [0.51098901 0.51098901 0.51239669 0.50964187 0.50964187]
0.511, 0.001

C = 1: [0.51373626 0.51373626 0.51515152 0.50964187 0.50688705]
0.512, 0.003

C = 3: [0.52747253 0.5        0.52892562 0.51790634 0.49586777]
0.514, 0.014

C = 10: [0.52747253 0.5        0.52892562 0.51790634 0.49586777]
0.514, 0.014

Highest score : 0.514 when C = 3


#### Run the Model to see the Test and Train Score with C = 3

In [26]:
print("C = {}".format(param_best))
ksvm = SVC(C = param_best, kernel = "rbf", gamma = "scale")
ksvm = train_test(X_train, X_test, y_train, y_test, ksvm)

C = 3
Train score: 1.00
Test score: 0.48

Classification report:
              precision    recall  f1-score   support

    negative       0.44      0.16      0.24       150
     neutral       0.49      0.87      0.63       220
    positive       0.09      0.01      0.02        85

    accuracy                           0.48       455
   macro avg       0.34      0.35      0.30       455
weighted avg       0.40      0.48      0.39       455

[[ 24 121   5]
 [ 23 192   5]
 [  7  77   1]]


In [27]:
summary_test["Kernelized SVMs Test"] = round(ksvm.score(X_test, y_test), 3)
summary_train["Kernelized SVMs Train"] = round(ksvm.score(X_train, y_train), 3)

## Modeling with Neural Networks

#### Finding the best hidden_layer_sizes

In [28]:
score_max = 0
param_best = 0
for param in [10, 30, 100]:
    model = MLPClassifier(hidden_layer_sizes=(param, ), max_iter=2000, activation="relu", random_state=0)
    scores = get_score(model,X_train, y_train, 5)
    (score_max,param_best) = print_compare_score('hidden_layer_size',scores,score_max,param,param_best) 
        
print("Highest score : {:.3f} when hidden_layer_sizes = {}".format(score_max, param_best))

hidden_layer_size = 10: [0.47252747 0.46153846 0.46280992 0.44077135 0.4738292 ]
0.462, 0.012

hidden_layer_size = 30: [0.45604396 0.46153846 0.4738292  0.42975207 0.4600551 ]
0.456, 0.015

hidden_layer_size = 100: [0.46978022 0.46153846 0.46556474 0.4214876  0.4600551 ]
0.456, 0.017

Highest score : 0.462 when hidden_layer_sizes = 10


Run the Model to see the Test and Train Score with hidden_layer_sizes = 30

In [29]:
print("hidden_layer_size = {}".format(param_best))
mlp = MLPClassifier(hidden_layer_sizes=(param_best, ), max_iter=2000, random_state=0) # default activation is 'relu'
mlp = train_test(X_train, X_test, y_train, y_test, mlp)

hidden_layer_size = 10
Train score: 1.00
Test score: 0.48

Classification report:
              precision    recall  f1-score   support

    negative       0.48      0.39      0.43       150
     neutral       0.53      0.66      0.58       220
    positive       0.26      0.18      0.21        85

    accuracy                           0.48       455
   macro avg       0.42      0.41      0.41       455
weighted avg       0.46      0.48      0.47       455

[[ 59  76  15]
 [ 48 145  27]
 [ 15  55  15]]


In [30]:
summary_test["Neural Networks Test"] = round(mlp.score(X_test, y_test), 3)
summary_train["Neural Networks Train"] = round(mlp.score(X_train, y_train), 3)

## NN with different solver that supports multi-labeled classification better

In [31]:
mlp_2 = MLPClassifier(hidden_layer_sizes=(param_best, ), max_iter=2000, solver = 'lbfgs', random_state=0) # default activation is 'relu' and default solver is adam
scores = get_score(mlp_2,X_train, y_train, 5)
score_max = scores.mean()
print("Score : {:.3f} when hidden_layer_sizes = {}\n".format(score_max, param_best))
mlp_2 = train_test(X_train, X_test, y_train, y_test, mlp_2)

Score : 0.476 when hidden_layer_sizes = 10

Train score: 1.00
Test score: 0.52

Classification report:
              precision    recall  f1-score   support

    negative       0.51      0.39      0.45       150
     neutral       0.55      0.71      0.62       220
    positive       0.33      0.22      0.27        85

    accuracy                           0.52       455
   macro avg       0.47      0.44      0.45       455
weighted avg       0.50      0.52      0.50       455

[[ 59  72  19]
 [ 44 157  19]
 [ 12  54  19]]


In [32]:
summary_test["Neural Networks Test - Solver lbfgs "] = round(mlp_2.score(X_test, y_test), 3)
summary_train["Neural Networks Train - Solver lbfgs"] = round(mlp_2.score(X_train, y_train), 3)

In [33]:
mlp_3 = MLPClassifier(hidden_layer_sizes=(param_best, ), max_iter=2000, activation = 'tanh', solver = 'sgd', random_state=0) 
# default activation is 'relu' and default solver is 'adam'
scores = get_score(mlp_3,X_train, y_train, 5)
score_max = scores.mean()
print("Score : {:.3f} when hidden_layer_sizes = {}\n".format(score_max, param_best))
mlp_3 = train_test(X_train, X_test, y_train, y_test, mlp_3)

Score : 0.511 when hidden_layer_sizes = 10

Train score: 0.51
Test score: 0.48

Classification report:
              precision    recall  f1-score   support

    negative       0.00      0.00      0.00       150
     neutral       0.48      1.00      0.65       220
    positive       0.00      0.00      0.00        85

    accuracy                           0.48       455
   macro avg       0.16      0.33      0.22       455
weighted avg       0.23      0.48      0.32       455

[[  0 150   0]
 [  0 220   0]
 [  0  85   0]]


In [34]:
summary_test["Neural Networks - tanh activation Test"] = round(mlp_3.score(X_test, y_test), 3)
summary_train["Neural Networks - tanh activation Train"] = round(mlp_3.score(X_train, y_train), 3)

#### Among the Neural Networks hidden_layer_sizes = 30 gave the best reults when we use tanh activation and sgd solver

## XG boost classifier

In [35]:
clf = GradientBoostingClassifier(n_estimators=300, learning_rate=0.1, max_depth=20, subsample = 0.5, random_state=0)
scores = get_score(clf,X_train, y_train, 5)
score_max = scores.mean()
print("Score : {:.3f} when hidden_layer_sizes = {}\n".format(score_max, param_best))
clf = train_test(X_train, X_test, y_train, y_test, clf)

Score : 0.493 when hidden_layer_sizes = 10

Train score: 1.00
Test score: 0.49

Classification report:
              precision    recall  f1-score   support

    negative       0.45      0.28      0.34       150
     neutral       0.54      0.73      0.62       220
    positive       0.30      0.22      0.26        85

    accuracy                           0.49       455
   macro avg       0.43      0.41      0.41       455
weighted avg       0.46      0.49      0.46       455

[[ 42  85  23]
 [ 37 161  22]
 [ 15  51  19]]


In [36]:
summary_test["XG Boost Test"] = round(clf.score(X_test, y_test), 3)
summary_train["XG Boost Train"] = round(clf.score(X_train, y_train), 3)

## Decision Tree Classifier

In [37]:
dt = tree.DecisionTreeClassifier()

In [38]:
scores = get_score(dt,X_train, y_train, 5)
print("{}\n{:.3f}, {:.3f}".format(scores, scores.mean(), scores.std()))

[0.46978022 0.47802198 0.4600551  0.42699725 0.49035813]
0.465, 0.021


Run the Model to see the Test and Train Score

In [39]:
dt = train_test(X_train, X_test, y_train, y_test, dt)

Train score: 1.00
Test score: 0.45

Classification report:
              precision    recall  f1-score   support

    negative       0.42      0.25      0.31       150
     neutral       0.53      0.66      0.59       220
    positive       0.25      0.27      0.26        85

    accuracy                           0.45       455
   macro avg       0.40      0.39      0.39       455
weighted avg       0.44      0.45      0.43       455

[[ 37  83  30]
 [ 37 145  38]
 [ 15  47  23]]


In [40]:
summary_test["Decision Tree Test"] = round(dt.score(X_test, y_test), 3)
summary_train["Decision Tree Train"] = round(dt.score(X_train, y_train), 3)

## Random Forest

### Trying different max_depth 

In [41]:
score_max = 0
param_best = 0
for param in [10,18, 20, 30, 40]:
    model = RandomForestClassifier(max_depth=param)
    scores = get_score(model,X_train, y_train, 5)
    (score_max,param_best) = print_compare_score('C',scores,score_max,param,param_best) 
        
print("Highest score : {:.3f} when C = {}".format(score_max, param_best))

C = 10: [0.51098901 0.51098901 0.51239669 0.50964187 0.50964187]
0.511, 0.001

C = 18: [0.51098901 0.51098901 0.51239669 0.50964187 0.51515152]
0.512, 0.002

C = 20: [0.51098901 0.51098901 0.51515152 0.50964187 0.51239669]
0.512, 0.002

C = 30: [0.51648352 0.51098901 0.51239669 0.50688705 0.51239669]
0.512, 0.003

C = 40: [0.51648352 0.50549451 0.51515152 0.50413223 0.51515152]
0.511, 0.005

Highest score : 0.512 when C = 18


Run the Model to see the Test and Train Score with max_depth = 30

In [42]:
print("C = {}".format(param_best))
rf = RandomForestClassifier(max_depth=param_best)
rf = train_test(X_train, X_test, y_train, y_test, rf)

C = 18
Train score: 0.53
Test score: 0.48

Classification report:
              precision    recall  f1-score   support

    negative       0.00      0.00      0.00       150
     neutral       0.48      1.00      0.65       220
    positive       0.00      0.00      0.00        85

    accuracy                           0.48       455
   macro avg       0.16      0.33      0.22       455
weighted avg       0.23      0.48      0.32       455

[[  0 150   0]
 [  0 220   0]
 [  0  85   0]]


In [43]:
summary_test["Random Forest Test"] = round(rf.score(X_test, y_test), 3)
summary_train["Random Forest Train"] = round(rf.score(X_train, y_train), 3)

## The Neural Network with the lbfgs solver has the best performance

In [44]:
summary_test

{'k-NNs Test': 0.497,
 'Logistic Regression Test': 0.484,
 'Multinomial Naive Bayes Test': 0.492,
 'Linear SVMs Test': 0.486,
 'Kernelized SVMs Test': 0.477,
 'Neural Networks Test': 0.481,
 'Neural Networks Test - Solver lbfgs ': 0.516,
 'Neural Networks - tanh activation Test': 0.484,
 'XG Boost Test': 0.488,
 'Decision Tree Test': 0.451,
 'Random Forest Test': 0.484}

## New sentences

In [45]:
# grab some sentences from the larger file, not the 2000 that we already classified
text1 = "This is amazing, climate change initiatives have created so many jobs!" # Positive
text2 = "I hate the bad idea of hotter temperatures and the horrible fact that ice caps are melting" # Negative
text3 = "Ice caps are melting faster each year" # Neutral
text4 = "Climate change is fake news, this is the coldest winter ever" # Negative
text5 = "Hubspot makes my day a lot easier, this makes me super happy! :)" # Positive
text6 = "Your customer service is a nightmare! Totally useless!!"   # Negative
text7 = "The older interface was much simpler" # Negative
text8 = "Awful experience. I would never buy this product again!" # Negative
text9 = "I don't think there is anything I really dislike about the product"  # Neutral
text10 = "I love how Zapier takes different apps and ties them together, perfect idea!" # Positive
text11 = "I still need to further test Zapier to say if its useful for me or not"  # Neutral
text12 = "Zapier is sooooo confusing to me" # Negative

In [46]:
new_texts = [text1, text2, text3, text4,text5,text6,text7,text8,text9,text10,text11,text12]
X_new = vectorizer.transform(new_texts)

In [47]:
knn.predict(X_new)

array(['neutral', 'neutral', 'neutral', 'neutral', 'neutral', 'neutral',
       'neutral', 'neutral', 'negative', 'neutral', 'neutral', 'neutral'],
      dtype=object)

In [48]:
mlp.predict(X_new)

array(['positive', 'negative', 'neutral', 'neutral', 'neutral',
       'positive', 'neutral', 'negative', 'negative', 'neutral',
       'neutral', 'neutral'], dtype='<U8')

In [49]:
mlp_2.predict(X_new)

array(['positive', 'neutral', 'neutral', 'neutral', 'negative', 'neutral',
       'neutral', 'negative', 'negative', 'neutral', 'neutral', 'neutral'],
      dtype='<U8')

## Conclusion

As we have see the results from all the different models Neural Networks Test - Solver lbfgs gave us highest test score of 0.492. Which is very much expected here as 'lbfgs' solver is real good with amount of data we have here. Some other models gave us good results on the test data were XG boost and Linear SVMs