# **The ninth in-class-exercise (20 points in total, 11/11/2020)**

The purpose of the exercise is to practice different machine learning algorithms for text classification as well as the performance evaluation. In addition, you are requried to conduct *10 fold cross validation (https://scikit-learn.org/stable/modules/cross_validation.html)* in the training. 

The dataset can be download from here: https://github.com/unt-iialab/INFO5731_FALL2020/blob/master/In_class_exercise/exercise09_datacollection.zip. The dataset contains two files train data and test data for sentiment analysis in IMDB review, it has two categories: 1 represents positive and 0 represents negative. You need to split the training data into training and validate data (80% for training and 20% for validation, https://towardsdatascience.com/train-test-split-and-cross-validation-in-python-80b61beca4b6) and perform 10 fold cross validation while training the classifier. The final trained model was final evaluated on the test data. 

Algorithms:

(1) MultinominalNB

(2) SVM 

(3) KNN 

(4) Decision tree

(5) Random Forest

(6) XGBoost

Evaluation measurement:

(1) Accuracy

(2) Recall

(3) Precison 

(4) F-1 score

In [28]:
# PREPARING THE TRAIN, VALIDATE AND TEST DATASETS

# Opening the stsa-train text file as a pandas dataframe.
import pandas as pd
stsa_train = pd.read_fwf(r"C:/Users/Raheyma Arshad/Desktop/stsa-train.txt", header = None)
del stsa_train[2]
stsa_train = stsa_train.rename(columns={0: "Sentiment", 1: "Text"})

# Splitting the stsa_train dataframe into training and validation datasets.
import sklearn
from sklearn.model_selection import train_test_split
x_train, x_validate, y_train, y_validate = sklearn.model_selection.train_test_split(stsa_train['Text'], stsa_train['Sentiment'], train_size=0.8, test_size=0.2)

# Converting the training data x(text) and y(sentiments) values into numpy arrays.
import numpy
x_train = x_train.to_numpy()
y_train = y_train.to_numpy()

# Opening the stsa-test text file as a pandas dataframe for Evaluation Measurement
test = pd.read_fwf(r"C:/Users/Raheyma Arshad/Desktop/stsa-test.txt", header = None)
del test[2]
del test[3]
test = test.rename(columns={0: "Sentiment", 1: "Text"})

# Setting the number of K-folds to 10.
from sklearn.model_selection import KFold
kf = KFold(n_splits=10)

In [31]:
# ALGORITHM: (1) MultinominalNB

from sklearn.feature_extraction.text import TfidfTransformer, CountVectorizer
from sklearn.pipeline import Pipeline

from sklearn.naive_bayes import MultinomialNB
pipeline = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', MultinomialNB())])

for train_index, test_index in kf.split(x_train, y_train):
    x_train_k, y_train_k = x_train[train_index], y_train[train_index]
    x_test_k, y_test_k = x_train[test_index], y_train[test_index]
    
    algorithm = pipeline.fit(x_train_k, y_train_k)

pred_validate = algorithm.predict(x_validate)
validation = {'Actual': y_validate, 'Predicted': pred_validate}
validation_df = pd.DataFrame(validation, columns = ['Actual', 'Predicted'])
print('The predictions for validation dataset are:\n', validation_df.head())

# EVALUATION MEASUREMENT:

pred_test = algorithm.predict(test['Text'])
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score

# (1) Accuracy
print('\nThe accuracy of MultinomialNB is:', (accuracy_score(test['Sentiment'], pred_test)*100))
# (2) Recall
print('The recall of MultinomialNB is:', recall_score(test['Sentiment'], pred_test))
# (3) Precison 
print('The precision of MultinomialNB is:', accuracy_score(test['Sentiment'], pred_test))
# (4) F-1 score
print('The f1-score of MultinomialNB is:', f1_score(test['Sentiment'], pred_test, average='macro'))

The predictions for validation dataset are:
       Actual  Predicted
4579       0          1
1041       0          0
3139       1          0
5654       1          1
4173       1          1

The accuracy of MultinomialNB is: 80.34047226798462
The recall of MultinomialNB is: 0.88998899889989
The precision of MultinomialNB is: 0.8034047226798462
The f1-score of MultinomialNB is: 0.8019699782747107


In [32]:
# ALGORITHM: (2) SVM

from sklearn.svm import LinearSVC
pipeline = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', LinearSVC())])

for train_index, test_index in kf.split(x_train, y_train):
    x_train_k, y_train_k = x_train[train_index], y_train[train_index]
    x_test_k, y_test_k = x_train[test_index], y_train[test_index]
    
    algorithm = pipeline.fit(x_train_k, y_train_k)

pred_validate = algorithm.predict(x_validate)
validation = {'Actual': y_validate, 'Predicted': pred_validate}
validation_df = pd.DataFrame(validation, columns = ['Actual', 'Predicted'])
print('The predictions for validation dataset are:\n', validation_df.head())

# EVALUATION MEASUREMENT:

pred_test = algorithm.predict(test['Text'])

# (1) Accuracy
print('\nThe accuracy of SVM is:', (accuracy_score(test['Sentiment'], pred_test)*100))
# (2) Recall
print('The recall of SVM is:', recall_score(test['Sentiment'], pred_test))
# (3) Precison 
print('The precision of SVM is:', accuracy_score(test['Sentiment'], pred_test))
# (4) F-1 score
print('The f1-score of SVM is:', f1_score(test['Sentiment'], pred_test, average='macro'))

The predictions for validation dataset are:
       Actual  Predicted
4579       0          1
1041       0          0
3139       1          0
5654       1          1
4173       1          1

The accuracy of SVM is: 79.07742998352553
The recall of SVM is: 0.801980198019802
The precision of SVM is: 0.7907742998352554
The f1-score of SVM is: 0.790753855048546


In [33]:
# ALGORITHM: (3) KNN

from sklearn.neighbors import KNeighborsClassifier
pipeline = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', KNeighborsClassifier())])

for train_index, test_index in kf.split(x_train, y_train):
    x_train_k, y_train_k = x_train[train_index], y_train[train_index]
    x_test_k, y_test_k = x_train[test_index], y_train[test_index]
    
    algorithm = pipeline.fit(x_train_k, y_train_k)

pred_validate = algorithm.predict(x_validate)
validation = {'Actual': y_validate, 'Predicted': pred_validate}
validation_df = pd.DataFrame(validation, columns = ['Actual', 'Predicted'])
print('The predictions for validation dataset are:\n', validation_df.head())

# EVALUATION MEASUREMENT:

pred_test = algorithm.predict(test['Text'])

# (1) Accuracy
print('\nThe accuracy of KNN is:', (accuracy_score(test['Sentiment'], pred_test)*100))
# (2) Recall
print('The recall of KNN is:', recall_score(test['Sentiment'], pred_test))
# (3) Precison 
print('The precision of KNN is:', accuracy_score(test['Sentiment'], pred_test))
# (4) F-1 score
print('The f1-score of KNN is:', f1_score(test['Sentiment'], pred_test, average='macro'))

The predictions for validation dataset are:
       Actual  Predicted
4579       0          1
1041       0          0
3139       1          0
5654       1          0
4173       1          1

The accuracy of KNN is: 72.15815485996706
The recall of KNN is: 0.7601760176017601
The precision of KNN is: 0.7215815485996705
The f1-score of KNN is: 0.7211927703457464


In [34]:
# ALGORITHM: (4) Decision Tree

from sklearn import tree
pipeline = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', tree.DecisionTreeClassifier())])

for train_index, test_index in kf.split(x_train, y_train):
    x_train_k, y_train_k = x_train[train_index], y_train[train_index]
    x_test_k, y_test_k = x_train[test_index], y_train[test_index]
    
    algorithm = pipeline.fit(x_train_k, y_train_k)

pred_validate = algorithm.predict(x_validate)
validation = {'Actual': y_validate, 'Predicted': pred_validate}
validation_df = pd.DataFrame(validation, columns = ['Actual', 'Predicted'])
print('The predictions for validation dataset are:\n', validation_df.head())

# EVALUATION MEASUREMENT:

pred_test = algorithm.predict(test['Text'])

# (1) Accuracy
print('\nThe accuracy of Decision Tree is:', (accuracy_score(test['Sentiment'], pred_test)*100))
# (2) Recall
print('The recall of Decision Tree is:', recall_score(test['Sentiment'], pred_test))
# (3) Precison 
print('The precision of Decision Tree is:', accuracy_score(test['Sentiment'], pred_test))
# (4) F-1 score
print('The f1-score of Decision Tree is:', f1_score(test['Sentiment'], pred_test, average='macro'))

The predictions for validation dataset are:
       Actual  Predicted
4579       0          1
1041       0          0
3139       1          0
5654       1          0
4173       1          1

The accuracy of Decision Tree is: 60.57111477210324
The recall of Decision Tree is: 0.6545654565456546
The precision of Decision Tree is: 0.6057111477210324
The f1-score of Decision Tree is: 0.604809108252994


In [35]:
# ALGORITHM: (5) Random Forest

from sklearn.ensemble import RandomForestClassifier
pipeline = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', RandomForestClassifier(n_estimators=100))])

for train_index, test_index in kf.split(x_train, y_train):
    x_train_k, y_train_k = x_train[train_index], y_train[train_index]
    x_test_k, y_test_k = x_train[test_index], y_train[test_index]
    
    algorithm = pipeline.fit(x_train_k, y_train_k)

pred_validate = algorithm.predict(x_validate)
validation = {'Actual': y_validate, 'Predicted': pred_validate}
validation_df = pd.DataFrame(validation, columns = ['Actual', 'Predicted'])
print('The predictions for validation dataset are:\n', validation_df.head())

# EVALUATION MEASUREMENT:

pred_test = algorithm.predict(test['Text'])

# (1) Accuracy
print('\nThe accuracy of Random Forest is:', (accuracy_score(test['Sentiment'], pred_test)*100))
# (2) Recall
print('The recall of Random Forest is:', recall_score(test['Sentiment'], pred_test))
# (3) Precison 
print('The precision of Random Forest is:', accuracy_score(test['Sentiment'], pred_test))
# (4) F-1 score
print('The f1-score of Random Forest is:', f1_score(test['Sentiment'], pred_test, average='macro'))

The predictions for validation dataset are:
       Actual  Predicted
4579       0          1
1041       0          0
3139       1          0
5654       1          0
4173       1          0

The accuracy of Random Forest is: 72.04832509610104
The recall of Random Forest is: 0.7832783278327833
The precision of Random Forest is: 0.7204832509610104
The f1-score of Random Forest is: 0.7194218732452308


In [36]:
# ALGORITHM: (6) XGBoost

from sklearn.ensemble import GradientBoostingClassifier
pipeline = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', GradientBoostingClassifier(n_estimators=50,verbose=2))])

for train_index, test_index in kf.split(x_train, y_train):
    x_train_k, y_train_k = x_train[train_index], y_train[train_index]
    x_test_k, y_test_k = x_train[test_index], y_train[test_index]
    
    algorithm = pipeline.fit(x_train_k, y_train_k)

pred_validate = algorithm.predict(x_validate)
validation = {'Actual': y_validate, 'Predicted': pred_validate}
validation_df = pd.DataFrame(validation, columns = ['Actual', 'Predicted'])
print('The predictions for validation dataset are:\n', validation_df.head())

# EVALUATION MEASUREMENT:

pred_test = algorithm.predict(test['Text'])

# (1) Accuracy
print('\nThe accuracy of XGBoost is:', (accuracy_score(test['Sentiment'], pred_test)*100))
# (2) Recall
print('The recall of XGBoost is:', recall_score(test['Sentiment'], pred_test))
# (3) Precison 
print('The precision of XGBoost is:', accuracy_score(test['Sentiment'], pred_test))
# (4) F-1 score
print('The f1-score of XGBoost is:', f1_score(test['Sentiment'], pred_test, average='macro'))

      Iter       Train Loss   Remaining Time 
         1           1.3754            8.07s
         2           1.3680            6.29s
         3           1.3615            5.35s
         4           1.3558            4.66s
         5           1.3507            4.12s
         6           1.3461            3.90s
         7           1.3409            3.63s
         8           1.3369            3.49s
         9           1.3330            3.32s
        10           1.3287            3.23s
        11           1.3256            3.08s
        12           1.3219            2.99s
        13           1.3190            2.94s
        14           1.3160            2.79s
        15           1.3123            2.70s
        16           1.3092            2.65s
        17           1.3060            2.55s
        18           1.3030            2.44s
        19           1.3005            2.37s
        20           1.2976            2.26s
        21           1.2948            2.17s
        2

        30           1.2736            1.26s
        31           1.2705            1.20s
        32           1.2688            1.13s
        33           1.2664            1.06s
        34           1.2639            1.00s
        35           1.2617            0.93s
        36           1.2599            0.87s
        37           1.2576            0.81s
        38           1.2554            0.75s
        39           1.2531            0.68s
        40           1.2513            0.62s
        41           1.2494            0.56s
        42           1.2468            0.50s
        43           1.2446            0.43s
        44           1.2425            0.37s
        45           1.2405            0.31s
        46           1.2383            0.25s
        47           1.2356            0.19s
        48           1.2335            0.13s
        49           1.2316            0.06s
        50           1.2294            0.00s
      Iter       Train Loss   Remaining Time 
         

         8           1.3383            2.47s
         9           1.3344            2.38s
        10           1.3307            2.34s
        11           1.3267            2.26s
        12           1.3230            2.19s
        13           1.3201            2.10s
        14           1.3171            2.03s
        15           1.3139            1.97s
        16           1.3107            1.90s
        17           1.3076            1.86s
        18           1.3048            1.82s
        19           1.3019            1.75s
        20           1.2994            1.69s
        21           1.2967            1.63s
        22           1.2933            1.58s
        23           1.2905            1.51s
        24           1.2874            1.45s
        25           1.2846            1.41s
        26           1.2824            1.35s
        27           1.2800            1.30s
        28           1.2776            1.25s
        29           1.2745            1.19s
        30