## IERG4080 Assignment 1
**Topic:Text Classification and Telegram Bot**<br>

In this assignment, you will build a text classifier of movie reviews, and then deploy the text classifier as a chatbot on Telegram to allow other people to use it. The task you will be working on is a binary classification problem: given a movie review, determine if it is positive or negative.

### Task 1 : text classification

**Data Preparation**<br>
1. Prepare a full dataset by combining all the training and test data found in the downloaded raw data (DO NOT use the preproessed data)
2. Randomly split the full dataset into a training set with 70% of the data, and a test set with 30% of the data (hence, you will have 35,000 reviews for training, and 15,000 reviews for testing)
3. Check that the ratio of positive to negative reviews is roughly 1:1 in both the training and test set

In [2]:
import os

# file directory of original dataset
neg_dir = ['aclImdb/test/neg/','aclImdb/train/neg/']
pos_dir = ['aclImdb/test/pos/','aclImdb/train/pos/']

#list containers to keep comments data
docs_neg , docs_pos = [],[]

#define read_file function for multi-use
def read_file(dire, docs):
    for folder in dire:
        all_files = os.listdir(folder) 
        for file in all_files:
            with open(folder + file,'r',encoding='UTF-8') as f:
                docs.append(f.read())
                
# try to read file and prompt error if happened
try:
    read_file(neg_dir,docs_neg)
    read_file(pos_dir,docs_pos)
    print(len(docs_neg),len(docs_pos)) #check if all data were input
except Exception as e:
    print('fail to open data!',e)
    exit(-1)

25000 25000


In [3]:
#import necessary modules
import pandas as pd
from sklearn.model_selection import train_test_split

#add tags via pos : 0 and neg : 1
pos_df = pd.DataFrame({'label':0,'review':docs_pos})
neg_df = pd.DataFrame({'label':1,'review':docs_neg})

df = pd.concat([neg_df,pos_df])

#unmarked the '#'symbol to save dataset if needed
#df.to_csv('labelled_full_dataset.csv',index=False)

X, y = df['review'].tolist(), df['label'].tolist()

# Splitting in Train and Test Sets, We want to use 30% of the data as test data
X_train, X_test, y_train, y_test = train_test_split(\
X, y, test_size=0.3, stratify=y, random_state=100)

print('data counts in each set:',\
      len(X_train),len(X_test),len(y_train),len(y_test))
print('='*50)
print('ratio of positive to negative reviews in train and test sets:',sum(y_train)/len(y_train),sum(y_test)/len(y_test))

data counts in each set: 35000 15000 35000 15000
ratio of positive to negative reviews in train and test sets: 0.5 0.5


**Using a Naive Bayes Classifier**

1. Build a pipeline using scikit-learn’s CounterVectorizer to vectorize the input data and train a naive Bayes classifier using the training data
2. Compute the following metrics on the test set
    1. accuracy
    2. precision and recall of both positve and negative reviews, **result combined in classification report**
3. Repeat the above but use the TfidfVectorizer instead

In [4]:
#import necessary modules
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score,classification_report

# define a model evaluation function for repeated use
def model_score(model):
    model.fit(X_train, y_train) # fit model
    y_pred = model.predict(X_test) # make predictions
    print("Accuracy : {:.4f}".format(accuracy_score(y_test, y_pred)))
    print(classification_report(y_test, y_pred))

deploy model training and evaluation as below

In [5]:
clf1 = Pipeline([('vec', CountVectorizer()), ('nb', MultinomialNB())]) 
model_score(clf1)

Accuracy : 0.8424
              precision    recall  f1-score   support

           0       0.87      0.81      0.84      7500
           1       0.82      0.87      0.85      7500

   micro avg       0.84      0.84      0.84     15000
   macro avg       0.84      0.84      0.84     15000
weighted avg       0.84      0.84      0.84     15000



Repeat the above but use the TfidfVectorizer instead

In [5]:
clf2 = Pipeline([('vec', TfidfVectorizer()), ('nb', MultinomialNB())])
model_score(clf2)

Accuracy : 0.8589
              precision    recall  f1-score   support

           0       0.88      0.83      0.86      7500
           1       0.84      0.89      0.86      7500

   micro avg       0.86      0.86      0.86     15000
   macro avg       0.86      0.86      0.86     15000
weighted avg       0.86      0.86      0.86     15000



**Using a Logistic Regression Classifier**

Repeat the above experiments but use the LogisticRegression class in scikit-learn instead of a naive Bayes model

In [6]:
clf3 = Pipeline([('vec', CountVectorizer()), ('LR', LogisticRegression())])
model_score(clf3)



Accuracy : 0.8845
              precision    recall  f1-score   support

           0       0.88      0.89      0.89      7500
           1       0.89      0.88      0.88      7500

   micro avg       0.88      0.88      0.88     15000
   macro avg       0.88      0.88      0.88     15000
weighted avg       0.88      0.88      0.88     15000



Repeat the above but use the TfidfVectorizer instead

In [7]:
clf4 = Pipeline([('vec', TfidfVectorizer()), ('LR', LogisticRegression())])
model_score(clf4)

Accuracy : 0.8948
              precision    recall  f1-score   support

           0       0.89      0.91      0.90      7500
           1       0.90      0.88      0.89      7500

   micro avg       0.89      0.89      0.89     15000
   macro avg       0.90      0.89      0.89     15000
weighted avg       0.90      0.89      0.89     15000



**Adding Bi-grams**

By default, the vectorizers only extract unigrams as features. Repeat all experiments above but adding bigram features.

In [8]:
clf5 = Pipeline([('vec', CountVectorizer(ngram_range=(1,2))), ('nb', MultinomialNB())])
model_score(clf5)

Accuracy : 0.8765
              precision    recall  f1-score   support

           0       0.89      0.86      0.87      7500
           1       0.86      0.90      0.88      7500

   micro avg       0.88      0.88      0.88     15000
   macro avg       0.88      0.88      0.88     15000
weighted avg       0.88      0.88      0.88     15000



In [9]:
clf6 = Pipeline([('vec', TfidfVectorizer(ngram_range=(1,2))), ('nb', MultinomialNB())])
model_score(clf6)

Accuracy : 0.8803
              precision    recall  f1-score   support

           0       0.90      0.85      0.88      7500
           1       0.86      0.91      0.88      7500

   micro avg       0.88      0.88      0.88     15000
   macro avg       0.88      0.88      0.88     15000
weighted avg       0.88      0.88      0.88     15000



In [10]:
clf7 = Pipeline([('vec', CountVectorizer(ngram_range=(1,2))), ('LR', LogisticRegression())])
model_score(clf7)



Accuracy : 0.9074
              precision    recall  f1-score   support

           0       0.90      0.91      0.91      7500
           1       0.91      0.90      0.91      7500

   micro avg       0.91      0.91      0.91     15000
   macro avg       0.91      0.91      0.91     15000
weighted avg       0.91      0.91      0.91     15000



In [11]:
clf8 = Pipeline([('vec', TfidfVectorizer(ngram_range=(1,2))), ('LR', LogisticRegression())])
model_score(clf8)

Accuracy : 0.8948
              precision    recall  f1-score   support

           0       0.89      0.90      0.90      7500
           1       0.90      0.89      0.89      7500

   micro avg       0.89      0.89      0.89     15000
   macro avg       0.89      0.89      0.89     15000
weighted avg       0.89      0.89      0.89     15000



**Using fastText** 
1. Instead of using scikit-learn, use Facebook’s fastText library (use the Python API)
2. You can use the default values of the parameters when training a model
3. Train a fastText model using the same training set and compute metrics on the same test set as above.

Please refer to another ipynb file **classifier-fasttext** <br>

prepare **train.txt, test.txt, y_test.txt** dataset for fastText

In [11]:
output_file = open("train.txt", 'w', encoding="utf-8")
for i in range(35000):
    output_file.write("__label__"+ str(y_train[i]) + " " + str(X_train[i]) + "\n")
output_file.close()

In [12]:
output_file = open("test.txt", 'w', encoding="utf-8")
for i in range(15000):
    output_file.write(str(X_test[i]) + "\n")
output_file.close()

In [13]:
output_file = open("y_test.txt", 'w', encoding="utf-8")
for i in range(15000):
    output_file.write("__label__"+str(y_test[i]) + "\n")
output_file.close()

**Model Persistence**
1. Compare the accuracy scores of all the scikit-learn models you have built 
2. Which model (e.g. which vectorizer + which classification model + with/without bigram) has the highest score?<br>
    Answer: model **clf7 : CountVectorizer with Bi-grams and LogisticRegression** has the highest Accuracy : 0.9074
3. Save that model as a file named model.pkl

In [14]:
from sklearn.externals import joblib
joblib.dump(clf7, "model.pkl")

['model.pkl']

## End of assignment1