# Coursework 1: Train a Sentiment Analysis Classifier
In this course work, you are asked to train a sentiment analysis classifier for movie reviews. The sample code below builds a simple classifier that uses tf-idf to vectorize text and a logistic regression model to make predictions.

In [1]:
# load data and take a quick look
import pandas as pd
raw_data = pd.read_csv('coursework1_train.csv')
raw_data.head()

Unnamed: 0.1,Unnamed: 0,text,sentiment
0,0,Enjoy the opening credits. They're the best th...,neg
1,1,"Well, the Sci-Fi channel keeps churning these ...",neg
2,2,It takes guts to make a movie on Gandhi in Ind...,pos
3,3,The Nest is really just another 'nature run am...,neg
4,4,Waco: Rules of Engagement does a very good job...,pos


In [2]:
# check the size of the data and its class distribution
all_text = raw_data['text'].tolist()
all_lables = raw_data['sentiment'].tolist()

print('entry num', len(all_text))
print('num of pos entries', len([l for l in all_lables if l=='pos']))
print('num of neg entries', len([l for l in all_lables if l=='neg']))

entry num 40000
num of pos entries 20000
num of neg entries 20000


In [3]:
# text cleaning and preprocessing:
# This sample code does not perform any text normalization/pre-processing
# Feel free to apply any pre-processing steps you find appropriate

In [4]:
# data split. 
# Feel free to use differnt raios or strategies to split the data.
train_text = all_text[:35000]
train_labels = all_lables[:35000]
test_text = all_text[35000:]
test_labels = all_lables[35000:]

In [5]:
# training: tf-idf + logistic regression
# you should explore different representations and algorithms.
from sklearn.feature_extraction.text import TfidfVectorizer
max_feature_num = 1000
train_vectorizer = TfidfVectorizer(max_features=max_feature_num)
train_vecs = train_vectorizer.fit_transform(train_text)
test_vecs = TfidfVectorizer(max_features=max_feature_num,vocabulary=train_vectorizer.vocabulary_).fit_transform(test_text)

# train model
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression().fit(train_vecs, train_labels)

# test model
test_pred = clf.predict(test_vecs)
from sklearn.metrics import precision_recall_fscore_support,accuracy_score
acc = accuracy_score(test_labels, test_pred)
pre, rec, f1, _ = precision_recall_fscore_support(test_labels, test_pred, average='macro')
print('acc', acc)
print('precision', pre)
print('rec', rec)
print('f1', f1)

acc 0.8616
precision 0.8616043455692743
rec 0.8615913854621674
f1 0.8615956596398864


## SAVE YOUR TRAINED MODEL
After you have found the best model, save your trained model and other necessary components (e.g. vocabulary, vectorizer) to a file. We will load your model from the saved file and apply your trained model on some held-out test data. **At submission time, you should submit the saved model file and we will NOT re-run your code to train your model; instead, we will directly use your trained model to run test (see notebook *cw1-test.ipynb*)**. 

Below is a sample code for saving the model (and other necessary components) obtained above, using the *pickle* package in Python. *You should adjust the code to save all the necessary components for re-running your model!*

In [6]:
import pickle

# save model and other necessary modules
all_info_want_to_save = {
    'model': clf,
    'vectorizer': TfidfVectorizer(max_features=max_feature_num,vocabulary=train_vectorizer.vocabulary_)
}

with open("sample_trained_model.pickle","wb") as save_path:
    pickle.dump(all_info_want_to_save, save_path)

In *cw1-test.ipynb*, we provide a sample code to illustrate how to re-load your saved model and apply it to some test data. 