<h1><center>HW1 Movie Review Sentiment Analysis</center></h1>
<h2><center>Due: March 2020 6th, 23:59</center></h2>

In this homework you will create different models to generate the positive/negative sentiment
classification of movie reviews.

This homework should be done individually without cooperation with others.

You are given the following files:
- `hw01.ipynb`: Notebook file with starter code
- `train.txt`: Training set to train your model
- `test.txt`: Test set to report your model’s performance
- `sample_prediction.csv`: Sample file your prediction result should look like
- `utils/`: folder containing all utility code for the series of homeworks

Remember to leverage code in `utils/`, so you don't have to build everything from
Scratch.
- `load_data(filename)`: load the input data and return sklearn.Bunch object. For basic
usage of Bunch object, please refer to sklearn documentation.
- `save_prediction(arr, filename)`: save your prediction into the format required by the
course

<h3> 0. Install Anaconda </h3>

If you do not yet have Python and Jupyter Notebook on your laptop, use this link
(https://www.anaconda.com/) to install anaconda. Anaconda is a suite that provides one
stop solution for all you need for Python development environment. This site contains
installation document for Windows, Mac, and Linux, choose the one that suits your operating
system.

*Tips: You may want to consider installing Jupyter Extensions (link: https://github.com/ipython-contrib/jupyter_contrib_nbextensions), and turn on extensions such as `ExcuteTime` and 
`Table of Contents(2)`. You may find them very helpful to assist you finishing homework. However, this 
is definitely not a necessary requirement.*

<h3> 1. Feature Dictionary Vectorization </h3> 

A quite unique step for NLP is to engineer the raw text input into numerical features. You will
eventually implement several featurizers, just like `dummy_featurize`, that distinguish your
models with others. However, we are not there yet. For convenience we allow you to
represent the features using a dictionary, look at the `dummy_featurize` code. So each data
point can be translated into a dictionary. However later we have to translate this list of
dictionaries into a homogeneous data structure. Therefore, you need to first implement the
pipeline method in the `SentimetnClassifier` class. Look into the code comment for more
details. To ensure your implementation works, we also provided some test code in the cell
below.

<h3> 2. Better featurizers </h3>

Have you finished the first step, you can run the model using the provided featurizer. See
the performance is nearly as good as a donkey? No surprise! The `dummy_featurize` should
have been named `really_dummy_featurize`. Now it is your turn to implement better
featurizers. For this homework, you need to implement at least 3 distinguishable featurizer.
Describe your features briefly in the write-up and include the accuracy of the model. No idea
at all? Look at your lecture notes for inspiration. Still no clue? Why not start with Bag of
words?

*Note: Model performance is important but it’s not the only thing we care about, your work will
also be rewarded by your creativity.*

<h3> 3. Optional: Try different learning methods </h3> 

All the work you have done so far are related to feature engineering, and the featurized data
is trained using Logistic Regression. Try to use different learning methods to train the model
and see if you achieve any difference in the performance. Discuss your findings in the
write-up.

<h3> 4. Deliverables (zip them all) </h3>

- pdf version of your final notebook
- Use the best model you trained, generate the prediction for test.txt, name the
output file prediction.csv (Be careful: the best model in your training set might not
be the best model for the test set).
- HW1_writeup.pdf: summarize the method you used and report their performance.
If you worked on the optional task, add the discussion. Add a short essay
discussing the biggest challenges you encounter during this assignment and
what you have learnt.

(**You are encouraged to add the writeup doc into your notebook
using markdown/html langauge, just like how this notes is prepared**)


# =============== Coding Starts Here ===================

In [None]:
%load_ext autoreload
%autoreload 2
%matplotlib inline
import os
import sys
import pandas as pd
import numpy as np

# add utils folder to path
p = os.path.dirname(os.getcwd())
if p not in sys.path:
    sys.path = [p] + sys.path
    
from utils.hw1 import load_data, save_prediction

## Featurizer

In [None]:
"""
!! Do not modify !!
"""
def dumb_featurize(text):
    feats = {}
    words = text.split(" ")

    for word in words:
        if word == "love" or word == "like" or word == "best":
            feats["contains_positive_word"] = 1
        if word == "hate" or word == "dislike" or word == "worst" or word == "awful":
            feats["contains_negative_word"] = 1

    return feats

In [None]:
def better_featurize(text):
    raise NotImplementedError
    """
    !! Do not work on this yet, work on the model and come back later !!
    
    Write your own code below
    """

## Model Class

In [None]:
from collections import Counter
from scipy.sparse import dok_matrix
from sklearn.linear_model import LogisticRegression

class SentimentClassifier:
    def __init__(self, feature_method=dumb_featurize, min_feature_ct=1, L2_reg=1.0):
        """
        :param feature_method: featurize function
        :param min_feature_count: int, ignore the features that appear less than this number to avoid overfitting
        """
        self.feature_vocab = {}
        self.feature_method = feature_method
        self.min_feature_ct = min_feature_ct
        self.L2_reg = L2_reg

    def featurize(self, X):
        """
        # Featurize input text

        :param X: list of texts
        :return: list of featurized vectors
        """
        featurized_data = []
        for text in X:
            feats = self.feature_method(text)
            featurized_data.append(feats)
        return featurized_data

    def pipeline(self, X, training=False):
        """
        Data processing pipeline to translate raw data input into sparse vectors
        :param X: featurized input
        :return X2: 2d sparse vectors
        
        Implement the pipeline method that translate the dictionary like feature vectors into 
        homogeneous numerical vectors, for example:
        [{"fea1": 1, "fea2": 2}, 
         {"fea2": 2, "fea3": 3}] 
         --> 
         [[1, 2, 0], 
          [0, 2, 3]]
          
        Hints:
        1. How can you know the length of the feature vector?
        2. When should you use sparse matrix?
        3. Have you treated non-seen features properly?
        4. Should you treat training and testing data differently?
        """
        # Have to build feature_vocab during training
        if training:
            raise NotImplementedError
         
        # Translate raw texts into vectors
        raise NotImplementedError

        return X2

    def fit(self, X, y):
        X = self.pipeline(self.featurize(X), training=True)

        D, F = X.shape
        self.model = LogisticRegression(C=self.L2_reg)
        self.model.fit(X, y)

        return self

    def predict(self, X):
        X = self.pipeline(self.featurize(X))
        return self.model.predict(X)

    def score(self, X, y):
        X = self.pipeline(self.featurize(X))
        return self.model.score(X, y)

    # Write learned parameters to file
    def save_weights(self, filename='weights.csv'):
        weights = [["__intercept__", self.model.intercept_[0]]]
        for feat, idx in self.feature_vocab.items():
            weights.append([feat, self.model.coef_[0][idx]])
        
        weights = pd.DataFrame(weights)
        weights.to_csv(filename, header=False, index=False)
        
        return weights

In [None]:
"""
Run this to test your model implementation
"""

cls = SentimentClassifier()
X_train = [{"fea1": 1, "fea2": 2}, {"fea2": 2, "fea3": 3}]

X = cls.pipeline(X_train, True)
assert X.shape[0] == 2 and X.shape[1] >= 3, "Fail to vectorize training features"

X_test = [{"fea1": 1, "fea2": 2}, {"fea2": 2, "fea3": 3}]
X = cls.pipeline(X_test)
assert X.shape[0] == 2 and X.shape[1] >= 3, "Fail to vectorize testing features"

X_test = [{"fea1": 1, "fea2": 2}, {"fea2": 2, "fea4": 3}]
try:
    X = cls.pipeline(X_test)
    assert X.shape[0] == 2 and X.shape[1] >= 3
except:
    raise Exception("Fail to treat un-seen features")
    
print("Success!!")

## Run your model

In [None]:
"""
Run this cell to test your model
"""
from sklearn.model_selection import train_test_split

data = load_data("train.txt")
X, y = data.text, data.target
X_train, X_dev, y_train, y_dev = train_test_split(X, y, test_size=0.3)
cls = SentimentClassifier(feature_method=dumb_featurize)
cls = cls.fit(X_train, y_train)
print("Training set accuracy: ", cls.score(X_train, y_train))
print("Dev set accuracy: ", cls.score(X_dev, y_dev))

In [None]:
"""
Run this cell to save weights and the prediction
"""
weights = cls.save_weights()

X_test = load_data("test.txt").text
save_prediction(cls.predict(X_test))

##  Example of better featurizer

In [None]:
def bag_of_words(text):
    word_bag = Counter(text.lower().split(" "))

    # do stuff here

    return word_bag

from sklearn.model_selection import train_test_split

data = load_data("train.txt")
X, y = data.text, data.target
X_train, X_dev, y_train, y_dev = train_test_split(X, y, test_size=0.3)
cls = SentimentClassifier(feature_method=bag_of_words, min_feature_ct=10)
cls = cls.fit(X_train, y_train)
print("Training set accuracy: ", cls.score(X_train, y_train))
print("Dev set accuracy: ", cls.score(X_dev, y_dev))