# Feature Engineering Homework 
***
**Name**: $<$Sayali Sonawane$>$ 

**Kaggle Username**: $<$sayalisonawane0$>$
***

This assignment is due on Moodle by **5pm on Friday February 23rd**. Additionally, you must make at least one submission to the **Kaggle** competition before it closes at **4:59pm on Friday February 23rd**. Submit only this Jupyter notebook to Moodle. Do not compress it using tar, rar, zip, etc. Your solutions to analysis questions should be done in Markdown directly below the associated question.  Remember that you are encouraged to discuss the problems with your instructors and classmates, but **you must write all code and solutions on your own**.  For a refresher on the course **Collaboration Policy** click [here](https://github.com/chrisketelsen/CSCI5622-Machine-Learning/blob/master/resources/syllabus.md#collaboration-policy)



## Overview 
***

When people are discussing popular media, there’s a concept of spoilers. That is, critical information about the plot of a TV show, book, or movie that “ruins” the experience for people who haven’t read / seen it yet.

The goal of this assignment is to do text classification on forum posts from the website [tvtropes.org](http://tvtropes.org/), to predict whether a post is a spoiler or not. We'll be using the logistic regression classifier provided by sklearn.

Unlike previous assignments, the code provided with this assignment has all of the functionality required. Your job is to make the functionality better by improving the features the code uses for text classification.

**NOTE**: Because the goal of this assignment is feature engineering, not classification algorithms, you may not change the underlying algorithm or it's parameters

This assignment is structured in a way that approximates how classification works in the real world: Features are typically underspecified (or not specified at all). You, the data digger, have to articulate the features you need. You then compete against others to provide useful predictions.

It may seem straightforward, but do not start this at the last minute. There are often many things that go wrong in testing out features, and you'll want to make sure your features work well once you've found them.


## Kaggle In-Class Competition 
***

In addition to turning in this notebook on Moodle, you'll also need to submit your predictions on Kaggle, an online tournament site for machine learning competitions. The competition page can be found here:  

[https://www.kaggle.com/c/feature-engineering-csci-5622-spring-2018](https://www.kaggle.com/c/feature-engineering-csci-5622-spring-2018)

Additionally, a private invite link for the competition has been posted to Piazza. 

The starter code below has a `model_predict` method which produces a two column CSV file that is correctly formatted for Kaggle (predictions.csv). It should have the example Id as the first column and the prediction (`True` or `False`) as the second column. If you change this format your submissions will be scored as zero accuracy on Kaggle. 

**Note**: You may only submit **THREE** predictions to Kaggle per day.  Instead of using the public leaderboard as your sole evaluation processes, it is highly recommended that you perform local evaluation using a validation set or cross-validation. 

In [41]:
import matplotlib.pylab as plt
import nltk
import numpy as np
import pandas as pd
from nltk.tokenize import RegexpTokenizer
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression, SGDClassifier
from nltk.corpus import stopwords
from sklearn import preprocessing
from sklearn.pipeline import FeatureUnion
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from scipy.sparse import csr_matrix
import copy
%matplotlib inline 

### [25 points] Problem 1: Feature Engineering 
***

The `FeatEngr` class is where the magic happens.  In it's current form it will read in the training data and vectorize it using simple Bag-of-Words.  It then trains a model and makes predictions.  

25 points of your grade will be generated from your performance on the the classification competition on Kaggle. The performance will be evaluated on accuracy on the held-out test set. Half of the test set is used to evaluate accuracy on the public leaderboard.  The other half of the test set is used to evaluate accuracy on the private leaderboard (which you will not be able to see until the close of the competition). 

You should be able to significantly improve on the baseline system (i.e. the predictions made by the starter code we've provided) as reported by the Kaggle system.  Additionally, the top **THREE** students from the **PRIVATE** leaderboard at the end of the contest will receive 5 extra credit points towards their Problem 1 score.


In [35]:
class TropeTransformer(BaseEstimator, TransformerMixin):
    """
    TropeTransformer() counts the number of tropes discussed in the data. It is added because if a certain trope 
    is discussed multiple times then there are high chances that, that 'sentence' is a spoiler.
    """
    def __init__(self):
        pass

    def fit(self, examples, y=None):
        # return self and nothing else
        return self

    def transform(self, examples):
        # X = list(examples["trope"])
        X_count = np.array([examples.count(i) for i in examples])
        return csr_matrix(X_count).transpose()

class PageTransformer(BaseEstimator, TransformerMixin):
    """
    PageTransformer() has the similar concept as TropeTransformer()
    """
    def __init__(self):
        pass

    def fit(self, examples, y=None):
        # return self and nothing else
        return self

    def transform(self, examples):
        # X = list(examples["page"])
        X_count = np.array([examples.count(i) for i in examples]).reshape(-1,1)
        return csr_matrix(X_count)

class SenCount(BaseEstimator, TransformerMixin):
    """
    SenCount() counts the length of a sentence. High chances are the 'sentence' is a spoiler if it is longer. 
    Usually then people describe the scene or the plot.
    """
    def __init__(self):
        pass

    def fit(self, examples, y=None):
        # return self and nothing else
        return self

    def transform(self, examples):
        # X = list(examples["page"])
        X_count = np.array([len(i) for i in examples]).reshape(-1,1)
        return csr_matrix(X_count)
    
class Tagging(BaseEstimator, TransformerMixin):
    """
    Creates tags for every words.
    Boolean feature. Turns on when NN and VB or RB and VB come together.
    Shows when some action is performed, that might be a spoiler.
    """
    def __init__(self):
        pass

    def fit(self, examples, y=None):
        # return self and nothing else
        return self

    def transform(self, examples):
        # X_count = np.array([nltk.pos_tag(i) for i in nltk.word_tokenize(examples)]).reshape(-1, 1)
        X_out = [0 for i in range(len(examples))]
        for x in range(len(examples)):
            wrote = False
            if examples[x]:
                x_m = np.array(nltk.word_tokenize(examples[x]))
                x_t = nltk.pos_tag(x_m)
                prev = x_t[0][1]

                for w, t in x_t:
                    if ((prev == 'NNP' or prev == 'NNPS') and 'VB' in t) or ('RB' in prev and 'VB' in t):
                        X_out[x] = 1
                        wrote = True
                    prev = t
            else:
                X_out[x] = 0
                wrote = True
            if not wrote:
                X_out[x] = 0
        return csr_matrix(X_out).transpose()

    
class ItemSelector(BaseEstimator, TransformerMixin):
    """

    ItemSelector is not designed to handle data grouped by sample.  (e.g. a
    list of dicts).  If your data is structured this way, consider a
    transformer along the lines of `sklearn.feature_extraction.DictVectorizer`.

    Parameters
    ----------
    key : hashable, required
        The key corresponding to the desired value in a mappable.
    """
    def __init__(self, key):
        self.key = key

    def fit(self, x, y=None):
        return self

    def transform(self, data_dict):
        lst = list(data_dict[self.key])
        return lst



In [38]:
class FeatEngr:
    def __init__(self):
        
        from sklearn.feature_extraction.text import CountVectorizer
        
        self.vectorizer = CountVectorizer(ngram_range=(1,2))

    def build_train_features(self, examples):
        """
        FeatureUnion is used to unite all the custom features created in above classes. 
        It is then used to build train and test features. 
        
        Pipeline is also created for cross validation, just for analysis purpose. 
        
        Method to take in training text features and do further feature engineering 
        Most of the work in this homework will go here, or in similar functions  
        :param examples: currently just a dataframe 
        """
        
        self.pipeline = Pipeline([
            ('union', FeatureUnion(
                transformer_list=[
                    ('body_bow', Pipeline([
                        ('selector', ItemSelector(key='sentence')),
                        ('cv', self.vectorizer),
                    ])),

                    ('body', Pipeline([
                        ('selector', ItemSelector(key='trope')),
                        ('tro', TropeTransformer()),
                    ])),

                    ('bod', Pipeline([
                        ('selector', ItemSelector(key='page')),
                        ('pg', PageTransformer()),
                    ])),

#                     ('bod1', Pipeline([
#                         ('selector', ItemSelector(key='sentence')),
#                         ('pg1', SenCount()),
#                     ])),
                    ('bod2', Pipeline([
                        ('selector', ItemSelector(key='sentence')),
                        ('pg2', Tagging()),
                    ])),

                ])),
            ('logreg', self.logreg),

        ])

        self.fu = FeatureUnion(
                    transformer_list=[
                        ('sen_vec', Pipeline([
                            ('selector', ItemSelector(key='sentence')),
                            ('cv', self.vectorizer),
                        ])),

                        ('trope_vec', Pipeline([
                            ('selector', ItemSelector(key='trope')),
                            ('tro', TropeTransformer()),
                        ])),

                        ('page_vec', Pipeline([
                            ('selector', ItemSelector(key='page')),
                            ('pg', PageTransformer()),
                        ])),

#                         ('bod1', Pipeline([
#                             ('selector', ItemSelector(key='sentence')),
#                             ('pg1', SenCount()),
#                         ])),
                        
                        ('bod2', Pipeline([
                            ('selector', ItemSelector(key='sentence')),
                            ('pg2', Tagging()),
                        ])),


            ])

        return self.fu.fit_transform(examples)

    def get_test_features(self, examples):
        """
        Method to take in test text features and transform the same way as train features 
        :param examples: currently just a dataframe  
        """
        return self.fu.transform(examples)

    def show_top10(self):
        """
        prints the top 10 features for the positive class and the 
        top 10 features for the negative class. 
        """
        feature_names = np.asarray(self.vectorizer.get_feature_names())
        top10 = np.argsort(self.logreg.coef_[0])[-10:]
        bottom10 = np.argsort(self.logreg.coef_[0])[:10]
        print("Pos: %s" % " ".join(feature_names[top10]))
        print("Neg: %s" % " ".join(feature_names[bottom10]))
                
    def train_model(self, random_state=1234):
        """
        Method to read in training data from file, and 
        train Logistic Regression classifier. 
        
        :param random_state: seed for random number generator 
        """
        
        from sklearn.linear_model import LogisticRegression 
        self.logreg = LogisticRegression(random_state=random_state)
        # load data 
        self.dfTrain = pd.read_csv("../data/spoilers/train.csv")
        self.trueTrain = copy.deepcopy(self.dfTrain)
        sno = nltk.stem.SnowballStemmer('english')
        list_examples = list()
        examples_list = list(self.dfTrain['sentence'])
        X_out = []

        for ex in examples_list:
            # removing digits
            ex = ''.join([i for i in ex if not i.isdigit()])
            # tokenizing
            tokenizer = RegexpTokenizer(r'\w+')
            tokens = tokenizer.tokenize(ex)
            # removing stop words
            filtered_words = filter(lambda token: token not in stopwords.words('english'), tokens)
            text_stemmed = list()
            # stemming
#             for x in filtered_words:
#                 text_stemmed.append(str(sno.stem(x)))
            text_sen = " ".join(i for i in filtered_words)
            list_examples.append(text_sen)
            self.dfTrain['sentence'] = pd.DataFrame(list_examples)
        
        # get training features and labels 
        self.X_train = self.build_train_features(self.dfTrain)
        self.y_train = np.array(self.dfTrain["spoiler"], dtype=int)
        
        # train logistic regression model.  !!You MAY NOT CHANGE THIS!! 
        
        self.logreg.fit(self.X_train, self.y_train)
        
    def model_predict(self):
        """
        Method to read in test data from file, make predictions
        using trained model, and dump results to file 
        """
        
        # read in test data 
        dfTest  = pd.read_csv("../data/spoilers/test.csv")
        
        # featurize test data 
        self.X_test = self.get_test_features(dfTest)
        
        # make predictions on test data 
        pred = self.logreg.predict(self.X_test)
        
        # dump predictions to file for submission to Kaggle  
        pd.DataFrame({"spoiler": np.array(pred, dtype=bool)}).to_csv("prediction.csv", index=True, index_label="Id")
        
    def model_accuracy(self):

        # cross validation
        kfold = KFold(n_splits=5, random_state=1234, shuffle=False)
        results = cross_val_score(self.pipeline, self.dfTrain, np.array((self.dfTrain['spoiler'])),cv=kfold)
        print("\n Cross Validation: " + str(results) )

        # mis classified examples
        yExp = np.array((self.dfTrain['spoiler']))
        yCal = np.array(self.pipeline.predict(self.dfTrain))
        count = 10
        print("\n \n" + "Misclassified examples: \n")
        for i in range(len(yExp)):
            if yExp[i] != yCal[i] and count > 0:
                print("Example: " + str(self.trueTrain['sentence'][i]) + " \n" +
                      "Expected: " + str(yExp[i]) + ". But got: " + str(yCal[i]) + ".\n")
                count -= 1

In [39]:
# Instantiate the FeatEngr clas 
feat = FeatEngr()

# Train your Logistic Regression classifier 
feat.train_model(random_state=1230)

# Shows the top 10 features for each class 
feat.show_top10()

# Analyze the training features
feat.model_accuracy()

# Make prediction on test data and produce Kaggle submission file 
feat.model_predict()

Pos: shoots cece regina ending ultimately turns destiny revealed kills dies
Neg: often cory meant tim buffy hilarious one episode fi johnny drew

 Cross Validation: [ 0.67000835  0.69632414  0.660401    0.63993317  0.66959064]

 
Misclassified examples: 

Example: Thankfully, it doesn't take. 
Expected: True. But got: 0.

Example: One of Vivien's twins is stillborn. 
Expected: False. But got: 1.

Example: That's 2 for 5. 
Expected: True. But got: 0.

Example: Including  his hand chair. 
Expected: False. But got: 1.

Example: In the end, she was killed by Son Hyuk after he left Athena. 
Expected: False. But got: 1.

Example: You will give up. 
Expected: True. But got: 0.

Example: Oh, and  aardvark . 
Expected: True. But got: 0.

Example: Maybe she's  bisexual? 
Expected: True. But got: 0.

Example: " They really aren't. 
Expected: False. But got: 1.

Example: You know the one(s). 
Expected: True. But got: 0.



### [25 points] Problem 2: Motivation and Analysis 
***

The job of the written portion of the homework is to convince the grader that:

- Your new features work
- You understand what the new features are doing
- You had a clear methodology for incorporating the new features

Make sure that you have examples and quantitative evidence that your features are working well. Be sure to explain how you used the data (e.g., did you have a validation set? did you do cross-validation?) and how you inspected the results. In addition, it is very important that you show some kind of an **error analysis** throughout your process.  That is, you should demonstrate that you've looked at misclassified examples and put thought into how you can craft new features to improve your model. 

A sure way of getting a low grade is simply listing what you tried and reporting the Kaggle score for each. You are expected to pay more attention to what is going on with the data and take a data-driven approach to feature engineering.

In [None]:
"""
ANALYSIS

Training is done on KFold = 5 cross validation. 

############ Preprocessing #############
Preprocessing was done on 'sentence' data before putting in classifier. 
Digits are removed. 
stop words are removed. 

### CountVectorizer ###
Initially I used only CountVectorizer on preprocessed 'sentence' data. Following was the result. 

Pos: discov freya devil harvey morgana regina sebastian olivia moriarti destini
Neg: cori johnni tim drew meant abc cur terror often bruce
Cross Validation: [ 0.62489557  0.61278195  0.61027569  0.59482038  0.63617377]

Misclassified examples:
Example: Beauregard used a reversed version of Dr. 
Expected: True. But got: 0.
Example: Cheryl, Andy, and Dana make Jim think his dead friend came back to life, 
because Jim stole back his playing card that said friend stole the day before he died. 
Expected: False. But got: 1.

In 1st example, none of the words have weightage through the algorithm because the user used common words to describe
the scene. That is why it was classfied as False instead of True.

In 2nd example, word 'died' is used which is very important feature. That is why it was misclassfied as True but it
was False may be because the story user told was false.

### Trope and Page ###
Trope and Page is used for further analysis. My intuition was if same trope and page is being discussed more times then 
there are chances that more discussions about those tropes might be spoiler.

Pos: regina gaz declar cece harvey discov sebastian moriarti devil destini
Neg: cori fi johnni abc meant tim drew paint terror remark

Cross Validation: [ 0.6679198   0.67418546  0.65747703  0.63032581  0.66165414]

Misclassified examples:
Example: Too bad for them a woman ends up the winner. 
Expected: False. But got: 1.
Example: She easily could've avoided it though... 
Expected: True. But got: 0.

The misclassification has the same reason as above but the accuracy is increased. Of course, because we did not add 
feature about 'sentence'.

### 'sentence' count ###
Next I thought bigger sentences tend to be spoiler with more probability. So I added a feature which counts the length
of sentences. 

Pos: regina gaz declar harvey cece discov sebastian moriarti devil destini
Neg: cori fi johnni meant abc tim drew paint terror remark

Cross Validation: [ 0.66875522  0.67418546  0.65789474  0.62823726  0.66207185]

Did not change the top10 features or accuracy. (bad feature) Was removed afterwards.

### Tagging ###
Next, I tried NLTK.pos_tag function to create tags for every word in 'sentence'. This is a boolean feature. If 
Proper Noun and Verb comes together then it might be the spoiler or Adverb and Verb comes together then feature 
turns on. Idea behind it was, "moriarty jumped" has noun and verb in it. If just bigrams come together then
high chances of spoilers. 

Pos: blows ending sebastian moriarty morgana cece devil regina dies destiny
Neg: cory fi tim closed johnny disney meant often drew buffy

Cross Validation: [ 0.67251462  0.69423559  0.660401    0.63909774  0.66833751]
 
CV accuracy definitely increased by 2%. Features have definitely changed. Misclassified examples:
Example: She easily could've avoided it though... 
Expected: True. But got: 0.
Example: Too bad for them a woman ends up the winner. 
Expected: False. But got: 1.

1st example is a very good example of negative-negative become positive. What user meant was she couldn't avoid it. .
But it was written in negative sentence. The classifier couldn't identify that. 
In 2nd exmaple, the statement seems like a spoiler. But it might not be because the story might be wrong
or it was known to everyone so it is no more a spoiler. 


################################################
Overall growth from 62% baseline accuracy to 69.4% accuracy. 
"""

### Hints 
***

- Don't use all the data until you're ready. 

- Examine the features that are being used.

- Do error analyses.

- If you have questions that aren’t answered in this list, feel free to ask them on Piazza.

### FAQs 
***

> Can I heavily modify the FeatEngr class? 

Totally.  This was just a starting point.  The only thing you cannot modify is the LogisticRegression classifier.  

> Can I look at TV Tropes?

In order to gain insight about the data yes, however, your feature extraction cannot use any additional data (beyond what I've given you) from the TV Tropes webpage.

> Can I use IMDB, Wikipedia, or a dictionary?

Yes, but you are not required to. So long as your features are fully automated, they can use any dataset other than TV Tropes. Be careful, however, that your dataset does not somehow include TV Tropes (e.g. using all webpages indexed by Google will likely include TV Tropes).

> Can I combine features?

Yes, and you probably should. This will likely be quite effective.

> Can I use Mechanical Turk?

That is not fully automatic, so no. You should be able to run your feature extraction without any human intervention. If you want to collect data from Mechanical Turk to train a classifier that you can then use to generate your features, that is fine. (But that’s way too much work for this assignment.)

> Can I use a Neural Network to automatically generate derived features? 

No. This assignment is about your ability to extract meaningful features from the data using your own experimentation and experience.

> What sort of improvement is “good” or “enough”?

If you have 10-15% improvement over the baseline (on the Public Leaderboard) with your features, that’s more than sufficient. If you fail to get that improvement but have tried reasonable features, that satisfies the requirements of assignment. However, the extra credit for “winning” the class competition depends on the performance of other students.

> Where do I start?  

It might be a good idea to look at the in-class notebook associated with the Feature Engineering lecture where we did similar experiments. 


> Can I use late days on this assignment? 

You can use late days for the write-up submission, but the Kaggle competition closes at **4:59pm on Friday February 23rd**

> Why does it say that the competition ends at 11:59pm when the assignment says 4:59pm? 

The end time/date are in UTC.  11:59pm UTC is equivalent to 4:59pm MST.  Kaggle In-Class does not allow us to change this. 