### Assignment 3 &ndash; Text Classification

This assignment is about text classification.   We are going to predict the category of a product based on information we have about the product from its product record and its reviews.

You are given training data in the form of product and review information for 1500 products.  Each is labeled as being in the category *Books* or  *Movies & TV* or *Music*

There are two parts to the assignment:  delivering the classifier, and documenting the research that went into choosing the best model.

You will be given a "labeled training set" which is one products file and one reviews file, same format as last assignment.  
In the products file, all products are from one of the three categories above, and in the reviews file, all of the reviews are associated with one of the products in the product file.  You will notice also that all the fields in the products file have been deleted except for the ASIN, the price, and the category.

You will build a classifier that will take as input the names of two files containing labeled test data, and will produce classifications for those records. A big part of the assignment is exploring, evaluating, and document choices you make as you build your classifier. 



#### Parts to the Assignment

In the first part of the assignment you will put code to train your model and to preprocess test data. After that, you will answer some questions about experiments you conducted and decisions you made in building your "best" model.

-------------------------------------------------

#### The Model and its Evaluation

(Just so you don't get confused, although these function definitions appear before the analysis, this model represents your best model which you developed as part of the analysis that you will provide below.)

You will write two functions that will allow evaluation of your model, *build_model* and *prepare_data*.

The first reads training data, prepares the data (extracts fields from the files, builds the response variable, vectorizes the X, and possibly reduces the feature set), then trains the model on that data (using the fit method).  The model returned should be ready to make predictions -- i.e. you have already fitted the model.

The second just does the preprocessing steps, producing an X matrix.  We didn't see this separation between preparing the data and fitting the model in class, but it is necessary in practice.  Typically you would build and optimize a model, then ship it into production.  In production the model would take new input data and call the model to get its prediction.  But in order to call the production model, the data must be preprocessed in the same way data was preprocessed during training.  That's what the *prepare_data* method does.    

Hint:  in training the model you will have built a vectorizer, and you need to use that same vectorizer in preparing the data.  It is OK to use a global variable to store that vectorizer that will be set by *build_model* and then read by *prepare_data*.

One last requirement.  You will need to do a numeric recoding of the *y* variable, i.e. the product category.  So I can test your model, you must code your response variable as follows

* Books category is coded as **0**
* Movies & TV category is coded as **1**
* Music category is coded as **2**


To evaluate your model, I will first call your *build_model* function to train it.  I will use the same files provided for you in the repository.   After training the model, I will call prepare_data using a different set of test data, then evaluate the model by calling its predict method on the X matrix returned by prepare_data.

In [2]:
# Returns a model -- an object that at least implements a predict method.
# The two parameters are names of files containing labeled training data
# The model returned should already be trained on (fitted to) the data in those two files

def build_model(product_file_name="products.txt", review_file_name="reviews.txt"):
    # Your code here
    return aModel

# Returns an X matrix which is prepared using the same preprocessing
# steps used to build and train the model returned by build_model above.
# Notice that although the product category might be in the product file, you cannot use it in
# this function.  In fact the only reason the product file is relevant at all is that you might
# have used the product *price* attribute in training your model, and if so you will have to read it
# from the product file.  

def prepare_data(product_file_name, review_file_name):
    # Your code here
    return X
    

-----------------------------------------------------------------------------
### Documenting your Decisions

From this point down there should be no code cells.  Your analysis should have supporting data (tables, graphics), but please render it in markdown, otherwise it will go away when I test your model.

#### Evaluation and Analysis

In answering these questions, please be sure to show your work, for example output of commands you used to gather data supporting your decisions.   For each of the cells below containing a question, please leave the question header and text in the notbook you subnmit.  Put your answer in markdown in the same cell, and add additional markdown cells below the question/answer cell for supporting data (output, tables, graphics).  


----------------------------
#### Model Quality

What do you expect the accuracy of your model to be on a new set of product records it has not seen before?

------------------------------------------------
#### Input Fields

What input fields from the product and review records did you include in training your model (prior to any feature selection)?  How did you decide which fields to use and which to omit?

-------------------------------------------
#### Preprocessing

What preprocessing steps did you use?  At minimum you must evaluate stemming, tokenizing, stop word removal.  How did you decide which steps improved the model and which did not?

(Your answer here.)

---------------------------------------------
#### Vectorization

What technique did you use to turn the input features into a feature vector?  How did you make that choice?

(Your answer here.)

---------------------------------------------
####  Feature Selection

It is important to examine and understand the model features both to convince that the important features are plausible, and to consider removing the unimportant features.  What features did you use in the model, and how did you make that choice?  List some of the most important features.  Are you convinced that they are are accurate exemplars of the class, or might they be artifacts of the training set?  List some of the least important features -- do they suggest ways to cut down the model size without significantly affecting accuracy?  

In class we looked at three ways of estimating the impact of a term on a model

* Frequency-based selection
* Mutual information
* Feature log probabilities, i.e. $P(f_i | C)$

How different are the three measures, i.e. do they all tend to rank the same variables as significant and insignificant?  Which did you use in your model, and why?

(Your answer here.)

-------------------------------------------------
#### Algorithm

Which classification algorithm did you use;  what alternatives did you explore and how did you make the choice?  What hyperparameter optimization did you perform?

(Your answer here.)

----------------------------------------
#### Understanding Misclassifications

Even though misclassifications are inevitable, it is important to understand *why* your algorithm makes errors, and whether it is making "understandable" errors.   Choose several examples of misclassification and informally explain why you believe the classifier made the wrong choice.  Is/was there anything you might be able to do in terms of feature engineering to fix some misclassifications?

(Your answer here.)