# Project for Data Science for Business course: Amazon product reviews

## Introduction

Product reviews on Amazon are used to support customer purchasing decisions. Reviews and five star ratings represent a reviewer's opinion about a product. Such ratings are subjective: a 5 star represents a best experience with a product and 1 star represents the worst one.  
  

In this project you will be provided with datasets of product reviews of amazon online shop*. The data is provided in json format. Each item looks like following: 

*{"category": "book", "reviewerID": "A1F6404F1VG29J", "asin": "B000F83SZQ", "reviewerName": "Avidreader", "helpful": [0, 0], "reviewText": "I enjoy vintage books and movies so I enjoyed reading this book.  The plot was unusual.  Don't think killing someone in self-defense but leaving the scene and the body without notifying the police or hitting someone in the jaw to knock them out would wash today.Still it was a good read for me.", "overall": 5.0, "summary": "Nice vintage story", "unixReviewTime": 1399248000, "reviewTime": "05 5, 2014"}*

Field Definitions:
* category: category of the product
* reviewerID: id of the reviewer
* asin: amazon standard identification number which is unique for each product
* reviewerName: name of the reviewer
* helpful: [x,y] i.e. x is the number of helpful ticks and y is the number of total ticks 
* reviewText: text of the review
* overall: overal rating of the product in the range of [1,2,3,4,5]
* summary: a summary of reviewer's opinion about the product
* unixReviewTime: time of writing the review in [UNIX format](https://en.wikipedia.org/wiki/Unix_time)
* reviewTime: time of writing the review in normal format

## Step 1
Predict the category of each product, based on the **`reviewText`** field. View the problem as a multiclass classification (24 product categories).

    f(`reviewText`) -> category

## Step 2
  
Predict whether a product review has 5 stars rating or not (yes/no), based on the `reviewText` field. The dataset for this step is filtered down to the `Digital_Music` category. Construct the problem as a binary classification (rank 5: 1, others: 0).

    f(reviewText) -> overall (5 or not)


## Step 3

Predict the actual number of stars, based on the `reviewText` field. Again, the dataset for this step is filtered down to the `Digital_Music` category. Construct the problem as a multiclass classification (5 categories labeled from 1 to 5).

    f(reviewText) -> overall


## Data

You can find the following datasets under project directory in the course git repository:

* For step 1: *amazon_step1.json.gz*
* For step 2: *amazon_step23.json.gz*
* For step 3: *amazon_step23.json.gz*

To complete your project, use the following **unseen** datasets to be filled by your final predictive models:

* For step 1: *amazon_step1_unseen.csv.gz* (label column to be filled by original product category names)
* For step 2: *amazon_step2_unseen.csv.gz* (label column to be filled by 0 and 1) 
* For step 3: *amazon_step3_unseen.csv.gz* (label column to be filled by 1, 2, 3, 4, 5)

## Requirements

We expect your solution for each step to contain the followings:

* data proprocessing and feature extraction (can be shared across different steps)
* feature selection (if needed)
* train, tune and test at least three models, one in each of the following categories:
    * a parametric-based model (i.e. linear models and SVM)
    * a similarity-based model (i.e. knn based models)
    * an information-based model (i.e. tree based models)
* model comparison and arguing about the best model (don't forget mentioning a baseine model)
* predict the labels of the unseen test dataset using your best obtained model
* discussion on possible additional tasks that can be done to boost the performance

## Deliverables

* Deliver a Jupyter notebook with an explanation of your methods, codes and results. Don't forget to divide your notebook into different parts, which clearly shows your solution to the common pre-processing as well as different steps separately. 
    

* In each step, fill in the "unseen" datasets with your best predictive model and commit the resulting files.  
    

* Submit your final notebook and files into the git repository of the team using the naming conventions and deadline mentioned in the course syllabus. Each team should build a **new** git repository to submit all project materials.  
   

* Invite both the course professor (github id: '*KenYounge*') and postdoc (github id: '*omidsh*') to your project git repository.  
    

* Note in the syllabus that you will also present your solution in the final session of the course - **more details to come.**. 

## Tips

* Presentation of your solutions in a story telling way is extremely important!   
  

* Document all of your assumptions (e.g. evaluation metric, resampling strategy, ...).  


* Make sure your code will run and results are reproducible (fix random seeds, etc.).  


*  Comment your blocks of code (and lines of code if needed) and anything in your story/logic that might not be obvious by looking at your code.    


* To speed up experimentation, you might use a small sample of the original dataset to do your initial coding. Also try to use all possible cores for computation, by setting the option of n_jobs = 1, when needed. 


* When possible (and it makes sense), try to take advantage of other fields (e.g. summary field) to improve the performance of your models.


* In steps 2 and 3, you try to predict the 'sentiment' of the text. Limiting your dictionary to only subjective words (which contain some sort of sentiment) may be beneficial. You can find online lists for such a purpose. An example of such a list is available [here](http://ptrckprry.com/course/ssd/data/positive-words.txt) for positive words and [here](http://ptrckprry.com/course/ssd/data/negative-words.txt) for negative words. Note that existence of positive and negative words does not necessarily mean that the sentence has positive or negative tone, since other factors such as negation words could affect the sentiments. 


* Try to be creative and use all available resources to improve your predictions, but don't forget that your line of thinking/reasoning as well as write up of those in each step is equally important.


* Your final grade is based on the whole process of doing the project and not just based on your results on the unseen data. 

## Grading

Grading of the project (apart from presentation), is based on the following components:

* 25 %  ___ Documentation in your notebook
* 15 %  ___ Code quality / comments
* 15 %  ___ Pre-processing
* 20 %  ___ Step 1
* 15 %  ___ Step 2
* 10 %  ___ Step 3

#### * Dataset Citation: 

Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering
R. He, J. McAuley
WWW, 2016