# Classification Assignment

In this assignment you're going to use scikit-learn and optionally any other modules we have used until now for document classification. The data that we're going to work on is Yelp restaurant reviews (more info [here](https://www.yelp.com/dataset_challenge). There are many more dimensions to this set (user info, bussiness info, etc.) but we're only going to focus on the review part. It's structured like this:

```json
{
    'type': 'review',
    'business_id': (encrypted business id),
    'user_id': (encrypted user id),
    'stars': (star rating, rounded to half-stars),
    'text': (review text),
    'date': (date, formatted like '2012-03-14'),
    'votes': {(vote type): (count)},
}

```

The dataset is very big (2.3 GB), so make sure that you select only as much as your computer can process. You can do this by changing the `max_reviews` variable below. However, do take care that the fewer reviews you are using, the less well you're going to perform on training data, and generalize to new data.

## The Assignment

Using a Jupyter Notebook to document classification processes is becoming popular in research. Not only does it contain a more verbose explanation of your thinking process than a paper, it also shows the code you're using and allows for direct reproduction. You're going to walk through all steps below, change the code to make a working classification document, and improve performance on this task. While doing so, you can use the *report here* boxes to describe your process and the **motivation** for your choices.

## What to Submit

You save your version of the .ipynb and submit it on blackboard.

## 1 - Loading Data

Select the correct field from the JSON file you are provided to load the text (our data) and stars (our labels).

In [16]:
import json

max_reviews = 10000
X, y = [], []

with open('./data/yelp_academic_dataset_review.json') as f:
    for i, jsf in enumerate(f):
        review = json.loads(jsf)  # this is the JSON file
        
        text = ''  # your answer here
        stars = '' # your answer here
        
        X.append(text)
        y.append(stars)
        
        if i == max_reviews:
            break

*Report here.*

## 2 - Preprocessing

This is the part where we do manual cleaning, or other preprocessing steps, add to the preprocessing function below:

In [17]:
for i, x in enumerate(X):
    X[i] = x.lower()
    # do some other things to document x

*Report here.*

## 3 - Interpretation

Finding out things about your data helps you understand performance of your classifier. You would want to do that here: 

In [18]:
from collections import Counter

Counter(y)

Counter({1: 1263, 2: 1030, 3: 1525, 4: 2687, 5: 3496})

*Report here.*

## 4 - Train & Test

We make sure we properly evaluate by splitting our data. You can change the parameter of `train_size` here.

In [19]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.10, random_state=42)
print("Training instances:", len(X_train))
print("Test instances:", len(X_test))

Training instances: 1000
Test instances: 9001


*Report here.*

## 5 - Features & Classification

After, we are ready for classification. You are provided with a standard pipeline. Expand on its functionality; try changing the vectorizer and / or its parameters. Add classifiers and change their parameters. Make sure you document ALL things you tried and how they increased cross-validation performance. You're also allowed to change the number of folds (`cv`).

In [20]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.neighbors import KNeighborsClassifier

pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('clf', KNeighborsClassifier(n_neighbors=3)),
])

In [22]:
from sklearn.model_selection import cross_val_score

scores = cross_val_score(pipeline, X_train, y_train, cv=5, n_jobs=-1)
print("Accuracy on Cross-Validation: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy on Cross-Validation: 0.31 (+/- 0.05)


*Report here.*

## 6 - Evaluation 

After you're done with a certain set-up above, you are allowed to test on your test set. Do that here. Report the performances and if you went back to another classifier to increase its performance.

In [23]:
pipeline.fit(X_train, y_train)  # fitting = training
scores = pipeline.score(X_test, y_test)
print("Accuracy on Test: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

Accuracy on Test: 0.35 (+/- 0.00)


*Report here.*

## 7 - Extra

Use LIME or graphs to reflect on the performance of your classifier. You need to figure the code out yourself (from the slide examples).

In [24]:
# your code here

*Report here.*