# Yelp Dataset Challenge

![Yelp Data Challenge](https://s3-media3.fl.yelpcdn.com/assets/srv0/engineering_pages/6d323fc75cb1/assets/img/dataset/960x225_dataset@2x.png)

## Natural Language Processing

In [None]:
import pandas as pd

df = pd.read_csv('last_2_years_restaurant_reviews.csv')

In [None]:
df.head()

In [None]:
df.info()

## Define features

In [None]:
# Take the values of the column that contains review text data, save to a variable named "documents"
documents = df['text']

# inspect your documents, e.g. check the size, take a peek at elements of the numpy array
print(documents.shape)
print('1:')
print(documents[0])
print('2:')
print(documents[1])
print('3:')
print(documents[2])

## Define target

* Target 要是 categorical variable
* 可以考慮當 Target 的有 `avg_stars`, `cool`, `funny`, `stars`, `useful`
  * 結果只有 `avg_stars` 和 `stars` 合適，其他幾個太多值了

In [None]:
# Check the unique values
for col in ['avg_stars', 'cool', 'funny', 'stars', 'useful']:
    print(col + ':')
    print(df[col].unique())

### 5 stars and non-5 stars rating

In [None]:
# Make a column and take the values, save to a variable named "target"
df['five stars'] = (df['stars'] > 4)
target = df['five stars']

#### Summary statistic

In [None]:
target.describe()

In [None]:
target.mean(), target.std(), target.shape

## Train and test splitting

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(documents, target, test_size=0.4, random_state=42)

## NLP representation

1. Create `TfidfVectorizer`, and name it `vectorizer`
2. Train the model with your training data
3. Use the trained model to transform your test data
4. Get the vocab of your tfidf

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(stop_words = 'english', max_features=5000)
vec_train = vectorizer.fit_transform(X_train)
vec_test = vectorizer.transform(X_test)
vocab = vectorizer.get_feature_names()

In [None]:
print(vocab)

In [None]:
vec_train_array = vec_train.toarray()
vec_test_array = vec_test.toarray()

In [None]:
print(documents.shape[0] * 0.6, documents.shape[0] * 0.4)
print(X_train.shape, X_test.shape)
print(vec_train.shape, vec_test.shape, vec_train_array.shape, vec_test_array.shape)

`vec_train_array` and `vec_test_array` are sparse matrixes where a lot of elements are 0.
It is hard to read from matrix form. So don't need to print and check them

In [None]:
# print(vec_train_array) # sparse matrix

In [None]:
# print(vec_test_array) # sparse matrix

## Similar review search engine

In [None]:
import numpy as np

def get_top_values(lst, n, labels):
    '''
    Input: list, integer, list
    Output: list
    
    Given a list of values, find the indices with the highest n values.
    Return the labels for each of these indices
    
    e.g.
    lst = [7, 3, 2, 4, 1]
    n = 2
    labels = ['cat', 'dog', 'mouse', 'pig', 'rabbit']
    output: ['cat', 'pig']
    '''
    # np.argsort() 預設是遞增排列，所以要用 -1 把排序反過來
    return [labels[i] for i in np.argsort(lst)[::-1][:n]]

In [None]:
def get_bottom_values(lst, n, labels):
    '''
    Input: list, integer, list
    Output: list
    
    Given a list of values, find the indices with the lowest n values.
    Return the labels for each of thest indices.
    
    e.g.
    lst = [7, 3, 2, 4, 1]
    n = 2
    labels = ['cat', 'dog', 'mouse', 'pig', 'rabbit']
    output: ['mouse', 'rabbit']
    '''
    return [labels[i] for i in np.argsort(lst)[:n]]

從 test sample 中隨便選一個 review，然後藉由 cosine similarity 來和 train samples 比較，選出 train samples 中前五名相似度最高的

In [None]:
# cosine similarity
from sklearn.metrics.pairwise import cosine_similarity

# Draw an arbitrary review from test (unseen in training) documents
# random_number = np.random.randint(0, documents.shape[0]) # 只是用來驗證
random_number = np.random.randint(0, vec_test.shape[0])
# random_review = documents[random_number]
random_review = X_test[random_number] # 用 test sample
# print(vec_test.shape[0], X_test.shape)
print(random_number)
print(random_review)

# Transform the drawn review(s) to vector(s)
vec_random_review = vectorizer.transform([random_review])
print(vec_random_review.shape)
print(vec_random_review.toarray()) # 是 sparse 的 array

# Calculate the similarity score(s) between vector(s) and training vectors
similarity_scores = cosine_similarity(vec_random_review, vec_train) # vec_random_review: 1 x 5000, vec_train: 238822 x 5000
print(similarity_scores.shape)
print(similarity_scores) # vec_random_review 和 vec_train 中每一筆數據的 cosine similarity 值，vec_train 有 238822 筆數據，所以會有 238822 個數值

In [None]:
# Let's find top 5 similar reviews
n = 5
top_reviews = get_top_values(similarity_scores[0], n, X_train.values) # 和 train sample 中的來比較
print(len(top_reviews))
print(top_reviews)

In [None]:
# print('Our search query:')
# print(random_review) # To be added

In [None]:
print('Most %s similar reviews:' % n)
# print()  # To be added
for index, review in enumerate(top_reviews):
    print(str(index) + ':\t' + review + '\n')

#### Q: Does the result make sense to you?

A: Yes.

## Classifying positive and negative review

### Naive-Bayes Classifier

In [None]:
from sklearn.naive_bayes import MultinomialNB

nb = MultinomialNB()
nb.fit(vec_train, y_train)

# Get score for training set and test set
nb_score_train = nb.score(vec_train, y_train)
nb_score_test = nb.score(vec_test, y_test)

print('train score: ', nb_score_train)
print('test score: ', nb_score_test)

### Logistic Regression Classifier

In [None]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
lr.fit(vec_train, y_train)

# Get score for training set and test set
lr_score_train = lr.score(vec_train, y_train)
lr_score_test = lr.score(vec_test, y_test)

print('train score: ', lr_score_train)
print('test score: ', lr_score_test)

print(lr.coef_.shape)
print(lr.coef_)

#### Q: What are the key features (words) that make the positive prediction?

In [None]:
# Let's find it out by ranking
n = 20
features_with_positive_predictions = get_top_values(lr.coef_[0], n, vocab)
print(features_with_positive_predictions)

#### Q: What are the key features(words) that make the negative prediction?

In [None]:
# Let's find it out by ranking
n = 20
features_with_negative_predictions = get_bottom_values(lr.coef_[0], n, vocab)
print(features_with_negative_predictions)

### Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(max_depth = None, n_estimators = 5, min_samples_leaf = 10)
rf.fit(vec_train, y_train)

# Get score for training set and test set
rf_score_train = rf.score(vec_train, y_train)
rf_score_test = rf.score(vec_test, y_test)

print('train score: ', rf_score_train)
print('test score: ', rf_score_test)

print(rf.feature_importances_.shape)
print(rf.feature_importances_)

#### Q: What do you see from the training score and the test score?

A: train score is higher than test score, it means overfitting.

#### Q: Can you tell what features (words) are important by inspecting the RFC model?

In [None]:
n = 20
top_features_importances = get_top_values(rf.feature_importances_, n, vocab)
print(top_features_importances)

Compare the top 10 words in Logistic Regression and Random Forest models

|words     |Logistic Regression|Random Forest|
|:--------:|:-----------------:|:-----------:|
|amazing   |1                  |1            |
|best      |2                  |3            |
|incredible|3                  |             |
|thank     |4                  |             |
|awesome   |5                  |             |
|delicious |6                  |7            |
|perfection|7                  |             |
|highly    |8                  |             |
|phenomenal|9                  |             |
|perfect   |10                 |             |
|didn      |x                  |2            |
|great     |x                  |4            |
|worst     |x                  |5            |
|love      |x                  |6            |
|bad       |x                  |8            |
|said      |x                  |9            |
|definitely|x                  |10           |

## Extra Credit

### 1. Use cross validation to evaluate classifiers

[sklearn cross validation](https://scikit-learn.org/stable/modules/cross_validation.html)

In [None]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(lr, vec_train, y_train, cv=5, scoring='accuracy')
print(scores)

### 2. Use grid search to find best predictable classifier

[sklearn grid search tutorial (with cross validation)](https://scikit-learn.org/stable/modules/grid_search.html#grid-search)

[sklearn grid search documentation (with cross validation)](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV)

[Model evaluation](https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter)

[classification_report](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html)

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report

params = {'penalty':['l1', 'l2'], 'C':[0.1, 1, 10, 100]}
scorings = ['accuracy']

for score in scorings:
    print("Tuning hyper-parameters for {}".format(score))
    clf = GridSearchCV(lr, params, cv=5, scoring=score)
    clf.fit(vec_train, y_train)
    
    print("Best parameters set found on development set:\n")
    print(clf.best_params_)
    print('\n')
    
    cv_result = clf.cv_results_
    means_train = cv_result['mean_train_score']
    stds_train = cv_result['std_train_score']
    means_test = cv_result['mean_test_score']
    stds_test = cv_result['std_test_score']
    
    for params, mean_train, std_train, mean_test, std_test in zip(cv_result['params'], means_train, stds_train, means_test, stds_test):
        print('For params:', params)
        print('Train: {:.3f} +/- {:.3f}'.format(mean_train, std_train))
        print('Test: {:.3f} +/- {:.3f}'.format(mean_test, std_test))
    

    print('Classification report:')
    print('The model is trained on the full development set.\n\
           The scores are computed on the full evaluation set.')
    y_pred = clf.predict(vec_test)
    print(classification_report(y_test, y_pred))