## References:
* Data Set:
    * https://www.kaggle.com/jvanelteren/boardgamegeek-reviews
* Existing work references:
    * https://www.kaggle.com/jvanelteren/exploring-the-13m-reviews-bgg-dataset  
* Code References:  
    * pre-built sentiment analyzers:
        * https://github.com/cjhutto/vaderSentiment  
        * https://textblob.readthedocs.io/en/dev/api_reference.html#module-textblob.classifiers  
    * variety of NaiveBayes Classifiers:
        * https://scikit-learn.org/stable/modules/classes.html#module-sklearn.naive_bayes
        * https://datascience.stackexchange.com/questions/53100/training-textblob-with-16k-rows-of-labeled-data-wont-work-only-few-are-working
    * featrure extraction:
        * https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

# PACKAGES UNAVALIABLE HERE, USE JUPYTER NOTEBOOK ON LOCAL MACHINE
## Vader Sentiment doesn't properly load in via pip install or isnt pre-installed here.

# Term Project
Joshua Tran, 1001296598, CSE-5334
## Objective:
The objective of this project is to produce a classifier that can take a look at a given review and produce a predicted rating. These reviews will be based off the datasets defined in the references for the 15 million reviews.
Based off these criteria, I'll be implementing a NaiveBayes classifier along side with some pre-built sentiment analyzers to predict the rating of these reviews. However to accomplish this, a few things need to be noted about what has been done.

### pre-processing the data:
The data is vast and large, some of the data needs to pre-processed to ease training, and to get rid of non-helpful information. From the given set of reviews, to simplifiy processing we've made all the classifications into integers 0-10 for 11 total classifications. keeping the floating point representations left review classification values in the thousands of total classifications and seemed over complicated and undesirable. So this has been cleaned up.  
Additionally, I rid the data set from reviews that have no comment, because no comments doesn't help us predict actual comments.

## Libraries

In [None]:
# base computation libraries
import numpy as np
import pandas as pd
from os import path
import matplotlib.pyplot as plt

# data splitting
from sklearn.model_selection import train_test_split

# feature extration
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline

# sentiment analyzers
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from textblob import TextBlob

# naive bayes classifiers
from sklearn.naive_bayes import MultinomialNB
from textblob.classifiers import NaiveBayesClassifier

# classifier saving
import pickle

# needed for textblob custom:
# import nltk
#nltk.download('punkt')

In [None]:
# naive bayes classifiers
from sklearn.naive_bayes import MultinomialNB
from textblob.classifiers import NaiveBayesClassifier

# classifier saving
import pickle

### Grabbing the data
Simply, we grab the data here from the raw file, then round the ratings to be integer values.

In [None]:
# path to reviews
ReviewPath = "../input/boardgamegeek-reviews/bgg-15m-reviews.csv"
df = pd.read_csv(ReviewPath).iloc[:, 1:]
df['rating']=df['rating'].round()

## Examining the given data:
Looks like with the given data we're given the following columns:
* A user
* A rating
* A comment
* An ID
* A name of a game    

Here we can additionally see that there are a bunch of reviews that have "NaN" comments in value and wont help us.

In [None]:
df

So lets remove those "NaN" comments

In [None]:
df = df.dropna(subset = ['comment'])
df

## Details of the data:
In the next two code blocks we can start to see what our data is looking like after we've cleaned it up a bit. First we can see that our count has drastically decreased to just below 3 million reviews that we can use. additionally the reviews average to just around a rating of 7.  
Additionally we can see in the chart below that the heavier concentration of review is towards 6 to 10 ratings.

In [None]:
df.describe()

In [None]:
df['rating'].hist(bins=10)
plt.xlabel('rating of review')
plt.ylabel('number of reviews')
plt.show()

## Taking a look at sentiment analysis: "Vader" Sentiment
Now that we've taken a look at the data. Lets take a closer look at some sentiment analyzers.
First we have "Vader" sentiment. This will take a given string and find the polarity of how positive, negative, and neutrual it is. Here we initiallize our analyzer, and give the first comment in our data to it and look at the scores.

In [None]:
analyzer = SentimentIntensityAnalyzer()
print(df.iloc[1]['comment'])
sentiment_polarity_score_example = analyzer.polarity_scores(df.iloc[1]['comment'])
print("result of sentiment polarity scores:\n",sentiment_polarity_score_example)

## Applying sentiment analysis to our data set.
Below we have a method to be used to apply our vader sentiment analyzer to the data set. The idea here is to use the values generated by our analyzer to produce numerical values to use that will weight a Naive Bayes classifier to produce predictions on what the actual weight of the sentiment is.  
It is important to note that this processing takes a little bit of time, so i've generated a routine to check if the the processed sentiment has already been done, and if it has then we can just pull that data.

In [None]:
def features(x):
    result = analyzer.polarity_scores(x)
    return pd.Series([result['neg'], result['neu'], result['pos'], result['compound']])

In [None]:
polarity_score_path = 'C:\\Users\\j0sh7\\OneDrive\\Desktop\\Fall 2020\\Data mining\\term proj\\vader-processed.csv'
if(path.exists(polarity_score_path)):
    vader_processed_df = pd.read_csv(polarity_score_path)
# exporting data if it does not exists
else:
    df[['neg polarity','neu polarity','pos polarity','compound polarity']] = df['comment'].apply(features)
    #df.to_csv(polarity_score_path,index=False)
    vader_processed_df = df

## Taking a look at the Vader processed data

In [None]:
vader_processed_df

In [None]:
vader_processed_df.describe()

Here is a helper sub-routine to generate our accuracy of datasets.

In [None]:
def findAccuracy(actual, predictions):
    return ( np.sum(actual == predictions)/len(actual) )

## Splitting the data
Now that we've applied our sentiment analysis to generate some features to determine a classification given a review. We need to seperate our data into a Training set and a Testing set. Below we generate a training set of 80% of the reviews and 20% of the reviews.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(np.stack(vader_processed_df[['neg polarity','neu polarity','pos polarity']].to_numpy()),vader_processed_df['rating'].values, test_size=0.2)

## Using the Naive Bayes classifier
Here we use our Naive bayes classifier, along with a variety of different smoothing paramters to adjust the alpha value (lap-lace smoothing value) four our classifier. However as the data shows below, the hyper parameter applied has no effect on this dataset. and seems to maintain it's accuracy.

In [None]:
smoothingHyperParameter = [10.0, 5.0, 2.0, 1.0, 0.1, 0.01, 0.001, 0.0001, 0.00001]
for hyperParam in smoothingHyperParameter:
    MultinomialNB_clf = MultinomialNB(alpha=hyperParam)
    MultinomialNB_clf.fit(X_train, y_train)
    result = MultinomialNB_clf.predict(X_test)
    print(f"result of hyper parameter: {hyperParam} = {findAccuracy(y_test, result)}")

## Taking a Closer look at the results
If we take a closer look, we can see that this implementation centers the classifications and information around just the values of 6 and 8 when we classify. Additionally the difference between the actual and predicted values seem to have a distribution around 0 to 3. With a few thousad from 4 to 7 in difference.

In [None]:
MultinomialNB_clf = MultinomialNB()
MultinomialNB_clf.fit(X_train, y_train)
result = MultinomialNB_clf.predict(X_test)
print(f"result: {np.unique(result)}")
print(f"result: {findAccuracy(y_test, result)}")


In [None]:
diff = np.abs(result - y_test)
diff_df = pd.DataFrame(data=diff, columns=["difference"])
diff_df['difference'].hist(bins=10)
plt.xlabel('difference between values')
plt.ylabel('number of differences')
plt.show()

In [None]:
diff_df.describe()

## Using TextBlob sentiment analyzer and Custom Classifier
Next lets try some new stuff. Here we try to do something similar to our previous approach. However, lets try to use TextBlob instead. Interestingly enough, TextBlob actually gives us the ability to use a custom classifer to predict what our sentiment is. Rather than just exclusively looking at positive and negative values.  
HOWEVER, this training and classification process is EXTREMELY SLOW. So just to take a look at this, we'll be minimizing the data to around 10 thousand data points that will be split similarly to how we did previously in distribution for test and training sets.  
The way this analyzer works is that it uses a feature extractor to send our given sets in and then apply a classifier(NaiveBayes) on it. However, for it to start producing prediction values it needs an initial comment/classification.  
Additionally it uses an nltk classifier and wraps around it, making it slower than the sci-kit classifiers.

In [None]:
Xdrop, X, yDrop, y = train_test_split(df[['comment','rating']],df['rating'], test_size=(10000 / df.shape[0]))
X_train_TB, X_test_TB, y_train_TB, y_test_TB = train_test_split(X, y, test_size=.2,random_state=42)

In [None]:
X_train_TB

In [None]:
X_test_TB

## Formatting the Data for TextBlob
the classifier needs our records stored in this format

In [None]:
train_records = list(X_train_TB.to_records(index=False))

Here we can see that this classifier takes a long while to train itself for just 10 thousand records.

In [None]:
%time cl = NaiveBayesClassifier(train_records)
%time initializer = cl.classify('this is a initializer, for classification')
print(initializer)

In [None]:
test_records = list(X_test_TB.to_records(index=False))

In [None]:
results = []
for item in test_records:
    results.append(cl.classify(item[0]))
results_TB = np.array(results)

## Looking at the results of this classifier
Taking a look at this classifier we can see that the precision has decreased due to the data size being lower. If we take a closer look we can see that the ratios of the difference between the predicted values and actual values have decreased though showing a slight increase in "accruacy" rather than precision. 

In [None]:
print(findAccuracy(y_test_TB, results))

In [None]:
diff_TB = np.abs(results_TB - y_test_TB)
diff_df_TB = pd.DataFrame(data=diff_TB, columns=["difference"])
diff_TB.hist(bins=10)
plt.xlabel('difference between values')
plt.ylabel('number of differences')
plt.show()

In [None]:
diff_TB.describe()

## Mixing the findings from Vader, TextBlob, and vanilla Sci-kit
After testing out each method used here. I found that Sci-kit's resources are incredibly fast and optimized, so In practice I'd like to use an approach with it. Additionally we found that smoothing these didn't have too much of an effect on the overall data.  
From Vader, I found that it actually might be best to just transition to losing the sentiment values from training and just use each word as a set of features for our training, as seen in TextBlob due to the increase of accruacy in results. However, TextBlob's problem with processing large amounts of data comes from its feature extraction method. So to speed this up we'll use essentially the same method it uses but port it to sci-kit.  
Here we break our data as usual into train and test. Then create a pipeline with a vectorized feature extractor for each review, then train it on the NaiveBayes classifier.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(df[['comment','rating']],df['rating'], test_size=0.2)

## Taking a look at the results
We can see that the precision and accruacy has increased. Creating a smaller grouping closer to zero.

In [None]:
model = make_pipeline(TfidfVectorizer(), MultinomialNB())
%time model.fit(X_train['comment'], X_train['rating'])
%time results_model = model.predict(X_test['comment'])
print(findAccuracy(y_test, results_model))

In [None]:
diff = np.abs(results_model - y_test)
diff.hist(bins=10)
plt.xlabel('difference between values')
plt.ylabel('number of differences')
plt.show()

In [None]:
diff.describe()

## Packaging up best model for our implemented model

In [None]:
#filename = '\finalized_model.sav'
#pickle.dump(model_model, open(filename, 'wb'))
#loaded_model = pickle.load(open(filename, 'rb'))

In [None]:
#%time results_load = loaded_model.predict(X_test['comment'])
#print(findAccuracy(y_test, results_load))