<p style="font-family: Arial; font-size:2.75em;color:purple; font-style:bold"><br>
Predicting User Average Rating using Board Games Dataset
</p>
<table>
<col width="550">
<col width="450">
<tr>
<p style="font-family: Arial; font-size:2.75em;">
<td><img src="https://www.theatkinson.co.uk/website/wp-content/uploads/2019/05/Board-games-900x600.jpg" align="middle" style="width:550px;height:360px;"/></td>
<td>

For this study, I am using board games' data set from <a href="https://www.kaggle.com">Kaggle</a>. The  <a href="https://www.kaggle.com/gutsyrobot/games-data/data">data set</a> is originally scraped from <a href="https://boardgamegeek.com/">BoardGameGeek</a> (BGG), a board game review site. 
    
The data contains average rating of board games along with metadata like year of  publication, playing time, minimum age of players etc.
    
<br>
I was excited to explore this real data set from BGG which involves a large community of board game hobbyists like me. The data set can be analysed for top games in terms of user comments, ratings or ease of understanding. 
<br>

Finally, the goal of the project is to build a model to predict average user ratings and study important features for the model. 
<br>
    
This report is made keeping following audiences in mind:
* **Players**: To review is board game has been rated as good before making a buying decision.

* **Marketing Analysts**: To predict average rating for a game before introducing it on the marketplace. If therating is poor, docusing on which features can improve the rating.

* **Creators**: To analyse what kind of games have been rated high or low from historical data and what are the important features.


<br>
<br>
</td>
    </p>
</tr>

</table>


## IMPORT LIBRARIES

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import sklearn
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn import model_selection
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.preprocessing import StandardScaler
from math import sqrt
sns.set(color_codes=True)

## READ CSV

In [None]:
games_df = pd.read_csv('../input/games-data/games.csv')

Let us now explore our data.

In [None]:
games_df.head()


In [None]:
games_df.shape

**81312 records across 20 featueres**

In [None]:
games_df.columns

## DATA ATTRIBUTES

The data description is as follows

* **id**  -Id of the game. The field should be unique which will be verified in sunbsequent cells.
* **type** - Type of Board games
* **name** - Name of the game
* **yearpublished** - The year a game was published. (float format)
* **minplayers** - Minimum number of players that can play the game. 
* **maxplayers** - Maximum number of players that can play the game.
* **playingtime** - average playing time to finish the game.
* **minplaytime** -  minimum playing time to finish the game.
* **maxplaytime** - maximum playing time to finish the game.
* **minage** - minimum age of a player 
* **users_rated** - number of users that rated the game.
* **average_rating** - average rating of the game. This will be the target variable for my study.
* **bayes_average_rating** - BoardGameGeek replace the average by the Bayesian average. In Bayesian statistics we establish teh value as per priori assumptions. When evidence comes in we can update this prior. More can be read at https://www.evanmiller.org/bayesian-average-ratings.html
* **total_owners** - Total number of people who own the game.
* **total_traders** - Total no of traders selling the game on marketplace.
* **total_wanters** - Current number of BGG members who are willing to purchase this game on marketplace.
* **total_wishers** - Current number of who have added the game in wish list.
* **total_comments** - Total no of comments received by users for a game.
* **total_weights** - total weight by all the users for rating for how difficult a game is to understand.
* **average_weight** - a community rating for how difficult a game is to understand.Weight is scored on a scale from 0.0 to 5.0 with 5 being difficult to comprehend or heavy game.


In [None]:
games_df.describe().transpose()

**Preliminary analysis shows:**
* We can see 0's as minimum values for lot of columns- user_rated, average_rating, minplayers, maxplayers, minage etc. 
* average rating is on scale of 0 to 10.  
* yearpublished has negatives
* min age range is from 0 to 120 
* average weight is on the scale of 0 to 5

Let us see what is proportion of boardgame and boardgame expansion categories


In [None]:
games_df['type'].value_counts()

## DATA CLEANING

**Check for nulls**

In [None]:
games_df.isnull().any()

In [None]:
games_df.isnull().any(axis=1).any()


In [None]:
games_df.isnull().sum()

In [None]:
with_null=games_df.shape[0]
with_null

In [None]:
games_df.dropna(inplace=True)#NaN values didnt get dropped using dropna()
games_df.isnull().any().any()

In [None]:
games_df.shape

In [None]:
without_null=games_df.shape[0]
without_null
print("%d removed after dropping nulls"%(with_null-without_null))

**Check for duplicates**

In [None]:
games_df.duplicated().value_counts()

In [None]:
games_df['id'].nunique() 
#Business sense dictates us to look at unique ID values rather than names which might have clerical errors!

In [None]:
games_df.drop_duplicates(subset ="id", inplace = True)
games_df.shape

In [None]:
no_dups=games_df.shape[0]
no_dups
print("%d duplicates removed from data set"%(without_null-no_dups))

## EDA and VISUALIZATION
**Analysis of years in which games were published.**
<p style="font-family: Arial; font-size:1.25em;color:purple; font-style:bold">
The data set has -ve years and year =0. The data type is float for year column.

In [None]:
games_df['yearpublished'].dtype

In [None]:
games_df['yearpublished'] = games_df['yearpublished'].astype(int)

In [None]:
games_df['yearpublished'].min()#-ve dates

In [None]:
games_df[games_df['yearpublished']>0].yearpublished.min()

In [None]:
games_df[games_df['yearpublished']<=0].yearpublished.count()

In [None]:
games_df[games_df['yearpublished']==0].yearpublished.count()

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

fig, axis = plt.subplots()
# Grid lines, Xticks, Xlabel, Ylabel

axis.yaxis.grid(True)
axis.set_title('year published vs average rating',fontsize=10)
axis.set_xlabel('year published',fontsize=10)
axis.set_ylabel('average rating',fontsize=10)

axis.scatter(games_df['yearpublished'],games_df['average_rating'])
plt.show()



games published during B.C era are very less. Majority of the games are published in later years. The descriptive statistics shows 75% of board games after year 1900s.
There does not seem to be any realtion in year published and average rating.

In [None]:
#plt.figure(figsize=(5,5))
plt.hist(games_df['yearpublished'])

plt.xlabel('year of publishing the game')
plt.ylabel('# of Board games')
plt.title('Histogram')

plt.grid(True)

plt.show()

Currently, I want to see the plot for yearspublished.
<p style="font-family: Arial; font-size:1.25em;color:purple; font-style:bold">
How to work with negative years and yearpublished=0?
    
Initially, I removed the records where yearpublished was negative or 0.
There are 7690 such records.

But, on closer ispection of data and history, I found out that -ve year represent years in BC or Before Christ.
-3500 is min value which is 3500 BC.
The game corresponding to that is "Senet" which is an ancient game (https://en.wikipedia.org/wiki/Senet)
And can be traced to BC period.
Hence, I will retain the -ve values.

But, 0 years are not valid as 1BC is followed by 1 AD. The games where yeard published is 0 are games like carrom or some unpublished prototype. It looks either BGG web site could not trace the history or the data was not available. 

I will remove data where yearpublished is 0 LATER. There are **7668** such records
Years like 220, 500 are 220 AD and 500 AD.
We are currently in 2020 CE or 2020 AD.

**Challenge with data type of year**
As data deals with BC values in negatives, the date conversion pd.to_datetime() was not working for me and throwing out of bound errors. I would like to explore more on this but for now, I will leave the data type of this field as int.
And, going forward check the histogram.



Let us check which game is the oldest 

In [None]:
games_df[games_df['yearpublished']==-3500]['name']#-ve dates
#Senet is the oldest game available in our data set

**How many records have 0 value in yearspublished**

In [None]:
games_df2=games_df[games_df['yearpublished']!=0]
print(games_df2.shape)
print(games_df.shape[0]-games_df2.shape[0])

<p style="font-family: Arial; font-size:1.25em;color:purple; font-style:bold">ANALYSIS OF average_rating and users_rated

Let us plot histogram for average rating.

In [None]:
plt.figure(figsize=(5,5))
plt.hist(games_df["average_rating"])

plt.xlabel('Average Rating')
plt.ylabel('# of Board games')
plt.title('Histogram')

plt.grid(True)

plt.show()

Many records have 0 average rating.
I will try to analyse if any user has rated for these records.

In [None]:
print(games_df['average_rating'].min())
print(games_df['users_rated'].min())

In [None]:
games_df[games_df['average_rating'] == 0]['users_rated'].describe()


In [None]:
#Rating cannot be 0, Users have not rated the game.
print(games_df[(games_df['average_rating']==0)].id.count())
print(games_df[(games_df['users_rated']==0)].id.count())

**24355** board games without any user rating. Hence, these records have 0 average_rating.

In [None]:
games_df[(games_df['average_rating']!=0.0)].average_rating.sort_values().head(5)
#0 rating rows should be dropped from the analysis,as 0 rating means game was not rated or the data is not available

**Remove records with 0 average rating**

In [None]:
games_df3 = games_df[games_df["average_rating"]==0]
games_df3.shape

In [None]:
# eliminating all the rows having average_rating = 0 or less than 0, since average rating can not be less than 1.
games_df = games_df[games_df["average_rating"]>0]
print(games_df.shape)
games_df = games_df[games_df["users_rated"] > 0]
print(games_df.shape)


Plot the histogram for average rating again

In [None]:
plt.figure(figsize=(5,5))
plt.hist(games_df["average_rating"])

plt.xlabel('Average Rating')
plt.ylabel('# of Board games')
plt.title('Histogram')

plt.grid(True)

plt.show()


Check describe function again

In [None]:
games_df.describe().transpose()

<p style="font-family: Arial; font-size:1.25em;color:purple; font-style:bold">Analysis of no of players and playing time- Can these columns take value 0?

From describe() function, I saw that min value for fields like minplayers, maxplayers, playingtime, minplayingtime, maxplayingtime, minage are 0. I want to analyse how many records do we have with such data.

I plan to remove the records with condition:

* where both minplayers or maxplayers=0 
* where any of the 3 fields: playingtime, minplayingtime, maxplayingtime =0
* minage =0 might mean for less than 1 year kids. But, does one prefer few months old infant to play board game? Hence, I plan to remove this as well

In [None]:
games_df[(games_df['maxplayers'] == 0) | (games_df['minplayers'] == 0)].id.count()
#3229 records

In [None]:
games_df[(games_df['playingtime'] == 0) | (games_df['minplaytime'] == 0) | (games_df['maxplaytime'] == 0)].id.count()
#9712 records

In [None]:
games_df[(games_df['minage'] == 0) ].id.count()
#12319 records

In [None]:
games_df.shape
#55064-3229-12319-9712

In [None]:
filter=(games_df['maxplayers'] > 0) & (games_df['minplayers'] > 0) & (games_df['minage'] > 0) & \
           (games_df['playingtime'] > 0) & (games_df['minplaytime'] > 0) & (games_df['maxplaytime'] > 0) 

games_df = games_df[filter]
games_df.shape

**Analysis of years in which games were published.**
Coming back to this field, let us see if earlier cleaning has changed our data.

The data set has -ve years and year =0. The data type is int for year column.

In [None]:
games_df['yearpublished'].min()#-ve dates 
#minimum value is same

In [None]:
games_df[games_df['yearpublished']>0].yearpublished.min()
#Change in the min year in AD era

In [None]:
games_df[games_df['yearpublished']<0].yearpublished.count()
#12 games in BC era

A filter was cretaed with condition where min and max players, min age , min adn max playing time are greater than 0.
After filtering out these records, we are left with **37005** records.

Let us check describe function again and see descriptive statistics of our data.

In [None]:
games_df.describe().transpose()

In [None]:
games_df[games_df['yearpublished']==0].yearpublished.count()
#1587 records where year published is 0. I will remove these records.

But, 0 years are not valid as 1BC is followed by 1 AD. The games where yeard published is 0 are games like carrom or some unpublished prototype. It looks either BGG web site could not trace the history or the data was not available. 


Let us check which game is the oldest 

**Remove records where years published is 0**

In [None]:
games_df=games_df[games_df['yearpublished']!=0]
print(games_df.shape)


In [None]:
print(games_df.shape)


The final data looks good now. The summary of **data cleaning and preparation** is as under:
We started with **81312** records in our data set.

* Null Values- 44 records removed
* Duplicates- 1849 duplicate records which were removed from data set.
* Average rating 0 as no user rated for those games- 24355 suvh records removed
* Invalid data for minage: 12319 records removed where minimum age is 0 year
* Invalid data for minplayer and maxplayer- 3229 records removed which had 0 value in these fields.
* Invalid data for playingtime, minplaytime and maxplaytime: 9712 records removed where these values showed 0 playtime.
* Yearspublished: The year is in int format due to BC and AD values. I am keeping negative values as they represent B.C. I removed data where yearpublished =0, *1587* such records after all the above cleaning. In original dat set 7668 such records were present.
* After data cleaning and preparation, we are left with **35418** records

<p style="font-family: Arial; font-size:1.75em;color:purple; font-style:bold"><br> Top Board Games</p>

**Further exploring the data**
What is the mean playing time for all the games put together?

In [None]:
games_df['playingtime'].mean()
# mean has increased after al teh cleaning and preparation

**Which board game has highest no of comments and in which year it was published ?**

In [None]:
games_df['total_comments'].max()

In [None]:
games_df[(games_df['total_comments']==games_df['total_comments'].max())]['name']

In [None]:
games_df[(games_df['total_comments']==games_df['total_comments'].max())]['yearpublished']

In [None]:
games_df[games_df['total_comments']== games_df['total_comments'].max()]

In [None]:
games_df[(games_df['total_comments']==games_df['total_comments'].max())]['name']

**Which games have received least number of comments?**

In [None]:
games_df[games_df['total_comments']== games_df['total_comments'].min()]

**What was the average minage of all games per game "type"? (boardgame & boardgameexpansion)**

In [None]:
games_df.groupby('type').mean()['minage']
#board game expansion see more mean, probably the creators can see what improvements can be done

**Is there a correlation between average_rating and baysian rating for the games?**
I will also check the scatter plot of these twio fields

In [None]:
games_df[['average_rating','bayes_average_rating']].corr() # very less correlation.

In [None]:
high_rated = games_df['average_rating'].sort_values().value_counts()
high_rated[:10]


Majority of the movies are rated 6 and 5. Games rated 10 or highest are the least.

In [None]:
high_rated_games = games_df[games_df['average_rating']==10][['name','total_comments','users_rated']]
high_rated_games[:10]
#a misleading result as very less users rasted the game bringing average rating to higer value.


**Top games with respect to total users rated, comments ,bayes rating**

In [None]:
games_df[['name','users_rated','total_comments','average_rating','bayes_average_rating']].sort_values('total_comments',ascending=False)[:10]

In [None]:
#top_users_rated['name','users_rated'][:10].plot(kind='bar', figsize=(15,10))
top_users_rated=games_df[['name','total_comments']].sort_values('total_comments',ascending=False)
#top_users_rated.reindex(index=[1,2], columns=['name','users_rated'])
top_users_rated[:10].plot(kind='bar')

In [None]:
top_users_rated=games_df[['name','users_rated','total_comments','average_rating','bayes_average_rating']].sort_values('users_rated',ascending=False)
top_users_rated[:10]

In [None]:
#top_users_rated['name','users_rated'][:10].plot(kind='bar', figsize=(15,10))
top_users_rated=games_df[['name','users_rated']].sort_values('users_rated',ascending=False)
top_users_rated.reindex(index=[1,2], columns=['name','users_rated'])
top_users_rated[:10].plot(kind='bar')

**Scatter plots for visualizing the relation between features**

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

fig, axis = plt.subplots()
# Grid lines, Xticks, Xlabel, Ylabel

axis.yaxis.grid(True)
axis.set_title('Scatter Plot',fontsize=10)
axis.set_ylabel('Average Rating',fontsize=10)
axis.set_xlabel('Bayes Average Rating',fontsize=10)

X = games_df['bayes_average_rating']
Y = games_df['average_rating']

axis.scatter(X, Y)
plt.show()

It looks bayes_average_rating is 0 even when average rating is high.
We have cases where bayes rating adjusts average ratingto lower or higher value. Majority of bayes rating lies in the range 5-8. Let us see how bayes rating varies with no of users and year published. From previous graph, we saw that after 1900s, games were given more rating. It can be attributed to improved games too. But, we would like to see how bayes rating changed.


In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

fig, axis = plt.subplots()
# Grid lines, Xticks, Xlabel, Ylabel

axis.yaxis.grid(True)
axis.set_title('Scatter Plot',fontsize=10)
axis.set_ylabel('# of Users rated',fontsize=10)
axis.set_xlabel('Bayes Average Rating',fontsize=10)

X = games_df['bayes_average_rating']
Y = games_df['users_rated']

axis.scatter(X, Y)
plt.show()

If a game was rated by large no of users, bayes rating was at higher level. So, bayes rating compensates for low no of votes. High rating by less no of users can be misleading and we can see that can be adjusted by bayes rating


In [None]:
%matplotlib inline
import matplotlib.pyplot as plt

fig, axis = plt.subplots()
# Grid lines, Xticks, Xlabel, Ylabel

axis.yaxis.grid(True)
axis.set_title('Scatter Plot',fontsize=10)
axis.set_ylabel('years published',fontsize=10)
axis.set_xlabel('Bayes Average Rating',fontsize=10)

X = games_df['bayes_average_rating']
Y = games_df['yearpublished']

axis.scatter(X, Y)
plt.show()

In [None]:
games_df[['yearpublished','users_rated','average_rating','bayes_average_rating']].groupby('yearpublished').mean()

from the output of above cells, we can observe that bayes rating is less than average rating where # of users that rated the game is less. If # of users are more, bayes rating might not change too much. There can be other factors impacting bayes rating. In year 2017, mean of user rated is only 7.5. Despite having high average of 6.8, bayes rating was dropped to 0.


**For our project average_rating is the target variable or response variable. So we wil analyse rest of the features with average_rating**

In [None]:
plt.scatter(games_df['users_rated'],games_df['average_rating'],color='b')
plt.ylabel('Average Rating')
plt.xlabel('# of Users rated the games')
plt.title('Scatter plot')

plt.grid(True)

plt.show()


In [None]:
plt.scatter(games_df['average_weight'],games_df['average_rating'],color='b')
plt.ylabel('Average Rating')
plt.xlabel('Average Weight')
plt.title('Scatter plot')

plt.grid(True)

plt.show()


I have used scatter plot in a function and this function will be called using for loop

In [None]:
def plot_scatter(df, x, y):
    #function to plot scatter plots
    fig, axis = plt.subplots()
    # Grid lines, Xticks, Xlabel, Ylabel

    axis.yaxis.grid(True)
    axis.set_title('Scatter Plot',fontsize=10)
    axis.set_xlabel(x,fontsize=10)
    axis.set_ylabel(y,fontsize=10)

    X = df[x]
    Y = df[y]

    axis.scatter(X, Y)
    plt.show()
    #End of function



In [None]:
plot_df=games_df
plot_df=plot_df.drop(columns=['id','name','type','average_rating'])

In [None]:
#fig, axs = plt.subplots(1,2)
for i in plot_df:
    print(i+" Vs average rating")
    plot_scatter(games_df, i, 'average_rating')
  

<p style="font-family: Arial; font-size:1.25em;color:purple; font-style:bold">INFERENCES from  Plots
    
* The features don't show linear relation with average_ratings
* Only average_weight shows little bit linear relationship. We will cross check this with correlation plt.

## CORRELATION

In [None]:
corr_df=games_df
corr_df=corr_df.drop(columns=['id','name','type'])
corr_df.columns

In [None]:
corr_df.corr()

In [None]:
sns.set(rc={'figure.figsize':(15,15)})
sns.heatmap(corr_df.corr(), annot=True)
plt.show()

<p style="font-family: Arial; font-size:1.25em;color:purple; font-style:bold">INFERENCE
    
* playingtime, minplaytime, maxplaytime are highly correlated with each other
* user_rated, total_owners,total_traders, total_wanters,total_wishers, total_comments,total_weights are also highly correlated with each other. 
* total_owners show high correlation with users_rated, total_weights and total_comments
* total_wanters and total_wishers are highly correlated
* users_rated is also correlated highly with total_weights and total_comments.
* none of the variables show high correlation with average_rating
* average_weight which is a scale for gauging ease of understanding game shows some amount of correlation i.e., 0.33
* minage also shows 0.3 correlation with average_rating

### Is there a linear relationship between average_weight & average_rating?

In [None]:
plt.figure(figsize=(5,5))
sns.regplot(x="average_weight", y="average_rating", data=games_df)

Extract Features and Target ('average_rating') Values into Separate Dataframes

In [None]:
games_df.columns

In [None]:
#All features
features=['minplayers', 'maxplayers','playingtime', 'minplaytime', 'maxplaytime', 'minage', 'users_rated', 'total_owners',
      'total_traders', 'total_wanters', 'total_wishers', 'total_comments','total_weights', 'average_weight']
    
    
#Features after removing correlated independent variables    
features2=['minplayers', 'maxplayers','playingtime', 'minage', 'users_rated', 'average_weight']
# with target only average weight shows some correlation
      
#features=[ 'minage',  'average_weight']

In [None]:
#with features1:
X = games_df[features]
#With features 2
X2 = games_df[features2]

y = games_df['average_rating']


In [None]:
y.mean()

Split the Dataset into Training and Test Datasets.
I am splitting test train on the basis of feature1 and feature 2 (less no of features)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=222)#feature 1

In [None]:
X_train2, X_test2, y_train2, y_test2 = train_test_split(X2, y, test_size=0.30, random_state=222)#  feature2

## LINEAR REGRESSION

   
* Phase 1:First linear regression model build on features

* Phase 2 :model build on features 2, correlated independent variables removed

In [None]:
regressor = LinearRegression()
regressor.fit(X_train, y_train)

In [None]:
#features 2, correlated variables removed
regressor2 = LinearRegression()
regressor2.fit(X_train2, y_train2)

### result of model with feature 1

In [None]:
print(regressor.coef_)
intercept = regressor.intercept_
print("Intercept",intercept)
#coefficient of average_weight is highest , hence 1 unit increase in average weight will increase rating by 0.27 units

In [None]:
regressor.score(X_train, y_train)

* coefficient of average_weight is highest , hence 1 unit increase in average weight will increase rating by 0.27 units

### result of model with  feature 2, correlated independent variables removed

In [None]:
print(regressor2.coef_)

intercept2 = regressor2.intercept_
print("Intercept",intercept2)
print(regressor2.score(X_train2, y_train2))
#coefficient of average_weight is highest , hence 1 unit increase in average weight will increase rating by 0.308 units

* coefficient of average_weight is highest , hence 1 unit increase in average weight will increase rating by 0.3 units

Prediction using Linear Regression Model

In [None]:
y_prediction = regressor.predict(X_test)#feature 1
y_prediction

In [None]:
y_prediction2 = regressor2.predict(X_test2)#feature 2
y_prediction2

<p style="font-family: Arial; font-size:1.25em;color:purple; font-style:bold"><br>
What is the mean of the expected target value in test set ?

In [None]:
y_test.describe()

In [None]:
y_prediction
from scipy import stats
#a = np.arange(y_prediction)
stats.describe(y_prediction)

In [None]:
y_prediction2
from scipy import stats

stats.describe(y_prediction2)

variance is less for feature 2 based model but the min and max predicted value vary a lot from actual min and max. Let us calculate Root Mean squared error 

<p style="font-family: Arial; font-size:1.25em;color:purple; font-style:bold"><br>
Evaluate Linear Regression Accuracy using Root Mean Square Error

In [None]:
RMSE = sqrt(mean_squared_error(y_true = y_test, y_pred = y_prediction))
RMSE

In [None]:
RMSE2 = sqrt(mean_squared_error(y_true = y_test2, y_pred = y_prediction2))
RMSE2

High RMSE for both models, model is not fit.
Next we will use only feature 1 as there is not much difference between the two models

## DECISION TREE REGRESSOR

In [None]:
regressor = DecisionTreeRegressor(max_depth=10)
regressor.fit(X_train, y_train)
print(regressor.score(X_train, y_train))

<p style="font-family: Arial; font-size:1.25em;color:purple; font-style:bold">Prediction using Decision Tree Regressor


In [None]:
y_prediction = regressor.predict(X_test)
y_prediction

In [None]:
y_prediction
from scipy import stats

stats.describe(y_prediction)

In [None]:
#sqrt(0.939)
0.9690201236300513*0.9690201236300513

<p style="font-family: Arial; font-size:1.25em;color:purple; font-style:bold">Evaluate Decision Tree Regression Accuracy using Root Mean Square Error

In [None]:
RMSE = sqrt(mean_squared_error(y_true = y_test, y_pred = y_prediction))
RMSE

## ADABOOST REGRESSOR

In [None]:
from sklearn.ensemble import AdaBoostRegressor
_regressor = AdaBoostRegressor()
_regressor.fit(X_train, y_train)

In [None]:
print(_regressor.score(X_train, y_train))

In [None]:
y_prediction = _regressor.predict(X_test)
y_prediction

In [None]:
RMSE = sqrt(mean_squared_error(y_true = y_test, y_pred = y_prediction))
RMSE

**INFERENCE**

* AdaBoost Regression gave the lowest RMSE
* RMSE from all the 3 models show that we have high errors and model are not fit


## CONCLUSIONS

The high RMSE limits the model to be used in production setting. The
data has inherent limitations. From our study, we can see that more important
features are required to make our model robust. We do not have data on
demographics of users that can be incorporated in the model. The data also gives
little information about the game itself.
Additional information like category of game or tags that gives information on
theme e.g. Strategy, Mystery, Wargame, Sports, etc. could have improved our
model.
Text analytics on User reviews in text form could also help to see the which
features are influencing user ratings.
Due to data quality issues, the model was highly underfit. 
Though, we can conclude that our features capture 44% of variance in average_rating.

## FUTURE WORK

It was exciting to work on real data that might not give expected results. In future, I
would like to apply unsupervised learning and create clusters on this data. Feature
engineering can also be used by creating new features from available data.
Bayesian average rating is another area which would be interesting to look at.

## REFERENCES

1.
https://www.evanmiller.org/bayesian-average-ratings.html

2.
https://github.com/ThaWeatherman/scrapers/tree/master/boardgamegeek

3.
https://en.wikipedia.org/wiki/BoardGameGeek

4.
https://www.kaggle.com/gutsyrobot/games-data/data

5.
https://www.kaggle.com/thuwaarahanragu/basic-data-visualization