# BGG Average Rating Prediction

## Step 1: Frame the problem

- Objective: Predict the average rating a board game will receive based on certain aspects such as number of participants in game, difficulty, playing time etc.
		
- In current world we have access to wide variety of products from many possible domains. Be it shopping, watching movies or playing games we can do all these things online. With ability for consumer to share his/her reviews about the product or service he/she receives has made massive influence on shopping behaviours of consumers.

- Products with very good rating high chances of ranking higher up in the Top product list. Having many positive reviews about product will give boost to the sell of product by attracting new consumers. Postive reviews gives customer a trust about quality of the product. But on same line, few negative reviews will repel the consumer from product. The major benifit of review or rating system can be seen on decline in malfunctioning.

- On BoradGameGeek website, from where the data has been collected for this problem, ratings of games plays very crucial role in attracting new players towards those high rated game. The rating of games are on scale of 1 to 10 with 1 being 'Awful' and 10 being 'Outstanding'. For more information about the ratings, visit https://boardgamegeek.com/wiki/page/Ratings&redirectedfrom=rating#


- Data Description:
There are total 20 features columns for each game.
Following are the features associated with each game:
    - type: Type of game
    - name: Name of game
    - yearpublished: Year when game is published
    - minplayers & maxplayers: Minimum and maximum number of players allowed to particiate in the game
    - playingtime: Allowed playing time (maximum)
    - minplayingtime & maxplayingtime: Minimum and maximum allowed playing time
    - minage: Minimum age of player required to play
    - users_rated: Total number of users given rating to the game
    - average_rating: Avergae rating for game
    - bayes_average_rating: Bayesian average rating for game
    - total_owners: Total number of players who own this game
    - total_traders: Total number of players who want to trade this game
    - total_wanters: Total number of players who wants this game in trade
    - total_wishers: Total number of players who added this game in their wishlist
    - total_comments: Total number of comments
    - total_weights: Total number of people given Game Play Weight to this game
    - average_weight: Average weight of this game

Important Terms:
- Rating: range 1-10. For more info: https://boardgamegeek.com/wiki/page/Ratings&redirectedfrom=rating#
- Game Play Weight: range 1-5. Community rating for how difficult a game is to understand. Lower rating (lighter weight) means easier. For more info: https://boardgamegeek.com/wiki/page/Weight

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

import warnings
warnings.filterwarnings("ignore")

In [None]:
data = pd.read_csv('../input/board-games-prediction-data/games.csv')
print(data.shape)
data.head()

## Step 2: Data Exploration

In [None]:
data_explore = data.copy()

- Average ratings for game will never be relied on id, type and name of game. So we will drop this column from dataset.
- Bayes average rating is another form of rating which is used to ensure that highly rated minority interest games are ranked lower than highly rated mass interest games. We should not use this column for making predictions.

In [None]:
drop_features = ['id', 'type', 'name', 'bayes_average_rating']
data_explore = data_explore.drop(columns=drop_features, axis=1)

In [None]:
data_explore.info()

- Following columns contain null values: yearpublished, minplayers, maxplayers, playingtime, minplaytime, maxplaytime, minage

In [None]:
data_explore.describe()

- Observations:
    - Minimum value of average rating and average weight is zero which is invalid as per the information given on BGG website. The expected minimum value for both these features should be 1. 
    - Minimum value in yearpublished column is negative indicating there are games with invalid value of published year.
    - Minimum value in users_rated column indicates that there are games for which no user has given rating.
    - Statistics of playingtime and maxplaytime columns are same. These are identical columns.
    - There are games with zero maximum playing time, zero minimum and/or maximum players required. These can be consider as invalid records.
    - There is no categorical feature present in this dataset.
    - By comparing max and 75% values of columns such as maxplaytime, minplaytime, users_rated maxplayers, minplayers etc. which tells us there are some outliers present in the dataset.
    
Let first focus on average rating.

In [None]:
def plot_histogram(data):
    ax = plt.gca()
    counts, _, patches = ax.hist(data)
    for count, patch in zip(counts, patches):
        if count>0:
            ax.annotate(str(int(count)), xy=(patch.get_x(), patch.get_height()+5))
    if data.name:
        plt.xlabel(data.name)

In [None]:
plt.figure(figsize=(15, 25))
i=1
for col in data_explore.columns:
    plt.subplot(6, 3, i)
    plot_histogram(data_explore[col])
    i+=1

In [None]:
plt.title('Histogram of Average Ratings')
plot_histogram(data_explore['average_rating'])

There are almost 24000 records with zero average ratings.  According to BoardGameGeek, the minimum rating any game can receive is 1. So definately the records of games with zero average rating are not going to any useful for us.

Lets explore more about these games with zero ratings.

In [None]:
data_explore_zero_ratings = data_explore[data_explore['average_rating']==0]
data_explore_zero_ratings.describe()

Above stats tells us that the game which receives zero average rating because there are no users who have given ratings to those games, there are hardly 1 to 2 users who owns or wants this game.

Lets get rid of those games with average rating equal to zero.

In [None]:
data_explore = data_explore[data_explore['average_rating']>0]
data = data[data['average_rating']>0] # making this change in orignal dataframe

In [None]:
plt.title('Histogram of Average Ratings')
plot_histogram(data_explore['average_rating'])

We also observe that there are games having average weight equal to zero. We should get rid of those records also.

In [None]:
plt.title('Histogram of Average Weight')
plot_histogram(data_explore['average_weight'])

In [None]:
data_explore = data_explore[data_explore['average_weight']>0]
data = data[data['average_weight']>0] # making this change in orignal dataframe

We also saw that for some games 'year of published' is negative.

In [None]:
plot_histogram(data_explore['yearpublished'])

We can see that for most of games, year of published is above 1500.

In [None]:
data_explore = data_explore[data_explore['yearpublished']>0]
plot_histogram(data_explore['yearpublished'])

Again there not so many games prior to middle of 19th century. 

In [None]:
data_explore_1920 = data_explore.query('yearpublished > 1900 and yearpublished < 2000')
plot_histogram(data_explore_1920['yearpublished'])

- For our problem, I will focus on the games which published after 1950. 
- Though dropping the records because they are very old in time might not be a good move. But I will prefer to have information about the time period in which board games were more popular and many peoples were playing them as the use of model we are going to build is has  to enable us to make better decisions about game new upcoming games.

In [None]:
data_explore = data_explore[data_explore['yearpublished']>1950]
data = data[data['yearpublished']>1950]
plot_histogram(data_explore['yearpublished'])

- It can be seen that the popularity of Board games started to rise in late 90's.
- Popularity of board games affects the quality, quantity, longevity and competitions among games. In my opinion these factors can influence the average rating of games. 

- Lets do the comparision of average rating between Board games published prior to 1985 where games are moderate to less popular and for games which pubilshed after 2000.

In [None]:
data_7585 = data_explore.query('yearpublished >= 1975 and yearpublished <= 1985')
data_0515 = data_explore.query('yearpublished >= 2005 and yearpublished <= 2015')

plt.figure(figsize=(15, 8))
plt.subplot(1, 2, 1)
plt.scatter(data_7585['yearpublished'], data_7585['average_rating'], s=data_7585['average_weight']*10)
plt.title("1975-85 (Games {})".format(data_7585['yearpublished'].count()))
plt.xlabel('Published year')
plt.ylabel('Average rating')
plt.subplot(1, 2, 2)
plt.scatter(data_0515['yearpublished'], data_0515['average_rating'], s=data_0515['average_weight']*10)
plt.title("2005-15 (Games {})".format(data_0515['yearpublished'].count()))
plt.xlabel('Published year')
plt.show()

Have a look at both charts for rating above 7. There are many games in 2015 which has more than 8 ratings. There where hardly few games which receives average rating higher than 7. 

Note: The size of circle indicates the complexity of games. larger the size larger the complexity.

Moving on, One of the key observation we made earlier was that some of the columns contains outliers. Lets focus on some of those columns.

In [None]:
columns = ['minplaytime', 'maxplaytime', 'minplayers', 'maxplayers', 'users_rated']
plt.figure(figsize=(15, 8))
sns.boxplot(data=data_explore[columns])
plt.ylim(-100, 500)

- We can see there any outliers in each of above features. Lets calculate total number of outliers present in each feature column.

In [None]:
Q1 = data_explore.quantile(0.25)
Q3 = data_explore.quantile(0.75)
IQR = Q3 - Q1
((data_explore < (Q1 - 1.5 * IQR)) | (data_explore > (Q3 + 1.5 * IQR))).sum()

- As we can see, each column has many outliers present. 
- Since the data is collected by using web-scrapping, there is possibility that the outliers can be the results of mistake made in data-collection process
- For now, we will not drop the outliers. Outliers not necessarily affect the models performance.
- Since there are outliers, we will replace the null values by median.

In [None]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy='median')

In [None]:
data_columns = data_explore.columns
data_explore = imputer.fit_transform(data_explore)
data_explore = pd.DataFrame(data=data_explore, columns=data_columns)

We also saw earlier that the playingtime column is same as maxplaying time. So we will drop the playingtime column.

In [None]:
drop_features.append('playingtime')
data_explore = data_explore.drop(columns=['playingtime'], axis=1)

Now lets see how features are correlated with average rating. 

In [None]:
plt.figure(figsize=(15, 10))
corr_matrix = data_explore.corr()
sns.heatmap(corr_matrix, mask=np.zeros_like(corr_matrix, dtype=np.bool), square=True, annot=True)

In [None]:
corr_matrix['average_rating'].sort_values(ascending=False)

- Observations:
    - Average rating is correlated with complexity of game and year in which game was published. Average rating is less correlated with the number of users given rating to the game. I feel this less correlation is good indicator because the ratings should be more depend on what users think about game rather than how many users rate the game.
    - Average rating is more correlated with number of people who want the game in trade rather than number of people who owns the game. 
    - Rating of game is clearly independend of number of players and playing time of game.
    - From the correlation plot, we can see that there are many features having fairly strong correlation with other features.

## Step 3: Data Preprocessing

In [None]:
data.shape

In orignal dataset we had more than 80000 records but because of many invalid values for some attributes we had to remove those records.

Lets now get ready with training and testing datasets.

In [None]:
y = data[['average_rating']].copy()
X = data.drop(columns=['average_rating'], axis=1)

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

We are going to build a pipeline which will take care of the data cleaning and model training.

In data cleaning process we will focus on following aspects of data:
1. drop unnecessary features
2. replace null values
3. standardization of features

In [None]:
feature_columns =[ feature for feature in list(X.columns) if feature not in drop_features ]

In [None]:
from sklearn.compose import ColumnTransformer

drop_feature_cols = ColumnTransformer(transformers=[('drop_columns', 'drop', drop_features)], remainder='passthrough')

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

In [None]:
pre_process = Pipeline(steps=[('drop_features', drop_feature_cols),
                              ('imputer', SimpleImputer(strategy="median")),
                              ('scaler', StandardScaler())])

In [None]:
X_train_transformed = pre_process.fit_transform(X_train)
X_test_transformed = pre_process.transform(X_test)

## Step 4: Select and Train a model

In data exploration step we found out two key facts about the dataset:
* Many features are correlated with each other.
* Many feature contains outliers.


- Having correlated features might affect performance of linear model but for tree-based models correlated feature is not a concern.
- I will use following machine learning algorithms:
    * Linear Regression
    * Decision Tree
    * Random Forest
   
   
- RMSE will the performance metric to evaluate model's performance.

In [None]:
from sklearn.model_selection import cross_val_score

def cv_results(model, X, y):
    scores = cross_val_score(model, X, y, cv = 7, scoring="neg_mean_squared_error", n_jobs=-1)
    rmse_scores = np.sqrt(-scores)
    print('CV Scores: ', rmse_scores)
    print('rmse: {},  S.D.:{} '.format(np.mean(rmse_scores), np.std(rmse_scores)))

### Linear Regression

In [None]:
from sklearn.linear_model import LinearRegression

linear_reg = LinearRegression()
linear_reg.fit(X_train_transformed, y_train)

In [None]:
coefs = list(zip(feature_columns, linear_reg.coef_[0]))
coefs.sort(key= lambda x:x[1], reverse=True)
coefs

We expect that the features which are strongly correlated with target variable to have high coefficient value compare to others.

Some interasting facts: 
- The model has given high weight to 'total_owners' feature than 'average_weight' and 'total_wanters'. Even though the latter two are more correlated with average rating than total_owners.
- Very less weight is given to 'users_rated' feature compare to other features like maxplaytime, minplaytime, etc.
    
This indicates Linear Regression gives much importance to number of people who owns the game rather than the people who wants this game in trade for prediction of average rating.

In [None]:
print("Linear Regression Model Cross Validation Results")
cv_results(linear_reg, X_train_transformed, y_train)

Variance of 2 in the prediciton of rating is not very good. 
Can we reduce the variance by reducing the correlation among feature variables.

Lets apply PCA technique to remove the correlation among input features.

In [None]:
from sklearn.decomposition import PCA

pca = PCA(n_components=0.95)   # Keeping variance 95% so that we will not loose much information.
X_train_reduced = pca.fit_transform(X_train_transformed)
X_test_reduced = pca.transform(X_test_transformed)
pca.n_components_, X_train_reduced.shape[1]

In [None]:
linear_reg.fit(X_train_reduced, y_train)

In [None]:
print("Linear Regression Model Cross Validation Results")
cv_results(linear_reg, X_train_reduced, y_train)

Unfortunately we didn't got any improvement in results. The only improvement we got is reduced in training time, which not our concern for this problem.

What effect the multicollinearity has on the model performance is well descibed on following link:
<br>https://datascience.stackexchange.com/questions/24452/in-supervised-learning-why-is-it-bad-to-have-correlated-features
<br>

- Few important learnings:
    - In case of supervised learning for predictions, the only reason for removing the multicollinearity is improve the training time and reduce the storage.
    - If we add so much correlated features to the model we may cause the model to consider unnecessary features and we may have curse of high dimensionality problem.
    - Multicollinearity affects the coefficients and p-values, but it does not influence the predictions, precision of the predictions, and the goodness-of-fit statistics. If your primary goal is to make predictions, and you don’t need to understand the role of each independent variable, you don’t need to reduce severe multicollinearity.

Now lets shift our focus on tree-based models. We will first implement Decision tree and then see if we get any improvement in result by implementing Random Forest.

### Decision Tree Regression

In [None]:
from sklearn.tree import DecisionTreeRegressor
tree_reg = DecisionTreeRegressor(criterion='mse', random_state=42)
tree_reg.fit(X_train_transformed, y_train)

In [None]:
coefs = list(zip(feature_columns, tree_reg.feature_importances_))
coefs.sort(key= lambda x:x[1], reverse=True)
coefs

- Decision Tree has given more importance to the total_wisher and yearpublished features than average_weight and total_wanters.
- This is completely different from what we recieved from Linear Regression. Linear Regressor model has given very less weights to total_wishers and yearpublished.
- This model has give more importance to total wishers and year in which game has been published.

In [None]:
print("Decision Tree Regression Model Cross Validation Results")
cv_results(tree_reg, X_train_transformed, y_train)

This is even worse than Linear Regression. Lets see what we get from Random Forest.

### Random Forest

In [None]:
from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor(criterion='mse', random_state=42, n_jobs=-1)
forest_reg.fit(X_train_transformed, y_train.values.flatten())

In [None]:
coefs = list(zip(feature_columns, forest_reg.feature_importances_))
coefs.sort(key= lambda x:x[1], reverse=True)
coefs

In [None]:
print("Random Forest Regression Model Cross Validation Results")
cv_results(forest_reg, X_train_transformed, y_train.values.flatten())

Much better result. Now lets tune the Random Forest model to obtain best model which we can use as our final model.

## Step 5: Fine Tune a Model

In [None]:
from sklearn.model_selection import GridSearchCV

grid_parm=[{'n_estimators':[25, 50, 75, 100], 'max_depth':[4, 8, 12, 16]}]
grid_search = GridSearchCV(RandomForestRegressor(random_state=42, n_jobs=-1), grid_parm, cv=5, scoring="neg_mean_squared_error", return_train_score=True, n_jobs=-1)
grid_search.fit(X_train_transformed, y_train.values.flatten())

In [None]:
cvres = grid_search.cv_results_
print("Results for each run of Random Forest Regression...")
for train_mean_score, test_mean_score, params in zip(cvres["mean_train_score"], cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-train_mean_score), np.sqrt(-test_mean_score), params)

In [None]:
grid_search.best_params_, -grid_search.best_score_

- If observe the train and test scores, There is not so much improvement in test score after parameter set {'max_depth': 12, 'n_estimators': 100}. On other side, RMSE is keep reducing as the max_depth is increasing. This indicates that if we continue to increase the max_depth model will start to overfit the training dataset.
- So I have decided to use model with parameters {'max_depth': 12, 'n_estimators': 100}. 

In [None]:
best_forest_reg = grid_search.best_estimator_
best_forest_reg.max_depth=12
best_forest_reg

## Step 6: Model Evaluation

In [None]:
print("Best Random Forest Cross Validation Results")
cv_results(best_forest_reg, X_test_transformed, y_test)

We also got much better RMSE on test dataset.

## Step 7: Analysis of Model Performance

Before saving the model, lets observe predictions made by model on overall dataset. This analysis will help us to know where actually model has underperformed.

In [None]:
y_train_pred = best_forest_reg.predict(X_train_transformed)
y_test_pred = best_forest_reg.predict(X_test_transformed)

In [None]:
y_pred = np.concatenate((y_train_pred, y_test_pred), axis=0)
y_pred.shape

In [None]:
plt.figure(figsize=(15, 6))
plt.subplot(1, 2, 1)
plt.title('Histogram of Observed Average Ratings')
plt.hist(data['average_rating'], bins=np.arange(1, 10), rwidth=0.85)
plt.subplot(1, 2, 2)
plt.title('Histogram of Predicted Average Ratings')
plt.hist(y_pred, bins=np.arange(1, 10), rwidth=0.85)
plt.show()

In [None]:
combine_data = pd.concat([X_train, X_test], axis=0)

In [None]:
plt.figure(figsize=(15, 6))
plt.subplot(1, 2, 1)
plt.scatter(data['yearpublished'], data['average_rating'],  c='green')
plt.title('Distibution of Observed Average Rating')
plt.subplot(1, 2, 2)
plt.scatter(combine_data['yearpublished'], y_pred, c='red')
plt.title('Distibution of Predicted Average Rating')
plt.show()

In [None]:
plt.figure(figsize=(15, 6))
plt.subplot(1, 2, 1)
plt.scatter(data['average_weight'], data['average_rating'],  c='green')
plt.title('Distibution of Observed Average Rating')
plt.ylabel('Average Rating')
plt.xlabel('Average Weight')
plt.subplot(1, 2, 2)
plt.scatter(combine_data['average_weight'], y_pred, c='red')
plt.title('Distibution of Predicted Average Rating')
plt.ylabel('Average Rating')
plt.xlabel('Average Weight')
plt.show()

- From above charts we can see that our model has failed to correctly predict the average ratings for games having
    - average rating less than 4.
    - average rating higher than 8.

- In given dataset there were many games published after 1990 having average rating less than 3. Our model has failed predict to predict the correct rating for those games.

- In orignal dataset, for some games having high average weight has low ratings. But looking at the predictions made model, there are very few games with high average weight has received low ratings.