# Introduction

BoardGameGeek is a popular site where different types of board games like Settlers of Catan, Through the Ages: A Story of Civilization etc are discussed and reviewed. There are seven major data types. Each of these types has subtypes which differ by the various domains of this collection of sites. But the ID numbers are unique for any given type. For example [boardgames] and [rpgitems] are subtypes of [thing]; the ID numbers for [things] are unique and shared between its subtypes, such that any one [thing] could have the [boardgame] or [rpgitem] subtype, but different [boardgames] and [rpg items] will never have the same ID number.


# **Objective**
The objective of this notebook to analyze the given review data and predict the rating. Let's use our analysis skills to predict the ratings. 

# Dataset
 The dataset contains data on 80000 board games. The dataset contains several data points about each board game. Here’s a list of the interesting ones:

* name – name of the board game.
* playingtime – the playing time (given by the manufacturer).
* minplaytime – the minimum playing time (given by the manufacturer).
* maxplaytime – the maximum playing time (given by the manufacturer).
* minage – the minimum recommended age to play.
* users_rated – the number of users who rated the game.
* average_rating – the average rating given to the game by users. (0-10)
* total_weights – Number of weights given by users. Weight is a subjective measure that is made up by BoardGameGeek. It’s how “deep” or involved a game is. Here’s a full explanation.
* average_weight – the average of all the subjective weights (0-5).

# Machine Learning Algorithms Used For Predicting The Rating
We have used the following 3 Machine Learning Algorithms in this project:
1.  k-means clustering
2.  Linear Regression
3. RandomForest Regression

# k-means clustering
K-means clustering is one of the simplest and popular unsupervised machine learning algorithms.
Typically, unsupervised algorithms make inferences from datasets using only input vectors without referring to known, or labelled, outcomes.
The objective of K-means is simple: group similar data points together and discover underlying patterns. To achieve this objective, K-means looks for a fixed number (k) of clusters in a dataset.”
A cluster refers to a collection of data points aggregated together because of certain similarities.
You’ll define a target number k, which refers to the number of centroids you need in the dataset. A centroid is the imaginary or real location representing the center of the cluster.
Every data point is allocated to each of the clusters through reducing the in-cluster sum of squares.
In other words, the K-means algorithm identifies k number of centroids, and then allocates every data point to the nearest cluster, while keeping the centroids as small as possible.
The ‘means’ in the K-means refers to averaging of the data; that is, finding the centroid.
How the K-means algorithm works
To process the learning data, the K-means algorithm in data mining starts with a first group of randomly selected centroids, which are used as the beginning points for every cluster, and then performs iterative (repetitive) calculations to optimize the positions of the centroids
It halts creating and optimizing clusters when either:
* The centroids have stabilized — there is no change in their values because the clustering has been successful.
* The defined number of iterations has been achieved.

# Linear Regression
Linear Regression is a machine learning algorithm based on supervised learning. It performs a regression task. Regression models a target prediction value based on independent variables. It is mostly used for finding out the relationship between variables and forecasting. Different regression models differ based on – the kind of relationship between dependent and independent variables, they are considering and the number of independent variables being used.
Linear regression performs the task to predict a dependent variable value (y) based on a given independent variable (x). So, this regression technique finds out a linear relationship between x (input) and y(output). Hence, the name is Linear Regression.


In [None]:
from IPython.display import Image
import os
!ls ../input/
Image("../input/diagram/pic.png")

Linear regression performs the task to predict a dependent variable value (y) based on a given independent variable (x). So, this regression technique finds out a linear relationship between x (input) and y(output). Hence, the name is Linear Regression.
In the figure above, X (input) is the work experience and Y (output) is the salary of a person. The regression line is the best fit line for our model.

In [None]:
from IPython.display import Image
import os
!ls ../input/
Image("../input/downloadd/pic1.png")


While training the model we are given :
x: input training data (univariate – one input variable(parameter))
y: labels to data (supervised learning)

When training the model – it fits the best line to predict the value of y for a given value of x. The model gets the best regression fit line by finding the best θ1 and θ2 values.
θ1: intercept
θ2: coefficient of x

Once we find the best θ1 and θ2 values, we get the best fit line. So when we are finally using our model for prediction, it will predict the value of y for the input value of x.

How to update θ1 and θ2 values to get the best fit line ?

**Cost Function (J):**
By achieving the best-fit regression line, the model aims to predict y value such that the error difference between predicted value and true value is minimum. So, it is very important to update the θ1 and θ2 values, to reach the best value that minimize the error between predicted y value (pred) and true y value (y).


In [None]:
from IPython.display import Image
import os
!ls ../input/
Image("../input/costfunction/Annotation 2020-05-11 091201.png")

Cost function(J) of Linear Regression is the Root Mean Squared Error (RMSE) between predicted y value (pred) and true y value (y).

**Gradient Descent:** 
To update θ1 and θ2 values in order to reduce Cost function (minimizing RMSE value) and achieving the best fit line the model uses Gradient Descent. The idea is to start with random θ1 and θ2 values and then iteratively updating the values, reaching minimum cost.

# RandomForest Regression

Random forest is a Supervised Learning algorithm which uses ensemble learning method for classification and regression.
Random forest is a bagging technique and not a boosting technique. The trees in random forests are run in parallel. There is no interaction between these trees while building the trees.
It operates by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.
A random forest is a meta-estimator (i.e. it combines the result of multiple predictions) which aggregates many decision trees, with some helpful modifications:
1. The number of features that can be split on at each node is limited to some percentage of the total (which is known as the hyperparameter). This ensures that the ensemble model does not rely too heavily on any individual feature, and makes fair use of all potentially predictive features.
2. Each tree draws a random sample from the original data set when generating its splits, adding a further element of randomness that prevents overfitting.
The above modifications help prevent the trees from being too highly correlated.

For Example, See these nine decision tree classifiers below :

In [None]:
from IPython.display import Image
import os
!ls ../input/
Image("../input/chekkkkkkkkk/Annotation 2020-05-11 091744.png")

These decision tree classifiers can be aggregated into a random forest ensemble which combines their input. Think of the horizontal and vertical axes of the above decision tree outputs as features x1 and x2. At certain values of each feature, the decision tree outputs a classification of “blue”, “green”, “red”, etc.

These above results are aggregated, through model votes or averaging, into a single ensemble model that ends up outperforming any individual decision tree’s output.

The aggregated result for the nine decision tree classifiers is shown below :


In [None]:
from IPython.display import Image
import os
!ls ../input/
Image("../input/cheeckkkkkkkk/forestttt.png")

Feature and Advantages of Random Forest :

It is one of the most accurate learning algorithms available. For many data sets, it produces a highly accurate classifier.

1. It runs efficiently on large databases.
2. It can handle thousands of input variables without variable deletion.
3. It gives estimates of what variables that are important in the classification.
4. It generates an internal unbiased estimate of the generalization error as the forest building progresses.
5. It has an effective method for estimating missing data and maintains accuracy when a large proportion of the data are missing.

Disadvantages of Random Forest :

1. Random forests have been observed to overfit for some datasets with noisy classification/regression tasks.
2. For data including categorical variables with different number of levels, random forests are biased in favor of those attributes with more levels. Therefore, the variable importance scores from random forest are not reliable for this type of data.

# Importing Board Game Geek Review Data

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D
from sklearn.preprocessing import StandardScaler


# Input data files are available in the read-only "../input/" directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Importing datasets and displaying the rows 

In [None]:
rank = pd.read_csv('../input/boardgamegeek-reviews/2019-05-02.csv')
rank.head()

In [None]:
review= pd.read_csv('../input/boardgamegeek-reviews/bgg-13m-reviews.csv', index_col=0)
review.head()

In [None]:
rank.describe()

In [None]:
rank['Average'].plot(kind='hist')
plt.show()

In [None]:
rank['Bayes average'].plot(kind='hist')
plt.show()

# Removing the rows that contain NaN

In [None]:
review.replace('-', np.nan, inplace = True)
review = review.dropna()

In [None]:
review.head()

In [None]:
review.describe()

In [None]:
review['rating'].plot(kind='hist')
plt.show()

In [None]:
detail= pd.read_csv('/kaggle/input/boardgamegeek-reviews/games_detailed_info.csv', index_col=0)
detail.iloc[:, 10:20].head()

In [None]:
detail.iloc[:, :16].describe()


In [None]:
detail['Board Game Rank'] = detail['Board Game Rank'].replace('Not Ranked', np.nan)

In [None]:
detail['average'].plot(kind='hist')
plt.show()

In [None]:
detail['bayesaverage'].plot(kind='hist')
plt.show()

# Joining Repeated Columns in Rank & Detail datasets
We will drop ranking data's repeated columns and join the rest (BGG URL and Name) with game detail dataset

In [None]:
rank_sub= rank[['ID', 'Year', 'Rank', 'Average', 'Bayes average', 'Users rated', 'Thumbnail']]
rank_sub.head()

In [None]:
detail_sub= detail[['id', 'yearpublished', 'Board Game Rank', 'average', 'bayesaverage', 'usersrated', 'thumbnail']]
detail_sub.head()

In [None]:
joined_df = rank_sub.merge(detail_sub, left_on='ID', right_on='id', how='left')
joined_df.head(20)

# Plot to show Number of Board Games Published 

In [None]:
plt.figure(figsize=(20, 5))
detail['yearpublished'].value_counts().sort_index().plot()
plt.xlabel('Year')
plt.ylabel('Board Games Published')
plt.title('Number of Board Games Published over Time')
plt.show()

# Top 50 Most Rated Board Games

In [None]:
plt.figure(figsize=(20, 7))
review['name'].value_counts()[:50].plot(kind='bar')
plt.xlabel('Board Game Name')
plt.ylabel('Rating Count')
plt.title('Top 50 Most Rated Board Games')
plt.show()

# Correlation Plots

In [None]:
corr = detail.iloc[:, 20:].corr()
corr = corr.dropna(how='all', axis=1).dropna(how='all', axis=0).round(2)

# Generate a custom diverging colormap
cmap = sns.diverging_palette(240, 10, as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
plt.subplots(figsize=(20,20))
sns.heatmap(corr, cmap='PiYG', annot=True, linewidths=.5)
plt.title('Correlation plot without genre variables')
plt.show()

In [None]:
corr = detail[detail.columns.difference(['Accessory Rank', "Amiga Rank", "Arcade Rank", "Atari ST Rank","Commodore 64 Rank",
                                               "RPG Item Rank", "Video Game Rank", "median", "thumbnail", "id", 
                                               "image"])] \
       .corr()
plt.figure(figsize=(20,20))
sns.heatmap(corr, annot=True, cmap='PiYG')
plt.title('Correlation plot with genre variables that has >1 board game (excluding median, thumbnail, id and image)')
plt.show()

In [None]:
games = pd.read_csv('../input/boardgames/games.csv')
games.head()

In [None]:
print(games.shape)


# Plotting Our Target Variable

In [None]:
# Import matplotlib
import matplotlib.pyplot as plt
# Make a histogram of all the ratings in the average_rating column.
plt.hist(games["average_rating"])
# Show the plot.
plt.show()

# Zero Ratings Review

In [None]:
games[games["average_rating"] == 0]


In [None]:
# Print the first row of all the games with zero scores.
# The .iloc method on dataframes allows us to index by position.
print(games[games["average_rating"] == 0].iloc[0])
# Print the first row of all the games with scores greater than 0.
print(games[games["average_rating"] > 0].iloc[0])

# Removing Games without Review

In [None]:
# Remove any rows without user reviews.
games = games[games["users_rated"] > 0]
# Remove any rows with missing values.
games = games.dropna(axis=0)

In [None]:
sd = games["average_rating"].std()
mean = games["average_rating"].mean()

print(sd,mean)

# Clustering Games

We’ll use a particular type of clustering called k-means clustering. Scikit-learn has an excellent implementation of k-means clustering that we can use. Scikit-learn is the primary machine learning library in Python, and contains implementations of most common algorithms, including random forests, support vector machines, and logistic regression. Scikit-learn has a consistent API for accessing these algorithms.

In [None]:
# Import the kmeans clustering model.
from sklearn.cluster import KMeans
from pandas import DataFrame
%matplotlib inline

# Initialize the model with 2 parameters -- number of clusters and random state.
kmeans_model = KMeans(n_clusters=5, random_state=1)
# Get only the numeric columns from games.
good_columns = games._get_numeric_data()
games_numeric = games.drop(['name','type','id'],axis=1)
games_mean = games_numeric.apply(np.mean, axis=1)
games_std =games_numeric.apply(np.std, axis=1)
# Fit the model using the good columns.
kmeans_model.fit(good_columns)
# Get the cluster assignments.
labels = kmeans_model.labels_

In [None]:
correlations=games_numeric.corr()

correlations["average_rating"] #Shows us how each column in board game dataset is correlated with average_rating

# Plot Clusters

Now that we have cluster labels, let’s plot the clusters. One sticking point is that our data has many columns – it’s outside of the realm of human understanding and physics to be able to visualize things in more than 3 dimensions. So we’ll have to reduce the dimensionality of our data, without losing too much information. One way to do this is a technique called principal component analysis, or PCA. PCA takes multiple columns, and turns them into fewer columns while trying to preserve the unique information in each column. To simplify, say we have two columns, total_owners, and total_traders. There is some correlation between these two columns, and some overlapping information. PCA will compress this information into one column with new numbers while trying not to lose any information.

We’ll try to turn our board game data into two dimensions, or columns, so we can easily plot it out.

We first initialize a PCA model from Scikit-learn. PCA isn’t a machine learning technique, but Scikit-learn also contains other models that are useful for performing machine learning. Dimensionality reduction techniques like PCA are widely used when preprocessing data for machine learning algorithms.

We then turn our data into 2 columns, and plot the columns. When we plot the columns, we shade them according to their cluster assignment.

The plot shows us that there are 5 distinct clusters. We could dive more into which games are in each cluster to learn more about what factors cause games to be clustered.

In [None]:
# Import the PCA model.
from sklearn.decomposition import PCA
# Create a PCA model.
pca_2 = PCA(2)
# Fit the PCA model on the numeric columns from earlier.
plot_columns = pca_2.fit_transform(good_columns)
# Make a scatter plot of each game, shaded according to cluster assignment.
plt.scatter(x=plot_columns[:,0], y=plot_columns[:,1], c=labels)
# Show the plot.
plt.show()

# Split into test and train

In [None]:
# Import a convenience function to split the sets.
from sklearn.model_selection import train_test_split
# Generate the training set.  Set random_state to be able to replicate results.
train = games.sample(frac=0.8, random_state=1)
# Select anything not in the training set and put it in the testing set.
test = games.loc[~games.index.isin(train.index)]
# Print the shapes of both sets.
print(train.shape)
print(test.shape)

# Picking Predicting Columns

In [None]:
# Get all the columns from the dataframe.
columns = games.columns.tolist()
# Filter the columns to remove ones we don't want.
columns = [c for c in columns if c not in ["bayes_average_rating", "average_rating", "type", "name"]]
# Store the variable we'll be predicting on.
target = "average_rating"

# LinearRegression

In [None]:
# Import the linearregression model.
from sklearn.linear_model import LinearRegression
# Initialize the model class.
model = LinearRegression()
# Fit the model to the training data.
model.fit(train[columns], train[target])

# Predicting error

After we train the model, we can make predictions on new data with it. This new data has to be in the exact same format as the training data, or the model won’t make accurate predictions. Our testing set is identical to the training set (except the rows contain different board games). We select the same subset of columns from the test set, and then make predictions on it.

In [None]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression
import numpy as np
from sklearn.metrics import mean_squared_error 

games_numeric2= games_numeric.drop(['average_rating','bayes_average_rating'],axis=1)


lr = LinearRegression()
lr.fit(games_numeric2, games["average_rating"])
predictions = lr.predict(games_numeric2)


mse = mean_squared_error(games["average_rating"],predictions)
print(mse)
rmse = mse ** (1/2)
print(rmse)

In [None]:
sns.distplot(games["average_rating"])
sns.distplot(predictions)


# RandomForestRegressor

In [None]:
# Import the random forest model.
from sklearn.ensemble import RandomForestRegressor
# Initialize the model with some parameters.
model = RandomForestRegressor(n_estimators=100, min_samples_leaf=10, random_state=1)
# Fit the model to the data.
model.fit(train[columns], train[target])
# Make predictions.
predictions = model.predict(test[columns])
# Compute the error.
mean_squared_error(predictions, test[target])


# Challenges and Conclusion

From the above analysis we can see that the error rate of random forest is less when compared to linear regression.The random forest algorithm can find nonlinearities in data that a linear regression wouldn’t be able to pick up on. A linear regression algorithm wouldn’t be able to pick up on this because there isn’t a linear relationship between the predictor and the target. Predictions made with a random forest usually have less error than predictions made by a linear regression.

The challenges I faced for this project were selecting the Machine algorithms for handling the data, choosing the columun to find the correlation and also handling the huge sample datasets. I have choosen "average_rating" for correlation because our final project was based on predicting rating. However, we can try predicting a different column, such as average_weight. Regarding handling of datasets, I was able to divide the dataset as test and train, which could be easily handled by Kaggle Server.

Finally, we were able to find which machine algorithms were best suited for this dataset.


# References
https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html
https://github.com/GuneetKohli/Predicting-Board-Game-Reviews-Using-KMeans-Clustering-Linear-Regression/blob/master/BoardGameReviews.ipynb
https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html
https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
https://www.kaggle.com/mrpantherson/board-game-data
