## **Libraries**

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
import xgboost as xgb
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

## **Dataset**

In [None]:
merged_df = pd.read_csv("merged_df.csv")
merged_df.head()

Unnamed: 0,Game,UserID,HoursPlayed,Ratings,Metadata.Genres,Release.Year
0,Alone in the Dark,189858084,0.4,5,"Action,Adventure,Racing / Driving",2008
1,Assassin's Creed,76451157,7.3,4,Action,2007
2,Assassin's Creed,22371742,10.9,2,Action,2007
3,Assassin's Creed,33865373,1.1,2,Action,2007
4,Assassin's Creed,37490443,29.0,3,Action,2007


## **Preprocessing:**

In [None]:
merged_df.dropna(inplace=True)

In [None]:
# One-hot encoding - Convert categorical features into numerical format
transformed_df = pd.get_dummies(merged_df, columns=['Metadata.Genres'])
transformed_df.head()

Unnamed: 0,Game,UserID,HoursPlayed,Ratings,Release.Year,Metadata.Genres_Action,"Metadata.Genres_Action,Adventure,Racing / Driving","Metadata.Genres_Action,Racing / Driving","Metadata.Genres_Action,Role-Playing (RPG)","Metadata.Genres_Action,Role-Playing (RPG),Strategy","Metadata.Genres_Action,Strategy",Metadata.Genres_Adventure,"Metadata.Genres_Racing / Driving,Simulation,Sports",Metadata.Genres_Role-Playing (RPG),Metadata.Genres_Simulation,Metadata.Genres_Sports,Metadata.Genres_Strategy
0,Alone in the Dark,189858084,0.4,5,2008,0,1,0,0,0,0,0,0,0,0,0,0
1,Assassin's Creed,76451157,7.3,4,2007,1,0,0,0,0,0,0,0,0,0,0,0
2,Assassin's Creed,22371742,10.9,2,2007,1,0,0,0,0,0,0,0,0,0,0,0
3,Assassin's Creed,33865373,1.1,2,2007,1,0,0,0,0,0,0,0,0,0,0,0
4,Assassin's Creed,37490443,29.0,3,2007,1,0,0,0,0,0,0,0,0,0,0,0


In [None]:
# Standardize numerical features
scaler = StandardScaler()
transformed_df[["HoursPlayed", "Release.Year"]] = scaler.fit_transform(transformed_df[["HoursPlayed", "Release.Year"]])

## **Models**:

Different models to be tested to see which is best:
1. Linear Regression
2. Random Forest
3. XGBoost
4. NLP

We will aim to predict the variables:
1. Rating
2. HoursPlayed

### Splitting the data considering 2 target variables:
1. 'Ratings' is the target variable
2. 'HoursPlayed' is the target variable

In [None]:
## RATINGS as a target variable
# Feature & target variable
X_ratings = transformed_df.drop(["Game", "UserID", "Ratings"], axis=1)
y_ratings = transformed_df["Ratings"]

# Splitting the dataset
X_train_ratings, X_test_ratings, y_train_ratings, y_test_ratings = train_test_split(X_ratings, y_ratings, test_size=0.2, random_state=42)

In [None]:
## HOURSPLAYED as a target variable
# Feature & target variable
X_hp = transformed_df.drop(["Game", "UserID", "HoursPlayed"], axis=1)
y_hp = transformed_df["HoursPlayed"]

# Splitting the dataset
X_train_hp, X_test_hp, y_train_hp, y_test_hp = train_test_split(X_hp, y_hp, test_size=0.2, random_state=42)

### **Linear Regression**: '*Ratings*'

Ratings shows a direct indicator of user preference and satisfaction. Linear Regression with the target variable of '*Ratings*'will be performed to predict the rating a user might give to a game.

Prediction will be based on the following variables:
1. HoursPlayed
2. Genre of the game
3. Year of Release of the game

In [None]:
lin_reg_ratings = LinearRegression()
lin_reg_ratings.fit(X_train_ratings, y_train_ratings)

In [None]:
# Cross Validation
cv_scores = cross_val_score(lin_reg_ratings, X_train_ratings, y_train_ratings, cv=5)

In [None]:
# Mean & SD of the cross-validation
cv_mean_ratings = np.mean(cv_scores)
cv_std_ratings = np.std(cv_scores)

print("Mean CV Score on Training Data:", cv_mean_ratings)
print("Standard Deviation of CV Scores on Training Data:", cv_std_ratings)

Mean CV Score on Training Data: -0.0143607094448835
Standard Deviation of CV Scores on Training Data: 0.012369944823791475


***Interpretation of the CV:***
* Given the negative number, the mean score indicates a poor model performance in predicting the target variable across the different folds of CV.
* Standard Deviation of CV scores is relatively small compared to the mean, suggesting the model's performance is consistent among the different folds, even though underperforming.



In [None]:
# Predicting on the test set
y_pred_ratings = lin_reg_ratings.predict(X_test_ratings)

In [None]:
# Performance metrics
mse_test = mean_squared_error(y_test_ratings, y_pred_ratings)
mae_test = mean_absolute_error(y_test_ratings, y_pred_ratings)
r2_test = r2_score(y_test_ratings, y_pred_ratings)

print(f"Mean Squared Error: {mse_test}")
print(f"Mean Absolute Error: {mae_test}")
print(f"R-squared: {r2_test}")

Mean Squared Error: 1.3513255520771437
Mean Absolute Error: 0.9102339300690079
R-squared: -0.003671482748920374


***Interpretation:***
* `MSE = 1.35` - MSE indicates that the model's predictions are close to the true values (the lower the MSE, the better). Given our range of 1-5, the MSE is moderetaly high and shows that the predictive power of our model is not satisfactory.
* `MAE = 0.91` - Our MAE suggests that our model has an average error of nearly one rating point per prediction, which is very high given our small range of ratings, from 1-5.
* `R2 = -0.0037` - A negative R2 value suggests that our model is not performing well and capturing the variance of the 'Ratings' variable.

**Summary:** The model shows to not be correctly predicting how a user would rate a game. Another model will be considered.



### **Linear Regression**: '*HoursPlayed*'

HoursPlayed is a direct measure of user engagement. We will be doing Linear Regression to predict the amount of time a user is likely to spend on playing a particular game. This prediction will be based on the following variables:
1. Rating
2. Genre of the game
3. Year of Release of the game

In [None]:
lin_reg_hours = LinearRegression()
lin_reg_hours.fit(X_train_hp, y_train_hp)

In [None]:
# Cross Validation
cv_scores_hp = cross_val_score(lin_reg_hours, X_train_hp, y_train_hp, cv=5)

In [None]:
# Mean & SD of the cross-validation
cv_mean_hp = np.mean(cv_scores_hp)
cv_std_hp = np.std(cv_scores_hp)

print("Mean CV Score on Training Data:", cv_mean_hp)
print("Standard Deviation of CV Scores on Training Data:", cv_std_hp)

Mean CV Score on Training Data: -0.02103726930000971
Standard Deviation of CV Scores on Training Data: 0.0179373783311712


***Interpretation:***

Once again, our mean CV score is a negative number, indicating a poor model performance despite the change of the target variable. The standard deviation continues to stay small across the folds.

In [None]:
# Predicting on the test set
y_pred_hp = lin_reg_hours.predict(X_test_hp)

In [None]:
# Performance metrics
mse_test2 = mean_squared_error(y_test_hp, y_pred_hp)
mae_test2 = mean_absolute_error(y_test_hp, y_pred_hp)
r2_test2 = r2_score(y_test_hp, y_pred_hp)

print(f"Mean Squared Error: {mse_test2}")
print(f"Mean Absolute Error: {mae_test2}")
print(f"R-squared: {r2_test2}")

Mean Squared Error: 1.2348510357295686
Mean Absolute Error: 0.4136283577619329
R-squared: -0.01101127845288441


***Interpretation:***
* `MSE = 1.23` - shows a moderate level of error in the model's predictions of how many hours someone would play.
* `MAE = 0.41` - model's predictions are off by about 0.41 on average, showing a low error depending on the typical range of HoursPlayed.
* `R-squared` - The negative value of R2 shows that the model is not effective in explaining the variance in the HoursPlayed variable.

In summary, this model shows a slightly better MAE which means there are smaller average errors, though both models struggle to explain the variances of the target variables. Another model will be considered next to try and get better results.




### **Random Forest**: '*Ratings*'

A Random Forest model will be developed to try and get a more reliable model to give us more effective personalizations.

In [None]:
# Random Forest Regressor
rf_ratings = RandomForestRegressor(random_state=42)
rf_ratings.fit(X_train_ratings, y_train_ratings)

In [None]:
# Predicting on test set
y_pred_ratings = rf_ratings.predict(X_test_ratings)

In [None]:
# Performance Metrics
mse_rf = mean_squared_error(y_test_ratings, y_pred_ratings)
mae_rf = mean_absolute_error(y_test_ratings, y_pred_ratings)
r2_rf = r2_score(y_test_ratings, y_pred_ratings)

print(f"Mean Squared Error: {mse_rf}")
print(f"Mean Absolute Error: {mae_rf}")
print(f"R-squared: {r2_rf}")

Mean Squared Error: 1.7885774766117621
Mean Absolute Error: 1.073271979751252
R-squared: -0.32843207560376975


***Interpretation:***
* `MSE = 1.79` - Difference between the actual and predicted ratings is high, similar to our LR model, indicating a level of error in the model's predictions.
* `MAE = 1.07` - On average, the model's predictions are off by around 1.078 rating points, which is significantly high considering our rating scale of 1-5.
* `MSE = -0.32` - Model performs worse than a simple model that would always predict the mean rating.

**Summary:** Random Forest Model does not seem to be predicting "Ratings" effectively.

### **Random Forest**: '*HoursPlayed*'

In [None]:
# Random Forest Regressor
rf_hours = RandomForestRegressor(random_state=42)
rf_hours.fit(X_train_hp, y_train_hp)

In [None]:
# Predicting on test set
y_pred_hours = rf_hours.predict(X_test_hp)

In [None]:
# Performance Metrics
mse_rf2 = mean_squared_error(y_test_hp, y_pred_hours)
mae_rf2 = mean_absolute_error(y_test_hp, y_pred_hours)
r2_rf2 = r2_score(y_test_hp, y_pred_hours)

print(f"Mean Squared Error: {mse_rf2}")
print(f"Mean Absolute Error: {mae_rf2}")
print(f"R-squared: {r2_rf2}")

Mean Squared Error: 1.2254321306834384
Mean Absolute Error: 0.4074215371395529
R-squared: -0.0032997254341122773


***Interpretation:***
From our EDA, we learned that the typical HoursPlayed by users falls around 1-10, with a median of approximately 6 hours.
* `MSE = 1.22` - Once again, it shows a moderate level of error in the model's predictions. It is relatively small, but there is room for improvement.
* `MAE = 0.41` - The model's predictions are on average off by around 0.41 hours (about 25 minutes). This value is low and suggests a good level of accuracy given the typical playtime which is about 6-7 hours.
* `MSE = -0.0033` - The negative value suggests that the model's predictions are not the best, and it is not capturing patterns in the data well.

**Summary**: The model shows almost exact performance of metrics as the Linear Regression model with the target of "HoursPlayed". The model shows a slightly better accuracy than our Random Forest model for "Ratings". However, other approaches will be done to try and get higher results.

### **XGBoost**: '*Ratings*'
XGBoost will be explored given the challenges faced with the previous models. We will first try with 'Ratings', then will 'HoursPlayed', as we did in our other models.

In [None]:
# XGBoost Regressor
xgb_ratings = xgb.XGBRegressor(objective ='reg:squarederror', random_state=42)
xgb_ratings.fit(X_train_ratings, y_train_ratings)

In [None]:
# Predicting on the test set
xgb_ypred = xgb_ratings.predict(X_test_ratings)

In [None]:
# Evaluating the model
mse_xgb = mean_squared_error(y_test_ratings, xgb_ypred)
mae_xgb = mean_absolute_error(y_test_ratings, xgb_ypred)
r2_xgb = r2_score(y_test_ratings, xgb_ypred)

print(f"Mean Squared Error: {mse_xgb}")
print(f"Mean Absolute Error: {mae_xgb}")
print(f"R-squared: {r2_xgb}")

Mean Squared Error: 1.7125829540344872
Mean Absolute Error: 1.0408333239764194
R-squared: -0.2719885819995167


***Interpretation***:
* `MSE = 1.71` - Given our rating scale, once again, the MSE is high and does not indicate a good predictive power.
* `MAE = 1.04` - Average absolute error is quite high as it represents a big portion of the rating scale.
* `R2 = -0.27` - The negative value suggests model is not capturing the patterns in the rating data effectively.

### **XGBoost**: '*HoursPlayed*'


In [None]:
# XGBoost Regressor
xgb_hours = xgb.XGBRegressor(objective ='reg:squarederror', random_state=42)
xgb_hours.fit(X_train_hp, y_train_hp)

In [None]:
# Predicting on the test set
xgb_ypred2 = xgb_hours.predict(X_test_hp)

In [None]:
# Evaluating the model
mse_xgb = mean_squared_error(y_test_hp, xgb_ypred2)
mae_xgb = mean_absolute_error(y_test_hp, xgb_ypred2)
r2_xgb = r2_score(y_test_hp, xgb_ypred2)

print(f"Mean Squared Error: {mse_xgb}")
print(f"Mean Absolute Error: {mae_xgb}")
print(f"R-squared: {r2_xgb}")

Mean Squared Error: 1.224642514015813
Mean Absolute Error: 0.4079470400876704
R-squared: -0.002653241499187997


***Interpretation***:
* `MSE = 1.22` - Given the continuous scale of the 'HoursPlayed' variable, the error is small and suggests that the model's predictions are somewhat close to the actual hours played.
* `MAE = 0.41` - Error is relatively small given the median playtime of around 6 hours, suggesting a good performance.
* `R2 = -0.003` - Poor performance in explaining the variance in HoursPlayed.

### **Neural Network**: *HoursPlayed*

In [None]:
X_nn = transformed_df.drop(["HoursPlayed", "Game", "UserID"], axis=1)
y_nn = transformed_df["HoursPlayed"]

# Splitting the dataset into training and testing sets
X_train_nn, X_test_nn, y_train_nn, y_test_nn = train_test_split(X_nn, y_nn, test_size=0.2, random_state=42)

In [None]:
model = Sequential()
model.add(Dense(128, input_dim=X_train_nn.shape[1], activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu'))
model.add(Dense(1, activation='linear'))

model.compile(optimizer='adam', loss='mean_squared_error')

In [None]:
# Training the model
model.fit(X_train_nn, y_train_nn, epochs=10, batch_size=32)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.src.callbacks.History at 0x7f4c87d79fc0>

In [None]:
y_pred_nn = model.predict(X_test_nn)
mse_nn = mean_squared_error(y_test_nn, y_pred_nn)
mae_nn = mean_absolute_error(y_test_nn, y_pred_nn)
r2_nn = r2_score(y_test_nn, y_pred_nn)

print(f"Mean Squared Error: {mse_nn}")
print(f"Mean Absolute Error: {mae_nn}")
print(f"R-squared: {r2_nn}")

Mean Squared Error: 1.2191385883583072
Mean Absolute Error: 0.43649162900435506
R-squared: 0.0018529950868443334


***Interpretation:***
* `MSE = 1.22` - Moderate MSE, indicating model's predictions are about 1.22 units square away from the actual values. Given the range of HoursPlayed, the value is acceptable.
* `MAE = 0.44` - The model's predictions show to be around 0.44 units off the actual value. This value is once again acceptable given the continuous nature of our target variable.
* `R2 = 0.0018` - While very low, this is the only model that has managed to capture more variance in comparison to the others.

## **Summary:**
**'Ratings'** vs. **'HoursPlayed'**:
* Models predicting ***'Ratings'*** all show negative R-squared values, indicating poor performance across all approaches. Linear Regression has performed the best out of all other models, as it has the least negative R-squared value along with the lowest MSE and MAE.

* Models predicting ***'HoursPlayed'*** show better performance overall than for our '*Ratings*' variable. The Neural Network shows a slightly positive R-squared value, which is better than all other models, though still very close to 0. While the MAE of the model is higher than for the others, its overall performance seems to be the best for predicting '*HoursPlayed*'.



### Best Performing Models:
1. Linear Regression for predicting "Ratings".
2. Neural Networks for predicting "HoursPlayed".
3. XGBoost for predicting "HoursPlayed".