This Project is aimed at generating a Global baseline recommender prediction for items in the shop.

This Project is aimed at generating a Global baseline recommender prediction for items in the shop. The analysis is done on a made-up dataset. The core idea of the global baseline estimation is to estimate a rough rating (1-5) for the items in the list using the overall raw average of all items plus both user and item biases.
For user-bias, it is a difference between a user's average rating and the overall global average rating. A negative user-bias suggests the user tends to rate items lower than average, possibly indicating a harsher critic or more critical reviewer.  On the other hand, a positive user-bias indicates that the user is more lenient in his/her ratings.
For item-bias, it measures how much an item is generally more liked or disliked by comparing its average rating to the global average. A positive item-bias indicates a generally more liked item as opposite to an item with negative item-bias, which indicates general dislikes.

### Importing libraries

In [49]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

### Loading data into dataframe

In [50]:
#Generating a toy dataset:
data = {
    'IPad': [5, 3, np.nan, 1, np.nan],
    'Smart_TV': [4, np.nan, 2, np.nan, 5],
    'Xbox_One': [np.nan, 2, 4, 5, 3],
    '5.1_Speakers': [3, np.nan, np.nan, 4, 2],
    'Smartphones': [1, 5, 3, np.nan, np.nan],
}

df = pd.DataFrame(data, index=[f'User{i}' for i in range(1, 6)])

#Reset index to 'User'
df.reset_index(inplace=True)

#Rename column 1 to be 'user'
df.rename(columns={'index': 'User'}, inplace=True)
df

Unnamed: 0,User,IPad,Smart_TV,Xbox_One,5.1_Speakers,Smartphones
0,User1,5.0,4.0,,3.0,1.0
1,User2,3.0,,2.0,,5.0
2,User3,,2.0,4.0,,3.0
3,User4,1.0,,5.0,4.0,
4,User5,,5.0,3.0,2.0,


#### Converting to long format dataframe:

In [51]:
df_long = df.melt(id_vars='User', value_vars=df[0:4], value_name='Rating')

#Rename column to 'item'
df_long.rename(columns={'variable': 'item'}, inplace=True)

#To preview:
df_long.head(2)

Unnamed: 0,User,item,Rating
0,User1,IPad,5.0
1,User2,IPad,3.0


#### Filter out the unknown values:

In [52]:
df_unkowns = df_long[df_long['Rating'].isna()]

#Separate the dataset to train split:
df_train_test = df_long.dropna()

#Due to low sample size, use 70/30 split for train test split:
df_train, df_test = train_test_split(df_train_test, test_size=0.3, random_state=41)

#### Calculating the raw mean for both training and test set data:

In [53]:
raw_train_mean = df_train['Rating'].mean()
raw_test_mean = df_test['Rating'].mean()
print(f'Raw train mean: {raw_train_mean:.3f}\nRaw test mean: {raw_test_mean:.3f}')

Raw train mean: 3.455
Raw test mean: 2.800


#### Calculate the RMSE for both Test and Training set:

In [54]:
#Insert the raw_mean values to the training and test dataset for easier computation:
df_train['raw_mean'] = raw_train_mean
df_test['raw_mean'] = raw_test_mean

#Compute the RMSE for both training and test dataset:
rmse_train = np.sqrt(mean_squared_error(df_train['Rating'], df_train['raw_mean']))
rmse_test = np.sqrt(mean_squared_error(df_test['Rating'], df_test['raw_mean']))
print(f'RMSE train: {rmse_train:.3f}\nRMSE test: {rmse_test:.3f}')

RMSE train: 1.499
RMSE test: 0.748


The RMSE for the test set is lower than the training set because of how the data is randomly splitted and how small the dataset is.
The RMSE for the test set will become larger than the training set if the random_state is set to 42 or if the dataset is larger. With a smaller dataset such as this made-up dataset, it is not surprising to see the RMSE behave this way.

Calculate the User and Item bias for both training and test set:

In [55]:
#User and item bias for training set:
user_bias_train = df_train.groupby('User')['Rating'].mean() - raw_train_mean
item_bias_train = df_train.groupby('item')['Rating'].mean() - raw_train_mean

#User and item bias for test set:
user_bias_test = df_test.groupby('User')['Rating'].mean() - raw_test_mean
item_bias_test = df_test.groupby('item')['Rating'].mean() - raw_test_mean

print(f'User bias train: {user_bias_train}\nItem bias train: {item_bias_train}')
print(f'User bias test: {user_bias_test}\nItem bias test: {item_bias_test}')

User bias train: User
User1   -0.121212
User2    0.545455
User3    0.045455
User4   -0.454545
User5    0.045455
Name: Rating, dtype: float64
Item bias train: item
5.1_Speakers   -1.454545
IPad           -0.454545
Smart_TV        1.045455
Smartphones    -0.454545
Xbox_One        1.045455
Name: Rating, dtype: float64
User bias test: User
User1    0.2
User2   -0.8
User3   -0.8
User4    1.2
User5    0.2
Name: Rating, dtype: float64
Item bias test: item
5.1_Speakers    0.7
Smart_TV       -0.8
Xbox_One       -0.3
Name: Rating, dtype: float64


From the raw average, and the appropriate user and item biases, calculate the baseline predictorsfor every user-item combination.

In [56]:
#Compute the baseline estimator for the training set:
df_train['baseline_estimate'] = raw_train_mean + user_bias_train.reindex(df_train['User']).values + item_bias_train.reindex(df_train['item']).values

#Compute the baseline estimator for the test set:
df_test['baseline_estimate'] = raw_test_mean + user_bias_test.reindex(df_test['User']).values + item_bias_test.reindex(df_test['item']).values

In [57]:
df_train.head(2)


Unnamed: 0,User,item,Rating,raw_mean,baseline_estimate
13,User4,Xbox_One,5.0,3.454545,4.045455
20,User1,Smartphones,1.0,3.454545,2.878788


In [58]:
df_test.head(2)

Unnamed: 0,User,item,Rating,raw_mean,baseline_estimate
15,User1,5.1_Speakers,3.0,2.8,3.7
14,User5,Xbox_One,3.0,2.8,2.7


Calculate the RMSE for the baseline predictors for both your training data and your test data.

In [59]:
#RMSE for the training baseline estimator:
rmse_train_baseline = np.sqrt(mean_squared_error(df_train['Rating'], df_train['baseline_estimate']))

#RMSE for the test baseline estimator:
rmse_test_baseline = np.sqrt(mean_squared_error(df_test['Rating'], df_test['baseline_estimate']))

Results summary:

In [60]:
print(f'RMSE for raw_mean training set: {rmse_train:.3f}')
print(f'RMSE for raw_mean test set: {rmse_test:.3f}')
print(f'RMSE for baseline training set: {rmse_train_baseline:.3f}')
print(f'RMSE for baseline test set: {rmse_test_baseline:.3f}')

RMSE for raw_mean training set: 1.499
RMSE for raw_mean test set: 0.748
RMSE for baseline training set: 1.144
RMSE for baseline test set: 0.600


To predict the unknown ratings using the training baseline estimator:

In [61]:
df_unkowns_predicted = df_unkowns.copy()
df_unkowns_predicted['baseline_estimate'] = raw_train_mean + user_bias_train.reindex(df_unkowns['User']).values + item_bias_train.reindex(df_unkowns['item']).values
df_unkowns_predicted

Unnamed: 0,User,item,Rating,baseline_estimate
2,User3,IPad,,3.045455
4,User5,IPad,,3.045455
6,User2,Smart_TV,,5.045455
8,User4,Smart_TV,,4.045455
10,User1,Xbox_One,,4.378788
16,User2,5.1_Speakers,,2.545455
17,User3,5.1_Speakers,,2.045455
23,User4,Smartphones,,2.545455
24,User5,Smartphones,,3.045455


To merge training, test, and unknown back into one dataframe:

In [62]:
df_final = pd.concat([df_train, df_test, df_unkowns_predicted])
df_final['baseline_estimate'] = round(df_final['baseline_estimate'], 2)
df_final

Unnamed: 0,User,item,Rating,raw_mean,baseline_estimate
13,User4,Xbox_One,5.0,3.454545,4.05
20,User1,Smartphones,1.0,3.454545,2.88
9,User5,Smart_TV,5.0,3.454545,4.55
12,User3,Xbox_One,4.0,3.454545,4.55
21,User2,Smartphones,5.0,3.454545,3.55
1,User2,IPad,3.0,3.454545,3.55
22,User3,Smartphones,3.0,3.454545,3.05
3,User4,IPad,1.0,3.454545,2.55
19,User5,5.1_Speakers,2.0,3.454545,2.05
5,User1,Smart_TV,4.0,3.454545,4.38


To pivot the table into a user-item matrix:

In [63]:
df_final.pivot_table(index='User', columns='item', values='baseline_estimate')


item,5.1_Speakers,IPad,Smart_TV,Smartphones,Xbox_One
User,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
User1,3.7,2.88,4.38,2.88,4.38
User2,2.55,3.55,5.05,3.55,1.7
User3,2.05,3.05,1.2,3.05,4.55
User4,4.7,2.55,4.05,2.55,4.05
User5,2.05,3.05,4.55,3.05,2.7


Conclusion:
The global baseline estimate by itself is not an accurate prediction method since the baseline(raw average) is the average rating across all users and items without considering individual preferences or item characteristics. However, it becomes useful when paired with other methods such as content-filtering. 
This is particularly helpful in addressing the cold-start problem, which is a common limitation in content-filtering where there is an insufficient user activity data available for analysis. The global baseline estimation provides a reasonable starting point for predictions when personalized data is yet available.
Additionally, the global baseline estimation can also be improved through regularization, which helps to prevent overfitting by penalizing overly large user or item biases. This is controlled by a hyperparameter(lambda). Alternating least squares (ALS) is one of the such commonly method.

RMSE or root mean squared error is a error measurement metrics typically used to evaluate the accuracy of prediction models. The lower the RMSE, the better the performance of the model because the predicted value is closer to the actual value. Typically, the training dataset will show lower RMSE than the test set because the model is directly fitted onto the training dataset. However, this is not always the case when the sample data size is small, such as the one used in this project. A small sample size has higher variance within the dataset and depending on how the dataset is split, the variance can change the outlook of the RMSE drastically. Also, certain the random_state values, which sets the seed for the splitting of the data, can also create a more biased splitting if the small size is small. So while this result may seem counterintuitive, it is not unusual with small or synthetic dataset. When using a larger, more representative dataset and appropriate random_state, the expected pattern of lower RMSE in training is generally true.