<p style="text-align:center">
    <a href="https://skills.network/?utm_medium=Exinfluencer&utm_source=Exinfluencer&utm_content=000026UJ&utm_term=10006555&utm_id=NA-SkillsNetwork-Channel-SkillsNetworkCoursesIBMML321ENSkillsNetwork817-2022-01-01" target="_blank">
    <img src="https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/assets/logos/SN_web_lightmode.png" width="200" alt="Skills Network Logo">
    </a>
</p>


# **Regression-based Rating Score Prediction using Embedding Features**


Estimated time needed: **45** minutes


In our previous lab, you have trained a neural network to predict the user-item interactions while simultaneously extracting the user and item embedding features. In the neural network, extends this by using  two embedding vectors as an input into a Neural Network to predict the rating.


![](https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/module_4/images/rating_regression.png)



Another way to make rating predictions is to use the embedding as an input to a neural network by aggregating them into a single feature vector as input data `X`. 

With the interaction label `Y` such as a rating score or an enrollment mode, we can build our other standalone predictive models to approximate the mapping from `X` to `Y`, as shown in the above flowchart.


In this lab, you will be given the course interaction feature vectors as input data `X` and consider label `Y` as the numerical rating scores. As such, we turn the recommender system into a common regression task and you can apply what you have learned about regression modeling to predict the ratings.


## Objectives


After completing this lab you will be able to:


* Build regression models to predict ratings using the combined embedding vectors


----


## Prepare and setup lab environment


First install and import required libraries:


In [None]:
#%pip install scikit-learn
#%pip install pandas

Note: you may need to restart the kernel to use updated packages.
Collecting pandas
  Downloading pandas-2.3.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (91 kB)
Collecting tzdata>=2022.7 (from pandas)
  Downloading tzdata-2025.2-py2.py3-none-any.whl.metadata (1.4 kB)
Downloading pandas-2.3.1-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.0/12.0 MB[0m [31m165.4 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading tzdata-2025.2-py2.py3-none-any.whl (347 kB)
Installing collected packages: tzdata, pandas
Successfully installed pandas-2.3.1 tzdata-2025.2
Note: you may need to restart the kernel to use updated packages.


In [1]:
import pandas as pd
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.metrics import mean_squared_error

In [2]:
# also set a random state
rs = 123

In [3]:
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.metrics import mean_squared_error

### Load datasets


In [4]:
rating_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMSkillsNetwork-ML0321EN-Coursera/labs/v2/module_3/ratings.csv"
user_emb_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/datasets/user_embeddings.csv"
item_emb_url = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-ML321EN-SkillsNetwork/labs/datasets/course_embeddings.csv"

The first dataset is the rating dataset that contains a user-item interaction matrix


In [5]:
rating_df = pd.read_csv(rating_url)

In [6]:
rating_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 233306 entries, 0 to 233305
Data columns (total 3 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   user    233306 non-null  int64 
 1   item    233306 non-null  object
 2   rating  233306 non-null  int64 
dtypes: int64(2), object(1)
memory usage: 5.3+ MB


In [7]:
rating_df.describe()

Unnamed: 0,user,rating
count,233306.0,233306.0
mean,1099162.0,3.998448
std,477166.1,0.816058
min,2.0,3.0
25%,721040.0,3.0
50%,1080061.0,4.0
75%,1466616.0,5.0
max,2103039.0,5.0


In [8]:
rating_df.head()

Unnamed: 0,user,item,rating
0,1889878,CC0101EN,5
1,1342067,CL0101EN,3
2,1990814,ML0120ENv3,5
3,380098,BD0211EN,5
4,779563,DS0101EN,3


As you can see from the above data, the user and item are just ids, let's substitute them by their embedding vectors:


In [9]:
# Load user embeddings
user_emb = pd.read_csv(user_emb_url)
# Load item embeddings
item_emb = pd.read_csv(item_emb_url)

In [10]:
user_emb

Unnamed: 0,user,UFeature0,UFeature1,UFeature2,UFeature3,UFeature4,UFeature5,UFeature6,UFeature7,UFeature8,UFeature9,UFeature10,UFeature11,UFeature12,UFeature13,UFeature14,UFeature15
0,1889878,0.080721,-0.129561,0.087998,0.030231,0.082691,-0.004176,-0.003480,0.091464,-0.040247,0.018958,-0.153328,-0.090143,0.082830,-0.058721,0.057929,-0.001472
1,1342067,0.068047,-0.112781,0.045208,-0.007570,-0.038382,0.068037,0.114949,0.104128,-0.034401,0.004011,0.064832,0.165857,-0.004384,0.053257,0.014308,0.056684
2,1990814,0.124623,0.012910,-0.072627,0.049935,0.020158,0.133306,-0.035366,-0.156026,0.039269,0.042195,0.014695,-0.115989,0.031158,0.102021,-0.020601,0.116488
3,380098,-0.034870,0.000715,0.077406,0.070311,-0.043007,-0.035446,0.032846,-0.060944,0.112384,0.002114,0.090660,-0.068545,0.008967,0.063962,0.052347,0.018072
4,779563,0.106414,-0.001887,-0.017211,-0.042277,-0.074953,-0.056732,0.074610,-0.019367,-0.031341,0.064896,-0.048158,-0.047309,-0.007544,0.010474,-0.032287,-0.083983
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
33896,1525198,0.030928,0.023841,-0.043546,0.007947,0.042259,-0.033503,-0.031111,0.036865,-0.027394,-0.014666,0.027467,0.056807,0.035407,0.016624,0.003942,-0.042969
33897,1047293,-0.025774,-0.011417,0.002209,-0.020827,-0.002285,-0.007308,0.023367,-0.006890,-0.022531,0.014554,0.031916,-0.028916,0.031161,-0.031941,0.012866,0.016397
33898,1653442,0.029938,-0.038859,-0.014539,-0.046586,0.014946,0.017971,0.049848,0.007205,0.072723,-0.072453,-0.064114,0.006508,0.040719,-0.055399,0.007838,-0.050092
33899,946438,0.031057,0.031040,0.015897,0.024325,0.036805,0.001677,-0.029462,-0.033028,0.047495,0.040393,-0.034266,0.035967,-0.033399,0.037442,0.039109,-0.031742


In [11]:
item_emb

Unnamed: 0,item,CFeature0,CFeature1,CFeature2,CFeature3,CFeature4,CFeature5,CFeature6,CFeature7,CFeature8,CFeature9,CFeature10,CFeature11,CFeature12,CFeature13,CFeature14,CFeature15
0,CC0101EN,0.009657,-0.005238,-0.004098,0.016303,-0.005274,-0.000361,-0.015081,-0.012229,0.015686,0.008401,-0.035495,0.009381,-0.032560,-0.007292,0.000966,-0.006218
1,CL0101EN,-0.008611,0.028041,0.021899,-0.001465,0.006900,-0.017981,0.010899,-0.037610,-0.019397,-0.025682,-0.000620,0.038803,0.000196,-0.045343,0.012863,0.019429
2,ML0120ENv3,0.027439,-0.027649,-0.007484,-0.059451,0.003972,0.020496,-0.012695,0.036138,0.019965,0.018686,-0.010450,-0.050011,0.013845,-0.044454,-0.001480,-0.007559
3,BD0211EN,0.020163,-0.011972,-0.003714,-0.015548,-0.007540,0.014847,-0.005700,-0.006068,-0.005792,-0.023036,0.015999,-0.023480,0.015469,0.022221,-0.023115,-0.001785
4,DS0101EN,0.006399,0.000492,0.005640,0.009639,-0.005487,-0.000590,-0.010015,-0.001514,-0.017598,0.003590,0.016799,0.002732,0.005162,0.015031,-0.000877,-0.021283
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
121,DB0113EN,-0.005431,-0.051409,0.070075,-0.053410,-0.110620,0.042510,0.015849,-0.023665,-0.067477,-0.017871,0.038501,0.054688,0.003747,0.053903,0.140372,-0.065488
122,DX0108EN,-0.010177,-0.047900,0.003748,0.020915,-0.034402,-0.042355,-0.054569,-0.059799,-0.038959,0.041647,0.037121,-0.039135,-0.031249,0.010422,-0.042940,0.050035
123,DS0107,0.028308,-0.022252,-0.039251,-0.027236,0.032327,-0.049738,0.029155,-0.030566,-0.043763,-0.020083,0.045287,-0.035698,-0.037330,0.024567,-0.032049,-0.004837
124,DB0115EN,-0.037462,-0.103100,0.066244,-0.004039,0.041908,-0.090526,-0.047779,0.012038,-0.060363,-0.008462,-0.058119,0.036157,-0.040502,-0.037642,0.006365,0.097246


In [17]:
item_emb.shape[1]

17

In [12]:
# Merge user embedding features
user_emb_merged = pd.merge(rating_df, user_emb, how='left', left_on='user', right_on='user').fillna(0)
# Merge course embedding features
merged_df = pd.merge(user_emb_merged, item_emb, how='left', left_on='item', right_on='item').fillna(0)

In [13]:
merged_df

Unnamed: 0,user,item,rating,UFeature0,UFeature1,UFeature2,UFeature3,UFeature4,UFeature5,UFeature6,...,CFeature6,CFeature7,CFeature8,CFeature9,CFeature10,CFeature11,CFeature12,CFeature13,CFeature14,CFeature15
0,1889878,CC0101EN,5,0.080721,-0.129561,0.087998,0.030231,0.082691,-0.004176,-0.003480,...,-0.015081,-0.012229,0.015686,0.008401,-0.035495,0.009381,-0.032560,-0.007292,0.000966,-0.006218
1,1342067,CL0101EN,3,0.068047,-0.112781,0.045208,-0.007570,-0.038382,0.068037,0.114949,...,0.010899,-0.037610,-0.019397,-0.025682,-0.000620,0.038803,0.000196,-0.045343,0.012863,0.019429
2,1990814,ML0120ENv3,5,0.124623,0.012910,-0.072627,0.049935,0.020158,0.133306,-0.035366,...,-0.012695,0.036138,0.019965,0.018686,-0.010450,-0.050011,0.013845,-0.044454,-0.001480,-0.007559
3,380098,BD0211EN,5,-0.034870,0.000715,0.077406,0.070311,-0.043007,-0.035446,0.032846,...,-0.005700,-0.006068,-0.005792,-0.023036,0.015999,-0.023480,0.015469,0.022221,-0.023115,-0.001785
4,779563,DS0101EN,3,0.106414,-0.001887,-0.017211,-0.042277,-0.074953,-0.056732,0.074610,...,-0.010015,-0.001514,-0.017598,0.003590,0.016799,0.002732,0.005162,0.015031,-0.000877,-0.021283
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
233301,1540125,DS0101EN,5,-0.021376,-0.081750,-0.140323,0.018257,0.070857,-0.150106,-0.101541,...,-0.010015,-0.001514,-0.017598,0.003590,0.016799,0.002732,0.005162,0.015031,-0.000877,-0.021283
233302,1250651,PY0101EN,5,0.038751,-0.045833,0.007787,0.054884,0.008866,-0.016915,-0.007734,...,-0.006246,-0.001485,0.007065,-0.003130,0.007294,-0.000657,0.006152,-0.001489,0.015253,0.000122
233303,1003832,CB0105ENv1,3,0.055601,0.032458,0.138734,-0.103575,-0.040634,0.019715,-0.024687,...,-0.004814,0.032963,-0.018020,0.013813,-0.048995,0.009753,-0.019230,-0.042314,-0.022855,0.008192
233304,922065,BD0141EN,4,0.098573,-0.033596,0.146387,0.002943,0.111133,-0.100475,0.097536,...,0.002402,0.003107,-0.019846,0.013243,0.010134,0.016171,-0.019714,-0.005965,-0.014285,0.006799


Next, we can combine the user features (the column labels starting with `UFeature` and item features (the column labels starting with `CFeature`. In machine learning, there are many ways to aggregate two feature vectors such as element-wise add, multiply, max/min, average, etc. Here we simply add the two sets of feature columns:


In [14]:
# Define column names for user and course embedding features
u_features = [f"UFeature{i}" for i in range(16)] # Assuming there are 16 user embedding features
c_features = [f"CFeature{i}" for i in range(16)]  # Assuming there are 16 course embedding features

# Extract user embedding features
user_embeddings = merged_df[u_features]
# Extract course embedding features
course_embeddings = merged_df[c_features]
# Extract ratings
ratings = merged_df['rating']

# Aggregate the two feature columns using element-wise add
regression_dataset = user_embeddings + course_embeddings.values
# Rename the columns of the resulting DataFrame
regression_dataset.columns = [f"Feature{i}" for i in range(16)]# Assuming there are 16 features
# Add the 'rating' column from the original DataFrame to the regression dataset
regression_dataset['rating'] = ratings
# Display the first few rows of the regression dataset
regression_dataset

Unnamed: 0,Feature0,Feature1,Feature2,Feature3,Feature4,Feature5,Feature6,Feature7,Feature8,Feature9,Feature10,Feature11,Feature12,Feature13,Feature14,Feature15,rating
0,0.090378,-0.134799,0.083900,0.046534,0.077417,-0.004537,-0.018561,0.079236,-0.024561,0.027359,-0.188823,-0.080762,0.050271,-0.066013,0.058894,-0.007689,5
1,0.059437,-0.084740,0.067107,-0.009036,-0.031482,0.050057,0.125847,0.066517,-0.053798,-0.021671,0.064212,0.204660,-0.004188,0.007914,0.027170,0.076114,3
2,0.152061,-0.014739,-0.080112,-0.009516,0.024130,0.153802,-0.048061,-0.119888,0.059234,0.060882,0.004244,-0.166000,0.045002,0.057566,-0.022081,0.108929,5
3,-0.014707,-0.011257,0.073692,0.054763,-0.050547,-0.020599,0.027146,-0.067012,0.106593,-0.020921,0.106658,-0.092025,0.024436,0.086183,0.029232,0.016287,5
4,0.112812,-0.001395,-0.011572,-0.032638,-0.080440,-0.057321,0.064595,-0.020880,-0.048939,0.068486,-0.031359,-0.044577,-0.002381,0.025505,-0.033164,-0.105266,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
233301,-0.014977,-0.081258,-0.134683,0.027895,0.065370,-0.150696,-0.111557,0.068990,0.023886,-0.130328,0.108049,0.113518,0.083626,-0.134038,-0.002495,-0.016603,5
233302,0.026693,-0.047697,0.010914,0.066091,0.023919,-0.017845,-0.013980,-0.010845,0.030093,-0.025450,0.082910,-0.043803,0.015785,0.040697,-0.066637,-0.033264,5
233303,0.049292,0.062408,0.137864,-0.134142,-0.072878,0.031165,-0.029502,0.173918,-0.104943,0.029938,-0.138595,-0.000103,-0.007854,0.026256,-0.072040,0.149764,3
233304,0.106140,-0.062923,0.147306,0.033648,0.101269,-0.099624,0.099939,0.091838,-0.026377,0.046507,0.088269,0.078541,-0.089107,0.001519,-0.048838,0.147942,4


By now, we have built the input dataset `X` and the output vector `y`:


In [15]:
X = regression_dataset.iloc[:, :-1]
y = regression_dataset.iloc[:, -1]
print(f"Input data shape: {X.shape}, Output data shape: {y.shape}")

Input data shape: (233306, 16), Output data shape: (233306,)


## TASK: Perform regression on the interaction dataset


Now our input data `X` and output `y` are ready, let's build regression models to map X to y and predict ratings. 


y.unique()


You may use `sklearn` to train and evaluate various regression models.


_TODO: First split dataset into training and testing datasets_


In [17]:
### WRITE YOUR CODE HERE
x_train,x_test,y_train,y_test = train_test_split(X,y,test_size=0.3, random_state=rs)

<details>
    <summary>Click here for Hints</summary>
    
Use `train_test_split()` to split dataset into training and testing datasets.  Use `X, y` as input dataset and output vector. Don't forget to specify `random_state = rs` and `test_size=0.3`.


_TODO: Create a basic linear regression model_


In [23]:
### WRITE YOUR CODE HERE
from sklearn.linear_model import LinearRegression,Ridge,Lasso,ElasticNet

In [19]:
lin_reg = LinearRegression()

<details>
    <summary>Click here for Hints</summary>
    
You can call `linear_regression = LinearRegression()` method to create a Linear Regression object


_TODO: Train the basic regression model with training data_


In [20]:
### WRITE YOUR CODE HERE
lin_reg.fit(x_train,y_train)

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


<details>
    <summary>Click here for Hints</summary>
    
You can call `model.fit()` method with `X_train, y_train` parameters.


_TODO: Evaluate the basic regression model_


In [22]:
### WRITE YOUR CODE HERE
predictions= lin_reg.predict(x_test)
from sklearn.metrics import root_mean_squared_error

root_mean_squared_error(y_pred=predictions,y_true=y_test)
### The main evaluation metric is RMSE but you may use other metrics as well

0.8138346209967071

<details>
    <summary>Click here for Hints</summary>
    
You can call `model.predict()` method with `X_test` parameter to get model predictions. Then use `mean_squared_error()` with `y_test, your_predictions` parameters to calculate the RMSE. 


_TODO: Try different regression models such as Ridge, Lasso, ElasticNet and tune their hyperparameters to see which one has the best performance_


In [30]:
### WRITE YOUR CODE HERE
ridge=Ridge()
lasso=Lasso()
elastic=ElasticNet()

In [28]:
from sklearn.model_selection import RandomizedSearchCV
import scipy

In [38]:
parameters_ridge={
    'alpha':scipy.stats.loguniform(10**-5,10**4)
}
parameters_lasso = {
    'alpha': scipy.stats.loguniform(10**-5,10**4)
}
parameters_elastic = {
    'alpha': scipy.stats.loguniform(10**-5,10**4),
    'l1_ratio': scipy.stats.loguniform(10**-5,1)
    }

In [32]:
ridge_cv=RandomizedSearchCV(estimator=ridge,param_distributions=parameters_ridge,n_iter=500,scoring='neg_mean_squared_error',n_jobs=-1,verbose=1,cv=3)
ridge_cv.fit(x_train,y_train)

Fitting 3 folds for each of 500 candidates, totalling 1500 fits


0,1,2
,estimator,Ridge()
,param_distributions,{'alpha': <scipy.stats....x7438342bd010>}
,n_iter,500
,scoring,'neg_mean_squared_error'
,n_jobs,-1
,refit,True
,cv,3
,verbose,1
,pre_dispatch,'2*n_jobs'
,random_state,

0,1,2
,alpha,np.float64(2728.4331206330617)
,fit_intercept,True
,copy_X,True
,max_iter,
,tol,0.0001
,solver,'auto'
,positive,False
,random_state,


In [33]:
print(ridge_cv.best_params_)

{'alpha': np.float64(2728.4331206330617)}


In [35]:
print(ridge_cv.best_params_)
ridge.set_params(**ridge_cv.best_params_)
ridge.fit(x_train,y_train)
y_pred=ridge.predict(x_test)
ridge_score=root_mean_squared_error(y_test,y_pred)
ridge_score

0.81381622686498

In [36]:
lasso_cv=RandomizedSearchCV(estimator=lasso,param_distributions=parameters_lasso,n_iter=500,scoring='neg_mean_squared_error',n_jobs=-1,verbose=1,cv=3)
lasso_cv.fit(x_train,y_train)
print(lasso_cv.best_params_)
lasso.set_params(**lasso_cv.best_params_)
lasso.fit(x_train,y_train)
y_pred=lasso.predict(x_test)
lasso_score=root_mean_squared_error(y_test,y_pred)
lasso_score

Fitting 3 folds for each of 500 candidates, totalling 1500 fits
{'alpha': np.float64(0.00030173208824153726)}


0.8138177875012098

In [39]:
elastic_cv=RandomizedSearchCV(estimator=elastic,param_distributions=parameters_elastic,n_iter=500,scoring='neg_mean_squared_error',n_jobs=-1,verbose=1,cv=3)
elastic_cv.fit(x_train,y_train)
print(elastic_cv.best_params_)
elastic.set_params(**elastic_cv.best_params_)
elastic.fit(x_train,y_train)
y_pred=elastic.predict(x_test)
elastic_score=root_mean_squared_error(y_test,y_pred)
elastic_score

Fitting 3 folds for each of 500 candidates, totalling 1500 fits
{'alpha': np.float64(0.02450941827900188), 'l1_ratio': np.float64(0.0008914801106233032)}


0.8138160248753364

### Summary


In this lab, you have built regression models to predict numerical course ratings using the embedding feature vectors extracted from neural networks. In the next lab, we can treat the prediction problem as a classification problem as rating only has two categorical values so classification can be a more natural problem statement.


## Authors


[Yan Luo](https://www.linkedin.com/in/yan-luo-96288783/)


### Other Contributors


```toggle## Change Log
```


```toggle|Date (YYYY-MM-DD)|Version|Changed By|Change Description|
```
```toggle|-|-|-|-|
```
```toggle|2021-10-25|1.0|Yan|Created the initial version|
```


Copyright © 2021 IBM Corporation. All rights reserved.
