# Using a ML model to recommend content (e.g., movies, articles, songs) to users based on preferences or features

In [2]:
import pandas as pd

## Load Data
I will be using this data set https://grouplens.org/datasets/movielens/ for my Portfolio Project. This data set has 100000 ratings by 943 users on 1682 items


In [3]:
# Load ratings
column_names = ['user_id', 'item_id', 'rating', 'timestamp']
ratings = pd.read_csv('ml-100k/u.data', sep='\t', names=column_names)

# Load movie information
movie_columns = [
    'item_id', 'title', 'release_date', 'video_release_date', 'IMDb_URL',
    'unknown', 'Action', 'Adventure', 'Animation', 'Children', 'Comedy', 'Crime',
    'Documentary', 'Drama', 'Fantasy', 'Film-Noir', 'Horror', 'Musical',
    'Mystery', 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western'
]
movies = pd.read_csv('ml-100k/u.item', sep='|', encoding='latin-1', names=movie_columns)

# Merge ratings with movie metadata
data = pd.merge(ratings, movies, on='item_id')

# Drop unused columns
data = data.drop(['timestamp', 'release_date', 'video_release_date', 'IMDb_URL'], axis=1)

# Preview
data.head()

Unnamed: 0,user_id,item_id,rating,title,unknown,Action,Adventure,Animation,Children,Comedy,...,Fantasy,Film-Noir,Horror,Musical,Mystery,Romance,Sci-Fi,Thriller,War,Western
0,196,242,3,Kolya (1996),0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1,186,302,3,L.A. Confidential (1997),0,0,0,0,0,0,...,0,1,0,0,1,0,0,1,0,0
2,22,377,1,Heavyweights (1994),0,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
3,244,51,2,Legends of the Fall (1994),0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,1,1
4,166,346,1,Jackie Brown (1997),0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Checks for Missing Data and Other Preprocessing
**MovieLens 100K dataset** https://grouplens.org/datasets/movielens/100k/ which I am using is **very clean** and was curated specifically for ML research. This data has been cleaned up - users who had less than 20 ratings or did not have complete demographic information were removed from this data set, so I am expecting that **minimal preprocessing is needed**

#### Check for Missing Values
The proportion represents about a 55:45 split, which is relatively balanced compared to severe imbalances we might see in some other datasets (like 90:10 or worse). Most algorithms should handle this reasonably well without major modifications.

In [4]:
print("Missing values in ratings + metadata:")
print(data.isnull().sum())
# Expected result: all zeros (no missing values)

Missing values in ratings + metadata:
user_id        0
item_id        0
rating         0
title          0
unknown        0
Action         0
Adventure      0
Animation      0
Children       0
Comedy         0
Crime          0
Documentary    0
Drama          0
Fantasy        0
Film-Noir      0
Horror         0
Musical        0
Mystery        0
Romance        0
Sci-Fi         0
Thriller       0
War            0
Western        0
dtype: int64


#### Check for Duplicates

In [5]:
print("Number of duplicate rows:", data.duplicated().sum())
# There shouldn't be any duplicates

Number of duplicate rows: 0


## Prepare Features and Labels

We'll make a binary classification problem:
1 if rating ≥ 4 → user liked the movie, 0 otherwise and use genres as features.

In [6]:
# Create binary target variable
data['liked'] = data['rating'] >= 4

# Use genre columns as features
genre_cols = movie_columns[6:]  # All genre columns
X = data[genre_cols]
y = data['liked'].astype(int)

print("Feature matrix shape:", X.shape)
print("Target vector shape:", y.shape)

Feature matrix shape: (100000, 18)
Target vector shape: (100000,)


### Check for Class Imbalance

This represents about a 55:45 split, which is relatively balanced compared to severe imbalances we might see in some other datasets (like 90:10 or worse). Most algorithms should handle this reasonably well without major modifications.

In [7]:
print(data['liked'].value_counts(normalize=True))

liked
True     0.55375
False    0.44625
Name: proportion, dtype: float64


## Train/ Validation / Test split

In [8]:
from sklearn.model_selection import train_test_split

# Step 1: First split into train+val and test
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Step 2: Split train+val into train and validation
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.25, random_state=42, stratify=y_temp)
# Note: 0.25 * 0.8 = 0.2 → so you get 60% train, 20% val, 20% test

print("Train size:", len(y_train))
print("Validation size:", len(y_val))
print("Test size:", len(y_test))

Train size: 60000
Validation size: 20000
Test size: 20000


# Train the model, and predict on Validation set - using Logistic Regression

In [9]:
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report

# Scale the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_val_scaled = scaler.transform(X_val)
X_test_scaled = scaler.transform(X_test)

# Train the Logistic Regression model
logreg = LogisticRegression(random_state=42, max_iter=1000)
logreg.fit(X_train_scaled, y_train)

# Predict on validation set
y_val_pred_lr = logreg.predict(X_val_scaled)

# Evaluate
print("Logistic Regression - Validation Performance:\n")
print(classification_report(y_val, y_val_pred_lr))


Logistic Regression - Validation Performance:

              precision    recall  f1-score   support

           0       0.55      0.34      0.42      8925
           1       0.59      0.77      0.67     11075

    accuracy                           0.58     20000
   macro avg       0.57      0.56      0.55     20000
weighted avg       0.57      0.58      0.56     20000



# Evaluation on Test Set

In [10]:
# Final test set evaluation
y_test_pred_lr = logreg.predict(X_test_scaled)

print("Logistic Regression - Test Performance:\n")
print(classification_report(y_test, y_test_pred_lr))


Logistic Regression - Test Performance:

              precision    recall  f1-score   support

           0       0.54      0.34      0.42      8925
           1       0.59      0.77      0.67     11075

    accuracy                           0.58     20000
   macro avg       0.57      0.56      0.54     20000
weighted avg       0.57      0.58      0.56     20000

