## Recommendation System: Overview and Dataset

Recommendation systems are what make platforms like **Netflix, Amazon, Spotify, and YouTube** so personalized. They suggest what to watch, buy, or listen to next based on content or user behavior.

In this project, we’ll build a **content-based recommendation system** for Netflix 2023 content data. The dataset includes:

* `Title` – name of the show/movie
* `Available Globally?`
* `Release Date`
* `Hours Viewed` – popularity measure
* `Language Indicator`
* `Content Type` – movie, series, etc.

We’ll convert this data to a numerical format, train a neural network, and use embeddings to find **similar content**.

## Step 1: Load and Understand the Dataset

In [18]:
import pandas as pd

# Load dataset
df = pd.read_csv("netflix_content.csv")
df.head()

Unnamed: 0,Title,Available Globally?,Release Date,Hours Viewed,Language Indicator,Content Type
0,The Night Agent: Season 1,Yes,2023-03-23,812100000,English,Show
1,Ginny & Georgia: Season 2,Yes,2023-01-05,665100000,English,Show
2,The Glory: Season 1 // 더 글로리: 시즌 1,Yes,2022-12-30,622800000,Korean,Show
3,Wednesday: Season 1,Yes,2022-11-23,507700000,English,Show
4,Queen Charlotte: A Bridgerton Story,Yes,2023-05-04,503000000,English,Movie


The dataset is rich for content-based filtering. Each title has metadata that helps the model understand similarities between content.



## Step 2: Clean and Preprocess the Data

In [19]:
# Remove commas from 'Hours Viewed' and convert to integer
df['Hours Viewed'] = df['Hours Viewed'].str.replace(',', '', regex=False).astype('int64')

# Drop missing or duplicate titles
df.dropna(subset=['Title'], inplace=True)
df.drop_duplicates(subset=['Title'], inplace=True)

# Create IDs for embeddings
df['Content_ID'] = df.reset_index().index.astype('int32')

# Encode categorical features
df['Language_ID'] = df['Language Indicator'].astype('category').cat.codes
df['ContentType_ID'] = df['Content Type'].astype('category').cat.codes

df[['Content_ID', 'Title', 'Hours Viewed', 'Language_ID', 'ContentType_ID']].head()

Unnamed: 0,Content_ID,Title,Hours Viewed,Language_ID,ContentType_ID
0,0,The Night Agent: Season 1,812100000,0,1
1,1,Ginny & Georgia: Season 2,665100000,0,1
2,2,The Glory: Season 1 // 더 글로리: 시즌 1,622800000,3,1
3,3,Wednesday: Season 1,507700000,0,1
4,4,Queen Charlotte: A Bridgerton Story,503000000,0,0


TensorFlow requires numerical inputs, so we convert all string features to numeric codes. Each content item now has its own ID and encoded language/type.

## Step 3: Build a Neural Recommendation Model

In [20]:
import tensorflow as tf
from tensorflow.keras import layers, Model

num_contents = df['Content_ID'].nunique()
num_languages = df['Language_ID'].nunique()
num_types = df['ContentType_ID'].nunique()

# Input layers
content_input = layers.Input(shape=(1,), dtype=tf.int32, name='content_id')
language_input = layers.Input(shape=(1,), dtype=tf.int32, name='language_id')
type_input = layers.Input(shape=(1,), dtype=tf.int32, name='content_type')

# Embeddings
content_embedding = layers.Embedding(input_dim=num_contents+1, output_dim=32)(content_input)
language_embedding = layers.Embedding(input_dim=num_languages+1, output_dim=8)(language_input)
type_embedding = layers.Embedding(input_dim=num_types+1, output_dim=4)(type_input)

# Flatten embeddings
content_vec = layers.Flatten()(content_embedding)
language_vec = layers.Flatten()(language_embedding)
type_vec = layers.Flatten()(type_embedding)

# Combine features
combined = layers.Concatenate()([content_vec, language_vec, type_vec])
x = layers.Dense(64, activation='relu')(combined)
x = layers.Dense(32, activation='relu')(x)
output = layers.Dense(num_contents, activation='softmax')(x)

# Build and compile model
model = Model(inputs=[content_input, language_input, type_input], outputs=output)
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])

Embeddings compress high-dimensional data (like content IDs or languages) into dense vectors. Similar content will have similar embeddings, so the model can learn relationships between shows/movies.


## Step 4: Train the Recommendation Model

In [21]:
model.fit(
    x={
        'content_id': df['Content_ID'],
        'language_id': df['Language_ID'],
        'content_type': df['ContentType_ID']
    },
    y=df['Content_ID'],
    epochs=5,
    batch_size=64
)

Epoch 1/5
[1m300/300[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 17ms/step - accuracy: 0.0000e+00 - loss: 9.9127
Epoch 2/5
[1m300/300[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 17ms/step - accuracy: 0.0000e+00 - loss: 9.8677
Epoch 3/5
[1m300/300[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 17ms/step - accuracy: 5.2198e-04 - loss: 9.4953
Epoch 4/5
[1m300/300[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 16ms/step - accuracy: 0.0129 - loss: 7.7804
Epoch 5/5
[1m300/300[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 17ms/step - accuracy: 0.1430 - loss: 5.5733


<keras.src.callbacks.history.History at 0x14c7856bc80>

This is a self-supervised approach — the model tries to predict content based on its metadata. Over time, it learns a **vector space** where similar content clusters together.

## Step 5: Recommend Similar Content

In [22]:
import numpy as np

def recommend_similar(content_title, top_k=5):
    # Find the content in the dataset
    content_row = df[df['Title'].str.contains(content_title, case=False, na=False)].iloc[0]
    content_id = content_row['Content_ID']
    language_id = content_row['Language_ID']
    content_type_id = content_row['ContentType_ID']
    
    # Predict similarity
    predictions = model.predict({
        'content_id': np.array([content_id]),
        'language_id': np.array([language_id]),
        'content_type': np.array([content_type_id])
    })
    
    # Get top recommended content
    top_indices = predictions[0].argsort()[-top_k-1:][::-1]
    recommendations = df[df['Content_ID'].isin(top_indices)]
    return recommendations[['Title', 'Language Indicator', 'Content Type', 'Hours Viewed']]

# Example recommendation
recommend_similar("Wednesday")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 271ms/step


Unnamed: 0,Title,Language Indicator,Content Type,Hours Viewed
3,Wednesday: Season 1,English,Show,507700000
3193,Somebody Feed Phil: Season 6,English,Show,5800000
6161,1983: Season 1,English,Show,1800000
8501,24 Hours in A&E: Season 15,English,Show,800000
10856,Innocent (2018): Season 1,English,Show,400000
13003,First Class: Season 1,English,Show,200000


By using embeddings, the model can suggest titles that are **similar in language, type, and popularity**. Even without user feedback, we can get reasonable recommendations.


## Final Summary

In this project, I built a **content-based recommendation system** using Netflix metadata and TensorFlow. By encoding content features into embeddings, the model learned which titles are similar. After training, it can recommend other shows/movies like a given title, for example: if someone liked *Wednesday*, it can suggest similar content.
