# 🎓 Netflix Modeling + Deployment Capstone Notebook


This notebook builds and explains a machine learning model that predicts whether a Netflix title is a Movie or TV Show.
It includes:
- Preprocessing and feature engineering
- Hyperparameter tuning
- Model evaluation
- SHAP explainability
- Streamlit + Docker deployment templates


## 📥 Step 1: Load and Prepare Data

In [None]:

import pandas as pd
df = pd.read_csv("netflix_titles.csv")

# Fill missing values
df['description'] = df['description'].fillna('')
df['listed_in'] = df['listed_in'].fillna('')
df['rating'] = df['rating'].fillna('Unknown')
df['release_year'] = pd.to_numeric(df['release_year'], errors='coerce')
df = df.dropna(subset=['release_year'])

# Binary encode type
df['type_encoded'] = df['type'].map({'Movie': 1, 'TV Show': 0})



We clean the dataset by filling missing values, converting `release_year`, and encoding `type` as binary (Movie = 1, TV Show = 0).


## 🛠️ Step 2: Feature Engineering

In [None]:

from sklearn.preprocessing import LabelEncoder

rating_enc = LabelEncoder()
df['rating_encoded'] = rating_enc.fit_transform(df['rating'])

df['genre_count'] = df['listed_in'].apply(lambda x: len(x.split(", ")))

features = ['release_year', 'rating_encoded', 'genre_count']
X = df[features]
y = df['type_encoded']



We encode ratings and count genres per title to create meaningful features.


## 🤖 Step 3: Train + Tune Random Forest Model

In [None]:

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
import joblib

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, random_state=42)

param_grid = {
    'n_estimators': [100, 150],
    'max_depth': [10, None]
}
grid = GridSearchCV(RandomForestClassifier(random_state=42), param_grid, cv=3, scoring='f1')
grid.fit(X_train, y_train)

best_model = grid.best_estimator_
joblib.dump(best_model, "best_type_model.pkl")
joblib.dump(rating_enc, "rating_encoder.pkl")



We train a `RandomForestClassifier` with hyperparameter tuning. The best model and encoder are saved for deployment.


## 📊 Step 4: Evaluate Model

In [None]:

from sklearn.metrics import classification_report, roc_auc_score

y_pred = best_model.predict(X_test)
y_proba = best_model.predict_proba(X_test)[:, 1]

print(classification_report(y_test, y_pred))
print("ROC AUC Score:", roc_auc_score(y_test, y_proba))



Evaluation shows how well our model performs. We use classification metrics and ROC AUC.


## 🧠 Step 5: SHAP Explainability

In [None]:

import shap
explainer = shap.Explainer(best_model, X_train)
shap_values = explainer(X_test)
shap.summary_plot(shap_values, X_test)



SHAP helps explain which features influenced the prediction and by how much.


## 🚀 Step 6: Streamlit & Docker Deployment

In [None]:

# Streamlit snippet
'''
import streamlit as st
import joblib
import numpy as np

model = joblib.load("best_type_model.pkl")
encoder = joblib.load("rating_encoder.pkl")

year = st.slider("Release Year", 1950, 2025, 2020)
rating = st.selectbox("Rating", ['PG', 'TV-MA', 'R', 'Unknown'])
rating_enc = encoder.transform([rating])[0]
genre_count = 2

if st.button("Predict Type"):
    pred = model.predict([[year, rating_enc, genre_count]])[0]
    st.success("Prediction: Movie" if pred else "Prediction: TV Show")
'''


In [None]:

# Dockerfile snippet
'''
FROM python:3.10
WORKDIR /app
COPY . /app
RUN pip install -r requirements.txt
EXPOSE 8501
CMD ["streamlit", "run", "streamlit_app.py"]
'''
