# Netflix Movies & TV Shows — Unsupervised Clustering
**Ready-to-run Colab notebook**

**What this notebook does**
- Loads the Netflix dataset (2019 snapshot) from a Google Drive link.
- Performs EDA, data cleaning, feature engineering (TF-IDF on descriptions + simple features).
- Runs KMeans and Agglomerative clustering, evaluates with Silhouette score.
- Visualizes clusters using PCA and saves cluster labels.

**How to use**
1. Open this notebook in Google Colab.
2. Run cells from top to bottom.
3. When prompted, the dataset will be downloaded using the provided Drive file id. If gdown fails, upload the CSV manually to the Colab session and update the path.


In [None]:
# Install required packages (only needed in Colab)
!pip install -q scikit-learn pandas matplotlib seaborn gdown scipy

In [None]:
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans, AgglomerativeClustering
from sklearn.metrics import silhouette_score
from scipy.sparse import hstack

sns.set(style='whitegrid')


In [None]:
# Download dataset from Google Drive using gdown.
FILE_ID = '1xJGlInE12mAggLuRo8b0oNSshUIG8GvF'  # from your provided link
output = 'netflix_titles.csv'

try:
    import gdown
    url = f'https://drive.google.com/uc?id={FILE_ID}'
    print('Downloading dataset...')
    gdown.download(url, output, quiet=False)
except Exception as e:
    print('gdown failed or isn\'t available.\nError:', e)
    print('If running in Colab, enable internet or upload the CSV manually to the session and set "output" to that path.')


In [None]:
# Load dataset (ensure netflix_titles.csv is present in the working directory)
try:
    df = pd.read_csv('NETFLIX MOVIES AND TV SHOWS CLUSTERING.csv')
    print('Loaded dataset with shape:', df.shape)
    display(df.head())
except FileNotFoundError:
    print("File netflix_titles.csv not found in the session. Please upload it or re-run the download cell.")

In [None]:
# Basic dataset information
df.info()

In [None]:
# Missing values summary
df.isnull().sum()

In [None]:
# Type distribution (Movie vs TV Show)
import matplotlib.pyplot as plt
plt.figure(figsize=(6,4))
sns.countplot(data=df, x='type')
plt.title('Movies vs TV Shows')
plt.show()


In [None]:
# Top 15 countries by content (note: 'country' may have multiple countries per row)
top_countries = (df['country'].dropna()
                  .str.split(',').explode()
                  .str.strip()
                  .value_counts().head(15))
plt.figure(figsize=(8,6))
sns.barplot(x=top_countries.values, y=top_countries.index)
plt.title('Top 15 countries by content count')
plt.xlabel('Count')
plt.show()


In [None]:
# Releases over years
plt.figure(figsize=(10,4))
df['release_year'].value_counts().sort_index().plot(kind='line')
plt.title('Content by Release Year')
plt.xlabel('Year')
plt.ylabel('Count')
plt.show()


In [None]:
# Basic cleaning
df = df.drop_duplicates(subset=['title', 'type', 'release_year'])
df = df.dropna(subset=['title', 'description'])
df.reset_index(drop=True, inplace=True)
print('After cleaning shape:', df.shape)


In [None]:
# Feature engineering
df['type_encoded'] = df['type'].map({'Movie':0, 'TV Show':1})
df['primary_genre'] = df['listed_in'].fillna('Unknown').apply(lambda x: x.split(',')[0].strip())
df['desc'] = df['description'].fillna('')
df[['title','type','type_encoded','primary_genre']].head()


In [None]:
# TF-IDF on description
tfidf = TfidfVectorizer(stop_words='english', max_features=3000)
tfidf_matrix = tfidf.fit_transform(df['desc'])
print('TF-IDF matrix shape:', tfidf_matrix.shape)


In [None]:
# Numeric features to combine (scaled)
num_feats = df[['type_encoded']].astype(float)
scaler = StandardScaler()
num_scaled = scaler.fit_transform(num_feats)
from scipy.sparse import csr_matrix
num_scaled_sparse = csr_matrix(num_scaled)

# Combined feature matrix
X = hstack([num_scaled_sparse, tfidf_matrix])
print('Combined feature matrix shape:', X.shape)


In [None]:
# Helper function to run clustering and report silhouette
def run_kmeans(X, k, random_state=42):
    km = KMeans(n_clusters=k, random_state=random_state, n_init=10)
    labels = km.fit_predict(X)
    score = silhouette_score(X, labels)
    return labels, score, km

def run_agglomerative(X, k):
    agg = AgglomerativeClustering(n_clusters=k)
    labels = agg.fit_predict(X.toarray()) if hasattr(X, 'toarray') else agg.fit_predict(X)
    score = silhouette_score(X, labels)
    return labels, score, agg

# Try a range of k and pick best by silhouette (KMeans)
scores = []
ks = list(range(2,8))
for k in ks:
    labels, score, _ = run_kmeans(X, k)
    scores.append(score)
    print(f'KMeans k={k} -> silhouette={score:.4f}')

best_k = ks[np.argmax(scores)]
print('\nBest k by silhouette (KMeans):', best_k)


In [None]:
# Run KMeans with best_k and Agglomerative for comparison
best_k = int(best_k)
k_labels, k_score, k_model = run_kmeans(X, best_k)
print('KMeans silhouette:', k_score)

agg_labels, agg_score, agg_model = run_agglomerative(X, best_k)
print('Agglomerative silhouette:', agg_score)

# Attach labels to df (use KMeans labels by default)
df['cluster_kmeans'] = k_labels
df['cluster_agglo'] = agg_labels


In [None]:
# PCA to 2D for visualization (use dense array for PCA)
print('Converting features to dense array for PCA (may be memory heavy). If memory problems occur, reduce TF-IDF max_features.')
X_dense = X.toarray()
pca = PCA(n_components=2, random_state=42)
X_pca = pca.fit_transform(X_dense)

plt.figure(figsize=(10,6))
sns.scatterplot(x=X_pca[:,0], y=X_pca[:,1], hue=df['cluster_kmeans'].astype(str), palette='tab10', s=40)
plt.title('KMeans clusters visualized with PCA (2 components)')
plt.legend(title='cluster', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()


In [None]:
# Inspect top genres and sample titles per cluster
for c in sorted(df['cluster_kmeans'].unique()):
    print('\n=== Cluster', c, 'summary ===')
    print('Count:', (df['cluster_kmeans']==c).sum())
    print('Top primary genres:')
    print(df[df['cluster_kmeans']==c]['primary_genre'].value_counts().head(5))
    print('\nSample titles:')
    print(df[df['cluster_kmeans']==c].sample(min(5, (df['cluster_kmeans']==c).sum()))[['title','type','release_year']].to_string(index=False))


In [None]:
# Save clustered dataset to CSV
out_file = 'netflix_titles_clustered.csv'
df.to_csv(out_file, index=False)
print('Saved clustered dataset to', out_file)


In [None]:
Conclusion & Next steps
What I did

Cleaned dataset, created TF-IDF features from descriptions, added a simple type feature.
Ran KMeans and Agglomerative clustering and visualized clusters via PCA.
Saved cluster assignments.

Next steps / improvements

Use more engineered features: cast, director, runtime, multiple-genre indicators.
Use UMAP for better visualization with sparse inputs.
Try topic modeling (LDA) on descriptions before clustering.
Link clusters with IMDB ratings or user engagement metrics for business insights.
