# **Movie Recommender Pipeline**

### Overview  
This notebook builds a content-based KNN recommender for movies using a variety of feature types:
- **Text embeddings** (overview, keywords) via a SentenceTransformer  
- **Categorical TF-IDF** (genres, themes, cast, director, collection)  
- **Numeric features** (runtime, budget, revenue, votes) scaled with StandardScaler  
- **Weighted fusion** of all feature blocks  
- **Dimensionality reduction** with TruncatedSVD  
- **Nearest-neighbors indexing** for fast similarity search  
- **Model artifact persistence** for deployment  

1. **Multi-Modal Feature Fusion**  
   - **Text Variables**: Movie overviews and keyword lists carry a lot of language signals about plot and themes. I encoded them with a SentenceTransformer (all-mpnet-base-v2) - hybrid of accuracy and speed to capture deeper relationships.  
   - **Categorical Data**: Genres, themes, cast, director and collection names each describe a different facet of a movie’s identity. TF-IDF treats these like a grouped list of options so that options that appear more often with higher ratings are weighted higher automatically.  
   - **Numeric Attributes**: Runtime, budget, revenue, average rating, and vote counts capture scale and popularity. We log-transformed earlier where appropriate and then StandardScale to put them on an even footing weight wise.

2. **Tuned Weighting**  
   - All features matter equally. For example, “director” or “keywords” may be far more predictive of user taste than “runtime.” After experimentation using the tuning program, I settled on a set of weights that amplify the most important blocks.

3. **Feature Stacking & Normalization**  
   - We multiply each TF-IDF / embedding / numeric block by its weight, then horizontally stack them into a single dense matrix.  
   - L2-normalizing each movie vector ensures that cosine similarity reflects relative composition rather than absolute magnitude.

4. **Dimensionality Reduction - (TruncatedSVD)**  
   - Stacking everything yields high dimensionality (tens of thousands of TF-IDF tokens + embedding dims + numeric dims). TruncatedSVD compresses it down (125 components), removing noise and speeding up neighbors search, while preserving almost all variance.

5. **Nearest-Neighbors Indexing**  
   - We fit a cosine-metric KNN on the reduced vectors. At recommendation time, a single lookup returns the top K+1 neighbors; dropping the seed itself gives the most similar movies.

6. **Creating the App**  
   - I saved the reduced feature matrix separately for fast loading.  
   - All transformers are joblib for reproducibility.  
   - The movie DataFrame is gzipped so the app can quickly look up titles and metadata without retraining massive raw data.

In [2]:
import pandas as pd
from sentence_transformers import SentenceTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler, normalize
from scipy.sparse import hstack, save_npz
from sklearn.neighbors import NearestNeighbors
import os
import pickle
import joblib
import numpy as np
from sklearn.decomposition import TruncatedSVD
import gzip
import shutil
from pathlib import Path

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
# load and preprocess data
df = pd.read_csv('final_cleaned_tmdb.csv', keep_default_na=False)
df.drop_duplicates(subset=['title'], inplace=True)
df.reset_index(drop=True, inplace=True)

In [None]:
# compute sentence‐transformer embeddings
model = SentenceTransformer('all-mpnet-base-v2')
overview_embeds = model.encode(df['overview'].tolist(), show_progress_bar=True)
keywords_embeds = model.encode(df['keywords'].tolist(), show_progress_bar=True)

Batches: 100%|██████████| 1691/1691 [01:03<00:00, 26.65it/s]
Batches: 100%|██████████| 1691/1691 [00:22<00:00, 76.44it/s] 


In [None]:
# build TF-IDF matrices for categorical features
vect_genres     = TfidfVectorizer(token_pattern='[^,]+')
genre_tfidf     = vect_genres.fit_transform(df['genres'])

vect_themes     = TfidfVectorizer(token_pattern='[^,]+')
themes_tfidf    = vect_themes.fit_transform(df['themes'])

vect_cast       = TfidfVectorizer(token_pattern='[^,]+')
cast_tfidf      = vect_cast.fit_transform(df['top_cast'])

vect_director   = TfidfVectorizer(token_pattern='[^,]+')
director_tfidf  = vect_director.fit_transform(df['director'])

vect_collection = TfidfVectorizer(token_pattern='[^,]+')
collection_tfidf= vect_collection.fit_transform(df['collection_name'])

In [None]:
# scale numeric columns
numeric_cols = ['runtime_log','budget_log','revenue_log','vote_average','vote_count_log']
scaler       = StandardScaler()
num_scaled   = scaler.fit_transform(df[numeric_cols])


In [None]:
# set your tuned weights
w_genres     = 2.1248831057594555
w_themes     = 0.7380308776580978
w_cast       = 1.8518025540709728
w_director   = 5.611039339547304
w_collection = 7.692623326659378
w_overview   = 6.552969618108489
w_keywords   = 7.377335381712764
w_numeric    = 1.9105801674680045

In [8]:
blocks = [
    genre_tfidf    * w_genres,
    themes_tfidf   * w_themes,
    cast_tfidf     * w_cast,
    director_tfidf * w_director,
    collection_tfidf * w_collection,
    overview_embeds * w_overview,
    keywords_embeds * w_keywords,
    num_scaled      * w_numeric,
]

In [None]:
# stack and L2-normalize into a single matrix
X = normalize(
    hstack(blocks, format='csr'),
    norm='l2',
    axis=1
)
svd = TruncatedSVD(n_components=125, random_state=42)
X_reduced = svd.fit_transform(X).astype('float32')  

In [None]:
# fit a NearestNeighbors index (k+1 so we can drop the seed itself)
knn = NearestNeighbors(n_neighbors=4, metric='cosine', n_jobs=-1)
knn.fit(X_reduced)

0,1,2
,n_neighbors,4
,radius,1.0
,algorithm,'auto'
,leaf_size,30
,metric,'cosine'
,p,2
,metric_params,
,n_jobs,-1


In [22]:
def recommend(seed_title, df, feature_matrix, knn_model, k=3):
    if seed_title not in df['title'].values:
        raise ValueError(f"Movie '{seed_title}' not found in DataFrame.")
    seed_idx = df.index[df['title'] == seed_title][0]

    # reshape the seed vector to 2D
    query_vec = feature_matrix[seed_idx].reshape(1, -1)

    distances, indices = knn_model.kneighbors(query_vec, n_neighbors=k+1)

    rec_idxs     = indices[0][1:]              # drop the seed itself
    similarities = 1 - distances[0][1:]        # cosine-distance → similarity

    recs = df.loc[rec_idxs, ['title','director','vote_average']].copy()
    recs['similarity'] = similarities
    return recs.reset_index(drop=True)



In [None]:
# example usage
if __name__ == "__main__":
    top3 = recommend("Star Wars: The Force Awakens", df, X_reduced, knn, k=3)
    print(top3)
    

                              title          director  vote_average  \
0  Star Wars: The Rise of Skywalker       J.J. Abrams         6.296   
1                Return of the Jedi  Richard Marquand         7.902   
2           The Empire Strikes Back    Irvin Kershner         8.395   

   similarity  
0    0.903056  
1    0.890735  
2    0.883968  


In [None]:
MODELS_DIR = 'models'
os.makedirs(MODELS_DIR, exist_ok=True)

# save reduced feature matrix
np.savez_compressed(
    os.path.join(MODELS_DIR, 'feature_matrix_reduced.npz'),
    X_reduced=X_reduced
)

# only objects streamlit app will need at runtime
artifacts = [
    (knn,         'knn_model'),
    (scaler,      'scaler'),
    (vect_genres, 'tfidf_genres'),
    (vect_themes, 'tfidf_themes'),
    (vect_cast,   'tfidf_cast'),
    (vect_director,'tfidf_director'),
    (vect_collection,'tfidf_collection'),
]
for obj, name in artifacts:
    path = os.path.join(MODELS_DIR, f'{name}.joblib')
    joblib.dump(obj, path, compress=('lzma', 9))

# gzip dataframe for indexing
with gzip.open(os.path.join(MODELS_DIR, 'movies_df.pkl.gz'), 'wb', compresslevel=9) as f:
    pickle.dump(df, f, protocol=pickle.HIGHEST_PROTOCOL)

print("Saved compressed models (no SVD) to the models/ folder.")

✅ Saved compressed models (no SVD) to the models/ folder.
