# Movie Recommendation System - PHASE 2: Data Preprocessing

## Overview
This notebook handles advanced data preprocessing including:
- Advanced text processing and tokenization
- Feature engineering and vectorization
- Train-test splitting with stratification
- Data quality enhancement
- Standardization and normalization

In [8]:
# Import all required libraries
import pandas as pd
import numpy as np
import pickle
import json
import warnings
warnings.filterwarnings('ignore')

# Import processing libraries
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import MinMaxScaler, StandardScaler, MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from scipy.sparse import hstack
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

# Download required NLTK data
for resource in ['stopwords', 'wordnet']:
    try:
        nltk.data.find(f'corpora/{resource}')
    except LookupError:
        nltk.download(resource, quiet=True)

print("✓ All libraries imported successfully!")

✓ All libraries imported successfully!


## Section 1: Load EDA Results and Raw Data

In [9]:
# Load EDA results from Phase 1
import os

results_dir = '../results'

# Load the pickled EDA results
with open(os.path.join(results_dir, 'eda_results.pkl'), 'rb') as f:
    eda_results = pickle.load(f)

# Extract components
df_movies = eda_results['movies_df'].copy()
tfidf_vectorizer_desc = eda_results['tfidf_vectorizer']
mlb_genres = eda_results['mlb_genres']
scaler_numeric = eda_results['scaler']
tfidf_matrix_desc = eda_results['tfidf_matrix_desc']
genres_df = eda_results['genres_df']
numeric_features_df = eda_results['numeric_features_df']

print("✓ EDA results loaded successfully!")
print(f"\nLoaded Data Shape: {df_movies.shape}")
print(f"Loaded Features Shape: {tfidf_matrix_desc.shape}")

✓ EDA results loaded successfully!

Loaded Data Shape: (4389, 28)
Loaded Features Shape: (4389, 1000)


## Section 2: Advanced Text Processing with Multiple Techniques

In [10]:
# Advanced text preprocessing with stemming and lemmatization
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

def advanced_text_processing(text):
    """
    Comprehensive text processing:
    - Lowercase conversion
    - Punctuation removal
    - Tokenization
    - Stopword removal
    - Stemming AND Lemmatization
    """
    if not isinstance(text, str):
        return ""
    
    # Convert to lowercase
    text = text.lower()
    
    # Remove special characters, keep alphanumeric and spaces
    import string
    text = text.translate(str.maketrans('', '', string.punctuation))
    
    # Tokenization (simple split instead of word_tokenize to avoid punkt_tab dependency)
    tokens = text.split()
    
    # Remove stopwords
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words and len(token) > 2]
    
    # Apply both stemming and lemmatization
    processed_tokens = []
    for token in tokens:
        stemmed = stemmer.stem(token)
        lemmatized = lemmatizer.lemmatize(token)
        # Use the shorter result (usually more conservative)
        processed_tokens.append(min([stemmed, lemmatized], key=len))
    
    return ' '.join(processed_tokens)

# Apply advanced processing to overview text
print("Applying advanced text processing...")
df_movies['overview_processed'] = df_movies['overview'].apply(advanced_text_processing)

print("✓ Text processing completed!")
print("\nSample processed text:")
for i in range(2):
    print(f"\nMovie: {df_movies.iloc[i]['title']}")
    print(f"Original (first 100 chars): {df_movies.iloc[i]['overview'][:100]}")
    print(f"Processed (first 100 chars): {df_movies.iloc[i]['overview_processed'][:100]}")

Applying advanced text processing...
✓ Text processing completed!

Sample processed text:

Movie: Avatar
Original (first 100 chars): In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but 
Processed (first 100 chars): 22nd centuri parapleg marin dispatch moon pandora uniqu mission becom torn follow order protect alie

Movie: Pirates of the Caribbean: At World's End
Original (first 100 chars): Captain Barbossa, long believed to be dead, has come back to life and is headed to the edge of the E
Processed (first 100 chars): captain barbossa long believ dead come back life head edg earth turner elizabeth swann noth quit see
✓ Text processing completed!

Sample processed text:

Movie: Avatar
Original (first 100 chars): In the 22nd century, a paraplegic Marine is dispatched to the moon Pandora on a unique mission, but 
Processed (first 100 chars): 22nd centuri parapleg marin dispatch moon pandora uniqu mission becom torn follow order protect alie



## Section 3: Vectorization and Feature Engineering

In [11]:
# Create enhanced TF-IDF on processed text
print("Creating enhanced TF-IDF vectorization on processed text...")
tfidf_vectorizer_processed = TfidfVectorizer(
    max_features=1500,           # Increased from 1000
    ngram_range=(1, 3),          # Include trigrams
    min_df=1,                    # Minimum document frequency
    max_df=0.9,                  # Maximum document frequency
    sublinear_tf=True            # Sublinear term frequency scaling
)

tfidf_matrix_processed = tfidf_vectorizer_processed.fit_transform(df_movies['overview_processed'])
print(f"✓ Processed TF-IDF Matrix Shape: {tfidf_matrix_processed.shape}")

# Create combined tags vectorization
print("\nCreating tags vectorization...")
tags_text = df_movies['tags_cleaned'].fillna('')
tfidf_vectorizer_tags = TfidfVectorizer(
    max_features=500,
    ngram_range=(1, 2),
    min_df=1,
    max_df=0.95
)

tfidf_matrix_tags = tfidf_vectorizer_tags.fit_transform(tags_text)
print(f"✓ Tags TF-IDF Matrix Shape: {tfidf_matrix_tags.shape}")

# Combine all features
print("\nCombining all features...")
combined_feature_matrix = hstack([
    tfidf_matrix_processed * 0.4,    # Weight descriptions 40%
    tfidf_matrix_tags * 0.3,         # Weight tags 30%
    genres_df.values * 0.2,          # Weight genres 20%
    numeric_features_df.values * 0.1 # Weight numeric features 10%
])

print(f"✓ Combined Feature Matrix Shape: {combined_feature_matrix.shape}")
print(f"  Total Features: {combined_feature_matrix.shape[1]}")
print(f"  Sparsity: {100 * (1 - combined_feature_matrix.nnz / (combined_feature_matrix.shape[0] * combined_feature_matrix.shape[1])):.2f}%")

Creating enhanced TF-IDF vectorization on processed text...
✓ Processed TF-IDF Matrix Shape: (4389, 1500)

Creating tags vectorization...
✓ Processed TF-IDF Matrix Shape: (4389, 1500)

Creating tags vectorization...
✓ Tags TF-IDF Matrix Shape: (4389, 500)

Combining all features...
✓ Combined Feature Matrix Shape: (4389, 2023)
  Total Features: 2023
  Sparsity: 98.27%
✓ Tags TF-IDF Matrix Shape: (4389, 500)

Combining all features...
✓ Combined Feature Matrix Shape: (4389, 2023)
  Total Features: 2023
  Sparsity: 98.27%


## Section 4: Train-Test Split with Stratification

In [12]:
# Create stratification column based on rating categories
df_movies['rating_category'] = pd.cut(df_movies['vote_average'], 
                                        bins=[0, 4, 6, 8, 10],
                                        labels=['Low', 'Medium', 'High', 'Very High'])

print("Rating Distribution:")
print(df_movies['rating_category'].value_counts().sort_index())

# Perform stratified train-test split (80-20)
print("\nPerforming stratified train-test split (80-20)...")

indices = np.arange(len(df_movies))

train_indices, test_indices = train_test_split(
    indices,
    test_size=0.2,
    stratify=df_movies['rating_category'],
    random_state=42
)

print(f"✓ Train indices: {len(train_indices)}")
print(f"✓ Test indices: {len(test_indices)}")

# Convert combined_feature_matrix to CSR format for efficient indexing
from scipy.sparse import csr_matrix
combined_feature_matrix = csr_matrix(combined_feature_matrix)

# Split all features
train_features = combined_feature_matrix[train_indices]
test_features = combined_feature_matrix[test_indices]

# Create training and testing dataframes
train_df = df_movies.iloc[train_indices].reset_index(drop=True)
test_df = df_movies.iloc[test_indices].reset_index(drop=True)

print(f"\nTrain set shape: {train_df.shape}")
print(f"Test set shape: {test_df.shape}")

print(f"\nTrain set rating distribution:")
print(train_df['rating_category'].value_counts().sort_index())

print(f"\nTest set rating distribution:")
print(test_df['rating_category'].value_counts().sort_index())

print("\n✓ Train-test split completed!")

Rating Distribution:
rating_category
Low            66
Medium       1692
High         2589
Very High      42
Name: count, dtype: int64

Performing stratified train-test split (80-20)...
✓ Train indices: 3511
✓ Test indices: 878

Train set shape: (3511, 30)
Test set shape: (878, 30)

Train set rating distribution:
rating_category
Low            53
Medium       1353
High         2071
Very High      34
Name: count, dtype: int64

Test set rating distribution:
rating_category
Low           13
Medium       339
High         518
Very High      8
Name: count, dtype: int64

✓ Train-test split completed!


## Section 5: Create User-Item Interaction Matrix (for Collaborative Filtering)

In [13]:
# Create synthetic user-item interaction matrix for collaborative filtering
print("Creating synthetic user-item interaction matrix...")

n_users = 150
n_movies = len(df_movies)

# Create sparse user-item matrix
np.random.seed(42)
user_indices = []
movie_indices = []
ratings = []

# Generate ratings for 30% of user-movie pairs
sparsity = 0.7
n_interactions = int(n_users * n_movies * (1 - sparsity))

for _ in range(n_interactions):
    user_id = np.random.randint(0, n_users)
    movie_id = np.random.randint(0, n_movies)
    # Generate rating based on movie's popularity and quality
    base_rating = df_movies.iloc[movie_id]['vote_average']
    rating = base_rating + np.random.normal(0, 1.5)
    rating = np.clip(rating, 1, 10)
    
    user_indices.append(user_id)
    movie_indices.append(movie_id)
    ratings.append(rating)

# Create sparse user-item matrix
from scipy.sparse import csr_matrix

user_item_matrix = csr_matrix(
    (ratings, (user_indices, movie_indices)),
    shape=(n_users, n_movies)
)

print(f"✓ User-Item Matrix Shape: {user_item_matrix.shape}")
print(f"✓ Sparsity: {100 * (1 - user_item_matrix.nnz / (n_users * n_movies)):.2f}%")
print(f"✓ Non-zero interactions: {user_item_matrix.nnz}")

# Calculate user-user and item-item similarity matrices for collaborative filtering
from sklearn.metrics.pairwise import cosine_similarity

print("\nCalculating user similarity matrix...")
user_similarity = cosine_similarity(user_item_matrix)
print(f"✓ User-User Similarity Shape: {user_similarity.shape}")

print("\nCalculating item similarity matrix...")
item_similarity = cosine_similarity(user_item_matrix.T)
print(f"✓ Item-Item Similarity Shape: {item_similarity.shape}")

Creating synthetic user-item interaction matrix...
✓ User-Item Matrix Shape: (150, 4389)
✓ Sparsity: 74.10%
✓ Non-zero interactions: 170524

Calculating user similarity matrix...
✓ User-User Similarity Shape: (150, 150)

Calculating item similarity matrix...
✓ User-Item Matrix Shape: (150, 4389)
✓ Sparsity: 74.10%
✓ Non-zero interactions: 170524

Calculating user similarity matrix...
✓ User-User Similarity Shape: (150, 150)

Calculating item similarity matrix...
✓ Item-Item Similarity Shape: (4389, 4389)
✓ Item-Item Similarity Shape: (4389, 4389)


## Section 6: Save Preprocessed Data

In [14]:
# Save all preprocessed data
import os

results_dir = '../results'
os.makedirs(results_dir, exist_ok=True)

preprocessing_results = {
    'train_df': train_df,
    'test_df': test_df,
    'train_features': train_features,
    'test_features': test_features,
    'combined_feature_matrix': combined_feature_matrix,
    'tfidf_vectorizer_processed': tfidf_vectorizer_processed,
    'tfidf_vectorizer_tags': tfidf_vectorizer_tags,
    'mlb_genres': mlb_genres,
    'train_indices': train_indices,
    'test_indices': test_indices,
    'user_item_matrix': user_item_matrix,
    'user_similarity': user_similarity,
    'item_similarity': item_similarity
}

# Save to pickle
pickle_path = os.path.join(results_dir, 'preprocessed_data.pkl')
with open(pickle_path, 'wb') as f:
    pickle.dump(preprocessing_results, f)

print(f"✓ Preprocessed data saved to: {pickle_path}")

# Also save train and test dataframes as CSV for reference
train_df.to_csv(os.path.join(results_dir, 'train_movies.csv'), index=False)
test_df.to_csv(os.path.join(results_dir, 'test_movies.csv'), index=False)

print(f"✓ Train/Test sets saved to CSV")

# Print summary
print("\n" + "=" * 80)
print("PREPROCESSING SUMMARY")
print("=" * 80)
print(f"\nTotal Movies: {len(df_movies)}")
print(f"Training Movies: {len(train_df)} ({len(train_df)/len(df_movies)*100:.1f}%)")
print(f"Testing Movies: {len(test_df)} ({len(test_df)/len(df_movies)*100:.1f}%)")
print(f"\nFeature Dimensions:")
print(f"  - Combined Features: {combined_feature_matrix.shape[1]}")
print(f"  - TF-IDF (Descriptions): {tfidf_matrix_processed.shape[1]}")
print(f"  - TF-IDF (Tags): {tfidf_matrix_tags.shape[1]}")
print(f"  - Genres (One-Hot): {genres_df.shape[1]}")
print(f"  - Numeric Features: {numeric_features_df.shape[1]}")
print(f"\nCollaborative Filtering Data:")
print(f"  - Users: {n_users}")
print(f"  - Movies: {n_movies}")
print(f"  - User-Item Interactions: {user_item_matrix.nnz}")
print(f"  - Sparsity: {100 * (1 - user_item_matrix.nnz / (n_users * n_movies)):.2f}%")
print(f"\n✓ Preprocessing Phase completed successfully!")

✓ Preprocessed data saved to: ../results\preprocessed_data.pkl
✓ Train/Test sets saved to CSV

PREPROCESSING SUMMARY

Total Movies: 4389
Training Movies: 3511 (80.0%)
Testing Movies: 878 (20.0%)

Feature Dimensions:
  - Combined Features: 2023
  - TF-IDF (Descriptions): 1500
  - TF-IDF (Tags): 500
  - Genres (One-Hot): 20
  - Numeric Features: 3

Collaborative Filtering Data:
  - Users: 150
  - Movies: 4389
  - User-Item Interactions: 170524
  - Sparsity: 74.10%

✓ Preprocessing Phase completed successfully!
✓ Train/Test sets saved to CSV

PREPROCESSING SUMMARY

Total Movies: 4389
Training Movies: 3511 (80.0%)
Testing Movies: 878 (20.0%)

Feature Dimensions:
  - Combined Features: 2023
  - TF-IDF (Descriptions): 1500
  - TF-IDF (Tags): 500
  - Genres (One-Hot): 20
  - Numeric Features: 3

Collaborative Filtering Data:
  - Users: 150
  - Movies: 4389
  - User-Item Interactions: 170524
  - Sparsity: 74.10%

✓ Preprocessing Phase completed successfully!
