# Machine Learning Revisions

Target: Classify movies as "high-grossing" or "low-grossing" and build a basic recommendation system.

Steps Overview: We’ll first preprocess the data, then build a classification model using logistic regression, and finally implement a recommendation system based on movie similarity.

In [67]:
#Imports 
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns

In [44]:
#Load Merged and Cleaned Data
path = "/Users/saniaspry/Documents/Flatiron/Phase-3/Movie-Recommender/data/merged_data.csv"
ml_df = pd.read_csv(path)

In [45]:
# Use 2018 data as it's the most recent
data_2018 = ml_df[ml_df["year"] == 2018]
data_2018

Unnamed: 0,movie_id,primary_title,genres,individual_genre,runtime_minutes,title,studio,domestic_gross,foreign_gross,year,averagerating,numvotes,director_id,director_name,total_gross
240,tt0800054,The Guardians,"Comedy,Family",Comedy,88.0,The Guardians,MBox,177000.0,0.0,2018,7.8,68,nm0401827,Chris Hummel,177000.0
241,tt0800054,The Guardians,"Comedy,Family",Family,88.0,The Guardians,MBox,177000.0,0.0,2018,7.8,68,nm0401827,Chris Hummel,177000.0
242,tt6213362,The Guardians,"Drama,War",Drama,138.0,The Guardians,MBox,177000.0,0.0,2018,6.8,1314,nm0064741,Xavier Beauvois,177000.0
243,tt6213362,The Guardians,"Drama,War",War,138.0,The Guardians,MBox,177000.0,0.0,2018,6.8,1314,nm0064741,Xavier Beauvois,177000.0
244,tt6901956,The Guardians,"Action,Adventure,Comedy",Action,46.0,The Guardians,MBox,177000.0,0.0,2018,4.1,7,nm7014443,Sebastian Garcia Lorenzo,177000.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7764,tt8427036,Helicopter Eela,Drama,Drama,135.0,Helicopter Eela,Eros,72000.0,0.0,2018,5.4,673,nm1224879,Pradeep Sarkar,72000.0
7765,tt9078374,Last Letter,"Drama,Romance",Drama,114.0,Last Letter,CL,181000.0,0.0,2018,6.4,322,nm0412517,Shunji Iwai,181000.0
7766,tt9078374,Last Letter,"Drama,Romance",Romance,114.0,Last Letter,CL,181000.0,0.0,2018,6.4,322,nm0412517,Shunji Iwai,181000.0
7767,tt9151704,Burn the Stage: The Movie,"Documentary,Music",Documentary,84.0,Burn the Stage: The Movie,Trafalgar,4200000.0,16100000.0,2018,8.8,2067,nm10201503,Jun-Soo Park,20300000.0


### Data Check:
Here, we confirm the data is ready by checking for missing values, ensuring our columns are formatted correctly, and identifying any features we need to create (e.g., a label for high-grossing movies).

In [46]:
# Check for missing values
print(data_2018.isnull().sum())

# Check column names and data types
print(data_2018.dtypes)


movie_id            0
primary_title       0
genres              0
individual_genre    0
runtime_minutes     1
title               0
studio              0
domestic_gross      0
foreign_gross       0
year                0
averagerating       0
numvotes            0
director_id         0
director_name       0
total_gross         0
dtype: int64
movie_id             object
primary_title        object
genres               object
individual_genre     object
runtime_minutes     float64
title                object
studio               object
domestic_gross      float64
foreign_gross       float64
year                  int64
averagerating       float64
numvotes              int64
director_id          object
director_name        object
total_gross         float64
dtype: object


Remove duplicate titles

In [47]:
# Remove duplicate titles before creating the similarity matrix
unique_titles = data_2018.drop_duplicates(subset='title')

### Feature Engineering
We’ll define a threshold for “high-grossing” and “low-grossing” movies based on revenue (e.g., above or below a specific gross amount)

In [48]:
# Creating a binary label
threshold = 100000000  # Example threshold
data_2018['high_grossing'] = np.where(data_2018['total_gross'] >= threshold, 1, 0)
data_2018['high_grossing'].value_counts()


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_2018['high_grossing'] = np.where(data_2018['total_gross'] >= threshold, 1, 0)


high_grossing
0    561
1    161
Name: count, dtype: int64

Adding a Franchise Column

In [49]:
# Define a function to check if a movie title belongs to a franchise
def check_franchise(title):
    # List of keywords associated with popular franchises
    franchise_keywords = [
        'Avengers', 'Star Wars', 'Harry Potter', 'Marvel', 'Toy Story', 
        'Fast & Furious', 'Transformers', 'Pirates of the Caribbean', 
        'Spider-Man', 'Batman', 'Superman', 'James Bond', 'X-Men', 
        'Jurassic', 'Mission: Impossible', 'Despicable Me', 'Shrek', 
        'Hobbit', 'Lord of the Rings', 'Tomb Raider', 'Mamma Mia', ':',
        'Detective Chinatown', 'Incredibles', 'Deadpool', 'The Guardians of the Galaxy'
    ]
    # Check if any keyword is in the title
    for keyword in franchise_keywords:
        if keyword in title:
            return 'Yes'
    return 'No'

# Apply this function to each title to create a new franchise column
data_2018['franchise'] = data_2018['title'].apply(check_franchise)

# Display unique titles with franchise column
data_2018[['title', 'franchise']].drop_duplicates(subset='title').head()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  data_2018['franchise'] = data_2018['title'].apply(check_franchise)


Unnamed: 0,title,franchise
240,The Guardians,No
259,The Mule,No
575,Suspiria,No
608,Beast,No
711,Winchester,No


### Adding a Director Loyalty Column

This feature will quantify the consistency of director-studio collaborations by checking if directors have worked repeatedly with the same studio, which often correlates with higher movie revenues.

In [50]:
# Calculate director loyalty by averaging revenue for each director-studio pair
director_loyalty = data_2018.groupby(['studio', 'director_name'])['total_gross'].mean().reset_index()
director_loyalty.rename(columns={'total_gross': 'avg_revenue_with_studio'}, inplace=True)

# Merge this director loyalty information back into your main DataFrame
data_2018 = data_2018.merge(director_loyalty, on=['studio', 'director_name'], how='left')

# Preview the updated DataFrame to confirm changes
data_2018[['studio', 'director_name', 'avg_revenue_with_studio']].head()

Unnamed: 0,studio,director_name,avg_revenue_with_studio
0,MBox,Chris Hummel,177000.0
1,MBox,Chris Hummel,177000.0
2,MBox,Xavier Beauvois,177000.0
3,MBox,Xavier Beauvois,177000.0
4,MBox,Sebastian Garcia Lorenzo,177000.0


### Label Encoding

In [51]:
label_encoder = LabelEncoder()

# Apply label encoding to the 'individual_genre' column
data_2018['genre_encoded'] = label_encoder.fit_transform(data_2018['individual_genre'])
data_2018['genre_encoded']

0       4
1       8
2       7
3      19
4       0
       ..
717     7
718     7
719    15
720     6
721    12
Name: genre_encoded, Length: 722, dtype: int64

In [52]:
# Convert the 'franchise' column to binary (0 or 1)
data_2018['franchise_binary'] = data_2018['franchise'].replace({'Yes': 1, 'No': 0})



  data_2018['franchise_binary'] = data_2018['franchise'].replace({'Yes': 1, 'No': 0})


### Feature Selection:

Define the Target: Set a threshold on total_revenue to classify movies as "high-grossing" or "low-grossing."

Choose Features: Might include franchise (as a binary feature), avg_revenue_with_studio, and genre-related features.

In [53]:
# Selecting features and target variable
X = data_2018[['genre_encoded', 'avg_revenue_with_studio', 'franchise_binary']]
y = data_2018['high_grossing']


## Building the Logistic Regression Model

We split the data into training and testing sets to evaluate model performance.

In [54]:
#Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


Logistic regression is a simple model used to classify data based on probabilities. Here, it will help us classify movies as high or low-grossing.

In [55]:
#Model Training
model = LogisticRegression()
model.fit(X_train, y_train)


After training, we evaluate how well our model performed on the test set by calculating accuracy and other metrics.

In [56]:
y_pred = model.predict(X_test)

# Model evaluation
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))


Accuracy: 0.2119815668202765
Classification Report:
               precision    recall  f1-score   support

           0       0.00      0.00      0.00       171
           1       0.21      1.00      0.35        46

    accuracy                           0.21       217
   macro avg       0.11      0.50      0.17       217
weighted avg       0.04      0.21      0.07       217

Confusion Matrix:
 [[  0 171]
 [  0  46]]


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


Logistic Regression Model Evaluation: 

The performance of the logistic regression classification model is **relatively poor,** as reflected in the following metrics:


- The model correctly predicted the outcome only about 21.2% of the time. This suggests that the model is not performing well overall.


**Precision (Class 0): 0.00**
- This means that when the model predicted "0" (the negative class), it was never correct.

**Recall (Class 1): 1.00**
- The model correctly identified all of the instances of the positive class ("1").

**F1-Score (Class 1): 0.35**
- While the model identified all positive instances, the F1-score for Class 1 is still low, indicating that the positive class predictions are imbalanced.
Confusion Matrix:

Essentially, the model predicted "1" for almost all instances, failing to predict "0" at all. This resulted in a very low precision for "0" and a high recall for "1", but still a very low overall performance.

### Non-Technical Summary:
The logistic regression model's performance is not ideal. It has trouble distinguishing between the two classes and tends to predict the positive class ("1") much more often than the negative class ("0"). While it correctly identifies all the positive cases, it misses the negative ones completely, leading to poor accuracy and a high number of false positives.

### Tuning the Logistic Regression Model using Hyperparameter Tuning & Cross Validation

In [65]:
# 1. Selecting features and target variable
X = data_2018[['genre_encoded', 'avg_revenue_with_studio', 'franchise_binary']]
y = data_2018['high_grossing']

In [68]:
# 2. Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 3. Standardize the features (important for Logistic Regression)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [69]:

# 4. Create a logistic regression model
log_reg = LogisticRegression(max_iter=1000)

# 5. Define the hyperparameters to tune
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],  # Regularization strength (inverse of regularization)
    'solver': ['liblinear', 'saga'],  # Solvers for optimization
    'penalty': ['l2']  # Regularization method (L2 is common for logistic regression)
}

# 6. Perform Grid Search with Cross-Validation
grid_search = GridSearchCV(estimator=log_reg, param_grid=param_grid, cv=5, n_jobs=-1, scoring='accuracy')

# 7. Fit the grid search to the data
grid_search.fit(X_train_scaled, y_train)



We define a parameter grid for C (regularization strength), solver, and penalty. We use GridSearchCV to perform an exhaustive search across the parameter grid using 5-fold cross-validation.

In [70]:
# 8. Get the best parameters from the grid search
print("Best Hyperparameters:", grid_search.best_params_)

Best Hyperparameters: {'C': 100, 'penalty': 'l2', 'solver': 'liblinear'}


After fitting the grid search, we select the best model based on the highest accuracy achieved.

In [71]:
# 9. Use the best model to make predictions
best_model = grid_search.best_estimator_

In [72]:
# 10. Evaluate the model on the test set
y_pred = best_model.predict(X_test_scaled)

# 11. Print the classification report and confusion matrix
print("Classification Report:")
print(classification_report(y_test, y_pred))

print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

Classification Report:
              precision    recall  f1-score   support

           0       0.99      1.00      1.00       115
           1       1.00      0.97      0.98        30

    accuracy                           0.99       145
   macro avg       1.00      0.98      0.99       145
weighted avg       0.99      0.99      0.99       145

Confusion Matrix:
[[115   0]
 [  1  29]]


### Updated Regression Model Results Interpretation:



### Summary of Findings from Both Logistic Regression Models:

First Model: The initial model had very low performance (accuracy around 21%) with significant misclassification (poor precision and recall). This result indicates that the features used in the model (e.g., genre, revenue, franchise) were insufficient or the model wasn't tuned properly.

Updated Model: The new model shows exceptional performance, with an accuracy of 99%, a near-perfect precision and recall for the non-high-grossing class (0), and very good recall for the high-grossing class (1). This indicates the model is now effectively predicting high-grossing movies and has learned the right patterns, likely due to improved features or hyperparameter tuning.

## Basic Recommendation System Using Similarity Calculation

In content-based filtering, we calculate similarity between movies based on their features (like genre and director). Here, we’ll use cosine similarity to find movies that are similar to a given movie.

In [60]:
# Use genre and franchise status for similarity calculation
similarity_features = data_2018[['genre_encoded', 'franchise_binary']]

#  Compute cosine similarity
similarity_matrix = cosine_similarity(similarity_features)

# Convert to DataFrame for easier interpretation
similarity_df = pd.DataFrame(similarity_matrix, index=data_2018['primary_title'], columns=data_2018['primary_title'])

# Remove duplicate rows and columns in the similarity matrix
similarity_df = similarity_df.loc[~similarity_df.index.duplicated(), ~similarity_df.columns.duplicated()]

similarity_df.head()


primary_title,The Guardians,The Mule,Suspiria,Beast,Winchester,First Man,Den of Thieves,The Happytime Murders,Tomb Raider,12 Strong,...,Shoplifters,Nobody's Fool,Andhadhun,Gonjiam: Haunted Asylum,Capernaum,The Spy Gone North,How Long Will I Love U,Helicopter Eela,Last Letter,Burn the Stage: The Movie
primary_title,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
The Guardians,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,...,1.0,1.0,1.0,0.995893,1.0,1.0,1.0,1.0,1.0,0.986394
The Mule,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,...,1.0,1.0,1.0,0.995893,1.0,1.0,1.0,1.0,1.0,0.986394
Suspiria,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,...,1.0,1.0,1.0,0.995893,1.0,1.0,1.0,1.0,1.0,0.986394
Beast,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,...,1.0,1.0,1.0,0.995893,1.0,1.0,1.0,1.0,1.0,0.986394
Winchester,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,...,1.0,1.0,1.0,0.995893,1.0,1.0,1.0,1.0,1.0,0.986394


Recommendation Function: This function will take a movie title and recommend other movies based on similarity.

In [64]:
def recommend_movie(movie_title, n_recommendations=5):
    recommendations = similarity_df[movie_title].sort_values(ascending=False)[1:n_recommendations+1]
    return recommendations

# Example recommendation
print(recommend_movie("First Man"))


primary_title
Under the Tree      1.0
The Last Suit       1.0
Before We Vanish    1.0
The Children Act    1.0
Madame              1.0
Name: First Man, dtype: float64
