## Feature Engineering

Now that I have our data reprocessed and stored in table in their normalized form. I will derive couple of more features. This features will help me to check if there is any influence of these features on the ratings of the movies.

I will add new features (listed below):

1. avg_rating
2. votes_weighted_rating
3. decade
4. is_multigenre
5. top_director
6. top_star
7. tag_count
8. most_common_tag
9. star_director_pair

Steps to derive features:

1. Get all data in 1 dataframe by using SQL query on DB
2. add new column to the dataframe

In [1]:
import sqlite3
import pandas as pd

from collections import Counter


In [2]:
db_name = '../Data/movies.db'
conn = sqlite3.connect(db_name)

query = """SELECT
    l.movieid,
    i.movie_name,
    i.rating AS imdb_rating,
    i.votes AS imdb_votes,
    i.runtime AS imdb_runtime,
    i.year AS year,
    t.vote_average AS tmdb_vote_average,
    t.vote_count AS tmdb_votes,
    t.original_language as language,
    t.popularity as popularity,
    t.release_year,
    t.budget as budget,
    t.revenue as revenue,
    GROUP_CONCAT(DISTINCT g.genre_name) AS genres,
    GROUP_CONCAT(DISTINCT d.director_name) AS directors,
    GROUP_CONCAT(DISTINCT s.star_name) AS stars,
    GROUP_CONCAT(DISTINCT p.production_companies_name) AS production_house
FROM links l
JOIN imdb i ON l.imdbid = i.movie_id
LEFT JOIN tmdb t ON l.tmdbid = t.id
LEFT JOIN genre_imdb gi ON i.movie_id = gi.movie_id
LEFT JOIN genre g ON gi.genre_id = g.genre_id
LEFT JOIN director_imdb di ON i.movie_id = di.movie_id
LEFT JOIN director d ON di.director_id = d.director_id
LEFT JOIN star_imdb si ON i.movie_id = si.movie_id
LEFT JOIN star s ON si.star_id = s.star_id
LEFT JOIN production_companies_tmdb pi ON t.id = pi.id
LEFT JOIN production_companies p ON pi.production_companies_id = p.production_companies_id 
GROUP BY l.movieid
"""

df = pd.read_sql_query(query, conn)
print(df.shape)
# print(df.head())

print("/n ***** Converting Data for columns imdb_votes, year, release_year, budget, revenue from float to int *****")
df['imdb_votes'] = df['imdb_votes'].astype('Int64')
df['year'] = df['year'].astype('Int64')
df['release_year'] = df['release_year'].astype('Int64')

df['budget'] = df['budget'].astype('Int64')
df['revenue'] = df['revenue'].astype('Int64')

print("/n ***** Imputing Data for columns release_year, tmdb_vote_avergea *****")
df['release_year'] = df['release_year'].fillna(df['year'])
df['tmdb_vote_average'] = df['tmdb_vote_average'].fillna(df['imdb_rating'])
df['tmdb_votes'] = df['tmdb_votes'].fillna(df['imdb_votes'])

# Note: 5 records still have year as Null but it doesn't matter
# print(df['release_year'].isnull().sum())

# we have release_year column so drop year column
df.drop(columns='year', inplace = True)
# print(df[df['release_year'].isnull()])

(46173, 17)
/n ***** Converting Data for columns imdb_votes, year, release_year, budget, revenue from float to int *****
/n ***** Imputing Data for columns release_year, tmdb_vote_avergea *****


In [3]:
# rename column
df.rename(columns={
    'imdb_runtime': 'runtime'
}, inplace=True)

df.dropna(inplace = True)
df.head()

Unnamed: 0,movieId,movie_name,imdb_rating,imdb_votes,runtime,tmdb_vote_average,tmdb_votes,language,popularity,release_year,budget,revenue,genres,directors,stars,production_house
0,1,Toy Story,8.3,1002538,81,8.0,17152.0,English,78.404,1995,30000000,394400000,"Comedy,Animation,Adventure",John Lasseter,"Tom Hanks,Tim Allen,Jim Varney,Don Rickles",Pixar
1,2,Jumanji,7.0,352469,104,7.2,9833.0,English,13.444,1995,65000000,262821940,"Comedy,Adventure,Family",Joe Johnston,"Bonnie Hunt,Robin Williams,Kirsten Dunst,Jonat...","PolyGram Filmed Entertainment,TriStar Pictures..."
2,3,Grumpier Old Men,6.6,28491,101,6.5,347.0,English,14.815,1995,25000000,71500000,"Comedy,Romance",Howard Deutch,"Jack Lemmon,Ann-Margret,Walter Matthau,Sophia ...","Warner Bros. Pictures,Lancaster Gate"
3,4,Waiting to Exhale,5.9,11399,124,6.2,142.0,English,14.451,1995,16000000,81452156,"Comedy,Drama,Romance",Forest Whitaker,"Angela Bassett,Loretta Devine,Lela Rochon,Whit...",20th Century Fox
4,5,Father of the Bride Part II,6.0,39557,106,6.2,659.0,English,14.537,1995,0,76594107,"Comedy,Romance,Family",Charles Shyer,"Diane Keaton,Steve Martin,Martin Short,Kimberl...","Touchstone Pictures,Sandollar Productions"


In [4]:
# feature 1: avg_rating across imdb and tmdb
df['rating'] = ((df['imdb_rating'] + df['tmdb_vote_average']) / 2).round(1)
df['votes'] = ((df['imdb_votes'] + df['tmdb_votes'])).astype('Int64')

# as we have new avg rating and total_counts, drop the old ones (we no longer need them)
df.drop(columns=['tmdb_vote_average', 'imdb_rating', 'imdb_votes', 'tmdb_votes'], inplace = True)

In [5]:
# removes spaces from 'directors', 'stars', 'genres', 'production_house'
df['directors'] = df['directors'].str.replace(' ', '_')
df['stars'] = df['stars'].str.replace(' ', '_')
df['genres'] = df['genres'].str.replace(' ', '_')
df['production_house'] = df['production_house'].str.replace(' ', '_')

In [6]:
# create a backup df to validate data for models at the end
main_df = df.copy()

In [7]:
# performing data scaling on numerical columns
import numpy as np
from sklearn.preprocessing import StandardScaler

df[['budget', 'revenue', 'votes']] = df[['budget', 'revenue', 'votes']].replace(0, np.nan)
df[['budget', 'revenue', 'votes']] = df[['budget', 'revenue', 'votes']].fillna(df[['budget', 'revenue', 'votes']].median())
df[['budget', 'revenue', 'votes']] = df[['budget', 'revenue', 'votes']].apply(np.log1p)

scaler = StandardScaler()
df[['runtime', 'popularity', 'budget', 'revenue', 'votes']] = scaler.fit_transform(df[['runtime', 'popularity', 'budget', 'revenue', 'votes']])

df.head()

Unnamed: 0,movieId,movie_name,runtime,language,popularity,release_year,budget,revenue,genres,directors,stars,production_house,rating,votes
0,1,Toy Story,-0.876283,English,2.233116,1995,1.195327,2.712847,"Comedy,Animation,Adventure",John_Lasseter,"Tom_Hanks,Tim_Allen,Jim_Varney,Don_Rickles",Pixar,8.2,3.173651
1,2,Jumanji,0.172073,English,0.153076,1995,1.816543,2.412624,"Comedy,Adventure,Family",Joe_Johnston,"Bonnie_Hunt,Robin_Williams,Kirsten_Dunst,Jonat...","PolyGram_Filmed_Entertainment,TriStar_Pictures...",7.1,2.647084
2,3,Grumpier Old Men,0.035331,English,0.196976,1995,1.048842,1.449742,"Comedy,Romance",Howard_Deutch,"Jack_Lemmon,Ann-Margret,Walter_Matthau,Sophia_...","Warner_Bros._Pictures,Lancaster_Gate",6.6,1.359258
3,4,Waiting to Exhale,1.083687,English,0.185321,1995,0.690274,1.546134,"Comedy,Drama,Romance",Forest_Whitaker,"Angela_Bassett,Loretta_Devine,Lela_Rochon,Whit...",20th_Century_Fox,6.1,0.893266
4,5,Father of the Bride Part II,0.263235,English,0.188075,1995,0.092157,1.500648,"Comedy,Romance,Family",Charles_Shyer,"Diane_Keaton,Steve_Martin,Martin_Short,Kimberl...","Touchstone_Pictures,Sandollar_Productions",6.1,1.528488


In [8]:
# checking stats of the final data, to verify if there are any missing values
num_rows, num_cols = df.shape
print(num_rows, num_cols)

# get missing values count and %
print("Missing values per column:")
missing_info = df.isnull().sum()
for column, count in missing_info.items():
    percent = (count / num_rows) * 100
    print(f"  {column}: {count} missing ({percent:.2f}%)")

46063 14
Missing values per column:
  movieId: 0 missing (0.00%)
  movie_name: 0 missing (0.00%)
  runtime: 0 missing (0.00%)
  language: 0 missing (0.00%)
  popularity: 0 missing (0.00%)
  release_year: 0 missing (0.00%)
  budget: 0 missing (0.00%)
  revenue: 0 missing (0.00%)
  genres: 0 missing (0.00%)
  directors: 0 missing (0.00%)
  stars: 0 missing (0.00%)
  production_house: 0 missing (0.00%)
  rating: 0 missing (0.00%)
  votes: 0 missing (0.00%)


In [9]:
# defining functions to extract features for stars, directors and prodcution companies
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer

# General function to extract top-N from a comma-separated column
def extract_top_n(series, n):
    all_items = series.dropna().str.split(',').sum()
    return [item.strip() for item, _ in Counter(all_items).most_common(n)]

# Function to filter only top-N from each row
def filter_top_items(text, top_n_list):
    if not isinstance(text, str):
        return ''
    items = [i.strip() for i in text.split(',')]
    return ' '.join([i for i in items if i in top_n_list])


## Strategy behind selecting other features

1. Movie_ratings are often influences by stars performing, production companies and directors
2. I have a vast data for this fields for each of the movies
3. While recommending movie, it's also essential to consider this factors rather than just genre or revenue

## Primary Analysis
1. Primary Approach was to use `one hot encoding` for this, but number of distinct genres, stars, directors, production companies is way too much.
2. It could lead to having 1000+ columns in dataset, with sparse representation of matrix and heavy load on further processing 

## What's next

1. to keep it simple and easy to process, I am targeting only top 5 directors, top 10 stars, top 10 genres and top 5 movies (As usually they will be the key factor influencing ratings)
2. Further, I thought of using `tf-idf`. Why??? Because let's say there are 5 stars in a movie and 3 of them ar every popular, then we with tfidf we get accurately get the influence of each star towards the rating

In [10]:
# 1. Genres (Top 10)
top_genres = extract_top_n(df['genres'], 10)
df['genres_filtered'] = df['genres'].apply(lambda x: filter_top_items(x, top_genres))
print(top_genres)

# 2. Stars (Top 10)
top_stars = extract_top_n(df['stars'], 10)
df['stars_filtered'] = df['stars'].apply(lambda x: filter_top_items(x, top_stars))
print(top_stars)

# 3. Directors (Top 5)
top_directors = extract_top_n(df['directors'], 5)
df['directors_filtered'] = df['directors'].apply(lambda x: filter_top_items(x, top_directors))
print(top_directors)

# 4. Production Companies (Top 5)
top_producers = extract_top_n(df['production_house'], 5)
print(top_producers)
df['prod_filtered'] = df['production_house'].apply(lambda x: filter_top_items(x, top_producers))

['Drama', 'Comedy', 'Romance', 'Action', 'Crime', 'Thriller', 'Horror', 'Adventure', 'Mystery', 'Fantasy']
['Nicolas_Cage', 'Bruce_Willis', 'Christopher_Lee', 'Gérard_Depardieu', 'Eric_Roberts', 'John_Wayne', 'James_Mason', 'Michael_Caine', 'Robert_De_Niro', 'Boris_Karloff']
['Michael_Curtiz', 'Richard_Thorpe', 'Cheh_Chang', 'Alfred_Hitchcock', 'Mervyn_LeRoy']
['Unknown', 'Warner_Bros._Pictures', 'Metro-Goldwyn-Mayer', 'Paramount', '20th_Century_Fox']


In [11]:
# Create and apply TF-IDF for each
def tfidf_to_df(column, prefix):
    vectorizer = TfidfVectorizer(token_pattern=r'(?u)\b[\w.-]+\b')
    tfidf_matrix = vectorizer.fit_transform(df[column])
    
    # Replace spaces with underscores in feature names
    feature_names = [f"{prefix}_{feat.replace(' ', '_')}" for feat in vectorizer.get_feature_names_out()]
    
    tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=feature_names)
    return tfidf_df

# Apply to each filtered field
genre_tfidf_df = tfidf_to_df('genres_filtered', 'genre')
star_tfidf_df = tfidf_to_df('stars_filtered', 'star')
dir_tfidf_df = tfidf_to_df('directors_filtered', 'director')
prod_tfidf_df = tfidf_to_df('prod_filtered', 'prod')

# Concatenate all
df = pd.concat([df.reset_index(drop=True), genre_tfidf_df, star_tfidf_df, dir_tfidf_df, prod_tfidf_df], axis=1)

# Drop temp columns
df.drop(columns=['genres_filtered', 'stars_filtered', 'directors_filtered', 'prod_filtered'], inplace=True)
# Drop old columns
df.drop(columns=['genres', 'stars', 'directors', 'production_house'], inplace=True)

print(df.shape)
df.head()

(46063, 40)


Unnamed: 0,movieId,movie_name,runtime,language,popularity,release_year,budget,revenue,rating,votes,...,director_alfred_hitchcock,director_cheh_chang,director_mervyn_leroy,director_michael_curtiz,director_richard_thorpe,prod_20th_century_fox,prod_metro-goldwyn-mayer,prod_paramount,prod_unknown,prod_warner_bros._pictures
0,1,Toy Story,-0.876283,English,2.233116,1995,1.195327,2.712847,8.2,3.173651,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,2,Jumanji,0.172073,English,0.153076,1995,1.816543,2.412624,7.1,2.647084,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,3,Grumpier Old Men,0.035331,English,0.196976,1995,1.048842,1.449742,6.6,1.359258,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
3,4,Waiting to Exhale,1.083687,English,0.185321,1995,0.690274,1.546134,6.1,0.893266,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,5,Father of the Bride Part II,0.263235,English,0.188075,1995,0.092157,1.500648,6.1,1.528488,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [12]:
df.columns

Index(['movieId', 'movie_name', 'runtime', 'language', 'popularity',
       'release_year', 'budget', 'revenue', 'rating', 'votes', 'genre_action',
       'genre_adventure', 'genre_comedy', 'genre_crime', 'genre_drama',
       'genre_fantasy', 'genre_horror', 'genre_mystery', 'genre_romance',
       'genre_thriller', 'star_boris_karloff', 'star_bruce_willis',
       'star_christopher_lee', 'star_eric_roberts', 'star_gérard_depardieu',
       'star_james_mason', 'star_john_wayne', 'star_michael_caine',
       'star_nicolas_cage', 'star_robert_de_niro', 'director_alfred_hitchcock',
       'director_cheh_chang', 'director_mervyn_leroy',
       'director_michael_curtiz', 'director_richard_thorpe',
       'prod_20th_century_fox', 'prod_metro-goldwyn-mayer', 'prod_paramount',
       'prod_unknown', 'prod_warner_bros._pictures'],
      dtype='object')

## Feature Engineering: Conclusion

1. As we can see our data looks clean and less parse now. There are still many columns with 0 values, but they are equally important
2. e.g. if none of the popular stars is performing in a movie, but still if it has high rating. May be it's due to other features!!!

Now we will proceed ahead with feature selection.

## Feature selection

### Steps:
1. We will perform 3 types of tests as per requirement: (correlation analysis and tree-based feature importance)
2. For `Correlation Analysis`: We will perform test on numerical columns like `runtime`, `popularity`, `release_year`, `budget`, `revenue`, `votes`
3. For `tree-based feature importance`: We will find the key features (ranked by importance)

In [13]:
# separating target variable from dataset

X = df.drop(columns=['rating'])  # Assuming 'rating' is your target
y = df['rating']

In [14]:
import numpy as np
import pandas as pd

numerical_cols = ['popularity', 'release_year', 'budget', 'revenue', 'votes', 'rating']

X_numeric = df[numerical_cols]

# Compute correlation matrix
corr_matrix = X_numeric.corr().abs()

# Get upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))

print(upper)

# Find features with correlation > 0.9
high_corr_features = [column for column in upper.columns if any(upper[column] > 0.9)]

# Drop them
X_corr_reduced = X_numeric.drop(columns=high_corr_features)


              popularity  release_year    budget   revenue     votes    rating
popularity           NaN       0.09184  0.121412  0.133249  0.242904  0.104728
release_year         NaN           NaN  0.053815  0.007208  0.170181  0.137125
budget               NaN           NaN       NaN  0.302511  0.222280  0.099507
revenue              NaN           NaN       NaN       NaN  0.196978  0.067312
votes                NaN           NaN       NaN       NaN       NaN  0.403179
rating               NaN           NaN       NaN       NaN       NaN       NaN


## Correlation Concusion:

1. Even though the variables show very weak correlation between them, they are still important
2. Having low correlation means variable are just not linearly corrrelated with the target

## Tree based feature importance (random forest)

In [15]:
from sklearn.ensemble import RandomForestRegressor
import pandas as pd

X = df.drop(columns=['rating', 'movieId', 'movie_name', 'language'])  # Drop non-feature columns
y = df['rating']

model = RandomForestRegressor(random_state=42)
model.fit(X, y)

importances = pd.Series(model.feature_importances_, index=X.columns)
top_features = importances.sort_values(ascending=False).head(20)

print(top_features)


votes                         0.285324
runtime                       0.144640
popularity                    0.134923
release_year                  0.132341
genre_horror                  0.079131
genre_drama                   0.050674
budget                        0.031600
genre_action                  0.019502
genre_comedy                  0.017593
revenue                       0.016870
genre_thriller                0.014850
genre_adventure               0.014430
genre_crime                   0.012214
prod_unknown                  0.010913
genre_fantasy                 0.009635
genre_romance                 0.009315
genre_mystery                 0.006844
prod_warner_bros._pictures    0.001702
prod_metro-goldwyn-mayer      0.001667
prod_20th_century_fox         0.001256
dtype: float64


## Conclusion: Tree Based feature importance (random forest)

1. This helped us to understand the feature importance on target (rating) columns
2. Stats are also matching with real world behavior
3. For e.g. rating is influenced by number of votes
4. Another e.g. is key variables like popularity, genre, budget, revenue plays key role in the movie rating

In [16]:
# backup copy for scaled df
scaled_df = df.copy()

## Linear Regression

In [17]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import mean_squared_error, accuracy_score, f1_score, classification_report

# Assuming df is your preprocessed DataFrame (TF-IDF + Scaled)
# Make sure it has no missing values
df = scaled_df.copy()
df.dropna(inplace=True)


In [18]:
# Prepare features and target
X_reg = df.drop(['rating', 'movieId', 'movie_name', 'language'], axis=1)
y_reg = df['rating']

# Split data
X_train_reg, X_temp_reg, y_train_reg, y_temp_reg = train_test_split(X_reg, y_reg, test_size=0.3, random_state=42)
X_val_reg, X_test_reg, y_val_reg, y_test_reg = train_test_split(X_temp_reg, y_temp_reg, test_size=0.5, random_state=42)

# Train Linear Regression model
lin_reg = LinearRegression()
lin_reg.fit(X_train_reg, y_train_reg)

# Validate
val_preds_reg = lin_reg.predict(X_val_reg)
val_mse = mean_squared_error(y_val_reg, val_preds_reg)

# Final test evaluation
test_preds_reg = lin_reg.predict(X_test_reg)
test_mse = mean_squared_error(y_test_reg, test_preds_reg)

print("📊 Linear Regression Results")
print(f"Validation MSE: {val_mse:.4f}")
print(f"Test MSE: {test_mse:.4f}")

📊 Linear Regression Results
Validation MSE: 0.8367
Test MSE: 0.8312


## Logistic regression

In [19]:
# Create binary target
df = scaled_df.copy()
df['is_good'] = (df['rating'] >= 7).astype(int)

# Check class distribution
print("Class distribution:\n", df['is_good'].value_counts())


Class distribution:
 is_good
0    38379
1     7684
Name: count, dtype: int64


In [20]:
df = scaled_df.copy()

min_rating = df['rating'].min()
max_rating = df['rating'].max()

print(f"Minimum rating: {min_rating}")
print(f"Maximum rating: {max_rating}")

Minimum rating: 0.6
Maximum rating: 9.3


In [21]:
df = scaled_df.copy()
# Create binary label
df['is_good'] = (df['rating'] >= 7).astype(int)

# Prepare features and target
X_clf = df.drop(['rating', 'is_good', 'movieId', 'movie_name', 'language', 'budget', 'revenue'], axis=1)
y_clf = df['is_good']

# Split data
X_train_clf, X_temp_clf, y_train_clf, y_temp_clf = train_test_split(X_clf, y_clf, test_size=0.3, random_state=42)
X_val_clf, X_test_clf, y_val_clf, y_test_clf = train_test_split(X_temp_clf, y_temp_clf, test_size=0.5, random_state=42)

# Train Logistic Regression model
log_reg = LogisticRegression(class_weight='balanced', max_iter=10000)
log_reg.fit(X_train_clf, y_train_clf)

# Validate
val_preds_clf = log_reg.predict(X_val_clf)
val_acc = accuracy_score(y_val_clf, val_preds_clf)
val_f1 = f1_score(y_val_clf, val_preds_clf)

# Final test evaluation
test_preds_clf = log_reg.predict(X_test_clf)

print("\n📊 Logistic Regression Results")
print(f"Validation Accuracy: {val_acc:.4f}")
print(f"Validation F1 Score: {val_f1:.4f}")
print("\nTest Classification Report:")
print(classification_report(y_test_clf, test_preds_clf))


📊 Logistic Regression Results
Validation Accuracy: 0.7319
Validation F1 Score: 0.4959

Test Classification Report:
              precision    recall  f1-score   support

           0       0.93      0.72      0.81      5832
           1       0.32      0.71      0.44      1078

    accuracy                           0.72      6910
   macro avg       0.62      0.72      0.63      6910
weighted avg       0.84      0.72      0.75      6910



## KNN algorithm (unsupervised) model

In [22]:
# Knn recommendations:

from sklearn.neighbors import NearestNeighbors
from sklearn.model_selection import train_test_split

# Drop non-feature columns
X = df.drop(columns=['movieId', 'movie_name', 'rating', 'language'])

X_train, X_val = train_test_split(X, test_size=0.2, random_state=42)

knn = NearestNeighbors(n_neighbors=6, metric='cosine')  # 1 input + 5 recommendations
knn.fit(X_train)

# Pick a movie from the validation set
movie_idx = 42
# Find the actual index in X_val (needed to get correct data from df)
actual_val_idx = X_val.index[movie_idx]
# Run KNN
distances, indices = knn.kneighbors(X_val.iloc[[movie_idx]])

# Map KNN output indices (relative to X_train) to actual indices in df
recommended_indices = [X_train.index[i] for i in indices[0]]

print("KNN Recommendations:")
print(scaled_df.loc[recommended_indices][['movieId', 'movie_name']])

movie_ids = scaled_df.loc[recommended_indices]['movieId'].tolist()

KNN Recommendations:
       movieId                    movie_name
25137   159542  Endgame - Bronx lotta finale
38148   220848             The Last Sentinel
30696   184295                   Gunned Down
36700   212519          Disturbing the Peace
32888   194204                    Alterscape
44512   280400                 Double Threat


In [23]:
from sklearn.metrics.pairwise import cosine_similarity

# Input movie vector
input_vector = X_val.iloc[[movie_idx]]

# Get same recommended vectors (mapped to actual df indices)
recommended_vectors = X.loc[recommended_indices]

# Cosine similarity
similarities = cosine_similarity(input_vector, recommended_vectors)[0]

print("\nCosine Similarities with Recommended Movies:")
for idx, sim in zip(recommended_indices, similarities):
    print(f"{df.loc[idx, 'movie_name']}: {sim:.4f}")


Cosine Similarities with Recommended Movies:
Endgame - Bronx lotta finale: 1.0000
The Last Sentinel: 1.0000
Gunned Down: 1.0000
Disturbing the Peace: 1.0000
Alterscape: 1.0000
Double Threat: 1.0000


In [24]:
filtered_df = main_df[main_df['movieId'].isin(movie_ids)][['movie_name', 'genres']]
print(filtered_df)


                         movie_name                  genres
25184  Endgame - Bronx lotta finale  Action,Thriller,Sci-Fi
30766                   Gunned Down         Action,Thriller
32964                    Alterscape  Action,Thriller,Sci-Fi
36786          Disturbing the Peace         Action,Thriller
38238             The Last Sentinel  Action,Thriller,Sci-Fi
44620                 Double Threat         Action,Thriller


## Conclusion: KNN

1. For KNN, we use unsupervised algorithm (as our data doesn't have actual labeled data)
2. So in order to validate our results, we can either check cosine similarity or manually verify the attributes
3. We checked both above and both are upto mark. The most influential factor in our dataset is genre and all 5 (recommended) + 1 (original) movies have same genres