## Feature Engineering

Now that I have our data reprocessed and stored in table in their normalized form. I will derive couple of more features. This features will help me to check if there is any influence of these features on the ratings of the movies.

I will add new features (listed below):

1. avg_rating
2. votes_weighted_rating
3. decade
4. is_multigenre
5. top_director
6. top_star
7. tag_count
8. most_common_tag
9. star_director_pair

Steps to derive features:

1. Get all data in 1 dataframe by using SQL query on DB
2. add new column to the dataframe

In [1]:
import sqlite3
import pandas as pd

from collections import Counter


In [2]:
db_name = '../Data/movies.db'
conn = sqlite3.connect(db_name)

query = """SELECT
    l.movieid,
    i.movie_name,
    i.rating AS imdb_rating,
    i.votes AS imdb_votes,
    i.runtime AS imdb_runtime,
    i.year AS year,
    t.vote_average AS tmdb_vote_average,
    t.vote_count AS tmdb_votes,
    t.release_year,
    GROUP_CONCAT(DISTINCT g.genre_name) AS genres,
    GROUP_CONCAT(DISTINCT d.director_name) AS directors,
    GROUP_CONCAT(DISTINCT s.star_name) AS stars
FROM links l
JOIN imdb i ON l.imdbid = i.movie_id
LEFT JOIN tmdb t ON l.tmdbid = t.id
LEFT JOIN genre_imdb gi ON i.movie_id = gi.movie_id
LEFT JOIN genre g ON gi.genre_id = g.genre_id
LEFT JOIN director_imdb di ON i.movie_id = di.movie_id
LEFT JOIN director d ON di.director_id = d.director_id
LEFT JOIN star_imdb si ON i.movie_id = si.movie_id
LEFT JOIN star s ON si.star_id = s.star_id
GROUP BY l.movieid
"""

df = pd.read_sql_query(query, conn)
print(df.shape)

print("/n ***** Converting Data for columns imdb_votes, year, release_year from float to int *****")
df['imdb_votes'] = df['imdb_votes'].astype('Int64')
df['year'] = df['year'].astype('Int64')
df['release_year'] = df['release_year'].astype('Int64')

print("/n ***** Imputing Data for columns release_year, tmdb_vote_avergea *****")
df['release_year'] = df['release_year'].fillna(df['year'])
df['tmdb_vote_average'] = df['tmdb_vote_average'].fillna(df['imdb_rating'])
df['tmdb_votes'] = df['tmdb_votes'].fillna(df['imdb_votes'])

# Note: 5 records still have year as Null but it doesn't matter
# print(df['release_year'].isnull().sum())

# we have release_year column to drop year column
df.drop(columns='year', inplace = True)
# print(df[df['release_year'].isnull()])


# print(df.head())

(46173, 12)
/n ***** Converting Data for columns imdb_votes, year, release_year from float to int *****
/n ***** Imputing Data for columns release_year, tmdb_vote_avergea *****


In [3]:
# feature 1: avg_rating across imdb and tmdb
df['avg_rating'] = ((df['imdb_rating'] + df['tmdb_vote_average']) / 2).round(1)
df['total_votes'] = ((df['imdb_votes'] + df['tmdb_votes'])).astype('Int64')

df.head()

Unnamed: 0,movieId,movie_name,imdb_rating,imdb_votes,imdb_runtime,tmdb_vote_average,tmdb_votes,release_year,genres,directors,stars,avg_rating,total_votes
0,1,Toy Story,8.3,1002538,81,8.0,17152.0,1995,"Comedy,Animation,Adventure",John Lasseter,"Tom Hanks,Tim Allen,Jim Varney,Don Rickles",8.2,1019690
1,2,Jumanji,7.0,352469,104,7.2,9833.0,1995,"Comedy,Adventure,Family",Joe Johnston,"Bonnie Hunt,Robin Williams,Kirsten Dunst,Jonat...",7.1,362302
2,3,Grumpier Old Men,6.6,28491,101,6.5,347.0,1995,"Comedy,Romance",Howard Deutch,"Jack Lemmon,Ann-Margret,Walter Matthau,Sophia ...",6.6,28838
3,4,Waiting to Exhale,5.9,11399,124,6.2,142.0,1995,"Comedy,Drama,Romance",Forest Whitaker,"Angela Bassett,Loretta Devine,Lela Rochon,Whit...",6.1,11541
4,5,Father of the Bride Part II,6.0,39557,106,6.2,659.0,1995,"Comedy,Romance,Family",Charles Shyer,"Diane Keaton,Steve Martin,Martin Short,Kimberl...",6.1,40216


In [4]:
# feature 2; votes weighted rating

C = df['avg_rating'].mean()
m = df['total_votes'].quantile(0.80)

def weighted_rating(x, m=m, C=C):
    v = x['total_votes']
    R = x['avg_rating']
    return (v / (v + m)) * R + (m / (v + m)) * C if pd.notnull(v) and pd.notnull(R) else None

df['votes_weighted_rating'] = df.apply(weighted_rating, axis=1)
df['votes_weighted_rating'] = df['votes_weighted_rating'].round(1)

df.head()


Unnamed: 0,movieId,movie_name,imdb_rating,imdb_votes,imdb_runtime,tmdb_vote_average,tmdb_votes,release_year,genres,directors,stars,avg_rating,total_votes,votes_weighted_rating
0,1,Toy Story,8.3,1002538,81,8.0,17152.0,1995,"Comedy,Animation,Adventure",John Lasseter,"Tom Hanks,Tim Allen,Jim Varney,Don Rickles",8.2,1019690,8.2
1,2,Jumanji,7.0,352469,104,7.2,9833.0,1995,"Comedy,Adventure,Family",Joe Johnston,"Bonnie Hunt,Robin Williams,Kirsten Dunst,Jonat...",7.1,362302,7.1
2,3,Grumpier Old Men,6.6,28491,101,6.5,347.0,1995,"Comedy,Romance",Howard Deutch,"Jack Lemmon,Ann-Margret,Walter Matthau,Sophia ...",6.6,28838,6.4
3,4,Waiting to Exhale,5.9,11399,124,6.2,142.0,1995,"Comedy,Drama,Romance",Forest Whitaker,"Angela Bassett,Loretta Devine,Lela Rochon,Whit...",6.1,11541,6.0
4,5,Father of the Bride Part II,6.0,39557,106,6.2,659.0,1995,"Comedy,Romance,Family",Charles Shyer,"Diane Keaton,Steve Martin,Martin Short,Kimberl...",6.1,40216,6.1


In [5]:
# feature 3: decade

df['decade'] = (df['release_year'] // 10) * 10
df.head()

# df['decade'].unique()

Unnamed: 0,movieId,movie_name,imdb_rating,imdb_votes,imdb_runtime,tmdb_vote_average,tmdb_votes,release_year,genres,directors,stars,avg_rating,total_votes,votes_weighted_rating,decade
0,1,Toy Story,8.3,1002538,81,8.0,17152.0,1995,"Comedy,Animation,Adventure",John Lasseter,"Tom Hanks,Tim Allen,Jim Varney,Don Rickles",8.2,1019690,8.2,1990
1,2,Jumanji,7.0,352469,104,7.2,9833.0,1995,"Comedy,Adventure,Family",Joe Johnston,"Bonnie Hunt,Robin Williams,Kirsten Dunst,Jonat...",7.1,362302,7.1,1990
2,3,Grumpier Old Men,6.6,28491,101,6.5,347.0,1995,"Comedy,Romance",Howard Deutch,"Jack Lemmon,Ann-Margret,Walter Matthau,Sophia ...",6.6,28838,6.4,1990
3,4,Waiting to Exhale,5.9,11399,124,6.2,142.0,1995,"Comedy,Drama,Romance",Forest Whitaker,"Angela Bassett,Loretta Devine,Lela Rochon,Whit...",6.1,11541,6.0,1990
4,5,Father of the Bride Part II,6.0,39557,106,6.2,659.0,1995,"Comedy,Romance,Family",Charles Shyer,"Diane Keaton,Steve Martin,Martin Short,Kimberl...",6.1,40216,6.1,1990


In [6]:
# feature 4: check if movie is multi genre

df['is_multigenre'] = df['genres'].apply(lambda x: int(len(str(x).split(',')) > 1))
df.head()

# df['is_multigenre'].unique()

Unnamed: 0,movieId,movie_name,imdb_rating,imdb_votes,imdb_runtime,tmdb_vote_average,tmdb_votes,release_year,genres,directors,stars,avg_rating,total_votes,votes_weighted_rating,decade,is_multigenre
0,1,Toy Story,8.3,1002538,81,8.0,17152.0,1995,"Comedy,Animation,Adventure",John Lasseter,"Tom Hanks,Tim Allen,Jim Varney,Don Rickles",8.2,1019690,8.2,1990,1
1,2,Jumanji,7.0,352469,104,7.2,9833.0,1995,"Comedy,Adventure,Family",Joe Johnston,"Bonnie Hunt,Robin Williams,Kirsten Dunst,Jonat...",7.1,362302,7.1,1990,1
2,3,Grumpier Old Men,6.6,28491,101,6.5,347.0,1995,"Comedy,Romance",Howard Deutch,"Jack Lemmon,Ann-Margret,Walter Matthau,Sophia ...",6.6,28838,6.4,1990,1
3,4,Waiting to Exhale,5.9,11399,124,6.2,142.0,1995,"Comedy,Drama,Romance",Forest Whitaker,"Angela Bassett,Loretta Devine,Lela Rochon,Whit...",6.1,11541,6.0,1990,1
4,5,Father of the Bride Part II,6.0,39557,106,6.2,659.0,1995,"Comedy,Romance,Family",Charles Shyer,"Diane Keaton,Steve Martin,Martin Short,Kimberl...",6.1,40216,6.1,1990,1


In [7]:
# feature 5: top directors

# get director appearance by counts
director_counts = Counter(d for sublist in df['directors'].dropna().str.split(',') for d in sublist)
# extract the list of top directors by count
top_directors = set([d for d, count in director_counts.most_common(100)])

# function to fetch top director for movie (if any else other)
def extract_top_director(director_str):
    if pd.isna(director_str): return 'Other'
    for d in director_str.split(','):
        if d.strip() in top_directors:
            return d.strip()
    return 'Other'

df['top_director'] = df['directors'].apply(extract_top_director)

df.head()
# df['top_director'].unique()

Unnamed: 0,movieId,movie_name,imdb_rating,imdb_votes,imdb_runtime,tmdb_vote_average,tmdb_votes,release_year,genres,directors,stars,avg_rating,total_votes,votes_weighted_rating,decade,is_multigenre,top_director
0,1,Toy Story,8.3,1002538,81,8.0,17152.0,1995,"Comedy,Animation,Adventure",John Lasseter,"Tom Hanks,Tim Allen,Jim Varney,Don Rickles",8.2,1019690,8.2,1990,1,Other
1,2,Jumanji,7.0,352469,104,7.2,9833.0,1995,"Comedy,Adventure,Family",Joe Johnston,"Bonnie Hunt,Robin Williams,Kirsten Dunst,Jonat...",7.1,362302,7.1,1990,1,Other
2,3,Grumpier Old Men,6.6,28491,101,6.5,347.0,1995,"Comedy,Romance",Howard Deutch,"Jack Lemmon,Ann-Margret,Walter Matthau,Sophia ...",6.6,28838,6.4,1990,1,Other
3,4,Waiting to Exhale,5.9,11399,124,6.2,142.0,1995,"Comedy,Drama,Romance",Forest Whitaker,"Angela Bassett,Loretta Devine,Lela Rochon,Whit...",6.1,11541,6.0,1990,1,Other
4,5,Father of the Bride Part II,6.0,39557,106,6.2,659.0,1995,"Comedy,Romance,Family",Charles Shyer,"Diane Keaton,Steve Martin,Martin Short,Kimberl...",6.1,40216,6.1,1990,1,Other


In [8]:
# feature 6: Star Count

# get star counts by appearances 
star_counts = Counter(s for sublist in df['stars'].dropna().str.split(',') for s in sublist)
# extact top 10 actors
top_stars = set([s for s, count in star_counts.most_common(100)])

# function to find top actor associated with movie (if any)
def extract_top_star(star_str):
    if pd.isna(star_str): return 'Other'
    for s in star_str.split(','):
        if s.strip() in top_stars:
            return s.strip()
    return 'Other'

df['top_star'] = df['stars'].apply(extract_top_star)

df.head()
# df['top_star'].unique()

Unnamed: 0,movieId,movie_name,imdb_rating,imdb_votes,imdb_runtime,tmdb_vote_average,tmdb_votes,release_year,genres,directors,stars,avg_rating,total_votes,votes_weighted_rating,decade,is_multigenre,top_director,top_star
0,1,Toy Story,8.3,1002538,81,8.0,17152.0,1995,"Comedy,Animation,Adventure",John Lasseter,"Tom Hanks,Tim Allen,Jim Varney,Don Rickles",8.2,1019690,8.2,1990,1,Other,Tom Hanks
1,2,Jumanji,7.0,352469,104,7.2,9833.0,1995,"Comedy,Adventure,Family",Joe Johnston,"Bonnie Hunt,Robin Williams,Kirsten Dunst,Jonat...",7.1,362302,7.1,1990,1,Other,Other
2,3,Grumpier Old Men,6.6,28491,101,6.5,347.0,1995,"Comedy,Romance",Howard Deutch,"Jack Lemmon,Ann-Margret,Walter Matthau,Sophia ...",6.6,28838,6.4,1990,1,Other,Other
3,4,Waiting to Exhale,5.9,11399,124,6.2,142.0,1995,"Comedy,Drama,Romance",Forest Whitaker,"Angela Bassett,Loretta Devine,Lela Rochon,Whit...",6.1,11541,6.0,1990,1,Other,Other
4,5,Father of the Bride Part II,6.0,39557,106,6.2,659.0,1995,"Comedy,Romance,Family",Charles Shyer,"Diane Keaton,Steve Martin,Martin Short,Kimberl...",6.1,40216,6.1,1990,1,Other,Other


In [9]:
# feature 7: tag_count and feature 8: most_common_tag from tags table

# Load tags separately
tags_df = pd.read_sql_query("SELECT movieid, tag FROM tags", conn)

# Tag count
tag_count_df = tags_df.groupby('movieId').size().reset_index(name='tag_count')
df = df.merge(tag_count_df, on='movieId', how='left')
df['tag_count'] = df['tag_count'].fillna(0).astype(int)

# Most common tag
tag_mode_df = tags_df.groupby('movieId')['tag'].agg(lambda x: x.mode().iloc[0] if not x.mode().empty else None).reset_index(name='most_common_tag')
df = df.merge(tag_mode_df, on='movieId', how='left')

print(df.head())

   movieId                   movie_name  imdb_rating  imdb_votes  \
0        1                    Toy Story          8.3     1002538   
1        2                      Jumanji          7.0      352469   
2        3             Grumpier Old Men          6.6       28491   
3        4            Waiting to Exhale          5.9       11399   
4        5  Father of the Bride Part II          6.0       39557   

   imdb_runtime  tmdb_vote_average  tmdb_votes  release_year  \
0            81                8.0     17152.0          1995   
1           104                7.2      9833.0          1995   
2           101                6.5       347.0          1995   
3           124                6.2       142.0          1995   
4           106                6.2       659.0          1995   

                       genres        directors  \
0  Comedy,Animation,Adventure    John Lasseter   
1     Comedy,Adventure,Family     Joe Johnston   
2              Comedy,Romance    Howard Deutch   
3     

In [10]:
df['star_director_pair'] = df['top_star'] + '_' + df['top_director']

df.head()

Unnamed: 0,movieId,movie_name,imdb_rating,imdb_votes,imdb_runtime,tmdb_vote_average,tmdb_votes,release_year,genres,directors,...,avg_rating,total_votes,votes_weighted_rating,decade,is_multigenre,top_director,top_star,tag_count,most_common_tag,star_director_pair
0,1,Toy Story,8.3,1002538,81,8.0,17152.0,1995,"Comedy,Animation,Adventure",John Lasseter,...,8.2,1019690,8.2,1990,1,Other,Tom Hanks,1230,Pixar,Tom Hanks_Other
1,2,Jumanji,7.0,352469,104,7.2,9833.0,1995,"Comedy,Adventure,Family",Joe Johnston,...,7.1,362302,7.1,1990,1,Other,Other,573,Robin Williams,Other_Other
2,3,Grumpier Old Men,6.6,28491,101,6.5,347.0,1995,"Comedy,Romance",Howard Deutch,...,6.6,28838,6.4,1990,1,Other,Other,23,CLV,Other_Other
3,4,Waiting to Exhale,5.9,11399,124,6.2,142.0,1995,"Comedy,Drama,Romance",Forest Whitaker,...,6.1,11541,6.0,1990,1,Other,Other,12,chick flick,Other_Other
4,5,Father of the Bride Part II,6.0,39557,106,6.2,659.0,1995,"Comedy,Romance,Family",Charles Shyer,...,6.1,40216,6.1,1990,1,Other,Other,64,Steve Martin,Other_Other


In [11]:
# get missing values count and %
num_rows, num_cols = df.shape

print(num_rows, num_cols)

print("Missing values per column:")
missing_info = df.isnull().sum()
for column, count in missing_info.items():
    percent = (count / num_rows) * 100
    print(f"  {column}: {count} missing ({percent:.2f}%)")
print("\nColumn details:\n")

46173 21
Missing values per column:
  movieId: 0 missing (0.00%)
  movie_name: 0 missing (0.00%)
  imdb_rating: 0 missing (0.00%)
  imdb_votes: 0 missing (0.00%)
  imdb_runtime: 0 missing (0.00%)
  tmdb_vote_average: 0 missing (0.00%)
  tmdb_votes: 0 missing (0.00%)
  release_year: 5 missing (0.01%)
  genres: 0 missing (0.00%)
  directors: 0 missing (0.00%)
  stars: 0 missing (0.00%)
  avg_rating: 0 missing (0.00%)
  total_votes: 0 missing (0.00%)
  votes_weighted_rating: 0 missing (0.00%)
  decade: 5 missing (0.01%)
  is_multigenre: 0 missing (0.00%)
  top_director: 0 missing (0.00%)
  top_star: 0 missing (0.00%)
  tag_count: 0 missing (0.00%)
  most_common_tag: 15454 missing (33.47%)
  star_director_pair: 0 missing (0.00%)

Column details:

