# Feature Engineering

IMDb, acronym for Internet Movie Database, is a website owned by Amazon.com where users can look for details about movies and TV shows: plot summaries, users reviews, genre, director and cast are just some of the attributes stored on IMDb. 

The goal of this project is to predict the rating of a given movie using properly trained Machine Learning models and evaluate the goodness of the predictions.
The chosen dataset was made available on the Kaggle platform and it contains the top 100 movies between 2003 - 2022. 

Dataset: https://www.kaggle.com/datasets/georgescutelnicu/top-100-popular-movies-from-2003-to-2022-imdb 

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import cpi

## 1. Understanding the Data

In [2]:
dataset_raw = pd.read_csv('movies.csv')
dataset_raw.head(2)

Unnamed: 0,Title,Rating,Year,Month,Certificate,Runtime,Directors,Stars,Genre,Filming_location,Budget,Income,Country_of_origin
0,Avatar: The Way of Water,7.8,2022,December,PG-13,192,James Cameron,"Sam Worthington, Zoe Saldana, Sigourney Weaver...","Action, Adventure, Fantasy",New Zealand,"$350,000,000","$2,267,946,983",United States
1,Guillermo del Toro's Pinocchio,7.6,2022,December,PG,117,"Guillermo del Toro, Mark Gustafson","Ewan McGregor, David Bradley, Gregory Mann, Bu...","Animation, Drama, Family",USA,"$35,000,000","$108,967","United States, Mexico, France"


### 3.4. Budget & Income: Turning into USD currency (VEDA)

In [3]:
dataset_processed_finance = dataset_raw

# Checking for how the finance data looks
budget_values = dataset_processed_finance['Budget'].unique()
income_values = dataset_processed_finance['Income'].unique()

#print(f'Budget values: {budget_values}')
#print(f'Income values: {income_values}')

'''There are unconsistencies in how the values are writen, extra 
# spaces etc. Also presence of 'Unknown'''

# 2. Getting number of movies with uknown runtime
unknown_budget_currency = dataset_processed_finance['Budget'].value_counts()['Unknown']
unknown_income_currency = dataset_processed_finance['Income'].value_counts()['Unknown']

missing_budget_values = dataset_processed_finance['Budget'].isnull().sum()
missing_income_values = dataset_processed_finance['Income'].isnull().sum()

print(f'Number of movies with unknown budget: {unknown_budget_currency}')
print(f'Number of movies with empty budget: {missing_budget_values}')
print(f'Number of movies with unknown income: {unknown_income_currency}')
print(f'Number of movies with empty budget: {missing_income_values}')


Number of movies with unknown budget: 304
Number of movies with empty budget: 0
Number of movies with unknown income: 145
Number of movies with empty budget: 0


In [4]:
# 3. Dropping all movies without a budget
dataset_processed_finance = dataset_processed_finance.drop(dataset_processed_finance[dataset_processed_finance['Budget'].isin(['Unknown'])].index)

# 4. Checking the dataset shape
print('-----------------------------')
print(f'Current number of columns (Budget): {dataset_processed_finance.shape[1]}')
print(f'Current number of rows (Budget): {dataset_processed_finance.shape[0]}')


# 5. Dropping all movies without a income
dataset_processed_finance = dataset_processed_finance.drop(dataset_processed_finance[dataset_processed_finance['Income'].isin(['Unknown'])].index)

# 6. Checking the dataset shape
print('-----------------------------')
print(f'Current number of columns (Income): {dataset_processed_finance.shape[1]}')
print(f'Current number of rows (Income): {dataset_processed_finance.shape[0]}')


-----------------------------
Current number of columns (Budget): 13
Current number of rows (Budget): 1696
-----------------------------
Current number of columns (Income): 13
Current number of rows (Income): 1651


In [5]:
# Conversion function (converts currency and to float)
def convert_to_usd(amount):
    amount.replace(' ', '')
    amount.replace('\xa0', '')
    if amount.startswith('$'):
        amount = amount.strip('$').replace(',', '')   # must remove commas
        return float(amount)   # convert str into float
    elif amount.startswith('€'):
        # Exchange rate for EUR to USD
        amount = amount.strip('€').replace(',', '')
        return float(amount) * 1.06 
    elif amount.startswith('¥'):
        # Exchange rate for YEN to USD
        amount = amount.strip('¥').replace(',', '')
        return float(amount) * 0.0075
    elif amount.startswith('₹'):
        # Exchange rate for RPL to USD
        amount = amount.strip('₹').replace(',', '')
        return float(amount) * 0.012 
    elif amount.startswith('SEK'):
        # Exchange rate for SEK to USD
        amount = amount.strip('SEK').replace(',', '')
        return float(amount) * 0.094
    elif amount.startswith('DKK'):
        # Exchange rate for RPL to USD
        amount = amount.strip('DKK').replace(',', '')
        return float(amount) * 0.14
    elif amount.startswith('£'):
        # Exchange rate for RPL to USD
        amount = amount.strip('£').replace(',', '')
        return float(amount) * 1.21  
    else:
        return None

In [6]:
# 7. Applying the conversion function
dataset_processed_finance['Budget'] = dataset_processed_finance['Budget'].apply(convert_to_usd)
dataset_processed_finance['Income'] = dataset_processed_finance['Income'].apply(convert_to_usd)

dataset_processed_finance.info()  # 9 missing values in Budget

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1651 entries, 0 to 1999
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Title              1651 non-null   object 
 1   Rating             1651 non-null   float64
 2   Year               1651 non-null   int64  
 3   Month              1651 non-null   object 
 4   Certificate        1646 non-null   object 
 5   Runtime            1651 non-null   object 
 6   Directors          1651 non-null   object 
 7   Stars              1651 non-null   object 
 8   Genre              1651 non-null   object 
 9   Filming_location   1651 non-null   object 
 10  Budget             1642 non-null   float64
 11  Income             1651 non-null   float64
 12  Country_of_origin  1651 non-null   object 
dtypes: float64(3), int64(1), object(9)
memory usage: 180.6+ KB


In [7]:
# 8. Removing 9 titles with empty budget values 
dataset_processed_finance = dataset_processed_finance.dropna(axis=0, subset=['Budget'])

# 9. Checking the dataset shape
print('-----------------------------')
print(f'Current number of columns: {dataset_processed_finance.shape[1]}')
print(f'Current number of rows: {dataset_processed_finance.shape[0]}')


-----------------------------
Current number of columns: 13
Current number of rows: 1642


In [8]:
dataset_processed_1 = dataset_processed_finance

In [9]:
dataset_processed_1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1642 entries, 0 to 1999
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Title              1642 non-null   object 
 1   Rating             1642 non-null   float64
 2   Year               1642 non-null   int64  
 3   Month              1642 non-null   object 
 4   Certificate        1637 non-null   object 
 5   Runtime            1642 non-null   object 
 6   Directors          1642 non-null   object 
 7   Stars              1642 non-null   object 
 8   Genre              1642 non-null   object 
 9   Filming_location   1642 non-null   object 
 10  Budget             1642 non-null   float64
 11  Income             1642 non-null   float64
 12  Country_of_origin  1642 non-null   object 
dtypes: float64(3), int64(1), object(9)
memory usage: 179.6+ KB


In [10]:
dataset_processed_1.head()

Unnamed: 0,Title,Rating,Year,Month,Certificate,Runtime,Directors,Stars,Genre,Filming_location,Budget,Income,Country_of_origin
0,Avatar: The Way of Water,7.8,2022,December,PG-13,192,James Cameron,"Sam Worthington, Zoe Saldana, Sigourney Weaver...","Action, Adventure, Fantasy",New Zealand,350000000.0,2267947000.0,United States
1,Guillermo del Toro's Pinocchio,7.6,2022,December,PG,117,"Guillermo del Toro, Mark Gustafson","Ewan McGregor, David Bradley, Gregory Mann, Bu...","Animation, Drama, Family",USA,35000000.0,108967.0,"United States, Mexico, France"
2,Bullet Train,7.3,2022,August,R,127,David Leitch,"Brad Pitt, Joey King, Aaron Taylor Johnson, Br...","Action, Comedy, Thriller",Japan,85900000.0,239268600.0,"Japan, United States"
4,M3gan,6.4,2022,December,PG-13,102,Gerard Johnstone,"Jenna Davis, Amie Donald, Allison Williams, Vi...","Horror, Sci-Fi, Thriller",New Zealand,12000000.0,171253900.0,United States
6,Amsterdam,6.1,2022,October,R,134,David O Russell,"Christian Bale, Margot Robbie, John David Wash...","Comedy, Drama, History",USA,80000000.0,31245810.0,"United States, Japan"


### Adjusting income & budget based on realease year and inflation
Why? Otherwise will newrealses have a bigger inpact on the model than older ones.
Since the inflation isn´t linear, I found a librar called cpi, where you give year and amount and get back how much that would be in dollars in 2022(guess thats when the library is from and not yet updated but not a biggie for our model)

In [None]:
#loop through the rows in the dataset
for index, row in dataset_processed_1.iterrows():
    # calculate the budget with inflation included
    budget_inf = cpi.inflate(row['Budget'], row['Year'])
    # save the result in a new column named 'budget_inf'
    dataset_processed_1.at[index, 'budget_inf'] = budget_inf
    
    # calculate the income with inflation included
    income_inf = cpi.inflate(row['Income'], row['Year'])
    # save the result in a new column named 'income_inf'
    dataset_processed_1.at[index, 'income_inf'] = income_inf

In [None]:
dataset_processed_1.head(120)

In [None]:
#loop through the rows in the dataset
for index, row in dataset_processed_1.iterrows():
    # calculate the roi in percentage
    roi_inf = ((row['income_inf'] - row['budget_inf']) / row['budget_inf']) * 100
    # save the result in a new column named 'budget_inf'
    dataset_processed_1.at[index, 'roi_inf'] = roi_inf
    

In [None]:
dataset_processed_1.head(120)

# One-Hot Encoding Contry of origin

In [None]:
def one_hot_encoding_dummy(column, seperator, dataset, y_column):
    dataset_processed_genre = dataset
    # 1. Creating a list with all genres mentioned in the dataset
    genre_list = []
    for genres in dataset_processed_genre[column]:
        genre_list.append(genres.split(seperator))

    # 2. Creating a list with genre categories
    unique_g= set()
    for genres in genre_list:
        for genre in genres:
            unique_g.add(genre)

    unique_genres = list(unique_g)
    amount = len(unique_genres)
    #return ((amount), (unique_genres))

    # 3. Creating a subtable with title as the first column followed by one 
    # column per genre
    genre_subtable = pd.DataFrame(columns=[y_column] + unique_genres)

    # 4. For each movie, assign 1 to those genres it has been assigned 
    for i, row in dataset_processed_genre.iterrows():
        new_row = {y_column: row[y_column]}
        for genre in unique_genres:
            new_row[genre] = 0
        for genre in row[column].split(seperator):
            new_row[genre] = 1
        genre_subtable = genre_subtable.append(new_row, ignore_index=True)

    # 5. Merging the subtable with the main dataset
    pivot_table = genre_subtable.melt(id_vars=y_column, var_name=column)
    pivot_table = pivot_table[pivot_table['value'] == 1]
    pivot_table = pivot_table.drop(columns=['value'])
    pivot_table = pivot_table.pivot_table(index=y_column, columns=column, aggfunc='size', fill_value=0).reset_index()

    # 7. Merge the pivot table with the original DataFrame
    dataset_processed_genre = pd.merge(dataset_processed_genre, pivot_table, on=y_column, how='left')
    dataset_processed_genre.drop(columns=[column], inplace=True)

    # 8. Checking the dataset shape
    return dataset_processed_genre
    #print('-----------------------------')
    #print(f'Current number of columns: {dataset_processed_genre.shape[1]}')
    #print(f'Current number of rows: {dataset_processed_genre.shape[0]}')
    
    
new_dataset =  one_hot_encoding_dummy("Country_of_origin", ", ", dataset_processed_1, "Title" )

In [None]:
# new_dataset2.to_csv('C:/zlatte1/my_data2.csv', index=False)

# One-Hot Encoding Director & actors/actresses

In [None]:
def one_hot_encoding_bin(orginal_column, new_column_name, seperator, file_location):
    # 1. Only keeping the first director
    my_list = orginal_column.split(',')
    
    print(my_list)
   
    #dataset_processed_1[new_column_name] = dataset_processed_1[orginal_column].str.split(seperator).str[0]

    
    #dataset_processed_1.drop(columns=[orginal_column], inplace=True)

   
    #dataset_directors = pd.read_csv(file_location)

    
    #replace = dataset_processed_1[new_column_name].isin(dataset_directors['Name'])
    #dataset_processed_1[new_column_name] = replace.astype(int)

    
    #return dataset_processed_1
    

dataset_director = one_hot_encoding_bin("Directors", "Top_50_Director", ", ", "top_50_directors.csv")  

In [None]:
dataset_director.head(2)

In [None]:
#print(dataset_director['Certificate'].unique())

In [None]:
dataset_director_actors = one_hot_encoding_bin("Stars", "Star_top_1000", ", ", "top_1000_actors.csv") 

In [None]:
dataset_director_actors.head(2)

In [None]:
#dataset_director_actors.to_csv('C:/zlatte1/my_data3.csv', index=False)

In [None]:
def one_hot_encoding_bin_multi(orginal_column, new_column_name, seperator, file_location):
    new_list = []
    dataset_processed_1[new_column_name] = dataset_processed_1[orginal_column].str.split(seperator).str[0]

    # 2. Dropping the original column
    dataset_processed_1.drop(columns=[orginal_column], inplace=True)

    # 1. Importing top directors dataset
    dataset_directors = pd.read_csv(file_location)

    # 2. If Director is in dataset_directors, replace name with 1
    # else replace name with 0
    replace = (dataset_processed_1[new_column_name].isin(dataset_directors['Name'])).astype(int).replace(0, '0')
    dataset_processed_1[new_column_name] = replace.astype(int)

    # 3. Check the updated table
    return dataset_processed_1
    #print(f'Number of top 50 directors in dataset: {(dataset_processed_1["Top_50_Director"] == 1).astype(int).sum()}')

dataset_director = one_hot_encoding_bin_multi("Directors", "Top_50_Director", ", ", "top_50_directors.csv")  

In [12]:
dataset_processed_1.head(2)

Unnamed: 0,Title,Rating,Year,Month,Certificate,Runtime,Directors,Stars,Genre,Filming_location,Budget,Income,Country_of_origin
0,Avatar: The Way of Water,7.8,2022,December,PG-13,192,James Cameron,"Sam Worthington, Zoe Saldana, Sigourney Weaver...","Action, Adventure, Fantasy",New Zealand,350000000.0,2267947000.0,United States
1,Guillermo del Toro's Pinocchio,7.6,2022,December,PG,117,"Guillermo del Toro, Mark Gustafson","Ewan McGregor, David Bradley, Gregory Mann, Bu...","Animation, Drama, Family",USA,35000000.0,108967.0,"United States, Mexico, France"


In [13]:
def one_hot_encoding_bin_multi(dataset, orginal_column, new_column_name, seperator, file_location):
    new_list = []
    dataset[new_column_name] = dataset[orginal_column].str.split(seperator).str[0]

    # 2. Dropping the original column
    dataset.drop(columns=[orginal_column], inplace=True)

    # 1. Importing top directors dataset
    dataset_directors = pd.read_csv(file_location)

    # 2. If Director is in dataset_directors, replace name with 1
    # else replace name with 0
    new_column = dataset[new_column_name].str.extract('(\d+)')
    replace = pd.to_numeric(new_column, errors='coerce')
    replace = pd.cut(replace, bins=[0, 100, 250, 1000, np.inf], labels=[1, 2, 3, 4]).fillna(4)
    dataset[new_column_name] = replace.astype(int)

    # 3. Check the updated table
    return dataset
    #print(f'Number of top 50 directors in dataset: {(dataset_processed_1["Top_50_Director"] == 1).astype(int).sum()}')

dataset_director = one_hot_encoding_bin_multi(dataset_processed_1, "Stars", "Star_top_1000", ", ", "top_1000_actors.csv")  

TypeError: arg must be a list, tuple, 1-d array, or Series