# Feature Engineering

IMDb, acronym for Internet Movie Database, is a website owned by Amazon.com where users can look for details about movies and TV shows: plot summaries, users reviews, genre, director and cast are just some of the attributes stored on IMDb. 

The goal of this project is to predict the rating of a given movie using properly trained Machine Learning models and evaluate the goodness of the predictions.
The chosen dataset was made available on the Kaggle platform and it contains the top 100 movies between 2003 - 2022. 

Dataset: https://www.kaggle.com/datasets/georgescutelnicu/top-100-popular-movies-from-2003-to-2022-imdb 

In [213]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## 1. Understanding the Data

In [214]:
dataset_raw = pd.read_csv('movies.csv')
dataset_raw.head(2)

Unnamed: 0,Title,Rating,Year,Month,Certificate,Runtime,Directors,Stars,Genre,Filming_location,Budget,Income,Country_of_origin
0,Avatar: The Way of Water,7.8,2022,December,PG-13,192,James Cameron,"Sam Worthington, Zoe Saldana, Sigourney Weaver...","Action, Adventure, Fantasy",New Zealand,"$350,000,000","$2,267,946,983",United States
1,Guillermo del Toro's Pinocchio,7.6,2022,December,PG,117,"Guillermo del Toro, Mark Gustafson","Ewan McGregor, David Bradley, Gregory Mann, Bu...","Animation, Drama, Family",USA,"$35,000,000","$108,967","United States, Mexico, France"


### Dataset

The dataset consists of 13 different features (columns) with 2000 movies (rows). Summary of all the variables in the dataset. 

| Variable | Description | Type | Remove | Processing |
| --- | --- | --- | --- | --- |
| Title | Movie title | str |  |  |
| Rating | Movie rating | float |  |  |
| Year | Release year | int |  |  |
| Month | Release month | str |  | Categorical variable |
| Certificate |  Parental guidence categorisation | str | X |  |
| Runtime | Movie duration in minutes | int |  | Convert to int |
| Directors | Movie directors (can be more than one) | str |  | Replace by IMDb scoring/popularity score |
| Stars | Actors and Actresses (can be more than one) | str |  | Replace by IMDb scoring/popularity score |
| Genre | Type of movie (always more than one) | str |  | One-hot encoding |
| Filming_location | Where the movie was filmed | str | X |  |
| Budget | Amount of money spend on production (different currencies) | str |  | Common currency (USD) and convert to int |
| Income | Movie profit (different currencies) | str |  | Common currency (USD) and convert to int |
| Country_of_origin | Where the movie is produced | str | X |  |

## Data problems

The dataset requires some processing and feature engineering before POC can be proceeded as most of the data is str type. Following problems must be solved:
1. Some variables will affect the prediction less then others and given that they are str values it's easier to remove them than spend time on processing their values. (*2. Primary clean-up of the dataset*)
2. Variable **Month** can be turned into a categorical value by replacing the names of the months by their corresponding numbers (*3.1 Month*)
3. The **Runtime** will be simply converted to int instead of current str (*3.2 Runtime*)
4. Currently each value of the **Genre** variable contains multiple genres. These will be handled with one-hot encoding (*3.3 Genre*)
5. The **Budget** and **Income** variables are string values with varying currencies. Thess will be converted to USD and simultaneously to int (*3.4 Budget and Income*)
6. The **Directors** and **Stars** variables are similar to the genre, where multuple names are listed for each value. For the director only the first listed name will be kept. The stars column on the other hand will be split into 'Lead' (first name) and 'Supporting' (second name). Finally, for the now three columns - 'Director', 'Lead', and 'Supporting' - the names will be replaced by their rankings based on IMDb data (*4. Feature Engineering*)

## 2. Primary clean-up of the dataset

1. Removing the 3 columns (not so significant for the analysis)
2. Removing the movie that doesn't have a rating

In [215]:
# Removing unwanted columns
dataset_pruned = dataset_raw.drop(['Certificate', 'Filming_location', 'Country_of_origin'], axis = 1)

# Checking the number of rows
print('-----------------------------')
print(f'Current number of columns: {dataset_pruned.shape[1]}')
print(f'Current number of rows: {dataset_pruned.shape[0]}')

-----------------------------
Current number of columns: 10
Current number of rows: 2000


In [216]:
# Missing values
dataset_pruned.isna().any()

# Which doesn't have a rating
movie_wo_rating = dataset_pruned[dataset_pruned['Rating'].isna()]['Title']
print(f' Movie without rating: {movie_wo_rating}')

# Removing movie from dataset
delete_row = dataset_pruned[dataset_pruned['Title']=='A Man Called Otto'].index
dataset_pruned = dataset_pruned.drop(delete_row)

# Checking the number of rows
print('-----------------------------')
print(f'Current number of columns: {dataset_pruned.shape[1]}')
print(f'Current number of rows: {dataset_pruned.shape[0]}')


 Movie without rating: 85    A Man Called Otto
Name: Title, dtype: object
-----------------------------
Current number of columns: 10
Current number of rows: 1999


## 3. Variable Processing

1. Turning 'Month' into a categorical variable i.e. Jan = 1, Feb = 2 etc.
2. Converting 'Runtime' into type integer 
3. Turning 'Genre' into one-hot encoding 
4. Converting 'Budget' and 'Income' to USD currency

'Directors' and 'Stars' are processed in the next part 

### 3.1. Month: Converting to categorical variable 

In [217]:
# Checking how many unique values there are for column Month
month_values = dataset_pruned['Month'].unique()
odd_values_2008 = dataset_pruned['Month'].value_counts()['2008']
odd_values_2014 = dataset_pruned['Month'].value_counts()['2014']

print(f'Unique values: {month_values}')  # Here are some odd values
print(f'Number of times 2008 appears: {odd_values_2008}')
print(f'Number of times 2014 appears: {odd_values_2014}')

Unique values: ['December' 'August' 'November' 'October' 'March' 'September' 'May'
 'April' 'January' 'July' 'June' 'February' '2014' '2008']
Number of times 2008 appears: 1
Number of times 2014 appears: 1


In [218]:
# 1. Drop the 2008 and 2014 containing row
dataset_processed_month = dataset_pruned.drop(dataset_pruned[dataset_pruned['Month'].isin(['2008', '2014'])].index)

# 2. Maping names to integers
month_map = {
    'January': 1, 
    'February': 2, 
    'March': 3, 
    'April': 4, 
    'May': 5,
    'June': 6,
    'July': 7,
    'August': 8,
    'September': 9,
    'October': 10,
    'November': 11,
    'December': 12
}

# 3. Replacing the names (str) with integers
dataset_processed_month['Month'] = dataset_processed_month['Month'].replace(month_map)
print(dataset_processed_month['Month'])

# 4. Checking the dataset shape
print('-----------------------------')
print(f'Current number of columns: {dataset_processed_month.shape[1]}')
print(f'Current number of rows: {dataset_processed_month.shape[0]}')

0       12
1       12
2        8
3       11
4       12
        ..
1995     6
1996     7
1997    11
1998     2
1999     8
Name: Month, Length: 1997, dtype: int64
-----------------------------
Current number of columns: 10
Current number of rows: 1997


### 3.2. Runtime: Convering values into integer

In [219]:
# 1. Checking for non-numerical values
dataset_processed_month['Runtime'].unique()  # Movie with unknown duration

# 2. Getting number of movies with uknown runtime
unknown_runtime = dataset_processed_month['Runtime'].value_counts()['Unknown']
print(f'Number of movies with unknown runtime: {unknown_runtime}')

# 3. Removing these movies from the dataset
dataset_processed_runtime = dataset_processed_month.drop(dataset_processed_month[dataset_processed_month['Runtime'].isin(['Unknown'])].index)

# 4. Converting to int values
dataset_processed_runtime['Runtime'] = dataset_processed_runtime['Runtime'].astype(int)

# 5. Checking the dataset shape
print('-----------------------------')
print(f'Current number of columns: {dataset_processed_runtime.shape[1]}')
print(f'Current number of rows: {dataset_processed_runtime.shape[0]}')

Number of movies with unknown runtime: 1
-----------------------------
Current number of columns: 10
Current number of rows: 1996


### 3.3. Genre: One-Hot Encoding

In [220]:
dataset_processed_genre = dataset_processed_runtime

# 1. Creating a list with all genres mentioned in the dataset
genre_list = []
for genres in dataset_processed_genre['Genre']:
    genre_list.append(genres.split(', '))

# 2. Creating a list with genre categories
unique_g= set()
for genres in genre_list:
    for genre in genres:
        unique_g.add(genre)

unique_genres = list(unique_g)
print(f'''There are {len(unique_genres)}
        possible genres and they are: {unique_genres}''')

There are 20
        possible genres and they are: ['History', 'Western', 'Animation', 'Mystery', 'Action', 'Crime', 'War', 'Comedy', 'Drama', 'Horror', 'Thriller', 'Biography', 'Family', 'Romance', 'Adventure', 'Sci-Fi', 'Sport', 'Fantasy', 'Musical', 'Music']


In [221]:
# 3. Creating a subtable with title as the first column followed by one 
# column per genre
genre_subtable = pd.DataFrame(columns=['Title'] + unique_genres)

# 4. For each movie, assign 1 to those genres it has been assigned 
for i, row in dataset_processed_genre.iterrows():
    new_row = {'Title': row['Title']}
    for genre in unique_genres:
        new_row[genre] = 0
    for genre in row['Genre'].split(', '):
        new_row[genre] = 1
    genre_subtable = genre_subtable.append(new_row, ignore_index=True)

print(genre_subtable.head(2))

  genre_subtable = genre_subtable.append(new_row, ignore_index=True)
  genre_subtable = genre_subtable.append(new_row, ignore_index=True)
  genre_subtable = genre_subtable.append(new_row, ignore_index=True)
  genre_subtable = genre_subtable.append(new_row, ignore_index=True)
  genre_subtable = genre_subtable.append(new_row, ignore_index=True)
  genre_subtable = genre_subtable.append(new_row, ignore_index=True)
  genre_subtable = genre_subtable.append(new_row, ignore_index=True)
  genre_subtable = genre_subtable.append(new_row, ignore_index=True)
  genre_subtable = genre_subtable.append(new_row, ignore_index=True)
  genre_subtable = genre_subtable.append(new_row, ignore_index=True)
  genre_subtable = genre_subtable.append(new_row, ignore_index=True)
  genre_subtable = genre_subtable.append(new_row, ignore_index=True)
  genre_subtable = genre_subtable.append(new_row, ignore_index=True)
  genre_subtable = genre_subtable.append(new_row, ignore_index=True)
  genre_subtable = genre_subtable.

                            Title History Western Animation Mystery Action  \
0        Avatar: The Way of Water       0       0         0       0      1   
1  Guillermo del Toro's Pinocchio       0       0         1       0      0   

  Crime War Comedy Drama  ... Thriller Biography Family Romance Adventure  \
0     0   0      0     0  ...        0         0      0       0         1   
1     0   0      0     1  ...        0         0      1       0         0   

  Sci-Fi Sport Fantasy Musical Music  
0      0     0       1       0     0  
1      0     0       0       0     0  

[2 rows x 21 columns]


  genre_subtable = genre_subtable.append(new_row, ignore_index=True)
  genre_subtable = genre_subtable.append(new_row, ignore_index=True)
  genre_subtable = genre_subtable.append(new_row, ignore_index=True)
  genre_subtable = genre_subtable.append(new_row, ignore_index=True)
  genre_subtable = genre_subtable.append(new_row, ignore_index=True)
  genre_subtable = genre_subtable.append(new_row, ignore_index=True)
  genre_subtable = genre_subtable.append(new_row, ignore_index=True)
  genre_subtable = genre_subtable.append(new_row, ignore_index=True)
  genre_subtable = genre_subtable.append(new_row, ignore_index=True)
  genre_subtable = genre_subtable.append(new_row, ignore_index=True)
  genre_subtable = genre_subtable.append(new_row, ignore_index=True)
  genre_subtable = genre_subtable.append(new_row, ignore_index=True)
  genre_subtable = genre_subtable.append(new_row, ignore_index=True)
  genre_subtable = genre_subtable.append(new_row, ignore_index=True)
  genre_subtable = genre_subtable.

In [222]:
# 5. Merging the subtable with the main dataset
pivot_table = genre_subtable.melt(id_vars='Title', var_name='Genre')
pivot_table = pivot_table[pivot_table['value'] == 1]
pivot_table = pivot_table.drop(columns=['value'])
pivot_table = pivot_table.pivot_table(index='Title', columns='Genre', aggfunc='size', fill_value=0).reset_index()

# 7. Merge the pivot table with the original DataFrame
dataset_processed_genre = pd.merge(dataset_processed_genre, pivot_table, on='Title', how='left')
dataset_processed_genre.drop(columns=['Genre'], inplace=True)

# 8. Checking the dataset shape
print('-----------------------------')
print(f'Current number of columns: {dataset_processed_genre.shape[1]}')
print(f'Current number of rows: {dataset_processed_genre.shape[0]}')

-----------------------------
Current number of columns: 29
Current number of rows: 1996


### 3.4. Budget & Income: Turning into USD currency

In [223]:
dataset_processed_finance = dataset_processed_genre

# Checking for how the finance data looks
budget_values = dataset_processed_finance['Budget'].unique()
income_values = dataset_processed_finance['Income'].unique()

#print(f'Budget values: {budget_values}')
#print(f'Income values: {income_values}')

'''There are unconsistencies in how the values are writen, extra 
# spaces etc. Also presence of 'Unknown'''

# 2. Getting number of movies with uknown runtime
unknown_budget_currency = dataset_processed_finance['Budget'].value_counts()['Unknown']
unknown_income_currency = dataset_processed_finance['Income'].value_counts()['Unknown']

missing_budget_values = dataset_processed_finance['Budget'].isnull().sum()
missing_income_values = dataset_processed_finance['Income'].isnull().sum()

print(f'Number of movies with unknown budget: {unknown_budget_currency}')
print(f'Number of movies with empty budget: {missing_budget_values}')
print(f'Number of movies with unknown income: {unknown_income_currency}')
print(f'Number of movies with empty budget: {missing_income_values}')


Number of movies with unknown budget: 303
Number of movies with empty budget: 0
Number of movies with unknown income: 142
Number of movies with empty budget: 0


In [224]:
# 3. Dropping all movies without a budget
dataset_processed_finance = dataset_processed_finance.drop(dataset_processed_finance[dataset_processed_finance['Budget'].isin(['Unknown'])].index)

# 4. Checking the dataset shape
print('-----------------------------')
print(f'Current number of columns (Budget): {dataset_processed_finance.shape[1]}')
print(f'Current number of rows (Budget): {dataset_processed_finance.shape[0]}')


# 5. Dropping all movies without a income
dataset_processed_finance = dataset_processed_finance.drop(dataset_processed_finance[dataset_processed_finance['Income'].isin(['Unknown'])].index)

# 6. Checking the dataset shape
print('-----------------------------')
print(f'Current number of columns (Income): {dataset_processed_finance.shape[1]}')
print(f'Current number of rows (Income): {dataset_processed_finance.shape[0]}')


-----------------------------
Current number of columns (Budget): 29
Current number of rows (Budget): 1693
-----------------------------
Current number of columns (Income): 29
Current number of rows (Income): 1650


In [225]:
# Conversion function (converts currency and to float)
def convert_to_usd(amount):
    amount.replace(' ', '')
    amount.replace('\xa0', '')
    if amount.startswith('$'):
        amount = amount.strip('$').replace(',', '')   # must remove commas
        return float(amount)   # convert str into float
    elif amount.startswith('€'):
        # Exchange rate for EUR to USD
        amount = amount.strip('€').replace(',', '')
        return float(amount) * 1.06 
    elif amount.startswith('¥'):
        # Exchange rate for YEN to USD
        amount = amount.strip('¥').replace(',', '')
        return float(amount) * 0.0075
    elif amount.startswith('₹'):
        # Exchange rate for RPL to USD
        amount = amount.strip('₹').replace(',', '')
        return float(amount) * 0.012 
    elif amount.startswith('SEK'):
        # Exchange rate for SEK to USD
        amount = amount.strip('SEK').replace(',', '')
        return float(amount) * 0.094
    elif amount.startswith('DKK'):
        # Exchange rate for RPL to USD
        amount = amount.strip('DKK').replace(',', '')
        return float(amount) * 0.14
    elif amount.startswith('£'):
        # Exchange rate for RPL to USD
        amount = amount.strip('£').replace(',', '')
        return float(amount) * 1.21  
    else:
        return None

In [226]:
# 7. Applying the conversion function
dataset_processed_finance['Budget'] = dataset_processed_finance['Budget'].apply(convert_to_usd)
dataset_processed_finance['Income'] = dataset_processed_finance['Income'].apply(convert_to_usd)

dataset_processed_finance.info()  # 9 missing values in Budget

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1650 entries, 0 to 1995
Data columns (total 29 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Title      1650 non-null   object 
 1   Rating     1650 non-null   float64
 2   Year       1650 non-null   int64  
 3   Month      1650 non-null   int64  
 4   Runtime    1650 non-null   int64  
 5   Directors  1650 non-null   object 
 6   Stars      1650 non-null   object 
 7   Budget     1641 non-null   float64
 8   Income     1650 non-null   float64
 9   Action     1650 non-null   int64  
 10  Adventure  1650 non-null   int64  
 11  Animation  1650 non-null   int64  
 12  Biography  1650 non-null   int64  
 13  Comedy     1650 non-null   int64  
 14  Crime      1650 non-null   int64  
 15  Drama      1650 non-null   int64  
 16  Family     1650 non-null   int64  
 17  Fantasy    1650 non-null   int64  
 18  History    1650 non-null   int64  
 19  Horror     1650 non-null   int64  
 20  Music   

In [227]:
# 8. Removing 9 titles with empty budget values 
dataset_processed_finance = dataset_processed_finance.dropna(axis=0, subset=['Budget'])

# 9. Checking the dataset shape
print('-----------------------------')
print(f'Current number of columns: {dataset_processed_finance.shape[1]}')
print(f'Current number of rows: {dataset_processed_finance.shape[0]}')


-----------------------------
Current number of columns: 29
Current number of rows: 1641


## 4. Feature Engineering: 'Directors' and 'Stars'

This part needs more work and additional database. I think these columns will have an effect on the scoring of the movie, but they are strings right now. I think we should:
1. Directors column should only contain the first director and ignore all other
2. Stars column should be split into 'Lead' and 'Supporting' where the first actor is added to the lead and the second to the supporting
3. Replace the directors and actors with their imdb score of how good they are or some popularity score

### 4.1. Directors

1. Only one director/film, only the first director will be kept
2. Replace director with top 50 direcor = 1 or not top 50 director = 0

#### One director per movie

In [228]:
dataset_processed_1 = dataset_processed_finance

# 1. Only keeping the first director
dataset_processed_1['Top_50_Director'] = dataset_processed_1['Directors'].str.split(',').str[0]

#### Replacing the director's name with ranking

In [229]:
# 1. Importing top directors dataset
dataset_directors = pd.read_csv('top_50_directors.csv')

# 2. If Director is in dataset_directors, replace name with 1
# else replace name with 0
replace = dataset_processed_1['Director_Top'].isin(dataset_directors['Name'])
dataset_processed_1['Director_Top'] = replace.astype(int)


KeyError: 'Director_Top'

In [None]:
# 3. Check the updated table
dataset_processed_1.head(2)
print(f'Number of top 50 directors in dataset: {(dataset_processed_1["Director_Top"] == 1).astype(int).sum()}')

Number of top 50 directors in dataset: 179


### 4.2. Actors

1. Splitting 'Stars' column into two; (1) 'Lead' (the first value of original column) and (2) 'Supporting' (the second value of the original column)
2. Replace the name in these two columns by their rating

#### Splitting Stars into Lead and Supporting

In [None]:
# 1. Creating a new column 'Lead' with the first value of 'Stars'
dataset_processed_1['Lead'] = dataset_processed_1['Stars'].fillna('').str.split(',', expand=True)[0]

# 2. Creating a new column 'Supporting' with the second value of 'Stars'
dataset_processed_1['Supporting'] = dataset_processed_1['Stars'].fillna('').str.split(',', expand=True)[1]

# 3. Dropping the original 'Stars' column
dataset_processed_1.drop(columns=['Stars'], inplace=True)

#### Replacing name with rating

In [None]:
# 1. Importing top a list of 1000 top actors and actresses
dataset_top_actors = pd.read_csv('top_1000_actors.csv', usecols=['Position', 'Name'])

# 2. If star is in dataset_directors, replace name with 1
# else replace name with 0
replace_lead = dataset_processed_1['Lead'].isin(dataset_top_actors['Name'])
dataset_processed_1['Lead_Top'] = replace_lead.astype(int)

replace_support = dataset_processed_1['Supporting'].isin(dataset_top_actors['Name'])
dataset_processed_1['Supporting_Top'] = replace_support.astype(int)

In [None]:
# 3. Check the updated table
dataset_processed_1.head(2)
print(f'Number of top 1000 actors in dataset: {(dataset_processed_1["Lead_Top"] == 0).astype(int).sum()}')

Number of top 1000 actors in dataset: 480


In [None]:
dataset_processed_1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1641 entries, 0 to 1995
Data columns (total 33 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Title            1641 non-null   object 
 1   Rating           1641 non-null   float64
 2   Year             1641 non-null   int64  
 3   Month            1641 non-null   int64  
 4   Runtime          1641 non-null   int64  
 5   Directors        1641 non-null   object 
 6   Budget           1641 non-null   float64
 7   Income           1641 non-null   float64
 8   Action           1641 non-null   int64  
 9   Adventure        1641 non-null   int64  
 10  Animation        1641 non-null   int64  
 11  Biography        1641 non-null   int64  
 12  Comedy           1641 non-null   int64  
 13  Crime            1641 non-null   int64  
 14  Drama            1641 non-null   int64  
 15  Family           1641 non-null   int64  
 16  Fantasy          1641 non-null   int64  
 17  History       

In [None]:
dataset_processed_1['Directors'].value_counts()

Ridley Scott        13
Steven Spielberg    12
Shawn Levy          10
Clint Eastwood      10
Michael Bay          9
                    ..
Michael Engler       1
May el Toukhy        1
David Yarovesky      1
Josh Cooley          1
Chris Kentis         1
Name: Directors, Length: 845, dtype: int64

## 5. Multiple rating of the same movie

Another problem is when a movie appears more than once in the list. Therefore, only the top rating for the movie will be kept and all others removed from the dataset.

In [None]:
# 1. Keeping only the highest score for each name
dataset_final = dataset_processed_1.sort_values(by=['Rating'], ascending=False)
dataset_final = dataset_processed_1.drop_duplicates(subset=['Title'], keep='first')

In [None]:
dataset_final.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1633 entries, 0 to 1995
Data columns (total 33 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Title            1633 non-null   object 
 1   Rating           1633 non-null   float64
 2   Year             1633 non-null   int64  
 3   Month            1633 non-null   int64  
 4   Runtime          1633 non-null   int64  
 5   Directors        1633 non-null   object 
 6   Budget           1633 non-null   float64
 7   Income           1633 non-null   float64
 8   Action           1633 non-null   int64  
 9   Adventure        1633 non-null   int64  
 10  Animation        1633 non-null   int64  
 11  Biography        1633 non-null   int64  
 12  Comedy           1633 non-null   int64  
 13  Crime            1633 non-null   int64  
 14  Drama            1633 non-null   int64  
 15  Family           1633 non-null   int64  
 16  Fantasy          1633 non-null   int64  
 17  History       

## 6. Summary of the Dataprocessing and Feature Engineering

### 6.1. The new Dataset 

| Variable | Processing description | Type |
| --- | --- | --- |
| Title | Movie title | str |
| Rating | Movie rating | float |
| Year | Release year | int |
| Month | Release month | int | 
| Runtime | Movie duration in minutes | int |  | Convert to int |
| Genre | 20 columns with differnt genres with One-Hot Encoding (1-0) | int |
| Budget | Set to USD | int |
| Income | Set to USD | int |
| Director | Director name | str |
| Director_Top | If director is a top 50 director or not (1-0) | int |
| Lead | Lead star name | str |
| Lead_Top | Name replaced by actor's rating (1-1001, 1001 given to those that are not in the list) | int |
| Supporting | Supporting star name | str |
| Supporting_Top | Name replaced by actor's rating (1-1001, 1001 given to those that are not in the list)| int |

In [None]:
def create_table(dataframes, step):
    num_dataframes = len(dataframes)
    num_rows = []
    num_cols = []
    for df in dataframes:
        num_rows.append(df.shape[0])
        num_cols.append(df.shape[1])
    indices = range(0, num_dataframes, step)
    df_dict = {'Rows': num_rows, 'Columns': num_cols}
    table = pd.DataFrame(df_dict, index=range(1, num_dataframes+1))
    table.index.name = 'Processing step'
    table = table.iloc[indices, :]
    return table

### 6.2. Data loss due to processing

In [None]:
df_list = [
    dataset_raw, 
    dataset_pruned, 
    dataset_processed_month,
    dataset_processed_runtime,
    dataset_processed_genre,
    dataset_processed_finance,
    dataset_processed_1,
    dataset_final]
step = 1
create_table(df_list, step)

Unnamed: 0_level_0,Rows,Columns
Processing step,Unnamed: 1_level_1,Unnamed: 2_level_1
1,2000,13
2,1999,10
3,1997,10
4,1996,10
5,1996,29
6,1641,33
7,1641,33
8,1633,33


## 7. Exporting the new dataset

In [None]:
dataset_processed_1.to_csv('new_dataset.csv', index=False)