# Data Cleaning
The cleaning process is organized into sections, each corresponding to a different dataset (`.csv` file). Each section includes the following steps:

1. **Data Understanding**: Initial exploration of the dataset.
2. **Data Cleaning**: Handling of missing values (NaN), removal of duplicates, validating foreign keys to identify and manage invalid references, setting correct data types, and renaming columns. <br>
   *(Optional)* **Deep Clean**: Custom cleaning steps applied to a specific dataset, if necessary.
3. **Final Result**: Displays the cleaned dataset and saves it to a new `.csv` file.

All the datasets combined have a total size of ~1Gb and can be all uploaded in memory at the same time on almost every PC.

First, import the necessary libraries and set up any required options.

In [151]:
import pandas as pd
import numpy as np

from utils.utils import find_matching, summarize_nulls

# Set to True to print cleaned data into new csv
PRINT_CSV = False

## Movies

In [3]:
# Import 'movies.csv' dataset
movies_df = pd.read_csv('datasets/movies.csv')

### 1. Data Understanding

In [4]:
movies_df.head()

Unnamed: 0,id,name,date,tagline,description,minute,rating
0,1000001,Barbie,2023.0,She's everything. He's just Ken.,Barbie and Ken are having the time of their li...,114.0,3.86
1,1000002,Parasite,2019.0,Act like you own the place.,"All unemployed, Ki-taek's family takes peculia...",133.0,4.56
2,1000003,Everything Everywhere All at Once,2022.0,The universe is so much bigger than you realize.,An aging Chinese immigrant is swept up in an i...,140.0,4.3
3,1000004,Fight Club,1999.0,Mischief. Mayhem. Soap.,A ticking-time-bomb insomniac and a slippery s...,139.0,4.27
4,1000005,La La Land,2016.0,Here's to the fools who dream.,"Mia, an aspiring actress, serves lattes to mov...",129.0,4.09


In [5]:
movies_df.shape

(941597, 7)

In [6]:
movies_df.dtypes

id               int64
name            object
date           float64
tagline         object
description     object
minute         float64
rating         float64
dtype: object

### 2. Data Cleaning

In [7]:
# Rename columns
movies_df = movies_df.rename(columns={'name': 'title', 'minute': 'runtime', 'date': 'release_year'})
print(f"Movies dataset columns: {', '.join(movies_df.columns)}")

Movies dataset columns: id, title, release_year, tagline, description, runtime, rating


In [8]:
# Check for null values
summarize_nulls(movies_df)

Unnamed: 0,Null Count,Null Percentage
id,0,0.0%
title,10,0.0011%
release_year,91913,9.7614%
tagline,802210,85.1967%
description,160812,17.0786%
runtime,181570,19.2832%
rating,850598,90.3357%


There are null values in most of the columns.
The fields '**release_year**', '**tagline**', '**description**', '**runtime**' and '**rating**' don't cause any problems so we'll keep them, but the few movies that are without a title can't be used and will be removed.

In [9]:
# Removing rows with null title
no_title = movies_df[movies_df['title'].isna()]
movies_df = movies_df.dropna(subset=['title'])

print("Movies dataset without title:")
no_title.head()

Movies dataset without title:


Unnamed: 0,id,title,release_year,tagline,description,runtime,rating
287514,1287515,,2015.0,,NONE is a short film that explores the balance...,4.0,
617642,1617643,,,,,,
646520,1646521,,2008.0,,,,
648185,1648186,,,,,,
720294,1720295,,,,"In this directorial debut of Eden Ewardson, he...",8.0,


In [10]:
# Check for duplicate rows
print(f"There are {movies_df.duplicated().sum()} duplicated rows")

There are 0 duplicated rows


In [11]:
# Setting the correct type for columns
movies_df['release_year'] = movies_df['release_year'].astype('Int64')
movies_df['runtime'] = movies_df['runtime'].astype('Int64')

In [12]:
# Check if 'id' column has unique values
print(f"'id' duplicates: {movies_df[movies_df['id'].duplicated()].shape[0]}")
movies_df = movies_df.set_index("id")

'id' duplicates: 0


The '**id**' field is the unique identifier of a movie, so it's been set as the index.

#### Deep Clean
Let's look inside some columns to see most frequent values:

In [13]:
movies_df['description'].value_counts().head(10)

description
Mexican feature film    893
Plot Unavailable.       504
Hong Kong movie         449
to be added later       304
Chinese movie           203
plot is unknown         182
Short film.             180
A short film.           168
1962 Japanese movie     162
Documentary film.       151
Name: count, dtype: int64

Many descriptions seem to have a description like "Plot Unavailable" or similar instead of a null value. The other fields seem fine.<br>
Let's try to fix as many as possible (fixing only the most frequent variation, not 100% accurate).

In [14]:
from utils.utils import null_movie_description_keywords

# Find null description variation
matches = find_matching(movies_df, 'description', null_movie_description_keywords, max_length=30)

# Fill with NaN values the result obtained
movies_df.loc[matches.index, ['description']] = np.nan

# Manual check to be sure to not overwrite real descriptions
matches['description'].value_counts().head(15)

description
Plot Unavailable.         504
plot is unknown           182
Plot Unknown              103
Plot Unknown.              45
Plot unknown.              35
Plot unknown               26
Plot details unknown.      16
Synopsis unknown.          11
Plot is unknown             9
plot is unknown.            5
plot unknown                5
plot currently unknown      2
Plot is Unknown             2
Overview unknown.           2
Plot unavailable            2
Name: count, dtype: int64

### 3. Final Result
All datasets reference the **movies** dataset. A movie is uniquely identified by his **id** and a movie id has multiple occurrences in other datasets. A movie has a title, a tagline, a description, the release year, the duration and a rating. Only the title is mandatory and all the other attributes could be missing.

In [15]:
movies_df.head()

Unnamed: 0_level_0,title,release_year,tagline,description,runtime,rating
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1000001,Barbie,2023,She's everything. He's just Ken.,Barbie and Ken are having the time of their li...,114,3.86
1000002,Parasite,2019,Act like you own the place.,"All unemployed, Ki-taek's family takes peculia...",133,4.56
1000003,Everything Everywhere All at Once,2022,The universe is so much bigger than you realize.,An aging Chinese immigrant is swept up in an i...,140,4.3
1000004,Fight Club,1999,Mischief. Mayhem. Soap.,A ticking-time-bomb insomniac and a slippery s...,139,4.27
1000005,La La Land,2016,Here's to the fools who dream.,"Mia, an aspiring actress, serves lattes to mov...",129,4.09


In [16]:
movies_df.shape

(941587, 6)

## Languages

In [27]:
# Import 'languages.csv' dataset
lang_df = pd.read_csv('datasets/languages.csv')

### 1. Data Understanding

In [28]:
lang_df.head()

Unnamed: 0,id,type,language
0,1000001,Language,English
1,1000002,Primary language,Korean
2,1000002,Spoken language,English
3,1000002,Spoken language,German
4,1000002,Spoken language,Korean


In [29]:
lang_df.shape

(1038762, 3)

In [30]:
lang_df.dtypes

id           int64
type        object
language    object
dtype: object

### 2. Data Cleaning

In [31]:
# Rename columns
lang_df = lang_df.rename(columns={'id': 'movie_id'})
print(f"Languages dataset columns: {', '.join(lang_df.columns)}")

Languages dataset columns: movie_id, type, language


In [32]:
# Check for null values
summarize_nulls(lang_df)

Unnamed: 0,Null Count,Null Percentage
movie_id,0,0.0%
type,0,0.0%
language,0,0.0%


In [33]:
# Check for duplicate rows
print(f"There are {lang_df.duplicated().sum()} duplicated rows")

There are 0 duplicated rows


In [34]:
# Identifying invalid foreign keys
invalid_values = lang_df[~lang_df['movie_id'].isin(movies_df.index)]

print(f"There were {len(invalid_values)} rows with invalid foreign keys")

# Removing invalid values
lang_df = lang_df.drop(invalid_values.index)

invalid_values.head()

There were 9 rows with invalid foreign keys


Unnamed: 0,movie_id,type,language
703424,1617643,Language,English
733826,1646521,Language,English
735509,1648186,Language,English
810990,1720295,Language,Burmese
816243,1725370,Language,English


In [35]:
# Setting the category data type for column 'type'
print(f"types: {', '.join(lang_df['type'].unique())}")
lang_df['type'] = lang_df['type'].astype('category')

types: Language, Primary language, Spoken language


The `type` field has only 3 possible values, so we can set it as a categorical type.

### 3. Final Result
The languages dataset is directly linked to the movies dataset through the `movie_id` column. There are more rows in the languages dataset than in the movies dataset because a movie can be associated with multiple languages. Additionally, not all movies have a language defined. <br>
A language associated with a movie can fall into one or more of the following categories:
- *Language*: Refers to a generic language associated with the movie, typically used when there is a single dominant language.
- *Primary Language*: The main or original language of the movie.
- *Spoken Language*: All the languages actually used in the movie's dialogues.

In [36]:
lang_df.head()

Unnamed: 0,movie_id,type,language
0,1000001,Language,English
1,1000002,Primary language,Korean
2,1000002,Spoken language,English
3,1000002,Spoken language,German
4,1000002,Spoken language,Korean


In [37]:
lang_df.shape

(1038753, 3)

## Actors

In [38]:
# Import 'actors.csv' dataset
actors_df = pd.read_csv('datasets/actors.csv')

### 1. Data Understanding

In [39]:
actors_df.head()

Unnamed: 0,id,name,role
0,1000001,Margot Robbie,Barbie
1,1000001,Ryan Gosling,Ken
2,1000001,America Ferrera,Gloria
3,1000001,Ariana Greenblatt,Sasha
4,1000001,Issa Rae,Barbie


In [40]:
actors_df.shape

(5798450, 3)

In [41]:
actors_df.dtypes

id       int64
name    object
role    object
dtype: object

### 2. Data Cleaning

In [42]:
# Rename columns
actors_df = actors_df.rename(columns={'id': 'movie_id'})
print(f"Actors dataset columns: {', '.join(actors_df.columns)}")

Actors dataset columns: movie_id, name, role


In [43]:
# Check for null values
summarize_nulls(actors_df)

Unnamed: 0,Null Count,Null Percentage
movie_id,0,0.0%
name,4,0.0001%
role,1361559,23.4814%


Despite the high number of null values in the `role` field, the rows will be maintained because they still contain information about the actor's `name`. However, an actor without a name is unusable, so the corresponding rows will be removed.

In [44]:
# Removing actors without name
no_name = actors_df[actors_df['name'].isna()]
actors_df = actors_df.dropna(subset=['name'])
no_name

Unnamed: 0,movie_id,name,role
4145738,1443629,,
4281100,1469981,,Self
4306960,1474958,,Cinematography
5430275,1773264,,


In [45]:
# Check for duplicate rows
print(f"There are {actors_df.duplicated().sum()} duplicated rows")
actors_duplicates = actors_df[actors_df.duplicated(keep=False)].head(6)

# Dropping the duplicates
actors_df = actors_df.drop_duplicates()

actors_duplicates

There are 946 duplicated rows


Unnamed: 0,movie_id,name,role
3967,1000062,Rosie Jones,Lady of the Boot of Jemiah
3993,1000062,Rosie Jones,Lady of the Boot of Jemiah
44615,1000797,Karel Heřmánek,Czech Injured Man
44642,1000797,Karel Heřmánek,Czech Injured Man
47806,1000863,Michael Fennimore,Car Salesman
47807,1000863,Michael Fennimore,Car Salesman


Completely duplicated rows are clearly errors and can be removed.

In [46]:
# Identifying invalid foreign keys
invalid_values = actors_df[~actors_df['movie_id'].isin(movies_df.index)]

print(f"There were {len(invalid_values)} rows with invalid foreign keys")

# Removing invalid values
actors_df = actors_df.drop(invalid_values.index)

invalid_values.head()

There were 1 rows with invalid foreign keys


Unnamed: 0,movie_id,name,role
5027860,1646521,Catherine R,Self


#### Deep Clean

In [47]:
actors_df['role'].value_counts().head(10)

role
Self                      188465
Himself                    74797
Herself                    22678
Self (archive footage)     21851
Narrator                   12276
(uncredited)               10816
(voice)                     8435
Narrator (voice)            7569
Dancer                      6874
Doctor                      5742
Name: count, dtype: int64

The role column contains many variations of the 'Self' role. Let's examine this more closely.

In [48]:
from utils.utils import self_actor_role_keywords

# Find self variation
matches = find_matching(actors_df, 'role', self_actor_role_keywords)
print(f"Rows contains 'self' variations: {matches['role'].shape[0]}")
matches['role'].value_counts().head()

Rows contains 'self' variations: 375240


role
Self                      188465
Himself                    74797
Herself                    22678
Self (archive footage)     21851
himself                     5273
Name: count, dtype: int64

There are over 300,000 values similar to 'Self', but many of them also contain additional information, such as 'Self - Presenter' or 'Self - Guest'. Overwriting all these values could lead to a loss of information, so they won't be overwritten in the cleaned dataset. However, they may be modified when visualizing the data for statistical purposes.

In [49]:
# Reset indexing after removing rows
actors_df = actors_df.reset_index(drop=True)

### 3. Final Result
The actors dataset is directly linked to the movies dataset through the `movie_id` column and contains nearly six times the number of rows as the movies dataset. Additionally, a movie may have no actors associated with it. <br>
The same actor can appear multiple times in the dataset if they feature in more than one movie. <br>
An actor is identified solely by their full name, stored in a single field.


In [50]:
actors_df.head()

Unnamed: 0,movie_id,name,role
0,1000001,Margot Robbie,Barbie
1,1000001,Ryan Gosling,Ken
2,1000001,America Ferrera,Gloria
3,1000001,Ariana Greenblatt,Sasha
4,1000001,Issa Rae,Barbie


In [51]:
actors_df.shape

(5797499, 3)

## Countries

In [52]:
# Import 'countries.csv' dataset
countries_df = pd.read_csv('datasets/countries.csv')

### 1. Data Understanding

In [53]:
countries_df.head()

Unnamed: 0,id,country
0,1000001,UK
1,1000001,USA
2,1000002,South Korea
3,1000003,USA
4,1000004,Germany


In [54]:
countries_df.shape

(693476, 2)

In [55]:
countries_df.dtypes

id          int64
country    object
dtype: object

### 2. Data Cleaning

In [56]:
# Rename columns
countries_df = countries_df.rename(columns={'id': 'movie_id'})
print(f"Countries dataset columns: {', '.join(countries_df.columns)}")

Countries dataset columns: movie_id, country


In [57]:
# Check for null values
summarize_nulls(countries_df)

Unnamed: 0,Null Count,Null Percentage
movie_id,0,0.0%
country,0,0.0%


In [58]:
# Identifying invalid foreign keys
invalid_values = countries_df[~countries_df['movie_id'].isin(movies_df.index)]

print(f"There were {len(invalid_values)} rows with invalid foreign keys")

# Removing invalid values
countries_df = countries_df.drop(invalid_values.index)

invalid_values.head()

There were 1 rows with invalid foreign keys


Unnamed: 0,movie_id,country
544278,1646521,USA


In [59]:
# Check for duplicate rows
print(f"There are {countries_df.duplicated().sum()} duplicated rows")

There are 0 duplicated rows


### 3. Final Results
The **countries** dataset is directly connected to the movies dataset through the 'movie_id' column. This dataset contains all the countries where the movies were produced.


In [60]:
countries_df.head()

Unnamed: 0,movie_id,country
0,1000001,UK
1,1000001,USA
2,1000002,South Korea
3,1000003,USA
4,1000004,Germany


In [61]:
countries_df.shape

(693475, 2)

## Crew

In [62]:
# Import 'crew.csv' dataset
crew_df = pd.read_csv('datasets/crew.csv')

### 1. Data Understanding

In [63]:
crew_df.head()

Unnamed: 0,id,role,name
0,1000001,Director,Greta Gerwig
1,1000001,Producer,Tom Ackerley
2,1000001,Producer,Margot Robbie
3,1000001,Producer,Robbie Brenner
4,1000001,Producer,David Heyman


In [64]:
crew_df.shape

(4720183, 3)

In [65]:
crew_df.dtypes

id       int64
role    object
name    object
dtype: object

### 2. Data Cleaning

In [66]:
# Rename columns
crew_df = crew_df.rename(columns={'id': 'movie_id'})
print(f"Crew dataset columns: {', '.join(crew_df.columns)}")

Crew dataset columns: movie_id, role, name


In [67]:
# Check for null values
summarize_nulls(crew_df)

Unnamed: 0,Null Count,Null Percentage
movie_id,0,0.0%
role,0,0.0%
name,1,0.0%


In [68]:
find_matching(crew_df, 'name', ["anonymous", "unknown"]).head()

Unnamed: 0,movie_id,role,name
122552,1001960,Co-director,Anonymous
488047,1014226,Executive producer,Anonymous Anonymous
510633,1015341,Director,Anonymous
510634,1015341,Editor,Anonymous
740024,1028791,Original writer,Anonymous


The existing `NaN` value was not removed because it is minimal compared to the overall size of the dataset, and removing it would not significantly impact the analysis. Similarly, values such as *Unknown* or *Anonymous* were kept because they account for less than 0.001% of the data and do not affect the overall results. <br>
So, instead of removing the `NaN` value was replaced with *Unknown*

In [69]:
crew_df['name'] = crew_df['name'].fillna('Unknown')

In [70]:
# Check for duplicate rows
print('Duplicated rows:', crew_df.duplicated().sum())
crew_duplicates = crew_df[crew_df.duplicated(keep=False)].head()

# Dropping the duplicates
crew_df = crew_df.drop_duplicates()

crew_duplicates

Duplicated rows: 1282


Unnamed: 0,movie_id,role,name
1680,1000018,Stunts,Chris Webb
1721,1000018,Stunts,Chris Webb
2690,1000031,Stunts,Sarah Irwin
2691,1000031,Stunts,Sarah Irwin
2692,1000031,Stunts,Sarah Irwin


Completely duplicated rows are clearly an error and can be safely removed.

In [71]:
# Identifying invalid foreign keys
invalid_values = crew_df[~crew_df['movie_id'].isin(movies_df.index)]

print(f"There were {len(invalid_values)} rows with invalid foreign keys")

# Removing invalid values
crew_df = crew_df.drop(invalid_values.index)

invalid_values.head()

There were 3 rows with invalid foreign keys


Unnamed: 0,movie_id,role,name
2644692,1287515,Director,Ash Thorp
2644693,1287515,Composer,Ben Lukas Boysen
3997184,1646521,Director,Giovanni De Nava


In [72]:
# Reset indexing after removing rows
crew_df = crew_df.reset_index(drop=True)

### 3. Final Results
The **crew** dataset is connected to the movies dataset through the `movie_id` column. It includes the names of all crew members along with their roles. <br>
A crew member can have different roles in the same movies and can appear in more than one movie. <br>
A crew member is solely identified by his full name.

In [73]:
crew_df.head()

Unnamed: 0,movie_id,role,name
0,1000001,Director,Greta Gerwig
1,1000001,Producer,Tom Ackerley
2,1000001,Producer,Margot Robbie
3,1000001,Producer,Robbie Brenner
4,1000001,Producer,David Heyman


In [74]:
crew_df.shape

(4718898, 3)

## Genres

In [75]:
# Import 'genres.csv' dataset
genres_df = pd.read_csv('datasets/genres.csv')

### 1. Data Understanding

In [76]:
genres_df.head()

Unnamed: 0,id,genre
0,1000001,Comedy
1,1000001,Adventure
2,1000002,Comedy
3,1000002,Thriller
4,1000002,Drama


In [77]:
genres_df.shape

(1046849, 2)

In [78]:
genres_df.dtypes

id        int64
genre    object
dtype: object

### 2. Data Cleaning

In [79]:
# Rename columns
genres_df = genres_df.rename(columns={'id': 'movie_id'})
print(f"Genres dataset columns: {', '.join(genres_df.columns)}")

Genres dataset columns: movie_id, genre


In [80]:
# Check for null values
summarize_nulls(genres_df)

Unnamed: 0,Null Count,Null Percentage
movie_id,0,0.0%
genre,0,0.0%


In [81]:
# Check for duplicate rows
print(f"There are {genres_df.duplicated().sum()} duplicated rows")

There are 0 duplicated rows


In [82]:
# Identifying invalid foreign keys
invalid_values = genres_df[~genres_df['movie_id'].isin(movies_df.index)]

print(f"There were {len(invalid_values)} rows with invalid foreign keys")

# Removing invalid values
genres_df = genres_df.drop(invalid_values.index)

invalid_values.head()

There were 4 rows with invalid foreign keys


Unnamed: 0,movie_id,genre
465349,1287515,Animation
810489,1617643,Documentary
836148,1646521,Documentary
837569,1648186,Thriller


In [83]:
# Setting the correct type for columns
genres_list = list(genres_df["genre"].unique())
print(f'There are {len(genres_list)} genres in the dataset: {", ".join(genres_list)}')

genres_df['genre'] = genres_df['genre'].astype('category')

There are 19 genres in the dataset: Comedy, Adventure, Thriller, Drama, Science Fiction, Action, Music, Romance, History, Crime, Animation, Mystery, Horror, Family, Fantasy, War, Western, TV Movie, Documentary


Given the limited number of genres in the dataset, the field can be optimized setting it to category type.

### 3. Final Results
The **genres** dataset is connected to the movies dataset through the `movie_id` column. A movie can have multiple genres.


In [84]:
genres_df.head()

Unnamed: 0,movie_id,genre
0,1000001,Comedy
1,1000001,Adventure
2,1000002,Comedy
3,1000002,Thriller
4,1000002,Drama


In [85]:
genres_df.shape

(1046845, 2)

## Posters

In [86]:
# Import 'posters.csv' dataset
posters_df = pd.read_csv('datasets/posters.csv')

### 1. Data Understanding

In [87]:
posters_df.head()

Unnamed: 0,id,link
0,1000001,https://a.ltrbxd.com/resized/film-poster/2/7/7...
1,1000002,https://a.ltrbxd.com/resized/film-poster/4/2/6...
2,1000003,https://a.ltrbxd.com/resized/film-poster/4/7/4...
3,1000004,https://a.ltrbxd.com/resized/film-poster/5/1/5...
4,1000005,https://a.ltrbxd.com/resized/film-poster/2/4/0...


In [88]:
posters_df.shape

(941597, 2)

In [89]:
posters_df.dtypes

id       int64
link    object
dtype: object

### 2. Data Cleaning

In [90]:
# Rename columns
posters_df = posters_df.rename(columns={'id': 'movie_id', 'link': 'poster_link'})
print(f"Posters dataset columns: {', '.join(posters_df.columns)}")

Posters dataset columns: movie_id, poster_link


In [91]:
# Check for null values
summarize_nulls(posters_df)

Unnamed: 0,Null Count,Null Percentage
movie_id,0,0.0%
poster_link,180712,19.1921%


In [92]:
# Removing null rows
posters_df = posters_df.dropna()

Removing the `NaN` values as they do not contribute meaningful information to the dataset and could hinder data consistency and analysis. <br>
We also need to check the validity of the link:

In [93]:
# Check for invalid links
link_regex = r'\bhttps?:\/\/(?:www\.)?[a-zA-Z0-9-]+\.[a-zA-Z]{2,}(?:\/[^\s]*)?'
print(f"There are {(~posters_df['poster_link'].str.contains(link_regex, na=False)).sum()} invalid links.")

There are 0 invalid links.


In [94]:
# Check for duplicate rows
print(f"There are {posters_df.duplicated().sum()} duplicated rows")

There are 0 duplicated rows


In [95]:
# Check if a movie can have more than 1 poster
print(f"There are {posters_df['movie_id'].duplicated().sum()} duplicates in the 'movie_id' column.")

There are 0 duplicates in the 'movie_id' column.


A movie can have at most only one poster. The relationship between the **posters** dataset and the **movies** dataset is One-to-One, allowing us to merge the two datasets.

In [96]:
# Identifying invalid foreign keys
invalid_values = posters_df[~posters_df['movie_id'].isin(movies_df.index)]

print(f"There are {len(invalid_values)} with invalid foreign keys")

invalid_values.head()

There are 6 with invalid foreign keys


Unnamed: 0,movie_id,poster_link
287514,1287515,https://a.ltrbxd.com/resized/film-poster/4/4/7...
646520,1646521,https://a.ltrbxd.com/resized/film-poster/1/0/1...
741481,1741482,https://a.ltrbxd.com/resized/film-poster/6/5/6...
840337,1840338,https://a.ltrbxd.com/resized/film-poster/5/8/4...
883228,1883229,https://a.ltrbxd.com/resized/film-poster/7/4/1...


There is no need to remove rows that have an invalid foreign key in this dataset, since the **merge** operation with the movies dataset can naturally differentiate between matching and unmatched records

In [97]:
# Merging the datasets on 'id' from 'movies' and 'movie_id' from 'posters'
movies_df = pd.merge(movies_df, posters_df, left_index=True, right_on='movie_id', how='left')

# Re-set the id as index
movies_df = movies_df.rename(columns={'movie_id': 'id'})
movies_df = movies_df.set_index('id')

### 3. Final Results
The posters dataset has now been merged into the movies dataset under the `poster_link` column. To ensure the merge was successful, the dataset should have the same number of rows as before (previously 941,587) and one additional column (previously 6).

In [98]:
movies_df.head()

Unnamed: 0_level_0,title,release_year,tagline,description,runtime,rating,poster_link
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1000001,Barbie,2023,She's everything. He's just Ken.,Barbie and Ken are having the time of their li...,114,3.86,https://a.ltrbxd.com/resized/film-poster/2/7/7...
1000002,Parasite,2019,Act like you own the place.,"All unemployed, Ki-taek's family takes peculia...",133,4.56,https://a.ltrbxd.com/resized/film-poster/4/2/6...
1000003,Everything Everywhere All at Once,2022,The universe is so much bigger than you realize.,An aging Chinese immigrant is swept up in an i...,140,4.3,https://a.ltrbxd.com/resized/film-poster/4/7/4...
1000004,Fight Club,1999,Mischief. Mayhem. Soap.,A ticking-time-bomb insomniac and a slippery s...,139,4.27,https://a.ltrbxd.com/resized/film-poster/5/1/5...
1000005,La La Land,2016,Here's to the fools who dream.,"Mia, an aspiring actress, serves lattes to mov...",129,4.09,https://a.ltrbxd.com/resized/film-poster/2/4/0...


In [99]:
movies_df.shape

(941587, 7)

## Releases

In [100]:
# Import 'release.csv' dataset
releases_df = pd.read_csv('datasets/releases.csv')

### 1. Data Understanding

In [101]:
releases_df.head()

Unnamed: 0,id,country,date,type,rating
0,1000001,Andorra,2023-07-21,Theatrical,
1,1000001,Argentina,2023-07-20,Theatrical,ATP
2,1000001,Australia,2023-07-19,Theatrical,PG
3,1000001,Australia,2023-10-01,Digital,PG
4,1000001,Austria,2023-07-20,Theatrical,


In [102]:
releases_df.shape

(1332782, 5)

In [103]:
releases_df.dtypes

id          int64
country    object
date       object
type       object
rating     object
dtype: object

### 2. Data Cleaning

In [104]:
# Rename columns
releases_df = releases_df.rename(columns={'id': 'movie_id', 'type': 'distribution_format'})
print(f"Release dataset columns: {', '.join(releases_df.columns)}")

Release dataset columns: movie_id, country, date, distribution_format, rating


In [105]:
# Check for null values
summarize_nulls(releases_df)

Unnamed: 0,Null Count,Null Percentage
movie_id,0,0.0%
country,0,0.0%
date,0,0.0%
distribution_format,0,0.0%
rating,998802,74.9411%


It's fine for the dataset to have null values  in the `rating` column. <br>
The absence of a rating could also be due to some countries not having an official rating system. If that is the case, we don't have enough information to determine whether it's a missing value or a country without a rating system. To make the data more meaningful, we should identify the countries without an official rating system and differentiate them from with a specific value. However, for our purposes, we don't need to clean the data to that extent.

In [106]:
# Check for alternative null values in the dataset
releases_df[(releases_df['rating'] == "0") & (~releases_df['country'].isin(["Germany", "Austria", "Switzerland"]))].head()

Unnamed: 0,movie_id,country,date,distribution_format,rating
216945,1009817,Hungary,2013-08-22,Theatrical,0
223492,1010561,Thailand,2011-11-21,Physical,0
279945,1018726,Israel,2022-06-16,Digital,0
289168,1020593,Israel,2022-06-16,Digital,0
291982,1021166,Israel,2022-06-16,Digital,0


We initially checked the rating value 0 in the dataset, assuming it might represent a null or missing value. However, we discovered that in some countries (e.g., Germany), a rating of 0 has a meaningful interpretation, indicating that the film is suitable for all audiences, including children. While in some cases the 0 rating might still be an error, it appears only 70 times in a dataset of over one million rows. Given that we don't have a reference dataset with the rating system for every country, we can ignore these values.

In [107]:
# Check for duplicate rows
print(f"There are {releases_df.duplicated().sum()} duplicated rows")

There are 0 duplicated rows


In [108]:
# Identifying invalid foreign keys
invalid_values = releases_df[~releases_df['movie_id'].isin(movies_df.index)]

print(f"There were {len(invalid_values)} rows with invalid foreign keys")

# Removing invalid values
releases_df = releases_df.drop(invalid_values.index)

invalid_values.head()

There were 2 rows with invalid foreign keys


Unnamed: 0,movie_id,country,date,distribution_format,rating
730832,1287515,USA,2015-01-01,Theatrical,
1098232,1646521,USA,2008-03-02,Theatrical,PG


In [109]:
# Setting the correct type for the columns
releases_df['date'] = pd.to_datetime(releases_df['date'], format='%Y-%m-%d')
releases_df['distribution_format'] = releases_df['distribution_format'].astype('category')

distribution_formats = list(releases_df['distribution_format'].unique())
print(f'There are {len(distribution_formats)} distribution formats in the dataset: {", ".join(distribution_formats)}')

There are 6 distribution formats in the dataset: Theatrical, Digital, Physical, Premiere, Theatrical limited, TV


Given the limited number of distribution formats in the dataset, the field can be optimized setting it to category type.

### 3. Final Results
The **release** dataset is linked to the movies dataset through the 'movie_id' column. It contains details about the movie releases worldwide, including the release date, distribution format, and rating.

In [110]:
releases_df.head()

Unnamed: 0,movie_id,country,date,distribution_format,rating
0,1000001,Andorra,2023-07-21,Theatrical,
1,1000001,Argentina,2023-07-20,Theatrical,ATP
2,1000001,Australia,2023-07-19,Theatrical,PG
3,1000001,Australia,2023-10-01,Digital,PG
4,1000001,Austria,2023-07-20,Theatrical,


In [111]:
releases_df.shape

(1332780, 5)

## Studios

In [112]:
# Import 'studios.csv' dataset
studios_df = pd.read_csv('datasets/studios.csv')

### 1. Data Understanding

In [113]:
studios_df.head()

Unnamed: 0,id,studio
0,1000001,LuckyChap Entertainment
1,1000001,Heyday Films
2,1000001,NB/GG Pictures
3,1000001,Mattel
4,1000001,Warner Bros. Pictures


In [114]:
studios_df.shape

(679283, 2)

In [115]:
studios_df.dtypes

id         int64
studio    object
dtype: object

### 2. Data Cleaning

In [116]:
# Rename columns
studios_df = studios_df.rename(columns={'id': 'movie_id'})
print(f"Studios dataset columns: {', '.join(studios_df.columns)}")

Studios dataset columns: movie_id, studio


In [117]:
# Check for null values
summarize_nulls(studios_df)

Unnamed: 0,Null Count,Null Percentage
movie_id,0,0.0%
studio,10,0.0015%


Unnamed studios are not usable and can be safely removed.

In [118]:
# Removing rows with null studio name
no_name_studios = studios_df[studios_df['studio'].isna()]
studios_df = studios_df.dropna(subset=['studio'])

print("Studios without name:")
no_name_studios.head()

Studios without name:


Unnamed: 0,movie_id,studio
347347,1259717,
411467,1350206,
473794,1450762,
534117,1565428,
541076,1579904,


In [119]:
# Check for duplicate rows
print(f"There are {studios_df.duplicated().sum()} duplicated rows")
studios_duplicates = studios_df[studios_df.duplicated(keep=False)].head(6)

# Drop duplicates
studios_df = studios_df.drop_duplicates()

studios_duplicates

There are 212 duplicated rows


Unnamed: 0,movie_id,studio
145,1000044,Working Title Films
146,1000044,Working Title Films
485,1000165,Working Title Films
487,1000165,Working Title Films
809,1000263,Working Title Films
810,1000263,Working Title Films


Completely duplicated rows are clearly an error and can be safely removed.

In [120]:
# Identifying invalid foreign keys
invalid_values = studios_df[~studios_df['movie_id'].isin(movies_df.index)]

print(f"No rows with an invalid foreign key")

invalid_values.head()

No rows with an invalid foreign key


Unnamed: 0,movie_id,studio


In [121]:
# Reset indexing after removing rows
studios_df = studios_df.reset_index(drop=True)

### 3. Final Results
The **studios** dataset is linked to the movies dataset through the `movie_id` column. It lists all the studios involved in each movie, allowing a movie to be associated with multiple studios and a studio to collaborate on multiple movies.

In [122]:
studios_df.head()

Unnamed: 0,movie_id,studio
0,1000001,LuckyChap Entertainment
1,1000001,Heyday Films
2,1000001,NB/GG Pictures
3,1000001,Mattel
4,1000001,Warner Bros. Pictures


In [123]:
studios_df.shape

(679061, 2)

## Themes

In [124]:
# Import 'themes.csv' dataset
themes_df = pd.read_csv('datasets/themes.csv')

### 1. Data Understanding

In [125]:
themes_df.head()

Unnamed: 0,id,theme
0,1000001,Humanity and the world around us
1,1000001,Crude humor and satire
2,1000001,Moving relationship stories
3,1000001,Emotional and captivating fantasy storytelling
4,1000001,Surreal and thought-provoking visions of life ...


In [126]:
themes_df.shape

(125641, 2)

In [127]:
themes_df.dtypes

id        int64
theme    object
dtype: object

### 2. Data Cleaning

In [128]:
# Rename columns
themes_df = themes_df.rename(columns={'id': 'movie_id'})
print(f"Themes dataset columns: {', '.join(themes_df.columns)}")

Themes dataset columns: movie_id, theme


In [129]:
# Check for null values
summarize_nulls(themes_df)

Unnamed: 0,Null Count,Null Percentage
movie_id,0,0.0%
theme,0,0.0%


In [130]:
# Check for duplicate rows
print(f"There are {themes_df.duplicated().sum()} duplicated rows")

There are 0 duplicated rows


In [131]:
# Identifying invalid foreign keys
invalid_values = themes_df[~themes_df['movie_id'].isin(movies_df.index)]

print(f"No rows with an invalid foreign key")

invalid_values.head()

No rows with an invalid foreign key


Unnamed: 0,movie_id,theme


In [132]:
# Check for unique values
print(f"There are {len(themes_df['theme'].unique())} unique themes")

# Setting the correct type for columns
themes_df['theme'] = themes_df['theme'].astype('category')

There are 109 unique themes


There are a limited number of values for the theme, which means the column can be set as a categorical variable to optimize the analysis.

### 3. Final Results
The **themes** dataset is linked to the movies dataset through the `movie_id` column. It contains 125,000 rows, with multiple occurrences referring to the same movie, suggesting that themes are not frequently assigned to movies. Each theme describes a movie using few standard phrases.

In [133]:
themes_df.head()

Unnamed: 0,movie_id,theme
0,1000001,Humanity and the world around us
1,1000001,Crude humor and satire
2,1000001,Moving relationship stories
3,1000001,Emotional and captivating fantasy storytelling
4,1000001,Surreal and thought-provoking visions of life ...


In [134]:
themes_df.shape

(125641, 2)

## The Oscar Awards

In [135]:
# Import 'the_oscar_awards.csv' dataset
oscars_df = pd.read_csv('datasets/the_oscar_awards.csv')

### 1. Data Understanding

In [136]:
oscars_df.head()

Unnamed: 0,year_film,year_ceremony,ceremony,category,name,film,winner
0,1927,1928,1,ACTOR,Richard Barthelmess,The Noose,False
1,1927,1928,1,ACTOR,Emil Jannings,The Last Command,True
2,1927,1928,1,ACTRESS,Louise Dresser,A Ship Comes In,False
3,1927,1928,1,ACTRESS,Janet Gaynor,7th Heaven,True
4,1927,1928,1,ACTRESS,Gloria Swanson,Sadie Thompson,False


In [137]:
oscars_df.shape

(10889, 7)

In [138]:
oscars_df.dtypes

year_film         int64
year_ceremony     int64
ceremony          int64
category         object
name             object
film             object
winner             bool
dtype: object

### 2. Data Cleaning

In [139]:
# Rename columns
oscars_df = oscars_df.rename(columns={'ceremony': 'number_ceremony', 'year_film': 'year_movie', 'name': 'nominee_name', 'film': 'nominee_movie', 'winner': 'is_winner'})
print(f"Oscar dataset columns: {', '.join(oscars_df.columns)}")

Oscar dataset columns: year_movie, year_ceremony, number_ceremony, category, nominee_name, nominee_movie, is_winner


In [140]:
# Check for null values
summarize_nulls(oscars_df)

Unnamed: 0,Null Count,Null Percentage
year_movie,0,0.0%
year_ceremony,0,0.0%
number_ceremony,0,0.0%
category,0,0.0%
nominee_name,5,0.0459%
nominee_movie,319,2.9296%
is_winner,0,0.0%


In [141]:
# Check for duplicate rows
print(f"There are {oscars_df.duplicated().sum()} duplicated rows")
oscars_df[oscars_df.duplicated(keep=False)].head(6)

There are 7 duplicated rows


Unnamed: 0,year_movie,year_ceremony,number_ceremony,category,nominee_name,nominee_movie,is_winner
6219,1983,1984,56,MUSIC (Original Song),Music by Michel Legrand; Lyric by Alan Bergman...,Yentl,False
6220,1983,1984,56,MUSIC (Original Song),Music by Michel Legrand; Lyric by Alan Bergman...,Yentl,False
7066,1991,1992,64,MUSIC (Original Song),Music by Alan Menken; Lyric by Howard Ashman,Beauty and the Beast,False
7068,1991,1992,64,MUSIC (Original Song),Music by Alan Menken; Lyric by Howard Ashman,Beauty and the Beast,False
7394,1994,1995,67,MUSIC (Original Song),Music by Elton John; Lyric by Tim Rice,The Lion King,False
7395,1994,1995,67,MUSIC (Original Song),Music by Elton John; Lyric by Tim Rice,The Lion King,False


Duplicates are retained because, in music categories, the same artists can receive identical nominations for different songs, with the song titles not specified in the dataset.

In [142]:
# Check the consistency between year_film and year_ceremony
print(f"There are {oscars_df[oscars_df['year_movie'] > oscars_df['year_ceremony']].shape[0]} rows where the movie has been released after ceremony")

There are 0 rows where the movie has been released after ceremony


In [143]:
# Check for multiple winner possibilities
from utils.utils import special_oscar_awards

# Filtering rows where the 'category' is in the special_oscar_awards list
filtered_oscars = find_matching(oscars_df, 'category', special_oscar_awards)

# Find groups with more than one winner
multiple_winners = filtered_oscars.groupby(['year_ceremony', 'category']).filter(
    lambda x: x['is_winner'].sum() > 1
)

# Keep only the rows where 'is_winner' is True
multiple_winners = multiple_winners[multiple_winners['is_winner'] == True]
multiple_winners.head(6)

Unnamed: 0,year_movie,year_ceremony,number_ceremony,category,nominee_name,nominee_movie,is_winner
33,1927,1928,1,SPECIAL AWARD,Warner Bros.,,True
34,1927,1928,1,SPECIAL AWARD,Charles Chaplin,,True
520,1936,1937,9,SPECIAL AWARD,The March of Time for its significance to mot...,,True
521,1936,1937,9,SPECIAL AWARD,W. Howard Greene and Harold Rosson for the co...,,True
640,1937,1938,10,SPECIAL AWARD,Mack Sennett,,True
641,1937,1938,10,SPECIAL AWARD,Edgar Bergen for his outstanding comedy creation,,True


There can be multiple winners for the same Oscar category (e.g., 1932 Best Actor). <br>
In the count, we excluded special awards, such as the Jean Hersholt Humanitarian Award, because they are given without nominations and have an undefined number of winners each year. These awards are not part of traditional competitive categories but rather honorary recognitions.

In [144]:
# Check for unique values
print(f"There are {len(oscars_df['category'].unique())} unique categories")

# Setting the correct type for columns
oscars_df['category'] = oscars_df['category'].astype('category')

There are 115 unique categories


There are a limited number of values for the `category` field, which means the column can be set as a categorical variable to optimize the analysis.

### 3. Final Results
The Oscar dataset is not linked to the movies dataset in any way, and it cannot be automatically connected due to the absence of a unique identifier for the movies. <br>
This dataset includes all nominations and winners of the Oscars, both ordinary and special, starting from the first ceremony.

In [145]:
oscars_df.head()

Unnamed: 0,year_movie,year_ceremony,number_ceremony,category,nominee_name,nominee_movie,is_winner
0,1927,1928,1,ACTOR,Richard Barthelmess,The Noose,False
1,1927,1928,1,ACTOR,Emil Jannings,The Last Command,True
2,1927,1928,1,ACTRESS,Louise Dresser,A Ship Comes In,False
3,1927,1928,1,ACTRESS,Janet Gaynor,7th Heaven,True
4,1927,1928,1,ACTRESS,Gloria Swanson,Sadie Thompson,False


In [146]:
oscars_df.shape

(10889, 7)

## Rotten Tomatoes Reviews

In [191]:
# Import 'rotten_tomatoes_reviews' dataset
reviews_df = pd.read_csv('datasets/rotten_tomatoes_reviews.csv')

### 1. Data Understanding

In [192]:
reviews_df.head()

Unnamed: 0,rotten_tomatoes_link,movie_title,critic_name,top_critic,publisher_name,review_type,review_score,review_date,review_content
0,m/0814255,Percy Jackson & the Olympians: The Lightning T...,Andrew L. Urban,False,Urban Cinefile,Fresh,,2010-02-06,A fantasy adventure that fuses Greek mythology...
1,m/0814255,Percy Jackson & the Olympians: The Lightning T...,Louise Keller,False,Urban Cinefile,Fresh,,2010-02-06,"Uma Thurman as Medusa, the gorgon with a coiff..."
2,m/0814255,Percy Jackson & the Olympians: The Lightning T...,,False,FILMINK (Australia),Fresh,,2010-02-09,With a top-notch cast and dazzling special eff...
3,m/0814255,Percy Jackson & the Olympians: The Lightning T...,Ben McEachen,False,Sunday Mail (Australia),Fresh,3.5/5,2010-02-09,Whether audiences will get behind The Lightnin...
4,m/0814255,Percy Jackson & the Olympians: The Lightning T...,Ethan Alter,True,Hollywood Reporter,Rotten,,2010-02-10,What's really lacking in The Lightning Thief i...


In [193]:
reviews_df.shape

(1129887, 9)

In [194]:
reviews_df.dtypes

rotten_tomatoes_link    object
movie_title             object
critic_name             object
top_critic                bool
publisher_name          object
review_type             object
review_score            object
review_date             object
review_content          object
dtype: object

### 2. Data Cleaning

In [195]:
# Rename columns
reviews_df = reviews_df.rename(columns={'review_type': 'type', 'review_score': 'score', 'review_content': 'content', 'top_critic': 'is_top_critic'})
print(f"Rotten Tomatoes Reviews dataset columns: {', '.join(reviews_df.columns)}")

Rotten Tomatoes Reviews dataset columns: rotten_tomatoes_link, movie_title, critic_name, is_top_critic, publisher_name, type, score, review_date, content


In [196]:
# Check for null values
summarize_nulls(reviews_df)

Unnamed: 0,Null Count,Null Percentage
rotten_tomatoes_link,0,0.0%
movie_title,0,0.0%
critic_name,18521,1.6392%
is_top_critic,0,0.0%
publisher_name,0,0.0%
type,0,0.0%
score,305902,27.0737%
review_date,0,0.0%
content,65778,5.8216%


After checking the Rotten Tomatoes website, it's fine to have the publisher name, critic name, and content as null values.

In [197]:
# Check for duplicate rows
print(f"There are {reviews_df.duplicated().sum()} duplicated rows")

filtered_df = reviews_df[reviews_df['critic_name'].notna()]
filtered_df = filtered_df[filtered_df.duplicated(keep=False)]

reviews_df = reviews_df.drop(reviews_df[reviews_df['critic_name'].isna()].index)

filtered_df.head()

There are 119471 duplicated rows


Unnamed: 0,rotten_tomatoes_link,movie_title,critic_name,is_top_critic,publisher_name,type,score,review_date,content
35513,m/1069696-screamers,Screamers,Dave White,False,Movies.com,Fresh,B-,1996-01-26,
35514,m/1069696-screamers,Screamers,Dave White,False,Movies.com,Fresh,B-,1996-01-26,
35576,m/1069707-othello,Othello,Fred Topel,False,About.com,Fresh,4/5,2003-11-25,Fine Shakespeare adaptation
35577,m/1069707-othello,Othello,Fred Topel,False,About.com,Fresh,4/5,2003-11-25,Fine Shakespeare adaptation
135520,m/animatrix,The Animatrix,Michael Dequina,False,TheMovieReport.com,Fresh,3/5,2007-01-12,


There are many duplicate reviews in the datasets. After checking the official website, it is possible to have multiple reviews for the same movie from the same publisher and with an unspecified critic. These rows will be excluded from the total count of duplicates and will not be removed. All other duplicate rows are being removed.

In [198]:
# Check for unique values
print(f"There are {len(reviews_df['type'].unique())} review types")

# Setting the correct type for columns
reviews_df['type'] = reviews_df['type'].astype('category')
reviews_df['review_date'] = pd.to_datetime(reviews_df['review_date'], format='%Y-%m-%d')

There are 2 review types


A review can be either Fresh or Rotten, so the `type` column is set as a categorical variable.

In [199]:
# Check the range of values of review date
print("Min Value: ", reviews_df['review_date'].min())
print("Max Value: ", reviews_df['review_date'].max())

Min Value:  1800-01-01 00:00:00
Max Value:  2020-10-29 00:00:00


The minimum date is probably an error, so we need to check of how many of those there are, and eventually proceed to remove them.

In [200]:
# Setting invalid dates as null
print("Review with '1800-01-01' as date:", reviews_df[reviews_df['review_date'] <= "1800-01-01"].shape[0])
reviews_df.loc[reviews_df['review_date'] <= "1800-01-01", 'review_date'] = pd.NaT

Review with '1800-01-01' as date: 39


In [201]:
# Reset indexing after removing rows
reviews_df = reviews_df.reset_index(drop=True)

### 3. Final Result
The review dataset is not linked to the movies dataset in any way, and it cannot be automatically connected due to the absence of a unique identifier for the movies. <br>
This dataset contains reviews collected from the Rotten Tomatoes review aggregator website, featuring reviews from various publishers and critics. Each review is categorized as either "Fresh" or "Rotten," which are equivalent to "Positive" and "Negative," respectively.


In [202]:
reviews_df.head()

Unnamed: 0,rotten_tomatoes_link,movie_title,critic_name,is_top_critic,publisher_name,type,score,review_date,content
0,m/0814255,Percy Jackson & the Olympians: The Lightning T...,Andrew L. Urban,False,Urban Cinefile,Fresh,,2010-02-06,A fantasy adventure that fuses Greek mythology...
1,m/0814255,Percy Jackson & the Olympians: The Lightning T...,Louise Keller,False,Urban Cinefile,Fresh,,2010-02-06,"Uma Thurman as Medusa, the gorgon with a coiff..."
2,m/0814255,Percy Jackson & the Olympians: The Lightning T...,Ben McEachen,False,Sunday Mail (Australia),Fresh,3.5/5,2010-02-09,Whether audiences will get behind The Lightnin...
3,m/0814255,Percy Jackson & the Olympians: The Lightning T...,Ethan Alter,True,Hollywood Reporter,Rotten,,2010-02-10,What's really lacking in The Lightning Thief i...
4,m/0814255,Percy Jackson & the Olympians: The Lightning T...,David Germain,True,Associated Press,Rotten,,2010-02-10,It's more a list of ingredients than a movie-m...


In [203]:
reviews_df.shape

(1111366, 9)

---

## Save the clean datasets to new `.csv` files

In [190]:
# Print the clean datasets to new csv files
if PRINT_CSV:
    movies_df.to_csv('clean_datasets/movies.csv')
    lang_df.to_csv('clean_datasets/languages.csv', index=False)
    actors_df.to_csv('clean_datasets/actors.csv', index=False)
    countries_df.to_csv('clean_datasets/countries.csv', index=False)
    crew_df.to_csv('clean_datasets/crew.csv', index=False)
    genres_df.to_csv('clean_datasets/genres.csv', index=False)
    releases_df.to_csv('clean_datasets/releases.csv', index=False)
    studios_df.to_csv('clean_datasets/studios.csv', index=False)
    themes_df.to_csv('clean_datasets/themes.csv', index=False)
    oscars_df.to_csv('clean_datasets/oscars.csv', index=False)
    reviews_df.to_csv('clean_datasets/reviews.csv', index=False)
    # JSON for import in MongoDB
    reviews_df['review_date'] = reviews_df['review_date'].dt.strftime('%Y-%m-%dT%H:%M:%S').apply(lambda x: {"$date": f"{x}Z"})
    reviews_df.to_json('clean_datasets/reviews.json', orient='records', lines=True)