# Data Cleaning
The cleaning process is organized into sections, each corresponding to a different dataset (`.csv` file). Each section includes the following steps:

1. **Data Understanding**: Initial exploration of the dataset.
2. **Data Cleaning**: Handling of missing values (NaN), removal of duplicates, validating foreign keys to identify and manage invalid references, setting correct data types, and renaming columns. <br>
   *(Optional)* **Deep Clean**: Custom cleaning steps applied to a specific dataset, if necessary.
3. **Final Result**: Displays the cleaned dataset and saves it to a new `.csv` file.

All the datasets combined have a total size of ~1Gb and can be all uploaded in memory at the same time on almost every PC.

First, import the necessary libraries and set up any required options.

In [None]:
import pandas as pd
import numpy as np

from utils.utils import find_matching, summarize_nulls

# Set to True to print cleaned data into new csv
PRINT_CSV = False

## Movies

In [None]:
# Import 'movies.csv' dataset
movies_df = pd.read_csv('datasets/movies.csv')

### 1. Data Understanding

In [None]:
movies_df.head()

In [None]:
movies_df.shape

In [None]:
movies_df.dtypes

### 2. Data Cleaning

In [None]:
# Rename columns
movies_df = movies_df.rename(columns={'name': 'title', 'minute': 'runtime', 'date': 'release_year'})
print(f"Movies dataset columns: {', '.join(movies_df.columns)}")

In [None]:
# Check for null values
summarize_nulls(movies_df)

There are null values in most of the columns.
The fields '**release_year**', '**tagline**', '**description**', '**runtime**' and '**rating**' don't cause any problems so we'll keep them, but the few movies that are without a title can't be used and will be removed.

In [None]:
# Removing rows with null title
no_title = movies_df[movies_df['title'].isna()]
movies_df = movies_df.dropna(subset=['title'])

print("Movies dataset without title:")
no_title.head()

In [None]:
# Check for duplicate rows
print(f"There are {movies_df.duplicated().sum()} duplicated rows")

In [None]:
# Setting the correct type for columns
movies_df['release_year'] = movies_df['release_year'].astype('Int64')
movies_df['runtime'] = movies_df['runtime'].astype('Int64')

In [None]:
# Check if 'id' column has unique values
print(f"'id' duplicates: {movies_df[movies_df['id'].duplicated()].shape[0]}")
movies_df = movies_df.set_index("id")

The '**id**' field is the unique identifier of a movie, so it's been set as the index.

#### Deep Clean
Let's look inside some columns to see most frequent values:

In [None]:
movies_df['description'].value_counts().head(10)

Many descriptions seem to have a description like "Plot Unavailable" or similar instead of a null value. The other fields seem fine.<br>
Let's try to fix as many as possible (fixing only the most frequent variation, not 100% accurate).

In [None]:
from utils.utils import null_movie_description_keywords

# Find null description variation
matches = find_matching(movies_df, 'description', null_movie_description_keywords, max_length=30)

# Fill with NaN values the result obtained
movies_df.loc[matches.index, ['description']] = np.nan

# Manual check to be sure to not overwrite real descriptions
matches['description'].value_counts().head(15)

### 3. Final Result
All datasets reference the **movies** dataset. A movie is uniquely identified by his **id** and a movie id has multiple occurrences in other datasets. A movie has a title, a tagline, a description, the release year, the duration and a rating. Only the title is mandatory and all the other attributes could be missing.

In [None]:
movies_df.head()

In [None]:
movies_df.shape

## Languages

In [None]:
# Import 'languages.csv' dataset
lang_df = pd.read_csv('datasets/languages.csv')

### 1. Data Understanding

In [None]:
lang_df.head()

In [None]:
lang_df.shape

In [None]:
lang_df.dtypes

### 2. Data Cleaning

In [None]:
# Rename columns
lang_df = lang_df.rename(columns={'id': 'movie_id'})
print(f"Languages dataset columns: {', '.join(lang_df.columns)}")

In [None]:
# Check for null values
summarize_nulls(lang_df)

In [None]:
# Check for duplicate rows
print(f"There are {lang_df.duplicated().sum()} duplicated rows")

In [None]:
# Identifying invalid foreign keys
invalid_values = lang_df[~lang_df['movie_id'].isin(movies_df.index)]

print(f"There were {len(invalid_values)} rows with invalid foreign keys")

# Removing invalid values
lang_df = lang_df.drop(invalid_values.index)

invalid_values.head()

In [None]:
# Setting the category data type for column 'type'
print(f"types: {', '.join(lang_df['type'].unique())}")
lang_df['type'] = lang_df['type'].astype('category')

The `type` field has only 3 possible values, so we can set it as a categorical type.

### 3. Final Result
The languages dataset is directly linked to the movies dataset through the `movie_id` column. There are more rows in the languages dataset than in the movies dataset because a movie can be associated with multiple languages. Additionally, not all movies have a language defined. <br>
A language associated with a movie can fall into one or more of the following categories:
- *Language*: Refers to a generic language associated with the movie, typically used when there is a single dominant language.
- *Primary Language*: The main or original language of the movie.
- *Spoken Language*: All the languages actually used in the movie's dialogues.

In [None]:
lang_df.head()

In [None]:
lang_df.shape

## Actors

In [None]:
# Import 'actors.csv' dataset
actors_df = pd.read_csv('datasets/actors.csv')

### 1. Data Understanding

In [None]:
actors_df.head()

In [None]:
actors_df.shape

In [None]:
actors_df.dtypes

### 2. Data Cleaning

In [None]:
# Rename columns
actors_df = actors_df.rename(columns={'id': 'movie_id'})
print(f"Actors dataset columns: {', '.join(actors_df.columns)}")

In [None]:
# Check for null values
summarize_nulls(actors_df)

Despite the high number of null values in the `role` field, the rows will be maintained because they still contain information about the actor's `name`. However, an actor without a name is unusable, so the corresponding rows will be removed.

In [None]:
# Removing actors without name
no_name = actors_df[actors_df['name'].isna()]
actors_df = actors_df.dropna(subset=['name'])
no_name

In [None]:
# Check for duplicate rows
print(f"There are {actors_df.duplicated().sum()} duplicated rows")
actors_duplicates = actors_df[actors_df.duplicated(keep=False)].head(6)

# Dropping the duplicates
actors_df = actors_df.drop_duplicates()

actors_duplicates

Completely duplicated rows are clearly errors and can be removed.

In [None]:
# Identifying invalid foreign keys
invalid_values = actors_df[~actors_df['movie_id'].isin(movies_df.index)]

print(f"There were {len(invalid_values)} rows with invalid foreign keys")

# Removing invalid values
actors_df = actors_df.drop(invalid_values.index)

invalid_values.head()

#### Deep Clean

In [None]:
actors_df['role'].value_counts().head(10)

The role column contains many variations of the 'Self' role. Let's examine this more closely.

In [None]:
from utils.utils import self_actor_role_keywords

# Find self variation
matches = find_matching(actors_df, 'role', self_actor_role_keywords)
print(f"Rows contains 'self' variations: {matches['role'].shape[0]}")
matches['role'].value_counts().head()

There are over 300,000 values similar to 'Self', but many of them also contain additional information, such as 'Self - Presenter' or 'Self - Guest'. Overwriting all these values could lead to a loss of information, so they won't be overwritten in the cleaned dataset. However, they may be modified when visualizing the data for statistical purposes.

In [None]:
# Reset indexing after removing rows
actors_df = actors_df.reset_index(drop=True)

### 3. Final Result
The actors dataset is directly linked to the movies dataset through the `movie_id` column and contains nearly six times the number of rows as the movies dataset. Additionally, a movie may have no actors associated with it. <br>
The same actor can appear multiple times in the dataset if they feature in more than one movie. <br>
An actor is identified solely by their full name, stored in a single field.


In [None]:
actors_df.head()

In [None]:
actors_df.shape

## Countries

In [None]:
# Import 'countries.csv' dataset
countries_df = pd.read_csv('datasets/countries.csv')

### 1. Data Understanding

In [None]:
countries_df.head()

In [None]:
countries_df.shape

In [None]:
countries_df.dtypes

### 2. Data Cleaning

In [None]:
# Rename columns
countries_df = countries_df.rename(columns={'id': 'movie_id'})
print(f"Countries dataset columns: {', '.join(countries_df.columns)}")

In [None]:
# Check for null values
summarize_nulls(countries_df)

In [None]:
# Identifying invalid foreign keys
invalid_values = countries_df[~countries_df['movie_id'].isin(movies_df.index)]

print(f"There were {len(invalid_values)} rows with invalid foreign keys")

# Removing invalid values
countries_df = countries_df.drop(invalid_values.index)

invalid_values.head()

In [None]:
# Check for duplicate rows
print(f"There are {countries_df.duplicated().sum()} duplicated rows")

### 3. Final Results
The **countries** dataset is directly connected to the movies dataset through the 'movie_id' column. This dataset contains all the countries where the movies were produced.


In [None]:
countries_df.head()

In [None]:
countries_df.shape

## Crew

In [None]:
# Import 'crew.csv' dataset
crew_df = pd.read_csv('datasets/crew.csv')

### 1. Data Understanding

In [None]:
crew_df.head()

In [None]:
crew_df.shape

In [None]:
crew_df.dtypes

### 2. Data Cleaning

In [None]:
# Rename columns
crew_df = crew_df.rename(columns={'id': 'movie_id'})
print(f"Crew dataset columns: {', '.join(crew_df.columns)}")

In [None]:
# Check for null values
summarize_nulls(crew_df)

In [None]:
find_matching(crew_df, 'name', ["anonymous", "unknown"]).head()

The existing `NaN` value was not removed because it is minimal compared to the overall size of the dataset, and removing it would not significantly impact the analysis. Similarly, values such as *Unknown* or *Anonymous* were kept because they account for less than 0.001% of the data and do not affect the overall results.

In [None]:
# Check for duplicate rows
print('Duplicated rows:', crew_df.duplicated().sum())
crew_duplicates = crew_df[crew_df.duplicated(keep=False)].head()

# Dropping the duplicates
crew_df = crew_df.drop_duplicates()

crew_duplicates

Completely duplicated rows are clearly an error and can be safely removed.

In [None]:
# Identifying invalid foreign keys
invalid_values = crew_df[~crew_df['movie_id'].isin(movies_df.index)]

print(f"There were {len(invalid_values)} rows with invalid foreign keys")

# Removing invalid values
crew_df = crew_df.drop(invalid_values.index)

invalid_values.head()

In [None]:
# Reset indexing after removing rows
crew_df = crew_df.reset_index(drop=True)

### 3. Final Results
The **crew** dataset is connected to the movies dataset through the `movie_id` column. It includes the names of all crew members along with their roles. <br>
A crew member can have different roles in the same movies and can appear in more than one movie. <br>
A crew member is solely identified by his full name.

In [None]:
crew_df.head()

In [None]:
crew_df.shape

## Genres

In [None]:
# Import 'genres.csv' dataset
genres_df = pd.read_csv('datasets/genres.csv')

### 1. Data Understanding

In [None]:
genres_df.head()

In [None]:
genres_df.shape

In [None]:
genres_df.dtypes

### 2. Data Cleaning

In [None]:
# Rename columns
genres_df = genres_df.rename(columns={'id': 'movie_id'})
print(f"Genres dataset columns: {', '.join(genres_df.columns)}")

In [None]:
# Check for null values
summarize_nulls(genres_df)

In [None]:
# Check for duplicate rows
print(f"There are {genres_df.duplicated().sum()} duplicated rows")

In [None]:
# Identifying invalid foreign keys
invalid_values = genres_df[~genres_df['movie_id'].isin(movies_df.index)]

print(f"There were {len(invalid_values)} rows with invalid foreign keys")

# Removing invalid values
genres_df = genres_df.drop(invalid_values.index)

invalid_values.head()

In [None]:
# Setting the correct type for columns
genres_list = list(genres_df["genre"].unique())
print(f'There are {len(genres_list)} genres in the dataset: {", ".join(genres_list)}')

genres_df['genre'] = genres_df['genre'].astype('category')

Given the limited number of genres in the dataset, the field can be optimized setting it to category type.

### 3. Final Results
The **genres** dataset is connected to the movies dataset through the `movie_id` column. A movie can have multiple genres.


In [None]:
genres_df.head()

In [None]:
genres_df.shape

## Posters

In [None]:
# Import 'posters.csv' dataset
posters_df = pd.read_csv('datasets/posters.csv')

### 1. Data Understanding

In [None]:
posters_df.head()

In [None]:
posters_df.shape

In [None]:
posters_df.dtypes

### 2. Data Cleaning

In [None]:
# Rename columns
posters_df = posters_df.rename(columns={'id': 'movie_id', 'link': 'poster_link'})
print(f"Posters dataset columns: {', '.join(posters_df.columns)}")

In [None]:
# Check for null values
summarize_nulls(posters_df)

In [None]:
# Removing null rows
posters_df = posters_df.dropna()

Removing the `NaN` values as they do not contribute meaningful information to the dataset and could hinder data consistency and analysis. <br>
We also need to check the validity of the link:

In [None]:
# Check for invalid links
link_regex = r'\bhttps?:\/\/(?:www\.)?[a-zA-Z0-9-]+\.[a-zA-Z]{2,}(?:\/[^\s]*)?'
print(f"There are {(~posters_df['poster_link'].str.contains(link_regex, na=False)).sum()} invalid links.")

In [None]:
# Check for duplicate rows
print(f"There are {posters_df.duplicated().sum()} duplicated rows")

In [None]:
# Check if a movie can have more than 1 poster
print(f"There are {posters_df['movie_id'].duplicated().sum()} duplicates in the 'movie_id' column.")

A movie can have at most only one poster. The relationship between the **posters** dataset and the **movies** dataset is One-to-One, allowing us to merge the two datasets.

In [None]:
# Identifying invalid foreign keys
invalid_values = posters_df[~posters_df['movie_id'].isin(movies_df.index)]

print(f"There are {len(invalid_values)} with invalid foreign keys")

invalid_values.head()

There is no need to remove rows that have an invalid foreign key in this dataset, since the **merge** operation with the movies dataset can naturally differentiate between matching and unmatched records

In [None]:
# Merging the datasets on 'id' from 'movies' and 'movie_id' from 'posters'
movies_df = pd.merge(movies_df, posters_df, left_index=True, right_on='movie_id', how='left')

# Re-set the id as index
movies_df = movies_df.rename(columns={'movie_id': 'id'})
movies_df = movies_df.set_index('id')

### 3. Final Results
The posters dataset has now been merged into the movies dataset under the `poster_link` column. To ensure the merge was successful, the dataset should have the same number of rows as before (previously 941,587) and one additional column (previously 6).

In [None]:
movies_df.head()

In [None]:
movies_df.shape

## Releases

In [None]:
# Import 'release.csv' dataset
releases_df = pd.read_csv('datasets/releases.csv')

### 1. Data Understanding

In [None]:
releases_df.head()

In [None]:
releases_df.shape

In [None]:
releases_df.dtypes

### 2. Data Cleaning

In [None]:
# Rename columns
releases_df = releases_df.rename(columns={'id': 'movie_id', 'type': 'distribution_format'})
print(f"Release dataset columns: {', '.join(releases_df.columns)}")

In [None]:
# Check for null values
summarize_nulls(releases_df)

It's fine for the dataset to have null values  in the `rating` column. <br>
The absence of a rating could also be due to some countries not having an official rating system. If that is the case, we don't have enough information to determine whether it's a missing value or a country without a rating system. To make the data more meaningful, we should identify the countries without an official rating system and differentiate them from with a specific value. However, for our purposes, we don't need to clean the data to that extent.

In [None]:
# Check for alternative null values in the dataset
releases_df[(releases_df['rating'] == "0") & (~releases_df['country'].isin(["Germany", "Austria", "Switzerland"]))].head()

We initially checked the rating value 0 in the dataset, assuming it might represent a null or missing value. However, we discovered that in some countries (e.g., Germany), a rating of 0 has a meaningful interpretation, indicating that the film is suitable for all audiences, including children. While in some cases the 0 rating might still be an error, it appears only 70 times in a dataset of over one million rows. Given that we don't have a reference dataset with the rating system for every country, we can ignore these values.

In [None]:
# Check for duplicate rows
print(f"There are {releases_df.duplicated().sum()} duplicated rows")

In [None]:
# Identifying invalid foreign keys
invalid_values = releases_df[~releases_df['movie_id'].isin(movies_df.index)]

print(f"There were {len(invalid_values)} rows with invalid foreign keys")

# Removing invalid values
releases_df = releases_df.drop(invalid_values.index)

invalid_values.head()

In [None]:
# Setting the correct type for the columns
releases_df['date'] = pd.to_datetime(releases_df['date'], format='%Y-%m-%d')
releases_df['distribution_format'] = releases_df['distribution_format'].astype('category')

distribution_formats = list(releases_df['distribution_format'].unique())
print(f'There are {len(distribution_formats)} distribution formats in the dataset: {", ".join(distribution_formats)}')

Given the limited number of distribution formats in the dataset, the field can be optimized setting it to category type.

### 3. Final Results
The **release** dataset is linked to the movies dataset through the 'movie_id' column. It contains details about the movie releases worldwide, including the release date, distribution format, and rating.

In [None]:
releases_df.head()

In [None]:
releases_df.shape

## Studios

In [None]:
# Import 'studios.csv' dataset
studios_df = pd.read_csv('datasets/studios.csv')

### 1. Data Understanding

In [None]:
studios_df.head()

In [None]:
studios_df.shape

In [None]:
studios_df.dtypes

### 2. Data Cleaning

In [None]:
# Rename columns
studios_df = studios_df.rename(columns={'id': 'movie_id'})
print(f"Studios dataset columns: {', '.join(studios_df.columns)}")

In [None]:
# Check for null values
summarize_nulls(studios_df)

Unnamed studios are not usable and can be safely removed.

In [None]:
# Removing rows with null studio name
no_name_studios = studios_df[studios_df['studio'].isna()]
studios_df = studios_df.dropna(subset=['studio'])

print("Studios without name:")
no_name_studios.head()

In [None]:
# Check for duplicate rows
print(f"There are {studios_df.duplicated().sum()} duplicated rows")
studios_duplicates = studios_df[studios_df.duplicated(keep=False)].head(6)

# Drop duplicates
studios_df = studios_df.drop_duplicates()

studios_duplicates

Completely duplicated rows are clearly an error and can be safely removed.

In [None]:
# Identifying invalid foreign keys
invalid_values = studios_df[~studios_df['movie_id'].isin(movies_df.index)]

print(f"No rows with an invalid foreign key")

invalid_values.head()

In [None]:
# Reset indexing after removing rows
studios_df = studios_df.reset_index(drop=True)

### 3. Final Results
The **studios** dataset is linked to the movies dataset through the `movie_id` column. It lists all the studios involved in each movie, allowing a movie to be associated with multiple studios and a studio to collaborate on multiple movies.

In [None]:
studios_df.head()

In [None]:
studios_df.shape

## Themes

In [None]:
# Import 'themes.csv' dataset
themes_df = pd.read_csv('datasets/themes.csv')

### 1. Data Understanding

In [None]:
themes_df.head()

In [None]:
themes_df.shape

In [None]:
themes_df.dtypes

### 2. Data Cleaning

In [None]:
# Rename columns
themes_df = themes_df.rename(columns={'id': 'movie_id'})
print(f"Themes dataset columns: {', '.join(themes_df.columns)}")

In [None]:
# Check for null values
summarize_nulls(themes_df)

In [None]:
# Check for duplicate rows
print(f"There are {themes_df.duplicated().sum()} duplicated rows")

In [None]:
# Identifying invalid foreign keys
invalid_values = themes_df[~themes_df['movie_id'].isin(movies_df.index)]

print(f"No rows with an invalid foreign key")

invalid_values.head()

In [None]:
# Check for unique values
print(f"There are {len(themes_df['theme'].unique())} unique themes")

# Setting the correct type for columns
themes_df['theme'] = themes_df['theme'].astype('category')

There are a limited number of values for the theme, which means the column can be set as a categorical variable to optimize the analysis.

### 3. Final Results
The **themes** dataset is linked to the movies dataset through the `movie_id` column. It contains 125,000 rows, with multiple occurrences referring to the same movie, suggesting that themes are not frequently assigned to movies. Each theme describes a movie using few standard phrases.

In [None]:
themes_df.head()

In [None]:
themes_df.shape

## The Oscar Awards

In [None]:
# Import 'the_oscar_awards.csv' dataset
oscars_df = pd.read_csv('datasets/the_oscar_awards.csv')

### 1. Data Understanding

In [None]:
oscars_df.head()

In [None]:
oscars_df.shape

In [None]:
oscars_df.dtypes

### 2. Data Cleaning

In [None]:
# Rename columns
oscars_df = oscars_df.rename(columns={'ceremony': 'number_ceremony', 'year_film': 'year_movie', 'name': 'nominee_name', 'film': 'nominee_movie', 'winner': 'is_winner'})
print(f"Oscar dataset columns: {', '.join(oscars_df.columns)}")

In [None]:
# Check for null values
summarize_nulls(oscars_df)

In [None]:
# Check for duplicate rows
print(f"There are {oscars_df.duplicated().sum()} duplicated rows")
oscars_df[oscars_df.duplicated(keep=False)].head(6)

Duplicates are retained because, in music categories, the same artists can receive identical nominations for different songs, with the song titles not specified in the dataset.

In [None]:
# Check the consistency between year_film and year_ceremony
print(f"There are {oscars_df[oscars_df['year_movie'] > oscars_df['year_ceremony']].shape[0]} rows where the movie has been released after ceremony")

In [None]:
# Check for multiple winner possibilities
from utils.utils import special_oscar_awards

# Filtering rows where the 'category' is in the special_oscar_awards list
filtered_oscars = find_matching(oscars_df, 'category', special_oscar_awards)

# Find groups with more than one winner
multiple_winners = filtered_oscars.groupby(['year_ceremony', 'category']).filter(
    lambda x: x['is_winner'].sum() > 1
)

# Keep only the rows where 'is_winner' is True
multiple_winners = multiple_winners[multiple_winners['is_winner'] == True]
multiple_winners.head(6)

There can be multiple winners for the same Oscar category (e.g., 1932 Best Actor). <br>
In the count, we excluded special awards, such as the Jean Hersholt Humanitarian Award, because they are given without nominations and have an undefined number of winners each year. These awards are not part of traditional competitive categories but rather honorary recognitions.

In [None]:
# Check for unique values
print(f"There are {len(oscars_df['category'].unique())} unique categories")

# Setting the correct type for columns
oscars_df['category'] = oscars_df['category'].astype('category')

There are a limited number of values for the `category` field, which means the column can be set as a categorical variable to optimize the analysis.

### 3. Final Results
The Oscar dataset is not linked to the movies dataset in any way, and it cannot be automatically connected due to the absence of a unique identifier for the movies. <br>
This dataset includes all nominations and winners of the Oscars, both ordinary and special, starting from the first ceremony.

In [None]:
oscars_df.head()

In [None]:
oscars_df.shape

## Rotten Tomatoes Reviews

In [None]:
# Import 'rotten_tomatoes_reviews' dataset
reviews_df = pd.read_csv('datasets/rotten_tomatoes_reviews.csv')

### 1. Data Understanding

In [None]:
reviews_df.head()

In [None]:
reviews_df.shape

In [None]:
reviews_df.dtypes

### 2. Data Cleaning

In [None]:
# Rename columns
reviews_df = reviews_df.rename(columns={'review_type': 'type', 'review_score': 'score', 'review_content': 'content', 'top_critic': 'is_top_critic'})
print(f"Rotten Tomatoes Reviews dataset columns: {', '.join(reviews_df.columns)}")

In [None]:
# Check for null values
summarize_nulls(reviews_df)

After checking the Rotten Tomatoes website, it's fine to have the publisher name, critic name, and content as null values.

In [None]:
# Check for duplicate rows
print(f"There are {reviews_df.duplicated().sum()} duplicated rows")

filtered_df = reviews_df[reviews_df['critic_name'].notna()]
filtered_df = filtered_df[filtered_df.duplicated(keep=False)]

reviews_df = reviews_df.drop(reviews_df[reviews_df['critic_name'].isna()].index)

filtered_df.head()

There are many duplicate reviews in the datasets. After checking the official website, it is possible to have multiple reviews for the same movie from the same publisher and with an unspecified critic. These rows will be excluded from the total count of duplicates and will not be removed. All other duplicate rows are being removed.

In [None]:
# Check for unique values
print(f"There are {len(reviews_df['type'].unique())} review types")

# Setting the correct type for columns
reviews_df['type'] = reviews_df['type'].astype('category')
reviews_df['review_date'] = pd.to_datetime(reviews_df['review_date'], format='%Y-%m-%d')

A review can be either Fresh or Rotten, so the `type` column is set as a categorical variable.

The `rotten_tomatoes_link` column is not useful for visualization or statistical purposes, so it can be deleted.

In [None]:
# Reset indexing after removing rows
reviews_df = reviews_df.reset_index(drop=True)

### 3. Final Result
The review dataset is not linked to the movies dataset in any way, and it cannot be automatically connected due to the absence of a unique identifier for the movies. <br>
This dataset contains reviews collected from the Rotten Tomatoes review aggregator website, featuring reviews from various publishers and critics. Each review is categorized as either "Fresh" or "Rotten," which are equivalent to "Positive" and "Negative," respectively.


In [None]:
reviews_df.head()

In [None]:
reviews_df.shape

---
## Save the clean datasets to new `.csv` files

In [None]:
# Print the clean datasets to new csv files
if PRINT_CSV:
    movies_df.to_csv('clean_datasets/movies.csv')
    lang_df.to_csv('clean_datasets/languages.csv', index=False)
    actors_df.to_csv('clean_datasets/actors.csv', index=False)
    countries_df.to_csv('clean_datasets/countries.csv', index=False)
    crew_df.to_csv('clean_datasets/crew.csv', index=False)
    genres_df.to_csv('clean_datasets/genres.csv', index=False)
    releases_df.to_csv('clean_datasets/releases.csv', index=False)
    studios_df.to_csv('clean_datasets/studios.csv', index=False)
    themes_df.to_csv('clean_datasets/themes.csv', index=False)
    oscars_df.to_csv('clean_datasets/oscars.csv', index=False)
    reviews_df.to_csv('clean_datasets/reviews.csv', index=False)