# Data Cleaning
Made a section for each dataset (csv file).
Every section has a:
1. **Data Understanding**: first look into the dataset.
2. **Data Cleaning**: NaN and duplicates handling, setting types and renaming columns <br>
   (Optional) **Deep Clean**: custom cleaning made for specific dataset if needed.
3. **Final Result**: shows final result and saves clean dataset into new .csv file

First of all, libraries imports and options:

In [None]:
import pandas as pd
import numpy as np

from utils.utils import find_matching

# Set to True to print cleaned data into new csv
PRINT_CSV = False

## Movies

In [None]:
# Import 'movies.csv' dataset
movies_df = pd.read_csv('datasets/movies.csv')

### 1. Data Understanding

In [None]:
movies_df.head()

In [None]:
movies_df.shape

In [None]:
movies_df.dtypes

### 2. Data Cleaning

In [None]:
# Rename columns
movies_df = movies_df.rename(columns={'name': 'title', 'minute': 'duration_in_minutes', 'date': 'release_year'})
movies_df.columns

In [None]:
# Check for null values
movies_df.isna().sum()

There are null values in most of the columns.
The fields '**release_year**', '**tagline**', '**description**', '**duration_in_minutes**' and '**rating**' don't cause any problems so we'll keep them, but the few movies that are without a title can't be used and will be removed.

In [None]:
# Removing rows with null title
no_title = movies_df[movies_df['title'].isna()]
movies_df = movies_df.dropna(subset=['title'])
no_title

In [None]:
# Check for duplicate rows
movies_df.duplicated().sum()

In [None]:
# If 'id' column has unique values can be an index
duplicates_id = movies_df[movies_df['id'].duplicated()].shape[0]
movies_df.set_index("id", inplace=True)
print("'id' duplicates:", duplicates_id)

In [None]:
# Setting the correct type for columns
movies_df['release_year'] = movies_df['release_year'].astype('Int64')
movies_df['duration_in_minutes'] = movies_df['duration_in_minutes'].astype('Int64')
movies_df[['release_year', 'duration_in_minutes']].dtypes

#### Deep Clean
Let's look inside some columns to see most frequent values

In [None]:
movies_df['description'].value_counts().head(10)

Many descriptions seem to have a description like "Plot Unavailable" or similar instead of a null value. The other fields seem fine.<br>
Let's try to fix as many as possible (fixing only the most frequent variation, not 100% accurate).

In [None]:
from utils.utils import null_movie_description_keywords

# Find null description variation
result = find_matching(movies_df, 'description', null_movie_description_keywords, max_length=30)
matches = result.copy()

# Fill with NaN values the result obtained
result['description'] = np.nan

# Manual check to be sure to not overwrite real descriptions
matches['description'].value_counts()

### 3. Final Result

All datasets reference the **movies** dataset. A movie is uniquely identified by his **id** and a movie id has multiple occurrences in other datasets.

In [None]:
movies_df.head()

In [None]:
movies_df.shape

In [None]:
# Print clean dataset to new csv file
if PRINT_CSV:
    movies_df.to_csv('clean_datasets/movies.csv')

# Free memory
movies_df = None

## Languages

In [None]:
# Import 'languages.csv' dataset
lang_df = pd.read_csv('datasets/languages.csv')

### 1. Data Understanding

In [None]:
lang_df.head()

In [None]:
lang_df.shape

In [None]:
lang_df.dtypes

### 2. Data Cleaning

In [None]:
# Rename columns
lang_df = lang_df.rename(columns={'id': 'movie_id'})

In [None]:
# Check for null values
lang_df.isna().sum()

In [None]:
# check for duplicate row
lang_df.duplicated().sum()

In [None]:
# Setting the category data type for column 'type'
lang_types = lang_df['type'].unique()
lang_df['type'] = lang_df['type'].astype('category')
lang_types

### 3. Final Result
The languages dataset is directly connected to the movies dataset with the movie_id column. There are more languages rows than movies rows, because a movie can have multiple languages connected. Also, not all movie must have a language defined.

In [None]:
lang_df.head()

In [None]:
lang_df.shape

In [None]:
# Print clean dataset to new csv file
if PRINT_CSV:
    lang_df.to_csv('clean_datasets/languages.csv')

# Free memory
lang_df = None

## Actors

In [None]:
actors_df = pd.read_csv('datasets/actors.csv')

### 1. Data Understanding

In [None]:
actors_df.head()

In [None]:
actors_df.shape

In [None]:
actors_df.dtypes

### 2. Data Cleaning

In [None]:
# Rename columns
actors_df = actors_df.rename(columns={'id': 'movie_id'})

In [None]:
# check NaN values
actors_df.isna().sum()

There are a lot of missing roles, but there is nothing to do about it.

In [None]:
# Few actors are without a name and can't be used. Remove them
no_name = actors_df[actors_df['name'].isna()]
actors_df = actors_df.dropna(subset=['name'])
no_name

In [None]:
# Check for duplicate rows
print('Duplicated rows:', actors_df.duplicated().sum())
actors_duplicates = actors_df[actors_df.duplicated(keep=False)].head(6)

# Dropping the duplicates
actors_df = actors_df.drop_duplicates()

actors_duplicates

#### Deep Clean

In [None]:
actors_df['role'].value_counts().head(10)

The role column has many "Self" role variations let's look more deeply.

In [None]:
from utils.utils import self_actor_role_keywords

# Find self variation
result = find_matching(actors_df, 'role', self_actor_role_keywords)
print('Values matching:', result['role'].shape[0])
result['role'].value_counts().head(10)

There are over 300.000 values that are similar to "Self", but many of them contains also other information as "Self - Presenter" or "Self - Guest". Overwriting all those values could result in a loss of information, so they won't be overwritten in the cleaned dataset, but they might be when visualizing the data for statistical purposes.

In [None]:
# Reset indexing after removing rows
actors_df = actors_df.reset_index(drop=True)

### 3. Final Result
The actors dataset is directly connected to the movies dataset and has almost six times the number of rows as the movies dataset. Also, a movie can have no actors connected

In [None]:
actors_df.head()

In [None]:
actors_df.shape

In [None]:
# Print clean dataset to new csv file
if PRINT_CSV:
    actors_df.to_csv('clean_datasets/actors.csv')

# Free memory
actors_df = None

## Countries

In [None]:
# Import 'countries.csv' dataset
countries_df = pd.read_csv('datasets/countries.csv')

### 1. Data Understanding

In [None]:
countries_df.head()

In [None]:
countries_df.shape

In [None]:
countries_df.dtypes

### 2. Data Cleaning

In [None]:
# Rename columns
countries_df = countries_df.rename(columns={'id': 'movie_id'})

In [None]:
# Check for null values
countries_df.isna().sum()

In [None]:
# check for duplicate row
countries_df.duplicated().sum()

### 3. Final Results

The **countries** dataset is directly connected to the movies dataset with the movie_id column as a foreign key. This dataset contains all the countries where the movies were released.

In [None]:
countries_df.head()

In [None]:
countries_df.shape

In [None]:
# Print clean dataset to new csv file
if PRINT_CSV:
    countries_df.to_csv('clean_datasets/countries.csv')

# Free memory
countries_df = None

## Crew

In [None]:
# Import 'crew.csv' dataset
crew_df = pd.read_csv('datasets/crew.csv')

### 1. Data Understanding

In [None]:
crew_df.head()

In [None]:
crew_df.shape

In [None]:
crew_df.dtypes

### 2. Data Cleaning

In [None]:
# Rename columns
crew_df = crew_df.rename(columns={'id': 'movie_id', 'name': 'crew_member_name'})

In [None]:
# Check for null values
crew_df.isna().sum()

In [None]:
# check for duplicate row
print('Duplicated rows:', crew_df.duplicated().sum())
crew_duplicates = crew_df[crew_df.duplicated(keep=False)].head()

# Dropping the duplicates
crew_df = crew_df.drop_duplicates()

crew_duplicates

### 3. Final Results

The **crew** dataset is connected to the movies dataset through the 'movie_id' foreign key. It includes the names of all crew members along with their roles. A crew member can have multiple roles, but cannot perform the same role in the same movie.

In [None]:
crew_df.head()

In [None]:
crew_df.shape

In [None]:
# Print clean dataset to new csv file
if PRINT_CSV:
    crew_df.to_csv('clean_datasets/crew.csv')

# Free memory
crew_df = None

## Genres

In [None]:
genres_df = pd.read_csv('datasets/genres.csv')
genres_df

In [None]:
genres_df.dtypes

In [None]:
# check for NaN values
genres_df.isna().sum()

In [None]:
# check for duplicated values
genres_df.duplicated().sum()

## Posters

In [None]:
poster_df = pd.read_csv('datasets/posters.csv')
poster_df

In [None]:
poster_df.dtypes

In [None]:
# check for NaN values
poster_df.isna().sum()
poster_df[poster_df['link'].isna()]

poster_df = poster_df.dropna()

In [None]:
# check for duplicated values
poster_df.duplicated().sum()

In [None]:
poster_df['link'].str.len().max()

## Releases

In [None]:
releases_df = pd.read_csv('datasets/releases.csv')
releases_df

In [None]:
releases_df.dtypes

In [None]:
# typing columns
releases_df['date'] = pd.to_datetime(releases_df['date'], format='%Y-%m-%d')

releases_df['type'].unique()
releases_df['type'] = releases_df['type'].astype('category')

In [None]:
# check for NaN values
releases_df.isna().sum()
# releases_df[releases_df['date'].str.len() != 10]

In [None]:
# check for duplicates values
releases_df.duplicated().sum()

## Studios

In [None]:
studios_df = pd.read_csv('datasets/studios.csv')
studios_df

In [None]:
studios_df.dtypes

In [None]:
# check for NaN values
studios_df.isna().sum()
studios_df[studios_df['studio'].isna()]

studios_df = studios_df.dropna()

In [None]:
# check for duplicated values
studios_df.duplicated().sum()
studios_df[studios_df.duplicated(keep=False)]

studios_df = studios_df.drop_duplicates()

## Themes

In [None]:
themes_df = pd.read_csv('datasets/themes.csv')
themes_df

In [None]:
themes_df.dtypes

In [None]:
len(themes_df['theme'].unique())

themes_df['theme'].unique()

In [None]:
# check for NaN values
themes_df.isna().sum()

In [None]:
# check for duplicated values
themes_df.duplicated().sum()

## The Oscar Awards

In [None]:
oscars_df = pd.read_csv('datasets/the_oscar_awards.csv')
oscars_df

In [None]:
oscars_df.dtypes

In [None]:
# typing columns
oscars_df['ceremony'].unique()

# year_film always <= year_ceremony
# oscars_df[oscars_df['year_film'] > oscars_df['year_ceremony']]

oscars_df['category'] = oscars_df['category'].astype('category')

In [None]:
# check for NaN values
oscars_df.isna().sum()

oscars_df[oscars_df['category'] == "JEAN HERSHOLT HUMANITARIAN AWARD"]

In [None]:
# check for duplicated values
oscars_df.duplicated().sum()

oscars_df[oscars_df.duplicated(keep=False)]

## Rotten Tomatoes Reviews

In [None]:
reviews_df = pd.read_csv('datasets/rotten_tomatoes_reviews.csv')

### 1. Data Understanding

In [None]:
reviews_df.head()

In [None]:
reviews_df.shape

In [None]:
reviews_df.dtypes

### 2. Data Cleaning

In [None]:
# Rename columns
reviews_df = reviews_df.rename(columns={'review_type': 'type', 'review_score': 'score', 'review_date': 'date', 'review_content': 'content', 'top_critic': 'is_top_critic'})

In [None]:
# Check for null values
reviews_df.isna().sum()

Checking on Rotten Tomatoes website is fine having the publisher and critic name and content as null values.

In [None]:
# Check for duplicate rows
filtered_df = reviews_df[reviews_df['critic_name'].notna()]
filtered_df = filtered_df[filtered_df.duplicated(keep=False)]

reviews_df = reviews_df.drop(reviews_df[reviews_df['critic_name'].isna()].index)

filtered_df

There are many duplicate reviews in the datasets. Looking closely is actually possible to have more reviews for the same movie that have the same publisher and with the author not specified. Those rows will be excluded from the total count of duplicate rows and will not be removed. All other duplicate rows will be removed.

In [None]:
# Setting the correct type for columns
reviews_df['type'] = reviews_df['type'].astype('category')
reviews_df['date'] = pd.to_datetime(reviews_df['date'], format='%Y-%m-%d')

### 3. Final Result

In [None]:
reviews_df.head()

In [None]:
reviews_df.shape

In [None]:
# Print clean dataset to new csv file
if PRINT_CSV:
    reviews_df.to_csv('clean_datasets/reviews.csv')

# Free memory
reviews_df = None