# Task to do :

**Preprocessed the credits dataset for the upcoming EDA** :( remove duplicates, handle missing value and mistype columns, remove outliers )

The movies dataset -> https://www.kaggle.com/rounakbanik/the-movies-dataset

# 1. Loading the data

>## First we import all the useful libraries 
>- Path for filepath managing
>- pandas and numpy for dataframe manip 
>- and for dataviz seaborn and matplotlib

In [1]:
from pathlib import Path

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

>##  We organise our different files directories

In [2]:
DATA_DIR = Path('../data')
RAW_DIR = DATA_DIR / 'raw'
PROCESSED_DIR = DATA_DIR / 'processed'

>##  We load the different datasets into dataframes using Pandas

In [3]:
raw_credits_df = pd.read_csv(RAW_DIR / 'credits.csv')

display(raw_credits_df.sample(3))

Unnamed: 0,cast,crew,id
12061,"[{'cast_id': 13, 'character': ""Thomas B. 'Tom'...","[{'credit_id': '52fe4962c3a368484e12896f', 'de...",77210
10623,"[{'cast_id': 1, 'character': 'Donald Morton', ...","[{'credit_id': '52fe446ac3a368484e021ce9', 'de...",23478
1163,"[{'cast_id': 4, 'character': 'Alexander DeLar...","[{'credit_id': '52fe4224c3a36847f80071db', 'de...",185


# 2. Preprocessing of the data 

The dataset consists of Cast and Crew Information for all movies. Available in the form of a stringified JSON Object.<br>
Columns contents:

- **cast:** The group of actors who make up a film or stage play
- **crew:** The group of people, hired by a production company, for the purpose of producing a film or motion picture. The crew is distinguished from the cast as the cast are understood to be the actors who appear in front of the camera or provide voices for characters in the film.

- **id:** The id of the movie. Movie ids are consistent between `ratings_small.csv`, `movies_metadata.csv`, `credits.csv`, `keywords.csv`and `links_small.csv` (i.e., the same id refers to the same movie across these data files).


## 2.1. Information on the data

In [4]:
raw_credits_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45476 entries, 0 to 45475
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   cast    45476 non-null  object
 1   crew    45476 non-null  object
 2   id      45476 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 1.0+ MB


In [5]:
# Let us check if there is any duplicate rows
raw_credits_df.duplicated().value_counts()

False    45439
True        37
dtype: int64

> The rating table:
>- has 45476 rows and 3 columns:
    >    - NO missing values  
    >    - 37 rows are duplicated 
>- has 1 integer and 2 objects 
>- has 2 objects which have information inside list of dict ( json ) **( we will extract those information )**

## 2.2. Duplicate rows, Handling missing, incorrect and invalid data

#### 2.2.1. Duplicate rows

In [6]:
# removing the duplicate rows and create a processed_movies_df
processed_credits_df = raw_credits_df.drop_duplicates().copy()

#### 2.2.2. missing data

In [7]:
# No missing data

#### 2.2.3. mistype

In [8]:
# Let us see what is inside 
processed_credits_df['cast'].tolist()[0]

"[{'cast_id': 14, 'character': 'Woody (voice)', 'credit_id': '52fe4284c3a36847f8024f95', 'gender': 2, 'id': 31, 'name': 'Tom Hanks', 'order': 0, 'profile_path': '/pQFoyx7rp09CJTAb932F2g8Nlho.jpg'}, {'cast_id': 15, 'character': 'Buzz Lightyear (voice)', 'credit_id': '52fe4284c3a36847f8024f99', 'gender': 2, 'id': 12898, 'name': 'Tim Allen', 'order': 1, 'profile_path': '/uX2xVf6pMmPepxnvFWyBtjexzgY.jpg'}, {'cast_id': 16, 'character': 'Mr. Potato Head (voice)', 'credit_id': '52fe4284c3a36847f8024f9d', 'gender': 2, 'id': 7167, 'name': 'Don Rickles', 'order': 2, 'profile_path': '/h5BcaDMPRVLHLDzbQavec4xfSdt.jpg'}, {'cast_id': 17, 'character': 'Slinky Dog (voice)', 'credit_id': '52fe4284c3a36847f8024fa1', 'gender': 2, 'id': 12899, 'name': 'Jim Varney', 'order': 3, 'profile_path': '/eIo2jVVXYgjDtaHoF19Ll9vtW7h.jpg'}, {'cast_id': 18, 'character': 'Rex (voice)', 'credit_id': '52fe4284c3a36847f8024fa5', 'gender': 2, 'id': 12900, 'name': 'Wallace Shawn', 'order': 4, 'profile_path': '/oGE6JqPP2xH4t

> We will use the function extract_name we create to deal with the json
>    - extract_name is in handle_json file in package_folder

In [9]:
# importing the function
import sys
sys.path.append('../package_folder')

from handle_json import extract_name

In [10]:
# extracting the name of the cast and saving it
processed_credits_df.cast = processed_credits_df.cast.apply(extract_name)

In [11]:
# same for crew
processed_credits_df.crew = processed_credits_df['crew'].apply(extract_name)

In [12]:
processed_credits_df.sample(3)

Unnamed: 0,cast,crew,id
31270,Reese C. Hartwig|James Hong|Grey Griffin|Tim M...,Spike Brandt|Tony Cervone|James Krieg|Tony Cer...,343977
11894,Daniel Schlachet|Craig Chester,Tom Kalin,54934
32850,Brock Peters|Raymond St. Jacques|Melba Moore|C...,Daniel Mann,231392


> We need to check if some null value has been found

In [13]:
processed_credits_df.isna().sum()

cast    2414
crew     771
id         0
dtype: int64

> We have new null value because of empty cast and crew. Since the proportion is low we will drop those rows

In [14]:
processed_credits_df = processed_credits_df.dropna()

In [15]:
# For convenience we will rename id column to movieId
processed_credits_df = processed_credits_df.rename(columns={"id": "movieId"})


In [16]:
# We won't explode it now but later on EDA if needed will do it
processed_credits_df.to_csv(PROCESSED_DIR / 'credits.csv', index=False)