# Task to do :

**Preprocessed the keywords dataset for the upcoming EDA** :( remove duplicates, handle missing value and mistype columns, remove outliers )

The movies dataset -> https://www.kaggle.com/rounakbanik/the-movies-dataset

# 1. Loading the data

>## First we import all the useful libraries 
>- Path for filepath managing
>- pandas and numpy for dataframe manip 
>- and for dataviz seaborn and matplotlib

In [1]:
from pathlib import Path

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

>##  We organise our different files directories

In [2]:
DATA_DIR = Path('../data')
RAW_DIR = DATA_DIR / 'raw'
PROCESSED_DIR = DATA_DIR / 'processed'

>##  We load the different datasets into dataframes using Pandas

In [3]:
raw_keywords_df = pd.read_csv(RAW_DIR / 'keywords.csv')

display(raw_keywords_df.sample(3))


Unnamed: 0,id,keywords
35504,128230,[]
18000,61730,"[{'id': 10183, 'name': 'independent film'}]"
1195,218,"[{'id': 83, 'name': 'saving the world'}, {'id'..."


# 2. Preprocessing of the data 


This dataset contains the movie plot keywords for MovieLens movies. Available in the form of a stringified JSON Object. <br>
Columns contents:

- **id :** The id of the movie. Movie ids are consistent between `ratings_small.csv`, `movies_metadata.csv`, and `links_small.csv` (i.e., the same id refers to the same movie across these data files).
- **keywords:** A stringified dictionary that gives information on the keywords of the film which identifier is the above id.


## 2.1. Information on the data

In [4]:
# basics info
raw_keywords_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46419 entries, 0 to 46418
Data columns (total 2 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        46419 non-null  int64 
 1   keywords  46419 non-null  object
dtypes: int64(1), object(1)
memory usage: 725.4+ KB


In [5]:
raw_keywords_df.duplicated().value_counts()

False    45432
True       987
dtype: int64

In [6]:
raw_keywords_df.sample(4)

Unnamed: 0,id,keywords
7055,3022,"[{'id': 848, 'name': 'double life'}, {'id': 15..."
26729,47902,"[{'id': 10183, 'name': 'independent film'}]"
14219,90056,"[{'id': 596, 'name': 'adultery'}, {'id': 9748,..."
5289,10394,"[{'id': 378, 'name': 'prison'}, {'id': 585, 'n..."


> The keywords table has :
    >- 46419 rows and 2 columns :
    >    - all non-null **( we don't have missing value )**
    >    - 987 duplicates rows **(we will drop them)**
    >- 1 float and 1 integer 
    >- keywords have some empty json values

## 2.2. Duplicate rows, Handling missing, incorrect and invalid data

#### 2.2.1. Duplicate rows

In [7]:
# removing the duplicate rows and start to create a processed_keywords_df
processed_keywords_df = raw_keywords_df.drop_duplicates().copy()

#### 2.2.2. missing data

In [8]:
# No missing data

#### 2.2.3. mistype

In [9]:
# Let us see what is inside 
processed_keywords_df['keywords'].tolist()[0]

"[{'id': 931, 'name': 'jealousy'}, {'id': 4290, 'name': 'toy'}, {'id': 5202, 'name': 'boy'}, {'id': 6054, 'name': 'friendship'}, {'id': 9713, 'name': 'friends'}, {'id': 9823, 'name': 'rivalry'}, {'id': 165503, 'name': 'boy next door'}, {'id': 170722, 'name': 'new toy'}, {'id': 187065, 'name': 'toy comes to life'}]"

> We will use the function extract_name we create to deal with the json
>    - extract_name is in handle_json file in package_folder

In [10]:
# importing the function
import sys
sys.path.append('../package_folder')

from handle_json import extract_name

In [11]:
# extracting the name of the cast and saving it
processed_keywords_df.keywords = processed_keywords_df.keywords.apply(extract_name)

In [12]:
processed_keywords_df.sample(3)

Unnamed: 0,id,keywords
24911,183111,
29897,118612,coma|crash|accepting death
23757,242095,hacker|supernatural powers|road trip|independe...


In [13]:
# They is no other way to deal with empty value than deleting because with don't have other informations
processed_keywords_df = processed_keywords_df.dropna()

In [14]:
# Then we explode the keywords
# processed_keywords_df.keywords = processed_keywords_df.keywords.str.split('|')
# processed_keywords_df = processed_keywords_df.explode('keywords')

In [15]:
# For convenience we will rename id column to movieId
processed_keywords_df = processed_keywords_df.rename(columns={"id": "movieId"})


In [16]:
# We won't explode it now but later on EDA if needed will do it
# We will stop here and save our preprocess data
processed_keywords_df.to_csv(PROCESSED_DIR/ 'keywords.csv', index=False)