# Table of Contents
<a id='table-of-contents'></a>
- [1 Introduction](#1)
- [2 Data Preparation](#2)
    - [2.1](#2.1)
         -[2.1.1](#2.1.1)
         -[2.1.2](#2.1.2)
         -[2.1.3](#2.1.3)
         -[2.1.4](#2.1.4)
         -[2.1.5](#2.1.5)
         -[2.1.6](#2.1.6)
         -[2.1.7](#2.1.7)
     

[back to top](#table-of-contents)
<a id='1'></a>
# 1. Introduction
In this notebook the top 100 Kdramas csv will be cleaned and prepped so that I can perform some EDA on it and then eventually build a recommender system for it.
## 1.1 Preloading packages

In [None]:
#core packages
import os 
import numpy as np
import pandas as pd
import warnings

#visualisation
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib import ticker
import plotly.express as px
import seaborn as sns
plt.rcParams['figure.dpi'] = 600
pd.set_option('display.max_rows', None)
pd.set_option('display.max.columns', None)
pd.set_option('float_format', '{:f}'.format)
warnings.filterwarnings('ignore')

#reduce memory usage conversions
def reduce_mem_usage(df, verbose=True):
    numerics = ['int8','int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

[back to top](#table-of-contents)
<a id = '2'></a>
# 2. Data Preparation

In [None]:
data = pd.read_csv('../input/top-100-korean-drama-mydramalist/top100_kdrama.csv')
print(data.shape)
data.head()

**Notes on data prep to be done:**
- `Aired date` should be split into two columns: `Aired on`, `Final Episode` or something similar
- `Network` should be split into `Network 1`, `Network 2`, etc..
- `Duration` should be changed from X hr X min to an integer that is equal to minutes. 
- `Content Rating` just a rumber, maybe an extra column for explanation but not really necessary
- `Cast` can be split into multiple columns
- `Genre` split into multiple columns
- `Tags` into multiple columns 
- Any dates changed to proper datetime format 

In [None]:
data.dtypes

In [None]:
print(f'Number of rows: {data.shape[0]};  Number of columns: {data.shape[1]}; No of missing values: {sum(data.isna().sum())}')

No missing values, nice.

<a id='2.1'></a>
## 2.1 Data cleaning
<a id = '2.1.1'></a>
### 2.1.1 Converting Cast, Genre, and Tags

In [None]:
#GENRE Encoding
g = []
for genres in data['Genre']:
    G = genres.split(',', -1)
    for i, genre in enumerate(G):
        if genre.strip() not in g:
            g.append(G[i].strip())
        else:
            pass
g.sort()

for genre in g:
    data[f'Genre_{genre}'] = np.zeros((100,), dtype = int)
counter = 0
for genres in data['Genre']:
    G = genres.split(',', -1)
    for i, genre in enumerate(G):
        for gen in g:
            if G[i].strip() == gen:
                data[f'Genre_{gen}'][counter] = 1 
            else:
                pass
    counter +=1
data.drop(['Genre'], axis = 1, inplace = True)

#TAG Encoding
t = []
for tags in data['Tags']:
    T = tags.split(',', -1)
    for i, tag in enumerate(T):
        if tag.strip() not in t:
            t.append(T[i].strip())
        else:
            pass
t.sort()

for tag in t:
    data[f'Tag_{tag}'] = np.zeros((100,), dtype = int)
counter = 0
for tags in data['Tags']:
    T = tags.split(',', -1)
    for i, tag in enumerate(T):
        for ta in t:
            if T[i].strip() == ta:
                data[f'Tag_{ta}'][counter] = 1 
            else:
                pass
    counter +=1
data.drop(['Tags'], axis = 1, inplace = True)

#CAST Encoding
c = []
for cast in data['Cast']:
    C = cast.split(',', -1)
    for i, cas in enumerate(C):
        if cas.strip() not in c:
            c.append(C[i].strip())
        else:
            pass
c.sort()

for actor in c:
    data[f'{str(actor).replace(" ", "_").lower()}'] = np.zeros((100,), dtype = int)
counter = 0
for actor in data['Cast']:
    C = actor.split(',', -1)
    for i, act in enumerate(C):
        for A in c:
            if C[i].strip() == A:
                data[f'{str(A).replace(" ", "_").lower()}'][counter] = 1 
            else:
                pass
    counter +=1
data.drop(['Cast'], axis = 1, inplace = True)

data.head()
print(data.shape)

Now we have a very big dataset

<a id = '2.1.2'></a>
### 2.1.2 Converting Content Rating
We gonna see the unique values for content rating and decide what to do. Maybe change it into a integer encoded column where rather than 15, 18 etc we have like 1, 2 

In [None]:
data['Content Rating'].value_counts() # Yes so after looking at this I want to conver it to a Integer Encoded column. 

In [None]:
import sklearn
from sklearn.preprocessing import OrdinalEncoder
enc = OrdinalEncoder()
Rating = np.array(data['Content Rating'])
data['Content Rating'] = enc.fit_transform(Rating.reshape(-1,1)) # 2 = 18+, 1 = 15+, 0 = 13+
data.head()

<a id = '2.1.3'></a>
### 2.1.3 Converting Duration
Need to convert duration from Xhr Ymin to Zmins

In [None]:
def t_converter(duration):
    duration = duration.strip(" min.").replace(" hr. ",":")
    # SHould now have a string of this formate'X:YY'
    T = duration.split(':')
    if len(T)==2:
        hours = int(T[0])*60
        mins = int(T[1])
        time = hours +mins
    else:
        time = int(T[0]) #this only has minutes
    return time
for i, run in enumerate(data['Duration']):
    data['Duration'][i] = t_converter(run)
data.head()
data.rename(columns = {'Duration':'Duration/min'}, inplace = True)

<a id = '2.1.4'></a>
### 2.1.4 Converting Network
Am going to assume that the first network in each list is the original air network.

In [None]:
for i, row in enumerate(data['Network']):
    networks = str(row)
    networks = networks.split(',')
    data['Network'][i] = networks[0].strip()
data.head()

In [None]:
data['Network'].value_counts()

<a id = '2.1.5'></a>
### 2.1.5 Converting Aired On
Need to go through this column and; firstly - split it into multiple columns, `Air Date 1`, `Air Date 2` etc.
After that One-Hot Encode all columns. Is easier than trying to Integer encode all three columns and is also easier to intepret.

In [None]:
data['Air Day 1'] = np.zeros((100,))
data['Air Day 2'] = np.zeros((100,))
data['Air Day 3'] = np.zeros((100,))
for i, col in enumerate(data['Aired On']):
    days = col.split(", ")
    if len(days)== 3:
        data['Air Day 1'][i] = days[0].strip()
        data['Air Day 2'][i] = days[1].strip()
        data['Air Day 3'][i] = days[2].strip()
    elif len(days) == 2:
        data['Air Day 1'][i] = days[0].strip()
        data['Air Day 2'][i] = days[1].strip()
        data['Air Day 3'][i] = np.nan
    else:
        data['Air Day 1'][i] = days[0].strip()
        data['Air Day 2'][i] = np.nan
        data['Air Day 3'][i] = np.nan
data.drop(['Aired On'], axis = 1, inplace = True)

In [None]:
cols = ['Air Day 1', 'Air Day 2', 'Air Day 3']
dummies = pd.get_dummies(data[cols])
data = pd.concat([data, dummies], axis = 1)
data.drop(cols, axis =1, inplace = True)
data.head()

<a id = '2.1.6'></a>
### 2.1.6 Converting Aired Date
Need to convert the Aired Date column to two columns: `First Aired`, `Last Aired`. Also want to convert that column to datetime. day-month-year

In [None]:
import datetime 
data['First Aired'] = np.nan 
data['Last Aired'] = np.nan
for i, row in enumerate(data['Aired Date']):
    dates = row.split(' - ')
    if len(dates)>1:
        data['First Aired'][i] = datetime.datetime.strptime(dates[0], '%b %d, %Y').strftime('%d/%m/%y')
        data['Last Aired'][i] = datetime.datetime.strptime(dates[1], '%b %d, %Y').strftime('%d/%m/%y')
    else:
        data['First Aired'][i] = datetime.datetime.strptime(dates[0], '%b %d, %Y').strftime('%d/%m/%y')
data.drop('Aired Date', axis =1, inplace = True)
data.head()

<a id = '2.1.7'></a>
### 2.1.7 Last Tidy
Lastly want to get rid of the number symbol in the rank column and drop the synopsis column as it won't be needed for any EDA

In [None]:
for i, rank in enumerate(data['Rank']):
    R = rank.strip('#')
    data['Rank'][i] = int(R)
data.drop('Synopsis', axis = 1, inplace = True)
data = reduce_mem_usage(data)
data.head()

In [None]:
data.to_csv('top_100_k_drama_clean.csv', index = False, header = True)

Okay so that has created a cleaned up and much larger csv