# DATA ENCODING

Here, we will be considering 3 different types of encoding for our categorical features
- Label Enconding
- One-Hot Encoding
- Target Encoding

We will also convert our problem from regression to classification by encoding the ratings into bins, to prove better for results.

In [2]:
# IMPORTS
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from category_encoders.cat_boost import CatBoostEncoder
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import pickle

In [3]:
df = pd.read_pickle(r'../datasets/pickle/processed_action_movie_data.pkl')
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6648 entries, 0 to 6647
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   index     6648 non-null   int64  
 1   year      6648 non-null   int16  
 2   runtime   6648 non-null   int16  
 3   director  6648 non-null   object 
 4   star      6648 non-null   object 
 5   rating    6648 non-null   float32
dtypes: float32(1), int16(2), int64(1), object(2)
memory usage: 207.9+ KB


~~## LABEL ENCODING **[SKIPPED]**~~

We will perform lable encoding on the following column(s):
- Certificate

In [None]:
# Define the unique certificates and their typical restrictions
certificates = {
    'G': 0,
    'PG': 1,
    'PG-13': 2,
    'R': 3,
    'NC-17': 4,
    'Not Rated': 5,
    'Unrated': 5
}


df = df.drop(df[(df['certificate'] == '16+') | (df['certificate'] == 'TV-14')].index)
print(df['certificate'].unique()) # count the number of NaN values in 'certificate'

# Create a new column 'certificates_order' with the certificate order
df['certificate_enc'] = df['certificate'].map(certificates).astype(int)
df['certificate_enc'].value_counts()

~~## ONE-HOT ENCODING **[SKIPPED]**~~

We will perform one-hot encoding on the following column(s):
- Director
- Star

Since there are multiple directors and actors per row, we have to split them up before we can apply one-hot encoding properly

*NOTE: We are aware One-Hot Encoding on high cardinality could result in overfitting and the curse of dimensionality*

In [14]:
encoder = OneHotEncoder(handle_unknown="ignore")

In [15]:
cols_to_enc = ['director', 'star']

enc_cols = encoder.fit_transform(df[cols_to_enc]).toarray()
enc_df = pd.DataFrame(enc_cols, columns=encoder.get_feature_names_out(cols_to_enc))

# Concatenate the new dataframe with the original dataframe
df = pd.concat([df, enc_df], axis=1)

df.columns

Index(['year', 'runtime', 'director', 'star', 'rating',
       'director_A'Ali de Sousa', 'director_A. Raven Cruz',
       'director_A. Todd Smith', 'director_A.J. Ager',
       'director_A.J. Martinson',
       ...
       'star_Zlatko Buric', 'star_Zlatko Krickic', 'star_Zoe Naylor',
       'star_Zoe Saldana', 'star_Zoey D'Arienzo', 'star_Zoey Deutch',
       'star_Zoë Bell', 'star_Zoë Nathenson', 'star_Zul Ariffin',
       'star_Óscar Jaenada'],
      dtype='object', length=9718)

# COUNT ENCODING

Let's see if we can perform count encoding, instead of one-hot encoding, on the following column(s):
- Director
- Star

## TARGET ENCODING

Let's see if we can perform target encoding, instead of one-hot encoding, on the following column(s):
- Director
- Star

In [49]:
# perform count encoding
director_counts = df['director'].value_counts()
star_counts = df['star'].value_counts()

df['director_count'] = df['director'].map(director_counts)
df['star_count'] = df['star'].map(star_counts)

In [39]:
df.head()

Unnamed: 0,year,runtime,director,star,rating,director_count,star_count
0,2022,161,Ryan Coogler,Letitia Wright,6.9,3,1
1,2022,192,James Cameron,Sam Worthington,7.8,2,8
2,2022,139,Dan Kwan,Michelle Yeoh,8.0,1,4
3,2022,100,Jason Moore,Jennifer Lopez,5.4,1,1
4,2022,127,David Leitch,Brad Pitt,7.3,4,5


In [40]:
df['director_count'].value_counts()

director_count
1     3898
2     1140
3      483
4      300
5      220
6      150
7      147
8       88
9       63
14      28
12      24
11      22
10      20
19      19
17      17
16      16
13      13
Name: count, dtype: int64

In [41]:
df['star_count'].value_counts()

star_count
1     4212
2      730
3      369
5      235
4      204
6      168
8       96
9       90
16      80
7       77
10      70
20      60
27      54
12      48
17      34
15      30
14      28
28      28
24      24
11      11
Name: count, dtype: int64

# LABEL ENCODING (for Ratings)

We will convert change the problem from a regression to classification problem, which will help us achieve better accuracy

In [4]:
# Convert rating into a classification problem
bin_num = 3
bin_edges = np.linspace(start=0, stop=10, num=bin_num+1) # Create evenly spaced bins between 0 and 10
bin_labels = list(range(bin_num))

# Create a new column 'rating_class' that contains the bin labels
df['rating_class'] = pd.cut(df['rating'], bins=bin_edges, labels=bin_labels)
df['rating_class'] = pd.to_numeric(df['rating_class'], downcast='integer')
df.drop(columns=['rating'], inplace=True)

In [5]:
df['rating_class'].value_counts()

rating_class
1    4408
2    1489
0     751
Name: count, dtype: int64

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6648 entries, 0 to 6647
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   index         6648 non-null   int64 
 1   year          6648 non-null   int16 
 2   runtime       6648 non-null   int16 
 3   director      6648 non-null   object
 4   star          6648 non-null   object
 5   rating_class  6648 non-null   int8  
dtypes: int16(2), int64(1), int8(1), object(2)
memory usage: 188.4+ KB


# CATBOOST

A technique similar to Target Encoding, however, can be better in some situations

In [7]:
cbe = CatBoostEncoder()
cat_cols = ['director', 'star']
train, target = df[cat_cols], df['rating_class']

cbe.fit(train, target)
df[cat_cols] = cbe.transform(train)

In [8]:
df.head()

Unnamed: 0,index,year,runtime,director,star,rating_class
0,0,2022,161,1.777753,1.111011,2
1,1,2022,192,1.70367,1.234557,2
2,2,2022,139,1.111011,1.422202,2
3,3,2022,100,1.111011,1.111011,1
4,4,2022,127,1.622202,1.685168,2


In [9]:
df['director'].describe()

count    6648.000000
mean        1.103998
std         0.232665
min         0.111101
25%         1.037004
50%         1.111011
75%         1.111011
max         1.888876
Name: director, dtype: float64

# SAVE IT!

In [78]:
df.head()

Unnamed: 0,year,runtime,director,star,rating_class
0,2022,161,1.777753,1.111011,2
1,2022,192,1.70367,1.234557,2
2,2022,139,1.111011,1.422202,2
3,2022,100,1.111011,1.111011,1
4,2022,127,1.622202,1.685168,2


In [79]:
df.describe()

Unnamed: 0,year,runtime,director,star,rating_class
count,6648.0,6648.0,6648.0,6648.0,6648.0
mean,2012.455325,97.242329,1.103998,1.10818,1.111011
std,5.921617,22.203019,0.232665,0.182494,0.569797
min,2000.0,1.0,0.111101,0.277753,0.0
25%,2008.0,87.0,1.037004,1.111011,1.0
50%,2013.0,94.0,1.111011,1.111011,1.0
75%,2017.0,106.0,1.111011,1.111011,1.0
max,2023.0,700.0,1.888876,1.911101,2.0


In [84]:
extra_name = 'catboost_3'
extra_name += "_" * 1 if len(extra_name) else 1

In [85]:
# Save it as CSV
df.to_csv(f'../datasets/csv/{extra_name}encoded_action_movie_data.csv', index=False)

In [86]:
# Save it as pickle
with open(f'../datasets/pickle/{extra_name}encoded_action_movie_data.pkl', 'wb') as f:
    pickle.dump(df, f)