# Machine Learning

---

### Essential Libraries

Let us begin by importing the essential Python Libraries.

> NumPy : Library for Numeric Computations in Python  
> Pandas : Library for Data Acquisition and Preparation  
> Matplotlib : Low-level library for Data Visualization  
> Seaborn : Higher-level library for Data Visualization  

In [1]:
# Basic Libraries
import json
import statistics
import math

from collections import defaultdict

import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt  # we only need pyplot

sb.set()  # set the default Seaborn style for graphics


## Advantages and disadvantages of various linear regression models

The following is from [https://statisticsbyjim.com/regression/choosing-regression-analysis/](https://statisticsbyjim.com/regression/choosing-regression-analysis/)

`Ordinary Least Squares (OLS)`
- Sensitivity to both outliers and multicollinearity
- Prone to overfitting

`Ridge regression`
- allows you to analyze data even when severe multicollinearity is present
- helps prevent overfitting
- reduces the large, problematic variance that multicollinearity causes by introducing a slight bias in the estimates
- trades away much of the variance in exchange for a little bias, which produces more useful coefficient estimates when multicollinearity is present.

`Lasso regression` (least absolute shrinkage and selection operator)
- performs variable selection that aims to increase prediction accuracy by identifying a simpler model
- It is similar to Ridge regression but with variable selection

`Partial least squares` (PLS) regression
- is useful when you have very few observations compared to the number of independent variables or when your independent variables are highly correlated. 
- PLS decreases the independent variables down to a smaller number of uncorrelated components, similar to Principal Components Analysis. 
- Then, the procedure performs linear regression on these components rather than the original data. 
- PLS emphasizes developing predictive models and is not used for screening variables. 
- Unlike OLS, you can include multiple continuous dependent variables. 
- PLS uses the correlation structure to identify smaller effects and model multivariate patterns in the dependent variables.

## Importing the data

In [43]:
anime_df = pd.read_csv('dataset/anime_cleaned_2.csv')
print("Number of animes:", len(anime_df))
anime_df.head(1)

Number of animes: 8661


Unnamed: 0,id,title,start_date,end_date,synopsis,mean,rank,popularity,num_list_users,num_scoring_users,...,broadcast_day_of_the_week,broadcast_start_time,statistics_watching,statistics_completed,statistics_on_hold,statistics_dropped,statistics_plan_to_watch,statistics_num_list_users,positive_viewership_fraction,negative_viewership_fraction
0,95,Turn A Gundam,1999-04-09,2000-04-14,"It is the Correct Century, two millennia after...",7.71,1049,2892,40743,13338,...,friday,17:00,2735.0,16661.0,2538.0,1597.0,17292.0,40823.0,0.8987,0.1013


## Unravel genres from one column to multiple columns

In [46]:
anime_df = json_genres(anime_df)  # convert genres column to json

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  "\n",


In [52]:
from ipynb.fs.full.helpers import json_genres

def unravel_genres(row):
    res = pd.Series(dtype=str)
    for elem in row:
        res = res.append(pd.Series([elem['name']]))
    res.reset_index(inplace=True, drop=True)
    return res

anime_df_genres_expanded = anime_df['genres'].apply(lambda row: unravel_genres(row))
anime_df_genres_expanded

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,Action,Adventure,Drama,Mecha,Military,Romance,Sci-Fi,Space,,,
1,Action,Drama,Military,Sci-Fi,Space,,,,,,
2,Adventure,Comedy,Fantasy,Kids,Sci-Fi,Shounen,,,,,
3,Action,Adventure,Comedy,Drama,Fantasy,Shounen,Super Power,,,,
4,Adventure,Comedy,Kids,Sci-Fi,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...
8656,Music,,,,,,,,,,
8657,Comedy,Slice of Life,,,,,,,,,
8658,Adventure,Comedy,Demons,Fantasy,Historical,Supernatural,,,,,
8659,Music,Supernatural,Vampire,,,,,,,,


In [91]:
anime_df_genres_expanded = anime_df_genres_expanded.fillna('NA')

In [92]:
anime_df_genres_expanded.iloc[:, 0].value_counts(normalize=True)

Action           0.315783
Comedy           0.311396
Adventure        0.127583
Drama            0.057730
Music            0.056114
Fantasy          0.029327
Kids             0.009121
Game             0.008660
Mystery          0.008082
Historical       0.007043
Avant Garde      0.006928
Slice of Life    0.006697
Mecha            0.005427
Romance          0.005080
Shounen          0.005080
Demons           0.004965
Horror           0.004849
School           0.004734
Sci-Fi           0.004157
Boys Love        0.003579
Cars             0.003002
Ecchi            0.002309
Military         0.001963
Sports           0.001847
Parody           0.001039
Shoujo           0.000924
no_genre         0.000924
Girls Love       0.000808
Gourmet          0.000808
Supernatural     0.000693
Psychological    0.000693
Harem            0.000577
Martial Arts     0.000462
Josei            0.000462
Seinen           0.000462
Award Winning    0.000231
Suspense         0.000115
Vampire          0.000115
Space       

In [93]:
anime_expanded_df = anime_df.copy()
for index, row in anime_df_genres_expanded.iterrows():
    for i in anime_df_genres_expanded.columns:
        anime_expanded_df.loc[index, f"genre-{i}"] = anime_df_genres_expanded.iloc[index, i]

In [94]:
anime_expanded_df.head(2)

Unnamed: 0,id,title,start_date,end_date,synopsis,mean,rank,popularity,num_list_users,num_scoring_users,...,genre-1,genre-2,genre-3,genre-4,genre-5,genre-6,genre-7,genre-8,genre-9,genre-10
0,95,Turn A Gundam,1999-04-09,2000-04-14,"It is the Correct Century, two millennia after...",7.71,1049,2892,40743,13338,...,Adventure,Drama,Mecha,Military,Romance,Sci-Fi,Space,,,
1,3665,Ginga Eiyuu Densetsu Gaiden (1999),1999-12-24,2000-07-21,Ginga Eiyuu Densetsu Gaiden (1999) is the seco...,8.07,472,4347,17849,6478,...,Drama,Military,Sci-Fi,Space,,,,,,


### Encoding nominal (unordered) categorical variables using `OneHotEncoding`

Our dataset contains a lot of categorical variables such as:
- media_type
- genre
- source
- rating
- start_season

In [107]:
# Import the encoder from sklearn
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder()

# OneHotEncoding of categorical predictors (not the response)
cat_variables = ['media_type', 'source', 'rating', 'start_season_season'
                 ] + [f"genre-{i}" for i in anime_df_genres_expanded.columns]
anime_cat = anime_expanded_df[cat_variables]

ohe.fit(anime_cat)
anime_cat_ohe = pd.DataFrame(ohe.transform(anime_cat).toarray(),
                             columns=ohe.get_feature_names(anime_cat.columns))

# Check the encoded variables
anime_cat_ohe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8661 entries, 0 to 8660
Columns: 290 entries, media_type_movie to genre-10_Supernatural
dtypes: float64(290)
memory usage: 19.2 MB


In [110]:
num_variable = []
for i in anime_df:
    if i not in cat_variables:
        num_variable.append(i)
num_variable

['id',
 'title',
 'start_date',
 'end_date',
 'synopsis',
 'mean',
 'rank',
 'popularity',
 'num_list_users',
 'num_scoring_users',
 'nsfw',
 'status',
 'genres',
 'num_episodes',
 'average_episode_duration',
 'studios',
 'start_season_year',
 'broadcast_day_of_the_week',
 'broadcast_start_time',
 'statistics_watching',
 'statistics_completed',
 'statistics_on_hold',
 'statistics_dropped',
 'statistics_plan_to_watch',
 'statistics_num_list_users',
 'positive_viewership_fraction',
 'negative_viewership_fraction']

In [113]:
# Combining Numeric features with the OHE Categorical features
animeData_num = anime_df[num_variable]
animeData_ohe = pd.concat([animeData_num, anime_cat_ohe],
                          sort=False,
                          axis=1).reindex(index=animeData_num.index)

# Check the final dataframe
animeData_ohe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8661 entries, 0 to 8660
Columns: 317 entries, id to genre-10_Supernatural
dtypes: float64(300), int64(7), object(10)
memory usage: 20.9+ MB
