# Machine Learning

---

### Essential Libraries

Let us begin by importing the essential Python Libraries.

> NumPy : Library for Numeric Computations in Python  
> Pandas : Library for Data Acquisition and Preparation  
> Matplotlib : Low-level library for Data Visualization  
> Seaborn : Higher-level library for Data Visualization  

In [120]:
# Basic Libraries
import json
import statistics
import math

from collections import defaultdict

import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt  # we only need pyplot

sb.set()  # set the default Seaborn style for graphics


## Advantages and disadvantages of various linear regression models

The following is from [https://statisticsbyjim.com/regression/choosing-regression-analysis/](https://statisticsbyjim.com/regression/choosing-regression-analysis/)

`Ordinary Least Squares (OLS)`
- Sensitivity to both outliers and multicollinearity
- Prone to overfitting

`Ridge regression`
- allows you to analyze data even when severe multicollinearity is present
- helps prevent overfitting
- reduces the large, problematic variance that multicollinearity causes by introducing a slight bias in the estimates
- trades away much of the variance in exchange for a little bias, which produces more useful coefficient estimates when multicollinearity is present.

`Lasso regression` (least absolute shrinkage and selection operator)
- performs variable selection that aims to increase prediction accuracy by identifying a simpler model
- It is similar to Ridge regression but with variable selection

`Partial least squares` (PLS) regression
- is useful when you have very few observations compared to the number of independent variables or when your independent variables are highly correlated. 
- PLS decreases the independent variables down to a smaller number of uncorrelated components, similar to Principal Components Analysis. 
- Then, the procedure performs linear regression on these components rather than the original data. 
- PLS emphasizes developing predictive models and is not used for screening variables. 
- Unlike OLS, you can include multiple continuous dependent variables. 
- PLS uses the correlation structure to identify smaller effects and model multivariate patterns in the dependent variables.

## Importing the data

In [121]:
anime_df = pd.read_csv('dataset/anime_cleaned_2.csv')
print("Number of animes:", len(anime_df))
anime_df.head(1)

Number of animes: 8661


Unnamed: 0,id,title,start_date,end_date,synopsis,mean,rank,popularity,num_list_users,num_scoring_users,...,broadcast_day_of_the_week,broadcast_start_time,statistics_watching,statistics_completed,statistics_on_hold,statistics_dropped,statistics_plan_to_watch,statistics_num_list_users,positive_viewership_fraction,negative_viewership_fraction
0,95,Turn A Gundam,1999-04-09,2000-04-14,"It is the Correct Century, two millennia after...",7.71,1049,2892,40743,13338,...,friday,17:00,2735.0,16661.0,2538.0,1597.0,17292.0,40823.0,0.8987,0.1013


## Dropping non-essential columns to linear regression

In [122]:
anime_df.drop([
    'title', 'start_date', 'end_date', 'id', 'synopsis', 'rank', 'popularity',
    'num_list_users', 'num_scoring_users', 'broadcast_day_of_the_week',
    'broadcast_start_time', 'statistics_watching', 'statistics_completed',
    'statistics_on_hold', 'statistics_dropped', 'statistics_plan_to_watch',
    'statistics_num_list_users', 'positive_viewership_fraction', 'negative_viewership_fraction'
],
              axis=1,
              inplace=True)


In [123]:
anime_df.head(1)

Unnamed: 0,mean,nsfw,media_type,status,genres,num_episodes,source,average_episode_duration,rating,studios,start_season_year,start_season_season
0,7.71,white,tv,finished_airing,"[{'id': 1, 'name': 'Action'}, {'id': 2, 'name'...",50,original,1445,pg_13,"[{'id': 14, 'name': 'Sunrise'}, {'id': 1260, '...",1999.0,spring


## Unravel `genres` and `studios` from one column to multiple columns

In [124]:
from ipynb.fs.full.helpers import json_genres, json_studios

anime_df = json_genres(anime_df)  # convert genres column to json
anime_df = json_studios(anime_df)  # convert studios column to json

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  "\n",
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  "outputs": [],


In [125]:
def unravel_genre_or_studio(row):
    res = pd.Series(dtype=str)
    for elem in row:
        res = res.append(pd.Series([elem['name']]))
    res.reset_index(inplace=True, drop=True)
    return res


genres_expanded = anime_df['genres'].apply(
    lambda row: unravel_genre_or_studio(row))
genres_expanded

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10
0,Action,Adventure,Drama,Mecha,Military,Romance,Sci-Fi,Space,,,
1,Action,Drama,Military,Sci-Fi,Space,,,,,,
2,Adventure,Comedy,Fantasy,Kids,Sci-Fi,Shounen,,,,,
3,Action,Adventure,Comedy,Drama,Fantasy,Shounen,Super Power,,,,
4,Adventure,Comedy,Kids,Sci-Fi,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...
8656,Music,,,,,,,,,,
8657,Comedy,Slice of Life,,,,,,,,,
8658,Adventure,Comedy,Demons,Fantasy,Historical,Supernatural,,,,,
8659,Music,Supernatural,Vampire,,,,,,,,


In [126]:
genres_expanded = genres_expanded.fillna('NA')

In [127]:
studios_expanded = anime_df['studios'].apply(
    lambda row: unravel_genre_or_studio(row))
studios_expanded = studios_expanded.fillna('NA')
studios_expanded

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,Sunrise,Nakamura Production,,,,,,,,
1,Artland,Magic Bus,,,,,,,,
2,Shin-Ei Animation,,,,,,,,,
3,Toei Animation,,,,,,,,,
4,Toei Animation,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...
8656,Doga Kobo,,,,,,,,,
8657,,,,,,,,,,
8658,Sunrise,,,,,,,,,
8659,A-1 Pictures,,,,,,,,,


In [128]:
anime_expanded_df = anime_df.copy()
for index, row in genres_expanded.iterrows():
    for i in genres_expanded.columns:
        anime_expanded_df.loc[index, f"genre-{i}"] = genres_expanded.iloc[index, i]
for index, row in studios_expanded.iterrows():
    for i in studios_expanded.columns:
        anime_expanded_df.loc[index, f"studio-{i}"] = studios_expanded.iloc[index, i]


In [129]:
anime_expanded_df.head(2)

Unnamed: 0,mean,nsfw,media_type,status,genres,num_episodes,source,average_episode_duration,rating,studios,...,studio-0,studio-1,studio-2,studio-3,studio-4,studio-5,studio-6,studio-7,studio-8,studio-9
0,7.71,white,tv,finished_airing,"[{'id': 1, 'name': 'Action'}, {'id': 2, 'name'...",50,original,1445,pg_13,"[{'id': 14, 'name': 'Sunrise'}, {'id': 1260, '...",...,Sunrise,Nakamura Production,,,,,,,,
1,8.07,white,ova,finished_airing,"[{'id': 1, 'name': 'Action'}, {'id': 8, 'name'...",28,novel,1560,r,"[{'id': 8, 'name': 'Artland'}, {'id': 207, 'na...",...,Artland,Magic Bus,,,,,,,,


### Encoding nominal (unordered) categorical variables using `OneHotEncoding`

Our dataset contains a lot of categorical variables such as:
- media_type
- source
- rating
- start_season_season
- start_season_year
- status
- nsfw
- genres
- studios

In [130]:
# Import the encoder from sklearn
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder()

# OneHotEncoding of categorical predictors (not the response)
cat_variables = [
    'media_type', 'source', 'rating', 'start_season_season',
    'start_season_year', 'status', 'nsfw'
] + [f"genre-{i}" for i in genres_expanded.columns] + [f"studio-{i}" for i in studios_expanded.columns]
anime_cat = anime_expanded_df[cat_variables]

ohe.fit(anime_cat)
anime_cat_ohe = pd.DataFrame(ohe.transform(anime_cat).toarray(),
                             columns=ohe.get_feature_names(anime_cat.columns))

# Check the encoded variables
anime_cat_ohe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8661 entries, 0 to 8660
Columns: 917 entries, media_type_movie to studio-9_Steve N' Steven
dtypes: float64(917)
memory usage: 60.6 MB


In [131]:
num_variable = []
for i in anime_df:
    if i not in cat_variables:
        num_variable.append(i)
num_variable

['mean', 'genres', 'num_episodes', 'average_episode_duration', 'studios']

In [132]:
# Combining Numeric features with the OHE Categorical features
animeData_num = anime_df[num_variable]
animeData_ohe = pd.concat([animeData_num, anime_cat_ohe],
                          sort=False,
                          axis=1).reindex(index=animeData_num.index)

# Check the final dataframe
animeData_ohe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8661 entries, 0 to 8660
Columns: 922 entries, mean to studio-9_Steve N' Steven
dtypes: float64(918), int64(2), object(2)
memory usage: 60.9+ MB


## Linear Regression

In [133]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# Extract Response and Predictors
y = pd.DataFrame(animeData_ohe['mean'])
X = pd.DataFrame(animeData_ohe.drop(['mean', 'genres', 'studios'], axis=1))

# Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

In [134]:
print("Length")
print(f"X_train:\t {len(X_train)}")
print(f"X_test: \t {len(X_test)}")
print(f"y_train:\t {len(y_train)}")
print(f"y_test: \t {len(y_test)}")

Length
X_train:	 6062
X_test: 	 2599
y_train:	 6062
y_test: 	 2599


In [135]:
X_train.head(2)

Unnamed: 0,num_episodes,average_episode_duration,media_type_movie,media_type_music,media_type_ona,media_type_ova,media_type_special,media_type_tv,source_4_koma_manga,source_book,...,studio-5_Trigger,studio-6_Graphinica,studio-6_NA,studio-6_Science SARU,studio-7_NA,studio-7_Studio Colorido,studio-8_NA,studio-8_Sola Digital Arts,studio-9_NA,studio-9_Steve N' Steven
347,26,1460,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0
5322,1,5940,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0


In [136]:
y_train.head(2)

Unnamed: 0,mean
347,7.32
5322,7.87


In [137]:
# fit model
linreg = LinearRegression()
linreg.fit(X_train, y_train)

# predict
y_train_pred = linreg.predict(X_train)


In [138]:
from sklearn.metrics import mean_squared_error

# Goodness of Fit for Train Data
print("Goodness of Fit of Model \tTrain Dataset")
print("Explained Variance (R^2) \t:", linreg.score(X_train, y_train))
print("Mean Squared Error (MSE) \t:",
      mean_squared_error(y_train, y_train_pred))
print("Root Mean Squared Error (RMSE) \t:",
      np.sqrt(mean_squared_error(y_train, y_train_pred)))
print()

# Accuracy for Test Data
y_test_pred = linreg.predict(X_test)
print("Accuracy of Model        \tTest Dataset")
print("Explained Variance (R^2) \t:", linreg.score(X_test, y_test))
print("Mean Squared Error (MSE) \t:", mean_squared_error(y_test, y_test_pred))
print("Root Mean Squared Error (RMSE) \t:",
      np.sqrt(mean_squared_error(y_test, y_test_pred)))
print()

Goodness of Fit of Model 	Train Dataset
Explained Variance (R^2) 	: 0.7236039823951999
Mean Squared Error (MSE) 	: 0.1743214633722397
Root Mean Squared Error (RMSE) 	: 0.4175182192099402

Accuracy of Model        	Test Dataset
Explained Variance (R^2) 	: -2776297658460.2754
Mean Squared Error (MSE) 	: 1757429263340.7527
Root Mean Squared Error (RMSE) 	: 1325680.679251513



High `train` accuracy but low `test` accuracy signifies that there is overfitting
Now, we will try to reduce the number of variables used.

### a) Using only `genres`

In [159]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Extract Response and Predictors
y = pd.DataFrame(animeData_ohe['mean'])
X = pd.DataFrame(animeData_ohe[[col for col in animeData_ohe if 'genre-' in col]])

# Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# fit model
linreg_genre = LinearRegression()
linreg_genre.fit(X_train, y_train)

# predict
y_train_pred = linreg_genre.predict(X_train)

# Goodness of Fit for Train Data
print("Goodness of Fit of Model \tTrain Dataset")
print("Explained Variance (R^2) \t:", linreg_genre.score(X_train, y_train))
print("Mean Squared Error (MSE) \t:",
      mean_squared_error(y_train, y_train_pred))
print("Root Mean Squared Error (RMSE) \t:",
      np.sqrt(mean_squared_error(y_train, y_train_pred)))
print()

# Accuracy for Test Data
y_test_pred = linreg_genre.predict(X_test)
print("Accuracy of Model        \tTest Dataset")
print("Explained Variance (R^2) \t:", linreg_genre.score(X_test, y_test))
print("Mean Squared Error (MSE) \t:", mean_squared_error(y_test, y_test_pred))
print("Root Mean Squared Error (RMSE) \t:",
      np.sqrt(mean_squared_error(y_test, y_test_pred)))
print()

Goodness of Fit of Model 	Train Dataset
Explained Variance (R^2) 	: 0.44800395294866946
Mean Squared Error (MSE) 	: 0.3430266515880524
Root Mean Squared Error (RMSE) 	: 0.5856847715179663

Accuracy of Model        	Test Dataset
Explained Variance (R^2) 	: 0.4256690691614351
Mean Squared Error (MSE) 	: 0.37590551042482817
Root Mean Squared Error (RMSE) 	: 0.6131113360759431



### b) using only studios

In [181]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Extract Response and Predictors
y = pd.DataFrame(animeData_ohe['mean'])
X = pd.DataFrame(
    animeData_ohe[[col for col in animeData_ohe if 'studio-' in col]])

# Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# fit model
linreg_studio = LinearRegression()
linreg_studio.fit(X_train, y_train)

# predict
y_train_pred = linreg_studio.predict(X_train)

# Goodness of Fit for Train Data
print("Goodness of Fit of Model \tTrain Dataset")
print("Explained Variance (R^2) \t:", linreg_studio.score(X_train, y_train))
print("Mean Squared Error (MSE) \t:",
      mean_squared_error(y_train, y_train_pred))
print("Root Mean Squared Error (RMSE) \t:",
      np.sqrt(mean_squared_error(y_train, y_train_pred)))
print()

# Accuracy for Test Data
y_test_pred = linreg_studio.predict(X_test)
print("Accuracy of Model        \tTest Dataset")
print("Explained Variance (R^2) \t:", linreg_studio.score(X_test, y_test))
print("Mean Squared Error (MSE) \t:", mean_squared_error(y_test, y_test_pred))
print("Root Mean Squared Error (RMSE) \t:",
      np.sqrt(mean_squared_error(y_test, y_test_pred)))
print()

Goodness of Fit of Model 	Train Dataset
Explained Variance (R^2) 	: 0.4190236176853088
Mean Squared Error (MSE) 	: 0.36649173061631374
Root Mean Squared Error (RMSE) 	: 0.6053856048968407

Accuracy of Model        	Test Dataset
Explained Variance (R^2) 	: -4.5928276807914545e+24
Mean Squared Error (MSE) 	: 2.9051449111004296e+24
Root Mean Squared Error (RMSE) 	: 1704448565108.5015



### c) Only media_type

In [183]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Extract Response and Predictors
y = pd.DataFrame(animeData_ohe['mean'])
X = pd.DataFrame(animeData_ohe[[col for col in animeData_ohe if 'media_type_' in col]])

# Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# fit model
linreg_media_type = LinearRegression()
linreg_media_type.fit(X_train, y_train)

# predict
y_train_pred = linreg_media_type.predict(X_train)

# Goodness of Fit for Train Data
print("Goodness of Fit of Model \tTrain Dataset")
print("Explained Variance (R^2) \t:", linreg_media_type.score(X_train, y_train))
print("Mean Squared Error (MSE) \t:",
      mean_squared_error(y_train, y_train_pred))
print("Root Mean Squared Error (RMSE) \t:",
      np.sqrt(mean_squared_error(y_train, y_train_pred)))
print()

# Accuracy for Test Data
y_test_pred = linreg_media_type.predict(X_test)
print("Accuracy of Model        \tTest Dataset")
print("Explained Variance (R^2) \t:", linreg_media_type.score(X_test, y_test))
print("Mean Squared Error (MSE) \t:", mean_squared_error(y_test, y_test_pred))
print("Root Mean Squared Error (RMSE) \t:",
      np.sqrt(mean_squared_error(y_test, y_test_pred)))
print()

Goodness of Fit of Model 	Train Dataset
Explained Variance (R^2) 	: 0.09672436475816382
Mean Squared Error (MSE) 	: 0.56362963582048
Root Mean Squared Error (RMSE) 	: 0.7507527128292512

Accuracy of Model        	Test Dataset
Explained Variance (R^2) 	: 0.10377370002762032
Mean Squared Error (MSE) 	: 0.5815163898581035
Root Mean Squared Error (RMSE) 	: 0.7625722194376763



### d) Using genres and media_type

In [188]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Extract Response and Predictors
y = pd.DataFrame(animeData_ohe['mean'])
X = pd.DataFrame(animeData_ohe[[col for col in animeData_ohe if 'genre-' in col or 'media_type_' in col]])

# Split the Dataset into Train and Test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# fit model
linreg_genre_media_type = LinearRegression()
linreg_genre_media_type.fit(X_train, y_train)

# predict
y_train_pred = linreg_genre_media_type.predict(X_train)

# Goodness of Fit for Train Data
print("Goodness of Fit of Model \tTrain Dataset")
print("Explained Variance (R^2) \t:", linreg_genre_media_type.score(X_train, y_train))
print("Mean Squared Error (MSE) \t:",
      mean_squared_error(y_train, y_train_pred))
print("Root Mean Squared Error (RMSE) \t:",
      np.sqrt(mean_squared_error(y_train, y_train_pred)))
print()

# Accuracy for Test Data
y_test_pred = linreg_genre_media_type.predict(X_test)
print("Accuracy of Model        \tTest Dataset")
print("Explained Variance (R^2) \t:", linreg_genre_media_type.score(X_test, y_test))
print("Mean Squared Error (MSE) \t:", mean_squared_error(y_test, y_test_pred))
print("Root Mean Squared Error (RMSE) \t:",
      np.sqrt(mean_squared_error(y_test, y_test_pred)))
print()

Goodness of Fit of Model 	Train Dataset
Explained Variance (R^2) 	: 0.464147484340933
Mean Squared Error (MSE) 	: 0.3372254601597825
Root Mean Squared Error (RMSE) 	: 0.5807111675865916

Accuracy of Model        	Test Dataset
Explained Variance (R^2) 	: 0.47536605932800424
Mean Squared Error (MSE) 	: 0.33374574030119897
Root Mean Squared Error (RMSE) 	: 0.5777073136988997

