<a href="https://colab.research.google.com/github/villafue/Capstone_2_MovieLens/blob/main/MovieLens.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Introduction

Hello! My name is Roland and this is my first Data Science Project, as well as first Capstone project with Springboard. My goal was model exploration to include SciKit Learn's Stacking Regressor.

I did my best to create a logical flow and organization to my notebook. Everything is categorized into sections and hyperlinked for easier navigation. As much as possible, I included markdowns inbetween the code to explain my thinking behind what I did. Please do not hesitate to contact me, and thank you for reading!

Last, I am thankful and grateful for the plethora of notebooks that others were kind enough to share on Kaggle. I adapted many of their codes and processes in my notebook. They are referenced in section "6.3."


***

## Content

Note: The internal links work when the notebook is run via Google Colab.

1. **[Import Packages](#import_packages)**
2. **[Load Data](#load_data)**
3. **[Data Dreparation](#data_preparation)**
    - 3.1 - [Remove Outliers](#remove_outliers)
    - 3.2 - [Data Exploration](#data_exploration)   
    - 3.3 - [Treat Missing Values](#treat_missing_values)   
    - 3.4 - [Check Duplicates](#duplicates)
4. **[Exploratory Data Analysis](#exploratory_data_analysis)**
    - 4.1 - [Correlation Matrix](#correlation_matrix)
    - 4.2 - [Feature Engineering](#feature_engineering)
        - 4.2.1 - [Polynomials](#polynomials)
        - 4.2.2 - [Interior](#interior)
        - 4.2.3 - [Architectural & Structural](#architectural_&_structural)
        - 4.2.4 - [Exterior](#exterior)
        - 4.2.5 - [Location](#location)
        - 4.2.6 - [Land](#land)
        - 4.2.7 - [Access](#access)
        - 4.2.8 - [Utilities](#utilities)
        - 4.2.9 - [Miscellaneous](#miscellaneous)
    - 4.3 - [Target Variable](#target_variable)
    - 4.4 - [Treating Skewed Features](#treating_skewed_features)
5. **[Modeling](#modeling)**
    - 5.1 - [Model Exploration](#model_exploration)
        - 5.1.1 - [Baseline](#baseline)
        - 5.1.2 - [Simple Models](#simple_models)
        - 5.1.3 - [Advanced Linear Models](#advanced_linear_models)
        - 5.1.4 - [Ensemble Tree Models](#ensemble_tree_models)
    - 5.2 - [Training](#training)
        - 5.2.1 - [Feature Selection](#feature_selection)
        - 5.2.2 - [Reduction Comparison](#reduction_comparison)
    - 5.3 - [Optimization](#optimization)
        - 5.3.1 - [Linear Regression](#linear_regression)
        - 5.3.2 - [Ridge Regression](#ridge_regression)
        - 5.3.3 - [Lasso Regression](#lasso_regression)
        - 5.3.4 - [Elastic Net Regression](#elastic_net_regression)
        - 5.3.5 - [Random Forest Regressor](#random_forest_regressor)
        - 5.3.6 - [AdaBoost Regressor](#adaboost_regressor)
        - 5.3.7 - [GradientBoost Regressor](#gradientboost_regressor)
        - 5.3.8 - [XGBoost Regressor](#xgboost_regressor)
        - 5.3.8 - [Models Summary](#models_summary)
    - 5.4 - [Stacking](#stacking)
6. **[Conclusion](#conclusion)**
    - 6.1 - [Submission](#submission)
    - 6.2 - [Final Thoughts](#final_thoughts)
    - 6.3 - [References](#references)
    - 6.4 - [TPOT](#TPOT)

<a id='import_packages'></a>
# 1. 

---

<a name="import_packages"></a>
## Import Packages

In [None]:

# This first set of packages include Pandas, for data manipulation,
# numpy for mathematical computation and matplotlib & seaborn, for visualisation.
import pandas as pd
pd.set_option('display.notebook_repr_html', True)
pd.set_option('max_columns', 82)
pd.options.display.max_rows = 100
import numpy as np
from IPython.display import display
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
print('1: Data Manipulation, Mathematical Computation and Visualisation packages imported!')

# Statistical packages used for transformations
from scipy import stats
from scipy.stats import skew, norm
from scipy.special import boxcox1p
from scipy.stats.stats import pearsonr
print('2: Statistical packages imported!')

# Metrics used for measuring the accuracy and performance of the models
from sklearn import metrics
from sklearn.metrics import mean_squared_error
print('3: Metrics packages imported!')

# Algorithms used for modeling
from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor, GradientBoostingRegressor
import xgboost as xgb
print('4: Algorithm packages imported!')

# Pipeline and scaling preprocessing will be used for models that are sensitive
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_selection import SelectFromModel
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
print('5: Pipeline and preprocessing packages imported!')

# Model selection packages used for sampling dataset and optimising parameters
from sklearn import model_selection
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import ShuffleSplit
print('6: Model selection packages imported!')

# Set visualisation colours
mycols = ["#66c2ff", "#5cd6d6", "#00cc99", "#85e085", "#ffd966", "#ffb366", "#ffb3b3", "#dab3ff", "#c2c2d6"]
print('7: My colours are ready! :)')

# To ignore annoying warning
import warnings
def ignore_warn(*args, **kwargs):
    pass
warnings.warn = ignore_warn #ignore annoying warning (from sklearn and seaborn)
warnings.filterwarnings("ignore", category=DeprecationWarning)
print('8: Deprecation warning will be ignored!')

1: Data Manipulation, Mathematical Computation and Visualisation packages imported!
2: Statistical packages imported!
3: Metrics packages imported!
4: Algorithm packages imported!
5: Pipeline and preprocessing packages imported!
6: Model selection packages imported!
7: My colours are ready! :)


In [None]:
plt.rcParams.update(plt.rcParamsDefault)
%matplotlib inline
plt.style.use(['seaborn-whitegrid'])
sns.set_palette(palette = mycols, n_colors = 4)
sns.set(context='notebook', palette='deep')
print('9: My Axes are visible in "Dark Mode!"')

9: My Axes are visible in "Dark Mode!"


<a id='load_data'></a>
# 2. 


<a name="load_data"></a>
 ## Load data

First, I load the train and test datasets directly into Pandas dataframes. I load the data files directly from my Github as it allows me the flexibility to work from different computers.





In [None]:
pandas profiling

In [None]:
url = 'https://raw.githubusercontent.com/villafue/Capstone_2_Netflix/main/Data/netflix_titles.csv'
nf = pd.read_csv(url)
print(nf.shape, '\n' * 2, 'The Netflix dataset has: ', nf.shape[0], 'rows and ', nf.shape[1], 'columns.')
# This makes a little barrier between printed outputs.
print('\n', '=' * 136, '\n' * 2, 'nf Set:', '\n')
display(nf.head())
print('\n', '=' * 136)

(7787, 12) 

 The Netflix dataset has:  7787 rows and  12 columns.


 nf Set: 



Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,TV Show,3%,,"João Miguel, Bianca Comparato, Michel Gomes, R...",Brazil,"August 14, 2020",2020,TV-MA,4 Seasons,"International TV Shows, TV Dramas, TV Sci-Fi &...",In a future where the elite inhabit an island ...
1,s2,Movie,7:19,Jorge Michel Grau,"Demián Bichir, Héctor Bonilla, Oscar Serrano, ...",Mexico,"December 23, 2016",2016,TV-MA,93 min,"Dramas, International Movies",After a devastating earthquake hits Mexico Cit...
2,s3,Movie,23:59,Gilbert Chan,"Tedd Chan, Stella Chung, Henley Hii, Lawrence ...",Singapore,"December 20, 2018",2011,R,78 min,"Horror Movies, International Movies","When an army recruit is found dead, his fellow..."
3,s4,Movie,9,Shane Acker,"Elijah Wood, John C. Reilly, Jennifer Connelly...",United States,"November 16, 2017",2009,PG-13,80 min,"Action & Adventure, Independent Movies, Sci-Fi...","In a postapocalyptic world, rag-doll robots hi..."
4,s5,Movie,21,Robert Luketic,"Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...",United States,"January 1, 2020",2008,PG-13,123 min,Dramas,A brilliant group of students become card-coun...





The "Test Set" has one less column than my train set. This is because the dependent variable "SalePrice," for the test set, is kept secret and is used as the final measure of the accuracy of my model.



Next, I want to see the column names.

In [None]:
nf.columns

Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description'],
      dtype='object')

Now, I will save the 'Id' columns, from both datasets, as it's necessary for preparing the final submission data. I'll drop them from my training and test datasets as they are redundant for modeling.

Here is a helpful link for ".format" and strings. [Python Format Function](https://www.geeksforgeeks.org/python-format-function/#:~:text=Diamond%20star%20pattern-,Python%20%7C%20format()%20function,a%20string%20through%20positional%20formatting.)

In [None]:
train_ID = train['Id']
test_ID = test['Id']

train.drop("Id", axis = 1, inplace = True)
test.drop("Id", axis = 1, inplace = True)

print("The train data size after dropping Id feature is : {} ".format(train.shape)) 
print("The test data size after dropping Id feature is : {} ".format(test.shape))


The train data size after dropping Id feature is : (1460, 80) 
The test data size after dropping Id feature is : (1459, 79) 


***

I'll change my home directory because this is where I upload files, from my computer, into Colab.

In [None]:
# os.chdir('sample_data')