The merged database is made of fairly lot of features, I will explore them and see if some could be optimized, using two technics: 

- Graphical [Scatter Plots](21.Scatter%20plots%20for%20EDA.ipynb) between numerical features and the dependent variable.

- Using [PCA and Scree plot](22.PCA%20and%20Screeplot.ipynb).

Before going further, I've taken time to write a new function into [my_utils](my_utils.py) library to facilitate the loading of datasets from the NPZ backup file created in the previous [notebook](17.The%20global%20Dataset%20-%20Merging%20all%20the%20datasets%20into%20a%20big%20one.ipynb)

- load_npz_as_dict()
- load_dataset()
- load_Xy()

Code of those functions is inspired by the code written in my [my_lib.py library](https://github.com/epfl-extension-school/project-adsml19-c4-s11-3871-2111/blob/master/mylib.py) while completing my [course #4 project](https://github.com/xnicolovici/machine_learning/tree/master/notebooks/Course%20No%204/11.%20Course%20project), and the function *duildDataMatrix()* written in [House Price model training chapter](https://github.com/epfl-extension-school/project-adsml19-c3-s9-3871-2111/blob/master/house-prices/house-prices-solution-2-of-2.ipynb) while completing [course #3 project](https://github.com/xnicolovici/machine_learning/tree/master/notebooks/Course%20No%203/09.%20Course%20project).


In [1]:
# Load my_utils.ipynb in Notebook
from ipynb.fs.full.my_utils import *

Opening connection to database
Add pythagore() function to SQLite engine
Fraction of the dataset used to train models: 10.00%
my_utils library loaded :-)


# load_npz_as_dict()

Remember, in the previous [notebook](17.The%20global%20Dataset%20-%20Merging%20all%20the%20datasets%20into%20a%20big%20one.ipynb), I've stored all the datasets into an NPZ file, to easyly reload them from disk.

The *load_npz_as_dict()* function aims to return a *dict* Python object filled in with the datasets loaded from an NPZ file. The function expects two main parameters:
- filename, the path to the NPZ file on disk, default set to *NPZ_NORMALIZED_DATAFILE* constant.
- dataset, the name of the dataset I'd like to load, default being hte 'full' one. Others are *stations*, *travel*, *weather_num* and *weather_cat*

Two other optional parameters are available:
- frac, a value between 0 and 1 used to get a sample of the dataset requested, 1 being 100% and 0 none. This parameter will be usefull when coding and training model to work on small subset of the data, 1.5 millions of line and 48 features might be too heavy to be processed on my desktop computer, even if it runs Apple M1 Silicon ;-)
- verbose, a boolean parameter, simply ask the function to display some informations when its value equal *True*

When used, this function opens the NPZ file, loads the requested dataset, convert them to a *pandas.DataFrame* object using column names loaded from the NPZ file, add it to a *dict()* Python object along with the *frac* parameter value and, if the requested dataset is the *full* one, add to the *dict()* object the feature name details of this *full* dataset (numerical, categorical, y and all).

This utility function will be used each time I need to get one of the *engineered* dataset, using the *frac* parameter to work on subset of it.

> Note: When verbose=True, this function display the *shape* of the dataset returned.


## Header of *load_npz_as_dict()*
implementation can be read in [my_utils](my_utils.ipynb) libary.

    def load_npz_to_dict(filename=NPZ_DATAFILE, dataset='full', frac=1, verbose=True) -> dict:
        """
        This function returns one of the dataset stored in the NPZ file passed as parameter,
        and if the dataset claimed is the full one, then its feature names are added to the
        dict returned by the function.

        The dict structure returned looks like this:
            - feature_names: (if requested dataset is the full one)
                - numerical
                - categorrical
                - all
                - y
            - dataset
            - frac  

        The NPZ file passed should contain a Python dict built in Notebook No 17

        The dataset parameter is used to determine which dataset the function should return.
        Dafault value is 'full'

        The frac parameter

        Returns:
        --------
        dict

        """
    

## Demonstration of the *load_npz_to_dict()* function

I'd like to load 10% of the *stations* dataset and display the first three lines.

As the *stations* dataset contains 83 lines and 5 columns, I should obtain a dataset with shape=(8, 5), 10% of 83 lines with 5 columns.

In [2]:
npz_dict=load_npz_as_dict(dataset='stations', frac=0.1)
df=npz_dict['dataset']
df.head(3)

Loading dataset 'stations' from NPZ file ./data/capstone-data-normalized.npz
Building sample from dataset (frac=0.1)
 Dataset shape: (8, 5)


Dataset loaded, returning dict


Unnamed: 0,STATION,NAME,LATITUDE,LONGITUDE,ELEVATION
58,USC00282023,"CRANFORD, NJ US",40.6666,-74.3235,24.4
25,US1NJHD0002,"KEARNY 1.7 NW, NJ US",40.7729,-74.1409,29.0
14,US1NJUN0014,"WESTFIELD 0.6 NE, NJ US",40.6588,-74.3358,36.3


# load_dataset()

Along with the *load_npz_to_dict()* function, I've written a wrapper function around it: *load_dataset()*

This wrapper function returns a tuple instead of a dict for code simplification. For example, the following instruction loads the full dataset, store it in *df* variable, and initialize a *feature* variable that contains a list of the rest of the values of the returned tuple.
    
        df,*features=load_dataset()

Be aware that this function removed from the retunred features column name list the name of the result vector column. This is a big difference with *load_npz_as_dict()* function which returns **all** dataframe column names in a single list (*load_npz_as_dict()['features']['all']*)

## Header of *load_dataset()*
implementation can be read in [my_utils](my_utils.ipynb) libary.

    def load_dataset(frac=1, random_state=5, verbose=True, y_dtype='float', npz_filename=NPZ_DATAFILE) -> tuple:
        """
        Convenient wrapper around the load_npz_as_dict() function that returns the full dataset, its feature and result vector name
        as a tuple.

        This function exists to simplify the code when loading full dataset. For example, the follwing instruction
        loads the full dataset store it in df variable, feature variable will contain a list of the rest of the tuple

            df,*features=load_dataset()

        Parameters are passed as is to the load_npz_as_dict() function.

        Returned tuple is:
            - dataset
            - all feature names
            - y result vector name
            - numerical feature names
            - categorical feature names

        Returns:
        --------
        tuple

        """
    

## Demonstration of the *load_dataset()* function

I'd like to load 10% of the *full* dataset and display the feature names and the first three lines of the dataset returned:


In [3]:
df,x_column,*_=load_dataset(verbose=False)

print("Feature column names:")
print(','.join(x_column))

print("\nFirst three line of dataset:")
df.head(3)


Feature column names:
pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,weekend,day_period_afternoon,day_period_evening,day_period_morning,passenger_alone,diff_ELEVATION,diff_ASCENDING,diff_DESCENDING,WC_WT01,WC_WT02,WC_WT03,WC_WT04,WC_WT06,WC_WT08,WC_WT09,WC_WT11,WC_WDIR_E,WC_WDIR_N,WC_WDIR_NE,WC_WDIR_NW,WC_WDIR_S,WC_WDIR_SE,WC_WDIR_SW,WC_WDIR_W,WC_PEAK_Y,WC_SNOW_FALL,WC_SNOW_ROAD,distance_in_km_square_log10,dropoff_distance_to_STATION_log1p,WNP_AWND_log1p,WNP_SNWD_log1p,WND_AWND_log1p,WND_SNWD_log1p

First three line of dataset:


Unnamed: 0,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,store_and_fwd_flag,weekend,day_period_afternoon,day_period_evening,day_period_morning,km_per_hour,passenger_alone,diff_ELEVATION,diff_ASCENDING,diff_DESCENDING,WC_WT01,WC_WT02,WC_WT03,WC_WT04,WC_WT06,WC_WT08,WC_WT09,WC_WT11,WC_WDIR_E,WC_WDIR_N,WC_WDIR_NE,WC_WDIR_NW,WC_WDIR_S,WC_WDIR_SE,WC_WDIR_SW,WC_WDIR_W,WC_PEAK_Y,WC_SNOW_FALL,WC_SNOW_ROAD,distance_in_km_square_log10,dropoff_distance_to_STATION_log1p,WNP_AWND_log1p,WNP_SNWD_log1p,WND_AWND_log1p,WND_SNWD_log1p
0,-73.982155,40.767937,-73.96463,40.765602,0,0,1,0,0,1.073954,1,0.0,0,0,1,0,0,0,0,1,0,0,1,0,1,0,0,0,0,0,1,0,0,0.351326,0.931212,1.987874,0.0,1.987874,0.0
1,-73.980415,40.738564,-73.999481,40.731152,0,1,0,0,0,0.991388,1,37.2,0,1,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,1,0,0,0.513198,1.83852,1.526056,0.0,1.173514,0.0
2,-73.979027,40.763939,-74.005333,40.710087,0,0,0,0,1,1.034316,1,37.2,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,1,1,1,1.610335,1.444618,1.774952,0.0,0.0,2.87168


Cool, isn't it ?

# load_Xy_as_dict()

Training model is based on *np.array* of features (X) and vector result (y) build from datasets, splitted in train and validation subsets.

I've coded the *load_Xy_as_dict()* function to do the job in one call:

- Load the full dataset using *load_dataset()*
- Split dataframe in two subset, *train* and *valid*, using *sklearn.model_selection.train_test_split()* method
- Print some information on the shape of the subset created (if verbose=True)
- Return a dict of the different *np.arrays* created


## Header of *load_Xy_as_dict()*
implementation can be read in [my_utils](my_utils.ipynb) libary.

    def load_Xy(train_size=TRAIN_SIZE_DEFAULT, frac=1, random_state=5, verbose=True, npz_filename=NPZ_DATAFILE) -> dict:
        """
        Used to get features and vector result of the 'full' dataset as X and y np.array, splitted into two daatset: A train
        and valid one.

        The 'train_size' parameter may be used to fix the train size (defaults 0.8). This parameter is passed as is to the
        'sklearn.model_selection.train_test_split()' method.

        The value returned is a dict object:

            - train:
                - X:    Train set of X features
                - y:    Train set of y vector result

            - valid:
                - X:    Validation set of X features
                - y:    Validation set of y vector result

            - all:
                - X:    Complete set of X features
                - y:    Complete set of y vector result

            - features: List of feature names
            - result:   Name of the y result vector

        The 'full' dataset is retrived using the 'load_dataset()' function.

        Returns:
        --------
        dict

        """
    

## Demonstration of the *load_Xy_as_dict()* function

I'd like to load 1% of the X train feature values from the *full* dataset


In [4]:
X_tr=load_Xy_as_dict(frac=0.01, verbose=False)['train']['X']

print("Shape of the X train dataset using 1% of the full dataset:", X_tr.shape)

Shape of the X train dataset using 1% of the full dataset: (11411, 38)


# load_Xy()

A wrapper function around *load_Xy_as_dict()* that returns train and valid X/y values as a tuple, to simplify code in next Notebooks.

More informations below in the header of the function.

## Header of *load_Xy()*
implementation can be read in [my_utils](my_utils.ipynb) libary.

    def load_Xy(train_size=TRAIN_SIZE_DEFAULT, frac=1, random_state=5, verbose=True, npz_filename=NPZ_DATAFILE) -> tuple:
        """
        A wrapper function around load_Xy_as_dict() that returns X_tr, y_tr, X_va and y_va as a tuple.

        This function aims to simplify the code in Notebooks

        Returns:
        --------
        (X_tr, y_tr, X_va, y_va)

        """
    

## Demonstration of the *load_Xy()* function

I'd like to load 1% of the train and validation feature values of the *full* dataset in one instruction, with train/valid split = 60%


In [5]:
X_tr, y_tr, X_va, y_va = load_Xy(frac=0.01, train_size=0.6)
print("X_tr shape", X_tr.shape)
print("y_tr shape", y_tr.shape)
print("X_va shape", X_va.shape)
print("y_va shape", y_va.shape)

X_tr shape (8558, 38)
y_tr shape (8558,)
X_va shape (5706, 38)
y_va shape (5706,)


# Let's continue
to the next notebook, [Scatter plots for EDA](21.Scatter%20plots%20for%20EDA.ipynb)