## Pipelines Challenge

In this challenge, we will be working with this [dataset](https://drive.google.com/file/d/1B07fvYosBNdIwlZxSmxDfeAf9KaygX89/view?usp=sharing), where we will be predicting sales. 

**The main goal is to create a `pipeline` that covers all the data preprocessing and modeling steps.**


**TASK 1**: Build a pipeline that ends with a regression model, to predict `Item_Outlet_Sales` from the dataset. 

**The pipeline should have following steps:**

1. Split the features into numerical and categorical (text)
2. Replace null values
    - the mean for numerical variables
    - the most frequent value for categorical variables
3. Create dummy variables from categorical features
4. Use a PCA to reduce number of dummy variables to 3 principal components. PCA will be used directly after the OneHotEncoder that outputs data into a SparseMatrix, so we will need to use the **ToDenseTransformer** from the [article about custom pipelines](https://queirozf.com/entries/scikit-learn-pipelines-custom-pipelines-and-pandas-integration).
5. Select the 3 best candidates from the original numerical features using KBest
6. Fit a Ridge regression (default alpha is fine for now)

**TASK 2**: Tune the parameters of multiple models as well as the preprocessing steps and find the best solution.
- Try these models: 
        - Random Forest Regressor
        - Gradient Boosting Regressor 
        - Ridge Regression. 
- For the task 2, we will need to use the same approach from this [earlier article](https://iaml.it/blog/optimizing-sklearn-pipelines), in the section `PIPELINE TUNING (ADVANCED VERSION)`, where we tried different kinds of scalers. (Use the article as reference.)

_________________________________

In [1]:
import pandas as pd
df = pd.read_csv("regression_exercise.csv")
df.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [2]:
# creating target variable
y = df["Item_Outlet_Sales"]
df = df.drop(["Item_Outlet_Sales","Item_Identifier"],axis = 1)

In [12]:
import pandas as pd
import sys
sys.path.append(r"C:\Users\silvh\OneDrive\lighthouse\custom python")
from silvhua import *
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

from sklearn.feature_selection import SelectKBest

from sklearn.decomposition import PCA
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import FunctionTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import FeatureUnion
from sklearn.compose import ColumnTransformer

In [32]:
import numpy as np
import scipy

Split the dataset into a train and test set.

**Note:** We should always do this at the beginning before the pipeline.

In [4]:
df_train = df.sample(frac=0.8).sort_index()
y_train = y[y.index.isin(df_train.index.tolist())]

In [5]:
df_test = df[~df.index.isin(df_train.index.tolist())].sort_index()
y_test = y[y.index.isin(df_test.index.tolist())]
# SH 2022-10-31 15:35 Seems like a more involved way to split into test and train subsets
#  than train_test_split

---------------------
## Task I

### Split Features into numerical and categorical

In [6]:
cat_feats = df.dtypes[df.dtypes == 'object'].index.tolist()
num_feats = df.dtypes[~df.dtypes.index.isin(cat_feats)].index.tolist()

In [7]:
from sklearn.preprocessing import FunctionTransformer

# Using own function in Pipeline
def numFeat(data):
    return data[num_feats]

def catFeat(data):
    return data[cat_feats]

In [8]:
# we will start two separate pipelines for each type of features
keep_num = FunctionTransformer(numFeat)
keep_cat = FunctionTransformer(catFeat)

### replacing null values

### Creating dummy variables

#### *Scale numeric features* (not included in exercise instructions)

In [13]:
scaler = StandardScaler()

In [15]:
# use OneHotEncoder


### Use PCA to reduce the number of dummy variables to 3 principal components.

In [31]:
# don't forget ToDenseTransformer after one hot encoder 
class ToDenseTransformer():
    
    # here you define the operation it should perform
    def transform(self, X, y=None, **fit_params):
        return X.todense()

    # just return self
    def fit(self, X, y=None, **fit_params):
        return self
# SH 2022-10-31 16:25 this function doesn't work because `.todense()` 
    # returns error message, possibly due to the object type of the data

### Select the 3 best numeric features

In [3]:
# use SelectKBest

### Fitting models

In [19]:
# Use base_model in Task I
base_model = Ridge()

### Building a Pipeline

In [35]:
numeric_transform = Pipeline(steps=[
    ('keep_numeric', keep_num),
    ('impute_mean', SimpleImputer(strategy='mean')), # fill missing values
    ('scale', scaler), # scale
    ('select_kbest', SelectKBest(k=3)) # select 3 best numeric features
])
categorical_transform = Pipeline(steps=[
    ('keep_categorical', keep_cat),
    ('impute_mode', SimpleImputer(strategy='most_frequent')), # fill missing values
    ('one_hot_encode', OneHotEncoder(sparse=False)), # one-hot encode 
    # ('to_dense', ToDenseTransformer()),
    ('pca', PCA(n_components=3)),  # select 3 best categorical features
])
union = FeatureUnion([
    ('numeric', numeric_transform),
    ('categorical', categorical_transform)
])
pipeline = Pipeline(steps=[
    ('feature_union', union),
    ('ridge_regressor', base_model)
])

In [36]:
pipeline.fit(df_train, y_train)

In [41]:
ridge_y = pipeline.predict(df_test)
print('Ridge regression score:',pipeline.score(df_test,y_test))

Ridge regression score: 0.355797725830201


----------------------------
## Task II

### *Test parameter tuning with GridSearch for Ridge*

In [42]:
from sklearn.model_selection import GridSearchCV

In [60]:
params = {
    'feature_union__categorical__pca__n_components': [2,3,4]
}
grid_ridge = GridSearchCV(pipeline, param_grid=params)

In [61]:
grid_ridge.fit(df_test, y_test)
best_model = grid_ridge.best_estimator_
best_model

In [78]:
grid_ridge.cv_results_

{'mean_fit_time': array([0.09940906, 0.10576267, 0.09600883]),
 'std_fit_time': array([0.00889504, 0.01232014, 0.00594433]),
 'mean_score_time': array([0.00759788, 0.0082046 , 0.00639949]),
 'std_score_time': array([0.00243041, 0.00192069, 0.0014926 ]),
 'param_feature_union__categorical__pca__n_components': masked_array(data=[2, 3, 4],
              mask=[False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'feature_union__categorical__pca__n_components': 2},
  {'feature_union__categorical__pca__n_components': 3},
  {'feature_union__categorical__pca__n_components': 4}],
 'split0_test_score': array([0.38314762, 0.39365484, 0.39057693]),
 'split1_test_score': array([0.38482098, 0.38688649, 0.38169308]),
 'split2_test_score': array([0.32899884, 0.32531758, 0.33603943]),
 'split3_test_score': array([0.33483626, 0.33305193, 0.33418138]),
 'split4_test_score': array([0.29112719, 0.29655354, 0.29767655]),
 'mean_test_score': array([0.34458618, 0.34709288, 0.

In [90]:
# Get the best score from the grid search's best model
print('best score:',grid_ridge.best_score_)

best score: 0.34803347302981547


In [79]:
grid_ridge.scorer_

<function sklearn.metrics._scorer._passthrough_scorer(estimator, *args, **kwargs)>

In [219]:
# print('Final score is: ', tuned_model.score(df_test, y_test))

Final score is:  0.6241741712069144


### *Test Grid Search for multiple estimators in a for loop*

In [91]:
classifiers = {
    # 'ridge': Ridge(),
    'gradient': GradientBoostingRegressor(),
    'random_forest': RandomForestRegressor()
}
params = {
    'feature_union__categorical__pca__n_components': [2,4]
}

grid = dict()

for name, classifier in classifiers.items():
    # Create a new pipeline with classifier as last step
    grid_pipeline = Pipeline(steps=[
        ('feature_union', union),
        ('classifier', classifier)
    ])
    grid[name] = GridSearchCV(grid_pipeline, param_grid=params)
    grid[name].fit(df_test, y_test)
    print(f'Best grid search score for {name}: {grid[name].best_score_}')

Best grid search score for gradient: 0.5630236775755714
Best grid search score for random_forest: 0.5253332488848742


In [85]:
grid.keys()

dict_keys(['gradient', 'random_forest'])

In [65]:
grid['gradient'].fit(df_test, y_test)
grid['gradient'].best_estimator_

In [93]:
grid['random_forest']