## Pipelines Challenge

In this challenge, we will be working with this [dataset](https://drive.google.com/file/d/1B07fvYosBNdIwlZxSmxDfeAf9KaygX89/view?usp=sharing), where we will be predicting sales. 

**The main goal is to create a `pipeline` that covers all the data preprocessing and modeling steps.**


**TASK 1**: Build a pipeline that ends with a regression model, to predict `Item_Outlet_Sales` from the dataset. 

**The pipeline should have following steps:**

1. Split the features into numerical and categorical (text)
2. Replace null values
    - the mean for numerical variables
    - the most frequent value for categorical variables
3. Create dummy variables from categorical features
4. Use a PCA to reduce number of dummy variables to 3 principal components. PCA will be used directly after the OneHotEncoder that outputs data into a SparseMatrix, so we will need to use the **ToDenseTransformer** from the [article about custom pipelines](https://queirozf.com/entries/scikit-learn-pipelines-custom-pipelines-and-pandas-integration).
5. Select the 3 best candidates from the original numerical features using KBest
6. Fit a Ridge regression (default alpha is fine for now)

**TASK 2**: Tune the parameters of multiple models as well as the preprocessing steps and find the best solution.
- Try these models: 
        - Random Forest Regressor
        - Gradient Boosting Regressor 
        - Ridge Regression. 
- For the task 2, we will need to use the same approach from this [earlier article](https://iaml.it/blog/optimizing-sklearn-pipelines), in the section `PIPELINE TUNING (ADVANCED VERSION)`, where we tried different kinds of scalers. (Use the article as reference.)

_________________________________

In [104]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler, MinMaxScaler, RobustScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.decomposition import PCA
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.base import BaseEstimator, TransformerMixin

In [91]:
# Read the data from the CSV file
df = pd.read_csv('regression_exercise.csv')

# Display the first few rows of the dataframe
df.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,FDA15,9.3,Low Fat,0.016047,Dairy,249.8092,OUT049,1999,Medium,Tier 1,Supermarket Type1,3735.138
1,DRC01,5.92,Regular,0.019278,Soft Drinks,48.2692,OUT018,2009,Medium,Tier 3,Supermarket Type2,443.4228
2,FDN15,17.5,Low Fat,0.01676,Meat,141.618,OUT049,1999,Medium,Tier 1,Supermarket Type1,2097.27
3,FDX07,19.2,Regular,0.0,Fruits and Vegetables,182.095,OUT010,1998,,Tier 3,Grocery Store,732.38
4,NCD19,8.93,Low Fat,0.0,Household,53.8614,OUT013,1987,High,Tier 3,Supermarket Type1,994.7052


In [92]:
# Create the target variable
y = df["Item_Outlet_Sales"]

In [93]:
# Drop 'Item_Outlet_Sales' and 'Item_Identifier' columns from the DataFrame
df = df.drop(["Item_Outlet_Sales", "Item_Identifier"], axis=1)

Split the dataset into a train and test set.

**Note:** We should always do this at the beginning before the pipeline.

---------------------
## Task I

### Split Features into numerical and categorical

In [94]:
# Split the features into numerical and categorical
numerical_features = df.select_dtypes(include=['number']).columns
categorical_features = df.select_dtypes(include=['object']).columns

In [95]:
print("Numerical features:", numerical_features)
print("Categorical features:", categorical_features)

Numerical features: Index(['Item_Weight', 'Item_Visibility', 'Item_MRP',
       'Outlet_Establishment_Year'],
      dtype='object')
Categorical features: Index(['Item_Fat_Content', 'Item_Type', 'Outlet_Identifier', 'Outlet_Size',
       'Outlet_Location_Type', 'Outlet_Type'],
      dtype='object')


### Replace null values

### Creating dummy variables

### Use PCA to reduce the number of dummy variables to 3 principal components.

In [96]:
# Transformer to convert sparse matrix to dense array
class ToDenseTransformer(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        return X.toarray() if hasattr(X, 'toarray') else np.asarray(X)

In [97]:
# Define the preprocessing for numerical features: Imputation followed by feature selection
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('select', SelectKBest(f_regression, k=3))
])

In [98]:
# Define the preprocessing for categorical features: Imputation, OneHotEncoding, followed by PCA
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(drop='first', sparse_output=True)),
    ('to_dense', ToDenseTransformer()),
    ('pca', PCA(n_components=3))
])

In [99]:
# Combine numerical and categorical transformers into a preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ])

### Select the 3 best numeric features

### Fitting models

### Building a Pipeline

In [100]:
# Split data into features and target
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.2, random_state=42)

In [101]:
# Create the full pipeline with Ridge regression
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('scaler', StandardScaler()),  # It's a good practice to scale features before regression
    ('regressor', Ridge())
])

In [102]:
# Fit the pipeline
pipeline.fit(X_train, y_train)

In [103]:
# Print the score on the test set
print("Test score:", pipeline.score(X_test, y_test))

Test score: 0.3587951708308351


----------------------------
## Task II

In [105]:
# Define the parameter grid for GridSearchCV
param_grid = [
    {
        'scaler': [StandardScaler(), MinMaxScaler(), RobustScaler()],
        'regressor': [Ridge()],
        'regressor__alpha': [0.1, 1.0, 10.0]
    },
    {
        'scaler': [StandardScaler(), MinMaxScaler(), RobustScaler()],
        'regressor': [RandomForestRegressor()],
        'regressor__n_estimators': [100, 200],
        'regressor__max_depth': [None, 10, 20]
    },
    {
        'scaler': [StandardScaler(), MinMaxScaler(), RobustScaler()],
        'regressor': [GradientBoostingRegressor()],
        'regressor__n_estimators': [100, 200],
        'regressor__learning_rate': [0.01, 0.1, 0.2],
        'regressor__max_depth': [3, 5, 7]
    }
]

In [106]:
# Initialize GridSearchCV
grid_search = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1, scoring='neg_mean_squared_error', verbose=3)

In [107]:
# Fit the GridSearchCV
grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 81 candidates, totalling 405 fits


In [108]:
# Print the best parameters and the best score
print("Best parameters:", grid_search.best_params_)
print("Best score:", grid_search.best_score_)

Best parameters: {'regressor': GradientBoostingRegressor(), 'regressor__learning_rate': 0.1, 'regressor__max_depth': 3, 'regressor__n_estimators': 100, 'scaler': MinMaxScaler()}
Best score: -1232752.7655483559


In [109]:
# Evaluate on the test set
test_score = grid_search.score(X_test, y_test)
print("Test score:", test_score)

Test score: -1062514.8391612796
