# Pipelines and Custom Transformation

In [None]:
import seaborn as sns
import pandas as pd
import numpy as np

In [None]:
df = pd.read_csv("data/olist.csv")
df.head()

- Each observation of the dataset represents an item being delivered from a  `seller_state` to a `customer_state`. 
- Other columns describe the packaging properties of each item.

The target is the number of days between the order and the delivery.

In [None]:
# Check target
sns.histplot(df['days_until_delivery']);

## Creating X and y

In [None]:
X = df.drop(columns=['days_until_delivery'])
y = df['days_until_delivery']
X.head()

## Pipeline

❓ **>>>** Create a scikit-learn pipeline named `pipe` that:

- Engineers a `volume` feature from the dimensions features (so you need to multiply `product_length_cm` with `product_height_cm` and `product_width_cm`). To help you the function is already created, it takes a dataframe and outputs a new dataframe.
- Preserves the original product dimensions features for training.
- Scales all numerical features.
- Encodes the categorical features.
- Adds a default `Ridge` regression estimator.

**Note:**

- For this exercice, ignore the holdout method, so no need to use `train_test_split()`!
- If you are in the mood, you can try to build your own ```class``` called "ColumnMultiplier", but you really don't have to.

**Hints**:

- There are many ways to create your preprocessed matrix (using `ColumnTransformer` and/or `FeatureUnion`). 
    
- If your transformed feature matrix look weird, it may be stored as "sparse" by the default behavior of `OneHotEncoder(sparse_output=True)`. Use `.todense()` to turn it back to a dense matrix

In [None]:
def multiply(df): return pd.DataFrame(df['product_length_cm'] * df['product_height_cm'] * df['product_width_cm'])

In [None]:
# You probably won't use all of these functions, but just in case, let's import them all!
from sklearn.compose import ColumnTransformer
from sklearn.compose import make_column_selector
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler

from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import FunctionTransformer

from sklearn.preprocessing import StandardScaler
from sklearn.base import TransformerMixin
from sklearn.base import BaseEstimator

from sklearn.pipeline import FeatureUnion
# Code here!



In [None]:
## Using a Class as a ColumnMultiplier (you don't need to do it, but you can see an exemple)

# Create a class
class ColumnMultiplier(TransformerMixin, BaseEstimator):
# TransformerMixin generates a fit_transform method from fit and transform
# BaseEstimator generates get_params and set_params methods
    
    # Create parameters "column_1", "column_2", "column_3" to choose which columns of dataframe to multiply
    def __init__(self, column_1, column_2, column_3):
        self.column_1 = column_1
        self.column_2 = column_2
        self.column_3 = column_3
        
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        
        # Multiplication
        multiplied_features =X[self.column_1]*X[self.column_2]*X[self.column_3]
        
        # Return result as dataframe (for integration into ColumnTransformer)
        return pd.DataFrame(multiplied_features, columns=['volume'])

In [None]:
final_preprocessor3

In [None]:
pd.DataFrame(final_preprocessor3.fit_transform(X)).head()

In [None]:
final_preprocessor3.fit_transform(X).shape

## Train and Predict

❓ **>>>** Let's imagine `df` is your entire training set. "Cross_validate" your pipeline on this dataset (therefore low $r2$ score are expected)

In [None]:
from sklearn.model_selection import cross_val_score
# Code here!


✅ **Expected results** : A $r²$ around 0.15824 for cv=10.

❓ **>>>** Now, imagine you just received an new order `new_obs`, predict it's duration of delivery in a variable `prediction`.

In [None]:
new_obs = {'customer_state': ['RJ'],
           'seller_state': ['SP'],
           'product_weight_g': [1825],
           'product_length_cm': [53],
           'product_height_cm': [10],
           'product_width_cm': [40]}

In [None]:
# Code here!



✅ **Expected results** : ```array([20.67221182])```