## Week 6 â€” Preprocessing Pipelines

In previous weeks, we learned how to load data, clean missing values, handle categorical variables,  
perform aggregations, and transform the data.

Now we move to a **new phase** that is part of modern workflows:

> **Preparing the dataset for modeling using feature transformations with preprocessing pipelines.**

This includes:
- Review of Cleaning Techniques, encoding categorical features and scaling of numerical features
- Preprocessing Pipelines using `sklearn.pipeline`  

**Dataset Reference:** ðŸ”— https://archive.ics.uci.edu/dataset/2/adult

---
## SETUP

**Install the required package to load the dataset from UCI repository**

In [None]:
# uncomment the following line to install the required package
#!pip install ucimlrepo
#!pip install IPython
#!pip install scikit-learn

**Import necessary libraries**

In [None]:
import pandas as pd
import numpy as np
from IPython.display import display, HTML
from ucimlrepo import fetch_ucirepo

adult = fetch_ucirepo(id=2)
X = adult.data.features
y = adult.data.targets

pd.set_option('display.max_columns', None)

**Import pipeline related classes from `sklearn`**

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.utils import estimator_html_repr

## 1. Load the Dataset

> When loading the Adult Census Income dataset from the UCI repository, you will notice
> that the data is split into **two separate DataFrames**:
>
> - `X` contains all feature columns (Attribute1 â€¦ Attribute20)
> - `y` contains the target variable (`class`)
>
> This separation is common in Machine Learning libraries because it clearly
> distinguishes:
>
> - **independent variables** â†’ used to make predictions  
> - **dependent variable** â†’ the value we want to predict
>
> However, for **Exploratory Data Analysis (EDA)**, it is usually more convenient
> to work with a **single unified table**.
>
> Having both features and the target in the same DataFrame simplifies:
>
> - inspecting the overall structure  
> - checking distributions  
> - computing correlations
> - detecting missing values  
> - visualizing relationships between variables
>
> To prepare for EDA, we will **concatenate** the two parts into one unified table.
>
> ### Concatenating DataFrames
>
> The simplest way to combine `X` and `y` is with `pd.concat`, which allows us to
> join DataFrames **side-by-side** using `axis=1`:
>
> - `pd.concat([...])` â†’ specifies the DataFrames to combine  
> - `axis=1` or `axis='columns'` â†’ concatenate **column-wise**, placing the
>   target column next to the features  
>
> **Example:**
>
> ```python
> df = pd.concat([df_1, df_2], axis=1)
>
> # or equivalently
>
> df = pd.concat([df_1, df_2], axis="columns")
> ```
>
> ### What about `axis=0` or `axis='rows'`?
>
> - This stacks DataFrames **row-wise**, one on top of the other.  
> - It requires both DataFrames to have the **same columns**.  
> - Therefore it is *not* appropriate for joining `X` and `y`.

---

### Q1.1 Verify both datasets which are separated DataFrames `X` and `y`, then concatenate them into one using `pd.concat`.


In [None]:
# your code here


In [None]:
# your code here


In [None]:
# your code here


## Data Dictionary
>
>Below is the official data dictionary for the **Adult Census Income** dataset (UCI ML Repository).  
>
>Unlike the German Credit dataset, the Adult dataset already includes descriptive column names.
>However, several columns contain coded categories, ambiguous meanings, or missing values masked as `" ?"`,  
>which we will address in the cleaning and preprocessing stages.
>
>| Variable Name     | Role    | Type         | Demographic     | Description                                                                                               | Units | Missing Values |
>|-------------------|---------|--------------|-----------------|-----------------------------------------------------------------------------------------------------------|--------|----------------|
>| age               | Feature | Integer      | Age             | Age of the individual                                                                                     | years  | no             |
>| workclass         | Feature | Categorical  | Income/Employment | Employment status (Private, Self-emp-not-inc, Federal-gov, â€¦ )                                           |        | yes (encoded as `" ?"`) |
>| fnlwgt            | Feature | Integer      | â€”               | Final sampling weight (used by US Census Bureau)                                                          |        | no             |
>| education         | Feature | Categorical  | Education       | Highest level of education achieved (Bachelors, HS-grad, Some-college, â€¦)                                 |        | no             |
>| education-num     | Feature | Integer      | Education       | Numerical representation of education level                                                                |        | no             |
>| marital-status    | Feature | Categorical  | Other           | Marital status (Married, Divorced, Never-married, â€¦)                                                      |        | no             |
>| occupation        | Feature | Categorical  | Employment      | Type of occupation (Tech-support, Sales, Exec-managerial, â€¦)                                              |        | yes (encoded as `" ?"`) |
>| relationship      | Feature | Categorical  | Other           | Relationship of the individual to their household                                                          |        | no             |
>| race              | Feature | Categorical  | Race            | Race group (White, Black, Asian-Pac-Islander, â€¦)                                                           |        | no             |
>| sex               | Feature | Binary       | Sex             | Biological sex (Male, Female)                                                                              |        | no             |
>| capital-gain      | Feature | Integer      | â€”               | Capital gain from investment income                                                                        | USD    | no             |
>| capital-loss      | Feature | Integer      | â€”               | Capital loss from investment income                                                                        | USD    | no             |
>| hours-per-week    | Feature | Integer      | Employment      | Working hours per week                                                                                     | hours  | no             |
>| native-country    | Feature | Categorical  | Other           | Country of origin                                                                                          |        | yes (encoded as `" ?"`) |
>| income            | Target  | Binary       | Income          | Income category: `<=50K` or `>50K`                                                                         |        | no             |
---
### Q1.2 Obtain the `.info()` from the Dataset:

>Investigate the datatypes of each column. Are they appropriate?

In [None]:
# your code here


### Q1.3 Obtain descriptive statistics using `.describe()`

In [None]:
# your code here


### Q1.4 Investigate how many missing values are in each column

In [None]:
# you code here


>## 2. Preprocessing Pipelines
>
>### Why use preprocessing pipelines?
>
>In the previous class you cleaned the dataset **manually**, step by step.
>
>In this section, we will **recreate the same cleaning logic** using scikit-learn pipelines so that:
>
>- The preprocessing steps are **reusable** and **reproducible**.
>- You can apply the **same transformations** to any new data in the future.
>- All cleaning logic is kept in **one single object** instead of many scattered lines of code.
>
>We will use:
>
>- `Pipeline`
>- `ColumnTransformer`
>- `SimpleImputer`
>- `StandardScaler`
>- `OneHotEncoder`
---
>### Identify numerical and categorical columns for the pipelines
>
>We first separate the feature names into:
>
>- `numeric_features` â€“ columns treated as **numerical**.
>- `categorical_features` â€“ columns treated as **categorical**.
>
>**Example:**
>
>```python
>numeric_features = df.select_dtypes(include=["int64", "float64"]).columns.tolist() # or use include=np.number
>categorical_features = df.select_dtypes(include=["object"]).columns.tolist() # or also include=["object", "category", "bool"]
>```
---
### Q2.1. Create the `numeric_features` and `categorical_features` lists
- use the `.select_dtypes()` method exactly as shown above.

In [None]:
# your code here


In [None]:
# your code here


>### Build the preprocessing pipeline for numerical features
>
>For numerical columns, a common preprocessing flow is:
>
>1. **Impute missing values** (e.g. with the median).
>2. **Scale the values** to have mean 0 and variance 1.
>
>We can capture this logic in a `Pipeline`:
>
>```python
>from sklearn.pipeline import Pipeline
>from sklearn.impute import SimpleImputer
>from sklearn.preprocessing import StandardScaler
>
>numeric_pipeline = Pipeline([
>    ("imputer", SimpleImputer(strategy="median")),
>    ("scaler", StandardScaler()),
>])
>```
---
### Q2.2. Build `numeric_pipeline` using:
- median imputation  
- standardization with `StandardScaler()`

In [None]:
# your code here


>### Build the preprocessing pipeline for categorical features
>
>For categorical columns, a common preprocessing flow is:
>
>1. **Impute missing values** with the most frequent category.
>2. **Encode categories** using one-hot encoding.
>
>We also ask the encoder to return **dense output** so that the final result can easily be converted to a pandas DataFrame.
>
>```python
>from sklearn.preprocessing import OneHotEncoder
>
>categorical_pipeline = Pipeline([
>    ("imputer", SimpleImputer(strategy="most_frequent")),
>    ("encoder", OneHotEncoder(
>        drop="first",
>        handle_unknown="ignore",
>        sparse_output=False,
>    )),
>])
>```
>- `handle_unknown='ignore'`: ensures that if new categories appear in future data, the encoder will not raise an error but will instead create all-zero columns for those unseen categories.
>
>- `sparse_output=False`: makes sure the output is a dense array, which is easier to convert to a DataFrame.
>
>   **Example of Dense Output:**
>```console
>                education_Bachelors  education_Masters  education_PhD  education_Some-college
>           0    1                    0                  0              0
>           1    0                    0                  0              1
>           2    0                    1                  0              0
>```
>   - Dense arrays are easier to convert to pandas DataFrames, otherwise you get sparse matrix representations that are less intuitive to work with as the example below:
>
>       **Example of Sparse Output:**
>```console
>           (0, 2)	1
>           (1, 0)	1
>           (2, 1)	1
>```
>
>   - Each row shows:
>
>       - the **position** `(row_index, column_index)`
>       - the **value** stored at that position  
>       - all other positions not shown are implicitly **zeros**.
---
### Q2.3. Build the `categorical_pipeline` using:
- imputation with `most_frequent`
- one-hot encoding (drop first, ignore unknowns, dense output)

In [None]:
# your code here


>---
>### Custom Cleaning Stage
>
>At this point, we have already built:
>
>- the **numeric preprocessing pipeline**
>- the **categorical preprocessing pipeline**
>
>However, before sending the features into the encoder,  
>we must ensure that some columns â€” specifically those in the Adult Income dataset  
>that contain formatting inconsistencies â€” are **cleaned first**.
>
>These issues include:
>
>- values ending in `"."` (e.g., `">50K."`, `"United-States."`)
>- values containing `"?"` instead of real missing values
>- extra spaces, uppercase/lowercase inconsistencies
>
>This is exactly what we manually fixed in `Week 5`.  
>Now we reproduce this using a **custom preprocessing function**.
>
>---
>**Example: Define a custom cleaning function**
>
>```python
>def clean_categorical_values(df):
>    """
>    Applies simple normalization on all object columns:
>    - strip whitespace
>    - remove trailing periods
>    - replace '?' with ''
>    - lowercase everything
>    """
>    df = df.copy()
>    for col in df.select_dtypes(include=["object"]):
>        df[col] = df[col].str.strip()
>        df[col] = df[col].str.replace(".", "", regex=False)
>        df[col] = df[col].str.replace("?", "", regex=False)
>        df[col] = df[col].str.lower()
>    return df
>```
>---
### Q2.4. Run the following function

In [None]:
def clean_categorical_values(df):
    """
    Applies simple normalization on all object columns:
    - strip whitespace
    - remove trailing periods
    - replace '?' with ''
    - lowercase everything
    """
    df = df.copy()
    for col in df.select_dtypes(include=["object"]):
        df[col] = df[col].str.strip()
        df[col] = df[col].str.replace(".", "", regex=False)
        df[col] = df[col].str.replace("?", "", regex=False)
        df[col] = df[col].str.lower()

    return df

### Q2.5. Identify which columns must pass through the custom cleaning step
>
>Inspect the categorical variables and create a list named **`custom_features`**  
>containing only the columns that have:
>
>- `"."`  
>- `"?"`  
>- leading/trailing spaces  
>- inconsistent labels  
>
>Examples based on Week 5:
>
>- `"income"`  
>- `"native-country"`  
>- `"occupation"`  
>- `"workclass"`  
>
>**Note**: Make sure that the columns you have selected to pass through this pipeline are not present in the `categorical_features` list anymore. Otherwise they will pass trhough both pipelines. Use `categorical_features.remove('col')` to remove the item from the list.

---

In [None]:
# your code here


In [None]:
# your code here


In [None]:
# your code here


>### Q2.6. Wrap your function using `FunctionTransformer`
>
>```python
>from sklearn.preprocessing import FunctionTransformer
>
>custom_cleaner = FunctionTransformer(clean_categorical_values)
>```
>---
>### Build a pipeline that applies ONLY the custom cleaning
>
>This pipeline will operate **before** the normal preprocessing.
>
>```python
>custom_pipeline = Pipeline([
>    ("custom_pipeline", custom_cleaner)
>])
>```
---

In [None]:
# your code here


>### Combine custom, numerical and categorical pipelines with `ColumnTransformer`
>
>Now we combine both pipelines in a single object that knows:
>- which columns need custom processing and use `custom_pipeline`
>- which columns are numerical and use `numeric_pipeline`.
>- which columns are categorical and use `categorical_pipeline`.
>
>We also ask the transformer to return a **pandas DataFrame** instead of a NumPy array.
>
>```python
>from sklearn.compose import ColumnTransformer
>
>full_preprocessor = ColumnTransformer([
>    ("cus", custom_pipeline, custom_features),
>    ("num", numeric_pipeline, numeric_features),
>    ("cat", categorical_pipeline, categorical_features),
>])
>
>full_preprocessor.set_output(transform="pandas")
>```
>- `full_preprocessor.set_output(transform="pandas")`: works around the default behavior of ColumnTransformer, which is to return a NumPy array.
---
### Q2.7. Create the `full_preprocessor` using ColumnTransformer

In [None]:
# your code here


>### Visualizing the Preprocessing Pipeline (Diagram)
>
>Scikit-learn allows us to **visualize the entire preprocessing pipeline** using  
>`set_output(transform="pandas")` **(already done above) together with  
>`sklearn.utils.estimator_html_repr`.**
>
>This creates an interactive HTML diagram that shows:
>
>- each step in the pipeline,
>- how data flows through numerical and categorical branches,
>- how transformations are combined in the `ColumnTransformer`,
>- and the final output.
>
>This is extremely useful for understanding the structure of your preprocessing workflow.
>
>```python
>    from sklearn.utils import estimator_html_repr
>    from IPython.display import display, HTML
>
>    HTML(estimator_html_repr(full_preprocessor))
>```
>**Example Output**:
>
>![Pipeline Diagram](https://github.com/tgvp/PACD/blob/main/img/pipeline.png?raw=1)
---
### Q2.8. Display your pipeline diagram

In [None]:
# your code here


>### Apply the preprocessing pipeline to the whole dataset
>
>With a **single call** we now apply:
>
>- imputation of missing values
>- scaling of numerical features
>- one-hot encoding of categorical features
>
>and obtain a fully processed DataFrame.
>
>```python
>df_clean = full_preprocessor.fit_transform(df)
>df_clean.head()
>```
---
### Q2.9. Apply the pipeline and create `df_clean`

In [None]:
# your code here


>### Export the cleaned dataset produced by the pipeline
>
>Finally, save the preprocessed dataset for future use (for example, in a Machine Learning course).
>
>```python
>df_clean.to_csv("clean_dataset.csv", index=False)
>```
---
### Q2.10. Save the cleaned dataset as `clean_dataset.csv`

In [None]:
# your code here


### Concluding
>From now on, instead of repeating all cleaning steps manually, you can:
>
>- reuse `full_preprocessor` on new data with `full_preprocessor.transform(new_df)`;
>- keep all preprocessing logic **centralized and reproducible** in a single object.
>
> **Note 1**: You could also **improve** the pipeline by **reusing the group-based imputation** we have applied in **Week 5** and create a different pipeline.
>
> **Note 2**: You could also be more meticulous and remember **when you should apply One-Hot-Encoding** and **when not to apply**. Also ask yourself if it makes sense to apply it in `native-country` column for example.
>
> **Note 3**: Try to **chain pipelines** when it makes sense.
>
> **Example**:
>```python
>cat_preproc = Pipeline([
>    ("cus", custom_pipeline),
>    ("cat", categorical_pipeline)
>])
>
>full_preprocessor = ColumnTransformer([
>    ("cat", cat_preproc, categorical_features),
>    ("num", numeric_pipeline, numeric_features),
>])
>
>full_preprocessor.set_output(transform="pandas")
>```
>
>Verify the diagram of this alternative mkaing sure to include all features you want in the list.
---

## Let's practice a bit more!!!

By now you already defined the preprocessing pipeline which could e reused in a differente dataset.

### ðŸ“Œ **Bank Marketing Dataset (UCI Machine Learning Repository)**
This dataset contains information about clients of a Portuguese bank and whether they subscribed to a term deposit.

---

**Dataset Reference:** ðŸ”— https://archive.ics.uci.edu/ml/machine-learning-databases/00222/bank-additional-full.csv

### Data Dictionary
> | Variable Name | Role     | Type         | Demographic      | Description                                                                                                                                                                                                                       | Units | Missing Values |
> |---------------|----------|--------------|------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------|----------------|
> | age           | Feature  | Integer      | Age              | Client age                                                                                                                                                                                                                        |       | no             |
> | job           | Feature  | Categorical  | Occupation       | Type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')                                                      |       | no             |
> | marital       | Feature  | Categorical  | Marital Status   | Marital status (categorical: 'divorced','married','single','unknown'; *note*: "divorced" includes widowed)                                                                                                                       |       | no             |
> | education     | Feature  | Categorical  | Education Level  | Education level (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')                                                                                   |       | no             |
> | default       | Feature  | Binary       |                  | Has credit in default?                                                                                                                                                                                                            |       | no             |
> | balance       | Feature  | Integer      |                  | Average yearly balance                                                                                                                                                                                                            | euros | no             |
> | housing       | Feature  | Binary       |                  | Has housing loan?                                                                                                                                                                                                                 |       | no             |
> | loan          | Feature  | Binary       |                  | Has personal loan?                                                                                                                                                                                                                |       | no             |
> | contact       | Feature  | Categorical  |                  | Contact communication type (categorical: 'cellular','telephone')                                                                                                                                                                  |       | yes            |
> | day_of_week   | Feature  | Date         |                  | Last contact day of the week                                                                                                                                                                                                      |       | no             |
> | month         | Feature  | Date         |                  | Last contact month (categorical: 'jan','feb','mar','apr','may','jun','jul','aug','sep','oct','nov','dec')                                                                                                                         |       | no             |
> | duration      | Feature  | Integer      |                  | Last contact duration (seconds). **Important:** heavily affects target. Should be excluded from realistic predictive models. Included only for benchmark comparisons.                                                             | sec   | no             |
> | campaign      | Feature  | Integer      |                  | Number of contacts performed during this campaign (includes last contact)                                                                                                                                                         |       | no             |
> | pdays         | Feature  | Integer      |                  | Days since last contact from previous campaign (-1 means never contacted)                                                                                                                                                         |       | yes            |
> | previous      | Feature  | Integer      |                  | Number of contacts performed before this campaign                                                                                                                                                                                 |       | no             |
> | poutcome      | Feature  | Categorical  |                  | Outcome of previous marketing campaign (categorical: 'failure','nonexistent','success')                                                                                                                                           |       | yes            |
> | y             | Target   | Binary       |                  | Has the client subscribed a term deposit?                                                                                                                                                                                         |       | no             |
---
### Load the Dataset

In [None]:
# fetch dataset
bank_marketing = fetch_ucirepo(id=222)

# data (as pandas dataframes)
X = bank_marketing.data.features
y = bank_marketing.data.targets

### Now you will:

1. **Import the dataset**
2. **Inspect its structure**
3. **Identify numerical and categorical columns**
4. **Identify columns that will require custom processing**
5. **Apply the preprocessing pipeline already built**
6. **Export the cleaned dataset**

In [None]:
# your code here


In [None]:
# your code here


In [None]:
# your code here


In [None]:
# your code here


In [None]:
# your code here


In [None]:
# your code here


In [None]:
# your code here


In [None]:
# your tears here ðŸ˜Š