## Week 5 ‚Äî Data Transformation

In previous weeks, we learned how to load data, clean missing values, handle categorical variables,  
perform aggregations, and extract meaningful insights.

Now we move to a **new phase** that is part of modern workflows:

> **Preparing the dataset for modeling using feature transformations.**

This includes:
- Review of Cleaning Techniques
- Encoding categorical features
- Scaling numerical features
- Avoiding multicollinearity  

**Dataset Reference:** üîó https://archive.ics.uci.edu/dataset/2/adult

---
## SETUP

**Install the required package to load the dataset from UCI repository**

In [None]:
# uncomment the following line to install the required package
#!pip install ucimlrepo

**Import necessary libraries**

In [None]:
import pandas as pd
import numpy as np

from ucimlrepo import fetch_ucirepo

adult = fetch_ucirepo(id=2)
X = adult.data.features
y = adult.data.targets

pd.set_option('display.max_columns', None)

## 1. Load the Dataset

> When loading the Adult Census Income dataset from the UCI repository, you will notice
> that the data is split into **two separate DataFrames**:
>
> - `X` contains all feature columns (Attribute1 ‚Ä¶ Attribute20)
> - `y` contains the target variable (`class`)
>
> This separation is common in Machine Learning libraries because it clearly
> distinguishes:
>
> - **independent variables** ‚Üí used to make predictions  
> - **dependent variable** ‚Üí the value we want to predict
>
> However, for **Exploratory Data Analysis (EDA)**, it is usually more convenient
> to work with a **single unified table**.
>
> Having both features and the target in the same DataFrame simplifies:
>
> - inspecting the overall structure  
> - checking distributions  
> - computing correlations
> - detecting missing values  
> - visualizing relationships between variables
>
> To prepare for EDA, we will **concatenate** the two parts into one unified table.
>
> ### Concatenating DataFrames
>
> The simplest way to combine `X` and `y` is with `pd.concat`, which allows us to
> join DataFrames **side-by-side** using `axis=1`:
>
> - `pd.concat([...])` ‚Üí specifies the DataFrames to combine  
> - `axis=1` or `axis='columns'` ‚Üí concatenate **column-wise**, placing the
>   target column next to the features  
>
> **Example:**
>
> ```python
> df = pd.concat([df_1, df_2], axis=1)
>
> # or equivalently
>
> df = pd.concat([df_1, df_2], axis="columns")
> ```
>
> ### What about `axis=0` or `axis='rows'`?
>
> - This stacks DataFrames **row-wise**, one on top of the other.  
> - It requires both DataFrames to have the **same columns**.  
> - Therefore it is *not* appropriate for joining `X` and `y`.

---

### Q1.1 Verify both datasets which are separated DataFrames `X` and `y`, then concatenate them into one using `pd.concat`.


In [None]:
# your code here


In [None]:
# your code here


In [None]:
# your code here


## Data Dictionary
>
>Below is the official data dictionary for the **Adult Census Income** dataset (UCI ML Repository).  
>
>Unlike the German Credit dataset, the Adult dataset already includes descriptive column names.
>However, several columns contain coded categories, ambiguous meanings, or missing values masked as `" ?"`,  
>which we will address in the cleaning and preprocessing stages.
>
>| Variable Name     | Role    | Type         | Demographic     | Description                                                                                               | Units | Missing Values |
>|-------------------|---------|--------------|-----------------|-----------------------------------------------------------------------------------------------------------|--------|----------------|
>| age               | Feature | Integer      | Age             | Age of the individual                                                                                     | years  | no             |
>| workclass         | Feature | Categorical  | Income/Employment | Employment status (Private, Self-emp-not-inc, Federal-gov, ‚Ä¶ )                                           |        | yes (encoded as `" ?"`) |
>| fnlwgt            | Feature | Integer      | ‚Äî               | Final sampling weight (used by US Census Bureau)                                                          |        | no             |
>| education         | Feature | Categorical  | Education       | Highest level of education achieved (Bachelors, HS-grad, Some-college, ‚Ä¶)                                 |        | no             |
>| education-num     | Feature | Integer      | Education       | Numerical representation of education level                                                                |        | no             |
>| marital-status    | Feature | Categorical  | Other           | Marital status (Married, Divorced, Never-married, ‚Ä¶)                                                      |        | no             |
>| occupation        | Feature | Categorical  | Employment      | Type of occupation (Tech-support, Sales, Exec-managerial, ‚Ä¶)                                              |        | yes (encoded as `" ?"`) |
>| relationship      | Feature | Categorical  | Other           | Relationship of the individual to their household                                                          |        | no             |
>| race              | Feature | Categorical  | Race            | Race group (White, Black, Asian-Pac-Islander, ‚Ä¶)                                                           |        | no             |
>| sex               | Feature | Binary       | Sex             | Biological sex (Male, Female)                                                                              |        | no             |
>| capital-gain      | Feature | Integer      | ‚Äî               | Capital gain from investment income                                                                        | USD    | no             |
>| capital-loss      | Feature | Integer      | ‚Äî               | Capital loss from investment income                                                                        | USD    | no             |
>| hours-per-week    | Feature | Integer      | Employment      | Working hours per week                                                                                     | hours  | no             |
>| native-country    | Feature | Categorical  | Other           | Country of origin                                                                                          |        | yes (encoded as `" ?"`) |
>| income            | Target  | Binary       | Income          | Income category: `<=50K` or `>50K`                                                                         |        | no             |
---
### Q1.2 Obtain the `.info()` from the Dataset:

>Investigate the datatypes of each column. Are they appropriate?

In [None]:
# your code here


### Q1.3 Obtain descriptive statistics using `.describe()`

In [None]:
# your code here


### Q1.4 Investigate how many missing values are in each column

In [None]:
# you code here


## 2. Clean the Dataset

Before we proceed any further, it is necessary to ensure that the dataset is **consistent** and **clean**.

> Even though the Adult Income dataset already has descriptive columns,  
> it still contains issues that must be addressed before transformation:
>
> - missing values
> - categorical values with inconsistent spacing  
> - mixed types inside categorical columns  
> - skewed numerical variables  
> - redundant or correlated columns
___
### Q2.1. Create a categorical list containing the columns of the appropriate dtype.

In [None]:
# your code here


### Q2.2. Create a numerical list containing the columns of the appropriate dtype.

In [None]:
# your code here


### Q2.3. For every categorical column verify the unique values

In [None]:
# your code here

In [None]:
# your code here

In [None]:
# your code here

In [None]:
# your code here

In [None]:
# your code here

In [None]:
# your code here

In [None]:
# your code here

In [None]:
# your code here


### Q2.4. Clean the `income` column:
> - Use `df["col"].str.replace(".", "", regex=False)`
> - `regex=False` is to guarantee that you are not going to remove all characters

In [None]:
# your code here


### Q2.5 Replace `"?"` with proper `NaN` in all categorical columns

> As seen in the unique values above, several categorical columns use `"?"`  
> instead of `NaN` to mark missing entries.
>
> Replace **all occurrences** of `"?"` with `np.nan`.
>
>**Example in specific column:**
>```python
>df['col'].replace("?", np.nan, inplace=True)
>```
>**Example in the whole DataFrame:**
>```python
>df.replace("?", np.nan, inplace=True)
>```
___

In [None]:
# your code here


In [None]:
# your code here


### Q2.6. Create a `categoricals_with_nans` list containing categorical columns with missing values.

In [None]:
# your code here


In [None]:
# your code here


### Q2.7. Verify the proportion of each column from the list of categoricals with missing values

In [None]:
# your code here


### Q2.8. Verify the percentage of missing values in `categoricals_with_nans`

In [None]:
# your code here


### Q2.9 Decide the imputation strategy for:
>- `workclass`
>- `occupation`
>- `native-country`
>
>**Example of imputation with mode using `.fillna(df[col].mode()[0])`.**
>```python
>   df['col'] = df['col'].fillna(df['col'].mode()[0])
>```
>
>**Example dropping rows with missing values:**
>```python
>   df = df.dropna(subset=["col1", "col2", "..."])
>```
>
>**Example of group-based imputation:**
>```python
>   def group_impute_mode(df, target_col, group_col):
>       """
>       Impute missing values in `target_col` using the mode of each group defined by `group_col`.
>       """
>       group_modes = {}
>
>       # Compute mode per group
>       for group_value, group_df in df.groupby(group_col):
>           mode_val = group_df[target_col].mode()[0]
>           group_modes[group_value] = mode_val
>
>       # Apply the imputation group-wise
>       for group_value, mode_val in group_modes.items():
>           mask = (df[group_col] == group_value)
>           df.loc[mask, target_col] = df.loc[mask, target_col].fillna(mode_val)
>
>       return df
>```
___
**Run the following cell if you decide to use this function...**

**You still need to decide the `target_col` and the `group_col`**

In [None]:
def group_impute_mode(df, target_col, group_col):
    """
    Impute missing values in `target_col` using the mode of each group defined by `group_col`.
    """
    group_modes = {}

    # Compute mode per group
    for group_value, group_df in df.groupby(group_col):
        mode_val = group_df[target_col].mode()[0]
        group_modes[group_value] = mode_val

    # Apply the imputation group-wise
    for group_value, mode_val in group_modes.items():
        mask = (df[group_col] == group_value)
        df.loc[mask, target_col] = df.loc[mask, target_col].fillna(mode_val)

    return df

# Usage:
# df = group_impute_mode(df, target_col="col_with_missing", group_col="col_to_groupby")

In [None]:
# your code here


In [None]:
# your code here


In [None]:
# your code here


In [None]:
# your code here


### ü§∑‚Äç‚ôÇÔ∏è As discussed in previous classes, all strategies have disadvantages and may introduce bias

## 3. Why Data Transformation?
>
> Raw datasets usually **cannot** be directly fed into models.
>
> **Many algorithms expect features to be:**
> - on comparable scales  
> - numerical  
> - free of redundant or perfectly correlated dimensions  
> - consistent between train and test sets  
>
> **In practice, real-world datasets contain:**
> - variables measured in different units  
> - skewed distributions  
> - categorical labels  
> - missing values  
> - outliers  
>
> **Transformations help ensure:**
>
>- numerical stability  
>- faster gradient convergence  
>- better performance in distance-based models  
>- reduced bias in parameter estimation  
>- interpretability and model consistency  
>- compatibility with modern pipelines  
___
>
>## Label Encoding
>
>Label Encoding consists of assigning a **numeric code** to each category.
>
>**Example:**
>
>```python
>   education_map = {
>        "Preschool": 0,
>        "1st-4th": 1,
>        "5th-6th": 2,
>        "7th-8th": 3,
>        "9th": 4,
>        "10th": 5,
>        "11th": 6,
>        "12th": 7,
>        "HS-grad": 8,
>        "Some-college": 9,
>        "Assoc-voc": 10,
>        "Assoc-acdm": 11,
>        "Bachelors": 12,
>        "Masters": 13,
>        "Doctorate": 14,
>        "Prof-school": 15
>    }
>
>    df["education_encoded"] = df["education"].map(education_map)
>```
>When **Label Encoding** is appropriate?
> - When a categorical feature has a natural order:
>
>   - Education level
>
>   - Satisfaction scores
>
>   - Risk levels
>
>   - Stage or level progressions
>
>In these cases, the assigned numbers represent rank, not category labels.
___
### Q3.1. Use the **mapping dictionary** below to `Label Encode` the `education` column

In [None]:
education_map = {
    "Preschool": 0,
    "1st-4th": 1,
    "5th-6th": 2,
    "7th-8th": 3,
    "9th": 4,
    "10th": 5,
    "11th": 6,
    "12th": 7,
    "HS-grad": 8,
    "Some-college": 9,
    "Assoc-voc": 10,
    "Assoc-acdm": 11,
    "Bachelors": 12,
    "Masters": 13,
    "Prof-school": 14,
    "Doctorate": 15
}

In [None]:
# your code here


## One-Hot Encoding (OHE)

> After understanding Label Encoding, we now need a method that works for  
> categorical variables **without natural order**.  
>
> Assigning integers to categories like `"Red" ‚Üí 0`, `"Blue" ‚Üí 1`, `"Green" ‚Üí 2`  
> would incorrectly imply:
>
> - Blue > Red  
> - Green > Blue  
>
> which creates a **fake ordering relationship** that does not exist.
>
> To avoid this problem, we use **One-Hot Encoding (OHE)**.
___
### What is One-Hot Encoding?
>
> One-Hot Encoding converts each category into a **binary indicator column**.
>
> Suppose the column **`color`** contains the following categories:
>
> - <span style="color:#D62828; font-weight:bold;">Red</span>  
> - <span style="color:#1D3557; font-weight:bold;">Blue</span>  
> - <span style="color:#2A9D8F; font-weight:bold;">Green</span>  
>
> Using **One-Hot Encoding**, each category becomes its own variable.
>
> **Assume the input column `color`:**
>
> | color |
> |-------|
> | <span style="color:#D62828; font-weight:bold;">Red</span> |
> | <span style="color:#1D3557; font-weight:bold;">Blue</span> |
> | <span style="color:#2A9D8F; font-weight:bold;">Green</span> |
>
>**Example using One-Hot Encoding with `pd.get_dummies()`**
>
> ```python
>   df_encoded = pd.get_dummies(df, columns=list_of_columns, drop_first=True, dtype=type)
> ```
>
>**Expected result:**
> | color_Blue | color_Green |
> |------------|--------------|
> |     0      |      0       |
> |     1      |      0       |
> |     0      |      1       |
>
> **Why is the first category omitted?**
>
> Because dropping one dummy avoids the  <span style="color:#E76F51; font-weight:bold;">dummy variable trap</span> caused by **perfect multicollinearity**, where:
>
> ```
> Red = 1 ‚àí (Blue + Green)
> ```
>
> **Now every row still uniquely identifies its color, without redundancy.**
___
### Q3.2. Create a DataFrame named `df_encoded` using `OHE` on the remaining categorical columns:
>- Do not forget that `education` column must not be included

In [None]:
# your code here


In [None]:
# your code here


### Q3.3. Verify if the original columns are still in the DataFrame
>- Use `occupation` column for example

In [None]:
# your code here


### Q3.4. Verify the number of columns after the `OHE`

In [None]:
# your code here


### Q3.5. Concatenate `df_encoded` with the `education` encoded column

In [None]:
# your code here


### Q3.6 Verify if there are **outliers** in `numerical` columns using the `IQR method`

> To detect outliers in a numerical column, we can use the **Interquartile Range (IQR) method**.
> The IQR represents the spread of the middle 50% of the data.
>
> The formula works as follows:
>
> - Compute the 1st quartile (Q1) ‚Üí 25th percentile  
> - Compute the 3rd quartile (Q3) ‚Üí 75th percentile  
> - Compute the **IQR**:
>
> $$
> \text{IQR} = Q3 - Q1
> $$
>
> Outliers are any observations outside the following bounds:
>
> $$
> \text{Lower Bound} = Q1 - 1.5 \times \text{IQR}
> $$
> $$
> \text{Upper Bound} = Q3 + 1.5 \times \text{IQR}
> $$
>
> Values smaller than the lower bound or greater than the upper bound are considered **outliers**.
>
> Now, define a function that verifies whether a column contains outliers:
>
>```python
>def verify_outliers(df: pd.DataFrame, col: str) -> bool:
>    q1 = df[col].quantile(0.25)
>    q3 = df[col].quantile(0.75)
>    iqr = q3 - q1
>    # continue from here
>```
>
>Return a bool from the function and apply it on every `numerical` column


In [None]:
def verify_outliers(df: pd.DataFrame, col: str) -> bool:
    q1 = df[col].quantile(0.25)
    q3 = df[col].quantile(0.75)
    iqr = q3 - q1
    
    # continue from here


In [None]:
# your code here


<details>
<summary><h3>Can we also use the IQR method to remove outliers?</h3></summary>

> Yes, the same mathematical rule used to *detect* outliers can also be used
> to *remove* them.  
>
> Once we compute the lower and upper bounds:
>
> $$
> \text{Lower Bound} = Q1 - 1.5 \times \text{IQR}
> $$
> $$
> \text{Upper Bound} = Q3 + 1.5 \times \text{IQR}
> $$
>
> We can simply filter the DataFrame to keep only the values **within these limits**.
>
> This is known as **IQR-based outlier removal** and is one of the most common
> preprocessing techniques in data cleaning, especially for algorithms that are 
> sensitive to extreme values.
>
> Example function to *remove* outliers from a column:
>
> ```python
> def remove_outliers_iqr(df: pd.DataFrame, col: str) -> pd.DataFrame:
>     q1 = df[col].quantile(0.25)
>     q3 = df[col].quantile(0.75)
>     iqr = q3 - q1
>
>     lower = q1 - 1.5 * iqr
>     upper = q3 + 1.5 * iqr
>
>     return df[(df[col] >= lower) & (df[col] <= upper)]
> ```
>
___
<div style="background-color:#f2f2f2; padding:12px; border-left:4px solid #d9534f; border-radius:4px; margin:10px 0;">
<strong>‚ö†Ô∏è NOTE:</strong> REMOVING OUTLIERS IS NOT ALWAYS RECOMMENDED.<br>
It depends on the context and whether extreme values are real observations or measurement errors.<br>
In credit scoring datasets like this one, outliers may represent important patterns of risk.
</div>


</details>

### Transforming Numerical Features
>
>Before applying more advanced preprocessing techniques, we must ensure that our **numerical features are properly transformed**.
>
>**Raw numerical variables typically present issues such as:**
>
>- very different scales  
>- extreme outliers  
>- heavy-tailed shapes  
>- inconsistent units  
>
>These problems can negatively impact both **data analysis** and **modeling**.
___
### Why Numerical Transformation Matters
>
>**Many algorithms assume that:**
>
> - features are on comparable scales  
> - values are not dominated by extreme outliers  
> - distributions are not extremely skewed  
> - distances between samples are meaningful  
>
>**However, real-world datasets often include:**
>
>- **large scale differences**  
>   - e.g., `capital-gain` ranges up to 100,000+  
>   - e.g., `hours-per-week` ranges from 1 to ~60  
>
>- **strong skewness**  
>   - e.g., most people have zero capital gain or loss
>
>
>**Numerical transformations help ensure:**
>
>- numerical stability  
>- consistency between features  
>- reduction of bias introduced by outliers  
>- improved interpretability  
>- better suitability for downstream ML tasks  
---
### Standardization (Z-Score Scaling)
>
> Standardization rescales a feature so that it has:
>
> - mean = **0**  
> - standard deviation = **1**
>
>**Mathematically:**
>
>$$
>z = \frac{x - \mu}{\sigma}
>$$
>
>**Key characteristics:**
>
>- preserves the shape of the distribution  
>- keeps outliers (does not remove them)  
>- spreads values around zero  
>- widely used in:  
>  - Linear / Logistic Regression  
>  - Neural Networks  
>  - PCA  
>  - SVM
>
>**Example:**
>
>```python
>   def standardize(column: pd.Series) -> pd.Series:
>      """
>      Applies Z-score standardization to a numerical column.
>      (x - mean) / std
>      """
>      mean = column.mean()
>      std = column.std()
>      return (column - mean) / std
>
>   df_standard = pd.DataFrame()
>   for col in numericals:
>       df_standard[col] = standardize(df[col])
>
>   df_standard.head()
>```
>
>**Alternatively, you can use `sklearn.preprocessing.StandardScaler` for more efficient scaling.**
>
>```python
>   from sklearn.preprocessing import StandardScaler
>   scaler = StandardScaler()
>   df_standard = pd.DataFrame(scaler.fit_transform(df[numericals]), columns=numericals)
>```
___
### Q3.7. Create a new DataFrame named `df_standard` containing the standardized versions of the `numerical` columns
>- `age` **-> Good candidate for scaling**
>   - Moderate range (17‚Äì90)
>   - No extreme outliers
___
>- `fnlwgt` **‚Üí Strong candidate for scaling**
>   - Very large magnitude (20,000 to over 1,000,000)  
>   - Extremely skewed and dominates the dataset if not scaled  
>   - Standardization reduces magnitude-related bias  
---
>- `education-num` **‚Üí Do *NOT* standardize**
>   - Encodes an **ordinal** variable (1 to 16)  
>   - Small range and meaningful progression  
>   - Scaling removes interpretability without adding value  
---
>- `capital-gain` **‚Üí Do *NOT* scale raw values**
>   - 99% of observations are **zero**  
>   - Very few extremely large values  
---
>- `capital-loss` **‚Üí Do *NOT* scale raw values**
>  - Same skewness behavior as `capital-gain`  
>  - Mostly zeros with rare large values
___

In [None]:
# your code here


In [None]:
# your code here


In [None]:
# your code here


In [None]:
# your code here


### Q3.8 Concatenate `df_standard` with the rest of the dataset (`df_encoded`) to create the final cleaned DataFrame named `df_final`

In [None]:
# your code here


In [None]:
# your code here


### Q3.9 Export the dataset to a `csv` file as `cleaned_adult_income.csv`

In [None]:
# your code here


>### Next class we will introduce `Preprocessing Pipelines` using `scikit-learn`
>**This will allow us to combine all the steps above into a single reusable workflow.**