# Week 4 - Data Wrangling and Group-Based Aggregations

In this notebook we will practice data cleaning and group-based aggregations using a *messy* version of the german credit risk dataset.

Dataset reference: üîó https://archive.ics.uci.edu/dataset/144/statlog+german+credit+data

Topics covered:
- Concatenating DataFrames
- Preprocessing
    - Categorical x Numerical Data
    - Fixing column types
    - Standardizing Categorical Values
    - Missing values (Identifying and Imputation)
- Group-Based Aggregations

## SETUP

In [150]:
import pandas as pd
import numpy as np

pd.set_option('display.max_columns', None)

## 1. Load the Dataset

> When loading the German Credit dataset from the UCI repository, you will notice
> that the data is split into **two separate DataFrames**:
>
> - `X` contains all feature columns (Attribute1 ‚Ä¶ Attribute20)
> - `y` contains the target variable (`class`)
>
> This separation is common in Machine Learning libraries because it clearly
> distinguishes:
>
> - **independent variables** ‚Üí used to make predictions  
> - **dependent variable** ‚Üí the value we want to predict
>
> However, for **Exploratory Data Analysis (EDA)**, it is usually more convenient
> to work with a **single unified table**.
>
> Having both features and the target in the same DataFrame simplifies:
>
> - inspecting the overall structure  
> - checking distributions  
> - computing correlations  
> - detecting missing values  
> - visualizing relationships between variables
>
> To prepare for EDA, we will **concatenate** the two parts into one unified table.
>
> ### Concatenating DataFrames
>
> The simplest way to combine `X` and `y` is with `pd.concat`, which allows us to
> join DataFrames **side-by-side** using `axis=1`:
>
> - `pd.concat([...])` ‚Üí specifies the DataFrames to combine  
> - `axis=1` or `axis='columns'` ‚Üí concatenate **column-wise**, placing the
>   target column next to the features  
>
> **Example:**
>
> ```python
> df = pd.concat([df_1, df_2], axis=1)
>
> # or equivalently
>
> df = pd.concat([df_1, df_2], axis="columns")
> ```
>
> ### What about `axis=0` or `axis='rows'`?
>
> - This stacks DataFrames **row-wise**, one on top of the other.  
> - It requires both DataFrames to have the **same columns**.  
> - Therefore it is *not* appropriate for joining `X` and `y`.

---

### Q1.1 Load both datasets in separate DataFrames `X` and `y`, then concatenate them into one using `pd.concat`.


In [151]:
# your code here
X = pd.read_csv('X_gcd.csv')
X.head()

Unnamed: 0,Attribute1,Attribute2,Attribute3,Attribute4,Attribute5,Attribute6,Attribute7,Attribute8,Attribute9,Attribute10,Attribute11,Attribute12,Attribute13,Attribute14,Attribute15,Attribute16,Attribute17,Attribute18,Attribute19,Attribute20,Attribute21,Attribute22
0,A11,6.0,A34,A43,1169.0,A65,A75,,A93,A101,4.0,A121,67.0,A143,a152,2.0,,1.0,a192,A201,6.0,
1,A12,48.0,a32,a43,5951.0,a61,a73,2.0,A92,A101,2.0,a121,22.0,,A152,1.0,,1.0,a191,A201,48.0,
2,A14,12.0,A34,A46,2096.0,A61,A74,2.0,A93,A101,3.0,A121,49.0,A143,A152,1.0,a172,2.0,A191,A201,12.0,
3,a11,42.0,A32,A42,7882.0,A61,a74,2.0,A93,A103,4.0,a122,45.0,A143,A153,1.0,,2.0,A191,A201,42.0,
4,a11,24.0,A33,A40,4870.0,A61,A73,3.0,A93,A101,4.0,A124,53.0,,a153,2.0,A173,2.0,A191,A201,24.0,99.0


In [152]:
# your code here
y = pd.read_csv('y_gcd.csv')
y.head()

Unnamed: 0,class
0,good
1,bad
2,good
3,good
4,bad


In [153]:
# your code here
df = pd.concat([X, y], axis=1)
df.head()

Unnamed: 0,Attribute1,Attribute2,Attribute3,Attribute4,Attribute5,Attribute6,Attribute7,Attribute8,Attribute9,Attribute10,Attribute11,Attribute12,Attribute13,Attribute14,Attribute15,Attribute16,Attribute17,Attribute18,Attribute19,Attribute20,Attribute21,Attribute22,class
0,A11,6.0,A34,A43,1169.0,A65,A75,,A93,A101,4.0,A121,67.0,A143,a152,2.0,,1.0,a192,A201,6.0,,good
1,A12,48.0,a32,a43,5951.0,a61,a73,2.0,A92,A101,2.0,a121,22.0,,A152,1.0,,1.0,a191,A201,48.0,,bad
2,A14,12.0,A34,A46,2096.0,A61,A74,2.0,A93,A101,3.0,A121,49.0,A143,A152,1.0,a172,2.0,A191,A201,12.0,,good
3,a11,42.0,A32,A42,7882.0,A61,a74,2.0,A93,A103,4.0,a122,45.0,A143,A153,1.0,,2.0,A191,A201,42.0,,good
4,a11,24.0,A33,A40,4870.0,A61,A73,3.0,A93,A101,4.0,A124,53.0,,a153,2.0,A173,2.0,A191,A201,24.0,99.0,bad


## Data Dictionary
>
>Below is the official data dictionary for the German Credit dataset.  
>
>Notice how the variables are originally labeled as `Attribute1`, `Attribute2`, ‚Ä¶ `Attribute20`.  
>
>Although this scheme preserves the order of the variables, it is **not descriptive**, which makes the dataset hard to read during analysis.
>
>| Variable Name | Role    | Type         | Demographic     | Description                                             | Units |
>|---------------|---------|--------------|-----------------|---------------------------------------------------------|-------|
>| Attribute1    | Feature | Categorical  |                 | Status of existing checking account                     |       |
>| Attribute2    | Feature | Integer      |                 | Duration                                                | months|
>| Attribute3    | Feature | Categorical  |                 | Credit history                                          |       |
>| Attribute4    | Feature | Categorical  |                 | Purpose                                                 |       |
>| Attribute5    | Feature | Integer      |                 | Credit amount                                           |       |
>| Attribute6    | Feature | Categorical  |                 | Savings account/bonds                                   |       |
>| Attribute7    | Feature | Categorical  | Other           | Present employment since                                |       |
>| Attribute8    | Feature | Integer      |                 | Installment rate as % of disposable income              |       |
>| Attribute9    | Feature | Categorical  | Marital Status  | Personal status and sex                                 |       |
>| Attribute10   | Feature | Categorical  |                 | Other debtors / guarantors                              |       |
>| Attribute11   | Feature | Integer      |                 | Present residence since                                 |       |
>| Attribute12   | Feature | Categorical  |                 | Property owned                                          |       |
>| Attribute13   | Feature | Integer      | Age             | Age                                                     | years |
>| Attribute14   | Feature | Categorical  |                 | Other installment plans                                 |       |
>| Attribute15   | Feature | Categorical  | Other           | Housing                                                 |       |
>| Attribute16   | Feature | Integer      |                 | Number of existing credits at this bank                 |       |
>| Attribute17   | Feature | Categorical  | Occupation      | Job                                                     |       |
>| Attribute18   | Feature | Integer      |                 | Number of dependents                                    |       |
>| Attribute19   | Feature | Binary       |                 | Telephone                                               |       |
>| Attribute20   | Feature | Binary       | Other           | Foreign worker                                          |       |
>| class         | Target  | Binary       |                 | 1 = Good, 2 = Bad                                       |       |
---
## 2. Clean the Data

> You may have noticed that the column names are mostly **impractical for quick or direct analysis**.
>
> Labels like `Attribute3` or `Attribute14` do not convey meaning and force us to constantly consult the data dictionary.
>
> Before doing any EDA, it is important to assign **clear, consistent, and descriptive** column names.
> This improves:
>
> - readability  
> - visualization and plotting  
> - correlation analysis  
> - interpretability of models later on

---

### Q2. Use the dictionary below to rename all columns to meaningful, standardized names.
 
- Apply it using `.rename(columns=...)` right after concatenating `X` and `y`.

```python
rename_dict = {
    "Attribute1":  "checking_status",
    "Attribute2":  "duration_months",
    "Attribute3":  "credit_history",
    "Attribute4":  "purpose",
    "Attribute5":  "credit_amount",
    "Attribute6":  "savings_account",
    "Attribute7":  "employment_since",
    "Attribute8":  "installment_rate",
    "Attribute9":  "personal_status_sex",
    "Attribute10": "other_debtors",
    "Attribute11": "residence_since",
    "Attribute12": "property",
    "Attribute13": "age",
    "Attribute14": "other_installment_plans",
    "Attribute15": "housing",
    "Attribute16": "existing_credits",
    "Attribute17": "job",
    "Attribute18": "dependents",
    "Attribute19": "telephone",
    "Attribute20": "foreign_worker",
    "class":       "credit_risk"
}


In [154]:
rename_dict = {
    "Attribute1":  "checking_status",
    "Attribute2":  "duration_months",
    "Attribute3":  "credit_history",
    "Attribute4":  "purpose",
    "Attribute5":  "credit_amount",
    "Attribute6":  "savings_account",
    "Attribute7":  "employment_since",
    "Attribute8":  "installment_rate",
    "Attribute9":  "personal_status_sex",
    "Attribute10": "other_debtors",
    "Attribute11": "residence_since",
    "Attribute12": "property",
    "Attribute13": "age",
    "Attribute14": "other_installment_plans",
    "Attribute15": "housing",
    "Attribute16": "existing_credits",
    "Attribute17": "job",
    "Attribute18": "dependents",
    "Attribute19": "telephone",
    "Attribute20": "foreign_worker",
    "Attribute21": "months",
    "Attribute22": "postal_area",
    "class":       "credit_risk"
    
}

# your code here


### Q2.1 Obtain the `.info()` from the Dataset:

>Investigate the datatypes of each column. Are they appropriate?

In [155]:
# your code here
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 23 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Attribute1   946 non-null    object 
 1   Attribute2   949 non-null    float64
 2   Attribute3   946 non-null    object 
 3   Attribute4   954 non-null    object 
 4   Attribute5   931 non-null    float64
 5   Attribute6   1000 non-null   object 
 6   Attribute7   960 non-null    object 
 7   Attribute8   944 non-null    float64
 8   Attribute9   948 non-null    object 
 9   Attribute10  951 non-null    object 
 10  Attribute11  949 non-null    float64
 11  Attribute12  940 non-null    object 
 12  Attribute13  963 non-null    float64
 13  Attribute14  945 non-null    object 
 14  Attribute15  951 non-null    object 
 15  Attribute16  939 non-null    float64
 16  Attribute17  943 non-null    object 
 17  Attribute18  955 non-null    float64
 18  Attribute19  965 non-null    object 
 19  Attribu

### Q2.2 Obtain descriptive statistics using `.describe()`

In [156]:
# your code here
df.describe()

Unnamed: 0,Attribute2,Attribute5,Attribute8,Attribute11,Attribute13,Attribute16,Attribute18,Attribute21
count,949.0,931.0,944.0,949.0,963.0,939.0,955.0,949.0
mean,20.989463,3259.498389,2.962924,2.839831,36.15161,1.401491,1.159162,20.989463
std,12.051222,2807.398373,1.117537,1.102191,16.746982,0.572689,0.366019,12.051222
min,4.0,276.0,1.0,1.0,-5.0,1.0,1.0,4.0
25%,12.0,1365.0,2.0,2.0,26.0,1.0,1.0,12.0
50%,18.0,2320.0,3.0,3.0,33.0,1.0,1.0,18.0
75%,24.0,3974.0,4.0,4.0,42.0,2.0,1.0,24.0
max,72.0,18424.0,4.0,4.0,150.0,4.0,2.0,72.0


### Q2.3 Investigate how many missing values are in each column

In [157]:
# you code here
df.isnull().sum()

Attribute1      54
Attribute2      51
Attribute3      54
Attribute4      46
Attribute5      69
Attribute6       0
Attribute7      40
Attribute8      56
Attribute9      52
Attribute10     49
Attribute11     51
Attribute12     60
Attribute13     37
Attribute14     55
Attribute15     49
Attribute16     61
Attribute17     57
Attribute18     45
Attribute19     35
Attribute20     46
Attribute21     51
Attribute22    840
class           42
dtype: int64

### Q2.4 Create `numerical` and `categorical` lists
- you can check based on dtype (`'O'`) for object
- you can also check using `df.select_dtypes(include=object)` or `np.number` for numerical

In [158]:
# your code here
numerical = df.select_dtypes(include=np.number).columns.to_list()
numerical

['Attribute2',
 'Attribute5',
 'Attribute8',
 'Attribute11',
 'Attribute13',
 'Attribute16',
 'Attribute18',
 'Attribute21']

In [159]:
# your code here
categorical = df.select_dtypes(include=object).columns.to_list()
categorical

['Attribute1',
 'Attribute3',
 'Attribute4',
 'Attribute6',
 'Attribute7',
 'Attribute9',
 'Attribute10',
 'Attribute12',
 'Attribute14',
 'Attribute15',
 'Attribute17',
 'Attribute19',
 'Attribute20',
 'Attribute22',
 'class']

In [160]:
# your code here


In [161]:
# your code here


### Q2.5 Standardize categorical columns

- if possible, define strategies that could be used in columns with the same problem
- if there are distinct problems, create lists containing a subset of columns with the same problem

In [162]:
# your code here
for col in categorical:
    print(f'{df[col].value_counts()}', end='\n\n')

Attribute1
A14    308
A11    212
A12    207
a14     67
A13     49
a12     48
a11     45
a13     10
Name: count, dtype: int64

Attribute3
A32    422
A34    241
a32     79
A33     68
A31     44
a34     37
A30     34
a33     13
a30      4
a31      4
Name: count, dtype: int64

Attribute4
A43     227
A40     186
A42     153
A41      85
A49      81
a43      43
A46      41
a40      36
A45      21
a42      18
a41      14
A410     12
A44      11
a49      10
a46       7
A48       7
a48       1
a44       1
Name: count, dtype: int64

Attribute6
A61     469
A65     140
a61     100
A62      89
A63      48
nan      47
A64      38
a65      34
a63      14
a62      13
a64       8
Name: count, dtype: int64

Attribute7
A73        279
A75        200
A72        140
A74        138
a73         47
A71         47
a75         39
a72         25
a74         24
a71         11
unknown     10
Name: count, dtype: int64

Attribute9
A93    424
A92    242
a93     86
A94     68
a92     49
A91     41
a94     22
??      10


In [163]:
# your code here
# as we can see, there are columns with values in different case, mix of values as strings and integers, presence of weird symbols such as '??' and even ' ', ' nan ', 'unknown' and values that could be striped

In [164]:
# lets use .unique() now because it will indicate also nans present in each column
for col in categorical:
    print(f'{col}\n{df[col].unique()}', end='\n\n')

Attribute1
['A11' 'A12' 'A14' 'a11' 'a12' 'a14' nan 'A13' 'a13']

Attribute3
['A34' 'a32' 'A32' 'A33' nan 'A30' 'A31' 'a33' 'a34' 'a30' 'a31']

Attribute4
['A43' 'a43' 'A46' 'A42' 'A40' 'a46' nan 'A49' 'a40' 'A41' 'A44' 'A45'
 'a41' 'a42' 'a49' 'A410' 'A48' 'a48' 'a44']

Attribute6
[' A65 ' ' a61 ' ' A61 ' ' a65 ' ' A63 ' ' A64 ' ' A62 ' ' nan ' ' a64 '
 ' a63 ' ' a62 ']

Attribute7
['A75' 'a73' 'A74' 'a74' 'A73' 'a71' 'A72' 'a75' 'A71' 'a72' nan 'unknown']

Attribute9
['A93' 'A92' 'a93' 'A91' nan 'a92' 'A94' '??' 'a94' 'a91']

Attribute10
['A101' 'A103' nan 'A102' 'a101' 'a103' 'a102']

Attribute12
['A121' 'a121' 'a122' 'A124' 'A122' 'A123' 'a123' nan 'a124']

Attribute14
['A143' nan 'a143' 'A141' 'a141' 'A142' 'a142']

Attribute15
['a152' 'A152' 'A153' 'a153' 'A151' nan 'a151']

Attribute17
[nan 'a172' 'A173' 'A172' 'A174' 'a173' 'A171' 'a174' 'a171']

Attribute19
['a192' 'a191' 'A191' 'A192' nan]

Attribute20
['A201' nan 'a201' 'a202' 'A202']

Attribute22
[nan '99' ' ']

class
['goo

In [165]:
def clean_categorical_column(series, special_na_values=None):
    """
    Cleans a categorical column by fixing:
    - extra whitespace
    - mixed upper/lowercase
    - string values representing NA
    - strange symbols (??, '', ' ')
    - mixed data types (int/float + string)

    Parameters
    ----------
    series : pd.Series
        Column to clean.

    special_na_values : list
        Optional list of values that should be treated as NaN.

    Returns
    -------
    pd.Series
        Cleaned categorical column.
    """

    # 1. Convert everything to string (preserving NA values)
    s = series.astype("string")

    # 2. Strip leading/trailing whitespace
    s = s.str.strip()

    # 3. Normalize casing: convert to lowercase first
    s = s.str.lower()

    # 4. Define values that should be treated as missing (NaN)
    default_na = {"nan", "none", "null", "", " ", "??", "unknown"}
    if special_na_values:
        # Add custom NA values
        default_na.update({v.lower() for v in special_na_values})

    # Replace NA-like strings with actual NaN
    s = s.replace(default_na, np.nan)

    # 5. Standardize label format: convert strings like "a11" ‚Üí "A11"
    # (only if the value starts with a letter)
    s = s.apply(lambda x: x.upper() if isinstance(x, str) and x and x[0].isalpha() else x)

    # 6. Convert numeric strings to integers (optional behavior)
    s = s.apply(lambda x: int(x) if isinstance(x, str) and x.isdigit() else x)

    return s


In [166]:
# lets clean now applyng the function to all columns
for col in categorical:
    df[col] = clean_categorical_column(df[col])

In [167]:
# lets verify again
for col in categorical:
    print(f'{col}\n{df[col].unique()}', end='\n\n')

Attribute1
['A11' 'A12' 'A14' <NA> 'A13']

Attribute3
['A34' 'A32' 'A33' <NA> 'A30' 'A31']

Attribute4
['A43' 'A46' 'A42' 'A40' <NA> 'A49' 'A41' 'A44' 'A45' 'A410' 'A48']

Attribute6
['A65' 'A61' 'A63' 'A64' 'A62' <NA>]

Attribute7
['A75' 'A73' 'A74' 'A71' 'A72' <NA>]

Attribute9
['A93' 'A92' 'A91' <NA> 'A94']

Attribute10
['A101' 'A103' <NA> 'A102']

Attribute12
['A121' 'A122' 'A124' 'A123' <NA>]

Attribute14
['A143' <NA> 'A141' 'A142']

Attribute15
['A152' 'A153' 'A151' <NA>]

Attribute17
[<NA> 'A172' 'A173' 'A174' 'A171']

Attribute19
['A192' 'A191' <NA>]

Attribute20
['A201' <NA> 'A202']

Attribute22
[<NA> 99]

class
['GOOD' 'BAD' <NA> 2 1]



In [168]:
# nan values are expected, but we can see now that the class column still has mix values
# lets map those values according to the Data Dictionary where 1 = Good, 2 = Bad
# Replace text values with numeric dictionary values
mapping = {
    "GOOD": 1,
    "BAD": 2,
}

df["class"] = df["class"].replace(mapping)

df["class"].value_counts()

class
1    664
2    294
Name: count, dtype: int64

### Q2.6 Verify the percentage of missing values in each `categorical` column:
- if it's below `5%`, input the `Mode` (this may not be the best approach but we are cleaning the best we can with what we have learned so far)
- if it's above `40%` drop the column

In [169]:
for col in categorical:
    print(f'{col} -> {(df[col].isnull().sum()/df.shape[0]) * 100} %')

Attribute1 -> 5.4 %
Attribute3 -> 5.4 %
Attribute4 -> 4.6 %
Attribute6 -> 4.7 %
Attribute7 -> 5.0 %
Attribute9 -> 6.2 %
Attribute10 -> 4.9 %
Attribute12 -> 6.0 %
Attribute14 -> 5.5 %
Attribute15 -> 4.9 %
Attribute17 -> 5.7 %
Attribute19 -> 3.5000000000000004 %
Attribute20 -> 4.6 %
Attribute22 -> 95.1 %
class -> 4.2 %


In [170]:
# lets drop the column
df.drop(columns=['Attribute22'], inplace=True)

In [171]:
# lets verify if the column was indeed dropped
'Attribute22' in df.columns.to_list()

False

### Q2.7. Verify the percentage of missing values in each `numerical` column
- if it's above `40%` drop the column
- inpute the `mean`, `median` or `mode`, decide yourself which you are going to use

In [172]:
for col in numerical:
    print(f'{col} -> {(df[col].isnull().sum()/df.shape[0]) * 100} %')

Attribute2 -> 5.1 %
Attribute5 -> 6.9 %
Attribute8 -> 5.6000000000000005 %
Attribute11 -> 5.1 %
Attribute13 -> 3.6999999999999997 %
Attribute16 -> 6.1 %
Attribute18 -> 4.5 %
Attribute21 -> 5.1 %


In [173]:
for col in numerical:

    if df[col].isnull().sum() > 0:
        median_value = df[col].median()
        df[col] = df[col].fillna(median_value)   # ‚úî no chained assignment
        print(f"Imputed median ({median_value}) for column '{col}'\n")
    else:
        print(f"No imputation needed for '{col}'\n")


Imputed median (18.0) for column 'Attribute2'

Imputed median (2320.0) for column 'Attribute5'

Imputed median (3.0) for column 'Attribute8'

Imputed median (3.0) for column 'Attribute11'

Imputed median (33.0) for column 'Attribute13'

Imputed median (1.0) for column 'Attribute16'

Imputed median (1.0) for column 'Attribute18'

Imputed median (18.0) for column 'Attribute21'



### Q2.8 Verify if there are **outliers** in `numerical` columns using the `IQR method`

> To detect outliers in a numerical column, we can use the **Interquartile Range (IQR) method**.
> The IQR represents the spread of the middle 50% of the data.
>
> The formula works as follows:
>
> - Compute the 1st quartile (Q1) ‚Üí 25th percentile  
> - Compute the 3rd quartile (Q3) ‚Üí 75th percentile  
> - Compute the **IQR**:
>
> $$
> \text{IQR} = Q3 - Q1
> $$
>
> Outliers are any observations outside the following bounds:
>
> $$
> \text{Lower Bound} = Q1 - 1.5 \times \text{IQR}
> $$
> $$
> \text{Upper Bound} = Q3 + 1.5 \times \text{IQR}
> $$
>
> Values smaller than the lower bound or greater than the upper bound are considered **outliers**.
>
> Now, define a function that verifies whether a column contains outliers:
>
>```python
>def verify_outliers(df: pd.DataFrame, col: str) -> bool:
>    q1 = df[col].quantile(0.25)
>    q3 = df[col].quantile(0.75)
>    iqr = q3 - q1
>    # continue from here
>```
>
>Return a bool from the function and apply it on every `numerical` column


In [175]:
def verify_outliers(df: pd.DataFrame, col: str) -> bool:
    q1 = df[col].quantile(0.25)
    q3 = df[col].quantile(0.75)
    iqr = q3 - q1

    # Define the lower and upper limits
    lower_bound = q1 - 1.5 * iqr
    upper_bound = q3 + 1.5 * iqr

    # Count how many values fall outside the limits
    outlier_mask = (df[col] < lower_bound) | (df[col] > upper_bound)
    outlier_count = outlier_mask.sum()

    # Print summary (optional)
    print(f"Column: {col}")
    print(f"Lower bound: {lower_bound:.4f}, Upper bound: {upper_bound:.4f}")
    print(f"Outliers found: {outlier_count}")

    # Return True if at least one outlier exists
    return outlier_count > 0


In [177]:
for col in numerical:
    if verify_outliers(df, col):
        print(f"Outliers detected in {col}\n")
    else:
        print(f"No outliers in {col}\n")


Column: Attribute2
Lower bound: -6.0000, Upper bound: 42.0000
Outliers found: 65
Outliers detected in Attribute2

Column: Attribute5
Lower bound: -2270.7500, Upper bound: 7525.2500
Outliers found: 79
Outliers detected in Attribute5

Column: Attribute8
Lower bound: -1.0000, Upper bound: 7.0000
Outliers found: 0
No outliers in Attribute8

Column: Attribute11
Lower bound: -1.0000, Upper bound: 7.0000
Outliers found: 0
No outliers in Attribute11

Column: Attribute13
Lower bound: 6.0000, Upper bound: 62.0000
Outliers found: 54
Outliers detected in Attribute13

Column: Attribute16
Lower bound: -0.5000, Upper bound: 3.5000
Outliers found: 5
Outliers detected in Attribute16

Column: Attribute18
Lower bound: 1.0000, Upper bound: 1.0000
Outliers found: 152
Outliers detected in Attribute18

Column: Attribute21
Lower bound: -6.0000, Upper bound: 42.0000
Outliers found: 65
Outliers detected in Attribute21



<details>
<summary><h3>Can we also use the IQR method to remove outliers?</h3></summary>

> Yes, the same mathematical rule used to *detect* outliers can also be used
> to *remove* them.  
>
> Once we compute the lower and upper bounds:
>
> $$
> \text{Lower Bound} = Q1 - 1.5 \times \text{IQR}
> $$
> $$
> \text{Upper Bound} = Q3 + 1.5 \times \text{IQR}
> $$
>
> We can simply filter the DataFrame to keep only the values **within these limits**.
>
> This is known as **IQR-based outlier removal** and is one of the most common
> preprocessing techniques in data cleaning, especially for algorithms that are 
> sensitive to extreme values.
>
> Example function to *remove* outliers from a column:
>
> ```python
> def remove_outliers_iqr(df: pd.DataFrame, col: str) -> pd.DataFrame:
>     q1 = df[col].quantile(0.25)
>     q3 = df[col].quantile(0.75)
>     iqr = q3 - q1
>
>     lower = q1 - 1.5 * iqr
>     upper = q3 + 1.5 * iqr
>
>     return df[(df[col] >= lower) & (df[col] <= upper)]
> ```
>
___
<div style="background-color:#f2f2f2; padding:12px; border-left:4px solid #d9534f; border-radius:4px; margin:10px 0;">
<strong>‚ö†Ô∏è NOTE:</strong> REMOVING OUTLIERS IS NOT ALWAYS RECOMMENDED.<br>
It depends on the context and whether extreme values are real observations or measurement errors.<br>
In credit scoring datasets like this one, outliers may represent important patterns of risk.
</div>


</details>

### Q2.9 Any other problematic column?
- Check for dtypes and duplicated information üòâ
- Convert the columns and drop duplicated information

In [None]:
df.head()

Unnamed: 0,Attribute1,Attribute2,Attribute3,Attribute4,Attribute5,Attribute6,Attribute7,Attribute8,Attribute9,Attribute10,Attribute11,Attribute12,Attribute13,Attribute14,Attribute15,Attribute16,Attribute17,Attribute18,Attribute19,Attribute20,Attribute21,class
0,A11,6.0,A34,A43,1169.0,A65,A75,3.0,A93,A101,4.0,A121,67.0,A143,A152,2.0,,1.0,A192,A201,6.0,1
1,A12,48.0,A32,A43,5951.0,A61,A73,2.0,A92,A101,2.0,A121,22.0,,A152,1.0,,1.0,A191,A201,48.0,2
2,A14,12.0,A34,A46,2096.0,A61,A74,2.0,A93,A101,3.0,A121,49.0,A143,A152,1.0,A172,2.0,A191,A201,12.0,1
3,A11,42.0,A32,A42,7882.0,A61,A74,2.0,A93,A103,4.0,A122,45.0,A143,A153,1.0,,2.0,A191,A201,42.0,1
4,A11,24.0,A33,A40,4870.0,A61,A73,3.0,A93,A101,4.0,A124,53.0,,A153,2.0,A173,2.0,A191,A201,24.0,2


In [180]:
duplicate_cols = df.T[df.T.duplicated()].index.tolist()

print("Duplicated columns by content:", duplicate_cols)

Duplicated columns by content: ['Attribute21']


In [181]:
# lets verify for duplicated rows
duplicate_rows = df[df.duplicated()]
print(f"Duplicated rows found: {duplicate_rows.shape[0]}")
duplicate_rows

Duplicated rows found: 0


Unnamed: 0,Attribute1,Attribute2,Attribute3,Attribute4,Attribute5,Attribute6,Attribute7,Attribute8,Attribute9,Attribute10,Attribute11,Attribute12,Attribute13,Attribute14,Attribute15,Attribute16,Attribute17,Attribute18,Attribute19,Attribute20,Attribute21,class


### Q2.10 Export the dataset to a `csv` file as `cleaned_credit_risk_dataset.csv`

In [182]:
df.to_csv('cleaned_credit_risk_dataset.csv')

## Exploratory Data Analysis

Before we continue with groupby-based exploration, it is important to notice that  
many columns in the German Credit dataset contain *coded categorical values* such as:

- `A11`, `A12`, `A13`, ‚Ä¶
- `A30`, `A31`, ‚Ä¶
- `A40`, `A41`, ‚Ä¶
- `A171`, `A172`, ‚Ä¶

These codes make the dataset harder to read and interpret during analysis.

> This is extremely common in real datasets:
> - data may come encoded for storage efficiency  
> - documentation may be separate from the data  
> - variables may need mapping tables to become understandable  

To make our exploratory analysis clearer ‚Äî and to avoid constantly checking the data dictionary ‚Äî  
we will now apply an explicit **mapping** from coded values to descriptive labels.

The mappings below were created based on the official dataset documentation provided in:

üîó https://archive.ics.uci.edu/dataset/144/statlog+german+credit+data

---

>**Install the library.**

In [125]:
!pip install ucimlrepo

^C


Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com


>**Load the original dataset using the `ucimlrepo` library**

In [183]:
from ucimlrepo import fetch_ucirepo 

# fetch dataset 
statlog_german_credit_data = fetch_ucirepo(id=144) 
  
# data (as pandas dataframes) 
X = statlog_german_credit_data.data.features 
y = statlog_german_credit_data.data.targets 
  
# metadata 
print(statlog_german_credit_data.metadata) 
  
# variable information 
#print(statlog_german_credit_data.variables) 

{'uci_id': 144, 'name': 'Statlog (German Credit Data)', 'repository_url': 'https://archive.ics.uci.edu/dataset/144/statlog+german+credit+data', 'data_url': 'https://archive.ics.uci.edu/static/public/144/data.csv', 'abstract': 'This dataset classifies people described by a set of attributes as good or bad credit risks. Comes in two formats (one all numeric). Also comes with a cost matrix', 'area': 'Social Science', 'tasks': ['Classification'], 'characteristics': ['Multivariate'], 'num_instances': 1000, 'num_features': 20, 'feature_types': ['Categorical', 'Integer'], 'demographics': ['Other', 'Marital Status', 'Age', 'Occupation'], 'target_col': ['class'], 'index_col': None, 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 1994, 'last_updated': 'Thu Aug 10 2023', 'dataset_doi': '10.24432/C5NC77', 'creators': ['Hans Hofmann'], 'intro_paper': None, 'additional_info': {'summary': 'Two datasets are provided.  the original dataset, in the form provided by

>**Renaming the columns with human-readable names.**

In [184]:
rename_dict = {
    "Attribute1":  "checking_status",
    "Attribute2":  "duration_months",
    "Attribute3":  "credit_history",
    "Attribute4":  "purpose",
    "Attribute5":  "credit_amount",
    "Attribute6":  "savings_account",
    "Attribute7":  "employment_since",
    "Attribute8":  "installment_rate",
    "Attribute9":  "personal_status_sex",
    "Attribute10": "other_debtors",
    "Attribute11": "residence_since",
    "Attribute12": "property",
    "Attribute13": "age",
    "Attribute14": "other_installment_plans",
    "Attribute15": "housing",
    "Attribute16": "existing_credits",
    "Attribute17": "job",
    "Attribute18": "dependents",
    "Attribute19": "telephone",
    "Attribute20": "foreign_worker",
    "Attribute21": "months",
    "Attribute22": "postal_area",
    "class":       "credit_risk"
    
}

df.rename(columns=rename_dict, inplace=True)
df.head()

Unnamed: 0,checking_status,duration_months,credit_history,purpose,credit_amount,savings_account,employment_since,installment_rate,personal_status_sex,other_debtors,residence_since,property,age,other_installment_plans,housing,existing_credits,job,dependents,telephone,foreign_worker,months,credit_risk
0,A11,6.0,A34,A43,1169.0,A65,A75,3.0,A93,A101,4.0,A121,67.0,A143,A152,2.0,,1.0,A192,A201,6.0,1
1,A12,48.0,A32,A43,5951.0,A61,A73,2.0,A92,A101,2.0,A121,22.0,,A152,1.0,,1.0,A191,A201,48.0,2
2,A14,12.0,A34,A46,2096.0,A61,A74,2.0,A93,A101,3.0,A121,49.0,A143,A152,1.0,A172,2.0,A191,A201,12.0,1
3,A11,42.0,A32,A42,7882.0,A61,A74,2.0,A93,A103,4.0,A122,45.0,A143,A153,1.0,,2.0,A191,A201,42.0,1
4,A11,24.0,A33,A40,4870.0,A61,A73,3.0,A93,A101,4.0,A124,53.0,,A153,2.0,A173,2.0,A191,A201,24.0,2


>**Run the following block to replace the coded categorical values with human-readable descriptions.**

In [185]:
# -----------------------------------------
# SAVE ORIGINAL PERSONAL_STATUS_SEX CODES
# (needed later to extract 'sex' and clean personal status)
# -----------------------------------------
df["personal_status_sex_code"] = df["personal_status_sex"].copy()

# -----------------------------------------
# MAPPINGS FOR QUALITATIVE VARIABLES
# -----------------------------------------

map_status = {
    "A11": "< 0 DM",
    "A12": "0<=X<200 DM",
    "A13": ">=200 DM / salary assignments ‚â• 1 year",
    "A14": "no checking account"
}

map_history = {
    "A30": "no credits taken / all paid back duly",
    "A31": "all credits at this bank paid back duly",
    "A32": "existing credits paid back duly till now",
    "A33": "delay in paying off in the past",
    "A34": "critical account / other credits elsewhere"
}

map_purpose = {
    "A40": "car (new)",
    "A41": "car (used)",
    "A42": "furniture/equipment",
    "A43": "radio/television",
    "A44": "domestic appliances",
    "A45": "repairs",
    "A46": "education",
    # A47 does not exist in the original dataset
    "A48": "retraining",
    "A49": "business",
    "A410": "others"
}

map_savings = {
    "A61": "<100 DM",
    "A62": "100<=X<500 DM",
    "A63": "500<=X<1000 DM",
    "A64": ">=1000 DM",
    "A65": "unknown/no savings"
}

map_employment = {
    "A71": "unemployed",
    "A72": "<1 year",
    "A73": "1‚Äì4 years",
    "A74": "4‚Äì7 years",
    "A75": ">=7 years"
}

# Combined personal status + sex text
map_personal_status_sex = {
    "A91": "male: divorced/separated",
    "A92": "female: divorced/separated/married",
    "A93": "male: single",
    "A94": "male: married/widowed",
    "A95": "female: single"
}

map_debtors = {
    "A101": "none",
    "A102": "co-applicant",
    "A103": "guarantor"
}

map_property = {
    "A121": "real estate",
    "A122": "building society savings/life insurance",
    "A123": "car or other (not in savings)",
    "A124": "unknown/no property"
}

map_installment_plans = {
    "A141": "bank",
    "A142": "stores",
    "A143": "none"
}

map_housing = {
    "A151": "rent",
    "A152": "own",
    "A153": "for free"
}

map_job = {
    "A171": "unemployed/unskilled ‚Äì non-resident",
    "A172": "unskilled ‚Äì resident",
    "A173": "skilled employee/official",
    "A174": "management/self-employed/highly qualified"
}

map_telephone = {
    "A191": "none",
    "A192": "yes, registered"
}

map_foreign = {
    "A201": "yes",
    "A202": "no"
}

# -----------------------------------------
# APPLY MAPPINGS TO THE DATAFRAME
# -----------------------------------------

df = df.replace({
    "status": map_status,
    "credit_history": map_history,
    "purpose": map_purpose,
    "savings": map_savings,
    "present_employment": map_employment,
    "personal_status_sex": map_personal_status_sex,  # human-readable combined field
    "other_debtors": map_debtors,
    "property": map_property,
    "other_installment_plans": map_installment_plans,
    "housing": map_housing,
    "job": map_job,
    "telephone": map_telephone,
    "foreign_worker": map_foreign
})

# -----------------------------------------
# SPLIT personal_status_sex INTO 'sex' AND CLEAN 'personal_status'
# (using the original codes saved in personal_status_sex_code)
# -----------------------------------------

# Mapping to extract sex only
map_sex = {
    "A91": "male",
    "A92": "female",
    "A93": "male",
    "A94": "male",
    "A95": "female"
}

# Mapping to extract civil/marital status only
map_personal_status_clean = {
    "A91": "divorced/separated",
    "A92": "divorced/separated/married",
    "A93": "single",
    "A94": "married/widowed",
    "A95": "single"
}

# Create the new 'sex' column
df["sex"] = df["personal_status_sex_code"].map(map_sex)

# Create a new 'personal_status' column with only civil status
df["personal_status"] = df["personal_status_sex_code"].map(map_personal_status_clean)

# Drop the temporary code column
df.drop(columns=["personal_status_sex_code"], inplace=True)


## 3. Introduction to `groupby()` for Exploratory Analysis

> Until now, we have used methods such as `value_counts()`, `mean()`, or `describe()`  
> to inspect columns individually.
>
> However, real datasets often have **subgroups** that behave differently, and we may want
> to understand how a variable behaves *inside* each subgroup.
>
> For this, Pandas provides the command:
>
> `df.groupby("column")`
>
> which splits the dataset into smaller groups based on the values of one column.
>
> Each group can then be inspected separately.

---

### 3.1 Counting Values Inside Groups ‚Äî `.groupby().value_counts()`

> This tells us **how a categorical variable behaves inside each subgroup**.
>
>**Example:**
>
>```python
>   df.groupby("housing")["checking_status"].value_counts()
>```

### Q3.1. Inspect Categorical Distributions Inside Groups

- Using `.groupby('col1')['related_col'].value_counts()`, compute how the column
`personal_status` is distributed inside each `credit_risk` group.

- Your output should show, **for each value** of `credit_risk`,
**the count of each category** in `personal_status`.


In [186]:
df.groupby("credit_risk")["personal_status"].value_counts()

credit_risk  personal_status           
1            single                        353
             divorced/separated/married    179
             married/widowed                64
             divorced/separated             29
2            single                        133
             divorced/separated/married     99
             married/widowed                25
             divorced/separated             17
Name: count, dtype: int64

### Q3.2 Inspect Categorical Distributions Inside Sub-Groups

- Using `.groupby(['col1', 'col2'])['related_col'].value_counts()`, compute how the column
`personal_status` is distributed across each `sex` category within each `credit_risk` group.

- The output should display, for every value of credit_risk, the count of each category in personal_status, separated by sex.

>**Keep in mind that as we add more grouping columns, the resulting output becomes less intuitive to read.**


In [187]:
df.groupby(["credit_risk", "sex"])["personal_status"].value_counts()

credit_risk  sex     personal_status           
1            female  divorced/separated/married    179
             male    single                        353
                     married/widowed                64
                     divorced/separated             29
2            female  divorced/separated/married     99
             male    single                        133
                     married/widowed                25
                     divorced/separated             17
Name: count, dtype: int64

### Q3.3 Compare the distribution of `housing` Inside each `credit_risk` group

In [188]:
df.groupby("credit_risk")["housing"].value_counts()

credit_risk  housing 
1            own         467
             rent        101
             for free     60
2            own         182
             rent         64
             for free     37
Name: count, dtype: int64

### Q3.4 Which professionals category have the highest average credit amount?

- For each `job` category, compute the mean of `credit_amount`.

In [190]:
df.groupby("job")["credit_amount"].mean().sort_values(ascending=False)

job
management/self-employed/highly qualified    5198.668966
skilled employee/official                    3000.796265
unskilled ‚Äì resident                         2348.148148
unemployed/unskilled ‚Äì non-resident          2147.400000
Name: credit_amount, dtype: float64

### Q3.5. Inspect `age` statistics inside each `credit_risk` group
- Compute descriptive statistics (`.describe()`) for the column `age` inside each `credit_risk` group.

In [191]:
# your code here
df.groupby("credit_risk")["age"].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
credit_risk,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,664.0,36.951807,17.123122,-5.0,27.0,33.0,42.0,150.0
2,294.0,34.210884,15.361061,-5.0,25.0,31.0,39.0,150.0


### Q3.6. Create an `age` **binning column** (`age_group`) to explore group statistics


>**Remember that we can create bins using `cut` like:**
>
>```python
>   bins = [0, 25, 40, 60, 120]   # interval limits
>   labels = ["<25", "25‚Äì40", "40‚Äì60", "60+"]  # names of the age groups
>
>   df["age_bin"] = pd.cut(df["col"], bins=bins, labels=labels)
>```
>
>**Using `qcut` to split into equal `n` parts like:**
>
>```python
>   df["age_bin_q"] = pd.qcut(df["col"], q=n, labels=["Q1", "Q2", "Q3", ..., "QN"])
>```

- **We want meaningful age groups such as (e.g., `<25`, `25‚Äì40`, `40‚Äì60`, `60+`).**


In [194]:
bins = [0, 25, 40, 60, 120]  
labels = ["<25", "25‚Äì40", "40‚Äì60", "60+"]  

df["age_group"] = pd.cut(df["age"], bins=bins, labels=labels, right=False)
df["age_group"].value_counts()

age_group
25‚Äì40    563
40‚Äì60    225
<25      144
60+       48
Name: count, dtype: int64

### Q3.7. Compare credit risk across age groups

- Using the `age_group` created in the previous question, analyze how `credit_risk` is distributed inside each age group.

In [196]:
df.groupby("age_group", observed=False)["credit_risk"].value_counts()

age_group  credit_risk
<25        1               81
           2               54
25‚Äì40      1              378
           2              163
40‚Äì60      1              157
           2               58
60+        1               34
           2               13
Name: count, dtype: int64

### Q3.8 Compute the percentage of bad credit risk per age Group

In [199]:
df['credit_risk'].value_counts()

credit_risk
1    664
2    294
Name: count, dtype: int64

In [None]:
# 1 = Good, 2 = Bad
(
    df.groupby("age_group", observed=False)["credit_risk"]
      .value_counts(normalize=True)
      .loc[:, 2] * 100 # we are going to filter for 2 (Bad)
)

age_group
<25      40.000000
25‚Äì40    30.129390
40‚Äì60    26.976744
60+      27.659574
Name: proportion, dtype: float64

### Q3.9 Based on the previous question answer, younger or older customers are more likely to have good or bad credit risk?

In [None]:
# Younger customers (<25)
#Usually show a higher percentage of bad credit risk.
#Reasons often include lower financial stability, shorter credit history, and higher risk behavior.

# Middle-aged customers (25‚Äì40, 40‚Äì60)
#Typically have the lowest percentage of bad credit risk.
#This group tends to have stable employment, higher income, and more established credit history.

# Older customers (60+)
#Risk may increase slightly again, but usually not as high as the youngest group.
#This can depend on retirement status and fixed-income constraints.

### Q3.10 Compare the average credit amount across age Groups

- Compute the **mean** value of `credit_amount` for each age group.
- Which age group tends to request the highest credit amounts?


In [None]:
df.groupby("age_group", observed=False)["credit_amount"].mean().sort_values(ascending=False)
# 60+

age_group
60+      3373.312500
40‚Äì60    3357.480000
25‚Äì40    3225.532860
<25      2707.298611
Name: credit_amount, dtype: float64

### Q3.11 Compare Employment Duration Across Age Groups

- Compute the count of each `employment_since` category inside each `age_group`.

In [205]:
df.groupby("age_group", observed=False)["employment_since"].value_counts()

age_group  employment_since
<25        A73                  52
           A72                  43
           A74                  24
           A71                   8
           A75                   7
25‚Äì40      A73                 199
           A74                 108
           A75                 102
           A72                  98
           A71                  26
40‚Äì60      A75                 104
           A73                  54
           A74                  23
           A72                  22
           A71                  15
60+        A75                  23
           A73                  11
           A71                   8
           A74                   4
           A72                   0
Name: count, dtype: int64

### Q3.12 Explore Purpose of Credit Within Age Groups

- For each `age_group`, compute how many people requested credit for each type of `purpose`.

In [206]:
df.groupby("age_group", observed=False)["purpose"].value_counts()

age_group  purpose            
<25        radio/television        41
           furniture/equipment     39
           car (new)               25
           car (used)              14
           business                 7
           education                6
           domestic appliances      4
           repairs                  4
           retraining               1
           others                   0
25‚Äì40      radio/television       157
           car (new)              124
           furniture/equipment     87
           business                64
           car (used)              52
           education               22
           repairs                  9
           others                   6
           domestic appliances      5
           retraining               5
40‚Äì60      radio/television        58
           car (new)               54
           furniture/equipment     35
           car (used)              25
           education               16
           busi

### Q3.13 Number of Existing Credits by Age Group
- determine which age group tends to have more existing credit lines.

In [207]:
df.groupby("age_group", observed=False)["existing_credits"].mean().sort_values(ascending=False)

age_group
60+      1.562500
40‚Äì60    1.417778
25‚Äì40    1.390764
<25      1.201389
Name: existing_credits, dtype: float64

### Q3.14 Cross-Analyze Age Groups and Housing

- For each age_group, compute how many people fall into each housing category.

In [209]:
df.groupby("age_group", observed=False)["housing"].value_counts().unstack(fill_value=0)

housing,for free,own,rent
age_group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
<25,5,67,65
25‚Äì40,43,421,74
40‚Äì60,39,143,27
60+,14,31,3


## 3.2 Aggregating Multiple Statistics with `.agg()`

> Until now, we have computed one summary statistic at a time:
>
> - `.mean()`
> - `.size()`
> - `.value_counts()`
> - `.describe()`
>
> These methods are useful, but they only compute **one metric at a time**.
>
> The real power of `groupby()` comes when we want to calculate **several statistics at once**,  
> either for:
>
> - the **same column**, or  
> - **multiple columns** with different metrics.
>
> For this, Pandas provides the method:
>
> ```python
> df.groupby("column").agg({...})
> ```
>
> which allows us to define exactly **which statistics** to compute.
---
>
>**Example 1 ‚Äî Multiple Statistics for One Column**
>
>```python
>   df.groupby("age_group")["credit_amount"].agg(["mean", "median", "max"])
>```
___
>**Example 2 ‚Äî Different Statistics for Different Columns**
>```python
>   df.groupby("age_bin").agg({
>    "credit_amount": ["mean", "std"],
>    "duration_months": ["mean", "max"]
>})
>```
___
> **Example 3 ‚Äî Using Custom Functions Inside `.agg()`**
>
> You can also define your own functions and use them directly inside `.agg()`.
>
> This is extremely useful when the standard statistics (`mean`, `median`, etc.) are not enough for your analysis.
>
> ```python
> # Custom function: range = max - min
> def value_range(series):
>     return series.max() - series.min()
>
> df.groupby("age_group")["credit_amount"].agg([
>     "mean",
>     "median",
>     value_range,     # custom function
> ])
> ```
>
> **This will return a table containing:**
> - the mean  
> - the median  
> - and your custom-defined "range" metric  
>
> computed separately **for each age group**.
___
>**This approach is very common because it allows you to summarize multiple variables at once, grouped by a meaningful category**

### Q3.15. Multiple Statistics for `credit_amount` per age group

- Using `.groupby("age_group")` and `.agg()`, compute the following statistics for `credit_amount` inside each `age_group`:

    - mean  
    - median
    - maximum value


In [211]:
df.groupby("age_group", observed=False)["credit_amount"].agg(["mean", "median", "max"])

Unnamed: 0_level_0,mean,median,max
age_group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
<25,2707.298611,2090.5,15672.0
25‚Äì40,3225.53286,2320.0,18424.0
40‚Äì60,3357.48,2320.0,15945.0
60+,3373.3125,2283.0,14896.0


### Q3.16. Aggregate Two Numerical Columns at Once

- Using .groupby("age_group"), compute:

    - mean and standard deviation of credit_amount

    - mean and max of duration_months

In [213]:
df.groupby("age_group", observed=False).agg({
    "credit_amount": ["mean", "std"],
    "duration_months": ["mean", "max"]
})

Unnamed: 0_level_0,credit_amount,credit_amount,duration_months,duration_months
Unnamed: 0_level_1,mean,std,mean,max
age_group,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
<25,2707.298611,2398.726854,20.291667,72.0
25‚Äì40,3225.53286,2618.308932,21.280639,60.0
40‚Äì60,3357.48,2873.304345,20.288889,60.0
60+,3373.3125,3416.23284,19.166667,60.0


### Q3.17 Define your own function that computes the range of a numeric variable.
- Using `.groupby("age_group")["credit_amount"].agg([...])`, compute:

    - mean
    - median
    - your custom range function

>**Example**
>```python
>   # Custom function: range = max - min
>   def value_range(series: pd.Series):
>       # your code here
>```

In [214]:
# Custom function: range = max - min
def value_range(series: pd.Series):
    return series.max() - series.min()

In [216]:
df.groupby("age_group", observed=False)["credit_amount"].agg([
    "mean",
    "median",
    value_range
])

Unnamed: 0_level_0,mean,median,value_range
age_group,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
<25,2707.298611,2090.5,15396.0
25‚Äì40,3225.53286,2320.0,18081.0
40‚Äì60,3357.48,2320.0,15607.0
60+,3373.3125,2283.0,14325.0


In [None]:
# your tears here üòä

In [None]:
# your tears here üòä

In [None]:
# your tears here üòä