# Regression Part A: Regularized Linear Regression

### Target Variable
- **sbp** (systolic blood pressure): A continuous variable representing the maximum arterial pressure during heart contraction. This is our regression target.

### Features
We use the following features to predict sbp:
- **ldl**: Low-density lipoprotein cholesterol
- **adiposity**: Adiposity measure
- **obesity**: Obesity measure
- **typea**: Type A behavior score
- **age**: Age of the patient
- **tobacco**: Cumulative tobacco consumption (log-transformed)
- **alcohol**: Current alcohol consumption (log-transformed)
- **famhist**: Family history of coronary heart disease (binary: Present=1, Absent=0)
- **chd**: Coronary heart disease (binary indicator)

### Feature Transformations

To prepare the data for regularized linear regression, we apply the following transformations:

1. **Standardization**: Each feature is standardized to have mean = 0 and standard deviation = 1. This is essential for regularization, as it ensures that the regularization penalty affects all features equally regardless of their original scale.

2. **Log transformation**: The `tobacco` and `alcohol` features are log-transformed using `log1p` (log(1+x)) to handle their skewed distributions.

3. **Binary encoding**: The categorical feature `famhist` is encoded as binary (Present=1, Absent=0).

### Goal
The goal is to predict systolic blood pressure to assess cardiovascular risk. By understanding which factors contribute most to elevated blood pressure, we can better identify patients at risk and inform preventive interventions.

In [1]:
# load and transform the data
import pandas as pd
df_heart = pd.read_csv(
    "https://hastie.su.domains/ElemStatLearn/datasets/SAheart.data",
    sep=",",
    header=0,
    index_col=0,
)
columns_ordered = [
    "sbp",
    "ldl",
    "adiposity",
    "obesity",
    "typea",
    "age",
    "tobacco",
    "alcohol",
    "famhist",
    "chd",
]
missing = [c for c in columns_ordered if c not in df_heart.columns]
if missing:
    raise KeyError(f"Missing columns in df_heart: {missing}")

df_heart = df_heart[columns_ordered]

In [2]:
import numpy as np

# binary transformation of categorical attribute needed
df_heart['famhist'] = df_heart['famhist'].map({'Present': 1, 'Absent': 0})

# log transform skewed columns as described in part 1:
columns_to_log_transform = ['tobacco','alcohol']
for column in columns_to_log_transform:
    df_heart[column] = np.log1p(df_heart[column])

# center the data to have mean = 0 and variance = 1
# manual scalation of the data returns very slightly different values due to float representation:
df_heart_standarized = (df_heart - df_heart.mean()) / df_heart.std()
df_heart_standarized["famhist"] = df_heart["famhist"]

df_heart_standarized.head()

Unnamed: 0_level_0,sbp,ldl,adiposity,obesity,typea,age,tobacco,alcohol,famhist,chd
row.names,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,1.057417,0.477894,-0.295183,-0.176594,-0.418017,0.628654,1.576878,1.772852,1,1.372375
2,0.276789,-0.159507,0.411694,0.670646,0.193134,1.381617,-1.193044,-0.563821,0,1.372375
3,-0.991731,-0.608585,0.883374,0.734723,-0.112441,0.217947,-1.120397,-0.259134,1,-0.727086
4,1.54531,0.806252,1.622382,1.411091,-0.2143,1.039361,1.116254,0.858159,1,1.372375
5,-0.211103,-0.598928,0.30502,-0.012842,0.702427,0.423301,1.702714,1.422062,1,1.372375
