### Problem Statement

You are a data scientist / AI engineer at a healthcare consulting firm. You have been provided with a dataset named **`"patient_health_data.csv"`**, which includes records of various health indicators for a group of patients. The dataset comprises the following columns:

- `age:` The age of the patient.
- `bmi:` Body Mass Index of the patient.
- `blood_pressure:` The blood pressure of the patient.
- `cholesterol:` Cholesterol levels of the patient.
- `glucose:` Glucose levels of the patient.
- `insulin:` Insulin levels of the patient.
- `heart_rate:` Heart rate of the patient.
- `activity_level:` Activity level of the patient.
- `diet_quality:` Quality of diet of the patient.
- `smoking_status:` Whether the patient smokes (Yes or No).
- `alcohol_intake:` The amount of alcohol intake by the patient.
- `health_risk_score:` A composite score representing the overall health risk of a patient.

Your task is to use this dataset to build a linear regression model to predict the health risk score based on the given predictor variables. Additionally, you will use L1 (Lasso) and L2 (Ridge) regularization techniques to improve the model's performance. 

**Import Necessary Libraries**

In [11]:
# Import necessary libraries
import pandas as pd 

### Task 1: Data Preparation and Exploration

1. Import the data from the **`"patient_health_data.csv"`** file and store it in a variable df.
2. Display the number of rows and columns in the dataset.
3. Display the first few rows of the dataset to get an overview.
4. Check for any missing values in the dataset and handle them appropriately.
5. Encode the categorical variable `'smoking_status'` by converting 'Yes' to 1 and 'No' to 0.

In [12]:
# Step 1: Import the data from the "patient_health_data.csv" file and store it in a variable 'df'
df = pd.read_csv("patient_health_data.csv")


# Step 2: Display the number of rows and columns in the dataset
print(f"shape:{df.shape[0]}rows × {df.shape[1]} cols")


# Step 3: Display the first few rows of the dataset to get an overview
print("\nFirst 5 rows:")
display(df.head())


shape:250rows × 12 cols

First 5 rows:


Unnamed: 0,age,bmi,blood_pressure,cholesterol,glucose,insulin,heart_rate,activity_level,diet_quality,smoking_status,alcohol_intake,health_risk_score
0,58,24.865215,122.347094,165.730375,149.289441,22.306844,75.866391,1.180237,7.675409,No,0.824123,150.547752
1,71,19.103168,136.852028,260.610781,158.584646,13.869817,69.481114,7.634622,8.933057,No,0.85291,160.32035
2,48,22.316562,137.592457,177.342582,178.760166,22.849816,69.386962,7.917398,3.501119,Yes,4.740542,187.487398
3,34,22.196893,153.164775,234.594764,136.351714,15.140336,95.348387,3.19291,2.745585,No,2.226231,148.773138
4,62,29.837173,92.768973,276.106498,158.753516,17.228576,77.680975,7.044026,8.918348,No,3.944011,170.609655


In [13]:
# Step 4: Check for any missing values in the dataset and handle them appropriately
print("\n Missing values:")
display(df.head())


 Missing values:


Unnamed: 0,age,bmi,blood_pressure,cholesterol,glucose,insulin,heart_rate,activity_level,diet_quality,smoking_status,alcohol_intake,health_risk_score
0,58,24.865215,122.347094,165.730375,149.289441,22.306844,75.866391,1.180237,7.675409,No,0.824123,150.547752
1,71,19.103168,136.852028,260.610781,158.584646,13.869817,69.481114,7.634622,8.933057,No,0.85291,160.32035
2,48,22.316562,137.592457,177.342582,178.760166,22.849816,69.386962,7.917398,3.501119,Yes,4.740542,187.487398
3,34,22.196893,153.164775,234.594764,136.351714,15.140336,95.348387,3.19291,2.745585,No,2.226231,148.773138
4,62,29.837173,92.768973,276.106498,158.753516,17.228576,77.680975,7.044026,8.918348,No,3.944011,170.609655


In [14]:
# Step 5: Encode the categorical variable 'smoking_status' by converting 'Yes' to 1 and 'No' to 0.
df['smoking_status'] = df['smoking_status'].map({'Yes': 1, 'No': 0})
print("\nAfter encoding 'smoking_status':")
display(df.head())


After encoding 'smoking_status':


Unnamed: 0,age,bmi,blood_pressure,cholesterol,glucose,insulin,heart_rate,activity_level,diet_quality,smoking_status,alcohol_intake,health_risk_score
0,58,24.865215,122.347094,165.730375,149.289441,22.306844,75.866391,1.180237,7.675409,0,0.824123,150.547752
1,71,19.103168,136.852028,260.610781,158.584646,13.869817,69.481114,7.634622,8.933057,0,0.85291,160.32035
2,48,22.316562,137.592457,177.342582,178.760166,22.849816,69.386962,7.917398,3.501119,1,4.740542,187.487398
3,34,22.196893,153.164775,234.594764,136.351714,15.140336,95.348387,3.19291,2.745585,0,2.226231,148.773138
4,62,29.837173,92.768973,276.106498,158.753516,17.228576,77.680975,7.044026,8.918348,0,3.944011,170.609655


### Task 2: Train Linear Regression Models

1. Select the features and the target variable for modeling.
2. Split the data into training and test sets with a test size of 25%.
3. Initialize and train a Linear Regression model, and evaluate its performance using R-squared.
4. Initialize and train a Lasso Regression model with various alpha values provided in a list: [0.01, 0.1, 1.0, 10.0], and evaluate its performance using R-squared.
5. Initialize and train a Ridge Regression model with various alpha values provided in a list: [0.01, 0.1, 1.0, 10.0], and evaluate its performance using R-squared.

In [15]:
# Step 1: Select the features and target variable for modeling
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression,Lasso,Ridge
from sklearn.metrics import r2_score
#Selecting features (X) and target(y)
X = df.drop(['health_risk_score'],axis = 1 )
y = df['health_risk_score']

# Step 2: Split the data into training and test sets with a test size of 25%
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.25,random_state=42
                                                )


In [16]:
# Step 3: Initialize and train a Linear Regression model, and evaluate its performance using R-squared
lr = LinearRegression()
lr.fit(X_train, y_train)
print("Linear Regression R²:", r2_score(y_test, lr.predict(X_test)))

Linear Regression R²: 0.7643620906757491


In [19]:
# Step 4: Initialize and train a Lasso Regression model with various alpha values provided in a list, and evaluate its performance using R-squared
alphas = [0.01, 0.1, 1.0, 10.0]   # Different alpha values to test
for a in alphas:
    lasso = Lasso(alpha=a, max_iter=10000)   # Create Lasso model
    lasso.fit(X_train, y_train)              # Train the model
    print(f"Lasso (alpha={a}) R²:", r2_score(y_test, lasso.predict(X_test)))

Lasso (alpha=0.01) R²: 0.7645437646395714
Lasso (alpha=0.1) R²: 0.766050991480216
Lasso (alpha=1.0) R²: 0.7819763683575135
Lasso (alpha=10.0) R²: 0.7873364302158369


In [20]:
# Step 5: Initialize and train a Ridge Regression model with various alpha values provided in a list, and evaluate its performance using R-squared
for a in alphas:
    ridge = Ridge(alpha=a)
    ridge.fit(X_train, y_train)
    print(f"Ridge (alpha={a}) R²:", r2_score(y_test, ridge.predict(X_test)))

Ridge (alpha=0.01) R²: 0.764363158939054
Ridge (alpha=0.1) R²: 0.7643727707489341
Ridge (alpha=1.0) R²: 0.7644686367656158
Ridge (alpha=10.0) R²: 0.7654030812954533


“We compared Linear Regression, Lasso, and Ridge.
Linear Regression gave us an R² of ~0.764.
With Lasso, performance improved up to ~0.787 when α was tuned properly, because Lasso is effective at feature selection—it forces unimportant coefficients to zero, reducing variance and improving generalization.
Ridge, on the other hand, barely changed the R². That’s expected—Ridge only shrinks coefficients but doesn’t remove features, so when the dataset doesn’t have strong multicollinearity or too many irrelevant features, Ridge doesn’t make a big difference.
This shows why α tuning and the choice of L1 vs L2 matters depending on the dataset.”