### Problem Statement

You are a data scientist / AI engineer at a healthcare consulting firm. You have been provided with a dataset named **`"patient_health_data.csv"`**, which includes records of various health indicators for a group of patients. The dataset comprises the following columns:

- `age:` The age of the patient.
- `bmi:` Body Mass Index of the patient.
- `blood_pressure:` The blood pressure of the patient.
- `cholesterol:` Cholesterol levels of the patient.
- `glucose:` Glucose levels of the patient.
- `insulin:` Insulin levels of the patient.
- `heart_rate:` Heart rate of the patient.
- `activity_level:` Activity level of the patient.
- `diet_quality:` Quality of diet of the patient.
- `smoking_status:` Whether the patient smokes (Yes or No).
- `alcohol_intake:` The amount of alcohol intake by the patient.
- `health_risk_score:` A composite score representing the overall health risk of a patient.

Your task is to use this dataset to build a linear regression model to predict the health risk score based on the given predictor variables. Additionally, you will use L1 (Lasso) and L2 (Ridge) regularization techniques to improve the model's performance. 

**Import Necessary Libraries**

In [13]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.model_selection import train_test_split

### Task 1: Data Preparation and Exploration

1. Import the data from the **`"patient_health_data.csv"`** file and store it in a variable df.
2. Display the number of rows and columns in the dataset.
3. Display the first few rows of the dataset to get an overview.
4. Check for any missing values in the dataset and handle them appropriately.
5. Encode the categorical variable `'smoking_status'` by converting 'Yes' to 1 and 'No' to 0.

In [8]:
# Step 1: Import the data from the "patient_health_data.csv" file and store it in a variable 'df'
df = pd.read_csv('patient_health_data.csv')

# Step 2: Display the number of rows and columns in the dataset
print(df.shape)

# Step 3: Display the first few rows of the dataset to get an overview
df.head(3)

(250, 12)


Unnamed: 0,age,bmi,blood_pressure,cholesterol,glucose,insulin,heart_rate,activity_level,diet_quality,smoking_status,alcohol_intake,health_risk_score
0,58,24.865215,122.347094,165.730375,149.289441,22.306844,75.866391,1.180237,7.675409,No,0.824123,150.547752
1,71,19.103168,136.852028,260.610781,158.584646,13.869817,69.481114,7.634622,8.933057,No,0.85291,160.32035
2,48,22.316562,137.592457,177.342582,178.760166,22.849816,69.386962,7.917398,3.501119,Yes,4.740542,187.487398


In [12]:
# Step 4: Check for any missing values in the dataset and handle them appropriately
df.isna().sum()

age                  0
bmi                  0
blood_pressure       0
cholesterol          0
glucose              0
insulin              0
heart_rate           0
activity_level       0
diet_quality         0
smoking_status       0
alcohol_intake       0
health_risk_score    0
dtype: int64

In [14]:
# Step 5: Encode the categorical variable 'smoking_status' by converting 'Yes' to 1 and 'No' to 0.
df['smoking_status'] = df['smoking_status'].apply(lambda x: 1 if x == 'Yes' else 0)

### Task 2: Train Linear Regression Models

1. Select the features and the target variable for modeling.
2. Split the data into training and test sets with a test size of 25%.
3. Initialize and train a Linear Regression model, and evaluate its performance using R-squared.
4. Initialize and train a Lasso Regression model with various alpha values provided in a list: [0.01, 0.1, 1.0, 10.0], and evaluate its performance using R-squared.
5. Initialize and train a Ridge Regression model with various alpha values provided in a list: [0.01, 0.1, 1.0, 10.0], and evaluate its performance using R-squared.

In [15]:
# Step 1: Select the features and target variable for modeling
X = df.drop(['health_risk_score'], axis = 1)
y = df['health_risk_score']
# Step 2: Split the data into training and test sets with a test size of 25%
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 42)

In [17]:
# Step 3: Initialize and train a Linear Regression model, and evaluate its performance using R-squared
Model = LinearRegression()
Model.fit(X_train, y_train)
R2_Score_Model = Model.score(X_test, y_test)
print('Linear Regression R2 Score :', R2_Score_Model)

Linear Regression R2 Score : 0.7643620906757488


In [21]:
# Step 4: Initialize and train a Lasso Regression model with various alpha values provided in a list, and evaluate its performance using R-squared
Lasso_alphas = [0.01, 0.1, 1.0, 10]
for alpha in Lasso_alphas:
    lasso_model = Lasso(alpha=alpha)
    lasso_model.fit(X_train, y_train)
    lasso_R2 = lasso_model.score(X_test, y_test)
    print(f"lasso regression R2 score for alpha({alpha}) =", lasso_R2)

lasso regression R2 score for alpha(0.01) = 0.7645437646395714
lasso regression R2 score for alpha(0.1) = 0.7660509914802164
lasso regression R2 score for alpha(1.0) = 0.7819763683575137
lasso regression R2 score for alpha(10) = 0.7873364302158369


In [22]:
# Step 5: Initialize and train a Ridge Regression model with various alpha values provided in a list, and evaluate its performance using R-squared
Ridge_alphas = [0.01, 0.1, 1.0, 10]
for alpha in Ridge_alphas:
    Ridge_model = Ridge(alpha=alpha)
    Ridge_model.fit(X_train, y_train)
    Ridge_R2 = Ridge_model.score(X_test, y_test)
    print(f"Ridge regression R2 score for alpha({alpha}) =", Ridge_R2)

Ridge regression R2 score for alpha(0.01) = 0.764363158939054
Ridge regression R2 score for alpha(0.1) = 0.7643727707489341
Ridge regression R2 score for alpha(1.0) = 0.7644686367656159
Ridge regression R2 score for alpha(10) = 0.7654030812954533
