### Problem Statement

Given dataset named **`"patient_health_data.csv"`**, which includes records of various health indicators for a group of patients. The dataset comprises the following columns:

- `age:` The age of the patient.
- `bmi:` Body Mass Index of the patient.
- `blood_pressure:` The blood pressure of the patient.
- `cholesterol:` Cholesterol levels of the patient.
- `glucose:` Glucose levels of the patient.
- `insulin:` Insulin levels of the patient.
- `heart_rate:` Heart rate of the patient.
- `activity_level:` Activity level of the patient.
- `diet_quality:` Quality of diet of the patient.
- `smoking_status:` Whether the patient smokes (Yes or No).
- `alcohol_intake:` The amount of alcohol intake by the patient.
- `health_risk_score:` A composite score representing the overall health risk of a patient.

built a linear regression model to predict the health risk score based on the given predictor variables. Additionally, used L1 (Lasso) and L2 (Ridge) regularization techniques to improve the model's performance. 

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("patient_health_data.csv")

print(df.shape)

df.head()

(250, 12)


Unnamed: 0,age,bmi,blood_pressure,cholesterol,glucose,insulin,heart_rate,activity_level,diet_quality,smoking_status,alcohol_intake,health_risk_score
0,58,24.865215,122.347094,165.730375,149.289441,22.306844,75.866391,1.180237,7.675409,No,0.824123,150.547752
1,71,19.103168,136.852028,260.610781,158.584646,13.869817,69.481114,7.634622,8.933057,No,0.85291,160.32035
2,48,22.316562,137.592457,177.342582,178.760166,22.849816,69.386962,7.917398,3.501119,Yes,4.740542,187.487398
3,34,22.196893,153.164775,234.594764,136.351714,15.140336,95.348387,3.19291,2.745585,No,2.226231,148.773138
4,62,29.837173,92.768973,276.106498,158.753516,17.228576,77.680975,7.044026,8.918348,No,3.944011,170.609655


In [3]:
df.isna().sum()

age                  0
bmi                  0
blood_pressure       0
cholesterol          0
glucose              0
insulin              0
heart_rate           0
activity_level       0
diet_quality         0
smoking_status       0
alcohol_intake       0
health_risk_score    0
dtype: int64

In [4]:
df.smoking_status = df.smoking_status.apply(lambda x : 1 if x == 'Yes' else 0)

In [5]:
df.head()

Unnamed: 0,age,bmi,blood_pressure,cholesterol,glucose,insulin,heart_rate,activity_level,diet_quality,smoking_status,alcohol_intake,health_risk_score
0,58,24.865215,122.347094,165.730375,149.289441,22.306844,75.866391,1.180237,7.675409,0,0.824123,150.547752
1,71,19.103168,136.852028,260.610781,158.584646,13.869817,69.481114,7.634622,8.933057,0,0.85291,160.32035
2,48,22.316562,137.592457,177.342582,178.760166,22.849816,69.386962,7.917398,3.501119,1,4.740542,187.487398
3,34,22.196893,153.164775,234.594764,136.351714,15.140336,95.348387,3.19291,2.745585,0,2.226231,148.773138
4,62,29.837173,92.768973,276.106498,158.753516,17.228576,77.680975,7.044026,8.918348,0,3.944011,170.609655


### Training Linear,Ridge and Lasso Regression Model


In [13]:
X = df.drop('health_risk_score',axis=1)
y = df['health_risk_score']

from sklearn.model_selection import train_test_split

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.25,random_state=42)

In [14]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train,y_train)

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


In [8]:
y_pred = model.predict(X_test)

In [9]:
from sklearn.metrics import r2_score

r2 = r2_score(y_pred,y_test)
r2

0.7507755795293304

In [10]:
model.score(X_test,y_test)

0.7643620906757489

In [11]:
from sklearn.linear_model import Lasso
alpha_values = [0.01,0.1,1.0,10.0]

for alpha in alpha_values:
    model_lasso = Lasso(alpha)
    model_lasso.fit(X_train,y_train)
    
    y_pred_lasso = model_lasso.predict(X_test)
    r2 = r2_score(y_pred_lasso,y_test)
    score = model_lasso.score(X_test,y_test)
    print(f"at alpha {alpha}  score ==> {score} ")

at alpha 0.01  score ==> 0.7645437646395714 
at alpha 0.1  score ==> 0.766050991480216 
at alpha 1.0  score ==> 0.7819763683575135 
at alpha 10.0  score ==> 0.7873364302158369 


In [12]:
from sklearn.linear_model import Ridge

alpha_values = [0.01,0.1,1.0,10.0]

for alpha in alpha_values:
    model_ridge = Ridge(alpha)
    model_ridge.fit(X_train,y_train)
    
    y_pred_ridge = model_ridge.predict(X_test)
    r2 = r2_score(y_pred_ridge,y_test)
    score = model_ridge.score(X_test,y_test)
    print(f"at alpha {alpha} score ==> {score}")

at alpha 0.01 score ==> 0.7643631589390542
at alpha 0.1 score ==> 0.7643727707489341
at alpha 1.0 score ==> 0.7644686367656158
at alpha 10.0 score ==> 0.7654030812954534


* Insights
  - Best Accuracy we got from Lasso Regression Model
  - Alpha value 1 and 10 both got 78% accuracy