### <div align="center">Supervised Machine Learning - Regression</div>

#### 3.1: Simple Linear Regression
- Simple Linear Regression (One independent variable): y = mx + b (Slope (m) - coefficient, Intercept (b) - line cut at y axis, y is dependent variable).
- Linear regression helps to establish the relationship between dependent and independent variables.
- Gradient Descent is the most important concept in the world of Supervised Machine Learning.

In [1]:
# Predict home price using area per square ft
import pandas as pd

df = pd.read_csv("../../data/home_prices.csv")
df.head()

Unnamed: 0,area_sqr_ft,price_lakhs
0,656,39.0
1,1260,83.2
2,1057,86.6
3,1259,59.0
4,1800,140.0


In [4]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(df[["area_sqr_ft"]], df["price_lakhs"])

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


- The reason for using df[['area_sqr_ft']] instead of df['area_sqr_ft'] in model.fit is that machine learning models in Scikit-learn expect the input features (X) to be a 2D array or DataFrame with shape (n_samples, n_features), even if there's only a single feature.
- The target (y) can be a 1D array or Series ((n,)), as the model is trying to predict a single value for each sample, and fitting expects n_targets in a column or simply a vector.

In [5]:
model.predict([[1500],[700]])



array([95.56913434, 54.22372995])

The correct usage is model.predict([,]) because Scikit-learn's predict method expects the input to be a 2D array-like structure, representing all samples and their features together.

In [6]:
model.coef_, model.intercept_

(array([0.05168176]), 18.046501102723433)

In [7]:
model.coef_[0]*1500 + model.intercept_

95.56913433756426

In [8]:
model.coef_[0]*700 + model.intercept_

54.223729945649154

#### 3.2: Multiple Linear Regression
- When we have more than one independent variable called Multiple Linear Regression.
- Multiple Linear Regression (Multiple independent variable): y=m1⋅x1+m2⋅x2+…+b ; m1,m2: coefficients, x1,x2: independent variables and b: intercept
- Slop is also called Gradient

In [9]:
df_multi = pd.read_csv("../../data/home_prices_multi.csv")
df_multi.head()

Unnamed: 0,area_sqr_ft,price_lakhs,bedrooms
0,656,39.0,2
1,1260,83.2,2
2,1057,86.6,3
3,1259,59.0,2
4,1800,140.0,3


In [12]:
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(df_multi[["area_sqr_ft","bedrooms"]], df_multi["price_lakhs"])

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


In [13]:
test = pd.DataFrame([
    {'area_sqr_ft': 1500, "bedrooms": 3},
    {'area_sqr_ft': 2000, "bedrooms": 2}
])

In [14]:
model.predict(test)

array([106.82967328,  89.13769907])

In [15]:
model.coef_, model.intercept_

(array([2.66959723e-02, 3.10399604e+01]), -26.334166207939006)

In [16]:
# y = m1*area + m2*bedrooms + b
predicted_price = model.coef_[0]*1500 + model.coef_[1]*3 + model.intercept_
predicted_price

106.8296732785733

#### 3.5: Cost Function
- Best fit line is finding the appropriate line for slope m and intercept b.
- Understanding the cost function is essential for understanding Gradient Descent.
- Gradient Descent is the most important concept in the world of Supervised Machine Learning.
- Line, Slope, Intercept are the essential basic terms.
- Error/Loss: Difference between the predicted and actual Y value.
- Mean Absolute Error (MAE): The average of errors, disregarding their direction. It is the average differences between prediction and actual observation.
- Mean Squared Error (MSE): The average of squared differences between prediction and actual observation. It effectively highlights larger errors.
  - MSE = (1/n) Σ(yᵢ - ŷᵢ)²

#### 3.6: Derivatives and Partial Derivatives
- The slope helps us to understand how y is changing as per the change in x.
- Slope of a line at a given point is Derivative.
- Derivative: x^n = n * x^n−1
- Slope is used for linear equations, whereas Derivative is used for non-linear equations.
- Slope is constant, whereas Derivative is a function.
- The purpose of a Partial Derivative is to measure how a function changes as one of its variables is varied while keeping the other variables constant.

#### 3.7: Chain Rule
- Chain rule: A technique used to compute the derivative of a function composed of multiple functions.
- Application: Chain rule will be used in the Gradient Descent Technique.

#### 3.10: Gradient Descent Theory
- Gradient Descent: An optimization method used in linear regression to find the best-fit line by iteratively adjusting the slope (m) and intercept (b) to minimize the cost function, usually the mean squared error (MSE).
- Efficiency: Since testing every combination for MSE is impractical, gradient descent efficiently minimizes MSE with fewer iterative adjustments.
- Mean Squared Error (MSE): MSE= 1/n ∑[yi − (mxi + b)] ** 2
- Learning rate is step gradient descent take. 

#### 3.11: Gradient Descent Implementation
- In a role as a Data Scientist or AI Engineer, your daily tasks will not typically involve implementing Gradient Descent. Instead, you will use ML libraries. However, a solid understanding of this concept is beneficial for both your work and potential interviews.
- Internally model.fit() uses gradient descent technic to train model and eventual goal is to come with coef and intercept.
- Adjusting the learning rate and epochs based on observed outputs (m, L) will enable you to obtain the desired outcomes in the Gradient Descent Implementation.

In [3]:
import pandas as pd
import numpy as np

# Expected answer. m = 0.05168176, b=18.0465

def gradient_descent(x, y, lr=0.1, epochs=3000):
    # Scale x and y using Min-Max Scaling
    x_min, x_max = x.min(), x.max()
    y_min, y_max = y.min(), y.max()

    x_scaled = (x - x_min) / (x_max - x_min)
    y_scaled = (y - y_min) / (y_max - y_min)

    # Initialize parameters
    b = 0.0  # Intercept
    m = 0.0  # Slope
    n = len(y_scaled)  # Number of data points

    # Perform gradient descent
    for epoch in range(epochs):
        y_pred = b + m * x_scaled  # Predicted y values
        error = y_scaled - y_pred  # Error in prediction
        cost = np.mean(error ** 2)   # Mean squared error

        # Calculate gradients
        db = -2 * np.mean(error)  # Derivative w.r.t. intercept b
        dm = -2 * np.mean(error * x_scaled)  # Derivative w.r.t. slope m

        # Update parameters
        b -= lr * db
        m -= lr * dm

        # Optional: Print cost every 100 iterations to monitor progress
        if epoch % 100 == 0:
            print(f"Epoch {epoch}: Cost = {cost}, b = {b}, m = {m}")

    # Scale back the coefficients to original scale
    b_original = b * (y_max - y_min) + y_min - m * (y_max - y_min) * x_min / (x_max - x_min)
    m_original = m * (y_max - y_min) / (x_max - x_min)

    return b_original, m_original

In [4]:
if __name__ == "__main__":
    df = pd.read_csv("../../data/home_prices.csv")

    x = df["area_sqr_ft"].to_numpy()
    y = df["price_lakhs"].to_numpy()

    b, m = gradient_descent(x, y)

    print(f"Final Results: m={m}, b={b}")

Epoch 0: Cost = 0.2564831062314152, b = 0.08397689768976899, m = 0.05399231750098086
Epoch 100: Cost = 0.045238527469297275, b = 0.16916604992569337, m = 0.5099109453609146
Epoch 200: Cost = 0.044639043137800115, b = 0.13609990777687922, m = 0.5708534773777333
Epoch 300: Cost = 0.04461681897553835, b = 0.1297333115176334, m = 0.5825874285178556
Epoch 400: Cost = 0.044615995078461666, b = 0.1285074791315702, m = 0.5848466981084404
Epoch 500: Cost = 0.044615964534840506, b = 0.12827145583473568, m = 0.585281700693843
Epoch 600: Cost = 0.04461596340252335, b = 0.12822601161472477, m = 0.5853654566343008
Epoch 700: Cost = 0.044615963360545935, b = 0.12821726172791226, m = 0.585381583107562
Epoch 800: Cost = 0.04461596335898974, b = 0.12821557701379, m = 0.5853846881188415
Epoch 900: Cost = 0.04461596335893206, b = 0.12821525263683137, m = 0.5853852859615863
Epoch 1000: Cost = 0.044615963358929915, b = 0.128215190180887, m = 0.5853854010709744
Epoch 1100: Cost = 0.04461596335892984, b = 0.1

#### 3.12: Why MSE (Not MAE)
- Mean Squared Error (MSE) is our go-to for calculating Gradient Descent because:
  1. It's sensitive to outliers (Outlier is the data point far away from other data points, In this case error is higher).
  2. It's continuously differentiable (MSE is a smooth and continuous function in terms of the parameters it depends on. This means its gradient (the derivative that tells us the direction to adjust the parameters) is also continuous and differentiable everywhere).
- In rare scenarios, Mean Absolute Error (MAE) is our pick for Gradient Descent when we're dealing with lots of outliers.
- Even though these principles are basic, in real-world scenarios, you'll be using machine learning libraries directly.

#### 3.13: Model Evaluation (Train, Test Split)
- Just as obtaining a driver's license involves passing a test, not just learning to drive, machine learning models require splitting the dataset into training and testing parts and evaluating the model's precision.
- We utilize the train_test_split() function from the sklearn library for this purpose.


#### 3.14: Model Evaluation - Metrics
- To evaluate the performance of a Machine Learning model, we can use metrics such as MSE, MAE, or R2 score.
- The R2 score is easier to interpret, compared to other metrics.
- The parameter random_state is used in the train_test_split() function to ensure reproducibility.


In [5]:
import pandas as pd

# URL to the Auto MPG dataset (Check for the latest URL or availability on the UCI repository)
url = "http://archive.ics.uci.edu/ml/machine-learning-databases/auto-mpg/auto-mpg.data"

# Column names based on the dataset description
column_names = ['mpg', 'cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'model_year', 'origin', 'car_name']

# Read the dataset from the URL
# Note: The dataset uses various delimiters and contains missing values denoted as '?'
df = pd.read_csv(url, delim_whitespace=True, names=column_names, na_values='?', comment='\t')

# Display the first few rows of the dataframe
df.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,car_name
0,18.0,8,307.0,130.0,3504.0,12.0,70,1,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693.0,11.5,70,1,buick skylark 320
2,18.0,8,318.0,150.0,3436.0,11.0,70,1,plymouth satellite
3,16.0,8,304.0,150.0,3433.0,12.0,70,1,amc rebel sst
4,17.0,8,302.0,140.0,3449.0,10.5,70,1,ford torino


In [7]:
df.to_csv("mpg.csv",index=False)
df.isna().sum()

mpg             0
cylinders       0
displacement    0
horsepower      6
weight          0
acceleration    0
model_year      0
origin          0
car_name        0
dtype: int64

In [8]:
df.horsepower.describe()

count    392.000000
mean     104.469388
std       38.491160
min       46.000000
25%       75.000000
50%       93.500000
75%      126.000000
max      230.000000
Name: horsepower, dtype: float64

In [10]:
df['horsepower'].fillna(df['horsepower'].mean(), inplace=True)
df.isna().sum()

mpg             0
cylinders       0
displacement    0
horsepower      0
weight          0
acceleration    0
model_year      0
origin          0
car_name        0
dtype: int64

In [11]:
# Select features and target for modeling
features = ['cylinders', 'displacement', 'horsepower', 'weight', 'acceleration', 'model_year', 'origin']
X = df[features]
y = df['mpg']

In [12]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

0,1,2
,fit_intercept,True
,copy_X,True
,tol,1e-06
,n_jobs,
,positive,False


In [13]:
y_test[:5].to_list(), y_pred[:5]

NameError: name 'y_pred' is not defined

In [None]:
pd.DataFrame([y_test, y_pred])

In [None]:
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
mse, r2

In [None]:
model.score(X_test, y_test)

#### 3.18: Data preprocessing - One hot encoding
- Data Science/AI Project Stages
  Data Collection -> Data Preprocessing (Cleaning bad data, creating new features - feature engineering called one hot encoding) -> Model Training & Evaluation -> Model Tuning
- Multicollinearity occurs when two or more independent variables are highly correlated, making it difficult to distinguish their individual effects on the dependent variable.
- In simple terms, computers understand numbers, not text. The process of turning text into numbers is called encoding.
- We use label encoding for ordinal categories that have a specific order. For nominal categories, which don't have an order, we use one-hot encoding.
- One-hot encoding transforms categorical data into a binary vector format that's easier for machine learning to understand. Each category is represented by a binary vector, with a "1" at the position corresponding to the category and "0"s everywhere else.
- Multicollinearity is a situation where two or more independent variables are closely related. It makes it hard to differentiate their separate effects.
- To handle multicollinearity, we remove one of the columns after/during the one-hot encoding process.
- The Pandas library includes a built-in function named get_dummies() for implementing one-hot encoding. By using the drop_first parameter, we can eliminate the first dummy column.

In [15]:
import pandas as pd

df = pd.read_csv("../../data/home_prices.csv")
df.head()

Unnamed: 0,locality,area_sqr_ft,price_lakhs,bedrooms
0,Kollur,656,39.0,2
1,Kollur,1260,83.2,2
2,Kollur,1057,86.6,3
3,Kollur,1259,59.0,2
4,Kollur,1800,140.0,3


In [16]:
df_encoded = pd.get_dummies(df, columns=['locality'], drop_first=True)
df_encoded.sample(5)

Unnamed: 0,area_sqr_ft,price_lakhs,bedrooms,locality_Kollur,locality_Mankhal
13,2400,300.0,3,False,False
7,1110,45.0,2,True,False
14,1100,85.0,2,False,False
15,2600,400.0,3,False,False
12,1600,150.0,3,False,False


In [17]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression


X = df_encoded.drop('price_lakhs', axis=1)
y = df_encoded['price_lakhs']

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Building the Linear Regression model
model = LinearRegression()
model.fit(X_train, y_train)

model.score(X_test, y_test)

0.855890526315538

In [18]:
X_test

Unnamed: 0,area_sqr_ft,bedrooms,locality_Kollur,locality_Mankhal
0,656,2,True,False
13,2400,3,False,False
8,1700,3,True,False
1,1260,2,True,False
15,2600,3,False,False


In [19]:
# Model is trained. Now let's predict prices of homes
test = pd.DataFrame([
    {'area_sqr_ft': 1600, "bedrooms": 2, "locality_Kollur": False, "locality_Mankhal": False},
    {'area_sqr_ft': 1600, "bedrooms": 2, "locality_Kollur": False, "locality_Mankhal": True},
])

model.predict(test)

array([157.03383393, 109.25104283])

#### 3.20: Polynomial Regression
- Simple linear regression models a straight-line (y=b0+b1x) relationship between a dependent variable y and an independent variable x. Polynomial regression extends this by including higher powers of x for complex, non-linear relationships.
- y = b0 + b1*X + b2*X**2 + b3*X**3 + .... + bnX**n ; b0, b1,…,bn - coefficients ; n = degree
- Polynomial Regression with degree=1 is nothing but a Linear Regression.
- Deciding the degree will be based on trial and error, as well as domain knowledge.

#### 3.23: Overfitting and Underfitting
- Overfitting: Occurs when a model learns too much detail and noise from the training data, affecting its performance on new data.
- Under fitting: Happens when a model is too simple and cannot learn the data pattern, leading to poor performance on all data.
- Balanced Fit: Achieved when a model accurately learns the training data's patterns and performs well on unseen data.

#### 3.24: Reasons and Remedies for Over fitting/Under fitting
- Overfitting can happen due to any one or combination of the following points:
  1. Reason: Poor model, hyper parameters selection: Solution: Better model, hyper parameters selection (Where we try to cover each point and line becomes zigzag instead curve).
  2. Reason: Insufficient training data, Solution: Sufficient training data
  3. Reason: Poor feature selection, Solution: Careful feature selection
  4. Reason: Inadequate validation, Solution: Adequate validation (If train and test datasets are from same area then points available in different area will not fit into derived model).
  5. Reason: Lack of regularization, Solution: Apply regularization (It is a technique used to reduce overfitting but if we apply too much regularization then our model will go from one extream to another extream means overfit to underfit).
- Under fitting can happen due to any one or combination of the following points:
  1. Reason: Too simple model: Solution: Use a complex model that can capture data patterns
  2. Reason: Insufficient training data, Solution: Sufficient training data
  3. Reason: Insufficient features / Poor feature engineering: Solution: Better feature selection/engineering
  4. Reason: Insufficient training time, Solution: Sufficient training time (No of epoch/iterations should be more).
  5. Reason: Inadequate validation, Solution: Adequate validation
  6. Reason: Excessive regularization, Solution: Adequate regularization

#### 3.25: L1 and L2 Regularization
- L1 and L2 regularization are effective tools for minimizing over fitting. When L2 regularization is applied to Linear Regression, it transforms into Ridge Regression. In below use case we have added/increased the penalty and improved the regularization.
- In the same vein, L1 Regularization leads to what we commonly call Lasso Regression.
- Linear Regression too can encounter over fitting issues if the number of features is excessive.
- The choice between Ridge or Lasso Regression and parameters like Alpha is contingent on multiple factors and is typically determined through a process of trial and error.
- Linear regression also can have overfitting when there are too many feature and regularization is used to reduce the problem.

In [20]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
import matplotlib.pyplot as plt

In [21]:
df = pd.read_csv("../../data/dataset.csv")
df.head(3)

Unnamed: 0,f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,...,f142,f143,f144,f145,f146,f147,f148,f149,f150,target
0,-0.924357,-0.326536,-1.875007,-1.780626,-0.630143,0.788204,2.792209,-0.772192,-0.450994,0.4,...,0.968645,-0.702053,-0.327662,-0.392108,-1.463515,0.29612,0.261055,0.005113,-0.234587,10.681366
1,0.001795,-1.285599,-0.726774,0.385711,0.891863,0.599451,-0.140553,-0.76176,0.117707,0.333231,...,0.856399,0.214094,-1.245739,0.173181,0.385317,-0.883857,0.153725,0.058209,-1.14297,-60.163343
2,0.956702,2.31933,-0.705012,0.081829,0.33088,0.838491,2.493,1.227669,-0.785989,-0.920674,...,-0.493001,-0.589365,0.849602,0.357015,-0.69291,0.8996,0.3073,0.812862,0.629629,131.226545


In [22]:
X = df.drop(["target"], axis=1)
y = df["target"]
X.head(3)

Unnamed: 0,f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,...,f141,f142,f143,f144,f145,f146,f147,f148,f149,f150
0,-0.924357,-0.326536,-1.875007,-1.780626,-0.630143,0.788204,2.792209,-0.772192,-0.450994,0.4,...,0.097078,0.968645,-0.702053,-0.327662,-0.392108,-1.463515,0.29612,0.261055,0.005113,-0.234587
1,0.001795,-1.285599,-0.726774,0.385711,0.891863,0.599451,-0.140553,-0.76176,0.117707,0.333231,...,-0.446515,0.856399,0.214094,-1.245739,0.173181,0.385317,-0.883857,0.153725,0.058209,-1.14297
2,0.956702,2.31933,-0.705012,0.081829,0.33088,0.838491,2.493,1.227669,-0.785989,-0.920674,...,-0.208122,-0.493001,-0.589365,0.849602,0.357015,-0.69291,0.8996,0.3073,0.812862,0.629629


In [23]:
y[:5]

0     10.681366
1    -60.163343
2    131.226545
3   -131.889020
4   -138.566956
Name: target, dtype: float64

In [24]:
# Training Using Simple Linear Regression Model
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# Initialize models
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)
linear_model.score(X_test, y_test)

0.7410183335246907

In [25]:
# Training Using Lasso Regression Model (L1 Regularization)
lasso_model = Lasso(alpha=1.0)  # Adjust alpha for Lasso
lasso_model.fit(X_train, y_train)
lasso_model.score(X_test, y_test)

0.9901476660336155

In [26]:
# Training Using Ridge Regression Model (L2 Regularization)
ridge_model = Ridge(alpha=1.0)  # Adjust alpha for Ridge
ridge_model.fit(X_train, y_train)
ridge_model.score(X_test, y_test)

0.9462671341742027

#### 3.26: Bias Variance Trade off
- Bias is a measurement of how accurately a model can capture a pattern in a training dataset.
- Bias occurs when an algorithm misses significant patterns in the data due to its simplicity, while Variance occurs when an algorithm changes significantly based on minor differences in the training data.
- BUVO: Bias-Underfitting (High bias often leads to underfitting), Variance-Overfitting.
- Overfitting: Training Error - Low, Test Error - High (The model has learned too much detail from the training data).

### Linear Regression Use case
You are a data scientist / AI engineer at a healthcare consulting firm. You have been provided with a dataset named **`"patient_health_data.csv"`**, which includes records of various health indicators for a group of patients. The dataset comprises the following columns:

- `age:` The age of the patient.
- `bmi:` Body Mass Index of the patient.
- `blood_pressure:` The blood pressure of the patient.
- `cholesterol:` Cholesterol levels of the patient.
- `glucose:` Glucose levels of the patient.
- `insulin:` Insulin levels of the patient.
- `heart_rate:` Heart rate of the patient.
- `activity_level:` Activity level of the patient.
- `diet_quality:` Quality of diet of the patient.
- `smoking_status:` Whether the patient smokes (Yes or No).
- `alcohol_intake:` The amount of alcohol intake by the patient.
- `health_risk_score:` A composite score representing the overall health risk of a patient.

Your task is to use this dataset to build a linear regression model to predict the health risk score based on the given predictor variables. Additionally, you will use L1 (Lasso) and L2 (Ridge) regularization techniques to improve the model's performance. 

#### Import necessary libraries

In [3]:
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

#### Step 1: Data Preparation and Exploration
1. Import the data from the **`"patient_health_data.csv"`** file and store it in a variable df.
2. Display the number of rows and columns in the dataset.
3. Display the first few rows of the dataset to get an overview.
4. Check for any missing values in the dataset and handle them appropriately.
5. Encode the categorical variable `'smoking_status'` by converting 'Yes' to 1 and 'No' to 0.

In [4]:
# Import the data from the "patient_health_data.csv" file and store it in a variable 'df'
df = pd.read_csv("../data/patient_health_data.csv")

# Display the number of rows and columns in the dataset
print("Number of rows and columns:", df.shape)

# Display the first few rows of the dataset to get an overview
print("First few rows of the dataset:")
df.head()

Number of rows and columns: (250, 12)
First few rows of the dataset:


Unnamed: 0,age,bmi,blood_pressure,cholesterol,glucose,insulin,heart_rate,activity_level,diet_quality,smoking_status,alcohol_intake,health_risk_score
0,58,24.865215,122.347094,165.730375,149.289441,22.306844,75.866391,1.180237,7.675409,No,0.824123,150.547752
1,71,19.103168,136.852028,260.610781,158.584646,13.869817,69.481114,7.634622,8.933057,No,0.85291,160.32035
2,48,22.316562,137.592457,177.342582,178.760166,22.849816,69.386962,7.917398,3.501119,Yes,4.740542,187.487398
3,34,22.196893,153.164775,234.594764,136.351714,15.140336,95.348387,3.19291,2.745585,No,2.226231,148.773138
4,62,29.837173,92.768973,276.106498,158.753516,17.228576,77.680975,7.044026,8.918348,No,3.944011,170.609655


In [6]:
# Check for any missing values in the dataset and handle them appropriately
print("Missing values in the dataset:")
df.isna().sum()

Missing values in the dataset:


age                  0
bmi                  0
blood_pressure       0
cholesterol          0
glucose              0
insulin              0
heart_rate           0
activity_level       0
diet_quality         0
smoking_status       0
alcohol_intake       0
health_risk_score    0
dtype: int64

In [10]:
# Encode the categorical variable 'smoking_status' by converting 'Yes' to 1 and 'No' to 0.
df['smoking_status'] = df['smoking_status'].apply(lambda x: 1 if x == 'Yes' else 0)

#### Step 2: Train Linear Regression Models
1. Select the features and the target variable for modeling.
2. Split the data into training and test sets with a test size of 25%.
3. Initialize and train a Linear Regression model, and evaluate its performance using R-squared.
4. Initialize and train a Lasso Regression model with various alpha values provided in a list: [0.01, 0.1, 1.0, 10.0], and evaluate its performance using R-squared.
5. Initialize and train a Ridge Regression model with various alpha values provided in a list: [0.01, 0.1, 1.0, 10.0], and evaluate its performance using R-squared.

In [11]:
df.head()

Unnamed: 0,age,bmi,blood_pressure,cholesterol,glucose,insulin,heart_rate,activity_level,diet_quality,smoking_status,alcohol_intake,health_risk_score
0,58,24.865215,122.347094,165.730375,149.289441,22.306844,75.866391,1.180237,7.675409,0,0.824123,150.547752
1,71,19.103168,136.852028,260.610781,158.584646,13.869817,69.481114,7.634622,8.933057,0,0.85291,160.32035
2,48,22.316562,137.592457,177.342582,178.760166,22.849816,69.386962,7.917398,3.501119,1,4.740542,187.487398
3,34,22.196893,153.164775,234.594764,136.351714,15.140336,95.348387,3.19291,2.745585,0,2.226231,148.773138
4,62,29.837173,92.768973,276.106498,158.753516,17.228576,77.680975,7.044026,8.918348,0,3.944011,170.609655


In [13]:
# Select the features and target variable for modeling
X = df.drop(['health_risk_score'], axis=1)
y = df['health_risk_score']

# Split the data into training and test sets with a test size of 25%
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [16]:
# Initialize and train a Linear Regression model, and evaluate its performance using R-squared
linear_model = LinearRegression()
linear_model.fit(X_train, y_train)
linear_r2 = linear_model.score(X_test, y_test)
print("Linear Regression R-squared:", linear_r2)

Linear Regression R-squared: 0.7643620906757489


In [17]:
# Initialize and train a Lasso Regression model with various alpha values provided in a list, and evaluate its performance using R-squared
lasso_alphas = [0.01, 0.1, 1.0, 10.0]
for alpha in lasso_alphas:
    lasso_model = Lasso(alpha=alpha)
    lasso_model.fit(X_train, y_train)
    lasso_r2 = lasso_model.score(X_test, y_test)
    print(f"Lasso Regression R-squared (alpha={alpha}):", lasso_r2)

Lasso Regression R-squared (alpha=0.01): 0.7645437646395714
Lasso Regression R-squared (alpha=0.1): 0.7660509914802165
Lasso Regression R-squared (alpha=1.0): 0.781976368357514
Lasso Regression R-squared (alpha=10.0): 0.7873364302158368


- Lasso regression with alpha `1.0` gives a little improvement in accuracy score, `0.78` as opposed to `0.76` with plain Linear regression.

In [18]:
# Initialize and train a Ridge Regression model with various alpha values provided in a list, and evaluate its performance using R-squared
ridge_alphas = [0.01, 0.1, 1.0, 10.0]
for alpha in ridge_alphas:
    ridge_model = Ridge(alpha=alpha)
    ridge_model.fit(X_train, y_train)
    ridge_r2 = ridge_model.score(X_test, y_test)
    print(f"Ridge Regression R-squared (alpha={alpha}):", ridge_r2)

Ridge Regression R-squared (alpha=0.01): 0.7643631589390539
Ridge Regression R-squared (alpha=0.1): 0.7643727707489341
Ridge Regression R-squared (alpha=1.0): 0.7644686367656156
Ridge Regression R-squared (alpha=10.0): 0.7654030812954538


- Ridge regression performs similar to linear regression. However, for complex datasets where there is a possibility of overfitting both `lasso and ridge` regression will give improvements over vanilla models