# Statisitcs and Research methods

## Understanding Statistical Models vs. Machine Learning Models

It's essential first to understand the distinctions between statistical models and machine learning models, as they serve different purposes, assumptions, and interpretative depth.

- Statistical Models: 
    - These are rooted in traditional statistics and 
        - focus on relationships between variables through predefined equations. 
    - Statistical models aim to understand the underlying data-generating process, focusing on hypothesis testing and inference. 
    - These models often rely on strong assumptions like:
        - linearity, 
        - normality, and 
        - homoscedasticity 
        - and are **interpretable**, making it easier to understand the impact of individual variables.

- Machine Learning Models: 
    - These prioritize **predictive** power over interpretability. 
    - They are designed to automatically learn patterns and relationships within data, often with minimal assumptions. 
    - Machine learning models can handle complex and high-dimensional data but may lack transparency about how individual features affect the outcome, especially in “black box” models like neural networks or ensemble methods.


## Choosing the Right Statistical Model

The type of statistical model you use depends on your data and problem:

- Linear Regression: For predicting a **continuous target variable** based on one or more predictors.
- Logistic Regression: For predicting a **binary outcomes**, often used in classification problems.
- ANOVA (Analysis of Variance): For comparing means across multiple groups.
- Time Series Models: For data that’s ordered by time (e.g., ARIMA, SARIMA).
- Survival Analysis: For time-to-event data, such as customer churn timing.
- Multivariate Analysis: For understanding interactions across multiple variables (e.g., MANOVA, PCA).

## Preprocessing the Data
Prepare your data by cleaning and preprocessing it:

- Missing Values: Decide whether to impute or drop missing values.
- Outliers: Identify and consider handling outliers, especially in regression.
- Data Transformation: Transform non-normal variables if required (e.g., using log transformations).
- Feature Scaling: For some models, standardizing or normalizing data is essential.

## Exploratory Data Analysis (EDA)

EDA is essential to understand: 
- patterns,
    - visualizations
- distributions,
    - summary statistics
- relationships
    - correlation matrices
    
This is to identify relevant features and spot potential issues like multicollinearity.

## Building the Statistical Model

- **Statsmodels** provides 
    - coefficients, 
    - p-values, and 
    - confidence intervals for each variable, 
        - enabling hypothesis testing on whether each predictor significantly affects the outcome.

## Evaluating Model Performance
Regression Metrics: 
- Use R-squared, 
- Adjusted R-squared, 
- RMSE, and 
- MAE to evaluate regression models.

Classification Metrics: 
- Use confusion matrix, 
- accuracy, 
- precision, 
- recall, and 
- AUC-ROC.

Residual Analysis: 
- Residual plots help assess assumptions
    - homoscedasticity, 
    - normality of residuals).

## Model Interpretation
Statistical models are highly interpretable. 
- In linear regression, each coefficient represents the expected change in the dependent variable for a one-unit change in the predictor, holding all else constant.

Confidence Intervals: 
- Look at 95% CI for each coefficient; if it does not contain zero, it suggests the predictor has a statistically significant effect.

P-Values: 
- A p-value below a threshold (usually 0.05) indicates that the predictor significantly affects the outcome.

## Validating Assumptions
- Linearity: Check scatter plots of residuals.
- Normality of Residuals: Use a Q-Q plot to verify.
- No Multicollinearity: Variance inflation factor (VIF) helps detect multicollinearity.
- Homoscedasticity: Plot residuals vs. fitted values.

## Reporting and Communicating Results
Present your findings by focusing on:

- Key Coefficients: Explain which predictors significantly affect the outcome.
- Model Fit: Interpret R-squared values (e.g., explaining how much variance in the target variable is explained).
- Real-World Implications: Describe how insights from the model can impact business decisions.

# Approach to statistical modeling

Each model type has specific 
- applications, 
- strengths, and 
- limitations, 

Understand when and how to use them.

### Step 1: Define Objectives and Hypotheses

Identify the Problem and Objectives: 
- Clearly define the goal.
    - Are you trying to predict, classify, find patterns, or estimate relationships? 
    - Setting objectives helps in choosing the right model.

- Formulate Hypotheses: 
    - Based on the problem, develop hypotheses. 
        - For instance, in a sales prediction problem, you may hypothesize that `certain features like advertising spend, time of year, and economic indicators affect sales.`

### Step 2: Data Collection and Preprocessing
Data Collection: 
- Gather historical data related to the problem. 

Data Cleaning: 
- Handle missing values, remove duplicates, and ensure consistency.

Feature Engineering: 
- Create new features if necessary. 
- This could involve 
    - transformations, 
    - encoding categorical variables, or 
    - creating interaction terms.

Data Splitting: 
- Split the data into training and testing sets. Typically, an 80-20 or 70-30 split is used.

### Step 3: Select the Type of Statistical Model
Statistical models can be broadly categorized as:

- **Descriptive Models**: Summarize data patterns.
- **Inferential Models**: Help make inferences about the population.
- **Predictive Models**: Used to predict future outcomes based on historical data.
- **Prescriptive Models**: Suggest actions based on predictions.

Let's go through common types of statistical models and their applications.


# Regression Analysis
 
Regression Analysis is a statistical method to analyze the relationship between a dependent variable and one or more independent variables.

### Three types of regression analysis

##### Real-world examples
- Simple linear regression
    - A real estate agent wants to determine the relationship between the size of a house (in square feet) and its selling price. They can use simple linear regression to predict the selling price of a house based on its size.
    
-  Multiple Linear Regression / Multivariate Linear Regression
    - A car manufacturer wants to predict the fuel efficiency of their vehicles based on various independent variables such as engine size, horsepower, and weight.
    
- Logistic regression
    - A bank wants to predict whether a customer will default on their loan based on their credit score, income, and other factors. By using logistic regression, the bank can estimate the probability of default and take appropriate measures to minimize their risk.

## 1. Linear Regression

Linear Regression is a supervised learning algorithm used to model the relationship between a dependent variable (outcome) and one or more independent variables (predictors)

Linear Regression predicts a continuous target variable (e.g., the number of readmissions) by minimizing the residual sum of squares between observed and predicted values.

What It Means: 
- Linear regression models the relationship between a dependent variable and one or more independent variables by fitting a linear equation to observed data. It assumes a straight-line relationship. 

- It is employed to establish a link between a dependant variable and a single independent variable. 
    - A linear equation defines the relationship, with the 
        - slope and 
        - intercept 
    - of the line representing the effect of the independent variable on the dependant variable.
        - An independent variable is the variable that is controlled in a scientific experiment to test the effects on the dependent variable.
        - A dependent variable is the variable being measured in a scientific experiment.

Outcome Interpretation: 
- Each coefficient represents how much the dependent variable (outcome) changes when the predictor variable changes by one unit, keeping all else constant.

**Assumptions of Linear Regression**
- Linearity (Linear Relationship): The relationship between the predictors and the outcome is linear.
- Independence of Errors: Residuals (errors) are independent of each other.
- Normality of Errors: Residuals are normally distributed.
- Multivariate Normality
- No or Little Multicollinearity
- No or Little Autocorrelation
- Homoscedasticity: Variance of residuals is constant across all levels of predictors.

Performance Measures:
- R-squared: Indicates the proportion of the variance in the dependent variable explained by the independent variables. 
    - Values closer to 1 indicate a better fit.
- Mean Squared Error (MSE): The average squared difference between observed and predicted values; lower values are better.

Lay Explanation: 
- Think of linear regression like drawing a best-fit line through a scatterplot of data points, aiming to predict outcomes based on relationships in the data.
- Finds a relationship between independent and dependent variables by finding a “best-fitted line” that has minimal distance from all the data points.

Use Case: 
- When there is a linear relationship between the target and predictor variables.

### Mathematics or Linear Regression

- it is using the least square method finds a linear equation that minimizes the sum of squared residuals (SSR).
- Cost Function:

$ J(\theta) = \frac{1}{2m}\sum^{m}_{i=1}(h_{\theta}(x^{(i)})- y^{(i)})^{2}$

Model Equation:
$ 𝑦=𝛽_{0}+𝛽_{1}𝑥_{1}+…+𝛽_{𝑛}𝑥_{𝑛}+ 𝜖 $

where:
- $y$ = dependent variable
- $𝛽_{0}$ = Y intercept
- $𝛽_{1}$ = Slope coefficient
- $𝑥_{1}$ = independent variable
- $𝜖 $ = error term

**What is Cost Function ?**

A cost function, also referred to as a: 
- loss function : Used when we refer to the error for a single training example. 
- objective function : Used to refer to an average of the loss functions over an entire training dataset.
It quantifies the difference between predicted and actual values, serving as a metric to evaluate the performance of a model.

Objective 
- is to minimize the cost function, indicating better alignment between predicted and observed outcomes.
- Guides the model towards optimal predictions by measuring its accuracy against the training data.

**Why to use a Cost function**

Cost function helps us reach the optimal solution. 
- How: It takes both predicted outputs by the model and actual outputs and calculates how much wrong the model was in its prediction.
    - It basically measures the discrepancy between the model’s predictions and the true values it is attempting to predict. 
    - This variance is depicted as a lone numerical figure, enabling us to measure the model’s **precision**.
- The cost function is the technique of evaluating “the performance of our algorithm/model”.

Classifiers have very high accuracy but one solution (Classifier) is the best because it does not misclassify any point.
- Reason why it classifies all the points perfectly is that the:
    - line is almost exactly in between the two (n) groups, and not closer to any one of the groups.

Explanation of the function of a cost function:

- Error calculation: It determines the difference between the predicted outputs (what the model predicts as the answer) and the actual outputs (the true values we possess for the data).
- Gives one value: This simplifies comparing the model’s performance on various datasets or training rounds.
- Improving Guides: The objective is to reduce the cost function. 
    - How: Through modifying the internal parameters of the model such as weights and biases, we can aim to minimize the total error and enhance the accuracy / precision of the model.

**Types of Cost function in machine learning**

Its use cases depend on whether it is a regression problem or classification problem.
- Regression cost Function
- Binary Classification cost Functions
- Multi-class Classification cost Functions



### Problem Context: Predicting Hospital Readmission Rates
The aim to reduce hospital readmission rates. 
- High readmission rates can strain resources and negatively impact patient outcomes.
- The goal is to predict the number of readmissions within 30 days of discharge for a particular condition, such as 
    - diabetes, based on 
        - patient demographic, 
        - clinical data, and 
        - treatment data.

**Step 1. Define the Problem**

We want to predict the number of readmissions ($𝑌$) using features ($𝑋$) such as:
- Patient age
- Length of hospital stay
- Severity of condition
- Medication adherence rate
- Comorbidities (e.g., hypertension, kidney disease)
- Number of follow-up visits scheduled

**Step 2. Collect and Prepare Data**

- Data Collection: Gather historical patient data from the hospital's database.
- Understand the 
    - model description
    - causality and 
    - directionality
- Check the data
    - categorical data, 
    - missing data and 
    - outliers
- Data Cleaning: 
    - Dummy variable takes only the value 0 or 1 to indicate the effect for categorical variables.
    - Handle missing values, 
    - remove duplicates, and 
    - correct errors.
    - Outlier is a data point that differs significantly from other observations. 
        - use standard deviation method and 
        - interquartile range (IQR) method.
- Feature Engineering: 
    - Encode categorical variables (e.g., age group), 
    - scale continuous variables (e.g., length of stay), and 
    - create interaction terms if necessary.

**Step 3. Conduct a Simple Analysis**
- Check the **effect** comparing between 
    - Dependent variable to independent variable and 
    - Independent variable to independent variable
- Check the correlation.
    - Use scatter plots
- Check Multicollinearity 
    - This occurs when more than two independent variables are highly correlated. 
    - Use Variance Inflation Factor (VIF) 
        - if VIF > 5 there is highly correlated and 
        - if VIF > 10 there is certainly multicollinearity among the variables.
- Interaction Term imply a change in the slope from one value to another value.

**Step 4. Formulate the Model (From Scratch)**
- y in this equation stands for the predicted value, 
- x means the independent variable and 
- m & b are the **coefficients** we need to optimize in order to fit the regression line to our data.

Calculating coefficient of the equation:
- To calculate the coefficients we need the formula for 

Covariance 

$Cov (X,Y) = \frac{\sum (X_{i}- X)(Y_{j} - Y)}{n}$

Variance

$var(x) = \frac{\sum^{n}_{i} (x_i -\mu)^2}{N}$

- To calculate the coefficient m
    - m = cov(x, y) / var(x)
    - b = mean(y) — m * mean(x)

**Functions to calculate the Mean, Covariance, and Variance.**

In [None]:
# mean 
def get_mean(arr):
    return np.sum(arr)/len(arr)

# variance
def get_variance(arr, mean):
    return np.sum((arr-mean)**2)

# covariance
def get_covariance(arr_x, mean_x, arr_y, mean_y):
    final_arr = (arr_x - mean_x)*(arr_y - mean_y)
    return np.sum(final_arr)

**Fuction to calculate the coefficients and the Linear Regression Function**

In [None]:
# Coefficients 
# m = cov(x, y) / var(x)
# b = y - m*x

def get_coefficients(x, y):
    x_mean = get_mean(x)
    y_mean = get_mean(y)
    m = get_covariance(x, x_mean, y, y_mean)/get_variance(x, x_mean)
    b = y_mean - x_mean*m
    return m, b

In [None]:
# Linear Regression 
# Train and Test
# Train Split 80 % Test Split 20 %
def linear_regression(x_train, y_train, x_test, y_test):
    prediction = []
    m, b = get_coefficients(x_train, y_train)
    for x in x_test:
        y = m*x + b
        prediction.append(y)
    
    r2 = r2_score(prediction, y_test)
    mse = mean_squared_error(prediction, y_test)
    print("The R2 score of the model is: ", r2)
    print("The MSE score of the model is: ", mse)
    return prediction

prediction = linear_regression(x[:80], y[:80], x[80:], y[80:])

**Visualize the regression line**

In [None]:
def plot_reg_line(x, y):
    # Calculate predictions for x ranging from 1 to 100
    prediction = []
    m, c = get_coefficients(x, y)
    for x0 in range(1,100):
        yhat = m*x0 + c
        prediction.append(yhat)
    
    # Scatter plot without regression line
    fig = plt.figure(figsize=(20,7))
    plt.subplot(1,2,1)
    sns.scatterplot(x=x, y=y)
    plt.xlabel('X')
    plt.ylabel('Y')
    plt.title('Scatter Plot between X and Y')
    
    # Scatter plot with regression line
    plt.subplot(1,2,2)
    sns.scatterplot(x=x, y=y, color = 'blue')
    sns.lineplot(x = [i for i in range(1, 100)], y = prediction, color='red')
    plt.xlabel('X')
    plt.ylabel('Y')
    plt.title('Regression Plot')
    plt.show()

In [None]:
# Regression plot form seaborn
# regplot is basically the combination of the scatter plot and the line plot
sns.regplot(x, y)
plt.xlabel('X')
plt.ylabel('Y')
plt.title("Regression Plot")
plt.show()

In [None]:
def plot_reg_line(x, y):
    # Calculate predictions for x ranging from 1 to 100
    prediction = []
    m, c = get_coefficients(x, y)
    for x0 in range(1,100):
        yhat = m*x0 + c
        prediction.append(yhat)
    
    # Scatter plot without regression line
    fig = plt.figure(figsize=(20,7))
    plt.subplot(1,2,1)
    sns.scatterplot(x=x, y=y)
    plt.xlabel('X')
    plt.ylabel('Y')
    plt.title('Scatter Plot between X and Y')
    
    # Scatter plot with regression line
    plt.subplot(1,2,2)
    sns.scatterplot(x=x, y=y, color = 'blue')
    sns.lineplot(x = [i for i in range(1, 100)], y = prediction, color='red')
    plt.xlabel('X')
    plt.ylabel('Y')
    plt.title('Regression Plot')
    plt.show()

**Step 4. Formulate the model and Fit the Model (using library)**

- Split the Data: Divide data into training and testing sets (e.g., 80% training, 20% testing).
- Train the Model: Use a library like sklearn in Python to fit the regression model on the training data.
- Evaluate the Model: Check metrics such as $𝑅^2$ (explained variance) and RMSE (Root Mean Squared Error).

In [None]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

# Example

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# Create the dataset
X = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).reshape(-1, 1)
y = np.array([2, 4, 5, 7, 8, 10, 11, 13, 14, 16])

# Create the linear regression model
model = LinearRegression().fit(X, y)

# Get the slope and intercept of the line
slope = model.coef_
intercept = model.intercept_

# Plot the data points and the regression line
plt.scatter(X, y)
plt.plot(X, slope*X + intercept, color='red')
plt.show()


In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Example dataset
X = data[['age', 'length_of_stay', 'severity', 'medication_adherence', 'comorbidities']]
y = data['readmissions']

# Split data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit the model
model = LinearRegression()
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Evaluation
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)

print(f"RMSE: {rmse}, R^2: {r2}")


**Regression cost functions: Regression model evaluation metrics**

**loss function** is for a single training example. It is also sometimes called an error function. 

**cost function**, on the other hand, is the average loss over the entire training dataset. 

**Steps for Loss Functions**
1. Define the predictor function f(X), and identify the parameters to find.
2. Determine the loss for each training example.
3. Derive the expression for the Cost Function, representing the average loss across all examples.
4. Compute the gradient of the Cost Function concerning each unknown parameter.
5. Select the learning rate and execute the weight update rule for a fixed number of iterations.

These steps guide the optimization process, aiding in the determination of optimal model parameters.

Regression model we generally use to evaluate the prediction error rates and model performance in regression analysis.

- **R-squared (Coefficient of determination)** represents the coefficient of how well the values fit compared to the original values. The value from 0 to 1 interpreted as percentages. The higher the value is, the better the model is.
- **MSE (Mean Squared Error)** represents the difference between the original and predicted values extracted by squared the average difference over the data set.
- **RMSE (Root Mean Squared Error)** is the error rate by the square root of MSE.
- **MAE (Mean absolute error)** represents the difference between the original and predicted values extracted by averaged the absolute difference over the data set.

1. Mean Error (ME)
- The error for each training data is calculated and then the mean value of all these errors is derived.
- Errors can be both negative and positive. So they can cancel each other out during summation giving zero mean error for the model.
- Not a recommended cost function but it does lay the foundation for other cost functions of regression models.

2. Mean Squared Error (MSE)
- known as L2 loss.
- Here a square of the difference between the actual and predicted value is calculated to avoid any possibility of negative error(drawback cause).
- It is measured as the average of the sum of squared differences between predictions and actual observations.
- Since each error is squared, it helps to penalize even small deviations in prediction when compared to MAE. 
    - But if our dataset has outliers that contribute to larger prediction errors, then squaring this error further will magnify the error many times more and also lead to higher MSE error.
    - MSE loss function penalizes the model for making large errors by squaring them. Squaring a large quantity makes it even larger
        - it is less robust to outliers
        - not to be used if our data is prone to many outliers.

Graphically
- It is a positive quadratic function (of the form $ax^2 + bx + c$ where $a > 0$)
- A quadratic function only has a global minimum. 
    - Since there are no local minima, we will never get stuck in one. 
- Hence, it is always guaranteed that Gradient Descent will converge (if it converges at all) to the global minimum.

In [None]:
def update_weights_MSE(m, b, X, Y, learning_rate):
    m_deriv = 0
    b_deriv = 0
    N = len(X)
    for i in range(N):
        # Calculate partial derivatives
        # -2x(y - (mx + b))
        m_deriv += -2*X[i] * (Y[i] - (m*X[i] + b))

        # -2(y - (mx + b))
        b_deriv += -2*(Y[i] - (m*X[i] + b))

    # We subtract because the derivatives point in direction of steepest ascent
    m -= (m_deriv / float(N)) * learning_rate
    b -= (b_deriv / float(N)) * learning_rate

    return m, b

3. Mean Absolute Error (MAE)
- known as L1 Loss.
- Absolute Error for each training example is the distance between the predicted and the actual values, irrespective of the sign.
    - it is the absolute difference between the actual and predicted values.
- Here an absolute difference between the actual and predicted value is calculated to avoid any possibility of negative error.
- It is measured as the average of the sum of absolute differences between predictions and actual observations.
    - It is robust to outliers thus it will give better results even when our dataset has noise or outliers.
    - MAE cost is more robust to outliers as compared to MSE
-  The cost is the Mean of these Absolute Errors

In [None]:
def update_weights_MAE(m, b, X, Y, learning_rate):
    m_deriv = 0
    b_deriv = 0
    N = len(X)
    for i in range(N):
        # Calculate partial derivatives
        # -x(y - (mx + b)) / |mx + b|
        m_deriv += - X[i] * (Y[i] - (m*X[i] + b)) / abs(Y[i] - (m*X[i] + b))

        # -(y - (mx + b)) / |mx + b|
        b_deriv += -(Y[i] - (m*X[i] + b)) / abs(Y[i] - (m*X[i] + b))

    # We subtract because the derivatives point in direction of steepest ascent
    m -= (m_deriv / float(N)) * learning_rate
    b -= (b_deriv / float(N)) * learning_rate

    return m, b

4. Huber Loss

- The Huber loss combines the best properties of MSE and MAE.
- It is quadratic for smaller errors and is linear otherwise (and similarly for its gradient). 
- It is identified by its delta parameter:

In [None]:
def update_weights_Huber(m, b, X, Y, delta, learning_rate):
    m_deriv = 0
    b_deriv = 0
    N = len(X)
    for i in range(N):
        # derivative of quadratic for small values and of linear for large values
        if abs(Y[i] - m*X[i] - b) <= delta:
          m_deriv += -X[i] * (Y[i] - (m*X[i] + b))
          b_deriv += - (Y[i] - (m*X[i] + b))
        else:
          m_deriv += delta * X[i] * ((m*X[i] + b) - Y[i]) / abs((m*X[i] + b) - Y[i])
          b_deriv += delta * ((m*X[i] + b) - Y[i]) / abs((m*X[i] + b) - Y[i])
    
    # We subtract because the derivatives point in direction of steepest ascent
    m -= (m_deriv / float(N)) * learning_rate
    b -= (b_deriv / float(N)) * learning_rate

    return m, b

**Step 5: Interpret the Results**

Residual Analysis:
- Check normal distribution and normality for the residuals.
- Homoscedasticity describes a situation in which error term is the same across all values of the independent variables. 
    - means that the residuals are equal across the regression line.

Interpretation of Regression Output
- R-Squared : is a statistical measure of fit that indicates how much variation of a dependent variable is explained by the independent variables. 
    - Higher R-Squared value represents smaller differences between the observed data and fitted values.
- P-value

Interpret the Regression Equation
- The coefficients ($𝛽$) indicate the magnitude and direction of the relationship between each predictor and readmissions.
    - Example: A coefficient of -0.5 for medication_adherence means that for every 1% increase in medication adherence, readmissions decrease by 0.5.
- The intercept ($𝛽_0$) represents the expected number of readmissions when all predictors are zero.

**Optimization technique/Strategy**

We will use Gradient Descent as an optimization strategy to find the regression line.
- Weight Update Rule

NB: Perform optimization on the training data and check its performance on a new validation data.

What is gradient descent?
- lay man: 
    - It is a way of checking the ground near you and observe where the land tends to descend.
    - It gives an idea in what direction you should take your steps.
    - It helps models find the optimal set of parameters by iteratively adjusting them in the opposite direction of the gradient, aiming to find the optimal set of parameters.

Mathematical terms:
- find out the best parameters ($θ_1$) and ($θ_2$) for our learning algorithm.

Cost space is how our algorithm would perform when we choose a particular value for a parameter.

Cost Function is a function that measures the performance of a model for any given data. Cost Function quantifies the error between predicted values and expected values and presents it in the form of a single real number.

1. Make a hypothesis with initial parameters
- Hypothesis: $h_θ(x) = θ_0 + θ_1 x$
- Parameters: $θ_o, θ_1$
2. Calculate the Cost function
- Cost Function: $J(θ_o, θ_1) = \frac{1}{2m}\sum^{m}_{i = 1} (h_θ (x^{(i)}) - y^{i})^2$
3. The goal is to reduce the cost function, we modify the parameters by using the Gradient descent algorithm over the given data.
- Goal: $minimize_{θ_o, θ_1} J(θ_o, θ_1)$

Gradient descent aims to find the parameters that minimize this discrepancy and improve the model’s performance.

The algorithm operates by calculating the gradient of the cost function, 
    - which indicates the direction and magnitude of the steepest ascent. 

However, since the goal is to minimize the cost function, gradient descent moves in the opposite direction of the gradient, 
    - known as the negative gradient direction.

Iteratively updating the model’s parameters in the negative gradient direction, gradient descent gradually converges towards the optimal set of parameters that yields the lowest cost.

- Hyperparameter: learning rate, determines the step size taken in each iteration, influencing the speed and stability of convergence.

Gradient descent can be applied to:
    - linear regression, 
    - logistic regression, 
    - neural networks, and 
    - support vector machines.

**Definition**: Gradient descent is an iterative optimization algorithm for finding the local minimum of a function.

To find the local minimum of a function using gradient descent, we must take steps proportional to the negative of the gradient (move away from the gradient) of the function at the current point.
- If we take steps proportional to the positive of the gradient (moving towards the gradient), we will approach a local maximum of the function, and the procedure is called Gradient Ascent.

The goal of the gradient descent algorithm is to minimize the given function (say, cost function)
- it performs two steps iteratively:
1. Compute the gradient (slope), the first-order derivative of the function at that point
2. Make a step (move) in the direction opposite to the gradient. The opposite direction of the slope increases from the current point by alpha times the gradient at that point.

This code creates a function called gradient_descent, which requires the training data, learning rate, and number of iterations as parameters.

Steps :
1. Sets weights and bias to arbitrary values during initialization.
2. Executes a set number of iterations for loops.
3. Computes the estimated y values by utilizing the existing weights and bias.
4. Calculates the discrepancy between expected and real y values.
5. Determines the changes in the cost function based on weights and bias.
6. Adjusts the weights and bias by incorporating the gradients and learning rate.
7. Outputs the acquired weights and bias.


In [None]:
import numpy as np

def gradient_descent(X, y, learning_rate, num_iters):
  """
  Performs gradient descent to find optimal weights and bias for linear regression.

  Args:
      X: A numpy array of shape (m, n) representing the training data features.
      y: A numpy array of shape (m,) representing the training data target values.
      learning_rate: The learning rate to control the step size during updates.
      num_iters: The number of iterations to perform gradient descent.

  Returns:
      A tuple containing the learned weights and bias.
  """

  # Initialize weights and bias with random values
  m, n = X.shape
  weights = np.random.rand(n)
  bias = 0

  # Loop for the number of iterations
  for i in range(num_iters):
    # Predict y values using current weights and bias
    y_predicted = np.dot(X, weights) + bias

    # Calculate the error
    error = y - y_predicted

    # Calculate gradients for weights and bias
    weights_gradient = -2/m * np.dot(X.T, error)
    bias_gradient = -2/m * np.sum(error)

    # Update weights and bias using learning rate
    weights -= learning_rate * weights_gradient
    bias -= learning_rate * bias_gradient

  return weights, bias

# Example usage
X = np.array([[1, 1], [2, 2], [3, 3]])
y = np.array([2, 4, 5])
learning_rate = 0.01
num_iters = 100

weights, bias = gradient_descent(X, y, learning_rate, num_iters)

print("Learned weights:", weights)
print("Learned bias:", bias)

How Does Gradient Descent Work?
1. The algorithm optimizes to minimize the model’s cost function.
2. The cost function measures how well the model fits the training data and defines the difference between the predicted and actual values.
3. The cost function’s gradient is the derivative with respect to the model’s parameters and points in the direction of the steepest ascent.
4. The algorithm starts with an initial set of parameters and updates them in small steps to minimize the cost function.
5. In each iteration of the algorithm, it computes the gradient of the cost function with respect to each parameter.
6. The gradient tells us the direction of the steepest ascent, and by moving in the opposite direction, we can find the direction of the steepest descent.
7. The learning rate controls the step size, which determines how quickly the algorithm moves towards the minimum.
8. The process is repeated until the cost function converges to a minimum. Therefore indicating that the model has reached the optimal set of parameters.
9. Different variations of gradient descent include batch gradient descent, stochastic gradient descent, and mini-batch gradient descent, each with advantages and limitations.
10. Efficient implementation of gradient descent is essential for performing well in machine learning tasks. The choice of the learning rate and the number of iterations can significantly impact the algorithm’s performance.

On the basis of differentiation techniques 
- Gradient descent requires Calculation of gradient by differentiation of cost function. We can either use first order differentiation or second order differentiation.
    - First order Differentiation
    - Second order Differentiation.

**Types of Gradient Descent**

Classified by two methods mainly:
- On the basis of data ingestion: choice of gradient descent algorithm depends on the problem at hand and the size of the dataset.
    - Full Batch Gradient Descent Algorithm:
        - Batch gradient descent,
            - also known as vanilla gradient descent, 
        - full batch gradient descent algorithms, you use whole data at once to compute the gradient.
            - It updates the model’s parameters using the gradient of the entire training set.
        - It calculates the average gradient of the cost function for all the training examples and updates the parameters in the opposite direction.
            - calculates the error for each example within the training dataset.
            - The model is not changed until every training sample has been assessed. 
                - The entire procedure is referred to as a **cycle and a training epoch**.
        - Batch gradient descent guarantees convergence to the global minimum but can be computationally expensive and slow for large datasets.
            - Batch gradient descent is suitable for small datasets.
            - Its computational efficiency, which produces a stable error gradient and a stable convergence.
        - Drawbacks are that the stable error gradient can sometimes result in a state of convergence that isn’t the best the model can achieve. 
            - It also requires the entire training dataset to be in memory and available to the algorithm.

Advantages
- Fewer model updates mean that this variant of the steepest descent method is more computationally efficient than the stochastic gradient descent method.
- Reducing the update frequency provides a more stable error gradient and a more stable convergence for some problems.
- Separating forecast error calculations and model updates provides a parallel processing-based algorithm implementation.

Disadvantages
- A more stable error gradient can cause the model to prematurely converge to a suboptimal set of parameters.
- End-of-training epoch updates require the additional complexity of accumulating prediction errors across all training examples.
- The batch gradient descent method typically requires the entire training dataset in memory and is implemented for use in the algorithm.
- Large datasets can result in very slow model updates or training speeds.
- Slow and require more computational power.

In [None]:
class GDRegressor:
    
    def __init__(self,learning_rate=0.01,epochs=100):
        
        self.coef_ = None
        self.intercept_ = None
        self.lr = learning_rate
        self.epochs = epochs
        
    def fit(self,X_train,y_train):
        # init your coefs
        self.intercept_ = 0
        self.coef_ = np.ones(X_train.shape[1])
        
        for i in range(self.epochs):
            # update all the coef and the intercept
            y_hat = np.dot(X_train,self.coef_) + self.intercept_
            #print("Shape of y_hat",y_hat.shape)
            intercept_der = -2 * np.mean(y_train - y_hat)
            self.intercept_ = self.intercept_ - (self.lr * intercept_der)
            
            coef_der = -2 * np.dot((y_train - y_hat),X_train)/X_train.shape[0]
            self.coef_ = self.coef_ - (self.lr * coef_der)
        
        print(self.intercept_,self.coef_)
    
    def predict(self,X_test):
        return np.dot(X_test,self.coef_) + self.intercept_

- Stochastic Gradient Descent Algorithm
    - stochastic you take a sample while computing the gradient.
        - It randomly selects a training dataset example, 
            - changes the parameters for each training sample one at a time for each training example in the dataset.
                - The regular updates give us a fairly accurate idea of the rate of improvement. (benefit)
        - computes the gradient of the cost function for that example, 
        - and updates the parameters in the opposite direction.
    - stochastic gradient descent algorithm is more suitable for large datasets.
    - It is computationally efficient and can converge faster than batch gradient descent. It can be noisy (produce noisy gradients), cause the error rate to fluctuate rather than gradually go down and may not converge to the global minimum.

Advantages
- You can instantly see your model’s performance and improvement rates with frequent updates.
- This variant of the steepest descent method is probably the easiest to understand and implement, especially for beginners.
- Increasing the frequency of model updates will allow you to learn more about some issues faster.
- The noisy update process allows the model to avoid local minima (e.g., premature convergence).
- Faster and require less computational power.
- Suitable for the larger dataset.

Disadvantages
- Frequent model updates are more computationally intensive than other steepest descent configurations, and it takes considerable time to train the model with large datasets.
- Frequent updates can result in noisy gradient signals. This can result in model parameters and cause errors to fly around (more variance across the training epoch).
- A noisy learning process along the error gradient can also make it difficult for the algorithm to commit to the model’s minimum error.

In [None]:
from sklearn.linear_model import SGDClassifier
X = [[0., 0.], [1., 1.]]
y = [0, 1]
clf = SGDClassifier(loss="hinge", penalty="l2", max_iter=5)
clf.fit(X, y)
SGDClassifier(max_iter=5)

Mini-batch Gradient Descent
- Mini-batch is a good compromise between the two and is often used in practice.
- updates the model’s parameters using the gradient of a small batch size of the training dataset, known as a mini-batch. 
- It calculates the average gradient of the cost function for the mini-batch and updates the parameters in the opposite direction.
- It is the most commonly used method in practice because combines the ideas of batch gradient descent with SGD.
        - strikes a balance between batch gradient descent’s effectiveness and stochastic gradient descent’s durability.
- It is computationally efficient and less noisy than stochastic gradient descent while still being able to converge to a good solution.
- Mini-batch sizes typically range from 50 to 256.

Advantages
- The model is updated more frequently than the stack gradient descent method, allowing for more robust convergence and avoiding local minima.
- Batch updates provide a more computationally efficient process than stochastic gradient descent.
- Batch processing allows for both the efficiency of not having all the training data in memory and implementing the algorithm.

Disadvantages
- Mini-batch requires additional hyperparameters “mini-batch size” to be set for the learning algorithm.
- Error information should be accumulated over a mini-batch of training samples, such as batch gradient descent.
- it will generate complex functions.

Configure Mini-Batch Gradient Descent:

- The mini-batch steepest descent method is a variant of the steepest descent method recommended for most applications, intense learning.
- Mini-batch sizes, commonly called “batch sizes” for brevity, are often tailored to some aspect of the computing architecture in which the implementation is running. 
        - For example, a power of 2 that matches the memory requirements of the GPU or CPU hardware, such as 32, 64, 128, and 256.
- The stack size is a slider for the learning process.
- Smaller values ​​allow the learning process to converge quickly at the expense of noise in the training process. Larger values ​​result in a learning - process that slowly converges to an accurate estimate of the error gradient.

In [None]:
class MBGDRegressor:
    
    def __init__(self,batch_size,learning_rate=0.01,epochs=100):
        
        self.coef_ = None
        self.intercept_ = None
        self.lr = learning_rate
        self.epochs = epochs
        self.batch_size = batch_size
        
    def fit(self,X_train,y_train):
        # init your coefs
        self.intercept_ = 0
        self.coef_ = np.ones(X_train.shape[1])
        
        for i in range(self.epochs):
            
            for j in range(int(X_train.shape[0]/self.batch_size)):
                
                idx = random.sample(range(X_train.shape[0]),self.batch_size)
                
                y_hat = np.dot(X_train[idx],self.coef_) + self.intercept_
                #print("Shape of y_hat",y_hat.shape)
                intercept_der = -2 * np.mean(y_train[idx] - y_hat)
                self.intercept_ = self.intercept_ - (self.lr * intercept_der)

                coef_der = -2 * np.dot((y_train[idx] - y_hat),X_train[idx])
                self.coef_ = self.coef_ - (self.lr * coef_der)
        
        print(self.intercept_,self.coef_)
    
    def predict(self,X_test):
        return np.dot(X_test,self.coef_) + self.intercept_

**Step 6: Use the Model for Decision-Making**

Understanding which factors significantly influence readmissions,

To do this, you need a systematic approach grounded in exploratory analysis, statistical rigor, and effective communication

1. Thinking Approach: Identifying Significant Factors
- Define the Business Objective
    - Objective: Identify key drivers of hospital readmissions (to improve patient care and optimize resource allocation)
    - Questions to Answer:
        - What are the strongest predictors of readmissions?
        - Which predictors can be influenced through policy or operational changes?
        - How much can readmissions be reduced if certain factors are addressed?

- Perform Exploratory Data Analysis (EDA)
    - Inspect Data Distributions: Use histograms and boxplots to understand the spread of variables.
    - Check Relationships:
        - Pairwise correlations for numerical variables (e.g., length_of_stay vs. readmissions).
        - Grouped summaries for categorical variables (e.g., readmissions across age groups).
        - Example Insights:
            - Patients with longer stays might have higher readmission risks.
            - Non-adherence to medication might strongly correlate with readmissions.

- Statistical Hypothesis Testing
    - Use statistical tests to confirm relationships:
        - T-tests for differences in means (e.g., medication adherence between high and low readmission groups).
        - Chi-square tests for independence between categorical variables (e.g., age group vs. readmission rates).

Example 1: Statistical Hypothesis Testing for Medication Adherence
- Objective: Determine if medication adherence significantly differs between patients who are readmitted and those who are not.
- Approach: Two-Sample t-Test
- Hypotheses: 
    - $𝐻_0$ : The mean adherence rate is the same for both groups (readmitted and not readmitted).
    - $𝐻_𝑎$ : The mean adherence rate differs between the groups.

- Steps:
    - Prepare the Data:
    - Split patients into two groups: "Readmitted" and "Not Readmitted."
    - Collect medication adherence rates for each group.

- Check Assumptions:
    - Normality: Use a Shapiro-Wilk or Kolmogorov-Smirnov test to check if adherence rates are normally distributed.
    - Equal Variance: Use Levene’s test or Bartlett’s test.

- Perform the t-Test:
    - If variances are equal, use a standard t-test. If not, use Welch’s t-test.

- Interpret Results: 
    - If $𝑝 < 0.05$, reject $𝐻_0$
    - Conclude that adherence rates differ significantly.

In [None]:
from scipy.stats import ttest_ind

# Example data
adherence_readmitted = [0.7, 0.65, 0.6, 0.75, 0.8]  # Adherence rates for readmitted
adherence_not_readmitted = [0.9, 0.85, 0.88, 0.92, 0.89]  # Adherence rates for not readmitted

# Perform t-test
t_stat, p_value = ttest_ind(adherence_readmitted, adherence_not_readmitted, equal_var=False)
print(f"T-statistic: {t_stat}, P-value: {p_value}")

Example 2: Statistical Hypothesis Testing for Age Group vs. Readmission Rates
- Objective: Test if age group (categorical variable) is independent of readmission status.
- Approach: Chi-Square Test of Independence
- Hypotheses:
    - $𝐻_0$ : Age group is independent of readmission status.
    - $𝐻_𝑎$ : Age group and readmission status are dependent.

- Steps:
    - Create a Contingency Table:
        - Rows: Age groups (e.g., <40, 40–60, >60).
        - Columns: Readmission status (e.g., Yes, No).

- Perform the Chi-Square Test:

- Interpret Results:
    - If $ 𝑝< 0.05$, reject $𝐻_0$​
    - Conclude that age group influences readmission rates.

In [None]:
import numpy as np
from scipy.stats import chi2_contingency

# Contingency table
table = np.array([[50, 200], [70, 230], [100, 300]])

# Perform Chi-Square Test
chi2, p_value, dof, expected = chi2_contingency(table)
print(f"Chi2 Statistic: {chi2}, P-value: {p_value}")

Example 3: Statistical Hypothesis Testing for Length of Stay (LOS)
- Objective: Compare Average LOS for Readmitted vs. Not Readmitted Patients
- Approach: Two-Sample t-Test
    - $𝐻_0$ : The mean LOS is the same for readmitted and non-readmitted patients.
    - $𝐻_𝑎$ : The mean LOS differs.
- Steps:
    - Prepare the Data:
    - Split patients into two groups: "Readmitted" and "Not Readmitted."
    - Collect medication Length of stay for each group.

- Check Assumptions:
    - Normality: Use a Shapiro-Wilk or Kolmogorov-Smirnov test to check if Lengths of stay are normally distributed.
    - Equal Variance: Use Levene’s test or Bartlett’s test.

- Perform the t-Test:
    - If variances are equal, use a standard t-test. If not, use Welch’s t-test.

- Interpret Results: 
    - If $𝑝 < 0.05$, reject $𝐻_0$
    - Conclude that adherence rates differ significantly.

Example 4: Relationship Between LOS and Readmission Rate
- Approach: ANOVA (Analysis of Variance)
- Objective: Check if LOS groups (<3 days, 3–7 days, >7 days) have significantly different readmission rates.
- Hypotheses: 
    - $𝐻_0$ : The mean readmission rate is the same across all LOS groups.
    - $𝐻_𝑎$ : At least one group differs.
- Steps:
    - Group the Data:
        - Divide LOS into groups.
        - Calculate readmission rates for each group.
- Perform ANOVA:
- Interpret Results:
    - If $𝑝 < 0.05$
    - reject $𝐻_0$
    - Conclude that LOS impacts readmission rates.

In [None]:
from scipy.stats import f_oneway

# Example data
readmission_short = [0.1, 0.12, 0.08, 0.15]  # Readmission rates for <3 days
readmission_medium = [0.2, 0.22, 0.25, 0.18]  # Readmission rates for 3–7 days
readmission_long = [0.35, 0.4, 0.38, 0.42]  # Readmission rates for >7 days

# Perform ANOVA
f_stat, p_value = f_oneway(readmission_short, readmission_medium, readmission_long)
print(f"F-statistic: {f_stat}, P-value: {p_value}")



- Build and Interpret a Regression Model
    - Fit the Linear Regression model to identify significant predictors:
    - Check p-values of coefficients: Variables with p-values below a chosen threshold (e.g., 0.05) are statistically significant.
    - Evaluate effect size: Large coefficients indicate strong influence on the target.
    - Test for interaction effects, such as how length_of_stay and severity jointly influence readmissions.

- Refine the Model
    - Handle multicollinearity: Use Variance Inflation Factor (VIF) to remove or combine highly correlated predictors.
    - Validate the model: Perform cross-validation to ensure robustness.

This will help the institute to:
- Improve medication adherence programs for high-risk patients.
- Extend hospital stays for patients with severe conditions if needed.
- Schedule follow-up visits more effectively to minimize readmission risks.

Example 2: Predicting Readmissions Based on LOS
- Approach: Linear Regression
- Objective: Use regression to predict readmissions based on LOS and other predictors.

##### Linear Regression Helps Solve This Problem
- Quantifies Relationships: Identifies and quantifies the factors contributing to readmissions.
- Predicts Outcomes: Provides actionable predictions to guide healthcare interventions.
- Allocates Resources: Helps prioritize patients who need more attention post-discharge.
- Supports Policy Changes: Enables data-driven policy improvements in patient care.

In [None]:
import statsmodels.api as sm

# Example data
X = [2, 4, 6, 8, 10]  # LOS
y = [0, 1, 0, 1, 1]  # Readmission (0 = No, 1 = Yes)

# Add constant for intercept
X = sm.add_constant(X)
model = sm.Logit(y, X).fit()
print(model.summary())

2. Presenting Findings to Senior Management and Board
- Tailor Communication to the Audience
    - Senior management: Focus on actionable insights, resource implications, and patient care improvements.
    - Board of directors: Emphasize high-level trends, financial impacts, and alignment with strategic goals.

- Structure of Presentation
    - Introduction
        - Start with the context: "Readmission rates are a critical indicator of hospital performance and patient care quality."
        - Summarize the objective: "This study identifies key factors driving readmissions and proposes targeted interventions."

    - Key Findings
        - Use visuals like 
            - bar charts, 
            - scatter plots, and 
            - regression coefficient tables:
                - Example: "Medication adherence has the strongest inverse relationship with readmissions. A 10% increase in adherence reduces readmissions by 5%."
            - Highlight statistical significance:
                - "Length of stay and severity are significant at p < 0.05, confirming their importance."
    
    - Implications
        - Show real-world impact: "Addressing non-adherence could prevent ~300 readmissions annually, saving $1.2M in costs."
        - Prioritize recommendations: "Focus on medication adherence programs, especially for older patients with comorbidities."

    - Actionable Recommendations
        - Immediate Steps:
            - Develop a post-discharge follow-up protocol for high-risk groups.
            - Launch an adherence monitoring program.
        - Future Research:
            - Investigate additional factors like social determinants of health.

    - Conclusion
        - Reinforce value: "By addressing these factors, we can improve patient outcomes, meet regulatory benchmarks, and reduce financial strain."

- Tools for Communication
    - Visual Dashboards: Create dashboards showing predicted readmissions, trends over time, and "what-if" scenarios.
    - Executive Summaries: Provide concise summaries with high-impact visuals and key takeaways.
    - Financial Impact Models: Quantify cost savings or ROI of proposed interventions.

3. Example Insights and Visualizations
Insight Example: Medication Adherence
    - Insight: "Medication adherence has a strong negative correlation with readmissions ($𝑅=−0.65$)
        - A 10% increase in adherence is associated with a 5% reduction in readmissions."

Visualization:
    - A bar chart comparing adherence rates and average readmissions.
    - Regression coefficient chart showing the magnitude of influence.

Insight Example: Length of Stay
    - Insight: "Patients with hospital stays >7 days are 2x more likely to be readmitted within 30 days."

Visualization:
    - Scatter plot: length_of_stay vs. readmissions.
    - Box plot: Readmission rates by length-of-stay categories.

4. Implementation Plan
Once the board approves, focus on operationalizing findings:

- Deploy targeted interventions for high-risk patients.
- Set KPIs to monitor the effectiveness of changes.
- Continuously refine the model based on new data.

##### Set KPIs to monitor the effectiveness of changes

**KPI 1: 30-Day Readmission Rate**
- Definition: Percentage of patients readmitted to the hospital within 30 days of discharge.
- Why Important: This is the primary metric to assess whether interventions are reducing readmissions.
- Formula: $Readmission Rate = \frac{Number of patients readmitted within 30 days}{Total number of discharged patients} × 100$
- Target: A reduction in the readmission rate over time indicates success.

**KPI 2: Medication Adherence Rate**
- Definition: Percentage of patients adhering to their prescribed medications post-discharge.
- Why Important: Non-adherence is a leading cause of readmissions. Monitoring this ensures interventions like counseling and follow-ups are effective
- Formula: $Medication Adherence Rate = \frac{Number of patients adhering to medications}{Total number of patients} × 100$
- Target: An increase in adherence correlates with better outcomes and fewer readmissions.

**KPI 3: Follow-Up Appointment Compliance**
- Definition: Percentage of discharged patients attending follow-up appointments within the recommended time frame.
- Why Important: Follow-up visits can identify issues early and prevent readmissions.
- Formula: $Compliance Rate= \frac{Number of scheduled follow-ups}{Number of attended follow-ups} × 100$
- Target: High compliance indicates improved patient engagement.

**KPI 4: Average Length of Stay (LOS)**
- Definition: Average number of days patients spend in the hospital.
- Why Important: Shorter stays can indicate efficiency but might increase readmissions if patients are discharged prematurely.
- Formula: $LOS= \frac{Number of discharges}{Total inpatient days}$
​- Target: Maintain an optimal LOS that balances cost and readmission prevention.

**KPI 5: Percentage of High-Risk Patients Identified**
- Definition: Proportion of discharged patients flagged as high-risk for readmission and targeted for interventions.
- Why Important: Monitoring ensures that predictive models and risk stratification tools are working effectively.
- Formula:$High-Risk Patients Identified = \frac{Total number of discharged patients}{Number of flagged high-risk patients} × 100$
- Target: Increase the identification rate while reducing actual readmissions.

##### Presenting KPIs to Stakeholders

**Visual Presentation**

Use dashboards and visualizations:
- Bar charts to compare readmission rates before and after interventions.
- Line graphs showing trends over time for medication adherence and follow-up compliance.
- Heatmaps for condition-specific readmission trends.

Narrative
- Highlight success: "We reduced the 30-day readmission rate from 18% to 12%, saving $500,000 annually."
- Focus on actionable insights: "Medication adherence programs have been effective, with a 15% increase in adherence leading to a 5% drop in readmissions."

Recommendations
- Continue monitoring these KPIs for sustained improvements.
- Scale successful interventions to other patient groups or hospitals.

## 2. Multiple Linear Regression:

What it means:
- It is used when two or more independent variables influence the dependant variable. 

- A linear equation defines the relationship, with the 
    - coefficients of the independent variables 
    
- representing the effect of each variable on the dependant variable.

In [None]:
import pandas as pd
import numpy as np
import statsmodels.api as sm

data = pd.read_csv('data.csv') # read data from csv file
X = data[['Independent_Var_1', 'Independent_Var_2', 'Independent_Var_3']] # select independent variables
Y = data['Dependent_Var'] # select dependent variable

# Add a constant to the independent variable set
X = sm.add_constant(X)

# Fit the logistic regression model
model = sm.Logit(Y, X).fit()

# Print model summary
print(model.summary())


## 3. Logistic Regression

Logistic regression is a statistical model used for binary classification tasks.
- the outcome variable is categorical with two possible values (e.g., 1/0, Yes/No, Positive/Negative). 

It predicts the probability of an event occurring, transforming the linear combination of predictors through a logistic function (sigmoid function) to ensure the predicted probabilities lie between 0 and 1.

What It Means: 
- Logistic regression estimates the probability of a binary outcome (e.g., yes/no, success/failure) based on predictor variables. 
    - It uses a logistic function to map predictions to probabilities between 0 and 1.

- It is a statistical technique for investigating the relationship between a binary dependent variable (outcome) and one or more independent variables (predictors). 

- The goal of logistic regression is to find the best-fitting model to describe the relationship between the dependent variable and the independent variables and then use that model to predict the outcome variable.

Outcome Interpretation: 
- The model outputs probabilities that can be converted to binary outcomes. 
- Coefficients show how each predictor variable influences the likelihood of the outcome.

Performance Measures:
- Accuracy: Proportion of correct predictions.
- AUC-ROC: Measures the model's ability to distinguish between classes; values closer to 1 indicate a better model.

Lay Explanation: 
- Logistic regression is like a yes-or-no decision helper. It estimates the chances of an event happening (e.g., a customer buying a product) based on known factors.
- It tries to find the best-fitted curve for the data

Use Case: Used for binary classification (e.g., churn prediction, fraud detection).

Model Equation: 
$ 𝑃(𝑦=1)= \frac{1}{1+𝑒^{−(𝛽_{0}+𝛽_{1}𝑥_{1}+…+𝛽_{𝑛}𝑥_{𝑛}})}$

### Problem Statement

Objective:
- The medical institute, we want to identify the likelihood of patients being readmitted within 30 days of discharge based on patient 
    - demographics, 
    - medical history, 
    - length of stay (LOS), and 
    - clinical metrics such as blood pressure, 
    - blood glucose levels, and 
    - medication adherence.

Why Logistic Regression?

Logistic regression is ideal for this problem because:
- Binary Outcome: The target variable is binary: Readmitted (1) or Not Readmitted (0).
- Interpretability: It provides coefficients (log odds) that indicate how changes in predictors affect the likelihood of the event (readmission).
- Insights: It helps identify the significant factors influencing readmissions.

**Key Assumptions of Logistic Regression**
- Binary Outcome: The dependent variable is binary.
- Independence of Observations: Observations are independent of each other.
- Linearity of Log-Odds: There is a linear relationship between the log-odds of the outcome and the independent variables.
- No Multicollinearity: Independent variables are not highly correlated.
- Large Sample Size: Logistic regression performs well with larger datasets.

**Step 1: Define the Problem**
- Target Variable: Readmission within 30 days (1 = Yes, 0 = No).
- Predictors:
    - Patient Demographics: Age, gender, insurance status.
    - Clinical Metrics: Blood glucose levels, blood pressure, medication adherence.
    - Hospital Metrics: Length of Stay (LOS), number of previous visits.

**Step 2: Collect and Prepare Data**
- Gather historical patient data and ensure it's clean and consistent.
    - Check for Missing Data:
    - Impute missing values for predictors like glucose levels using median or mean.
    - Standardize Continuous Variables:
    - Standardize LOS, glucose levels, and blood pressure for consistency.

In [None]:
# Example dataset
data = pd.DataFrame({
    'age': [45, 60, 50, 40, 70],
    'los': [3, 7, 4, 2, 10],
    'glucose': [150, 200, 180, 140, 220],
    'med_adherence': [0.8, 0.6, 0.75, 0.9, 0.5],
    'readmitted': [1, 1, 0, 0, 1]
})

# Features and target
X = data[['age', 'los', 'glucose', 'med_adherence']]
y = data['readmitted']

# Add constant for intercept
X = sm.add_constant(X)

**Step 3: Exploratory Data Analysis**
- Univariate Analysis: Examine distributions of continuous variables.
- Bivariate Analysis: Analyze relationships between predictors and the target variable.
- Correlation Matrix: Identify multicollinearity among predictors.

**Step 4: Perform Logistic Regression**

In [None]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, roc_auc_score, roc_curve

In [None]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

In [None]:
model = sm.Logit(y, X).fit()
print(model.summary())

In [None]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression

# Load the data
data = pd.read_csv('data.csv')

# Split the data into training and testing sets
train = data[:800]
test = data[800:]

# Define the independent variables
X_train = train[['age', 'gender', 'income']]
X_test = test[['age', 'gender', 'income']]

# Define the dependent variable
y_train = train['buy_product']
y_test = test['buy_product']

# Fit the logistic regression model
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

# Predict the outcomes for the test data
y_pred = logreg.predict(X_test)

# Evaluate the performance of the model
from sklearn.metrics import accuracy_score
accuracy = accuracy

**Step 5: Interpret Coefficients and Evaluate the Model**

- Log Odds: Each coefficient represents the change in log odds of readmission for a unit increase in the predictor.
- Odds Ratios: Use np.exp(model.params) to convert coefficients to odds ratios.

1. Accuracy
2. Confusion Matrix
3. ROC Curve and AUC

In [None]:
y_pred = model.predict(X) > 0.5
accuracy = accuracy_score(y, y_pred)
print(f"Accuracy: {accuracy}")

In [None]:
cm = confusion_matrix(y, y_pred)
print(cm)

In [None]:
fpr, tpr, _ = roc_curve(y, model.predict(X))
auc = roc_auc_score(y, model.predict(X))
print(f"AUC: {auc}")

**Understanding Factors Significantly Influencing Readmission**

1. Use p-values from the logistic regression summary:
- Predictors with $𝑝< 0.05$ are statistically significant.
2. Assess the odds ratios:
- For example, if the odds ratio for LOS is 2.0, each additional day in the hospital doubles the odds of readmission.
3. Visualize relationships:
- Plot odds ratios for key predictors to present to stakeholders.

**Statistical Hypothesis Testing**

Example 1: Relationship Between LOS and Readmission
- Hypotheses:
    - $𝐻_0$: LOS has no effect on readmission.
    - $𝐻_𝑎$: LOS has a significant effect on readmission.
- Approach: Perform a logistic regression test and check the p-value for LOS.

Example 2: Age Group vs. Readmission
- Hypotheses:
    - $𝐻_0$: Age group is independent of readmission.
    - $𝐻_𝑎$: Age group and readmission are dependent.
- Approach: Use a Chi-Square test of independence (see previous example).

**Actionable Insights**
- Highlight key factors significantly influencing readmission (e.g., LOS, medication adherence).
- Use odds ratios to explain how much each factor increases or decreases the likelihood of readmission.
- Present findings visually (e.g., bar charts for odds ratios, ROC curves for model performance).


##### 3. Generalized Linear Models (GLMs)
What It Means: 
- GLMs extend linear regression by allowing different types of data distributions
    - Poisson for count data. 
- It models the mean of the outcome variable based on a link function.

Outcome Interpretation: 
- The coefficients explain how each predictor affects the mean outcome, given the distribution.

Performance Measures:
- Deviance: Measures how well the model fits compared to a perfect model; lower values are better.

Lay Explanation: 
- GLMs are like flexible versions of linear regression that can handle different data types (like counts or binary data), giving predictions that respect the data’s nature.

Use Case: 
- Extends linear regression for non-normal distributions (e.g., Poisson regression for count data).

Model Types: 
- Poisson regression, 
- Binomial regression.


In [None]:
import statsmodels.api as sm
poisson_model = sm.GLM(y_train, X_train, family=sm.families.Poisson()).fit()
predictions = poisson_model.predict(X_test)

##### 4. Time Series Models (e.g., ARIMA)
What It Means: 
- Time series models account for:
    - trends, 
    - seasonality, and 
    - temporal dependencies in data collected over time, often used for forecasting future values.

Outcome Interpretation: 
- Each prediction is based on patterns in past data points, accounting for recent trends and cycles.

Performance Measures:
- Mean Absolute Percentage Error (MAPE): Shows the average prediction error in percentage terms.
- Root Mean Squared Error (RMSE): Measures the prediction accuracy; lower values mean better predictions.

Lay Explanation: 
- Time series models are like weather forecasts—they predict future values based on past patterns, like trends and cycles.

Use Case: 
- Forecasting for data with a temporal component (e.g., sales data, stock prices).

Model Types: 
- ARIMA, 
- SARIMA, 
- Exponential Smoothing.

In [None]:
from statsmodels.tsa.arima.model import ARIMA
model = ARIMA(time_series_data, order=(1,1,1))
model_fit = model.fit()
predictions = model_fit.forecast(steps=10)

##### 5. Decision Trees and Random Forests
What It Means: 
- Decision trees split data based on conditions, creating branches that lead to a prediction. 
- Random forests use multiple trees to improve accuracy and reduce overfitting.

Outcome Interpretation: 
- Each "branch" shows how different conditions affect the outcome, 
- and random forests average the results of many trees for robust predictions.

Performance Measures:
- Accuracy: Proportion of correctly classified samples.
- Gini Index / Entropy: Used to measure the purity of the splits; lower values are better.

Lay Explanation: 
- Decision trees are like flowcharts that guide predictions based on conditions. 
- Random forests combine many trees to make stronger, more reliable decisions.

Use Case: 
- For classification or regression problems with non-linear relationships and high dimensionality.

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

tree_model = DecisionTreeClassifier()
tree_model.fit(X_train, y_train)
rf_model = RandomForestClassifier()
rf_model.fit(X_train, y_train)
predictions_tree = tree_model.predict(X_test)
predictions_rf = rf_model.predict(X_test)


##### 6. Support Vector Machines (SVM)
What It Means: 
- SVMs classify data by finding the best “boundary” (hyperplane) that separates classes with the widest possible margin.

Outcome Interpretation: 
- Data points on either side of the boundary belong to different classes, with "support vectors" helping to define the boundary.

Performance Measures:
- Accuracy: Proportion of correct classifications.
- Precision and Recall: Used when classes are imbalanced; precision is the correctness of positive predictions, and recall measures coverage.

Lay Explanation: 
- SVMs are like drawing a line to separate different groups, ensuring the groups are as distinct as possible with the help of a few key points.

Use Case: 
- Used for classification and regression in high-dimensional spaces, often for non-linearly separable data.

In [None]:
from sklearn.svm import SVC
model = SVC()
model.fit(X_train, y_train)
predictions = model.predict(X_test)


##### 7. Clustering Models (e.g., K-Means)
What It Means: 
- Clustering groups similar data points together without predefined labels, often used for segmenting customers or finding patterns.

Outcome Interpretation: 
- Each cluster represents a natural grouping in the data, with data points in the same cluster sharing similar characteristics.

Performance Measures:
- Silhouette Score: Measures how well each point fits within its cluster; values closer to 1 indicate better-defined clusters.
- Within-Cluster Sum of Squares (WCSS): Measures the compactness of clusters; lower values are better.

Lay Explanation: 
- Clustering is like sorting items into bins based on similarity, helping us identify groups in our data.

Use Case: 
- To group similar observations without predefined labels.

Model Types: 
- K-Means, 
- Hierarchical Clustering, 
- DBSCAN.

In [None]:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3)
clusters = kmeans.fit_predict(X)

##### 8. Principal Component Analysis (PCA)
What It Means: 
- PCA reduces the number of variables in the data by finding combinations of variables that capture the most information (variance).

Outcome Interpretation: 
- Each "principal component" explains a percentage of the total variance, helping simplify the data without losing much information.

Performance Measures:
- Explained Variance Ratio: Shows how much information each principal component holds; higher is better.

Lay Explanation: 
- PCA is like summarizing a book by keeping only the most important points, making data easier to work with without losing key insights.

Use Case: 
- Dimensionality reduction while retaining the most critical information.

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
reduced_data = pca.fit_transform(X)

##### 9. Bayesian Models
What It Means: 
- Bayesian models incorporate prior knowledge or beliefs with the data to update the probability of outcomes as new evidence is available.

Outcome Interpretation: 
- Each output is a probability distribution reflecting both prior knowledge and the new data, offering a range of likely outcomes.

Performance Measures:
- Log-Likelihood: Measures how well the model explains the data; higher values indicate better fit.

Lay Explanation: 
- Bayesian models are like revising a guess based on new evidence—updating beliefs as we get more information.

Use Case: 
- To incorporate prior knowledge and quantify uncertainty.

Model Types: 
- Bayesian Linear Regression, 
- Bayesian Networks.



In [None]:
import pymc3 as pm

with pm.Model() as model:
    alpha = pm.Normal('alpha', mu=0, sigma=1)
    beta = pm.Normal('beta', mu=0, sigma=1, shape=len(X_train.columns))
    epsilon = pm.HalfNormal('epsilon', sigma=1)
    mu = alpha + pm.math.dot(X_train, beta)
    y_pred = pm.Normal('y_pred', mu=mu, sigma=epsilon, observed=y_train)
    trace = pm.sample(2000)

##### 10. Survival Analysis (e.g., Cox Proportional Hazards)
What It Means: 
- Survival analysis predicts the time until an event occurs, such as customer churn or equipment failure.

Outcome Interpretation: 
- Each output shows the likelihood of the event happening over time, considering various risk factors.

Performance Measures:
- Concordance Index (C-Index): Measures the model’s ability to correctly rank predictions; values closer to 1 indicate better performance.

Lay Explanation: 
Survival analysis is like tracking how long something will last, based on factors that might speed it up or slow it down.

Use Case: 
- For time-to-event data, such as time until a customer churns or equipment fails.

Model Types: 
- Kaplan-Meier estimator, Cox Proportional Hazards Model.

In [None]:
from lifelines import CoxPHFitter
cph = CoxPHFitter()
cph.fit(data, 'time', event_col='event')
cph.predict_survival_function(data)

# Metrics

In [None]:
# Functions to compute True Positives, True Negatives, False Positives and False Negatives

def true_positive(y_true, y_pred):
    tp = 0
    for yt, yp in zip(y_true, y_pred):
        if yt == 1 and yp == 1:
            tp += 1
    return tp

def true_negative(y_true, y_pred):
    tn = 0
    for yt, yp in zip(y_true, y_pred):
        if yt == 0 and yp == 0:
            tn += 1        
    return tn

def false_positive(y_true, y_pred):
    fp = 0
    for yt, yp in zip(y_true, y_pred):
        if yt == 0 and yp == 1:
            fp += 1       
    return fp

def false_negative(y_true, y_pred):
    fn = 0
    for yt, yp in zip(y_true, y_pred):
        if yt == 1 and yp == 0:
            fn += 1        
    return fn

In [None]:
FP = cnf_matrix.sum(axis=0) - np.diag(cnf_matrix) 
FN = cnf_matrix.sum(axis=1) - np.diag(cnf_matrix)
TP = np.diag(cnf_matrix)
TN = cnf_matrix.sum() - (FP + FN + TP)FP = FP.astype(float)
FN = FN.astype(float)
TP = TP.astype(float)
TN = TN.astype(float)# Sensitivity, hit rate, recall, or true positive rate
TPR = TP/(TP+FN)
# Specificity or true negative rate
TNR = TN/(TN+FP) 
# Precision or positive predictive value
PPV = TP/(TP+FP)
# Negative predictive value
NPV = TN/(TN+FN)
# Fall out or false positive rate
FPR = FP/(FP+TN)
# False negative rate
FNR = FN/(TP+FN)
# False discovery rate
FDR = FP/(TP+FP)
# Overall accuracy for each class
ACC = (TP+TN)/(TP+FP+FN+TN)

In [None]:
# implementation for table metrics:
import sklearn.metrics
import mathdef matrix_metrix(real_values,pred_values,beta):
CM = confusion_matrix(real_values,pred_values)
TN = CM[0][0]
FN = CM[1][0] 
TP = CM[1][1]
FP = CM[0][1]
Population = TN+FN+TP+FP
Prevalence = round( (TP+FP) / Population,2)
Accuracy   = round( (TP+TN) / Population,4)
Precision  = round( TP / (TP+FP),4 )
NPV        = round( TN / (TN+FN),4 )
FDR        = round( FP / (TP+FP),4 )
FOR        = round( FN / (TN+FN),4 ) 
check_Pos  = Precision + FDR
check_Neg  = NPV + FOR
Recall     = round( TP / (TP+FN),4 )
FPR        = round( FP / (TN+FP),4 )
FNR        = round( FN / (TP+FN),4 )
TNR        = round( TN / (TN+FP),4 ) 
check_Pos2 = Recall + FNR
check_Neg2 = FPR + TNR
LRPos      = round( Recall/FPR,4 ) 
LRNeg      = round( FNR / TNR ,4 )
DOR        = round( LRPos/LRNeg)
F1         = round ( 2 * ((Precision*Recall)/(Precision+Recall)),4)
FBeta      = round ( (1+beta**2)*((Precision*Recall)/((beta**2 * Precision)+ Recall)) ,4)
MCC        = round ( ((TP*TN)-(FP*FN))/math.sqrt((TP+FP)*(TP+FN)*(TN+FP)*(TN+FN))  ,4)
BM         = Recall+TNR-1
MK         = Precision+NPV-1   

mat_met = pd.DataFrame({'Metric':['TP','TN','FP','FN','Prevalence','Accuracy','Precision','NPV','FDR','FOR','check_Pos','check_Neg','Recall','FPR','FNR','TNR','check_Pos2','check_Neg2','LR+','LR-','DOR','F1','FBeta','MCC','BM','MK'],     
                        'Value':[TP,TN,FP,FN,Prevalence,Accuracy,Precision,NPV,FDR,FOR,check_Pos,check_Neg,Recall,FPR,FNR,TNR,check_Pos2,check_Neg2,LRPos,LRNeg,DOR,F1,FBeta,MCC,BM,MK]})   

return (mat_met)

In [None]:
# ROC Implementation

from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from matplotlib import pyplotfpr, tpr, thresholds = roc_curve(real_values, prob_values)

auc = roc_auc_score(real_values, prob_values)
print('AUC: %.3f' % auc)pyplot.plot(fpr, tpr, linestyle='--', label='Roc curve')
pyplot.xlabel('False Positive Rate')
pyplot.ylabel('True Positive Rate')
pyplot.legend()pyplot.show()

# Precision-recall implementation

precision, recall, thresholds = sklearn.metrics.precision_recall_curve(real_values,prob_values)pyplot.plot(recall, precision, linestyle='--', label='Precision versus Recall')
pyplot.xlabel('Recall')
pyplot.ylabel('Precision')
pyplot.legend()pyplot.show()

In [None]:
# function for get many metrics directly from sklearn

def sk_metrix(real_values,pred_values,beta):
Accuracy = round( sklearn.metrics.accuracy_score(real_values,pred_values) ,4)
Precision= round( sklearn.metrics.precision_score(real_values,pred_values),4 )
Recall   = round( sklearn.metrics.recall_score(real_values,pred_values),4 )   
F1       = round ( sklearn.metrics.f1_score(real_values,pred_values),4)
FBeta    = round ( sklearn.metrics.fbeta_score(real_values,pred_values,beta) ,4)
MCC      = round ( sklearn.metrics.matthews_corrcoef(real_values,pred_values)  ,4)   
Hamming  = round ( sklearn.metrics.hamming_loss(real_values,pred_values) ,4)   
Jaccard  = round ( sklearn.metrics.jaccard_score(real_values,pred_values) ,4)   
Prec_Avg = round ( sklearn.metrics.average_precision_score(real_values,pred_values) ,4)   
Accu_Avg = round ( sklearn.metrics.balanced_accuracy_score(real_values,pred_values) ,4)   

mat_met = pd.DataFrame({
'Metric': ['Accuracy','Precision','Recall','F1','FBeta','MCC','Hamming','Jaccard','Precision_Avg','Accuracy_Avg'],
'Value': [Accuracy,Precision,Recall,F1,FBeta,MCC,Hamming,Jaccard,Prec_Avg,Accu_Avg]})   

return (mat_met)


In [None]:
# Evaluation Metrics For Multi-class Classification

def accuracy(y_true, y_pred):
    
    """
    Function to calculate accuracy
    -> param y_true: list of true values
    -> param y_pred: list of predicted values
    -> return: accuracy score
    
    """
    
# Intitializing variable to store count of correctly predicted classes
    correct_predictions = 0
    for yt, yp in zip(y_true, y_pred):
        if yt == yp:
            correct_predictions += 1
    #returns accuracy
    return correct_predictions / len(y_true)

In [None]:
#Computation of macro-averaged precision

def macro_precision(y_true, y_pred):

    # find the number of classes
    num_classes = len(np.unique(y_true))

    # initialize precision to 0
    precision = 0
    
    # loop over all classes
    for class_ in list(y_true.unique()):
        
        # all classes except current are considered negative
        temp_true = [1 if p == class_ else 0 for p in y_true]
        temp_pred = [1 if p == class_ else 0 for p in y_pred]
        
        
        # compute true positive for current class
        tp = true_positive(temp_true, temp_pred)
        
        # compute false positive for current class
        fp = false_positive(temp_true, temp_pred)
        
        
        # compute precision for current class
        temp_precision = tp / (tp + fp + 1e-6)
        # keep adding precision for all classes
        precision += temp_precision
        
    # calculate and return average precision over all classes
    precision /= num_classes
    
    return precision

print(f"Macro-averaged Precision score : {macro_precision(y_test, y_pred) }")

# implement marco-averaged precision using sklearn
macro_averaged_precision = metrics.precision_score(y_test, y_pred, average = 'macro')
print(f"Macro-Averaged Precision score using sklearn library : {macro_averaged_precision}")

In [None]:
#Computation of micro-averaged precision

def micro_precision(y_true, y_pred):


    # find the number of classes 
    num_classes = len(np.unique(y_true))
    
    # initialize tp and fp to 0
    tp = 0
    fp = 0
    
    # loop over all classes
    for class_ in y_true.unique():
        
        # all classes except current are considered negative
        temp_true = [1 if p == class_ else 0 for p in y_true]
        temp_pred = [1 if p == class_ else 0 for p in y_pred]
        
        # calculate true positive for current class
        # and update overall tp
        tp += true_positive(temp_true, temp_pred)
        
        # calculate false positive for current class
        # and update overall tp
        fp += false_positive(temp_true, temp_pred)
        
    # calculate and return overall precision
    precision = tp / (tp + fp)
    return precision

print(f"Micro-averaged Precision score : {micro_precision(y_test, y_pred)}")


#  implement mirco-averaged precision using sklearn
micro_averaged_precision = metrics.precision_score(y_test, y_pred, average = 'micro')
print(f"Micro-Averaged Precision score using sklearn library : {micro_averaged_precision}")

In [None]:
#Computation of macro-averaged recall

def macro_recall(y_true, y_pred):

    # find the number of classes
    num_classes = len(np.unique(y_true))

    # initialize recall to 0
    recall = 0
    
    # loop over all classes
    for class_ in list(y_true.unique()):
        
        # all classes except current are considered negative
        temp_true = [1 if p == class_ else 0 for p in y_true]
        temp_pred = [1 if p == class_ else 0 for p in y_pred]
        
        
        # compute true positive for current class
        tp = true_positive(temp_true, temp_pred)
        
        # compute false negative for current class
        fn = false_negative(temp_true, temp_pred)
        
        
        # compute recall for current class
        temp_recall = tp / (tp + fn + 1e-6)
        
        # keep adding recall for all classes
        recall += temp_recall
        
    # calculate and return average recall over all classes
    recall /= num_classes
    
    return recall

print(f"Macro-averaged recall score : {macro_recall(y_test, y_pred)}")


# implement macro-averaged recall using sklearn

macro_averaged_recall = metrics.recall_score(y_test, y_pred, average = 'macro')
print(f"Macro-averaged recall score using sklearn : {macro_averaged_recall}")


In [None]:
#Computation of micro-averaged recall

def micro_recall(y_true, y_pred):


    # find the number of classes 
    num_classes = len(np.unique(y_true))
    
    # initialize tp and fp to 0
    tp = 0
    fn = 0
    
    # loop over all classes
    for class_ in y_true.unique():
        
        # all classes except current are considered negative
        temp_true = [1 if p == class_ else 0 for p in y_true]
        temp_pred = [1 if p == class_ else 0 for p in y_pred]
        
        # calculate true positive for current class
        # and update overall tp
        tp += true_positive(temp_true, temp_pred)
        
        # calculate false negative for current class
        # and update overall tp
        fn += false_negative(temp_true, temp_pred)
        
    # calculate and return overall recall
    recall = tp / (tp + fn)
    return recall

print(f"Micro-averaged recall score : {micro_recall(y_test, y_pred)}")


#  implement micro-averaged recall using sklearn

micro_averaged_recall = metrics.recall_score(y_test, y_pred, average = 'micro')
print(f"Micro-Averaged recall score using sklearn library : {micro_averaged_recall}")

In [None]:
#Computation of macro-averaged f1 score

def macro_f1(y_true, y_pred):

    # find the number of classes
    num_classes = len(np.unique(y_true))

    # initialize f1 to 0
    f1 = 0
    
    # loop over all classes
    for class_ in list(y_true.unique()):
        
        # all classes except current are considered negative
        temp_true = [1 if p == class_ else 0 for p in y_true]
        temp_pred = [1 if p == class_ else 0 for p in y_pred]
        
        
        # compute true positive for current class
        tp = true_positive(temp_true, temp_pred)
        
        # compute false negative for current class
        fn = false_negative(temp_true, temp_pred)
        
        # compute false positive for current class
        fp = false_positive(temp_true, temp_pred)
        
        
        # compute recall for current class
        temp_recall = tp / (tp + fn + 1e-6)
        
        # compute precision for current class
        temp_precision = tp / (tp + fp + 1e-6)
        
        
        temp_f1 = 2 * temp_precision * temp_recall / (temp_precision + temp_recall + 1e-6)
        
        # keep adding f1 score for all classes
        f1 += temp_f1
        
    # calculate and return average f1 score over all classes
    f1 /= num_classes
    
    return f1


print(f"Macro-averaged f1 score : {macro_f1(y_test, y_pred)}")


# implement macro-averaged F1 score using sklearn

macro_averaged_f1 = metrics.f1_score(y_test, y_pred, average = 'macro')
print(f"Macro-Averaged F1 score using sklearn library : {macro_averaged_f1}")

In [None]:
#Computation of micro-averaged fi score

def micro_f1(y_true, y_pred):


    #micro-averaged precision score
    P = micro_precision(y_true, y_pred)

    #micro-averaged recall score
    R = micro_recall(y_true, y_pred)

    #micro averaged f1 score
    f1 = 2*P*R / (P + R)    

    return f1

print(f"Micro-averaged recall score : {micro_f1(y_test, y_pred)}")


# implement micro-averaged F1 score using sklearn

micro_averaged_f1 = metrics.f1_score(y_test, y_pred, average = 'micro')
print(f"Micro-Averaged F1 score using sklearn library : {micro_averaged_f1}")


In [None]:
# ROC AUCurve Computation

from sklearn.metrics import roc_auc_score

def roc_auc_score_multiclass(actual_class, pred_class, average = "macro"):
    
    #creating a set of all the unique classes using the actual class list
    unique_class = set(actual_class)
    roc_auc_dict = {}
    for per_class in unique_class:
        
        #creating a list of all the classes except the current class 
        other_class = [x for x in unique_class if x != per_class]

        #marking the current class as 1 and all other classes as 0
        new_actual_class = [0 if x in other_class else 1 for x in actual_class]
        new_pred_class = [0 if x in other_class else 1 for x in pred_class]

        #using the sklearn metrics method to calculate the roc_auc_score
        roc_auc = roc_auc_score(new_actual_class, new_pred_class, average = average)
        roc_auc_dict[per_class] = roc_auc

    return roc_auc_dict

roc_auc_dict = roc_auc_score_multiclass(y_test, y_pred)
roc_auc_dict

In [None]:
# ROC implementation: 

import matplotlib.pyplot as plt
from sklearn import svm, datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import label_binarize
from sklearn.metrics import roc_curve, auc
from sklearn.multiclass import OneVsRestClassifier
from itertools import cycle
plt.style.use('ggplot')

# Load the iris data
iris = datasets.load_iris()
X = iris.data
y = iris.target# Binarize the output
y_bin = label_binarize(y, classes=[0, 1, 2])
n_classes = y_bin.shape[1]# We split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y_bin, test_size= 0.5, random_state=0)


# We define the model as an SVC in OneVsRestClassifier setting.
# this means that the model will be used for class 1 vs class 2, 
# class 2vs class 3 and class 1 vs class 3. 
# So, we have 3 cases at #the end and within each case, the bias will be varied in order to 
# Get the ROC curve of the given case - 3 ROC curves as output.

classifier = OneVsRestClassifier(svm.SVC(kernel='linear', probability=True, random_state=0))
y_score = classifier.fit(X_train, y_train).decision_function(X_test)
# Plotting and estimation of FPR, TPR
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
    fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])
colors = cycle(['blue', 'red', 'green'])

for i, color in zip(range(n_classes), colors):
    plt.plot(fpr[i], tpr[i], color=color, lw=1.5, label='ROC curve of class {0} (area = {1:0.2f})' ''.format(i+1, roc_auc[i]))
    plt.plot([0, 1], [0, 1], 'k-', lw=1.5)
    plt.xlim([-0.05, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic for multi-class data')
    plt.legend(loc="lower right")
    plt.show()