# Statisitcs and Research methods

## Understanding Statistical Models vs. Machine Learning Models

It's essential first to understand the distinctions between statistical models and machine learning models, as they serve different purposes, assumptions, and interpretative depth.

- Statistical Models: 
    - These are rooted in traditional statistics and 
        - focus on relationships between variables through predefined equations. 
    - Statistical models aim to understand the underlying data-generating process, focusing on hypothesis testing and inference. 
    - These models often rely on strong assumptions like:
        - linearity, 
        - normality, and 
        - homoscedasticity 
        - and are **interpretable**, making it easier to understand the impact of individual variables.

- Machine Learning Models: 
    - These prioritize **predictive** power over interpretability. 
    - They are designed to automatically learn patterns and relationships within data, often with minimal assumptions. 
    - Machine learning models can handle complex and high-dimensional data but may lack transparency about how individual features affect the outcome, especially in ‚Äúblack box‚Äù models like neural networks or ensemble methods.


## Choosing the Right Statistical Model

The type of statistical model you use depends on your data and problem:

- Linear Regression: For predicting a **continuous target variable** based on one or more predictors.
- Logistic Regression: For predicting a **binary outcomes**, often used in classification problems.
- ANOVA (Analysis of Variance): For comparing means across multiple groups.
- Time Series Models: For data that‚Äôs ordered by time (e.g., ARIMA, SARIMA).
- Survival Analysis: For time-to-event data, such as customer churn timing.
- Multivariate Analysis: For understanding interactions across multiple variables (e.g., MANOVA, PCA).

## Preprocessing the Data
Prepare your data by cleaning and preprocessing it:

- Missing Values: Decide whether to impute or drop missing values.
- Outliers: Identify and consider handling outliers, especially in regression.
- Data Transformation: Transform non-normal variables if required (e.g., using log transformations).
- Feature Scaling: For some models, standardizing or normalizing data is essential.

## Exploratory Data Analysis (EDA)

EDA is essential to understand: 
- patterns,
    - visualizations
- distributions,
    - summary statistics
- relationships
    - correlation matrices
    
This is to identify relevant features and spot potential issues like multicollinearity.

## Building the Statistical Model

- **Statsmodels** provides 
    - coefficients, 
    - p-values, and 
    - confidence intervals for each variable, 
        - enabling hypothesis testing on whether each predictor significantly affects the outcome.

## Evaluating Model Performance
Regression Metrics: 
- Use R-squared, 
- Adjusted R-squared, 
- RMSE, and 
- MAE to evaluate regression models.

Classification Metrics: 
- Use confusion matrix, 
- accuracy, 
- precision, 
- recall, and 
- AUC-ROC.

Residual Analysis: 
- Residual plots help assess assumptions
    - homoscedasticity, 
    - normality of residuals).

## Model Interpretation
Statistical models are highly interpretable. 
- In linear regression, each coefficient represents the expected change in the dependent variable for a one-unit change in the predictor, holding all else constant.

Confidence Intervals: 
- Look at 95% CI for each coefficient; if it does not contain zero, it suggests the predictor has a statistically significant effect.

P-Values: 
- A p-value below a threshold (usually 0.05) indicates that the predictor significantly affects the outcome.

## Validating Assumptions
- Linearity: Check scatter plots of residuals.
- Normality of Residuals: Use a Q-Q plot to verify.
- No Multicollinearity: Variance inflation factor (VIF) helps detect multicollinearity.
- Homoscedasticity: Plot residuals vs. fitted values.

## Reporting and Communicating Results
Present your findings by focusing on:

- Key Coefficients: Explain which predictors significantly affect the outcome.
- Model Fit: Interpret R-squared values (e.g., explaining how much variance in the target variable is explained).
- Real-World Implications: Describe how insights from the model can impact business decisions.

# Approach to statistical modeling

Each model type has specific 
- applications, 
- strengths, and 
- limitations, 

Understand when and how to use them.

### Step 1: Define Objectives and Hypotheses

Identify the Problem and Objectives: 
- Clearly define the goal.
    - Are you trying to predict, classify, find patterns, or estimate relationships? 
    - Setting objectives helps in choosing the right model.

- Formulate Hypotheses: 
    - Based on the problem, develop hypotheses. 
        - For instance, in a sales prediction problem, you may hypothesize that `certain features like advertising spend, time of year, and economic indicators affect sales.`

### Step 2: Data Collection and Preprocessing
Data Collection: 
- Gather historical data related to the problem. 

Data Cleaning: 
- Handle missing values, remove duplicates, and ensure consistency.

Feature Engineering: 
- Create new features if necessary. 
- This could involve 
    - transformations, 
    - encoding categorical variables, or 
    - creating interaction terms.

Data Splitting: 
- Split the data into training and testing sets. Typically, an 80-20 or 70-30 split is used.

## Exploratory Data Analysis

### Why is EDA important?

Exploratory Data Analysis (EDA) helps us to understand our data without making any assumptions. EDA is a vital component before we continue with the modelling phase as it provides context and guidance on the course of action to take when developing the appropriate model. It will also assist in interpreting the results correctly. Without doing EDA you will not understand your data fully.


### The different types of EDA

EDA are generally classified in two ways:

    1) Non-graphical or Graphical
    2) Univariate or Multivariate
    
<div align="left" style="width: 600px; text-align: left;">
<img src="https://github.com/Explore-AI/Pictures/blob/f860f39251c523eda779dea0140316ccbefdd8e0/eda_map.jpg?raw=True"
     alt="EDA Diagram"
     style="padding-bottom=0.5em"
     width=600px/>
</div>


#### Non-graphical EDA
Involves calculations of summary/descriptive statistics. 

#### Graphical EDA
This type of analysis will contain data visualisations.

#### Univariate Analysis 
This is performed on one variable at a time as the prefix 'uni' indicates. 

#### Multivariate Analysis 
This type of analysis explores the relationship between two or more variables. 
When only comparing two variables it is known as **bivariate analysis** as indicated by the prefix 'bi'.

Read a more detailed explanation <a href="https://www.stat.cmu.edu/~hseltman/309/Book/chapter4.pdf">here</a>.

### 1. Basic Analysis

For a practical example, we will be looking at the Medical Claims Data. Using these four commands, we will perform a basic analysis:

    - df.head()
    - df.shape
    - df.info()
        - feature (variable) is categorical the Dtype is object and if it is a numerical variable the Dtype is an int64 or float64. 
        - This command also shows us that out of the 1338 none of the features contain any null values.
    - df.describe()

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import warnings
warnings.filterwarnings('ignore')

df = pd.read_csv('https://raw.githubusercontent.com/Explore-AI/Public-Data/master/Data/regression_sprint/claims_data.csv')

# Looking at the top five rows of our data
df.head()

# shape command shows us that we have x rows of data and y features.
df.shape

#  confirms our categorical and numerical features.
df.info()

# Null values for each feature can also be checked by using the following command
df.isnull().sum()

# Population and Sample

**Population**
- Population is a collection of all data points of interest.
    - eg: Total number of employees in the organization is known as population
- **Parameter**
    - numbers that are obtained when working with a population
        - eg: total number of employees working in an organization and After completion of our survey, we arrive at number ‚Äì 20000. 
    
**Sample**
- Sample is a subset of the population.
    - eg: Total number of employees in a project is known as a sample.
- **Statistic**
    - numbers that are obtained when working with a sample
        - eg: count the total number of employees working on a particular project. After completion of our survey, we arrive at number ‚Äì 20.

What to chose between Population and Sample?

The real-life case scenarios, we always deal with sample data. 
- The reason behind this is that a sample is easy to collect and easier to compute than the population. 
- Based on the result that we obtained for a sample, we can then use predictive analytics to make predictions about the entire population.

# The Measure of Central tendency

The concept of central tendency is based on the below fact ‚Äì
- ‚ÄúProvided with a larger number of observations of similar type, most of the observations seems to cluster around central position when represented as a graph‚Äù.

# Univariate Analysis: Non-Graphical

The first univariate analysis will be non-graphical. This is where we will be looking at the **descriptive statistics** of each feature. 

## Continous/Numeric Feature

##### **Descriptive Statistics**

We can get the descriptive statistics of each **numerical feature** by using the following command:

    - df.describe()

This command will provide the 
- mean, 
    - also known as the arithmetic mean 
    - is the statistical average of all data points in question.
- standard deviation and
- The five number summary of each numerical feature.
    - Minimum, 
    - Lower Quartile (Q1) = 25%,
    - Median (Q2) = 50%, 
        - middlemost data point in the dataset when arranged in ascending or descending order.
        - Higher resistance to outlier as compared to mean
        - Median with even number of data points = average of the middle two numbers.
        - Median with an odd number of data points = middlemost observation.
    - Upper Quartile (Q3) = 75%, 
    - Maximum is also used for creating the box plot.
        - exposes **Outlier**: is a data point that is significantly different from the rest of the data points in consideration.

Individual statistical measures can also be calculated by using the following commands:

    - df.count()
    - df.mean()
    - df.std()
    - df.min()
    - df.quantile([0.25, 0.5, 0.75], axis = 0)
    - df.median()
    - df.max()

The three measures for central tendency are the:
- mode
    - Mode is basically the value that appears the most in the dataset. 
- mean and 
- median**. 

The command to determine the mode is:

    - df.mode()

In [None]:
df.describe()

# statistics of a specific feature
df.age.describe()
df['age'].describe()

##### **Dispersion of Data**

Dispersion of data used to understands the distribution of data.
- Helps to understand the variation of data and provides a piece of information about the distribution data.

These include: 
- Range,
     - measure by subtracting the lowest value from the massive Number. 
          - The wide range indicates high variability,
          - The small range specifies low variability in the distribution.
     - Range = Highest_value  ‚Äì Lowest_value
          - range can be influence by outliers
- Interquartile Range (IQR),
     - IQR is a range (the boundary between the first and second quartile) and Q3 (the boundary between the third and fourth quartile).
     - IQR is preferred over a range as, like a range, IQR does not influence by outliers. 
     - IQR is used to measure variability by splitting a data set into four equal quartiles.
          - IQR uses a box plot to find the outliers.
               - Formula to find outliers: [Q1 ‚Äì 1.5 * IQR, Q3 + 1.5 * IQR]
- Variance, 
     - Variance measures how far each number in the dataset from the mean.

Population variance
$$\sigma^2 = \frac{\sum (x_i - \mu)^2}{n}$$
sample variance
$$ s^2 = \frac{\sum (x_i - \bar{x})^2}{n - 1}$$  

- Standard Deviation
     - Standard deviation is a squared root of the variance to get original values. 
     - Low standard deviation indicates data points close to mean.
         -  68 % of values lie within 1 standard deviation.
         - 95 % of values lies within 2 standard deviation.
         - 99.7 % of values lie within 3 standard deviation.

Population std
$$\sigma = \sqrt{\frac{1}{N}\sum (x_i - \mu)^2}$$
sample std
$$ s = \sqrt{\frac{1}{n - 1}\sum (x_i - \bar{x})^2}$$

##### Standard deviation and Mean Absolute deviation (Why SD is more reliable than MAD)


# Univariate Analysis: Graphical

Objective:
- Trends and Patterns of data
- Frequency
- Distribution of the variables
- Relationship that may exist between different variables

You can look at the **distribution** of any numerical feature by using the following plots:
- Scatter plot
- histogram
- density plot
- box plot
- violin plot
    
For a categorical feature we will use a:
- bar plot

## Continous/Numerical variable

### Uni-variate summary plots :
These plots give a more concise description of the location, dispersion, and distribution of a variable than an enumerative plot. 
- Summarizing every individual data value in a plot isn‚Äôt feasible, but it efficiently represents the entire dataset,

#### Histogram and Density Plot

For displaying a histogram and density plot we will be using the Matplotlib library and create a list of all numerical features to visualise these features at the same time.

 both the histogram and density plot display the same information. The density plot can be considered a smoothed version of the histogram and does not depend on the size of bins.

In [None]:
features = ['age', 'bmi', 'steps', 'children', 'claim_amount'] # create a list of all numerical features
df[features].hist(figsize=(10,10))

In [None]:
df[features].plot(kind='density', subplots=True, layout=(3, 2), sharex=False, figsize=(10, 10));

#### Box Plot and Violin Plot

For the Box Plot and Violin Plot, we will use the seaborn library and only select one feature instead of all the numerical features. We can visualise all numerical features simultaneously, but as the range of values for each feature is different, it will not create a useful visualisation. Standardisation or normalisation can be applied to a feature to adjust the range, but we will not apply it in this notebook. Further reading on standardisation and normalisation can be done <a href="https://medium.com/@dataakkadian/standardization-vs-normalization-da7a3a308c64">here</a>.

The `bmi` feature will be used.

Although both the box plot and violin plot display the distribution of the data, the boxplot provides certain statistics that are useful. 

The five vertical lines in the boxplot provide the information of the five number summary and the dots on the right hand side of the graph is a display of outliers. The violin plot focuses more on a smoothed distribution.

In [None]:
sns.boxplot(x='bmi', data=df)

sns.set(rc={'figure.figsize':(9,9)})
sns.boxplot(x = 'var', y = 'value', data = pd.melt(dfm))

In [None]:
sns.violinplot(x='bmi', data=df)


### Univariate enumerative Plots

#### Scatter plot

Plots different observations/values of the same variable corresponding to the index/observation number.
- plot the variable
- against the corresponding observation number stored as the index of the data frame (df.index)

In [None]:
plt.scatter(df.index, df['var1'])
plt.show()

In [None]:
sns.scatterplot(x= df.index , y= ['var'], hue = df['variety'])

# In seaborn, the ‚Äòhue‚Äô parameter, an interesting feature, determines which column in the data frame to use for color encoding.

#### Line plot
A line plot visualizes data by connecting the data points via line segments. 
- It resembles a scatter plot but differs by ordering the measurement points (usually by their x-axis value) and connecting them with straight line segments.

In [None]:
sns.set(rc = {'figure.figsize': (7,7)})
sns.set(font_scale= 1.5)

fig = sns.lineplot(x = df.index, y= df['var2'], markevery = 1, marker = 'd', data = df, hue = df[variety])

#### Strip plot and Swarm Plot :
- The strip plot is similar to a scatter plot.
    - helps to plot the distribution of variables for each category as individual data points.
- The swarm-plot, similar to a strip-plot, provides a visualization technique for univariate data to view the spread of values in a continuous variable.
    - The only difference between the strip-plot and the swarm-plot is that the swarm-plot spreads out the data points of the variable automatically to avoid overlap and hence provides a better visual overview of the data.

In [None]:
sns.stripplot(y=df['var1'])
sns.stripplot(x= df['variety',y=df['var1'])

In [None]:
sns.set(rc = 'figure.figsize': (5,5))
sns.swarmplot(x = df['var'])
sns.swarmplot(x = df['variety'], y = df['var'])

### Catagorical Data

#### Bar Plot

For the categorical features, we can create a **bar plot** to display the frequency distribution. 

plot on a two-dimensional axis. 
- One axis is the category axis indicating the category, while the 
- second axis is the value axis that shows the numeric value of that category, indicated by the length of the bar.

We'll generate a bar plot of the `children` feature, where each bar represents a unique number of children from the data, and the height represents how many times that number of children occurred. This can be done by using seaborn's `countplot`. 

In [None]:
df['var'].value_counts().plot.bar()

In [None]:
sns.countplot(x = 'children', data = df, palette="hls")
plt.title("Distribution of Children")

##### Pie Chart:
Shows the numerical proportion occupied by each category
-  pass the array of values to the ‚Äòlabels‚Äô parameter to add labels.

In [None]:
plt.pie(df['var'].value_counts(), labels= ['cat1', 'cat2', 'cat3'], shadow= True)

In [None]:
plt.pie(df['var'].value_counts(), startangle= 90, autopct='%.3f', labels= ['cat1', 'cat2', 'cat3'], shadow= True)

# Normal Distribution

Examples like: Birth weight, the IQ Score, and stock price return often form a bell-shaped curve.

Normal Distribution becomes essential for data scientists is the Central Limit Theorem
- theorem explains the magic of mathematics and is the foundation for hypothesis testing techniques.

### Properties of Normal Distribution
- Bell-shaped curve
    - curve is symmetric around the Mean
    - Mean, Median, and Mode are all the same.
- Normal Distribution is symmetric, which means its tails on one side are the mirror image of the other side
- also call a Gaussian Distribution
- simplify the Normal Distribution‚Äôs Probability Density by using only two parameters
    - $\mu$
    - $\sigma^2$
- Normal distribution retains the normal shape throughout, unlike other probability distributions that change their properties after a transformation. 

For a Normal Distribution:
- Product of two Normal Distribution results into a Normal Distribution
- The Sum of two Normal Distributions is a Normal Distribution
- Convolution of two Normal Distribution is also a Normal Distribution
- Fourier Transformation of a Normal Distribution is also Normal

Empirical Rule for Normal Distribution
- According to the Empirical Rule for Normal Distribution:
    - 68.27% of data lies within 1 standard deviation of the mean
    - 95.45% of data lies within 2 standard deviations of the mean
    - 99.73% of data lies within 3 standard deviations of the mean
-  almost all the data lies within 3 standard deviations. 

This rule enables us to check for Outliers and is very helpful when determining the normality of any distribution.

### Standard Normal Distribution
Standard Normal Distribution is a special case of Normal Distribution when
- $\mu$ = 0
- $\sigma$ = 1

Convert Normal Distribution into Standard Normal distribution with
$$ Z = \frac{X - \mu}{\sigma}$$

Example: Comparing Maths mark with History mark of 2 students
- who ever get the higher z-score performed better.

### Skewed Distribution

When data points cluster on one side more than the other. These types of distributions are called Skewed Distributions.

##### **kurtosis** and **skew**. 

Both kurtosis and skew are important statistical terms to be familiar with in data science. Kurtosis is the measure of outliers present in the data. **High kurtosis (>3)** indicates a large number of outliers and **low kurtosis (<3)** a lack of outliers.  Skew will indicate how symmetrical your data is. Below is a table that explains the range of values with regards to skew.

Left skewed distribution
- Mode > Median > Mean.

Right Skewed Distribution
- Mode < Median < Mean


|   Skew Value (x)  |       Description of Data      |
|:-------------------|:---------------:|
| -0.5 < x < 0.5              |Fairly Symmetrical |
| -1 < x < -0.5 | Moderate Negative Skew  | 
| 0.5 < x < 1             | Moderate Positive Skew  | 
|       x < -1     |High Negative Skew  | 
|       x > 1  |High Positve Skew | 

<div align="left" style="width: 500px; font-size: 80%; text-align: left; margin: 0 auto">
<img src="https://github.com/Explore-AI/Pictures/blob/f3aeedd2c056ddd233301c7186063618c1041140/regression_analysis_notebook/skew.jpg?raw=True"
     alt="Dummy image 1"
     style="float: left; padding-bottom=0.5em"
     width=500px/>
     For a more detailed explanation on skew and kurtosis read <a href="https://codeburst.io/2-important-statistics-terms-you-need-to-know-in-data-science-skewness-and-kurtosis-388fef94eeaa">here</a>.
</div>


The commands used to determine the skewness of data are:

    - df.skew()

### Check the **Normality** of a Distribution
- Histogram
- KDE Plots
- Q_Q Plots
- Skewness
- Kurtosis


In [None]:
df.skew()

# Closer to 0 implies fairly symmetrical.
# Above 0.3 implies  moderately skewed in a positive direction.
# Above 1 implies highly skewed.

### Kertosis

Check for Normality is Kurtosis. 

Kurtosis gives the information regarding tailedness which basically indicates the data distribution along the tails.
- For the symmetric type of distribution, the Kurtosis value will be close to Zero. We call such types of distributions as Mesokurtic distribution. 
    - Its tails are similar to Gaussian Distribution.

- If there are extreme values present in the data, then it means that more data points will lie along with the tails. In such cases, the value of K will be greater than zero.
    - Here, Tail will be fatter and will have longer distribution. We call such types of distributions as Leptokurtic Distribution.
        - As we can clearly see here, the tails are fatter and denser as compared to Gaussian Distribution:

- If there is a low presence of extreme values compared to Normal Distribution, then lesser data points will lie along the tail.
    - The Kurtosis value will be less than zero. We call such types of distributions as Platykurtic Distribution. 
        - It will have a thinner tail and a shorter distribution in comparison to Normal distribution.

The commands used to determine the kurtosis of data are:

    - df.kurtosis()

In [None]:
# Indicates a lack of outliers for all features.
df.kurtosis()

### Transform features into Normal/Gaussian Distribution
- Models such as Linear Regression, Logistic Regression, Artificial Neural Networks assume that features are normally distributed
- They perform much better if the features provided to them during modeling are normally distributed.

**What do we do when data provided to us does not necessarily follow a normal distribution?**

### Gaussian Distribution

In probability theory, a normal (or Gaussian) distribution is a type of continuous probability distribution for a real-valued random variable.
- general form of its probability density function is
$$ f(x) = \frac{1}{\sigma \sqrt{2\pi}}e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^2}$$

Samples of the Gaussian Distribution follow a bell-shaped curve and lies around the mean. 
- The mean, median, and mode of Gaussian Distribution are the same.

Steps:
1. Check if a variable is following Normal Distribution (see above)
- Checking the distribution of variables using a Q-Q plot
    - Q-Q plot is a scatterplot created by plotting two sets of quantiles against one another. If both sets of quantiles came from the same distribution, we should see the points forming a roughly straight line.
        -  if the data falls in a straight line then the variable follows normal distribution otherwise not.

Example: if variable is highly positively skewed
- plot the Q-Q plot for the variable and check.

If data points of the feature are not falling on a straight line. This implies that it does not follow a normal distribution.

In [None]:
#importing necessary libraries
import scipy.stats as stats
import pylab

stats.probplot(cp.price,plot=pylab)

##### Function in python which will take data and feature name as inputs and return the KDE plot and Q-Q plot of the feature.

In [None]:
# function to return plots for the feature
def normality(data,feature):
    plt.figure(figsize=(10,5))
    plt.subplot(1,2,1)
    sns.kdeplot(data[feature])
    plt.subplot(1,2,2)
    stats.probplot(data[feature],plot=pylab)
    plt.show()

### Performing the transformations

##### **Logarithmic Transformation**
Convert to its log value i.e log(Price)

In [None]:
# performing logarithmic transformation on the feature
cp['price_log']=np.log(cp['price'])
# plotting to check the transformation
normality(cp,'price_log')

##### **Reciprocal Transformation**
This will inverse values of Price i.e1/Price

In [None]:
cp['price_reciprocal']=1/cp.price
normality(cp,'price_reciprocal')

##### **Square Root Transformation**

This transformation will take the square root of the Price column i.e sqrt(Price).

In [None]:
cp['price_sqroot']=np.sqrt(cp.price)
normality(cp,'price_sqroot')

##### **Exponential Transformation**

The exponential value of the Price variable will be taken.

In [None]:
cp['price_exponential']=cp.price**(1/1.2)
normality(cp,'price_exponential')

##### **Box-Cox Transformation**

$$ y_i^{(\lambda)} = \{ {\frac{y_i^{(\lambda)} - 1}{\lambda} \text{if} \lambda \neq 0, \\ \ln(y_i) \text{if} \lambda = 0,}$$

where:
- y is the response variable and 
- Œª is the transformation parameter. 
    - Œª value varies from -5 to 5. 

During the transformation, all values of Œª are considered and the optimal/best value for the variable is selected. 
- log(y) is only applied when Œª=0.

Box cox is more logic-based and involves the Œª variable which is chosen as per the best skewness for the data so Box cox will be a better transformation to go with.

In [None]:
cp['price_Boxcox'],parameters=stats.boxcox(cp['price'])
normality(cp,'price_Boxcox')

# Types Of Probability Distribution Function in Univeriate Analysis

Probability Distribution Function (PDF) is a mathematical way of showing how likely different outcomes are in a random event. 
- It gives probabilities to each possible result, and 
- Adding up all the probabilities, the total is always 1. 
The PDF helps us understand the chances of different outcomes in a random experiment.

### Distribution Function
- is a mathematical expression that describes the probability of different possible outcomes for an experiment.
- denoted as Variable ~ Type (Characteristics)

Data Types
- We have Qualitative and Quantitative data. 
    - Quantitative data, we have 
        - Continuous data types/ random variables. 
            - Continuous data measures and can take any number of values within a given finite or infinite range.
            - Continuous data represented in decimal format.
            - Example: 
                - person‚Äôs height, 
                - Time, 
                - distance,
        - Discrete data types.
            - Discrete data is counted and can take only a limited number of values.
            - Discrete data is represented as Whole number.
            - Example:
                - number of students in a class, 
                - number of workers in a company

### Types of distribution functions

|   Discrete distributions   |      Continuous distributions     |
|:-------------------|:---------------:|
|  Uniform distribution | Normal distribution |
| Binomial distribution | Standard Normal distribution  | 
| Bernoulli distribution  | Student‚Äôs T distribution  | 
| Poisson distribution  | Chi-squared  distribution  |

#### **Probability Density Function (PDF):**
- Statistical term that describes the probability distribution of a **continuous** random variable.
- Probability associate with a single value is always Zero.

$$F(X) = P(a \leq x \leq b) = \int^{b}_{a} f(x)dx \geq 0$$

#### **Probability Mass Function (PMF):**
- Statistical term that describes the probability distribution of a **discrete** random variable.

$$p(x) = P(X=x)$$

Where:
- probability of x = the probability X = one specific x

#### **Cumulative Distribution Function (CDF):**
- It is another method to describe the distribution of a random variable (either continuous or discrete).

$$ F_X (x) = P(X \leq x)$$

Where:
- F_X (x) = function of X
- X = real value variable
- P = probability that X will have a value less then or equal to x

### Discrete Distribution

##### **1. Discrete Uniform distribution**
- Denoted as X ~ U (a, b)
- where X is a discrete random variable that follows uniform distribution ranging from a to b.
- Uniform distribution is when all the possible events are equally likely.
- Example:
    - Experiment of rolling a dice
    - six possible events X = {1, 2, 3, 4, 5, 6} each having a probability of P(X) = 1/6.

Formula for PMF, CDF of Uniform distribution function:

|   Term   |     Fromula     |
|:-------------------|:---------------:|
|  Support | $K \in {a, a + 1, ..., b-1, b}$ |
| PMF | $\frac{1}{n}$  | 
| CDF | $\frac{[k] - a + 1}{n}$  |
| Mean | $\frac{(a + b)}{2}$  | 
| Variance | $\frac{(n^2 - 1)}{12}$  |

Case Study: Lottery Number Simulation

A lottery system allows participants to pick a number between 1 and 6, inclusive, where each number has an equal chance of being selected. 
- This setup represents a discrete uniform distribution.

PMF:
- Since each outcome is equally likely, the probability for each number from 1 to 6 will be $\frac{1}{6} ‚âà 0.1667$.

CDF:
- The cumulative probabilities for the outcomes [1, 2, 3, 4, 5, 6] will increase incrementally as: [0.1667, 0.3334, 0.5001, 0.6668, 0.8335, 1.0]

Mean:
- For a discrete uniform distribution:

$$ Mean = \frac{Low + High}{2}$$
$$ = \frac{1 + 6}{2}$$
$$ = 3.5 $$

Variance:
- For a discrete uniform distribution:

$$ Variance = \frac{(High - Low + 1)^2 - 1}{12}$$
$$ = \frac{(6 -1 + 1)^2}{12}$$
$$ = \frac{35}{12} $$
$$ ‚âà 2.92 $$


In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import randint

# 1. Define the parameters of the discrete uniform distribution
low, high = 1, 6  # Numbers range from 1 to 6, inclusive

# 2. Simulate the discrete uniform distribution
n_samples = 10000
samples = np.random.randint(low, high + 1, size=n_samples)

# 3. Calculate the PMF
pmf = [1 / (high - low + 1)] * (high - low + 1)  # Since it's uniform, all probabilities are equal
outcomes = np.arange(low, high + 1)

# 4. Calculate the CDF
cdf = np.cumsum(pmf)

# 5. Mean and Variance
mean = np.mean(samples)
variance = np.var(samples)

# 6. Visualization
plt.figure(figsize=(12, 6))

# PMF Plot
plt.subplot(1, 2, 1)
plt.bar(outcomes, pmf, color='skyblue', alpha=0.7)
plt.title("PMF of Discrete Uniform Distribution")
plt.xlabel("Outcomes")
plt.ylabel("Probability")
plt.xticks(outcomes)

# CDF Plot
plt.subplot(1, 2, 2)
plt.step(outcomes, cdf, where='post', color='orange', label="CDF")
plt.title("CDF of Discrete Uniform Distribution")
plt.xlabel("Outcomes")
plt.ylabel("Cumulative Probability")
plt.xticks(outcomes)
plt.legend()

plt.tight_layout()
plt.show()

# 7. Print results
print("PMF:", pmf)
print("CDF:", cdf)
print(f"Simulated Mean: {mean:.2f}")
print(f"Simulated Variance: {variance:.2f}")


##### **2. Binomial distribution**
- Denoted as X ~ B(n, p).
- where X is a discrete random variable that follows Binomial distribution with parameters n, p.
    - n is the no. of trials,
    - p is the success probability for each trial.
- Probability distribution of the number of successes in ‚Äòn‚Äô independent experiments sequence.
    - Binomial event suggests the no. of times a specific outcome can be expected.
- The two outcomes of a Binomial trial could be 
    - Success is denoted as 1, and the probability associated with it is p.
    - Failure is denoted as 0, and the probability associate with it is q = 1-p.
- Example: 
    - Success/Failure, 
    - Pass/Fail/, 
    - Win/Lose,


|   Term   |     Fromula     |
|:-------------------|:---------------:|
| PMF | $ \left(^{n}_{k}\right) p^k q^{n - k}$  | 
| CDF | $I_q ( n - k, 1 + k)$  |
| Mean | $n \times p$  | 
| Variance | $ n \times p \times q$  |

Case Study: Quality Control in Manufacturing
- A manufacturing plant produces light bulbs. We inspect a batch of 10 bulbs. Each light bulb has a: 
    - 90% probability of passing quality control (success) and a 
    - 10% probability of failing (failure). 

PMF: 
- The PMF provides the probability of having exactly ùëò successes in n trials:
- For example, P(X=9) represents the probability that 9 out of 10 light bulbs pass quality control.

$$ \left(^{n}_{k}\right) p^k q^{n - k}$$

CDF: 
- The CDF provides the cumulative probability of having up to k successes:
$$ P(X \leq k) = \sum P(X = i) $$


Mean: For a Binomial distribution:
$$ Mean = n \times p$$
$$ = 10 \times 0.9 $$
$$ =0 $$

Variance: For a Binomial distribution:

$$ Variance= n \times p \times (1‚àíp) $$
$$ =10‚ãÖ0.9‚ãÖ0.1 $$
$$ =0.9 $$

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import binom

# 1. Define the parameters of the Binomial distribution
n = 10  # Number of trials (light bulbs in the batch)
p = 0.9  # Probability of success (passing quality control)

# 2. Simulate the Binomial distribution
n_samples = 10000
samples = np.random.binomial(n, p, size=n_samples)

# 3. Calculate the PMF
x = np.arange(0, n + 1)  # Possible outcomes: 0 to n successes
pmf = binom.pmf(x, n, p)

# 4. Calculate the CDF
cdf = binom.cdf(x, n, p)

# 5. Mean and Variance
mean = n * p  # Mean of a Binomial distribution
variance = n * p * (1 - p)  # Variance of a Binomial distribution

# 6. Visualization
plt.figure(figsize=(12, 6))

# PMF Plot
plt.subplot(1, 2, 1)
plt.bar(x, pmf, color='skyblue', alpha=0.7, label='PMF')
plt.title("PMF of Binomial Distribution")
plt.xlabel("Number of Successes")
plt.ylabel("Probability")
plt.legend()

# CDF Plot
plt.subplot(1, 2, 2)
plt.step(x, cdf, where='post', color='orange', label='CDF')
plt.title("CDF of Binomial Distribution")
plt.xlabel("Number of Successes")
plt.ylabel("Cumulative Probability")
plt.legend()

plt.tight_layout()
plt.show()

# 7. Print results
print("PMF:", pmf)
print("CDF:", cdf)
print(f"Theoretical Mean: {mean:.2f}")
print(f"Theoretical Variance: {variance:.2f}")
print(f"Simulated Mean: {np.mean(samples):.2f}")
print(f"Simulated Variance: {np.var(samples):.2f}")


##### **3. Bernoulli distribution**
- denoted as X ~ Bern(p).
- Where X is a discrete random variable that follows Bernoulli distribution with parameter p.
    - Where p is the probability of the success.
- Bernoulli is a Binomial experiment with a single trial.
    - Bernoulli‚Äôs event suggests which outcome can be expected for a single trial.
- Example: tossing a fair. The two possible outcomes are 
    - Heads, Tails. 
    - The probability (p) associated with each of them is 1/2.
- Example: In an unfair coin
    - Heads can have a probability of p = 0.8, then the probability of tail q = 1-p = 1-0.8 = 0.2

|   Term   |     Fromula     |
|:-------------------|:---------------:|
| PMF | $ \{ q = 1- p \text{  if  } k = 0 \\ \{ p \text{  if  } k = 1 \\ p^k (1 - p)^{1 - k}$  | 
| CDF | $\{ 0 = 1- p \text{  if  } k < 0 \\ \{ 1 - p \text{  if  } 0 \leq k < 1 \\ \{ 0 = 1- p \text{  if  } k \geq  1$  |
| Mean | $ p$  | 
| Variance | $ p( 1 - p) = p \times q$  |

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import bernoulli

# 1. Define the parameters of the Bernoulli distribution
p = 0.5  # Probability of success (Heads)

# 2. Simulate the Bernoulli distribution
n_samples = 10000
samples = np.random.binomial(1, p, size=n_samples)  # Equivalent to Bernoulli

# 3. Calculate the PMF
x = [0, 1]  # Possible outcomes: 0 (Tails), 1 (Heads)
pmf = bernoulli.pmf(x, p)

# 4. Calculate the CDF
cdf = bernoulli.cdf(x, p)

# 5. Mean and Variance
mean = p  # Mean of a Bernoulli distribution
variance = p * (1 - p)  # Variance of a Bernoulli distribution

# 6. Visualization
plt.figure(figsize=(12, 6))

# PMF Plot
plt.subplot(1, 2, 1)
plt.bar(x, pmf, color='skyblue', alpha=0.7, label='PMF')
plt.title("PMF of Bernoulli Distribution")
plt.xlabel("Outcomes (0: Tails, 1: Heads)")
plt.ylabel("Probability")
plt.xticks(x)
plt.legend()

# CDF Plot
plt.subplot(1, 2, 2)
plt.step(x, cdf, where='post', color='orange', label='CDF')
plt.title("CDF of Bernoulli Distribution")
plt.xlabel("Outcomes (0: Tails, 1: Heads)")
plt.ylabel("Cumulative Probability")
plt.xticks(x)
plt.legend()

plt.tight_layout()
plt.show()

# 7. Print results
print("PMF:", pmf)
print("CDF:", cdf)
print(f"Theoretical Mean: {mean:.2f}")
print(f"Theoretical Variance: {variance:.2f}")
print(f"Simulated Mean: {np.mean(samples):.2f}")
print(f"Simulated Variance: {np.var(samples):.2f}")


##### **4. Poisson Distribution**
- Denoted as X ~ Po(Œª). 
- Where X is a discrete random variable that follows Poisson Distribution with parameter Œª.
    - Where Œª is the expected rate of occurrences.
- It expresses the probability of a given number of events occurring in a fixed time interval.
- Examples: 
    - The number of diners at a restaurant on a given day.
    - Calls per hour at a call centre.

|   Term   |     Fromula     |
|:-------------------|:---------------:|
| PMF | $ \frac{\lambda^k e^{-\lambda}}{k!} $ | 
| CDF | $ e^{-\lambda} \sum^{[k]}_{i = 0} \frac{\lambda^i}{i!}$  |
| Mean | $ \lambda $  | 
| Variance | $ \lambda $  |

Case Study: Website Traffic
- A website receives an average of Œª=3 inquiries per minute. 
- The number of inquiries in any given minute can be modeled using a Poisson distribution.

PMF:
-  for Œª=3 and k=2:

$$ P(X = k ) = \frac{\lambda^k e^{-\lambda}}{k!}$$
$$ P(X = 2 ) = \frac{3^2 e^{-3}}{2!}$$
$$ 0.224$$

CDF: 

Mean:
- Mean=Œª=3

Variance:
- Variance=Œª=3

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import poisson

# 1. Define the parameter of the Poisson distribution
lam = 3  # Average rate (inquiries per minute)

# 2. Simulate the Poisson distribution
n_samples = 10000
samples = np.random.poisson(lam, size=n_samples)

# 3. Calculate the PMF
x = np.arange(0, 15)  # Possible outcomes (0 to 14 inquiries)
pmf = poisson.pmf(x, lam)

# 4. Calculate the CDF
cdf = poisson.cdf(x, lam)

# 5. Mean and Variance
mean = lam  # Mean of a Poisson distribution
variance = lam  # Variance of a Poisson distribution

# 6. Visualization
plt.figure(figsize=(12, 6))

# PMF Plot
plt.subplot(1, 2, 1)
plt.bar(x, pmf, color='skyblue', alpha=0.7, label='PMF')
plt.title("PMF of Poisson Distribution")
plt.xlabel("Number of Inquiries")
plt.ylabel("Probability")
plt.legend()

# CDF Plot
plt.subplot(1, 2, 2)
plt.step(x, cdf, where='post', color='orange', label='CDF')
plt.title("CDF of Poisson Distribution")
plt.xlabel("Number of Inquiries")
plt.ylabel("Cumulative Probability")
plt.legend()

plt.tight_layout()
plt.show()

# 7. Print results
print("PMF:", pmf)
print("CDF:", cdf)
print(f"Theoretical Mean: {mean:.2f}")
print(f"Theoretical Variance: {variance:.2f}")
print(f"Simulated Mean: {np.mean(samples):.2f}")
print(f"Simulated Variance: {np.var(samples):.2f}")


### Continuous Distributions

##### **1. Normal or Gaussian Distribution**
- denoted as $X ~ N (Œº, œÉ^2)$. 
- Where  X is a continuous random variable that follows a Normal distribution with parameters Œº, œÉ2.
    - Œº is the mean. 
    - $œÉ^2$ is the variance.
- describes the probability of a continuous random variable that takes real values.
- Examples:
    - Heights of people, 
    - exam scores of students, 
    - IQ Scores,
- Normal distribution follows the 68-95-99.7 rule (empirical rule). 
    - 68% of data lies in the first standard deviation range, 
    - 95% of data lies in the second standard deviation range, and 
    - 99.7% of data lies in the third standard deviation range.

Properties of Normal distribution:
- The random variable takes values from -‚àû to +‚àû
- The probability associate with any single value is Zero.
- looks like a bell curve and is symmetric about x=Œº. 
    - 50% of data lies on the left-hand side and 
    - 50% of the data lies on the right-hand side.
- The area under the curve (AUC) = 1
- All the measures of central tendency coincide i.e., mean = median = mode

|   Term   |     Fromula     |
|:-------------------|:---------------:|
| PDF | $ \frac{1}{\sigma \sqrt{2\pi}} e{-\frac{1}{2}(\frac{x = \mu}{\sigma})^2} $ | 
| CDF | $ \frac{1}{2} [ 1 = erf(\frac{x = \mu}{\sigma\sqrt{2}})$  |
| Mean | $ \mu $  | 
| Variance | $ \sigma^2 $  |

Case Study: Human Heights
- Assume the heights of adults in a population are normally distributed with:
    - Œº=170 cm (average height).
    - œÉ=10 cm (standard deviation).

PDF:
- Œº=170, 
- œÉ=10, and
- x=180:

$$ f(x) = \frac{1}{\sigma \sqrt{2\pi}} e{-\frac{1}{2}(\frac{x = \mu}{\sigma})^2}$$ 

CDF:
- For x = 180, P(X‚â§180)‚âà0.841.

Mean:
- Mean=Œº=170,

Variance: 
- Variance= $œÉ^2$=100

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

# 1. Define the parameters of the Normal distribution
mu = 170  # Mean (average height in cm)
sigma = 10  # Standard deviation (spread of height in cm)

# 2. Simulate the Normal distribution
n_samples = 10000
samples = np.random.normal(mu, sigma, size=n_samples)

# 3. Calculate the PDF
x = np.linspace(mu - 4 * sigma, mu + 4 * sigma, 1000)  # Range of values (¬±4œÉ)
pdf = norm.pdf(x, mu, sigma)

# 4. Calculate the CDF
cdf = norm.cdf(x, mu, sigma)

# 5. Mean and Variance
mean = mu  # Mean of a Normal distribution
variance = sigma ** 2  # Variance of a Normal distribution

# 6. Visualization
plt.figure(figsize=(12, 6))

# PDF Plot
plt.subplot(1, 2, 1)
plt.plot(x, pdf, color='skyblue', label='PDF')
plt.title("PDF of Normal Distribution")
plt.xlabel("Height (cm)")
plt.ylabel("Density")
plt.legend()

# CDF Plot
plt.subplot(1, 2, 2)
plt.plot(x, cdf, color='orange', label='CDF')
plt.title("CDF of Normal Distribution")
plt.xlabel("Height (cm)")
plt.ylabel("Cumulative Probability")
plt.legend()

plt.tight_layout()
plt.show()

# 7. Print results
print(f"Theoretical Mean: {mean:.2f}")
print(f"Theoretical Variance: {variance:.2f}")
print(f"Simulated Mean: {np.mean(samples):.2f}")
print(f"Simulated Variance: {np.var(samples):.2f}")


# Multivariate Analysis: Non-Graphical 

### Continuous - Continuous

##### **Covariance**

Statistical tool that helps to quantify the total variance of random variables from their expected value(Mean).
- it is a measure of the linear relationship between two random variables. 
- It can take any positive and negative values.
    - Positive Covariance: 
        - It indicates that two variables tend to move in the same direction, which means that if we increase the value of one variable other variable value will also increase.
    - Zero Covariance: 
        - It indicates that there is no linear relationship between them.
    - Negative Covariance: 
        - It indicates that two variables tend to move in the opposite direction, which means that if we increase the value of one variable other variable value will decrease and vice versa.

Formula:

Limitations of Covariance
 -Covariance magnitude does not signify the strength of their relationship, so what only matters is the sign, whether it is positive or negative which tells the relationship.
- If we convert or scale the measurements of the variable X and Y, then Cov(X‚Äô, Y‚Äô) ‚â† Cov(X, Y) should not happen.
- Covariance does not capture the non-linear relationship between two variables.

In [None]:
import numpy as np
import pandas as pd
from scipy.stats import pearsonr, spearmanr, kendalltau, pointbiserialr

# Step 1: Simulate Two Random Variables
np.random.seed(42)  # For reproducibility
n = 100  # Number of samples

# Variable X: Continuous Random Variable
X = np.random.normal(loc=50, scale=10, size=n)  # Mean=50, Std Dev=10

# Variable Y: Continuous Random Variable
Y = 0.5 * X + np.random.normal(loc=0, scale=5, size=n)  # Linear relationship with noise

# Convert to DataFrame for easier handling
data = pd.DataFrame({'X': X, 'Y': Y})

# Step 2: Covariance Calculation
# Covariance measures how two variables vary together.
cov_matrix = np.cov(X, Y)
covariance = cov_matrix[0, 1]

print(f"Covariance between X and Y: {covariance:.4f}")

# Step 3: Pearson Correlation Coefficient
# Measures linear correlation between X and Y.
pearson_corr, pearson_p_value = pearsonr(X, Y)
print(f"Pearson Correlation Coefficient: {pearson_corr:.4f}, p-value: {pearson_p_value:.4f}")

# Step 4: Spearman's Rank Correlation
# Measures monotonic relationship between variables.
spearman_corr, spearman_p_value = spearmanr(X, Y)
print(f"Spearman's Rank Correlation: {spearman_corr:.4f}, p-value: {spearman_p_value:.4f}")

# Step 5: Kendall's Tau Rank Correlation
# Measures ordinal association between variables.
kendall_corr, kendall_p_value = kendalltau(X, Y)
print(f"Kendall's Tau Correlation: {kendall_corr:.4f}, p-value: {kendall_p_value:.4f}")

# Step 6: Point Biserial Correlation
# Requires one continuous variable and one binary variable.
# Simulate a binary variable from X.
Z = (X > np.median(X)).astype(int)  # Binary variable based on X's median
point_biserial_corr, point_biserial_p_value = pointbiserialr(Z, Y)
print(f"Point Biserial Correlation: {point_biserial_corr:.4f}, p-value: {point_biserial_p_value:.4f}")


### Correlation

For this analysis, we can **determine the relationship between any two numerical features** by calculating the **correlation coefficient**. 
- Correlation is a measure of the degree to which two variables change together, if at all. 
    - If two features have a strong positive correlation, it means that if the value of one feature increases, the value of the other feature also increases. 
    - There are three different correlation measures:
        - Pearson correlation 
        - Spearman rank correlation
        - Kendall correlation

For this lesson, we will focus on the **Pearson correlation**. The Pearson correlation measures the linear relationship between features and assumes that the features are normally distributed. Below is a table that explains how to interpret the Pearson correlation measure:

|   Pearson Correlation Coefficient (r)  |       Description of Relationship     |
|:-------------------|:---------------:|
|  r = -1              |Perfect Negative Correlation |
| -1 < r < -0.8 | Strong Negative Correlation  | 
| - 0.8 < r < -0.5             | Moderate Negative Correlation  | 
|       - 0.5 < r < 0     |Weak Negative Correlation  | 
|       r = 0  |No Linear Correlation | 
| 0 < r < 0.5 | Weak Positive Correlation  | 
| 0.5 < r < 0.8             | Moderate Positive Correlation  | 
|       0.8 < r < 1     |Strong Positive Correlation  | 
|       r = 1  |Perfect Positive Correlation | 


<div align="left" style="width: 800px; text-align: left;">
<img src="https://github.com/Explore-AI/Pictures/blob/f3aeedd2c056ddd233301c7186063618c1041140/regression_analysis_notebook/pearson_corr.jpg?raw=True"
     alt="Pearson Correlation"
     style="padding-bottom=0.5em"
     width=800px/>
</div>

For a more detailed explanation of correlations, read <a href="https://medium.com/fintechexplained/did-you-know-the-importance-of-finding-correlations-in-data-science-1fa3943debc2#:~:text=Correlation%20is%20a%20statistical%20measure,to%20forecast%20our%20target%20variable.&text=It%20means%20that%20when%20the,variable(s)%20also%20increases.">here</a>.

The command we will use to determine the correlation between features is:

    - df.corr()

In [None]:
df.corr()

# Multivariate Analysis: Graphical

For the multivariate graphical analysis the following visualisations will be considered:

    - Heatmap
    - Scatter Plot
    - Pair Plot
    - Joint Plot
    - Bubble Plot
    
#### Heatmap

The relationship between features can also be displayed graphically using a **heatmap**. The Seaborn library will be used for this basic heatmap visualisation. 

To see how different heatmap variations can be created, read <a href="https://medium.com/@szabo.bibor/how-to-create-a-seaborn-correlation-heatmap-in-python-834c0686b88e">here</a>.

The correlation coefficient value will be displayed on the heatmap using the `vmin` and `vmax` parameters.

In [None]:
heatmap = sns.heatmap(df.corr(), vmin=-1, vmax=1, annot=True)
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':12}, pad=12)

#### Scatter Plot

A Scatter plot is used to visualise the relationship between two different features and is most likely the primary multivariate graphical method. For this exercise, we will create a scatter plot to determine if there is a relationship between `bmi` and `age`. The parameter `hue` is set to the feature `insurance_claim`, colouring the points according to whether or not a claim was submitted.

In [None]:
sns.scatterplot(x='age',y='bmi',hue='insurance_claim', data=df)

#### Pair Plot

A pair plot can be used to visualise the relationships between all the numerical features at the same time. 

The `hue` is once again set to the feature `insurance_claim` to indicate which data points submitted an insurance claim and which didn't.

In [None]:
sns.set_style("whitegrid")
sns.pairplot(df, hue="insurance_claim")
plt.show()

#### Joint Plot

The joint plot can be used to provide univariate and multivariate analyses at the same time. The central part of the plot will be a scatter plot comparing two different features. The top and right visualisations will display the distribution of each feature as a histogram. 

For this joint plot, we will once again compare `age` and `bmi`.

In [None]:
sns.jointplot(x = 'age', y = 'bmi', data = df)

# including the hue as insurance_claim
sns.jointplot(x = 'age', y = 'bmi', data = df, hue='insurance_claim')

#### Bubble Plots

A bubble plot is a variation of a scatter plot. Bubbles vary in size, dependent on another feature in the data. The same applies to the colour of the bubbles; which can be set to vary with the values of another feature. This way, we can visualise up to four dimensions/features at the same time.

For this bubble plot, `bmi` and `claim_amount` will be plotted on the x-axis and y-axis, respectively. The colours of the bubbles will vary based on whether the observation is a `smoker` or not, and lastly, the size of the bubbles will vary based on the number of `children` the observation has. We will create this bubble plot by using `seaborn`‚Äôs scatter plot.

In [None]:
plt.figure(figsize=(12,8))
sns.scatterplot(x="bmi", 
                y="claim_amount",
                size="children",
                sizes=(20,100),
                alpha=0.8,
                hue="smoker",
                data=df)

## Splitting the Data
### Two-Way Split

When fitting a machine learning model to some data, we ultimately intend to use that model to make predictions/forecasts on real-world data. 
- Real-world data is unseen - it doesn't exist in the dataset we have at our disposal - so in order to validate our model (check how well it performs), we need to test it on unseen data too.
- Gathering unseen data is not as simple as collecting it from outside the window and exposing it to the model: any new data would need to be 
    - cleaned, 
    - wrangled and 
    - annotated just like the data in our dataset.
- The next best thing, then, is to simulate some unseen data, which we can do using the existing dataset by splitting it into two sets:
    - One for training the model; and
    - A second for testing it.
   
We fit a model using the training data, and then assess its accuracy using the test set.
- use 80% of the data for training and 
    - the training set will contain 80% of the rows, or data points,
- keep 20% aside for testing. 
    - and the remaining 20% of rows will be in the test set.
These rows are selected at random, to ensure that the mix of data in the train set is as close as possible to the mix in the test set.

### Three-Way Split

Many academic works on machine learning talk about splitting the dataset into three distinct parts: 
- `train`, 
    - training set is used to fit the model to the observations.
- `validation,` and
    -  during the model tuning process where hyperparameters are tweaked and decisions on the dataset is made, the validation set is used to test the performance of the model.
- `test` sets. 
    - Once the model designer is satisfied with the performance of the model on the validation set, the previously unseen test set is brought out and used to provide an unbiased evaluation of a final model fit on the training dataset.

#### Caveats for using a validation set

On small datasets, it may not be feasible to include a validation set for the following reasons, both of which should be intuitive:

- The model may need every possible data point to adequately determine model values;
- For small enough test sets, the uncertainty of the test set can be considerably large to the point where different test sets may produce very different results.

Clearly, further splitting the training data into training and validation sets would remove precious observations for the training process.

### Cross-Validation

In the case that the designer does not desire to use a validation set, or there is simply not enough data, 
- a technique known as cross validation may be used. 
A common version of cross validation is known as K-fold cross validation: 
- during the training process, some proportion of the training data, say 10%, is held back, and effectively used as a validation set while the model parameters are calcuated.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import rc
import seaborn as sns

# Import the split function from sklearn
from sklearn.model_selection import train_test_split

In [None]:
# Split the dataset into the response, y, and features, X
y = df['ZAR/USD']
X = df.drop('ZAR/USD', axis=1)

Understand the four parameters to hand to the splitting function.

- `X` contains the features on which we will be training the model. In this case: just `exports`;
- `y` is the response variable, that which we are trying to predict. In this case: `exchange rate`;
- `test_size` is a value between 0 and 1: the proportion of our dataset that we want to be used as test data. Typically 0.2 (20%);
- `random_state` is an arbitrary value which, when set, ensures that the _random_ nature in which rows are picked to be in the test set is the same each time the split is carried out. In other words, the rows are picked at random, but we can ensure these random picks are repeatable by using the same value here. This makes it easier to assess model performance across iterations.

In [None]:
#  Call the train_test_split function:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=50)

Plotting the data points in each of the training and testing sets in different colours, we should be able to see that we have a similar spread of data in each

In [None]:
# Plot the results
plt.scatter(X_train, y_train, color='green', label='Training')  # plot the training data in green
plt.scatter(X_test, y_test, color='darkblue', label='Testing')  # plot the testing data in blue
plt.legend()
plt.show()

## Advanced plotting
Let's try and create something a little more visually appealing than the two plots above.
‚Äã
- We'll plot both dependent data series on the same graph;
- We'll assign two separate y-axes: one for each series;
- We'll display a legend near the top of the plot.

In [None]:
rc('mathtext', default='regular')
# Create blank figure
fig = plt.figure()

# Split figure to allow two sets of y axes
ax = fig.add_subplot(111)

# Plot the first line on its axis
ax.plot(np.arange(len(df.Y)), df.Y, '-', label = 'ZAR/USD', color='orange')

# Create second y axis and plot second line
ax2 = ax.twinx()
ax2.plot(np.arange(len(df.X)), df.X, '-', label = 'Exports (ZAR)')

# Add legends for each axis
ax.legend(loc=2)
ax2.legend(loc=9)

ax.grid()

# Set labels of axes
ax.set_xlabel("Months")
ax.set_ylabel("ZAR/USD")
ax2.set_ylabel("Exports (ZAR, millions)")
plt.show()

### Step 3: Select the Type of Statistical Model
Statistical models can be broadly categorized as:

- **Descriptive Models**: Summarize data patterns.
- **Inferential Models**: Help make inferences about the population.
- **Predictive Models**: Used to predict future outcomes based on historical data.
- **Prescriptive Models**: Suggest actions based on predictions.

Let's go through common types of statistical models and their applications.

# Regression Analysis
 
Regression Analysis is a statistical method to analyze the relationship between a dependent variable and one or more independent variables.

Use regression analysis for one of two purposes: 
- predict the value of the dependent variable when you know the independent variables or 
- predict the effect of an independent variable on the dependent variable.

### Three types of regression analysis

- **Simple linear regression**
    - Assumes a linear connection between a dependent variable (Y) and an independent variable (X).
    - linear regression model can be simple 
        - with only one dependent and one independent variable.
    - A real estate agent wants to determine the relationship between the size of a house (in square feet) and its selling price. They can use simple linear regression to predict the selling price of a house based on its size.
    
-  **Multiple Linear Regression / Multivariate Linear Regression**
    - Assumes a linear connection between a dependent variable (Y) and an independent variable (X).
    - linear regression model can be complex 
        - with numerous dependent and independent variables
        - with one dependent variable and more than one independent variable.
    - A car manufacturer wants to predict the fuel efficiency of their vehicles based on various independent variables such as engine size, horsepower, and weight.
    
- **Logistic regression**
    - Used When the dependent variable is discrete.
        - the target variable can take on only one of two values, 
    - The sigmoid curve represents its connection to the independent variable, and probability has a value between 0 and 1.
    - A bank wants to predict whether a customer will default on their loan based on their credit score, income, and other factors. By using logistic regression, the bank can estimate the probability of default and take appropriate measures to minimize their risk.

- **Polynomial Regression**
    - Represents a non-linear relationship between dependent and independent variables. 
    - This technique is a variant of the multiple linear regression model, but the best fit line is curved rather than straight.

- **Ridge Regression**
    - Applied when the independent variables are highly correlated.
        - When data exhibits multicollinearity
    - While least squares estimates are unbiased in multicollinearity, their variances are significant enough to cause the observed value to diverge from the actual value. 
    - Ridge regression reduces standard errors by biassing the regression estimates.
    - The lambda (Œª) variable in the ridge regression equation resolves the multicollinearity problem.

- **Lasso Regression**
    - Lasso regression (Least Absolute Shrinkage and Selection Operator) technique penalizes the absolute magnitude of the regression coefficient. 
    - The lasso regression technique employs variable selection, which leads to the shrinkage of coefficient values to absolute zero.

- **Quantile Regression**
    - The quantile regression approach is a subset of the linear regression technique. 
    - Statisticians and econometricians employ quantile regression when linear regression requirements are not met or when the data contains outliers.

- **Bayesian Linear Regression**
    - Machine learning utilizes Bayesian linear regression, a form of regression analysis, to calculate the values of regression coefficients using Bayes‚Äô theorem. 
    - Rather than determining the least-squares, this technique determines the features‚Äô posterior distribution.
    - The approach outperforms ordinary linear regression in terms of stability. 

- **Principal Components Regression**
    - Multicollinear regression data is often evaluated using the principle components regression approach. 
    - The significant components regression approach, like ridge regression, reduces standard errors by biassing the regression estimates. 
    - First, principal component analysis (PCA) modifies the training data, and then the resulting transformed samples train the regressors.

- **Partial Least Squares Regression**
    - The partial least squares regression technique is a fast and efficient covariance-based regression analysis technique. 
    - It is advantageous for regression problems with many independent variables with a high probability of multicollinearity between the variables. 
    - The method reduces the number of variables to a manageable number of predictors, then uses them in regression.

- **Elastic Net Regression**
    - Elastic net regression combines ridge and lasso regression techniques that are particularly useful when dealing with strongly correlated data. 
    - It regularizes regression models by utilizing the penalties associated with the ridge and lasso regression methods.


### Complete Workflow for Regression Modeling
Steps of a regression modeling process, covering:
- Exploratory Data Analysis (EDA), 
- assumption checking, 
- data transformations, 
- model fitting, and 
- interpretation.

**Step 1: Problem Definition and Data Understanding**

1. Define the Problem:
- Identify the dependent (response) variable and independent (predictor) variables.
- Clarify objectives
    - prediction, 
    - inference,
    - explanation.
2. Understand the Data:
- Review the dataset's structure, variable types, and context.

In [None]:
# missing values
def missing_values_table(df):
        mis_val = df.isnull().sum()
        
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        
        mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
        
        mis_val_table_ren_columns = mis_val_table.rename(
        columns = {0 : 'Missing Values', 1 : '% of Total Values'})
        
        mis_val_table_ren_columns = mis_val_table_ren_columns[
            mis_val_table_ren_columns.iloc[:,1] != 0].sort_values(
        '% of Total Values', ascending=False).round(1)
        
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns.\n"      
            "There are " + str(mis_val_table_ren_columns.shape[0]) +
              " columns that have missing values.")
        
        # Return the dataframe with missing information
        return mis_val_table_ren_columns

train.head()
train.info()
train.shape

missing_values_table(train)

**Step 2: Exploratory Data Analysis (EDA)**

1. Summary Statistics:
Compute 
- mean, 
- median, 
- standard deviation, and 
- correlations.

In [None]:
print(df.describe())
print(df.corr())

2. Visualization:
- Histogram for distributions.
- Scatter plots for relationships.
- Box plots for detecting outliers.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

sns.pairplot(df, diag_kind="kde")
plt.show()


3. Check Multicollinearity:

- Compute the Variance Inflation Factor (VIF).

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

X = df[["SquareFootage", "Bedrooms", "LocationIndex"]]
vif = pd.DataFrame()
vif["Features"] = X.columns
vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif)


**Step 3: Preprocessing and Transformations**

1. Handle Missing Data:
- Impute missing values or drop rows/columns.

In [None]:
df.fillna(df.median(), inplace=True)

2. Encode Categorical Variables:
- Use one-hot encoding or label encoding.

In [None]:
df = pd.get_dummies(df, columns=["Location"], drop_first=True)

3. Feature Scaling:
- Standardize or normalize numerical variables.

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df_scaled = scaler.fit_transform(df[["SquareFootage", "Bedrooms"]])

4. Transform Non-linear Relationships:
- Apply log, Box-Cox, or square root transformations for skewed variables.

In [None]:
from scipy.stats import boxcox

df["LogPrice"] = np.log(df["Price"])
df["BoxCoxPrice"], _ = boxcox(df["Price"])

**Step 4: Model Fitting and Assumption Checking**
1. Fit the Regression Model:

In [None]:
import statsmodels.api as sm

X = sm.add_constant(df[["SquareFootage", "Bedrooms"]])  # Add intercept
Y = df["Price"]
model = sm.OLS(Y, X).fit()
print(model.summary())

2. Check Model Assumptions:

(a) Linearity: 

Residuals vs. Fitted Plot: Look for randomness.

In [None]:
plt.scatter(model.fittedvalues, model.resid)
plt.axhline(0, color="red", linestyle="--")
plt.title("Residuals vs. Fitted")
plt.xlabel("Fitted Values")
plt.ylabel("Residuals")
plt.show()

(b) Normality of Residuals:

Use a histogram and Q-Q plot.

In [None]:
import scipy.stats as stats

stats.probplot(model.resid, dist="norm", plot=plt)
plt.title("Q-Q Plot")
plt.show()

(c) Homoscedasticity:

Breusch-Pagan test.

In [None]:
from statsmodels.stats.diagnostic import het_breuschpagan

bp_test = het_breuschpagan(model.resid, X)
print(f"p-value: {bp_test[1]}")

(d) Multicollinearity:

Variance Inflation Factor (VIF) as shown earlier.

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

X = df[["SquareFootage", "Bedrooms", "LocationIndex"]]
vif = pd.DataFrame()
vif["Features"] = X.columns
vif["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif)

**Step 5: Address Issues and Refine the Model**

1. Linearity:

Transform variables if residual plots show non-linearity.

In [None]:
df["SquareFootage_Sq"] = df["SquareFootage"] ** 2

2. Non-Normal Residuals:

Apply transformations to the dependent variable.

In [None]:
df["LogPrice"] = np.log(df["Price"])
model_log = sm.OLS(df["LogPrice"], X).fit()

3. Heteroscedasticity:

Use Weighted Least Squares (WLS).

In [None]:
weights = 1 / (model.resid ** 2)
model_wls = sm.WLS(Y, X, weights=weights).fit()

4. Multicollinearity:
- Drop or combine highly correlated variables.
- Use PCA, Ridge, or Lasso regression.

In [None]:
from sklearn.linear_model import Ridge

ridge = Ridge(alpha=1.0)
ridge.fit(X, Y)

**Step 6: Evaluate Model Performance**
- Metrics: 
    - $ùëÖ^2$
    - Adjusted $ùëÖ^2$
    - RMSE, 
    - MAE.

- Residual Plots:
    - Confirm residuals are normally distributed and homoscedastic.

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error

rmse = np.sqrt(mean_squared_error(Y, model.predict(X)))
mae = mean_absolute_error(Y, model.predict(X))
print(f"RMSE: {rmse}, MAE: {mae}")

**Step 7: Interpretation and Communication**

Coefficient Interpretation:
- For each predictor, interpret its coefficient in terms of the dependent variable.

Confidence Intervals:
- Report 95% confidence intervals for coefficients.

Visualize Results:

In [None]:
import seaborn as sns

sns.regplot(x="SquareFootage", y="Price", data=df, line_kws={"color": "red"})
plt.title("Regression Line: Square Footage vs. Price")
plt.show()

# Linear Regression

Linear Regression is a supervised learning algorithm used to model the relationship between a dependent variable (outcome) and one or more independent variables (predictors).
- Predicts the relationship between two variables by assuming they have a straight-line connection. 

Linear Regression predicts a continuous target variable (e.g., the number of readmissions) by minimizing the residual sum of squares between observed and predicted values.
- It finds the best line that minimizes the differences between predicted and actual values.

## 1. Simple Linear Regression

In a simple linear regression, there is 
- one independent variable and 
- one dependent variable. 

The model estimates the slope and intercept of the line of best fit, which represents the relationship between the variables. 
- The slope represents the change in the dependent variable for each unit change in the independent variable, while 
- The intercept represents the predicted value of the dependent variable when the independent variable is zero.

What It Means: 
- Linear regression models the relationship between a dependent variable and one or more independent variables by fitting a linear equation to observed data. 
- It assumes a straight-line relationship. 
- It shows the linear relationship between the independent(predictor) variable i.e. X-axis and the dependent (output) variable i.e. Y-axis, 
    - called linear regression.

- It is employed to establish a link between a dependant variable and a single independent variable. 
    - A linear equation defines the relationship, with the 
        - slope and 
        - intercept 
    - of the line representing the effect of the independent variable on the dependant variable.
        - An independent variable is the variable that is controlled in a scientific experiment to test the effects on the dependent variable.
        - A dependent variable is the variable being measured in a scientific experiment.

Outcome Interpretation: 
- Each coefficient represents how much the dependent variable (outcome) changes when the predictor variable changes by one unit, keeping all else constant.

**Assumptions of Linear Regression**

Regression is a parametric approach, which means that it makes assumptions about the data

For successful regression analysis, it‚Äôs essential to validate the following assumptions.

- Linearity (Linear Relationship): The relationship between the predictors and the outcome is linear.
    - Plot dependent variable and independent variable(s) and see linear relationship.
- Independence of Errors: Residuals (errors) are independent of each other.
    - The error terms should not be dependent on one another (like in time-series data wherein the next value is dependent on the previous one). 
    - There should be no correlation between the residual terms.
    - The absence of this phenomenon is known as Autocorrelation.
- No or Little Autocorrelation
- Normality of Errors: Residuals are normally distributed.
    - The mean of residuals should follow a normal distribution with a mean equal to zero or close to zero. 
    - This is done to check whether the selected line is the line of best fit or not. 
    - If the error terms are non-normally distributed, suggests that there are a few unusual data points that must be studied closely to make a better model.
- Multivariate Normality
- No or Little Multicollinearity
- Homoscedasticity: Variance of residuals is constant across all levels of predictors.
    - The error terms must have constant variance. 
    - The presence of non-constant variance in the error terms is referred to as Heteroscedasticity. 

Performance Measures:
- R-squared: Indicates the proportion of the variance in the dependent variable explained by the independent variables. 
    - Values closer to 1 indicate a better fit.
- Mean Squared Error (MSE): The average squared difference between observed and predicted values; lower values are better.

Lay Explanation: 
- Think of linear regression like drawing a best-fit line through a scatterplot of data points, aiming to predict outcomes based on relationships in the data.
- Finds a relationship between independent and dependent variables by finding a ‚Äúbest-fitted line‚Äù that has minimal distance from all the data points.
- The algorithm explains the linear relationship between the dependent(output) variable y and the independent(predictor) variable X using a straight line

Use Case: 
- When there is a linear relationship between the target and predictor variables.

### Mathematics or Linear Regression

- it is using the least square method finds a linear equation that minimizes the sum of squared residuals (SSR).
- Cost Function:

$ J(\theta) = \frac{1}{2m}\sum^{m}_{i=1}(h_{\theta}(x^{(i)})- y^{(i)})^{2}$

Model Equation:
$ ùë¶=ùõΩ_{0}+ùõΩ_{1}ùë•_{1}+‚Ä¶+ùõΩ_{ùëõ}ùë•_{ùëõ}+ ùúñ $

where:
- $y$ = dependent variable
- $ùõΩ_{0}$ = Y intercept / constant
- $ùõΩ_{1}$ = Slope coefficient / intercept
- $ùë•_{1}$ = independent variable
- $ùúñ $ = error term

**What is Cost Function ?**

The goal of the linear regression algorithm is to get the best values for $ùõΩ_{0}+ùõΩ_{1}$ to find the **best-fit line**.
- is a line that has the least error which means the error between predicted values and actual values should be minimum.

A cost function, also referred to as a: 
- loss function : Used when we refer to the error for a single training example. 
- objective function : Used to refer to an average of the loss functions over an entire training dataset.
It quantifies the difference between predicted and actual values, serving as a metric to evaluate the performance of a model.

Objective 
- is to minimize the cost function, indicating better alignment between predicted and observed outcomes.
- Guides the model towards optimal predictions by measuring its accuracy against the training data.

AKA - Random Error (Residuals)
- the difference between the observed value of the dependent variable($y_{i}$) and the predicted value(predicted) is called the residuals.
    - $ùúñ_{i}$ =  $y_{predicted}  ‚Äì  y_{i}$

where $ùë¶_{predicted} = ùõΩ_{0}+ùõΩ_{1}ùë•_{1}+‚Ä¶+ùõΩ_{ùëõ}ùë•_{ùëõ}+ ùúñ $

**Why to use a Cost function**

Cost function helps us reach the optimal solution / work out the optimal values for $ùõΩ_{0}+ùõΩ_{1}$ . 
- How: It takes both predicted outputs by the model and actual outputs and calculates how much wrong the model was in its prediction.
    - It basically measures the discrepancy between the model‚Äôs predictions and the true values it is attempting to predict. 
    - This variance is depicted as a lone numerical figure, enabling us to measure the model‚Äôs **precision**.
- The cost function is the technique of evaluating ‚Äúthe performance of our algorithm/model‚Äù.

Classifiers have very high accuracy but one solution (Classifier) is the best because it does not misclassify any point.
- Reason why it classifies all the points perfectly is that the:
    - line is almost exactly in between the two (n) groups, and not closer to any one of the groups.

Explanation of the function of a cost function:

- Error calculation: It determines the difference between the predicted outputs (what the model predicts as the answer) and the actual outputs (the true values we possess for the data).
- Gives one value: This simplifies comparing the model‚Äôs performance on various datasets or training rounds.
- Improving Guides: The objective is to reduce the cost function. 
    - How: Through modifying the internal parameters of the model such as weights and biases, we can aim to minimize the total error and enhance the accuracy / precision of the model.

**Types of Cost function in machine learning**

Its use cases depend on whether it is a regression problem or classification problem.
- Regression cost Function
- Binary Classification cost Functions
- Multi-class Classification cost Functions

### Problem Context: Predicting Hospital Readmission Rates
The aim to reduce hospital readmission rates. 
- High readmission rates can strain resources and negatively impact patient outcomes.
- The goal is to predict the number of readmissions within 30 days of discharge for a particular condition, such as 
    - diabetes, based on 
        - patient demographic, 
        - clinical data, and 
        - treatment data.

**Step 1. Define the Problem**

We want to predict the number of readmissions ($ùëå$) using features ($ùëã$) such as:
- Patient age
- Length of hospital stay
- Severity of condition
- Medication adherence rate
- Comorbidities (e.g., hypertension, kidney disease)
- Number of follow-up visits scheduled

**Step 2. Collect and Prepare Data**

- Data Collection: Gather historical patient data from the hospital's database.
- Understand the 
    - model description
    - causality and 
    - directionality
- Check the data
    - categorical data, 
    - missing data and 
    - outliers
- Data Cleaning: 
    - Dummy variable takes only the value 0 or 1 to indicate the effect for categorical variables.
    - Handle missing values, 
    - remove duplicates, and 
    - correct errors.
    - Outlier is a data point that differs significantly from other observations. 
        - use standard deviation method and 
        - interquartile range (IQR) method.
- Feature Engineering: 
    - Encode categorical variables (e.g., age group), 
    - scale continuous variables (e.g., length of stay), and 
    - create interaction terms if necessary.

**Step 3. Conduct a Simple Analysis**
- Check the **effect** comparing between 
    - Dependent variable to independent variable and 
    - Independent variable to independent variable
- Check the correlation.
    - Use scatter plots
- Check Multicollinearity 
    - This occurs when more than two independent variables are highly correlated. 
    - Use Variance Inflation Factor (VIF) 
        - if VIF > 5 there is highly correlated and 
        - if VIF > 10 there is certainly multicollinearity among the variables.
- Interaction Term imply a change in the slope from one value to another value.

`Show the relationship between the two variables using a scatter plot.`
- We have our Y, our X, and time (months), but we're just trying to model ZAR/USD as a *function* of Exports. 
    - To see if we can see that there possibly exists a linear relationship between the two variables: Value of Exports and ZAR/USD.

In [None]:
plt.scatter(df['X'], df['Y'])
plt.ylabel("ZAR/USD")
plt.xlabel("Value of Exports (ZAR, millions)")
plt.show()

**Step 4. Formulate the Model (From Scratch)**
- y in this equation stands for the predicted value, 
- x means the independent variable and 
- m & b are the **coefficients** we need to optimize in order to fit the regression line to our data.

#### Finding the Best Fit Line
Let's say we have estimated some values for $a$ and $b$. We could plug in all of our values of X to find the corresponding values of Y. These *new* values of Y could be compared to the *actual* values of Y to assess the fit of the line. This becomes tedious as the number of data points increases.
   
Looking at the data, we can make a guess at the values of the slope and intercept of the line. We'll use a rough estimate of the slope as $\frac{rise}{run} = \frac{16}{80000} = 0.0002$. For the intercept, we'll just take a guess and call it $-3$.   
   
Let's plot a line with values of $a = -3$, and $b = 0.0002$:   
   
First, we will need to generate some values of y using the following formula:
   
$$\hat{y}_i = a + bx_i$$   



Calculating coefficient of the equation:
- To calculate the coefficients we need the formula for 

Covariance 

$Cov (X,Y) = \frac{\sum (X_{i}- X)(Y_{j} - Y)}{n}$

Variance

$var(x) = \frac{\sum^{n}_{i} (x_i -\mu)^2}{N}$

- To calculate the coefficient m
    - m = cov(x, y) / var(x)
    - b = mean(y) ‚Äî m * mean(x)

**Functions to calculate the Mean, Covariance, and Variance.**

In [None]:
# mean 
def get_mean(arr):
    return np.sum(arr)/len(arr)

# variance
def get_variance(arr, mean):
    return np.sum((arr-mean)**2)

# covariance
def get_covariance(arr_x, mean_x, arr_y, mean_y):
    final_arr = (arr_x - mean_x)*(arr_y - mean_y)
    return np.sum(final_arr)

**Fuction to calculate the coefficients and the Linear Regression Function**

In [None]:
# Coefficients 
# m = cov(x, y) / var(x)
# b = y - m*x

def get_coefficients(x, y):
    x_mean = get_mean(x)
    y_mean = get_mean(y)
    m = get_covariance(x, x_mean, y, y_mean)/get_variance(x, x_mean)
    b = y_mean - x_mean*m
    return m, b

In [None]:
# Linear Regression 
# Train and Test
# Train Split 80 % Test Split 20 %
def linear_regression(x_train, y_train, x_test, y_test):
    prediction = []
    m, b = get_coefficients(x_train, y_train)
    for x in x_test:
        y = m*x + b
        prediction.append(y)
    
    r2 = r2_score(prediction, y_test)
    mse = mean_squared_error(prediction, y_test)
    print("The R2 score of the model is: ", r2)
    print("The MSE score of the model is: ", mse)
    return prediction

prediction = linear_regression(x[:80], y[:80], x[80:], y[80:])

In [None]:
# Define a function to generate values of y from a list of x, 
# Given parameters a and b

def gen_y(x_list, a, b):
    y_gen = []
    for x_i in x_list:
        y_i = a + b*x_i
        y_gen.append(y_i)
    
    return(y_gen)

# Generate the values by invoking the 'gen_y' function
y_gen = gen_y(df.X, -3, 0.0002)

# Plot the results
plt.scatter(df.X, df.Y)  # Plot the original data
plt.plot(df.X, y_gen, color='red')  # Plot the line connecting the generated y-values
plt.ylabel("ZAR/USD")
plt.xlabel("Value of Exports (ZAR, millions)")
plt.show()

**Visualize the regression line**

In [None]:
def plot_reg_line(x, y):
    # Calculate predictions for x ranging from 1 to 100
    prediction = []
    m, c = get_coefficients(x, y)
    for x0 in range(1,100):
        yhat = m*x0 + c
        prediction.append(yhat)
    
    # Scatter plot without regression line
    fig = plt.figure(figsize=(20,7))
    plt.subplot(1,2,1)
    sns.scatterplot(x=x, y=y)
    plt.xlabel('X')
    plt.ylabel('Y')
    plt.title('Scatter Plot between X and Y')
    
    # Scatter plot with regression line
    plt.subplot(1,2,2)
    sns.scatterplot(x=x, y=y, color = 'blue')
    sns.lineplot(x = [i for i in range(1, 100)], y = prediction, color='red')
    plt.xlabel('X')
    plt.ylabel('Y')
    plt.title('Regression Plot')
    plt.show()

In [None]:
# Regression plot form seaborn
# regplot is basically the combination of the scatter plot and the line plot
sns.regplot(x, y)
plt.xlabel('X')
plt.ylabel('Y')
plt.title("Regression Plot")
plt.show()

In [None]:
def plot_reg_line(x, y):
    # Calculate predictions for x ranging from 1 to 100
    prediction = []
    m, c = get_coefficients(x, y)
    for x0 in range(1,100):
        yhat = m*x0 + c
        prediction.append(yhat)
    
    # Scatter plot without regression line
    fig = plt.figure(figsize=(20,7))
    plt.subplot(1,2,1)
    sns.scatterplot(x=x, y=y)
    plt.xlabel('X')
    plt.ylabel('Y')
    plt.title('Scatter Plot between X and Y')
    
    # Scatter plot with regression line
    plt.subplot(1,2,2)
    sns.scatterplot(x=x, y=y, color = 'blue')
    sns.lineplot(x = [i for i in range(1, 100)], y = prediction, color='red')
    plt.xlabel('X')
    plt.ylabel('Y')
    plt.title('Regression Plot')
    plt.show()

**Step 4. Formulate the model and Fit the Model (using library)**

- Split the Data: Divide data into training and testing sets (e.g., 80% training, 20% testing).
- Train the Model: Use a library like sklearn in Python to fit the regression model on the training data.
- Evaluate the Model: Check metrics such as $ùëÖ^2$ (explained variance) and RMSE (Root Mean Squared Error).

In [None]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

# Example

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

# Create the dataset
X = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).reshape(-1, 1)
y = np.array([2, 4, 5, 7, 8, 10, 11, 13, 14, 16])

# Create the linear regression model
model = LinearRegression().fit(X, y)


##### Calculate the Regression Coefficients

Use the formulas for $ùõΩ_1$ (slope) and $ùõΩ_0$ (intercept):

$ùõΩ_1 = \frac{\sum (x_{i}- \bar{x})(y_{j} - \bar{y})}{\sum (x_{i}- \bar{x})^2}$

$ùõΩ_0 = \bar{y} - ùõΩ_1 \bar{x}$

In [None]:
# Mean of x and y from scratch 
x_mean = np.mean(x)
y_mean = np.mean(y)

# Calculating beta1 (slope)
numerator = np.sum((x - x_mean) * (y - y_mean))
denominator = np.sum((x - x_mean) ** 2)
beta1 = numerator / denominator

# Calculating beta0 (intercept)
beta0 = y_mean - beta1 * x_mean

print(f"Beta0 (Intercept): {beta0:.3f}")
print(f"Beta1 (Slope): {beta1:.3f}")

In [None]:
# Get the slope and intercept of the line
slope = model.coef_
intercept = model.intercept_

# Plot the data points and the regression line
plt.scatter(X, y)
plt.plot(X, slope*X + intercept, color='red')
plt.show()

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

# Example dataset
X = data[['age', 'length_of_stay', 'severity', 'medication_adherence', 'comorbidities']]
y = data['readmissions']

# Split data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit the model
model = LinearRegression()
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)

# Evaluation
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)

print(f"RMSE: {rmse}, R^2: {r2}")


**Let's check the calculted fit of the line** by measuring how far the true y-values of each point are from their corresponding y-value on the line.   
   
We'll use the equation below to calculate the error of each generated value of y:   
   
$$e_i = y_i - \hat{y}_i$$   

In [None]:
errors = np.array(df.Y - y_gen)
np.round(errors, 2)

In addition to having some very large errors, we can also see that most of the errors are positive numbers. Ideally, we want our errors to be evenly distributed either side of zero - we want our line to best fit the data, i.e. no bias.
   
We can measure the overall error of the fit by calculating the **Residual Sum of Squares**:
   
$$RSS = \sum_{i=1}^n(y_i-\hat{y}_i)^2$$

##### Residual Sum of Squares (RSS)
Definition: The Residual Sum of Squares (RSS) measures the discrepancy between the actual data points and the estimated values predicted by a regression model. It is calculated as the sum of the squared differences between actual ($ùë¶_ùëñ$) and predicted ($\hat{y}_ùëñ $) values.

The RSS finds the difference between the y-value of each data point and our estimated line (which may be either negative or positive), squares the difference, and then adds all the differences up. In other words, it's the sum of the squares of all the errors we calculated before.

Here:

- $ùë¶_ùëñ$ = Actual value of the dependent variable for observation ùëñ.
- $\hat{y}_ùëñ = ùõΩ_0 + ùõΩ_1 ùë•_ùëñ$ , where:
    - $ùõΩ_0$ is the intercept.
    - $ùõΩ_1$ is the slope of the regression line.
    - $ùë•_ùëñ$ is the value of the independent variable for observation ùëñ.

Substituting $\hat{y}_ùëñ$:

$$RSS = \sum_{i=1}^n(y_i-(ùõΩ_0 + ùõΩ_1 ùë•_ùëñ))^2$$

The RSS quantifies the "unexplained variance" by the model.

In a simple linear regression, minimizing RSS is equivalent to finding the best-fit line.

In [None]:
# Residual sum of squares from scratch
rss = np.sum((y - y_pred) ** 2)
print(f"Residual Sum of Squares (RSS): {rss:.3f}")

In [None]:
print("Residual sum of squares:", (errors ** 2).sum())

## Least Squares Method
Least Squares is another method that allows us to find the line of best fit while enforcing the constraint of minimising the residuals. More specifically, the **Least Squares Criterion** states that the sum of the squares of the residuals should be minimized, i.e.   
$$Q = \sum_{i=1}^n(y_i-\hat{y}_i)^2$$

The formulae for the intercept, $a$, and the slope, $b$, are determined by minimizing the equation for the sum of the squared prediction errors:   
$$Q = \sum_{i=1}^n(y_i-(a+bx_i))^2$$

Optimal values for $a$ and $b$ are found by differentiating $Q$ with respect to $a$ and $b$, setting both equal to 0 and then solving for $a$ and $b$.   
   
We won't go into the [derivation process](http://seismo.berkeley.edu/~kirchner/eps_120/Toolkits/Toolkit_10.pdf) here, but the equations for $a$ and $b$ are:   
   
$$b = \frac{\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^n(x_i-\bar{x})^2}$$   
   
and:   
   
$$a = \bar{y} - b\bar{x}$$

where:
- $ x_i$ Values of the independent variable.
- $ y_i$ Values of the dependent variable.
- $\bar{y}$ are the mean values of $y$.
- $\bar{x}$ are the mean values of $x$ in our dataset, respectively.

### Interpreting least-squares coefficients

Interpreting the least-squares coefficients provides insights into the relationship between the independent variable (x) and the dependent variable (y) in a simple linear regression.

#### The Slope ($ùõΩ_1$)

Interpretation:

- If $ùõΩ_1 > 0$: y increases as x increases (positive relationship).
- If $ùõΩ_1 < 0$: y decreases as x increases (negative relationship).
- If $ùõΩ_1 = 0$: No linear relationship exists between x and y.

If $ùõΩ_1 = 0.28$ this means that for every one-unit increase in x, y is expected to increase by 0.28 units.

Key Considerations:

- The magnitude of $ùõΩ_1$ indicates the strength of the effect.
- The direction (+/-) indicates the nature of the relationship.

#### The Intercept ($ùõΩ_0$)

Definition:
The intercept ($ùõΩ_0$) represents the predicted value of the dependent variable (y) when the independent variable (x) is zero.

Interpretation:
- The intercept gives a baseline value of y when x=0.
- It is meaningful only if ùë•=0 is within the range of observed data. 
    - If not, the intercept might be extrapolated and have limited interpretive value.

Limitations
- Causation vs. Correlation: The coefficients indicate relationships, not causation, unless you have a well-controlled experimental design.
- Range of x: The interpretation of $ùõΩ_0$ and $ùõΩ_1$ applies only within the range of observed x-values.
- Other Factors: The model assumes that other variables do not influence ùë¶, which might not be the case in real-world scenarios.

In [None]:
X = df.X.values
Y = df.Y.values

# Calculate x bar, y bar
x_bar = np.mean(X)
y_bar = np.mean(Y)

# Calculate slope
b = sum( (X-x_bar)*(Y-y_bar) ) / sum( (X-x_bar)**2 )

# Calculate intercept
a = y_bar - b*x_bar

print("Slope = " + str(b))
print("Intercept = " + str(a))

In [None]:
# Use the function we created earlier:
# it generates y-values for given x-values based on parameters a, b
y_gen2 = gen_y(df.X, a, b)

plt.scatter(df.X, df.Y)
plt.plot(df.X, y_gen2, color='red')
plt.show()

In [None]:
errors2 = np.array(y_gen2 - df.Y)
print(np.round(errors2, 2))

In [None]:
print("Residual sum of squares:", (errors2 ** 2).sum())

Here we can see our RSS has improved from ~867 down to ~321.  
Furthermore, if we calculate the sum of the errors we find that the value is close to 0.

----
Intuitively, this should make sense as it is an indication that the sum of the positive errors is equal to the sum of the negative errors. The line fits in the 'middle' of the data.

In [None]:
# Round off to 11 decimal places
np.round(errors2.sum(),11)

##### Recognise the Standard error of a statistic

The standard error (SE) of a statistic in linear regression quantifies the variability of the estimated coefficients ($ùõΩ_0$ and $ùõΩ_1$) and other regression outputs. 
- It measures how much the coefficient estimates are expected to vary from sample to sample due to random noise in the data.

**Standard Error of the Regression Coefficients**

For a coefficient $ùõΩ_ùëó$ , the standard error (ùëÜùê∏_ùõΩ_ùëó) is calculated as:

$$ùëÜùê∏_ùõΩ = \sqrt{\frac{ \sigma{^2}}{\sum{}(x_i-\bar{x})^2}}$$

Where:

- $\sigma{^2}$: The variance of the residuals, often estimated as the mean squared error (MSE):
$$\sigma{^2} = \frac{RSS}{n‚àí2}$$
- n is the number of observations.

$ \sum{}(ùë•_ùëñ ‚àí \bar{ùë•})^2$ : The total variation in the independent variable ùë•

**Standard Error of the Regression**

The standard error of the regression (also called the residual standard error, $ùëÖùëÜùê∏$ measures the average distance that the observed values fall from the regression line.

$$RSE = \frac{RSS}{n‚àí2}$$

Where:

- RSS: Residual Sum of Squares.
- $ùëõ ‚àí 2$ : Degrees of freedom for simple linear regression ($ùëõ ‚àí ùëò ‚àí 1$)
    - with ùëò = 1 predictor.

##### Role of Standard Errors in Linear Regression

Done to evaluate the reliability and precision of your regression model.

1.  Coefficient Standard Errors ($ùëÜùê∏_{ùõΩ_0}$ and $ùëÜùê∏_{ùõΩ_1}$
- These are used to:
    - Quantify Precision: Smaller standard errors indicate more precise estimates of the coefficients. 
    - Construct Confidence Intervals: The confidence interval for $ùõΩ_ùëó$ is:

$$ùõΩ_ùëó \pm t \cdot ùëÜùê∏_{ùõΩ_j}$$

where ùë° is the critical value from the t-distribution for the desired confidence level.
 
‚Äã- Perform Hypothesis Tests: To test if $ùõΩ_ùëó$ = 0, we calculate:

$$t = \frac{ùõΩ_ùëó}{SE_{ùõΩ_ùëó}}$$

Compare t to the critical t-value to determine significance.

2. Residual Standard Error (ùëÖùëÜùê∏) 
- Indicates the average error in predictions.
- Provides a baseline for assessing the fit of the model (smaller ùëÖùëÜùê∏ implies a better fit).
 
**Intepretation**

Residual Standard Error (RSE):
- On average, the observed y-values deviate from the predicted  y-values by 0.147 units.

Standard Error of Slope ($SE_{Œ≤_1}$):
- The variability in the estimated slope is 0.065. This is used to assess the precision of $ùõΩ_1.

Confidence in Coefficients:
- Smaller standard errors indicate more confidence in the coefficient estimates.
- Standard errors also allow hypothesis testing to determine if a predictor has a statistically significant impact on y.

In [None]:
import numpy as np

# Step 1: Define data
x = np.array([1, 2, 3, 4, 5])  # Independent variable
y = np.array([2.2, 2.8, 3.6, 4.5, 5.1])  # Dependent variable

# Step 2: Calculate coefficients
x_mean = np.mean(x)
y_mean = np.mean(y)
beta1 = np.sum((x - x_mean) * (y - y_mean)) / np.sum((x - x_mean) ** 2)
beta0 = y_mean - beta1 * x_mean

# Step 3: Calculate residuals and RSS
y_pred = beta0 + beta1 * x  # Predicted values
residuals = y - y_pred  # Residuals
RSS = np.sum(residuals ** 2)  # Residual Sum of Squares

# Step 4: Calculate standard error of the regression (RSE)
n = len(x)  # Number of observations
RSE = np.sqrt(RSS / (n - 2))  # Residual Standard Error

# Step 5: Calculate standard error of the slope (SE_beta1)
SE_beta1 = RSE / np.sqrt(np.sum((x - x_mean) ** 2))

# Step 6: Print results
print(f"Residual Standard Error (RSE): {RSE:.3f}")
print(f"Standard Error of Slope (SE_beta1): {SE_beta1:.3f}")


### Applications of Standard Errors

The standard errors (SEs) in linear regression are used to assess the precision and reliability of the estimated coefficients and model predictions. They serve as the foundation for key inferential techniques like 
- confidence intervals, 
- hypothesis testing, and 
- evaluating the overall fit of the regression model.

1. Constructing Confidence Intervals

Confidence intervals provide a range of plausible values for the regression coefficients.

$$ùõΩ_ùëó \pm t \cdot ùëÜùê∏_{ùõΩ_j}$$

where:
- $ùõΩ_ùëó$: Estimated coefficient.
- $ùëÜùê∏_{ùõΩ_j}$: Standard error of the coefficient.
- t: Critical value from the t-distribution based on the desired confidence level and degrees of freedom ($ùëõ ‚àí ùëò ‚àí 1$).

Interpretation: If the confidence interval for a coefficient does not include 0, it indicates that the predictor variable has a statistically significant relationship with the dependent variable at the given confidence level.

**Calculate the 95% confidence interval for a regression coefficient, such as slope($ùõΩ_1$)**

Use the following formula:

Confidence Interval = $$ùõΩ_1 \pm t \cdot ùëÜùê∏_{ùõΩ_1}$$

Steps to Calculate the 95% Confidence Interval

1. Estimate the Slope Coefficient ($ùõΩ_1$)

$$ùõΩ_1 = \frac{\sum_{i=1}^n(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^n(x_i-\bar{x})^2}$$

2. Calculate the Standard Error of the Slope ($ùõΩ_1$)

- The standard error of the slope is:

$$ùëÜùê∏_{ùõΩ_1} = \frac{ RSE}{\sqrt{\sum{}(x_i-\bar{x})^2}}$$

- RSE (Residual Standard Error) is:

$$RSE = \frac{RSS}{n‚àí2}$$
where RSS = $\sum{} (y_i-\hat{y_1})^2$

3. Find the Critical t-value ($ùë°_{critical}$):

Use the t-distribution with ùëõ ‚àí 2 degrees of freedom to find the critical value for the 95% confidence level ($ùë°_{critical}$).

4. Apply the Confidence Interval Formula:

- Combine the values:
    - Confidence Interval = $$ùõΩ_1 \pm ùë°_{critical} \cdot ùëÜùê∏_{ùõΩ_1}$$

_______

2. Hypothesis Testing

Hypothesis testing in Linear Regression
- Once you have fitted a straight line on the data, you need to ask, 
    - ‚ÄúIs this straight line a significant fit for the data?‚Äù Or 
    - ‚ÄúIs the beta coefficient explain the variance in the data plotted?‚Äù 
- Here comes the idea of hypothesis testing on the beta coefficient:

$H_0 : B_1  = 0$
    
$H_A : B_1  ‚â† 0$

Interpret the Regression Equation
- The coefficients ($ùõΩ$) indicate the magnitude and direction of the relationship between each predictor and readmissions.
    - Example: A coefficient of -0.5 for medication_adherence means that for every 1% increase in medication adherence, readmissions decrease by 0.5.
- The intercept ($ùõΩ_0$) represents the expected number of readmissions when all predictors are zero.

Assessing the Model Fit
- Other parameters to assess a model are:
    - t statistic: It is used to determine the p-value and hence, helps in determining whether the coefficient is significant or not
    - F statistic: It is used to assess whether the overall model fit is significant or not. 
        - the higher the value of the F-statistic, the more significant a model turns out to be.

To determine whether a predictor variable has a significant impact on the dependent variable, use hypothesis testing.
- Null Hypothesis ($ùêª_0): ùõΩ_ùëó =0$ (the predictor has no effect on response(ùë¶) varaible).
- Alternative Hypothesis $(ùêª_ùëé): ùõΩ_ùëó ‚â† 0$ (the predictor has an effect/ there is a relationship).
- t-statistic:

**How to Calculate the t-statistic in Linear regression**
The t-statistic in linear regression measures how many standard errors the estimated coefficient is away from zero. 
- It is used for hypothesis testing to determine if a predictor variable is statistically significant.

The formula to calculate the t-statistic for a coefficient

$$t = \frac{ùõΩ_ùëó}{SE_{ùõΩ_ùëó}}$$

Where:
- $ùõΩ_ùëó$: Estimated coefficient (e.g., slope or intercept).
- $SE_{ùõΩ_ùëó}$: Standard error of the estimated coefficient.

If the t-statistic is large in magnitude, it indicates that $ùõΩ_j (or Œ≤_1 in this case) is far from zero, suggesting the predictor has a significant effect on the dependent variable.

- P-Value: Compare the computed t-value to the critical value from the t-distribution, or calculate the p-value:
    - If $ùëù < ùõº (e.g., 0.05)$, reject $ùêª_0$ and conclude the predictor is statistically significant.
_________

3. Evaluating Model Fit

The standard error of the regression (Residual Standard Error, ùëÖùëÜùê∏) assesses the accuracy of the model's predictions.

$$RSE = \sqrt{\frac{RSS}{n‚àík-1}}$$

**Degrees of Freedom**
The t-statistic follows a t-distribution with $ùëõ ‚àí ùëò ‚àí 1 degrees of freedom,

where:
- RSS: Residual sum of squares.
- n: Number of observations.
- k: Number of predictors (excluding the intercept).

Interpretation:

- A smaller ùëÖùëÜùê∏ indicates better model fit.
- Used as a baseline to evaluate other models.

________

4. Comparing Predictors

Standard errors help compare the relative importance of different predictors by normalizing their coefficient estimates.
- Predictors with smaller $ùëÜùê∏{ùõΩ_ùëó}$ have more stable effects on ùë¶.
- Variables with larger $ùëÜùê∏{ùõΩ_ùëó}$ might need further investigation (e.g., multicollinearity).

In [None]:
############## Applications of Standard Errors

import numpy as np
import scipy.stats as stats

# Step 1: Define data
x = np.array([1, 2, 3, 4, 5])
y = np.array([2.2, 2.8, 3.6, 4.5, 5.1])

# Step 2: Calculate coefficients
x_mean = np.mean(x)
y_mean = np.mean(y)
beta1 = np.sum((x - x_mean) * (y - y_mean)) / np.sum((x - x_mean) ** 2)
beta0 = y_mean - beta1 * x_mean

# Step 3: Calculate residuals and RSS
y_pred = beta0 + beta1 * x
residuals = y - y_pred
RSS = np.sum(residuals ** 2)

# Step 4: Calculate RSE
n = len(x)
RSE = np.sqrt(RSS / (n - 2))

# Step 5: Calculate standard error of the slope (SE_beta1)
SE_beta1 = RSE / np.sqrt(np.sum((x - x_mean) ** 2))

# Step 6: Hypothesis Testing and Confidence Interval
t_stat = beta1 / SE_beta1  # t-statistic
p_value = 2 * (1 - stats.t.cdf(abs(t_stat), df=n-2))  # Two-tailed test

# Confidence Interval for beta1
t_critical = stats.t.ppf(0.975, df=n-2)  # 95% confidence level
conf_interval = (beta1 - t_critical * SE_beta1, beta1 + t_critical * SE_beta1)

# Step 7: Print results
print(f"Coefficient (beta1): {beta1:.3f}")
print(f"Standard Error (SE_beta1): {SE_beta1:.3f}")
print(f"t-Statistic: {t_stat:.3f}")
print(f"p-Value: {p_value:.5f}")
print(f"95% Confidence Interval for beta1: {conf_interval}")


Interpretation
- Coefficient (ùõΩ_1): The slope is 0.7
    - 0.7, indicating that y increases by 0.7 units for every one-unit increase in ùë• 
- Standard Error (ùëÜùê∏_{ùõΩ_1): The slope estimate has a variability of 0.094, indicating precision.
- t-Statistic and p-Value: The large t-statistic and small p-value indicate that ùõΩ_1 is statistically significant.
- Confidence Interval: We are 95% confident that the true value of ùõΩ_1 lies between  0.467 and 0.933.

Summary

- Confidence Intervals: Quantify the uncertainty around coefficient estimates.
- Hypothesis Testing: Assess the statistical significance of predictors.
- Model Diagnostics: Evaluate and compare models using ùëÖùëÜùê∏.
- Decision-Making: Use SEs to identify reliable predictors and improve the model.

In [None]:
############ calculate the 95% confidence interval for a regression coefficient, such as ùõΩ1 (slope), you use the following formula:

import numpy as np
from scipy.stats import t

# Step 1: Define data
x = np.array([1, 2, 3, 4, 5])
y = np.array([2.2, 2.8, 3.6, 4.5, 5.1])

# Step 2: Calculate the slope (beta1) and intercept (beta0)
x_mean = np.mean(x)
y_mean = np.mean(y)
beta1 = np.sum((x - x_mean) * (y - y_mean)) / np.sum((x - x_mean) ** 2)
beta0 = y_mean - beta1 * x_mean

# Step 3: Calculate residuals, RSS, and RSE
y_pred = beta0 + beta1 * x
residuals = y - y_pred
RSS = np.sum(residuals ** 2)
n = len(x)
RSE = np.sqrt(RSS / (n - 2))

# Step 4: Calculate standard error of the slope (SE_beta1)
SE_beta1 = RSE / np.sqrt(np.sum((x - x_mean) ** 2))

# Step 5: Determine t-critical value for 95% confidence interval
alpha = 0.05  # 95% confidence level
df = n - 2  # Degrees of freedom
t_critical = t.ppf(1 - alpha/2, df)

# Step 6: Calculate confidence interval
lower_bound = beta1 - t_critical * SE_beta1
upper_bound = beta1 + t_critical * SE_beta1

# Step 7: Print results
print(f"Slope (beta1): {beta1:.3f}")
print(f"Standard Error (SE_beta1): {SE_beta1:.3f}")
print(f"t-Critical: {t_critical:.3f}")
print(f"95% Confidence Interval for beta1: ({lower_bound:.3f}, {upper_bound:.3f})")


Interpretation
- The 95% confidence interval for $ùõΩ_1$ is (0.467, 0.933).
- This means we are 95% confident that the true slope ($ùõΩ_1$) lies within this range.
- Since the interval does not include 0, it indicates that the relationship between x and ùë¶ is statistically significant at the 5% significance level.



In [None]:
########### calculate the t-statistic for a simple linear regression with one predictor (x) and one response variable (y).

import numpy as np
from scipy.stats import t

# Step 1: Define data
x = np.array([1, 2, 3, 4, 5])
y = np.array([2.2, 2.8, 3.6, 4.5, 5.1])

# Step 2: Calculate coefficients
x_mean = np.mean(x)
y_mean = np.mean(y)
beta1 = np.sum((x - x_mean) * (y - y_mean)) / np.sum((x - x_mean) ** 2)
beta0 = y_mean - beta1 * x_mean

# Step 3: Calculate residuals, RSS, and RSE
y_pred = beta0 + beta1 * x
residuals = y - y_pred
RSS = np.sum(residuals ** 2)
n = len(x)
RSE = np.sqrt(RSS / (n - 2))

# Step 4: Calculate standard error of the slope (SE_beta1)
SE_beta1 = RSE / np.sqrt(np.sum((x - x_mean) ** 2))

# Step 5: Calculate the t-statistic
t_statistic = beta1 / SE_beta1

# Step 6: Calculate p-value (two-tailed test)
df = n - 2  # Degrees of freedom
p_value = 2 * (1 - t.cdf(abs(t_statistic), df))

# Step 7: Print results
print(f"Slope (beta1): {beta1:.3f}")
print(f"Standard Error (SE_beta1): {SE_beta1:.3f}")
print(f"t-Statistic: {t_statistic:.3f}")
print(f"p-Value: {p_value:.5f}")


##### Interpretation
t-Statistic:
- The t-statistic for the slope is 7.435, indicating the estimated coefficient is significantly far from zero.

p-Value:
- The small p-value (0.00231) suggests strong evidence against the null hypothesis (ùõΩ_1 = 0).
- The predictor (x) is statistically significant at a 5% significance level.

By comparing the t-statistic to critical t-values or using the p-value, you can conclude whether the predictor is significantly associated with the response variable.

### Explaining the rules for rejecting the null hypothesis using p-values

1. What is a p-value?

The p-value is the probability of observing the data (or something more extreme) if the null hypothesis ($ùêª_0$) is true.
- A low p-value indicates that the observed result is unlikely under the assumption of the null hypothesis.

2. Decision Rule for Rejecting the Null Hypothesis

The decision rule depends on the significance level (ùõº), which is the threshold for rejecting. 
- Common choices for Œ± are 0.05 (5%) or 0.01 (1%).

If p-value ‚â§ ùõº: Reject the null hypothesis ($ùêª_0$).
- The result is statistically significant.
- There is strong evidence against the null hypothesis.

If p-value > Œ±: Fail to reject the null hypothesis ($ùêª_0$).
- The result is not statistically significant.
- There isn‚Äôt enough evidence to conclude that the null hypothesis is false.

3. Interpretation Guidelines

Small p-value (‚â§ùõº):
- The observed effect is unlikely due to chance alone.
Example: 
- p=0.03 suggests that there‚Äôs only a 3% chance of observing your data if $ùêª_0$ were true.

Large p-value (>Œ±):
- The observed effect could plausibly occur due to chance.
Example: 
- p=0.10 suggests that there‚Äôs a 10% chance of observing your data if $ùêª_0$ were true.

Common Misinterpretations to Avoid

1. The p-value is not the probability that $ùêª_0$ is true.
- It reflects the likelihood of observing the data assuming $ùêª_0$ is true.
2. Failing to reject $ùêª_0$ does not mean $ùêª_0$ is true.
- It only means there isn‚Äôt enough evidence to conclude otherwise.
3. A small p-value does not indicate a large effect size.
- Statistical significance doesn‚Äôt always mean practical significance.

Summary

- The choice of significance level (Œ±) determines the threshold for rejecting $ùêª_0$
‚Äã- The p-value provides a way to quantify the strength of evidence against $ùêª_0$.
- Always report both the p-value and ùõº for transparency in hypothesis testing.

##### Practical Example
Suppose you are testing whether a new marketing strategy improves sales:

Null Hypothesis ($ùêª_0$): The new marketing strategy has no effect on sales (Œ≤=0).

Alternative Hypothesis ($ùêª_a$): The new marketing strategy increases sales (Œ≤>0).

If your analysis gives a p-value of 0.03, and you have set Œ±=0.05:
- Since p-value (0.03) < Œ± (0.05), you reject ($ùêª_0$).
- You conclude that the new marketing strategy likely increases sales.

### Regression cost functions: Regression model evaluation metrics

**loss function** is for a single training example. It is also sometimes called an error function. 

**cost function**, on the other hand, is the average loss over the entire training dataset. 

**Steps for Loss Functions**
1. Define the predictor function f(X), and identify the parameters to find.
2. Determine the loss for each training example.
3. Derive the expression for the Cost Function, representing the average loss across all examples.
4. Compute the gradient of the Cost Function concerning each unknown parameter.
5. Select the learning rate and execute the weight update rule for a fixed number of iterations.

These steps guide the optimization process, aiding in the determination of optimal model parameters.

Regression model we generally use to evaluate the prediction error rates and model performance in regression analysis.

1. **R-squared (Coefficient of determination)** 
- Indicates the proportion of variance in the dependent variable explained by the independent variables. 
- It represents the coefficient of how well the values fit compared to the original values. 
- It helps answer the question: "How well does my model explain the variability in the dependent variable?"
- The value from 0 to 1 interpreted as percentages. 
    - where: 
        - $ùëÖ^2$ = 1 (close to 1): Perfect fit (all variability in y is explained by X).
            - The model explains a large proportion of the variability in the data.
            - A large proportion of the variance in the dependent variable is explained by the independent variables.
            - Example: If $ùëÖ^2$ = 0.85
                - Then 85% of the variability in y is explained by X. The remaining 15% is due to unexplained variability (e.g., noise, unobserved variables).
        - $ùëÖ^2$ = 0 (close to 0): No relationship (the model does not explain any variability in y).
            - The model fails to explain much of the variability or A small proportion of the variance is explained.
            - Example: If $ùëÖ^2$ = 0.1
                - only 10% of the variability is explained by the model. This suggests either:
                    - The model lacks important predictors.
                    - The relationship between X and y may not be linear.
                    - There is high variability in y that cannot be captured effectively.
    - The higher the value is, the better the model is / model fits the data better, but it does not necessarily mean the model is accurate in predictions. and does not imply causation or that the model is the best predictor.

R-squared statistic is calculated as:

$$ùëÖ^2 = 1 - \frac{SS_{res}}{SS_{tot}}$$

Where:
- $SS_{res}$ (Residual Sum of Squares): Sum of squared differences between actual and predicted values.

$$SS_{res} = \sum^{n}_{i = 1} (y_i - \hat{y_i})^2$$

- $SS_{tot}$ (Total Sum of Squares): Sum of squared differences between actual values and their mean.

$$SS_{res} = \sum^{n}_{i = 1} (y_i - \bar{y_i})^2$$

### Important Caveats
1. Overfitting in Complex Models
- High $ùëÖ^2$ may result from overfitting, especially in models with many predictors.
2. Does Not Imply Causation
- High $ùëÖ^2$ shows correlation, not causation. For example, a model may explain variability due to spurious relationships.
3. Limited Applicability to Prediction
- High $ùëÖ^2$ doesn‚Äôt guarantee that the model predicts new data well (check with metrics like RMSE on test data).

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import r2_score

# Step 1: Create synthetic data
np.random.seed(42)
X = np.random.rand(100, 1) * 10  # Independent variable
y = 3 * X.squeeze() + 7 + np.random.randn(100) * 3  # Dependent variable with noise

# Step 2: Fit a simple linear regression model
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X, y)
y_pred = model.predict(X)

# Step 3: Calculate R-squared manually
y_mean = np.mean(y)
SS_res = np.sum((y - y_pred) ** 2)  # Residual Sum of Squares
SS_tot = np.sum((y - y_mean) ** 2)  # Total Sum of Squares
R_squared = 1 - (SS_res / SS_tot)

print(f"Manual R-squared: {R_squared:.4f}")

# Step 4: Calculate R-squared using sklearn
R_squared_sklearn = r2_score(y, y_pred)
print(f"Sklearn R-squared: {R_squared_sklearn:.4f}")

# Plotting
plt.scatter(X, y, label="Actual Data", alpha=0.7)
plt.plot(X, y_pred, color="red", label="Fitted Line")
plt.xlabel("X")
plt.ylabel("y")
plt.legend()
plt.title(f"Linear Regression (R-squared: {R_squared:.4f})")
plt.show()

2. **Adjusted R-squared**
- Adjusted version of $ùëÖ^2$ that accounts for the number of predictors in the model.

$$ùëÖ^2_{adj} = 1 - \frac{(1 - ùëÖ^2)(n -1)}{n - p - 1}$$

where:
- n is the number of observations, and
- p is the number of predictors.

The Significance of R-squared is:
    - if $R^2$ = 1 : Best-fit Line
    - if $R^2$ = 0.5 : still some errors
    - if $R^2$ = 0.05 : not performing well

Usage: Useful when comparing models with a different number of predictors

_____

3. Mean Error (ME)
- The error for each training data is calculated and then the mean value of all these errors is derived.
- Errors can be both negative and positive. So they can cancel each other out during summation giving zero mean error for the model.
- Not a recommended cost function but it does lay the foundation for other cost functions of regression models.

Residual Analysis

- Residuals: Differences between actual and predicted values ($ùë¶_ùëñ ‚àí \hat{ùë¶_ùëñ}$).
- Analysis: Residual plots help diagnose issues like non-linearity, heteroscedasticity, and independence of errors.

_____

4. **MSE (Mean Squared Error)**
- known as L2 loss.
- represents the difference between the original and predicted values extracted by squared the average difference over the data set.
- Here a square of the difference between the actual and predicted value is calculated to avoid any possibility of negative error(drawback cause).
- It is measured as the average of the sum of squared differences between predictions and actual observations.
$$MSE = \frac{1}{n} \sum^{n}_{i = 1} (y_i - \hat{y_i})^2$$
- Since each error is squared, it helps to penalize even small deviations in prediction when compared to MAE. 
    - But if our dataset has outliers that contribute to larger prediction errors, then squaring this error further will magnify the error many times more and also lead to higher MSE error.
    - MSE loss function penalizes the model for making large errors by squaring them. Squaring a large quantity makes it even larger
        - it is less robust to outliers
        - not to be used if our data is prone to many outliers.

Usage: Penalizes larger errors more heavily than smaller ones.

Graphically
- It is a positive quadratic function (of the form $ax^2 + bx + c$ where $a > 0$)
- A quadratic function only has a global minimum. 
    - Since there are no local minima, we will never get stuck in one. 
- Hence, it is always guaranteed that Gradient Descent will converge (if it converges at all) to the global minimum.

In [None]:
def update_weights_MSE(m, b, X, Y, learning_rate):
    m_deriv = 0
    b_deriv = 0
    N = len(X)
    for i in range(N):
        # Calculate partial derivatives
        # -2x(y - (mx + b))
        m_deriv += -2*X[i] * (Y[i] - (m*X[i] + b))

        # -2(y - (mx + b))
        b_deriv += -2*(Y[i] - (m*X[i] + b))

    # We subtract because the derivatives point in direction of steepest ascent
    m -= (m_deriv / float(N)) * learning_rate
    b -= (b_deriv / float(N)) * learning_rate

    return m, b

**Advantages of R-squared**:

- Interpretability: $ùëÖ^2$ is unitless and ranges between 0 and 1, making it easy to understand and compare across datasets.
- Proportional Explanation: Quantifies the proportion of variance explained by the model, offering insights into model effectiveness.
- Model Comparison: Useful for comparing the explanatory power of different models or regression equations.

**Disadvantages of RSE Compared to R-squared**:

- RSE depends on the scale of the dependent variable, making it hard to compare across datasets with different units.
- RSE alone does not provide information on how much variance the model explains.

_______

5. **RMSE (Root Mean Squared Error)** 
- is the error rate by the square root of MSE.

$$RMSE = \sqrt{MSE}$$

Usage: Commonly used because it is in the same units as the dependent variable and emphasizes larger errors.

______

6. **MAE (Mean absolute error)**
- known as L1 Loss.
- represents the difference between the original and predicted values extracted by averaged the absolute difference over the data set.
- It is the average of the absolute differences between predicted and actual values.
- Absolute Error for each training example is the distance between the predicted and the actual values, irrespective of the sign.
    - it is the absolute difference between the actual and predicted values.
- Here an absolute difference between the actual and predicted value is calculated to avoid any possibility of negative error.
- It is measured as the average of the sum of absolute differences between predictions and actual observations.
    - It is robust to outliers thus it will give better results even when our dataset has noise or outliers.
    - MAE cost is more robust to outliers as compared to MSE
-  The cost is the Mean of these Absolute Errors

$$MAE = \frac{1}{n} \sum^{n}_{i = 1} |y_i - \hat{y_i}|$$

Usage: Provides an easily interpretable measure of error in the same units as the dependent variable.

In [None]:
def update_weights_MAE(m, b, X, Y, learning_rate):
    m_deriv = 0
    b_deriv = 0
    N = len(X)
    for i in range(N):
        # Calculate partial derivatives
        # -x(y - (mx + b)) / |mx + b|
        m_deriv += - X[i] * (Y[i] - (m*X[i] + b)) / abs(Y[i] - (m*X[i] + b))

        # -(y - (mx + b)) / |mx + b|
        b_deriv += -(Y[i] - (m*X[i] + b)) / abs(Y[i] - (m*X[i] + b))

    # We subtract because the derivatives point in direction of steepest ascent
    m -= (m_deriv / float(N)) * learning_rate
    b -= (b_deriv / float(N)) * learning_rate

    return m, b

7. **Mean Absolute Percentage Error (MAPE)**

Definition: The mean of the absolute percentage differences between predicted and actual values.

$$MAPE = \frac{1}{n} \sum^{n}_{i = 1} |\frac{y_i - \hat{y_i}}{y_i}| \times 100$$

Usage: Expresses error as a percentage, making it easier to interpret across datasets

___________

8. Huber Loss

- The Huber loss combines the best properties of MSE and MAE.
- It is quadratic for smaller errors and is linear otherwise (and similarly for its gradient). 
- It is identified by its delta parameter:

In [None]:
def update_weights_Huber(m, b, X, Y, delta, learning_rate):
    m_deriv = 0
    b_deriv = 0
    N = len(X)
    for i in range(N):
        # derivative of quadratic for small values and of linear for large values
        if abs(Y[i] - m*X[i] - b) <= delta:
          m_deriv += -X[i] * (Y[i] - (m*X[i] + b))
          b_deriv += - (Y[i] - (m*X[i] + b))
        else:
          m_deriv += delta * X[i] * ((m*X[i] + b) - Y[i]) / abs((m*X[i] + b) - Y[i])
          b_deriv += delta * ((m*X[i] + b) - Y[i]) / abs((m*X[i] + b) - Y[i])
    
    # We subtract because the derivatives point in direction of steepest ascent
    m -= (m_deriv / float(N)) * learning_rate
    b -= (b_deriv / float(N)) * learning_rate

    return m, b

##### Choosing the Right Measure
- Use R-squared or Adjusted R-squared to evaluate the proportion of variance explained.
- Use MAE, MSE, or RMSE for measuring prediction accuracy in units of the target variable.
- Use MAPE for interpreting errors as percentages.

**Step 5: Interpret the Results**

Residual Analysis:
- Check normal distribution and normality for the residuals.
- Homoscedasticity describes a situation in which error term is the same across all values of the independent variables. 
    - means that the residuals are equal across the regression line.

Interpretation of Regression Output
- R-Squared : is a statistical measure of fit that indicates how much variation of a dependent variable is explained by the independent variables. 
    - Higher R-Squared value represents smaller differences between the observed data and fitted values.

**Optimization technique/Strategy**

We will use Gradient Descent as an optimization strategy to find the regression line.
- Weight Update Rule

NB: Perform optimization on the training data and check its performance on a new validation data.

**Gradient Descent for Linear Regression**

What is gradient descent?
- lay man: 
    - It is a way of checking the ground near you and observe where the land tends to descend.
    - It gives an idea in what direction you should take your steps.
    - It helps models find the optimal set of parameters by iteratively adjusting them in the opposite direction of the gradient, aiming to find the optimal set of parameters.

Mathematical terms:
- find out the best parameters ($Œ∏_1$) and ($Œ∏_2$) for our learning algorithm.

Cost space is how our algorithm would perform when we choose a particular value for a parameter.

Cost Function is a function that measures the performance of a model for any given data. Cost Function quantifies the error between predicted values and expected values and presents it in the form of a single real number.

1. Make a hypothesis with initial parameters
- Hypothesis: $h_Œ∏(x) = Œ∏_0 + Œ∏_1 x$
- Parameters: $Œ∏_o, Œ∏_1$
2. Calculate the Cost function
- Cost Function: $J(Œ∏_o, Œ∏_1) = \frac{1}{2m}\sum^{m}_{i = 1} (h_Œ∏ (x^{(i)}) - y^{i})^2$
3. The goal is to reduce the cost function, we modify the parameters by using the Gradient descent algorithm over the given data.
- Goal: $minimize_{Œ∏_o, Œ∏_1} J(Œ∏_o, Œ∏_1)$

**Gradient descent**

- one of the optimization algorithms that optimize the cost function (objective function) to reach the optimal minimal solution.
- aims to find the parameters that minimize this discrepancy and improve the model‚Äôs performance.
    - Need to reduce the cost function (MSE) for all data points. 
    - This is done by updating the values of the slope coefficient and the constant coefficient iteratively until we get an optimal solution for the linear function.

The algorithm operates by calculating the gradient of the cost function, 
- which indicates the direction and magnitude of the steepest ascent. 

However, since the goal is to minimize the cost function, gradient descent moves in the opposite direction of the gradient, 
- known as the negative gradient direction.

Iteratively updating the model‚Äôs parameters in the negative gradient direction, gradient descent gradually converges towards the optimal set of parameters that yields the lowest cost.

- Hyperparameter: learning rate, determines the step size taken in each iteration, influencing the speed and stability of convergence.

Gradient descent can be applied to:
- linear regression, 
- logistic regression, 
- neural networks, and 
- support vector machines.

**Definition**: Gradient descent is an iterative optimization algorithm for finding the local minimum of a function.

To find the local minimum of a function using gradient descent, we must take steps proportional to the negative of the gradient (move away from the gradient) of the function at the current point.
- If we take steps proportional to the positive of the gradient (moving towards the gradient), we will approach a local maximum of the function, and the procedure is called Gradient Ascent.

The goal of the gradient descent algorithm is to minimize the given function (say, cost function)
- it performs two steps iteratively:
1. Compute the gradient (slope), the first-order derivative of the function at that point
2. Make a step (move) in the direction opposite to the gradient. The opposite direction of the slope increases from the current point by alpha times the gradient at that point.
- number of steps you‚Äôre taking can be considered as the learning rate, and this decides how fast the algorithm converges to the minima.

This code creates a function called gradient_descent, which requires the training data, learning rate, and number of iterations as parameters.

Steps :
1. Sets weights and bias to arbitrary values during initialization.
2. Executes a set number of iterations for loops.
3. Computes the estimated y values by utilizing the existing weights and bias.
4. Calculates the discrepancy between expected and real y values.
5. Determines the changes in the cost function based on weights and bias.
6. Adjusts the weights and bias by incorporating the gradients and learning rate.
7. Outputs the acquired weights and bias.


In [None]:
import numpy as np

def gradient_descent(X, y, learning_rate, num_iters):
  """
  Performs gradient descent to find optimal weights and bias for linear regression.

  Args:
      X: A numpy array of shape (m, n) representing the training data features.
      y: A numpy array of shape (m,) representing the training data target values.
      learning_rate: The learning rate to control the step size during updates.
      num_iters: The number of iterations to perform gradient descent.

  Returns:
      A tuple containing the learned weights and bias.
  """

  # Initialize weights and bias with random values
  m, n = X.shape
  weights = np.random.rand(n)
  bias = 0

  # Loop for the number of iterations
  for i in range(num_iters):
    # Predict y values using current weights and bias
    y_predicted = np.dot(X, weights) + bias

    # Calculate the error
    error = y - y_predicted

    # Calculate gradients for weights and bias
    weights_gradient = -2/m * np.dot(X.T, error)
    bias_gradient = -2/m * np.sum(error)

    # Update weights and bias using learning rate
    weights -= learning_rate * weights_gradient
    bias -= learning_rate * bias_gradient

  return weights, bias

# Example usage
X = np.array([[1, 1], [2, 2], [3, 3]])
y = np.array([2, 4, 5])
learning_rate = 0.01
num_iters = 100

weights, bias = gradient_descent(X, y, learning_rate, num_iters)

print("Learned weights:", weights)
print("Learned bias:", bias)

How Does Gradient Descent Work?
1. The algorithm optimizes to minimize the model‚Äôs cost function.
2. The cost function measures how well the model fits the training data and defines the difference between the predicted and actual values.
3. The cost function‚Äôs gradient is the derivative with respect to the model‚Äôs parameters and points in the direction of the steepest ascent.
4. The algorithm starts with an initial set of parameters and updates them in small steps to minimize the cost function.
5. In each iteration of the algorithm, it computes the gradient of the cost function with respect to each parameter.
6. The gradient tells us the direction of the steepest ascent, and by moving in the opposite direction, we can find the direction of the steepest descent.
7. The learning rate controls the step size, which determines how quickly the algorithm moves towards the minimum.
8. The process is repeated until the cost function converges to a minimum. Therefore indicating that the model has reached the optimal set of parameters.
9. Different variations of gradient descent include batch gradient descent, stochastic gradient descent, and mini-batch gradient descent, each with advantages and limitations.
10. Efficient implementation of gradient descent is essential for performing well in machine learning tasks. The choice of the learning rate and the number of iterations can significantly impact the algorithm‚Äôs performance.

On the basis of differentiation techniques 
- Gradient descent requires Calculation of gradient by differentiation of cost function. We can either use first order differentiation or second order differentiation.
    - First order Differentiation
    - Second order Differentiation.

To update B 0 and B 1, we take gradients from the cost function. To find these gradients, we take partial derivatives for $B_0$ and $B_1$.

$J = \frac{1}{n} \sum^{n}_{i = 1} (ùõΩ_{0}+ùõΩ_{1} . x_i - y_i)^2$

$\frac{\partial J}{\partial ùõΩ_{0}} = \frac{2}{n} \sum^{n}_{i = 1} (ùõΩ_{0}+ùõΩ_{1} . x_i - y_i)$

$\frac{\partial J}{\partial ùõΩ_{1}} = \frac{2}{n} \sum^{n}_{i = 1} (ùõΩ_{0}+ùõΩ_{1} . x_i - y_i) . x_i$

$ùõΩ_{0} = ùõΩ_{0} - \alpha . \frac{2}{n} \sum^{n}_{i = 1} ( y_{pred} - y_{i}) $

$ùõΩ_{1} = ùõΩ_{1} - \alpha . \frac{2}{n} \sum^{n}_{i = 1} ( y_{pred} - y_{i}) . x_i $

Where: 
- The partial derivates are the gradients, and they are used to update the values of $B_0$ and $B_1$. 
- Alpha is the learning rate.

**Types of Gradient Descent**

Classified by two methods mainly:
- On the basis of data ingestion: choice of gradient descent algorithm depends on the problem at hand and the size of the dataset.

**Full Batch Gradient Descent Algorithm**:
- Batch gradient descent,
    - also known as vanilla gradient descent, 
- full batch gradient descent algorithms, you use whole data at once to compute the gradient.
    - It updates the model‚Äôs parameters using the gradient of the entire training set.
- It calculates the average gradient of the cost function for all the training examples and updates the parameters in the opposite direction.
    - calculates the error for each example within the training dataset.
    - The model is not changed until every training sample has been assessed. 
        - The entire procedure is referred to as a **cycle and a training epoch**.
- Batch gradient descent guarantees convergence to the global minimum but can be computationally expensive and slow for large datasets.
    - Batch gradient descent is suitable for small datasets.
    - Its computational efficiency, which produces a stable error gradient and a stable convergence.
- Drawbacks are that the stable error gradient can sometimes result in a state of convergence that isn‚Äôt the best the model can achieve. 
    - It also requires the entire training dataset to be in memory and available to the algorithm.

Advantages
- Fewer model updates mean that this variant of the steepest descent method is more computationally efficient than the stochastic gradient descent method.
- Reducing the update frequency provides a more stable error gradient and a more stable convergence for some problems.
- Separating forecast error calculations and model updates provides a parallel processing-based algorithm implementation.

Disadvantages
- A more stable error gradient can cause the model to prematurely converge to a suboptimal set of parameters.
- End-of-training epoch updates require the additional complexity of accumulating prediction errors across all training examples.
- The batch gradient descent method typically requires the entire training dataset in memory and is implemented for use in the algorithm.
- Large datasets can result in very slow model updates or training speeds.
- Slow and require more computational power.

#### Variants

##### Vanilla Gradient Descent, 

Vanilla means pure / without any adulteration.
- simplest form of gradient descent technique
    - main feature is that we take small steps in the direction of the minima by taking gradient of the cost function.

Pseudocode Vanilla Gradient Descent

$ update = learning rate * gradient of parameters$

$ parameters = parameters - update$

- make an update to the parameters by taking gradient of the parameters. 
- And multiplying it by a learning rate, which is essentially a constant number suggesting how fast we want to go the minimum. 4
**Learning rate** is a hyper-parameter and should be treated with care when choosing its value.

##### Gradient Descent with Momentum

Tweaks the above algorithm in such a way that we pay heed to the prior step before taking the next step.

Pseudocode Gradient Descent with Momentum

$ update = learning_rate * gradient$ 

$ velocity = previous_update * momentum$ 

$ parameter = parameter + velocity ‚Äì update$ 

Introduces Velocity, which considers the previous update and a constant which is called momentum.

##### ADAGRAD

ADAGRAD uses adaptive technique for learning rate updation. In this algorithm, on the basis of how the gradient has been changing for all the previous iterations we try to change the learning rate.

Pseudocode ADAGRAD

$ grad_component = previous_grad_component + (gradient * gradient)$ 

$ rate_change = square_root(grad_component) + epsilon$

$ adapted_learning_rate = learning_rate * rate_change$

$update = adapted_learning_rate * gradient$

$parameter = parameter ‚Äì update$

where:
-  epsilon is a constant which is used to keep rate of change of learning rate in check.

##### ADAM

ADAM is one more adaptive technique which builds on adagrad and further reduces it downside.
- consider this as momentum + ADAGRAD.

Pseudocode.

$ adapted_gradient = previous_gradient + ((gradient ‚Äì previous_gradient) * (1 ‚Äì beta1))$

$ gradient_component = (gradient_change ‚Äì previous_learning_rate)$

$ adapted_learning_rate =  previous_learning_rate + (gradient_component * (1 ‚Äì beta2))$

$ update = adapted_learning_rate * adapted_gradient$

$ parameter = parameter ‚Äì update$

where:
- beta1 and beta2 are constants to keep changes in gradient and learning rate in check

There are also second order differentiation method like **l-BFGS**.

In [None]:
class GDRegressor:
    
    def __init__(self,learning_rate=0.01,epochs=100):
        
        self.coef_ = None
        self.intercept_ = None
        self.lr = learning_rate
        self.epochs = epochs
        
    def fit(self,X_train,y_train):
        # init your coefs
        self.intercept_ = 0
        self.coef_ = np.ones(X_train.shape[1])
        
        for i in range(self.epochs):
            # update all the coef and the intercept
            y_hat = np.dot(X_train,self.coef_) + self.intercept_
            #print("Shape of y_hat",y_hat.shape)
            intercept_der = -2 * np.mean(y_train - y_hat)
            self.intercept_ = self.intercept_ - (self.lr * intercept_der)
            
            coef_der = -2 * np.dot((y_train - y_hat),X_train)/X_train.shape[0]
            self.coef_ = self.coef_ - (self.lr * coef_der)
        
        print(self.intercept_,self.coef_)
    
    def predict(self,X_test):
        return np.dot(X_test,self.coef_) + self.intercept_

**Stochastic Gradient Descent Algorithm**
- stochastic you take a sample while computing the gradient.
    - It randomly selects a training dataset example, 
        - changes the parameters for each training sample one at a time for each training example in the dataset.
            - The regular updates give us a fairly accurate idea of the rate of improvement. (benefit)
    - computes the gradient of the cost function for that example, 
    - and updates the parameters in the opposite direction.
- stochastic gradient descent algorithm is more suitable for large datasets.
- It is computationally efficient and can converge faster than batch gradient descent. It can be noisy (produce noisy gradients), cause the error rate to fluctuate rather than gradually go down and may not converge to the global minimum.

Advantages
- You can instantly see your model‚Äôs performance and improvement rates with frequent updates.
- This variant of the steepest descent method is probably the easiest to understand and implement, especially for beginners.
- Increasing the frequency of model updates will allow you to learn more about some issues faster.
- The noisy update process allows the model to avoid local minima (e.g., premature convergence).
- Faster and require less computational power.
- Suitable for the larger dataset.

Disadvantages
- Frequent model updates are more computationally intensive than other steepest descent configurations, and it takes considerable time to train the model with large datasets.
- Frequent updates can result in noisy gradient signals. This can result in model parameters and cause errors to fly around (more variance across the training epoch).
- A noisy learning process along the error gradient can also make it difficult for the algorithm to commit to the model‚Äôs minimum error.

In [None]:
from sklearn.linear_model import SGDClassifier
X = [[0., 0.], [1., 1.]]
y = [0, 1]
clf = SGDClassifier(loss="hinge", penalty="l2", max_iter=5)
clf.fit(X, y)
SGDClassifier(max_iter=5)

**Mini-batch Gradient Descent**
- Mini-batch is a good compromise between the two and is often used in practice.
- updates the model‚Äôs parameters using the gradient of a small batch size of the training dataset, known as a mini-batch. 
- It calculates the average gradient of the cost function for the mini-batch and updates the parameters in the opposite direction.
- It is the most commonly used method in practice because combines the ideas of batch gradient descent with SGD.
        - strikes a balance between batch gradient descent‚Äôs effectiveness and stochastic gradient descent‚Äôs durability.
- It is computationally efficient and less noisy than stochastic gradient descent while still being able to converge to a good solution.
- Mini-batch sizes typically range from 50 to 256.

Advantages
- The model is updated more frequently than the stack gradient descent method, allowing for more robust convergence and avoiding local minima.
- Batch updates provide a more computationally efficient process than stochastic gradient descent.
- Batch processing allows for both the efficiency of not having all the training data in memory and implementing the algorithm.

Disadvantages
- Mini-batch requires additional hyperparameters ‚Äúmini-batch size‚Äù to be set for the learning algorithm.
- Error information should be accumulated over a mini-batch of training samples, such as batch gradient descent.
- it will generate complex functions.

Configure Mini-Batch Gradient Descent:

- The mini-batch steepest descent method is a variant of the steepest descent method recommended for most applications, intense learning.
- Mini-batch sizes, commonly called ‚Äúbatch sizes‚Äù for brevity, are often tailored to some aspect of the computing architecture in which the implementation is running. 
        - For example, a power of 2 that matches the memory requirements of the GPU or CPU hardware, such as 32, 64, 128, and 256.
- The stack size is a slider for the learning process.
- Smaller values ‚Äã‚Äãallow the learning process to converge quickly at the expense of noise in the training process. Larger values ‚Äã‚Äãresult in a learning - process that slowly converges to an accurate estimate of the error gradient.

In [None]:
class MBGDRegressor:
    
    def __init__(self,batch_size,learning_rate=0.01,epochs=100):
        
        self.coef_ = None
        self.intercept_ = None
        self.lr = learning_rate
        self.epochs = epochs
        self.batch_size = batch_size
        
    def fit(self,X_train,y_train):
        # init your coefs
        self.intercept_ = 0
        self.coef_ = np.ones(X_train.shape[1])
        
        for i in range(self.epochs):
            
            for j in range(int(X_train.shape[0]/self.batch_size)):
                
                idx = random.sample(range(X_train.shape[0]),self.batch_size)
                
                y_hat = np.dot(X_train[idx],self.coef_) + self.intercept_
                #print("Shape of y_hat",y_hat.shape)
                intercept_der = -2 * np.mean(y_train[idx] - y_hat)
                self.intercept_ = self.intercept_ - (self.lr * intercept_der)

                coef_der = -2 * np.dot((y_train[idx] - y_hat),X_train[idx])
                self.coef_ = self.coef_ - (self.lr * coef_der)
        
        print(self.intercept_,self.coef_)
    
    def predict(self,X_test):
        return np.dot(X_test,self.coef_) + self.intercept_

**Step 6: Use the Model for Decision-Making**

Understanding which factors significantly influence readmissions,

To do this, you need a systematic approach grounded in exploratory analysis, statistical rigor, and effective communication

1. Thinking Approach: Identifying Significant Factors
- Define the Business Objective
    - Objective: Identify key drivers of hospital readmissions (to improve patient care and optimize resource allocation)
    - Questions to Answer:
        - What are the strongest predictors of readmissions?
        - Which predictors can be influenced through policy or operational changes?
        - How much can readmissions be reduced if certain factors are addressed?

- Perform Exploratory Data Analysis (EDA)
    - Inspect Data Distributions: Use histograms and boxplots to understand the spread of variables.
    - Check Relationships:
        - Pairwise correlations for numerical variables (e.g., length_of_stay vs. readmissions).
        - Grouped summaries for categorical variables (e.g., readmissions across age groups).
        - Example Insights:
            - Patients with longer stays might have higher readmission risks.
            - Non-adherence to medication might strongly correlate with readmissions.

- Statistical Hypothesis Testing
    - Use statistical tests to confirm relationships:
        - T-tests for differences in means (e.g., medication adherence between high and low readmission groups).
        - Chi-square tests for independence between categorical variables (e.g., age group vs. readmission rates).

Example 1: Statistical Hypothesis Testing for Medication Adherence
- Objective: Determine if medication adherence significantly differs between patients who are readmitted and those who are not.
- Approach: Two-Sample t-Test
- Hypotheses: 
    - $ùêª_0$ : The mean adherence rate is the same for both groups (readmitted and not readmitted).
    - $ùêª_ùëé$ : The mean adherence rate differs between the groups.

- Steps:
    - Prepare the Data:
    - Split patients into two groups: "Readmitted" and "Not Readmitted."
    - Collect medication adherence rates for each group.

- Check Assumptions:
    - Normality: Use a Shapiro-Wilk or Kolmogorov-Smirnov test to check if adherence rates are normally distributed.
    - Equal Variance: Use Levene‚Äôs test or Bartlett‚Äôs test.

- Perform the t-Test:
    - If variances are equal, use a standard t-test. If not, use Welch‚Äôs t-test.

- Interpret Results: 
    - If $ùëù < 0.05$, reject $ùêª_0$
    - Conclude that adherence rates differ significantly.

In [None]:
from scipy.stats import ttest_ind

# Example data
adherence_readmitted = [0.7, 0.65, 0.6, 0.75, 0.8]  # Adherence rates for readmitted
adherence_not_readmitted = [0.9, 0.85, 0.88, 0.92, 0.89]  # Adherence rates for not readmitted

# Perform t-test
t_stat, p_value = ttest_ind(adherence_readmitted, adherence_not_readmitted, equal_var=False)
print(f"T-statistic: {t_stat}, P-value: {p_value}")

Example 2: Statistical Hypothesis Testing for Age Group vs. Readmission Rates
- Objective: Test if age group (categorical variable) is independent of readmission status.
- Approach: Chi-Square Test of Independence
- Hypotheses:
    - $ùêª_0$ : Age group is independent of readmission status.
    - $ùêª_ùëé$ : Age group and readmission status are dependent.

- Steps:
    - Create a Contingency Table:
        - Rows: Age groups (e.g., <40, 40‚Äì60, >60).
        - Columns: Readmission status (e.g., Yes, No).

- Perform the Chi-Square Test:

- Interpret Results:
    - If $ ùëù< 0.05$, reject $ùêª_0$‚Äã
    - Conclude that age group influences readmission rates.

In [None]:
import numpy as np
from scipy.stats import chi2_contingency

# Contingency table
table = np.array([[50, 200], [70, 230], [100, 300]])

# Perform Chi-Square Test
chi2, p_value, dof, expected = chi2_contingency(table)
print(f"Chi2 Statistic: {chi2}, P-value: {p_value}")

Example 3: Statistical Hypothesis Testing for Length of Stay (LOS)
- Objective: Compare Average LOS for Readmitted vs. Not Readmitted Patients
- Approach: Two-Sample t-Test
    - $ùêª_0$ : The mean LOS is the same for readmitted and non-readmitted patients.
    - $ùêª_ùëé$ : The mean LOS differs.
- Steps:
    - Prepare the Data:
    - Split patients into two groups: "Readmitted" and "Not Readmitted."
    - Collect medication Length of stay for each group.

- Check Assumptions:
    - Normality: Use a Shapiro-Wilk or Kolmogorov-Smirnov test to check if Lengths of stay are normally distributed.
    - Equal Variance: Use Levene‚Äôs test or Bartlett‚Äôs test.

- Perform the t-Test:
    - If variances are equal, use a standard t-test. If not, use Welch‚Äôs t-test.

- Interpret Results: 
    - If $ùëù < 0.05$, reject $ùêª_0$
    - Conclude that adherence rates differ significantly.

Example 4: Relationship Between LOS and Readmission Rate
- Approach: ANOVA (Analysis of Variance)
- Objective: Check if LOS groups (<3 days, 3‚Äì7 days, >7 days) have significantly different readmission rates.
- Hypotheses: 
    - $ùêª_0$ : The mean readmission rate is the same across all LOS groups.
    - $ùêª_ùëé$ : At least one group differs.
- Steps:
    - Group the Data:
        - Divide LOS into groups.
        - Calculate readmission rates for each group.
- Perform ANOVA:
- Interpret Results:
    - If $ùëù < 0.05$
    - reject $ùêª_0$
    - Conclude that LOS impacts readmission rates.

In [None]:
from scipy.stats import f_oneway

# Example data
readmission_short = [0.1, 0.12, 0.08, 0.15]  # Readmission rates for <3 days
readmission_medium = [0.2, 0.22, 0.25, 0.18]  # Readmission rates for 3‚Äì7 days
readmission_long = [0.35, 0.4, 0.38, 0.42]  # Readmission rates for >7 days

# Perform ANOVA
f_stat, p_value = f_oneway(readmission_short, readmission_medium, readmission_long)
print(f"F-statistic: {f_stat}, P-value: {p_value}")



- Build and Interpret a Regression Model
    - Fit the Linear Regression model to identify significant predictors:
    - Check p-values of coefficients: Variables with p-values below a chosen threshold (e.g., 0.05) are statistically significant.
    - Evaluate effect size: Large coefficients indicate strong influence on the target.
    - Test for interaction effects, such as how length_of_stay and severity jointly influence readmissions.

- Refine the Model
    - Handle multicollinearity: Use Variance Inflation Factor (VIF) to remove or combine highly correlated predictors.
    - Validate the model: Perform cross-validation to ensure robustness.

This will help the institute to:
- Improve medication adherence programs for high-risk patients.
- Extend hospital stays for patients with severe conditions if needed.
- Schedule follow-up visits more effectively to minimize readmission risks.

Example 2: Predicting Readmissions Based on LOS
- Approach: Linear Regression
- Objective: Use regression to predict readmissions based on LOS and other predictors.

##### Linear Regression Helps Solve This Problem
- Quantifies Relationships: Identifies and quantifies the factors contributing to readmissions.
- Predicts Outcomes: Provides actionable predictions to guide healthcare interventions.
- Allocates Resources: Helps prioritize patients who need more attention post-discharge.
- Supports Policy Changes: Enables data-driven policy improvements in patient care.

In [None]:
import statsmodels.api as sm

# Example data
X = [2, 4, 6, 8, 10]  # LOS
y = [0, 1, 0, 1, 1]  # Readmission (0 = No, 1 = Yes)

# Add constant for intercept
X = sm.add_constant(X)
model = sm.Logit(y, X).fit()
print(model.summary())

2. Presenting Findings to Senior Management and Board
- Tailor Communication to the Audience
    - Senior management: Focus on actionable insights, resource implications, and patient care improvements.
    - Board of directors: Emphasize high-level trends, financial impacts, and alignment with strategic goals.

- Structure of Presentation
    - Introduction
        - Start with the context: "Readmission rates are a critical indicator of hospital performance and patient care quality."
        - Summarize the objective: "This study identifies key factors driving readmissions and proposes targeted interventions."

    - Key Findings
        - Use visuals like 
            - bar charts, 
            - scatter plots, and 
            - regression coefficient tables:
                - Example: "Medication adherence has the strongest inverse relationship with readmissions. A 10% increase in adherence reduces readmissions by 5%."
            - Highlight statistical significance:
                - "Length of stay and severity are significant at p < 0.05, confirming their importance."
    
    - Implications
        - Show real-world impact: "Addressing non-adherence could prevent ~300 readmissions annually, saving $1.2M in costs."
        - Prioritize recommendations: "Focus on medication adherence programs, especially for older patients with comorbidities."

    - Actionable Recommendations
        - Immediate Steps:
            - Develop a post-discharge follow-up protocol for high-risk groups.
            - Launch an adherence monitoring program.
        - Future Research:
            - Investigate additional factors like social determinants of health.

    - Conclusion
        - Reinforce value: "By addressing these factors, we can improve patient outcomes, meet regulatory benchmarks, and reduce financial strain."

- Tools for Communication
    - Visual Dashboards: Create dashboards showing predicted readmissions, trends over time, and "what-if" scenarios.
    - Executive Summaries: Provide concise summaries with high-impact visuals and key takeaways.
    - Financial Impact Models: Quantify cost savings or ROI of proposed interventions.

3. Example Insights and Visualizations
Insight Example: Medication Adherence
    - Insight: "Medication adherence has a strong negative correlation with readmissions ($ùëÖ=‚àí0.65$)
        - A 10% increase in adherence is associated with a 5% reduction in readmissions."

Visualization:
    - A bar chart comparing adherence rates and average readmissions.
    - Regression coefficient chart showing the magnitude of influence.

Insight Example: Length of Stay
    - Insight: "Patients with hospital stays >7 days are 2x more likely to be readmitted within 30 days."

Visualization:
    - Scatter plot: length_of_stay vs. readmissions.
    - Box plot: Readmission rates by length-of-stay categories.

4. Implementation Plan
Once the board approves, focus on operationalizing findings:

- Deploy targeted interventions for high-risk patients.
- Set KPIs to monitor the effectiveness of changes.
- Continuously refine the model based on new data.

##### Set KPIs to monitor the effectiveness of changes

**KPI 1: 30-Day Readmission Rate**
- Definition: Percentage of patients readmitted to the hospital within 30 days of discharge.
- Why Important: This is the primary metric to assess whether interventions are reducing readmissions.
- Formula: $Readmission¬†Rate = \frac{Number¬†of¬†patients¬†readmitted¬†within¬†30¬†days}{Total¬†number¬†of¬†discharged¬†patients} √ó 100$
- Target: A reduction in the readmission rate over time indicates success.

**KPI 2: Medication Adherence Rate**
- Definition: Percentage of patients adhering to their prescribed medications post-discharge.
- Why Important: Non-adherence is a leading cause of readmissions. Monitoring this ensures interventions like counseling and follow-ups are effective
- Formula: $Medication¬†Adherence¬†Rate = \frac{Number¬†of¬†patients¬†adhering¬†to¬†medications}{Total¬†number¬†of¬†patients} √ó 100$
- Target: An increase in adherence correlates with better outcomes and fewer readmissions.

**KPI 3: Follow-Up Appointment Compliance**
- Definition: Percentage of discharged patients attending follow-up appointments within the recommended time frame.
- Why Important: Follow-up visits can identify issues early and prevent readmissions.
- Formula: $Compliance¬†Rate= \frac{Number¬†of¬†scheduled¬†follow-ups}{Number¬†of¬†attended¬†follow-ups} √ó 100$
- Target: High compliance indicates improved patient engagement.

**KPI 4: Average Length of Stay (LOS)**
- Definition: Average number of days patients spend in the hospital.
- Why Important: Shorter stays can indicate efficiency but might increase readmissions if patients are discharged prematurely.
- Formula: $LOS= \frac{Number¬†of¬†discharges}{Total¬†inpatient¬†days}$
‚Äã- Target: Maintain an optimal LOS that balances cost and readmission prevention.

**KPI 5: Percentage of High-Risk Patients Identified**
- Definition: Proportion of discharged patients flagged as high-risk for readmission and targeted for interventions.
- Why Important: Monitoring ensures that predictive models and risk stratification tools are working effectively.
- Formula:$High-Risk¬†Patients¬†Identified = \frac{Total¬†number¬†of¬†discharged¬†patients}{Number¬†of¬†flagged¬†high-risk¬†patients} √ó 100$
- Target: Increase the identification rate while reducing actual readmissions.

##### Presenting KPIs to Stakeholders

**Visual Presentation**

Use dashboards and visualizations:
- Bar charts to compare readmission rates before and after interventions.
- Line graphs showing trends over time for medication adherence and follow-up compliance.
- Heatmaps for condition-specific readmission trends.

Narrative
- Highlight success: "We reduced the 30-day readmission rate from 18% to 12%, saving $500,000 annually."
- Focus on actionable insights: "Medication adherence programs have been effective, with a 15% increase in adherence leading to a 5% drop in readmissions."

Recommendations
- Continue monitoring these KPIs for sustained improvements.
- Scale successful interventions to other patient groups or hospitals.

## 2. Multiple Linear Regression:

simple linear regression equation is as follows:

$$Y = \beta_{0} + \beta_{1}X_1$$

where:
- $\beta_{0}$ is the intercept, interpreted as the value of $Y$ when $X_1 = 0$;
- $\beta_{1}$ is the coefficient, interpreted as the effect on $Y$ for a one unit increase in $X_1$; and
- $X_1$ is the single predictor variable.

Extending that idea to multiple linear regression is as simple as adding an $X_{j}$ and corresponding $\beta_{j}$ for each of the $p$ predictor variables, where $j$ is an element of the set $[1,p]$.
   
Hence in multiple linear regression, our regression equation becomes:   

$$Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_pX_p$$

where:

- $Y$ is the reponse variable which depends on the $p$ predictor variables;
- $\beta_0$ is the intercept, interpreted as the value of $Y$ when _all_ predictor variables are equal to zero;
- $\beta_j$ is the average effect on $Y$ of a one unit increase in $X_j$, assuming all other predictors are held fixed.

Multiple linear regression is a technique to understand the relationship between a single dependent variable and multiple independent variables.

$$ ùë¶=ùõΩ_{0}+ùõΩ_{1}ùë•_{1}+‚Ä¶+ùõΩ_{ùëõ}ùë•_{ùëõ}+ ùúñ $$

What it means:
- It is used when two or more independent variables influence the dependant variable. 

- A linear equation defines the relationship, with the 
    - coefficients of the independent variables 
    
- representing the effect of each variable on the dependant variable.

# Assumptions of Multiple Linear Regression

Regression is a parametric approach, which means that it makes assumptions about the data

For successful regression analysis, it‚Äôs essential to validate the following assumptions.

- Overfitting: When more and more variables are added to a model, the model may become far too complex and usually ends up memorizing all the data points in the training set
    - This phenomenon is known as the overfitting of a model. 
    - This usually leads to high training accuracy and very low test accuracy.
- Understanding of linearity and multicollinearity (predictors).
    - It is the phenomenon where a model with several independent variables, may have some variables interrelated.
- Understanding of independence, homoscedasticity, and normality (residuals).
- Feature Selection: With more variables present, selecting the optimal set of predictors from the pool of given features (many of which might be redundant) becomes an important task for building a relevant and better model.

We'll be moving through the following sections in order to achieve our objectives:

- Investigating our predictor variables:
    - Checking for linearity;
    - Checking for multicollinearity;
- Fitting a model with `statsmodels.OLS`;
- Evaluating our fitted model:
    - Checking for independence;
    - Checking for homoscedasticity;
    - Checking for normaility;
    - Checking for outliers.

### Checking for Linearity

Linearity is a key assumption in multilinear regression. It states that the relationship between each predictor and the response variable should be linear. When this assumption is violated, the model's predictions may be biased or less effective.

The first thing we need to check is the mathematical relationship between each predictor variable and the response variable. == linearity. 
- A linear relationship means that a change in the response *Y* due to a one-unit change in the predictor $X_j$ is constant, regardless of the value of $X_j$.

If we fit a regression model to a dataset that is non-linear, 
- it will fail to adequately capture the relationship in the data - resulting in a mathematically inappropriate model. 

### Detecting Non-Linearity

To check for linearity, 
- we can produce scatter plots of each individual predictor against the response variable. 
- The intuition here is that we are looking for obvious linear relationships.

**Result**

- State what appears of the variables that have an approximately linear relationship.
- State that exhibits no linearity with resonse variable

In [None]:
fig, axs = plt.subplots(2,5, figsize=(14,6),)
fig.subplots_adjust(hspace = 0.5, wspace=.2)
axs = axs.ravel()

for index, column in enumerate(df.columns):
    axs[index-1].set_title("{} vs. mpg".format(column),fontsize=16)
    axs[index-1].scatter(x=df[column],y=df['mpg'],color='blue',edgecolor='k')
    
fig.tight_layout(pad=1)

Step 1: Diagnosing Non-Linearity

Visual Inspection
- Use scatter plots to visualize the relationship between predictors and the response variable.

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt

# Scatter plots for each predictor vs. response
for predictor in ["SquareFootage", "Bedrooms", "DistanceFromCityCenter"]:
    sns.scatterplot(x=df[predictor], y=df["HousePrice"])
    plt.title(f"{predictor} vs. HousePrice")
    plt.xlabel(predictor)
    plt.ylabel("HousePrice")
    plt.show()


In [None]:
# Pairplot to visualize relationships
sns.pairplot(df, x_vars=["SquareFootage", "Bedrooms", "DistanceFromCityCenter"], y_vars="HousePrice", kind="reg")
plt.show()

Residual Plots
- Residual plots help check for linearity by plotting residuals against predicted values.

In [None]:
# Predicted values and residuals
predicted = model.predict(X)
residuals = Y - predicted

# Residual plot
plt.scatter(predicted, residuals)
plt.axhline(0, color="red", linestyle="--")
plt.title("Residual Plot")
plt.xlabel("Predicted Values")
plt.ylabel("Residuals")
plt.show()


If you see a pattern (e.g., curves or increasing spread), it indicates non-linearity.

If the relationship between variables is non-linear Applying Log Transformation:

Step 2: Transforming Predictors or the Response Variable

When to Transform
- Predictors: Transform when individual predictors have non-linear relationships with the response.
- Response Variable: Transform when the response itself shows a skewed distribution or non-linear relationship with predictors.

|Transformation |Formula|Use Case|
|---------------|-------|--------|
|Log            |$log(x)$|Skewed data, multiplicative relationships, exponential growth.
|Square Root	|$\sqrt{x}$ |Reduces spread while preserving the order of values.|
|Polynomial	    |$ùë•^2, x^3,...$|For non-linear relationships that resemble curves.|
|Reciprocal		|$\frac{1}{x}$|When values decrease rapidly as the predictor increases.|
|Box-Cox		|$y^{ùúÜ}$|Optimal transformation for normalizing data or reducing variance.|

Step 3: Applying Transformations in Python

Example 1: Log Transformation
- Suppose SquareFootage has a non-linear relationship with HousePrice.

In [None]:
# Log transformation of SquareFootage
df["Log_SquareFootage"] = np.log(df["SquareFootage"])

# Fit the model again
X_trans = df[["Log_SquareFootage", "Bedrooms", "DistanceFromCityCenter"]]
X_trans = sm.add_constant(X_trans)
model_trans = sm.OLS(Y, X_trans).fit()

print(model_trans.summary())

Example 2: Polynomial Transformation
- Suppose DistanceFromCityCenter has a curved relationship with HousePrice.

In [None]:
# Add polynomial terms
df["Distance_Squared"] = df["DistanceFromCityCenter"] ** 2

# Fit model with polynomial term
X_poly = df[["SquareFootage", "Bedrooms", "DistanceFromCityCenter", "Distance_Squared"]]
X_poly = sm.add_constant(X_poly)
model_poly = sm.OLS(Y, X_poly).fit()

print(model_poly.summary())


Example 3: Box-Cox Transformation for Response Variable
- Normalize HousePrice if it's highly skewed.

In [None]:
from scipy.stats import boxcox

# Box-Cox transformation
Y_boxcox, lambda_boxcox = boxcox(Y)
print(f"Optimal lambda for Box-Cox: {lambda_boxcox}")

# Fit model with transformed response
model_boxcox = sm.OLS(Y_boxcox, X).fit()
print(model_boxcox.summary())


Step 4: Comparing Models
Use metrics like Adjusted $ùëÖ^2$ , AIC, and BIC to compare the effectiveness of models before and after transformations.

In [None]:
# Compare models
print("Original Model AIC:", model.aic)
print("Log-Transformed Model AIC:", model_log.aic)
print("Polynomial Model AIC:", model_poly.aic)
print("Box-Cox Model AIC:", model_boxcox.aic)

Step 5: Visualizing and Validating Improvements
- Visualizing Residuals After Transformation

In [None]:
# Residual plot after transformation
predicted_trans = model_log.predict(X_trans)
residuals_trans = Y - predicted_trans

plt.scatter(predicted_trans, residuals_trans)
plt.axhline(0, color="red", linestyle="--")
plt.title("Residual Plot After Transformation")
plt.xlabel("Predicted Values")
plt.ylabel("Residuals")
plt.show()


Checking $ùëÖ^2$ and Adjusted $ùëÖ^2$ Compare values before and after applying transformations.

In [None]:
print(f"Original Model R^2: {model.rsquared}")
print(f"Transformed Model R^2: {model_log.rsquared}")

### Checking for Multicollinearity

Multicollinearity occurs when predictors in a regression model are highly correlated. This can inflate standard errors, making it difficult to assess the individual impact of predictors on the response variable.
- As multicollinearity makes it difficult to find out which variable is contributing towards the prediction of the response variable, it leads one to conclude incorrectly, the effects of a variable on the target variable.
- Properly detect and deal with the multicollinearity present in the model, as random removal of any of these correlated variables from the model causes the coefficient values to swing wildly and even change signs.

Multicollinearity refers to the presence of strong correlation among two or more of the predictor variables in the dataset. The presence of any correlation among predictors is detrimental to model quality for two reasons:

- It tends to increase the standard error;

- It becomes difficult to estimate the effect of any one predictor variable on the response variable.

We will check for multicollinearity by generating 
- pairwise scatter plots among predictors
- a correlation heatmap.

Multicollinearity can be detected using the following methods.

- Pairwise Correlations: Checking the pairwise correlations between different pairs of independent variables can throw useful insights into detecting multicollinearity.
    - Pairwise correlations may not always be useful as it is possible that just one variable might not be able to completely explain some other variable but some of the variables combined could be ready to do this.  Thus, to check these sorts of relations between variables, one can use VIF:
- Variance Inflation Factor (VIF): VIF explains the relationship of one independent variable with all the other independent variables. 
    - VIF is given by,

$ VIF = \frac{1}{1 - R^2}$

where 
- $i$ refers to the $ith$ variable which is being represented as a linear combination of the rest of the independent variables.

Heuristics
- if VIF > 10 then the value is high and it should be dropped.
- if the VIF=5 then it may be valid but should be inspected first.
- if VIF < 5, then it is considered a good VIF value.

**Step 1: Detecting Multicollinearity**

(a) Pairwise scatter plots

As can be inferred by the name, a pairwise scatter plot simply produces a visual $n \times n$ matrix, where $n$ is the total number of variables compared, in which each cell represents the relationship between two variables. The diagonal cells of this visual represent the comparison of a variable with itself, and as such are substituted by a representation of the distribution of values taken by the visual.


In [None]:
# Due to the number of visuals created, this codeblock takes about one minute to run.
from seaborn import pairplot
g = pairplot(df1.drop('mpg', axis='columns'))
g.fig.set_size_inches(9,9)

(b) Correlation Matrix
- Use a correlation matrix to identify highly correlated predictors.

Correlation heatmap

Another way we can visually discover linearity between two or more variables within our dataset is through the use of a correlation heatmap. Similar to the pairwise scatter plot we produced above, this visual presents a matrix in which each row represents a distinct variable, with each colum representing the correlation between this variable and another one within the dataset.

Result Interpretation
- Look for correlations > 0.8 or < -0.8, which may indicate multicollinearity.

In [None]:
# We only compare the predictor variables, and thus drop the target `mpg` column.
corr = df1.drop('mpg', axis='columns').corr()

from statsmodels.graphics.correlation import plot_corr

fig=plot_corr(corr,xnames=corr.columns)

In [None]:
import pandas as pd

# Compute correlation matrix
correlation_matrix = df[["SquareFootage", "Bedrooms", "DistanceFromCityCenter"]].corr()

# Display the heatmap
import seaborn as sns
import matplotlib.pyplot as plt

sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm")
plt.title("Correlation Matrix")
plt.show()


(c) Variance Inflation Factor (VIF)
- VIF quantifies how much the variance of a regression coefficient is inflated due to multicollinearity.

Result Interpretation

Rule of Thumb:
- $VIF=1$: No multicollinearity.
- $1<VIF<5$: Low multicollinearity.
- $VIF>5$: High multicollinearity.
- $VIF>10$: Severe multicollinearity.

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Calculate VIF for each predictor
vif_data = pd.DataFrame()
vif_data["Feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

print(vif_data)

In [None]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Prepare data for VIF calculation
X_vif = df[["SquareFootage", "Bedrooms", "DistanceFromCityCenter"]]
X_vif = sm.add_constant(X_vif)

# Calculate VIF for each predictor
vif_data = pd.DataFrame()
vif_data["Feature"] = X_vif.columns
vif_data["VIF"] = [variance_inflation_factor(X_vif.values, i) for i in range(X_vif.shape[1])]

print(vif_data)

**Step 2: Mitigating Multicollinearity**

1. Drop Highly Correlated Predictors:
- If two predictors are highly correlated, remove one to reduce redundancy.

In [None]:
X_reduced = X.drop(columns=["Bedrooms"])  # Example

In [None]:
# Drop 'Bedrooms' if it has high multicollinearity
X_reduced = df[["SquareFootage", "DistanceFromCityCenter"]]
X_reduced = sm.add_constant(X_reduced)

# Fit the model with reduced predictors
model_reduced = sm.OLS(Y, X_reduced).fit()
print(model_reduced.summary())

2. Apply Ridge or Lasso Regression:

- Ridge regression penalizes large coefficients to handle multicollinearity.
- Lasso regression performs feature selection by shrinking some coefficients to zero.

In [None]:
from sklearn.linear_model import Ridge, Lasso

ridge = Ridge(alpha=1.0)  # Regularization strength
ridge.fit(X, Y)
print("Ridge coefficients:", ridge.coef_)

lasso = Lasso(alpha=0.1)
lasso.fit(X, Y)
print("Lasso coefficients:", lasso.coef_)

In [None]:
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# Split data
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.2, random_state=42)

# Ridge regression
ridge = Ridge(alpha=1.0)  # Adjust alpha (regularization strength) as needed
ridge.fit(X_train, Y_train)

# Evaluate Ridge model
ridge_predictions = ridge.predict(X_test)
ridge_mse = mean_squared_error(Y_test, ridge_predictions)
print("Ridge Regression MSE:", ridge_mse)
print("Ridge Coefficients:", ridge.coef_)

In [None]:
from sklearn.linear_model import Lasso

# Lasso regression
lasso = Lasso(alpha=0.1)  # Adjust alpha as needed
lasso.fit(X_train, Y_train)

# Evaluate Lasso model
lasso_predictions = lasso.predict(X_test)
lasso_mse = mean_squared_error(Y_test, lasso_predictions)
print("Lasso Regression MSE:", lasso_mse)
print("Lasso Coefficients:", lasso.coef_)


3. Principal Component Analysis (PCA):
- PCA reduces dimensions by transforming correlated predictors into uncorrelated components.

Interpreting PCA:
- Principal components represent uncorrelated combinations of the original predictors.
- The explained variance ratio tells you how much variance is captured by each componen

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)  # Reduce dimensions
X_pca = pca.fit_transform(X.iloc[:, 1:])

In [None]:
from sklearn.decomposition import PCA
import numpy as np

# Scale predictors for PCA
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X.iloc[:, 1:])  # Exclude constant term

# Apply PCA
pca = PCA(n_components=2)  # Choose number of components
X_pca = pca.fit_transform(X_scaled)

# Fit model with PCA components
model_pca = sm.OLS(Y, sm.add_constant(X_pca)).fit()
print(model_pca.summary())

**Step 3: Comparing Models**

Evaluate model performance before and after applying mitigation techniques using metrics such as:

- Mean Squared Error (MSE)
- Adjusted $ùëÖ^2$ 
- Akaike Information Criterion (AIC)

In [None]:
# Compare model AIC
print("Original Model AIC:", model.aic)
print("Reduced Model AIC:", model_reduced.aic)
print("Ridge Model MSE:", ridge_mse)
print("Lasso Model MSE:", lasso_mse)

# Overfitting and Underfitting in Linear Regression

When model performs well on training data but not on the test data.

**Bias**

Bias is a measure to determine how accurate a model‚Äôs predictions are likely to be on future unseen data.
- Bias is errors made by training data.
    - Complex models, assuming there is enough training data available, can make accurate model predictions. 
    - Models that are too naive, are very likely to perform badly concerning model predictions.
- Linear algorithms have a high bias which makes them fast to learn and easier to understand but in general, are less flexible. 
    - Implying lower predictive performance on complex problems that fail to meet the expected outcomes.

**Variance**

Variance is the sensitivity of the model towards training data
- it quantifies how much the model will react when input data is changed.
    - model shouldn‚Äôt change too much from one training dataset to the next training data 
        - Whcih means that the algorithm is good at picking out the hidden underlying patterns between the inputs and the output variables.
    - model should have lower variance which means that the model doesn‚Äôt change drastically after changing the training data(it is generalizable). 
        - Having higher variance will make a model change drastically even on a small change in the training dataset.

**Bias Variance Tradeoff**

A supervised machine learning algorithm seeks to strike a balance between low bias and low variance for increased robustness.

The relationship between bias and variance is characterized by an inverse correlation.
- Increased bias leads to reduced variance.
- Conversely, heightened variance results in diminished bias.
Finding an equilibrium between bias and variance is crucial, and algorithms must navigate this trade-off for optimal outcomes.

**Overfitting**

When a model learns every pattern and noise in the data to such an extent that it affects the performance of the model on the unseen future dataset.
- model fits the data so well that it interprets noise as patterns in the data.

Caused when a model has low bias and higher variance it ends up memorizing the data.

Overfitting causes the model to become specific rather than generic. This usually leads to 
- high training accuracy and 
- very low test accuracy.

There are several ways to prevent overfitting:
- Cross-validation
- If the training data is too small to train add more relevant and clean data.
- If the training data is too large, do some feature selection and remove unnecessary features.
- Regularization

**Underfitting**

When the model fails to learn from the training dataset and is also not able to generalize the test dataset.

Detected by the performance metrics.

When a model has high bias and low variance it ends up not generalizing the data and causing underfitting. 
- It is unable to find the hidden underlying patterns in the data. 
- This usually leads to low training accuracy and very low test accuracy.

Ways to prevent underfitting:
- Increase the model complexity
- Increase the number of features in the training data
- Remove noise from the data.

### Fitting the model using `statsmodels.OLS`

`sklearn` is limited in terms of metrics and tools available to evaluate the appropriateness of the regression models we fit.
-As a means to expland our analysis, we import the `statsmodels` library which has a rich set of statistical tools to help us. 

##### Generating the regression string

Those of you familiar with the R language will know that fitting a machine learning model requires a sort of string of the form:

`y ~ X`

which is read as follows: "Regress y on X". The `statsmodels` library works in a similar way, so we need to generate an appropriate string to feed to the method when we wish to fit the model.

In [None]:
import statsmodels.formula.api as sm

In [None]:
df.describe().T

In [None]:
# Regress target variable on all of the predictors.
formula_str = df.columns[0]+' ~ '+'+'.join(df.columns[1:]); formula_str

In [None]:
# Importing seaborn library for visualizations
import seaborn as sns


# To plot all the scatterplots in a single plot
sns.pairplot(df, x_vars=[ 'TV', ' Newspaper','Radio' ], y_vars = 'Sales', size = 4, kind = 'scatter' )
plt.show()

##### Plotting 3D plot for multiple Linear regression

To get a better idea of what a multi-dimensional dataset looks like, we'll generate a 3D scatter plot showing the `mpg` on the _z_-axis (height), with two predictor variables, `cyl` and `disp` on the _x_- and _y_-axes.

In [None]:
# create figure and 3d axes
fig = plt.figure(figsize=(8,7))
ax = fig.add_subplot(111, projection='3d')

# set axis labels
ax.set_zlabel('MPG')
ax.set_xlabel('No. of Cylinders')
ax.set_ylabel('Weight (1000 lbs)')

# scatter plot with response variable and 2 predictors
ax.scatter(df['cyl'], df['wt'], df['mpg'])

We know that in simple linear regression (2D), any model that we fit to data manifests in the form of a straight line. Extending this idea to 3D, the line becomes a plane - a flat surface which is chosen to minimise the squared vertical distances between each observation (red dots), and the plane, as shown in the figure below from ISLR.

<img src="https://github.com/Explore-AI/Public-Data/raw/master/3D%20regression%20ISLR.jpg" alt="plane" style="width: 450px"/>

The result of a multivariate linear regression in higher dimensionality is known as a _hyperplane_ - similar to the flat surface in the figure above, but in a _p_-dimensional space, where $p>3$. Unfortunately, humans lack the ability to visualise any number of dimensions greater than three - so we have to be content with the idea that a hyperplane in _p_-dimensional space is effectively like a flat surface in 3-dimensional space.

In [None]:
# To plot heatmap to find out correlations
sns.heamap(df.corr(), cmap = 'YlGnBl', annot = True )
plt.show()

### Fitting the Multivariate Regression Model

In `sklearn`, fitting a multiple linear regression model is much the same as fitting a simple linear regression. This time, of course, our $X$ contains multiple columns, where it only contained one before. 

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split( X, y, train_size = 0.7, test_size = 0.3, random_state = 100 )

In [None]:
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

### Construct and fit the model

We now go ahead and fit our model.
- use the `ols` or Ordinary Least Squares regression model from the `statsmodels` library

In [None]:
import statsmodels.api as sm

In [None]:
# Add a constant to get an intercept
X_train_sm = sm.add_constant(X_train)
# Fit the resgression line using 'OLS'
lr = sm.OLS(y_train, X_train_sm).fit()

# OR

model=sm.ols(formula=formula_str, data=df1)
fitted = model.fit()

In [None]:
# Print the parameters,i.e. intercept and slope of the regression line obtained
lr.params

# extract model intercept
beta_0 = float(lr.intercept_)

# extract model coeffs
beta_js = pd.DataFrame(lr.coef_, X.columns, columns=['Coefficient'])
beta_js

### Interpreting Coefficients of Multilinear Regression

In a multilinear regression model, the coefficients represent the relationship between 
- each predictor (independent variable) and 
- the response (dependent variable), 
while controlling for the effects of other predictors in the model.

Intercept ($ùõΩ_0$ | `beta_0`):
- This is the predicted value of the response variable when all predictors are set to zero.
- It is meaningful only if all predictors can realistically take a value of zero.

Slope Coefficients ($ùõΩ_ùëñ$ | `beta_js`):
- Each $ùõΩ_ùëñ$ measures the change in the response variable for a one-unit increase in predictor $ùëã_ùëñ$, assuming all other predictors remain constant.
- A positive $ùõΩ_ùëñ$: Indicates that an increase in $ùëã_ùëñ$ is associated with an increase in the response.
- A negative $ùõΩ_ùëñ$: Indicates that an increase in $ùëã_ùëñ$ is associated with a decrease in the response.

P-Values:
- A p-value tests the null hypothesis that the coefficient $ùõΩ_{1}$ is zero (no effect). 
    - If the p-value is small (typically <0.05), the predictor is considered statistically significant in explaining the response variable.

Standardized Coefficients:
- If predictors are measured in different units, their coefficients can't be directly compared. Standardized coefficients (beta weights) are used to determine the relative importance of predictors.

##### Explaining Multilinear Regression equation

$$ ùë¶=ùõΩ_{0}+ùõΩ_{1}ùë•_{1}+‚Ä¶+ùõΩ_{2}ùë•_{2}+ ùúñ $$

- $ùõΩ_{1}$: if $ùõΩ_{1}$ = 2, then a one-unit increase in $ùë•_{1}$ is associated with an average increase of 2 units in ùëå, holding $ùë•_{2}$ constant.
- $ùõΩ_{2}$: if $ùõΩ_{2}$ = -3, then a one-unit increase in $ùë•_{2}$ is associated with an average decrease of 3 units in ùëå, holding $ùë•_{1}$ constant.

### Testing Relationships Between Response and Predictors

Multilinear regression tests the relationship between the response variable (ùëå) and the predictors ($ùë•_{1}$,$ùë•_{2}$,‚Ä¶,$ùë•_{p}$) by modeling ùëå as a linear combination of the predictors:

$$ ùë¶=ùõΩ_{0}+ùõΩ_{1}ùë•_{1}+ +ùõΩ_{2}ùë•_{2}+‚Ä¶+ùõΩ_{p}ùë•_{p}+ ùúñ $$

1. Hypothesis Testing:
- For each predictor $ùë•_{p}$, Null Hypothesis ($ùêª_0): ùõΩ_p =0$ (the predictor has no effect on response(ùë¶) varaible | no relationship between predictor(x) response(ùë¶)).
- Alternative Hypothesis $(ùêª_ùëé): ùõΩ_ùëó ‚â† 0$ (the predictor has an effect/ there is a relationship).

2. **t-statistic: test is performed for each coefficient**

How to Calculate the t-statistic in Linear regression

The t-statistic in linear regression measures how many standard errors the estimated coefficient is away from zero. 
- It is used for hypothesis testing to determine if a predictor variable is statistically significant.

The formula to calculate the t-statistic for a coefficient

$$t = \frac{\hat{ùõΩ_p}}{SE_{\hat{ùõΩ_p}}}$$

Where:
$ùõΩ_p$: Estimated coefficient (e.g., slope or intercept).
$SE_{\hat{ùõΩ_p}}$: Standard error of the estimated coefficient $\hat{ùõΩ_p}$.

### t-statistic maybe a misleading variable importance indicator:

In multiple linear regression, the t-statistic evaluates the significance of individual predictor variables by testing the null hypothesis that a predictor's coefficient is zero ($ùêª_0): ùõΩ_p =0$.

It can be misleading as an indicator of variable importance in multilinear regression for the following reasons:

- Multicollinearity
    - When predictor variables are highly correlated, the variance of the coefficient estimates increases.
    - This can lead to inflated standard errors and reduced t-statistics, causing variables to appear insignificant even if they are important.
        - Conversely, some variables might have significant t-statistics due to correlation with other predictors rather than their actual contribution to the response variable.

- Dependency on Units of Measurement
    - The t-statistic depends on the scale of the predictor variables. 
        - For example, variables with larger numerical ranges can dominate, making direct comparisons between t-statistics across variables inappropriate without standardization.

- Context of the Model
- The importance of a variable depends on the context of other predictors in the model. 
    - Adding or removing predictors can change the coefficients and t-statistics, leading to different conclusions about importance.

- Does Not Reflect Contribution to $R^2$
    - The t-statistic evaluates the statistical significance of a single variable, but it does not measure its contribution to the model's overall explanatory power ($R^2$).
    - A variable may be statistically significant (high t-statistic) yet contribute little to the variance explained.

- Focuses on Statistical Significance Over Practical Significance
    - A high t-statistic indicates statistical significance but does not imply that the variable is practically meaningful or contributes substantially to predictions.

Best Practices to Assess Variable Importance
- Use metrics like standardized coefficients to account for differences in units.
- Evaluate variable importance metrics, such as partial $R^2$ , Shapley values, or permutation importance, especially in models with multicollinearity.
- Perform model comparison using adjusted $R^2$ or the Akaike Information Criterion (AIC) to assess the model‚Äôs explanatory power with and without specific variables.

##### Implementation of best practices for assessing variable importance in multilinear regression:

- Standardized Coefficients: Calculates coefficients on a standardized scale for comparison.
- Partial $R^2$: Measures the contribution of each variable to the overall $R^2$.
- Permutation Importance: Evaluates the change in model performance when a variable's values are randomly shuffled.

In [None]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.inspection import permutation_importance

# Example data
np.random.seed(42)
X = pd.DataFrame({
    'Variable_A': np.random.rand(100) * 100,
    'Variable_B': np.random.rand(100) * 50,
    'Variable_C': np.random.rand(100) * 10
})
y = 2 * X['Variable_A'] + 0.5 * X['Variable_B'] + 0.1 * X['Variable_C'] + np.random.randn(100) * 5

# Step 1: Fit a linear regression model
model = LinearRegression()
model.fit(X, y)

# Step 2: Assess importance using standardized coefficients
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
model_scaled = LinearRegression()
model_scaled.fit(X_scaled, y)
standardized_coefficients = model_scaled.coef_

# Step 3: Compute partial R-squared for each variable
def partial_r2(X, y, variable):
    X_partial = X.drop(columns=[variable])
    model_partial = LinearRegression().fit(X_partial, y)
    residuals = y - model_partial.predict(X_partial)
    total_rss = np.sum((y - y.mean()) ** 2)
    partial_rss = np.sum(residuals ** 2)
    return 1 - (partial_rss / total_rss)

partial_r2_values = {var: partial_r2(X, y, var) for var in X.columns}

# Step 4: Compute permutation importance
perm_importance = permutation_importance(model, X, y, n_repeats=30, random_state=42)

# Step 5: Display results
print("Standardized Coefficients:")
for var, coef in zip(X.columns, standardized_coefficients):
    print(f"{var}: {coef:.4f}")

print("\nPartial R-squared Values:")
for var, r2 in partial_r2_values.items():
    print(f"{var}: {r2:.4f}")

print("\nPermutation Importance:")
for var, importance in zip(X.columns, perm_importance.importances_mean):
    print(f"{var}: {importance:.4f}")


2. **F-Test for Overall Model Significance**:

The F-statistic is used in hypothesis testing to evaluate the overall significance of a multiple linear regression model. Specifically, it tests whether at least one of the predictor variables in the model significantly explains variation in the dependent variable.
- Tests the null hypothesis that all coefficients are zero ($ùõΩ_{1} = ùõΩ_{2} = ... = ùõΩ_{p} = 0$).
    - If the F-statistic is significant, at least one predictor has a relationship with ùëå.

formula for the F-statistic is:

$$ F= \frac{Explained¬†Mean¬†Square¬†(MSR)}{Residual¬†Mean¬†Square¬†(MSE)}$$
$$ F= \frac{\frac{TSS‚àíRSS}{p}}{\frac{RSS}{n‚àíp‚àí1}}$$

Where:
- TSS: Total Sum of Squares
- RSS: Residual Sum of Squares
- n: Number of observations
- p: Number of predictors (excluding the intercept)
- Mean¬†Square Regression¬†(MSR): $\frac{TSS‚àíRSS}{p}$
- Mean Square Error (MSE): $frac{RSS}{n‚àíp‚àí1}$

Steps:

1. Calculate the degrees of freedom:
- For regression: ùëù
- For error: ùëõ‚àíùëù‚àí1

2. Compute the explained variance: TSS‚àíRSS

3. Calculate Mean¬†Square Regression¬†(MSR) and Mean Square Error (MSE)

4. Compute the F-statistic:
$$F = \frac{Explained¬†Mean¬†Square¬†(MSR)}{Residual¬†Mean¬†Square¬†(MSE)}$$

When to Perform the F-Test?
- Perform the F-test whenever you have a regression model and want to evaluate its overall significance. 
- It is especially relevant in multiple linear regression with several predictors.

Why Perform the F-Test?
- To determine if the model as a whole is useful for predicting the dependent variable.
- It helps decide whether further analysis (e.g., testing individual predictors or refining the model) is warranted.

##### Practical Steps in Hypothesis Testing:

i.  Formulate the Hypotheses
- Null Hypothesis ($ùêª_0$): All regression coefficients (except the intercept) are equal to zero, i.e., the predictors do not explain the variability in the dependent variable.
$$ ùêª_0: ùõΩ_{1} = ùõΩ_{2} = ... = ùõΩ_{p} = 0$$
- Alternative Hypothesis($ùêª_a$): At least one of the regression coefficients is not zero, i.e., at least one predictor contributes to explaining the variability.
$$ ùêª_a: at least oneùõΩ_{j}\neq 0,¬†for j = 1,2,...,p$$

ii. Calculate the F-Statistic

$$ F= \frac{Explained¬†Mean¬†Square¬†(MSR)}{Residual¬†Mean¬†Square¬†(MSE)}$$
$$ F= \frac{\frac{TSS‚àíRSS}{p}}{\frac{RSS}{n‚àíp‚àí1}}$$

where:

$$MSR = \frac{Explained Variance}{Degrees of Freedom for Regression (df_{reg})}$$

$$MSE = \frac{Residual Sum of Squares (RSS)}{Degrees of Freedom for Error (df_{error})}$$

iii. Determine the Degrees of Freedom

- $df_{reg}$ =p: Number of predictors.
- $df_{error}$ =n‚àíp‚àí1: Residual degrees of freedom, where ùëõ is the number of observations.

iv. Find the Critical Value

- Use the F-distribution table or Python to find the critical value for the given ùõº (commonly 0.05), $df_{reg}$ and $df_{error}$

v. Compare F-Statistic with the Critical Value
- If $ùêπ > ùêπ_{ùëêùëüùëñùë°ùëñùëêùëéùëô}$, reject $ùêª_0$. 
    - This implies that at least one predictor is significant.
- If $ùêπ ‚â§ ùêπ_{ùëêùëüùëñùë°ùëñùëêùëéùëô}$, fail to reject $ùêª_0$. 
    - This implies the predictors do not collectively explain the variability better than random chance.

vi. Use the p-Value (Optional)
- Instead of using a critical value, you can calculate the p-value associated with the F-statistic:
    - If p-value < Œ±, reject $ùêª_0$.
    - If p-value ‚â• Œ±, fail to reject $ùêª_0$.

Interpreting Results
- Significant F-statistic: Indicates the model has predictive power and at least one predictor is meaningful.
- Non-significant F-statistic: Suggests the model does not explain variability better than a simple mean-based model.

In [None]:
# Example parameters
TSS = 1200  # Total Sum of Squares
RSS = 300   # Residual Sum of Squares
n = 50      # Number of observations
p = 3       # Number of predictors (excluding the intercept)

# Step 1: Degrees of freedom
df_regression = p               # Degrees of freedom for regression
df_error = n - p - 1           # Degrees of freedom for error

# Step 2: Explained variance
explained_variance = TSS - RSS

# Step 3: Calculate MSR and MSE
MSR = explained_variance / df_regression  # Mean Square Regression
MSE = RSS / df_error                     # Mean Square Error

# Step 4: Calculate the F-statistic
F_statistic = MSR / MSE

# Step 5: Perform hypothesis testing
import scipy.stats as stats

# Calculate the critical value for the F-distribution
alpha = 0.05  # Significance level
F_critical = stats.f.ppf(1 - alpha, df_regression, df_error)

# Calculate the p-value for the F-statistic
p_value = 1 - stats.f.cdf(F_statistic, df_regression, df_error)

# Print the results
print(f"Degrees of Freedom (Regression): {df_regression}")
print(f"Degrees of Freedom (Error): {df_error}")
print(f"Explained Variance: {explained_variance}")
print(f"Mean Square Regression (MSR): {MSR}")
print(f"Mean Square Error (MSE): {MSE}")
print(f"F-Statistic: {F_statistic}")
print(f"Critical F-Value: {F_critical}")
print(f"P-Value: {p_value}")

# Decision based on F-statistic
if F_statistic > F_critical:
    print("Reject the null hypothesis: At least one predictor is significant.")
else:
    print("Fail to reject the null hypothesis: The model is not significant.")


3. Assessing Fit:
- Coefficient of Determination $R^2$ : 
    - Proportion of variance in ùëå explained by the predictors.
    - Purpose: Measures the proportion of variance in the dependent variable explained by the independent variables.
    - When to Use: Always, as a baseline measure of model fit.
- Adjusted $R^2$: 
    - Adjusts $R^2$ for the number of predictors, penalizing the inclusion of irrelevant predictors.
    - Key Consideration: Adjusted $R^2$ accounts for the number of predictors, providing a better measure for models with multiple variables.
- Residual Analysis
    - Purpose: Examines the residuals (differences between observed and predicted values) to check assumptions of the regression model.
    - How to Use:
        - Plot residuals vs. predicted values to check for patterns (should appear random).
        - Use a histogram or Q-Q plot of residuals to check normality.
        - Examine residuals vs. independent variables to check for independence.
    - When to Use: Always, to validate assumptions like linearity, homoscedasticity, and normality.
- Mean Squared Error (MSE)
    - Purpose: Measures the average squared difference between observed and predicted values.
    - When to Use: To quantify model error; lower MSE indicates better fit.
- F-Statistic
    - Purpose: Tests the overall significance of the model by comparing explained variance to unexplained variance.
    - When to Use: To test whether at least one predictor is significant in explaining the variance of the dependent variable.
- Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC)
    - Purpose: Compare models, penalizing for model complexity.
    - When to Use: When comparing multiple models with different numbers of predictors or structures.
- Cross-Validation
    - Purpose: Evaluates the model‚Äôs performance on unseen data.
    - How to Use:
        - Use k-fold cross-validation to divide data into training and test sets.
        - Calculate metrics (e.g., $R^2$ , MSE) on test sets.
    - When to Use: To assess model generalizability.
- Variance Inflation Factor (VIF)
    - Purpose: Detects multicollinearity among predictors.
    - How to Use: Compute VIF for each predictor; values > 10 indicate high multicollinearity.
    - When to Use: To assess stability of coefficient estimates.
- Cook‚Äôs Distance and Leverage
    - Purpose: Identifies influential observations that disproportionately affect the regression results.
    - How to Use:
        - Cook‚Äôs Distance: Observations with values > 1 are considered influential.
        - Leverage: High-leverage points have significant potential to influence the model.
    - When to Use: To identify outliers and influential data points.
- Normalized Residual Standard Error (NRSE)
    - Purpose: Provides a standardized measure of error in the model.
    - When to Use: To compare models with different dependent variable scales.
- Predictive Metrics (e.g., RMSE, MAE)
    - Purpose: Evaluate model accuracy in predicting outcomes.
    - When to Use: For regression models focused on prediction.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.api import OLS, add_constant

# Example dataset
np.random.seed(42)
X = pd.DataFrame({
    'Variable_A': np.random.rand(100) * 100,
    'Variable_B': np.random.rand(100) * 50,
    'Variable_C': np.random.rand(100) * 10
})
y = 2 * X['Variable_A'] + 0.5 * X['Variable_B'] + 0.1 * X['Variable_C'] + np.random.randn(100) * 5

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit a linear regression model
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# 1. Coefficient of Determination (R^2)
r2 = r2_score(y_test, y_pred)
adjusted_r2 = 1 - (1 - r2) * (len(y_test) - 1) / (len(y_test) - X_test.shape[1] - 1)
print(f"R^2: {r2:.4f}, Adjusted R^2: {adjusted_r2:.4f}")

# 2. Residual Analysis
residuals = y_test - y_pred
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
sns.scatterplot(x=y_pred, y=residuals)
plt.axhline(0, color='red', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs. Predicted')

plt.subplot(1, 2, 2)
sns.histplot(residuals, kde=True)
plt.xlabel('Residuals')
plt.title('Distribution of Residuals')
plt.show()

# 3. Mean Squared Error (MSE)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse:.4f}")

# 4. F-Statistic (using statsmodels)
X_train_const = add_constant(X_train)
ols_model = OLS(y_train, X_train_const).fit()
print(ols_model.summary())

# 5. Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC)
print(f"AIC: {ols_model.aic:.4f}, BIC: {ols_model.bic:.4f}")

# 6. Cross-Validation
cv_scores = cross_val_score(model, X, y, cv=5, scoring='r2')
print(f"Cross-Validation R^2 Scores: {cv_scores}")
print(f"Mean CV R^2: {np.mean(cv_scores):.4f}")

# 7. Variance Inflation Factor (VIF)
vif_data = pd.DataFrame()
vif_data['Variable'] = X.columns
vif_data['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print("Variance Inflation Factor (VIF):")
print(vif_data)

# 8. Cook's Distance and Leverage
influence = ols_model.get_influence()
cooks_d = influence.cooks_distance[0]
high_influence_points = np.where(cooks_d > 4 / len(X_train))[0]
print(f"High Influence Points (Cook's Distance > 4/n): {high_influence_points}")

# 9. Residual Standard Error (RSE)
rse = np.sqrt(mse)
print(f"Residual Standard Error (RSE): {rse:.4f}")

# 10. Predictive Metrics (RMSE and MAE)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
mae = np.mean(np.abs(y_test - y_pred))
print(f"Root Mean Squared Error (RMSE): {rmse:.4f}, Mean Absolute Error (MAE): {mae:.4f}")


In [None]:
############## Step 1: Import Libraries and Load Data
import pandas as pd
import numpy as np
import statsmodels.api as sm
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

# Generate example dataset
np.random.seed(42)
data = {
    "SquareFootage": np.random.uniform(500, 4000, 100),
    "Bedrooms": np.random.randint(1, 6, 100),
    "DistanceFromCityCenter": np.random.uniform(1, 20, 100),
    "HousePrice": np.random.uniform(50000, 500000, 100),
}

df = pd.DataFrame(data)

# Print sample data
print(df.head())

############# Step 2: Fit the Multilinear Regression Model

# Define predictors (X) and response (Y)
X = df[["SquareFootage", "Bedrooms", "DistanceFromCityCenter"]]
Y = df["HousePrice"]

# Add a constant for the intercept
X = sm.add_constant(X)

# Fit the model
model = sm.OLS(Y, X).fit()

# Summary of the model
print(model.summary())

############## Step 3: Interpret the Output

# Performing a summary operation lists out all different parameters of the regression line fitted
print(lr.summary())

# OR

print(fitted.summary())

##### Interpretation of `summary`

The model.summary() provides:

Coefficients:
- $ùõΩ_{0}$: The intercept.
- $ùõΩ_{1}$,$ùõΩ_{2}$,$ùõΩ_{3}$: Coefficients for predictors.

P-values:
- Assess the significance of each predictor.
  - If ùëù < 0.05, the predictor significantly explains variations in the response variable.

$R^{2}$ and Adjusted $R^{2}$ :
- Measure how much variance in the response is explained by the predictors.

**F-statistic**:
- Tests the overall significance of the model.
- It tests whether at least one predictor variable in the model has a non-zero coefficient, meaning it contributes significantly to explaining the variance in the dependent variable.

In [None]:
# Example parameters
TSS = 1200  # Total Sum of Squares
RSS = 300   # Residual Sum of Squares
n = 50      # Number of observations
p = 3       # Number of predictors (excluding the intercept)

# Step 1: Degrees of freedom
df_regression = p               # Degrees of freedom for regression
df_error = n - p - 1           # Degrees of freedom for error

# Step 2: Explained variance
explained_variance = TSS - RSS

# Step 3: Calculate MSR and MSE
MSR = explained_variance / df_regression  # Mean Square Regression
MSE = RSS / df_error                     # Mean Square Error

# Step 4: Calculate the F-statistic
F_statistic = MSR / MSE

# Print the results
print(f"Degrees of Freedom (Regression): {df_regression}")
print(f"Degrees of Freedom (Error): {df_error}")
print(f"Explained Variance: {explained_variance}")
print(f"Mean Square Regression (MSR): {MSR}")
print(f"Mean Square Error (MSE): {MSE}")
print(f"F-Statistic: {F_statistic}")


4. Assumptions:
- Linearity: The relationship between predictors and response is linear.
- Independence of Errors: Errors are independent of each other.
- Homoscedasticity: Constant variance of errors.
- Normality of Errors: Errors are normally distributed.

##### Practical Steps:
1. Plot residuals to check assumptions.
2. Use statistical tests (e.g., Shapiro-Wilk for normality, Breusch-Pagan for homoscedasticity).
3. Apply transformations or alternative models if assumptions are violated.

In [None]:
############### Step 4: Visualize Residuals to Check Assumptions

# Linearity and Homoscedasticity
# Plot predicted vs actual values
predicted = model.predict(X)
residuals = Y - predicted

plt.scatter(predicted, residuals)
plt.axhline(0, color="red", linestyle="--")
plt.xlabel("Predicted Values")
plt.ylabel("Residuals")
plt.title("Residuals vs Predicted Values")
plt.show()

# Normality of Errors
# Plot residual distribution
sns.histplot(residuals, kde=True)
plt.title("Residual Distribution")
plt.show()

# Perform Shapiro-Wilk test for normality
from scipy.stats import shapiro
shapiro_test = shapiro(residuals)
print(f"Shapiro-Wilk test p-value: {shapiro_test.pvalue}")


Step 5: Interpretation

Coefficient Interpretation
- if coefficient of , $ùõΩ_{1}$ = 50 it means that for every additional square foot, the house price increases by $50, assuming other predictors are held constant.

Model Fit
- if $R^{2}$ = 0.85, it means 85% of the variance in house prices is explained by the predictors.
- Check adjusted $R^{2}$ to ensure added predictors improve the model meaningfully.

Assumptions
- A residual plot with no pattern confirms linearity.
- Homoscedasticity: Residuals should have constant variance (scatter evenly around zero).
- Normality: Residuals should approximately follow a normal distribution.
___________

few 2-dimensional plots; plotting `wt`, `disp`, `cyl`, and `hp` vs. `mpg`, respectively (top-left to bottom-right).

In [None]:
fig, axs = plt.subplots(2, 2, figsize=(9,7))

axs[0,0].scatter(df['wt'], df['mpg'])
axs[0,0].plot(df['wt'], lm.intercept_ + lm.coef_[4]*df['wt'], color='red')
axs[0,0].title.set_text('Weight (wt) vs. mpg')

axs[0,1].scatter(df['disp'], df['mpg'])
axs[0,1].plot(df['disp'], lm.intercept_ + lm.coef_[1]*df['disp'], color='red')
axs[0,1].title.set_text('Engine displacement (disp) vs. mpg')

axs[1,0].scatter(df['cyl'], df['mpg'])
axs[1,0].plot(df['cyl'], lm.intercept_ + lm.coef_[0]*df['cyl'], color='red')
axs[1,0].title.set_text('Number of cylinders (cyl) vs. mpg')

axs[1,1].scatter(df['hp'], df['mpg'])
axs[1,1].plot(df['hp'], lm.intercept_ + lm.coef_[2]*df['hp'], color='red')
axs[1,1].title.set_text('Horsepower (hp) vs. mpg')

fig.tight_layout(pad=3.0)

plt.show()

### Assessing Model Accuracy

Let's assess the fit of our multivariate model. For the purpose of a rudimentary comparison, let's measure model accuracy aginst a simple linear regression model.

In [None]:
# Add a constant to X_test
X_test_sm = sm.add_constant(X_test)
# Predict the y values corresponding to X_test_sm
y_pred = lr.predict(X_test_sm)

We have included a column *Test RMSE*, which is simply the square root of the *Test MSE*.


\begin{align}
RMSE & = \sqrt{MSE} \\
     & = \sqrt{\frac{1}{N}\sum^{N} (\hat{y_i} - y_i)^{2}}
\end{align}

Where $y_i$ are the actual target values for a dataset with $N$ datapoints, and $\hat{y_i}$ represent our corresponding predictions. RMSE is a more intuitive metric to use than MSE because it is in the same units as the underlying variable being predicted.

In [None]:
from sklearn import metrics
import math

# Imporitng libraries
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

# dictionary of results
results_dict = {'Training MSE':
                    {
                        "SLR": metrics.mean_squared_error(y_train, slr.predict(X_train[['disp']])),
                        "MLR": metrics.mean_squared_error(y_train, lm.predict(X_train))
                    },
                'Test MSE':
                    {
                        "SLR": metrics.mean_squared_error(y_test, slr.predict(X_test[['disp']])),
                        "MLR": metrics.mean_squared_error(y_test, lm.predict(X_test))
                    },
                'Test RMSE':
                    {
                        "SLR": math.sqrt(metrics.mean_squared_error(y_test, slr.predict(X_test[['disp']]))),
                        "MLR": math.sqrt(metrics.mean_squared_error(y_test, lm.predict(X_test)))
                    }
                }

In [None]:
#RMSE value
print("RMSE: ",np.sqrt(mean_squared_error(y_test, y_pred))
#R-squared value
print("R-squared: ",r2_score(y_test, y_pred))

In [None]:
X_train_lm = X_train_lm.values.reshape(-1,1)
X_test_lm = X_test_lm.values.reshape(-1,1)

In [None]:
print(X_train_lm.shape)
print(X_train_lm.shape)

In [None]:
from sklearn.linear_model import LinearRegression
#Representing LinearRegression as lr (creating LinearRegression object)
lr = LinearRegression()
#Fit the model using lr.fit()
lr.fit(X_train_lm,y_train_lm)

In [None]:
#get intercept
print(lr.intercept_)
#get slope
print(lr.coef_)

# Addressing Assumptions in Multilinear Regression

Initial Diagnostics:
- Examine scatter plots and residual plots.
- Test assumptions (e.g., Breusch-Pagan for heteroscedasticity, Shapiro-Wilk for normality).

Transform Data if Necessary:
- Use log, Box-Cox, or polynomial transformations to address issues like non-linearity and heteroscedasticity.

Refit and Compare Models:
- Use metrics like Adjusted $ùëÖ^2$ , Akaike Information Criterion (AIC), and Bayesian Information Criterion (BIC) to compare models.

Document Interpretations:
- Explain coefficients in the context of transformed variables.
- Discuss any trade-offs made during model selection.

# Checking for Independence

Independence of Errors: Errors should be independent (important for time series or clustered data).

We have done checks for linearity and multicollinearity, which both referred to the predictor variables. 

To checking some of the artefacts of the fitted model for three more statistical phenomena which further help us determine its quality.

#### Residuals vs. Predictor Variables Plots 

The first check we do involves plotting the residuals (vertical distances between each data point and the regression hyperplane). 
- We are looking to confirm the independence assumption here, i.e.: the residuals should be independent. 

If they are we will see:
- Residuals approximately uniformly randomly distributed about the zero x-axes;
- Residuals not forming specific clusters.

Observing the plots two things should be relatively clear:

- Residuals are slightly to skewed to the positive or negative (reaching +5 but only about -3);

- check for clustering, 
    - Check which may present a cluster on the value 6.

Conclusion: is the residuals are largely independent?

In [None]:
fig, axs = plt.subplots(2,5, figsize=(14,6),sharey=True)
fig.subplots_adjust(hspace = 0.5, wspace=.2)
fig.suptitle('Predictor variables vs. model residuals', fontsize=16)
axs = axs.ravel()

for index, column in enumerate(df.columns):
    axs[index-1].set_title("{}".format(column),fontsize=12)
    axs[index-1].scatter(x=df[column],y=fitted.resid,color='blue',edgecolor='k')
    axs[index-1].grid(True)
    xmin = min(df[column])
    xmax = max(df[column])
    axs[index-1].hlines(y=0,xmin=xmin*0.9,xmax=xmax*1.1,color='red',linestyle='--',lw=3)
    if index == 1 or index == 6:
        axs[index-1].set_ylabel('Residuals')

# Checking for Homoscedasticity

Homoscedasticity is an important assumption in linear regression. It implies that the variance of the residuals (errors) is constant across all levels of the independent variables. When this assumption is violated (heteroscedasticity), the model's standard errors, and p-values can become unreliable, potentially leading to incorrect inferences.

Homoscedasticity: Residuals should have constant variance.

What needs to be done: Check whether the variance of the residuals (the error terms) is constant as the fitted values increase. 

#### Fitted vs. Residuals

Determine this by plotting the magnitude of the fitted values (i.e.: `mpg`) against the residuals. 
- What we are looking for is the plotted points to approximately form a rectangle.
- The magnitude of the residuals should not increase as the fitted values increase (if that is the case, the data will form the shape of a cone on its side).

**Observation**
- If the variance is constant, we have observed _homoscedasticity_. 
- If the variance is not constant, we have observed _heteroscedasticity_. 

Use the same plot to check for outliers: any plotted points that are visibly seperate from the random pattern of the rest of the residuals.

**Observation**
- Look at data point on particular side of the plot and observe the scatteredness/ density.
    - Points towards the right-hand side of the plot tend to be scattered slightly less densely, indicating the presence of heteroscedasticity.
    - This violates our assumption of homoscedasticity. 
- Look at the presesnce of outliers
    - The presence of these outliers means that those values are weighted too heavily in the prediction process, disproportionately influencing the model's performance. 
    - This in turn can lead to the confidence interval for out of sample predictions (unseen data) being unrealistically wide or narrow.

if Heteroscedasticity, 
- Solution: Use transformations (log, Box-Cox) or weighted least squares regression.

**Step 1: Diagnosing Heteroscedasticity**

(a) Residual Plot
- Plot the residuals against the predicted values to check for patterns.

Interpretation:
- If the points are randomly scattered, homoscedasticity is likely satisfied.
- A funnel-shaped or other pattern suggests heteroscedasticity.

In [None]:
plt.figure(figsize=(8,3))
p=plt.scatter(x=fitted.fittedvalues,y=fitted.resid,edgecolor='k')
xmin = min(fitted.fittedvalues)
xmax = max(fitted.fittedvalues)
plt.hlines(y=0,xmin=xmin*0.9,xmax=xmax*1.1,color='red',linestyle='--',lw=3)
plt.xlabel("Fitted values",fontsize=15)
plt.ylabel("Residuals",fontsize=15)
plt.title("Fitted vs. residuals plot",fontsize=18)
plt.grid(True)
plt.show()

In [None]:
import matplotlib.pyplot as plt

# Predicted values and residuals
predicted = model.predict(X)
residuals = Y - predicted

# Residual plot
plt.scatter(predicted, residuals)
plt.axhline(0, color="red", linestyle="--")
plt.title("Residual Plot")
plt.xlabel("Predicted Values")
plt.ylabel("Residuals")
plt.show()

(b) Breusch-Pagan Test
- This statistical test explicitly checks for heteroscedasticity.

Interpretation:
- Null Hypothesis: Homoscedasticity is present.
- If p-value < 0.05, reject the null hypothesis, indicating heteroscedasticity.

In [None]:
from statsmodels.stats.diagnostic import het_breuschpagan

# Breusch-Pagan test
bp_test = het_breuschpagan(residuals, X)
print("Breusch-Pagan Test Results:")
print(f"LM Statistic: {bp_test[0]}")
print(f"p-value: {bp_test[1]}")


In [None]:
# Breusch-Pagan test
from statsmodels.stats.diagnostic import het_breuschpagan

bp_test = het_breuschpagan(residuals, X)
print(f"Breusch-Pagan test p-value: {bp_test[1]}")

(c) White Test
- Another test for heteroscedasticity, more flexible than Breusch-Pagan.

In [None]:
from statsmodels.stats.diagnostic import het_white

# White test
white_test = het_white(residuals, X)
print("White Test Results:")
print(f"LM Statistic: {white_test[0]}")
print(f"p-value: {white_test[1]}")


**Step 2: Addressing Heteroscedasticity/ Handling Homoscedasticity**

1. Transforming the Response Variable
- Apply transformations to stabilize variance.

(a) Log Transformation: Use when variance increases with the response variable.

In [None]:
df["Log_HousePrice"] = np.log(df["HousePrice"])
model_log = sm.OLS(df["Log_HousePrice"], X).fit()
print(model_log.summary())

(b) Box-Cox Transformation: Automatically finds the best transformation.

In [None]:
from scipy.stats import boxcox

Y_transformed, lambda_boxcox = boxcox(Y)
print(f"Optimal lambda for Box-Cox: {lambda_boxcox}")

model_boxcox = sm.OLS(Y_transformed, X).fit()
print(model_boxcox.summary())


2. Applying Weighted Least Squares (WLS)

If heteroscedasticity is detected:
- Use WLS to assign weights inversely proportional to the variance of residuals.

When to use:
- When residual patterns vary predictably with certain predictors.

In [None]:
import statsmodels.api as sm
import numpy as np

# Calculate weights as inverse of squared residuals
weights = 1 / (residuals**2)

# Fit WLS model
model_wls = sm.WLS(Y, X, weights=weights).fit()
print(model_wls.summary())

In [None]:
# Fit a weighted least squares model
weights = 1 / (residuals**2)
model_wls = sm.WLS(Y, X, weights=weights).fit()

print(model_wls.summary())


3. Heteroscedasticity-Robust Standard Errors
- Use robust standard errors to correct inference without changing the model structure.

Types of Robust Covariance:
- "HC0": Basic robust variance.
- "HC1", "HC2", "HC3": Variants of robust variance, with "HC3" being stricter.

In [None]:
# Fit OLS model with robust standard errors
model_robust = sm.OLS(Y, X).fit(cov_type="HC3")
print(model_robust.summary())

Check Residual Plots After Mitigation

In [None]:
# Residuals of WLS model
predicted_wls = model_wls.predict(X)
residuals_wls = Y - predicted_wls

plt.scatter(predicted_wls, residuals_wls)
plt.axhline(0, color="red", linestyle="--")
plt.title("Residual Plot After WLS")
plt.xlabel("Predicted Values")
plt.ylabel("Residuals")
plt.show()

**Step 3: Comparing Models**

Evaluate and Compare Performance
- Residual plots before and after adjustments.
- Metrics like $ùëÖ^2$, AIC, and BIC.

In [None]:
# Compare AIC and BIC
print(f"Original Model AIC: {model.aic}")
print(f"Log-Transformed Model AIC: {model_log.aic}")
print(f"Box-Cox Model AIC: {model_boxcox.aic}")
print(f"WLS Model AIC: {model_wls.aic}")

Best Practices and Considerations

Diagnosis:
- Always check residual plots and use tests like Breusch-Pagan or White.

Correction:
- Start with transformations if patterns suggest non-linearity or skewed responses.
- Use WLS or robust standard errors for complex variance structures.

Validation:
- Ensure improvements in residual plots and metrics.
- Balance interpretability and complexity when applying advanced techniques.

# Checking for Normality

The normality of residuals is a key assumption in linear regression, especially for inference. It ensures that t-tests and F-tests for significance are valid. If residuals are not normally distributed, it can lead to unreliable p-values and confidence intervals.

To confirm our assumption of normality amongst the residuals. 
- If the residuals are non-normally distributed, confidence intervals can become too wide or too narrow, 
    - which leads to difficulty in estimating coefficients based on the minimisation of ordinary least squares.

Check for violation of the normality assumption in two different ways:
1. Plotting a histogram of the normalised residuals;
2. Generating a Q-Q plot of the residuals.

**Step 1: Testing for Normality**

(a) Visual Inspection: Histogram and Q-Q Plot

1. Histogram: Examine the residuals' distribution.

Plot a histogram of the residuals to take a look at their distribution. 
- It is fairly easy to pick up when a distribution looks similar to the classic _bell curve_ shape of the normal distribution.

Interpretation:
- Histogram: A bell-shaped curve suggests normality.

In [None]:
plt.figure(figsize=(8,5))
plt.hist(fitted.resid_pearson,bins=8,edgecolor='k')
plt.ylabel('Count',fontsize=15)
plt.xlabel('Normalized residuals',fontsize=15)
plt.title("Histogram of normalized residuals",fontsize=18)
plt.show()

In [None]:
import matplotlib.pyplot as plt
import scipy.stats as stats

# Calculate residuals
residuals = Y - model.predict(X)

# Histogram
plt.hist(residuals, bins=20, edgecolor='k', alpha=0.7)
plt.title("Residual Histogram")
plt.xlabel("Residuals")
plt.ylabel("Frequency")
plt.show()

2. Q-Q plot of the residuals

Compare residuals to a normal distribution.
- A Q-Q plot, A.K.A quantile-quantile plot, attempts to plot the theoretical quantiles of the standard normal distribution against the quantiles of the residuals. 
- The one-to-one line, indicated in red below, is the ideal line indicating normality. 
- The closer the plotted points are to the red line, the closer the residual distribution is to the standard normal distribution.

Quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities or dividing the observations in a sample in the same way.
- 2 quantile is known as the Median
- 4 quantile is known as the Quartile
- 10 quantile is known as the Decile
- 100 quantile is known as the Percentile

Interpretation:
- Q-Q Plot: Points should lie close to the 45¬∞ line for normality.

10 quantile will divide the Normal Distribution into 10 parts each having 10 % of the data points. The Q-Q plot or quantile-quantile plot is a scatter plot created by plotting two sets of quantiles against one another.

In [None]:
# We once again use the statsmodel library to assist us in producing our qqplot visualisation. 
from statsmodels.graphics.gofplots import qqplot

In [None]:
plt.figure(figsize=(8,5))
fig=qqplot(fitted.resid_pearson,line='45',fit='True')
plt.xticks(fontsize=13)
plt.yticks(fontsize=13)
plt.xlabel("Theoretical quantiles",fontsize=15)
plt.ylabel("Sample quantiles",fontsize=15)
plt.title("Q-Q plot of normalized residuals",fontsize=18)
plt.grid(True)
plt.show()

In [None]:
import matplotlib.pyplot as plt
import scipy.stats as stats

# Calculate residuals
residuals = Y - model.predict(X)

# Q-Q plot
stats.probplot(residuals, dist="norm", plot=plt)
plt.title("Q-Q Plot of Residuals")
plt.show()

(b) Shapiro-Wilk Test
- Statistical test for normality.

Interpretation

Null Hypothesis: Residuals follow a normal distribution.
- If p-value < 0.05, reject the null hypothesis, indicating non-normality.

In [None]:
from scipy.stats import shapiro

shapiro_test = shapiro(residuals)
print(f"Shapiro-Wilk Test Statistic: {shapiro_test.statistic}, p-value: {shapiro_test.pvalue}")

(c) Kolmogorov-Smirnov Test
- Another test for normality.

In [None]:
from scipy.stats import kstest

ks_test = kstest(residuals, 'norm', args=(residuals.mean(), residuals.std()))
print(f"KS Test Statistic: {ks_test.statistic}, p-value: {ks_test.pvalue}")

(d) Anderson-Darling Test
- Tests for how well data fits a specific distribution.

Compare the statistic to critical values. 
- If the statistic exceeds the critical value for a given significance level, residuals deviate from normality.

In [None]:
from scipy.stats import anderson

anderson_test = anderson(residuals, dist="norm")
print("Anderson-Darling Test Results:")
print(f"Statistic: {anderson_test.statistic}")
print("Critical Values:", anderson_test.critical_values)

**Step 2: Addressing Non-Normal Residuals/ Handling Normality of Errors**

(a) Transform the Response Variable

1. Log Transformation: Use if residuals are right-skewed.

In [None]:
import numpy as np

df["Log_HousePrice"] = np.log(df["HousePrice"])
model_log = sm.OLS(df["Log_HousePrice"], X).fit()
print(model_log.summary())

2. Applying Box-Cox Transformation
- Finds the best transformation parameter (ùúÜ).

In [None]:
from scipy.stats import boxcox

# Apply Box-Cox to the response variable
Y_boxcox, lambda_boxcox = boxcox(Y)
print(f"Optimal lambda for Box-Cox: {lambda_boxcox}")

# Fit the model again
model_boxcox = sm.OLS(Y_boxcox, X).fit()
print(model_boxcox.summary())

In [None]:
from scipy.stats import boxcox

Y_transformed, lambda_boxcox = boxcox(Y)
print(f"Optimal lambda for Box-Cox: {lambda_boxcox}")

model_boxcox = sm.OLS(Y_transformed, X).fit()
print(model_boxcox.summary())


3. Square Root Transformation: Helps stabilize variance and normalize data.

In [None]:
df["Sqrt_HousePrice"] = np.sqrt(df["HousePrice"])
model_sqrt = sm.OLS(df["Sqrt_HousePrice"], X).fit()
print(model_sqrt.summary())

(b) Using Robust Regression

If normality cannot be achieved:
- Robust regression minimizes the influence of outliers and non-normal errors.

1. Huber Regression: Combines linear regression with robustness to outliers.

In [None]:
from sklearn.linear_model import HuberRegressor

huber = HuberRegressor()
huber.fit(X, Y)
print("Huber Coefficients:", huber.coef_)

2. Quantile Regression: Models conditional medians instead of means.

In [None]:
import statsmodels.api as sm

model_quantile = sm.QuantReg(Y, X).fit(q=0.5)  # Median regression
print(model_quantile.summary())

3. Robust linear model:

In [None]:
from statsmodels.robust.robust_linear_model import RLM

# Fit a robust linear model
model_robust = sm.RLM(Y, X).fit()
print(model_robust.summary())

(c) Bootstrap for Non-Normal Residuals
- Bootstrapping creates confidence intervals without assuming normality.

In [None]:
from sklearn.utils import resample
import numpy as np

# Bootstrap residuals
bootstrap_samples = 1000
boot_means = []

for _ in range(bootstrap_samples):
    Y_boot, X_boot = resample(Y, X)
    model_boot = sm.OLS(Y_boot, X_boot).fit()
    boot_means.append(model_boot.params)

boot_means = np.array(boot_means)
print("Bootstrap Confidence Intervals:")
print(np.percentile(boot_means, [2.5, 97.5], axis=0))


**Step 3: Evaluating Adjustments**

Evaluate and Compare Performance
- Residual plots before and after adjustments.
- Normality tests on new residuals.
- Performance Metrics like $ùëÖ^2$, AIC, and BIC.

In [None]:
print(f"Original Model AIC: {model.aic}")
print(f"Log-Transformed Model AIC: {model_log.aic}")
print(f"Box-Cox Model AIC: {model_boxcox.aic}")

Plot Residuals After Adjustments:

In [None]:
# Residuals from Box-Cox model
residuals_boxcox = Y_transformed - model_boxcox.predict(X)

plt.hist(residuals_boxcox, bins=20, edgecolor='k', alpha=0.7)
plt.title("Residual Histogram After Box-Cox")
plt.xlabel("Residuals")
plt.ylabel("Frequency")
plt.show()

Testing:
- Use visual methods like histograms and Q-Q plots.
- Statistical tests (Shapiro-Wilk, Anderson-Darling) confirm non-normality.

Correction:
- Start with transformations like log or Box-Cox.
- Use robust regression if transformations fail or residuals deviate significantly.

Validation:
- Reassess residual plots and metrics post-adjustment.
- Ensure the model aligns with assumptions.

# Checking for Outliers in Residuals

Check for outliers amongst the residuals.

#### Plotting Cook's Distance

Cook's distance is a calculation which measures the effect of deleting an observation from the data. 
- Observations with large Cook's distances should be earmarked for closer examination in the analysis due to their disproportionate impact on the model.

**Observation**

Check values with much higher Cook's distances than the rest. 
- A rule of thumb for determining whether a Cook's distance is too large is whether it is greater than four times the mean Cook's distance.

In [None]:
from statsmodels.stats.outliers_influence import OLSInfluence as influence

In [None]:
inf=influence(fitted)

In [None]:
(c, p) = inf.cooks_distance
plt.figure(figsize=(8,5))
plt.title("Cook's distance plot for the residuals",fontsize=16)
plt.stem(np.arange(len(c)), c, markerfmt=",", use_line_collection=True)
plt.grid(True)
plt.show()

#### Calculate the mean Cooks Distance

Check which observation are 4 X higher the the average

Implications: Highly influential in this dataset
- warrant closer examination.

In [None]:
print('Mean Cook\'s distance: ', c.mean())

## 3. Logistic Regression

Logistic regression is the appropriate regression analysis to conduct when the dependent variable is dichotomous (binary).

It is used to describe data and to explain the relationship between one dependent binary variable and one or more 
- nominal, 
- ordinal, 
- interval or 
- ratio-level independent variables.

Logistic regression is a statistical model used for binary classification tasks.
- The outcome variable is categorical with two possible values (e.g., 1/0, Yes/No, Positive/Negative).
- Used to predict the Probabilities for classification problems.

It predicts the probability of an event occurring, transforming the linear combination of predictors through a logistic function (sigmoid function) to ensure the predicted probabilities lie between 0 and 1.

Model Equation: 
$ ùëÉ(ùë¶=1)= \frac{1}{1+ùëí^{‚àí(ùõΩ_{0}+ùõΩ_{1}ùë•_{1}+‚Ä¶+ùõΩ_{ùëõ}ùë•_{ùëõ}})}$

**What It Means:** 
- Logistic regression estimates the probability of a binary outcome (e.g., yes/no, success/failure) based on predictor variables. 
    - It uses a logistic function to map predictions to probabilities between 0 and 1.

- It is a statistical technique for investigating the relationship between a binary dependent variable (outcome) and one or more independent variables (predictors). 

- The goal of logistic regression is to find the best-fitting model to describe the relationship between the dependent variable and the independent variables and then use that model to predict the outcome variable.

**Lay Explanation:**
- Logistic regression is like a yes-or-no decision helper. It estimates the chances of an event happening (e.g., a customer buying a product) based on known factors.
- It tries to find the best-fitted curve for the data

**Why use Logistic Regression rather than Linear Regression?**

Outlier Influence:
- best fit line in linear regression shifts to fit that point.

Predicted outcome out of range:
- In linear regression, the predicted values may be out of range.

Response Variable:
- Linear regression is used when dependent variable is continuous
- Logistic Regression is used when our dependent variable is binary.

Logistic regression is ideal for this problem because:
- Binary Outcome: The target variable is binary: Readmitted (1) or Not Readmitted (0).
- Interpretability: It provides coefficients (log odds) that indicate how changes in predictors affect the likelihood of the event (readmission).
- Insights: It helps identify the significant factors influencing readmissions.

### Outcome Interpretation: 
- The model outputs probabilities that can be converted to binary outcomes. 
- Coefficients show how each predictor variable influences the likelihood of the outcome.

### Performance Measures:
- Accuracy: Proportion of correct predictions.
- AUC-ROC: Measures the model's ability to distinguish between classes; values closer to 1 indicate a better model.

### Types of Logistic Regression

#### Binary Logistic Regression
Binary logistic regression is used to predict the probability of a binary outcome, such as 
- yes or no, 
- true or false, or 
- 0 or 1. 

For example, it could be used to:
- predict whether a customer will churn or not, 
- predict whether a patient has a disease or not, or 
- predict whether a loan will be repaid or not.

#### Multinomial Logistic Regression
Multinomial logistic regression is used to predict the probability of one of three or more possible outcomes, such as 
- the type of product a customer will buy, 
- the rating a customer will give a product, or 
- the political party a person will vote for.

#### Ordinal Logistic Regression
Used to predict the probability of an outcome that falls into a predetermined order, such as 
- the level of customer satisfaction, 
- the severity of a disease, or 
- the stage of cancer.

#### Least-Squares Regression 
Is a foundational method in statistics and data science for modeling relationships between variables, particularly for continuous dependent variables. 
- It does so by finding the line (or hyperplane in higher dimensions) that minimizes the sum of the squared differences (residuals) between the observed and predicted values of the dependent variable.

Application:
- Continuous Outcomes: Least-squares regression is most commonly used for problems where the dependent variable is continuous, such as 
    - predicting house prices, 
    - stock prices, or 
    - blood pressure.
- Exploratory Analysis: Identifying potential relationships between variables.

##### Drawback of least-squares regression
When applied to classification tasks like logistic regression, is that it assumes linearity and can lead to problems when modeling binary or categorical outcomes.

Key Issue:

1. Inappropriate Predictions
- Least-squares regression is designed for continuous outcomes and does not restrict predictions to the range [0, 1], which is required for probabilities in classification problems.
- For binary classification, it can result in predictions outside the valid probability range, such as negative values or values greater than 1, which are meaningless.

2. Violation of Assumptions
- The error terms (residuals) in least-squares regression are assumed to be normally distributed and homoscedastic (constant variance). 
    - However, in classification problems, these assumptions are violated because:
        - The dependent variable is not continuous but binary.
        - The variance of the binary response variable is a function of the mean (heteroscedasticity), not constant.

3. Inefficient Parameter Estimation (relationship between the predictors and the binary outcome)
- Linear least squares does not model this relationship (non-linear relationship between predictors and the outcome) correctly .
    - As a result, least squares is inefficient in estimating parameters and may lead to biased coefficients.
- In classification tasks, the relationship between the predictors and the binary outcome is often non-linear (sigmoid-shaped in logistic regression). 

4. Poor Performance for Separation
- Least-squares regression does not inherently maximize the separation between the two classes in binary classification problems. 
- Logistic regression, on the other hand, maximizes the likelihood of the observed data, providing a more suitable objective for classification tasks.

5. Susceptibility to Outliers
- Least-squares regression is sensitive to outliers, as it minimizes the squared residuals. 
- In a classification context, outliers in the feature space can have a disproportionately large influence on the model, leading to poor generalization.

##### Why Logistic Regression Instead of Least Squares?
Logistic regression overcomes these drawbacks by:
- Modeling the probability of the binary outcome using the logit function (log-odds), ensuring probabilities stay within [0, 1].
- Using maximum likelihood estimation (MLE) to fit the model, which aligns with the probabilistic nature of classification problems.
- Making no assumptions about normally distributed errors, as it focuses on the Bernoulli distribution of binary outcomes.


##### Differences Between Linear and Logistic Regression
The core difference lies in their target predictions.
- Linear regression excels at predicting continuous values along a spectrum. 
    - resulting output would be a specific amount, a continuous value on the amount scale.
- Linear regression answers ‚Äúhow much‚Äù questions, providing a specific value on a continuous scale.

- Logistic regression deals with categories. 
    - It doesn‚Äôt predict a specific value but rather the likelihood of something belonging to a particular class.
    - output here would be a probability between 0 (not likely spam) and 1 (very likely spam). 
    - This probability is then used to assign an email to a definitive category (spam or not spam) based on a chosen threshold.
- Logistic regression tackles ‚Äúyes or no‚Äù scenarios, giving the probability of something belonging to a certain category.

### Problem Statement
Objective:
- The medical institute, we want to identify the likelihood of patients being readmitted within 30 days of discharge based on patient 
    - demographics, 
    - medical history, 
    - length of stay (LOS), and 
    - clinical metrics such as blood pressure, 
    - blood glucose levels, and 
    - medication adherence.

**Key Assumptions of Logistic Regression**

Data Specific
- Binary Outcome: The dependent variable is binary.
    - Logistic regression is designed for binary dependent variables. 
    - If your outcome has more than two categories, you might need a multinomial logistic regression or other classification techniques.
- Independence of Observations: Observations are independent of each other.
    -  This means no repeated measurements or clustering within the data.

Relationship Between Variables
- Linearity of Log-Odds: There is a linear relationship between the log-odds of the outcome and the independent variables.
    - Outcome itself has a relationship with log-odds.
    - Outcome does not have linear relationship with the independent variables.
- No Multicollinearity: Independent variables are not highly correlated.
    - Multicollinearity can cause instability in the model and make it difficult to interpret the coefficients.

Other
- Large Sample Size: Logistic regression performs well with larger datasets.
    - To ensure reliable parameter estimates.
- Absence of Outliers: outliers can significantly influence the model. 
    - It‚Äôs important to check for and address any outliers that might distort the results.

**Step 1: Define the Problem**
- Target Variable: Readmission within 30 days (1 = Yes, 0 = No).
- Predictors:
    - Patient Demographics: Age, gender, insurance status.
    - Clinical Metrics: Blood glucose levels, blood pressure, medication adherence.
    - Hospital Metrics: Length of Stay (LOS), number of previous visits.

**Step 2: Collect and Prepare Data**
- Gather historical patient data and ensure it's clean and consistent.
    - Check for Missing Data:
    - Impute missing values for predictors like glucose levels using median or mean.
    - Standardize Continuous Variables:
    - Standardize LOS, glucose levels, and blood pressure for consistency.

In [None]:
# Example dataset
data = pd.DataFrame({
    'age': [45, 60, 50, 40, 70],
    'los': [3, 7, 4, 2, 10],
    'glucose': [150, 200, 180, 140, 220],
    'med_adherence': [0.8, 0.6, 0.75, 0.9, 0.5],
    'readmitted': [1, 1, 0, 0, 1]
})

# Features and target
X = data[['age', 'los', 'glucose', 'med_adherence']]
y = data['readmitted']

# Add constant for intercept
X = sm.add_constant(X)

**Step 3: Exploratory Data Analysis**
- Univariate Analysis: Examine distributions of continuous variables.
- Bivariate Analysis: Analyze relationships between predictors and the target variable.
- Correlation Matrix: Identify multicollinearity among predictors.

**Step 4: Perform Logistic Regression**

How logistic regression squeezes the output of linear regression between 0 and 1.

Best Fit Equation in Linear Regression

$ y = ùõΩ_{0}+ùõΩ_{1}ùë•_{1}$

Now we want to take probabilities (P) instead of y.

**Issue**: 
the value of (P) will exceed 1 or go below 0 and we know that range of Probability is (0-1)

Odds and log-odds are central to understanding the relationship between predictors and the probability of an event occurring.

**Overcome issue of $0 < P < 1$**

by taking ‚Äúodds‚Äù of P:

Odds: The odds represent the ratio of the probability of an event occurring (P) to the probability of it not occurring (1‚àíP).

$$ Odds =  \frac{P}{1-P}$$

Log-Odds (Logit): The natural logarithm of the odds.

$$ Log-Odds =  \log(\frac{P}{1-P})$$

In logistic regression, the log-odds are modeled as a linear function of the predictors:

$$ P = ùõΩ_{0}+ùõΩ_{1}ùë•_{1}$$
$$ \frac{P}{1-P} = ùõΩ_{0}+ùõΩ_{1}ùë•_{1}$$

Odds can always be positive which means the range will always be ($0,+‚àû $).
- Odds are the ratio of the probability of success and probability of failure.

Why ‚Äòodds‚Äô?
- odds are probably the easiest way to do this.

Problem: is that the range is restricted and we don‚Äôt want a restricted range because if we do so then our correlation will decrease.
- By restricting the range we are actually decreasing the number of data points and if we decrease our data points, our correlation will decrease.
- Making it difficult to model a variable that has a restricted range.

Control:
- Control this we take the log of odds which has a range from (-‚àû,+‚àû)

$ \log(\frac{P}{1-P}) = ùõΩ_{0}+ùõΩ_{1}ùë•_{1}$

Now we just want a function of P because we want to predict probability not log of odds. To do so we will 
- multiply by exponent on both sides and then solve for P.

$ \exp[\log(\frac{P}{1-P})] = \exp(ùõΩ_{0}+ùõΩ_{1}ùë•_{1})$

$ \exp^{\ln[\frac{P}{1-P})} = \exp^{(ùõΩ_{0}+ùõΩ_{1}ùë•_{1})} $

$ \frac{P}{1-P} = \exp^{(ùõΩ_{0}+ùõΩ_{1}ùë•_{1})} $

$ p = \exp^{(ùõΩ_{0}+ùõΩ_{1}ùë•_{1})}  - p\exp^{(ùõΩ_{0}+ùõΩ_{1}ùë•_{1})}$

Now we have sigmoid function.

Model Equation: 
$ ùëÉ(ùë¶=1)= \frac{1}{1+ùëí^{‚àí(ùõΩ_{0}+ùõΩ_{1}ùë•_{1}+‚Ä¶+ùõΩ_{ùëõ}ùë•_{ùëõ}})}$

It squeezes a straight line into an S-curve.

In [None]:
import numpy as np

# Function to calculate the sigmoid function
def sigmoid(z):
    """
    The sigmoid function maps log-odds to probabilities between 0 and 1.
    """
    return 1 / (1 + np.exp(-z))

# Function to calculate odds and log-odds
def logistic_regression_predict(X, coefficients):
    """
    Predict probabilities, odds, and log-odds using logistic regression.
    
    Parameters:
    - X: Feature matrix (numpy array of shape [n_samples, n_features])
    - coefficients: Coefficients including intercept (numpy array of shape [n_features + 1])
    
    Returns:
    - probabilities: Predicted probabilities (numpy array of shape [n_samples])
    - odds: Odds of event occurring (numpy array of shape [n_samples])
    - log_odds: Log-Odds (numpy array of shape [n_samples])
    """
    # Add intercept to the feature matrix
    X = np.hstack((np.ones((X.shape[0], 1)), X))  # Add a column of ones for the intercept
    
    # Calculate log-odds (z = X * coefficients)
    log_odds = np.dot(X, coefficients)
    
    # Calculate probabilities using the sigmoid function
    probabilities = sigmoid(log_odds)
    
    # Calculate odds
    #  Derived from probabilities using the formula
    odds = probabilities / (1 - probabilities)
    
    return probabilities, odds, log_odds

# Example usage
# Example dataset: X contains two features, and coefficients include intercept and weights
X = np.array([[2, 3], [1, 0], [4, 5]])  # Feature matrix
coefficients = np.array([-3, 0.5, 1])  # Coefficients (intercept + weights for features)

# Predict probabilities, odds, and log-odds
probabilities, odds, log_odds = logistic_regression_predict(X, coefficients)

# Predicted Probabilities: Likelihood of the event occurring.
# Odds: Ratio of the probability of success to failure.
# Log-Odds: Linear transformation of the predictors.

# Print results
print("Predicted Probabilities:", probabilities)
print("Odds:", odds)
print("Log-Odds:", log_odds)

**log-odds linear function**

The log-odds linear function is a core concept in logistic regression and represents the relationship between the independent variables (predictors) and the log-odds of the dependent variable (outcome).

$$ \log(\frac{P}{1-P}) = ùõΩ_{0}+ùõΩ_{1}ùë•_{1}+ùõΩ_{2}ùë•_{2}+...+ +ùõΩ_{p}ùë•_{p}$$

Where:
- $ùõΩ_{0}$: Intercept (bias term).
- $ùõΩ_{1}, ùõΩ_{2},..., ùõΩ_{p}$: Coefficients of the predictors $x_{1}, x_{2},..., x_{p}$
- $x_{1}, x_{2},..., x_{p}$: Values of the independent variables.
= $P$: Predicted probability of the event occurring.

Steps to Calculate Log-Odds
1. Start with the linear combination: Compute a weighted sum of the predictors and the intercept:

$$ z = ùõΩ_{0}+ùõΩ_{1}ùë•_{1}+ùõΩ_{2}ùë•_{2}+...+ +ùõΩ_{p}ùë•_{p}$$

2. Interpret z as the log-odds: The value z is the log-odds, which can be converted to:
- Odds using: 
$$ odds = e^z $$
- Probability using the sigmoid function:
$$ P= \frac{1}{1+e^z}$$

In [None]:
import numpy as np

# Function to calculate log-odds
def calculate_log_odds(intercept, coefficients, predictors):
    """
    Calculate log-odds for logistic regression.

    Parameters:
    - intercept: Intercept term (beta_0)
    - coefficients: Coefficients for the predictors (list or array)
    - predictors: Values of the predictors (list or array)

    Returns:
    - log_odds: Computed log-odds
    """
    # Ensure inputs are numpy arrays
    coefficients = np.array(coefficients)
    predictors = np.array(predictors)
    
    # Compute log-odds
    log_odds = intercept + np.dot(coefficients, predictors)
    return log_odds

# Example inputs
intercept = -2
coefficients = [0.8, -1.2]  # Beta coefficients
predictors = [3, 5]         # Predictor values (x_1, x_2)

# Calculate log-odds
log_odds = calculate_log_odds(intercept, coefficients, predictors)
print("Log-Odds:", log_odds)


**Calculate class probabilities in logistic regression**

the logistic (sigmoid) function is used to transform the log-odds into probabilities. 
- The logistic function ensures the probabilities range between 0 and 1, making it suitable for classification problems

Logistic Function for Probability

$$ P= \frac{1}{1+e^z}$$

Where:
- P: Probability of the positive class (class 1).
- z: Log-odds, calculated as:

$$ z = ùõΩ_{0}+ùõΩ_{1}ùë•_{1}+ùõΩ_{2}ùë•_{2}+...+ +ùõΩ_{p}ùë•_{p}$$

- z is the weighted sum of the predictors and the intercept.

The logistic function outputs:
- P: Probability of the positive class (class 1).
- 1‚àíP: Probability of the negative class (class 0).

Steps to Calculate Class Probability
1. Calculate Log-Odds (z): Compute the linear combination of the intercept ($ùõΩ_0$) and the predictor variables.
2. Apply the Logistic Function: 
Use the formula: 
$$ P= \frac{1}{1+e^z}$$

3. Interpret the Result:
- If P‚â•0.5, classify the observation as the positive class (class 1).
- If P<0.5, classify the observation as the negative class (class 0).

In [None]:
import numpy as np

# Sigmoid (logistic) function
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# Function to calculate probability
def calculate_probability(intercept, coefficients, predictors):
    """
    Calculate class probability using the logistic function.

    Parameters:
    - intercept: Intercept term (beta_0)
    - coefficients: Coefficients for predictors (list or array)
    - predictors: Values of predictors (list or array)

    Returns:
    - probability: Probability of the positive class (class 1)
    """
    # Ensure inputs are numpy arrays
    coefficients = np.array(coefficients)
    predictors = np.array(predictors)
    
    # Calculate log-odds
    log_odds = intercept + np.dot(coefficients, predictors)
    
    # Apply sigmoid function to get probability
    probability = sigmoid(log_odds)
    return probability

# Example inputs
intercept = -2
coefficients = [0.8, -1.2]  # Beta coefficients
predictors = [3, 5]         # Predictor values (x_1, x_2)

# Calculate class probability
probability = calculate_probability(intercept, coefficients, predictors)
print("Class Probability (P for class 1):", probability)


### Decision boundary in logistic regression

The decision boundary in logistic regression is the threshold at which the model predicts one class over the other. It represents the dividing line (or surface in higher dimensions) between the predicted classes in the feature space.

Key Points about Decision Boundary in Logistic Regression
1. Sigmoid Function and Threshold:
- Logistic regression uses the sigmoid function to output probabilities between 0 and 1.
- A commonly used threshold is 0.5
    - If P‚â•0.5, classify as class 1 (positive class).
    - If P<0.5, classify as class 0 (negative class).

2. Log-Odds and Decision Boundary:
- The decision boundary corresponds to where the log-odds (z) equals zero.
- At z=0:

$$ P= \frac{1}{1+e^z}$$
$$ P= \frac{1}{1+e^0}$$
$$ P= \frac{1}{2}$$
$$ P= 0.5$$

- Thus, the decision boundary is the set of points where z=0, or equivalently:

$$ z = ùõΩ_{0}+ùõΩ_{1}ùë•_{1}+ùõΩ_{2}ùë•_{2}+...+ +ùõΩ_{p}ùë•_{p} = 0$$

3. Geometric Interpretation:
- In 2D (one predictor): The decision boundary is a line.
- In 3D (two predictors): The decision boundary is a plane.
- In higher dimensions: The decision boundary is a hyperplane.

4. Linear Nature of Decision Boundary:
- Logistic regression assumes a linear relationship between the predictors and the log-odds.
- The decision boundary is linear unless the model is extended with non-linear transformations of the predictors (e.g., polynomial features).

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Coefficients
beta_0 = -2      # Intercept
beta_1 = 0.8     # Coefficient for x1
beta_2 = -1.2    # Coefficient for x2

# Generate a range of x1 values
x1 = np.linspace(-10, 10, 100)

# Calculate x2 for the decision boundary
x2 = (-beta_0 - beta_1 * x1) / beta_2

# Plot the decision boundary
plt.figure(figsize=(8, 6))
plt.plot(x1, x2, label="Decision Boundary", color="red")

# Add some random points for class 0 and class 1
np.random.seed(42)
class_0 = np.random.multivariate_normal([3, 3], [[2, 1], [1, 2]], size=50)
class_1 = np.random.multivariate_normal([-3, -3], [[2, 1], [1, 2]], size=50)

plt.scatter(class_0[:, 0], class_0[:, 1], label="Class 0", color="blue", alpha=0.7)
plt.scatter(class_1[:, 0], class_1[:, 1], label="Class 1", color="green", alpha=0.7)

# Formatting
plt.axhline(0, color="black", linewidth=0.5, linestyle="--")
plt.axvline(0, color="black", linewidth=0.5, linestyle="--")
plt.title("Decision Boundary of Logistic Regression")
plt.xlabel("x1")
plt.ylabel("x2")
plt.legend()
plt.grid()
plt.show()


### Key properties of the logistic regression equation

Expalin the Logistic regression model

Sigmoid Function:
- uses a special ‚ÄúS‚Äù shaped curve to predict probabilities. It ensures that the predicted probabilities stay between 0 and 1.

Straightforward Relationship:
- relationship between our inputs and the outcome is like drwing a straight line but a curve is there instead.

Coefficients / parameters:
- numbers that tell us how much each input affects the outcome in the logistic regression model.
- coefficient tells us how much the outcome changes for every one unit increase in predictor variable.

Best Guess: 
- Figure out the best coefficients for the logistic regression model by looking at the data we have and tweaking them until our predictions match the real outcomes as closely as possible.

Basic Assumptions:
- We assume that our observations are independent, meaning one doesn‚Äôt affect the other. 
- We assume that there‚Äôs not too much overlap between our predictors (like age and height), 
- We assume the relationship between our predictors and the outcome is kind of like a straight line.

Probabilities, Not Certainties:
- Logistic regression gives us probabilities.
- Then decide on a cutoff point to make our final decision.

Checking Our Work:
- We make sure our predictions are good, like 
    - accuracy, 
    - precision, 
    - recall,
    - ROC curve.

In [None]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, roc_auc_score, roc_curve

data = pd.read_csv('data.csv') # read data from csv file
X = data[['Independent_Var_1', 'Independent_Var_2', 'Independent_Var_3']] # select independent variables
Y = data['Dependent_Var'] # select dependent variable

# Add a constant to the independent variable set
X = sm.add_constant(X)

# Fit the logistic regression model
model = sm.Logit(Y, X).fit()

# Print model summary
print(model.summary())

In [None]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

In [None]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression

# Load the data
data = pd.read_csv('data.csv')

# Split the data into training and testing sets
train = data[:800]
test = data[800:]

# Define the independent variables
X_train = train[['age', 'gender', 'income']]
X_test = test[['age', 'gender', 'income']]

# Define the dependent variable
y_train = train['buy_product']
y_test = test['buy_product']

# Fit the logistic regression model
logreg = LogisticRegression()
logreg.fit(X_train, y_train)

# Predict the outcomes for the test data
y_pred = logreg.predict(X_test)

# Evaluate the performance of the model
from sklearn.metrics import accuracy_score
accuracy = accuracy

**Step 5: Interpret Coefficients and Evaluate the Model**

- Log Odds: Each coefficient represents the change in log odds of readmission for a unit increase in the predictor.
- Odds Ratios: Use np.exp(model.params) to convert coefficients to odds ratios.

1. Accuracy
2. Confusion Matrix
3. ROC Curve and AUC

**Step 6: Optimisation**

### Cost Function in Logistic Regression

Linear regression, uses the Mean squared error which was the difference between y_predicted and y_actual
- this is derived from the **maximum likelihood estimator**.

logistic regression $Yi$ is a non-linear function ($ ≈∂= \frac{1}‚Äã{1+ e-z}$).
- If we use this in the above MSE equation then it will give a non-convex graph with many local minima.

Problem: cost function will give results with local minima
- End up miss out on our global minima and our error will increase.

Solution: derive a different cost function for logistic regression
- **log loss** which is also derived from the **maximum likelihood estimation method**.

$ Log Loss = \frac{1}{N} \sum^{N}_{i = 1} - ( y_i * \log(Y_i) + (1 - y_i) * log (1 - Y_i))$

#### Maximum likelihood estimator

Primary Objctive:
- is to identify parameter values that maximize the likelihood function.
- it represents the joint probability density function (pdf) of our sample observations.
- it involves multiplying the conditional probabilities for observing each example given the distribution parameters.
- this process aims to discover parameter values such that, when plugged into the model for P(x), it produces a value close to one for individuals with a predicted outcome and close to zero for those with a predicted outcome.

Start by defining our likelihood function. 
- We now know that the labels are binary
- we have two outcomes success and failure. 
- This means we can interpret each label as Bernoulli random variable.

**Random experiment** whose outcomes are of two types, success S and failure F, occurring with probabilities p and q respectively is called a Bernoulli trial. If for this experiment a random variable X is defined such that it takes value 1 when S occurs and 0 if F occurs, then X follows a Bernoulli Distribution.

#### Math behind this log loss function

$ Y ~ Ber(P)$

Where P is our sigmoid function

$ P[Y=y | X=x] = \sigma ( \theta^{T} x^i)^y (1 - \sigma(\theta^{T} x^i))^{1-y} $

where œÉ(Œ∏^T*x^i) is the sigmoid function. Now for n observations

$ L(\theta) = \prod^{n}_{1} \sigma ( \theta^{T} x^i)^y (1 - \sigma(\theta^{T} x^i))^{1-y} $

We need a value for theta which will maximize this likelihood function. 

To make our calculations easier
- we multiply the log on both sides. 

The function we get is also called the 
- log-likelihood function or 
- sum of the log conditional probability

$ \log(L(\theta)) = \sum^{n}_{1} * \log[\sigma ( \theta^{T} x^i)] + (1-y) * \log(1 - \sigma(\theta^{T} x^i)] $

In ML, it is conventional to minimize a loss(error) function via gradient descent, rather than maximize an objective function via gradient ascent. 
- If we maximize this above function then we‚Äôll have to deal with gradient ascent to avoid this we take negative of this log so that we use gradient descent.

$ max[log(x)] = min[-log(x)] $

The negative of this function is our cost function and what do we want with our cost function? That it should have a minimum value. 
It is common practice to minimize a cost function for optimization problems; therefore, we can invert the function so that we minimize the negative log-likelihood (NLL).

$ - \log(L(\theta)) =  -\sum^{n}_{1} * \log[\sigma ( \theta^{T} x^i)] + (1-y) * \log(1 - \sigma(\theta^{T} x^i)] $

where 
- y represents the actual class and 
    - p(y) is the probability of 1.
- log(œÉ(Œ∏^T*x^i) ) is the probability of that class.
    - 1-p(y) is the probability of 0.

Get graph of cost function when y=1 and y=0.
- By getting a convex graph with only 1 local minimum and now it‚Äôll be easy to use gradient descent.
    - red line here represents the 1 class (y=1), the right term of cost function will vanish. Now if the predicted probability is close to 1 then our loss will be less and when probability approaches 0, our loss function reaches infinity.
    - black line represents 0 class (y=0), the left term will vanish in our cost function and if the predicted probability is close to 0 then our loss function will be less but if our probability approaches 1 then our loss function reaches infinity.

$ Cost(h_{\Theta}(x),y) = \left\{ \begin{array}{rcl} - \log(h_{\Theta}(x)) if y = 1\\ - \log(1 - h_{\Theta}(x)) if y = 0 \end{array}\right.$

Cost function is also called **log loss**

It also ensures that as the
- probability of the correct answer is maximized, 
- probability of the incorrect answer is minimized. 
    - Lower the value of this cost function higher will be the accuracy.

### Gradient Descent Optimization

How to use Gradient Descent to compute the minimum cost.

- Gradient descent changes the value of our weights in such a way that it always converges to minimum point
    - it aims at finding the optimal weights which minimize the loss function of our model.
Gradient descent is an iterative method that finds the minimum of a function by figuring out the slope at a random point and then moving in the opposite direction.

At first 
- gradient descent takes a random value of our parameters from our function. 
- need an algorithm that will tell us whether at the next iteration we should move left or right to reach the minimum point.
    - The gradient descent algorithm 
        - finds the slope of the loss function at that particular point and then 
In the next iteration, 
- it moves in the opposite direction to reach the minima.

Since we have a convex graph now we don‚Äôt need to worry about local minima. 
    - A convex curve will always have only 1 minima.

Gradient descent algorithm

$ \theta_{new} = \theta_{old} - \alpha \frac{\partial J(\theta)}{\partial \theta_j} $

where alpha is known as the learning rate. 
- It determines the step size at each iteration while moving towards the minimum point. 
    - a lower value of ‚Äúalpha‚Äù is preferred, because if the learning rate is a big number then we may miss the minimum point and keep on oscillating in the convex curve.

#### Derivation of Cost Function
Derive this cost function w.r.t our parameters.

$
\frac{d\sigma(x)}{dx} = \frac{d}{dx} \left( \frac{1}{1+e^{-x}} \right) = \frac{d}{dx} \left( 1 + e^{-x} \right)^{-1} $

$\Rightarrow -\left(1 + e^{-x}\right)^{-2} \times \frac{d}{dx} \left(1 + e^{-x}\right)$

$\Rightarrow -\left(1 + e^{-x}\right)^{-2} \times \left[ 0 + \frac{d}{dx} \left(e^{-x}\right) \right]$

$\Rightarrow -\left(1 + e^{-x}\right)^{-2} \times \left[e^{-x} \times \frac{d}{dx}(-x) \right]$

$\Rightarrow -\left(1 + e^{-x}\right)^{-2} \times \left[e^{-x} \times (-1) \right]$

$\Rightarrow e^{-x} \left(1 + e^{-x}\right)^{-2}$

$\Rightarrow \frac{e^{-x}}{(1+e^{-x})^2} = \frac{e^{-x} + 1 - 1}{(1+e^{-x})(1+e^{-x})}$

$\Rightarrow \frac{(1+e^{-x}) - 1}{(1+e^{-x})(1+e^{-x})} = \frac{1}{(1+e^{-x})} \left[ \frac{(1+e^{-x})}{(1+e^{-x})} - \frac{1}{(1+e^{-x})} \right]$

$\Rightarrow \frac{1}{(1+e^{-x})} \left[ 1 - \frac{1}{(1+e^{-x})} \right]$

Derive the cost function with the help of the chain rule as it allows us to calculate complex partial derivatives by breaking them down.

**Step-1: Use chain rule and break the partial derivative of log-likelihood**

$-\frac{\partial LL(\theta)}{\partial \theta_j} = -\frac{\partial LL(\theta)}{\partial p} \cdot \frac{\partial p}{\partial \theta} \quad$
$\text{where } p= \sigma\left[\theta^\top x\right]$

$= -\frac{\partial LL(\theta)}{\partial p} \cdot \frac{\partial p}{\partial z} \cdot \frac{\partial z}{\partial \theta_j} \quad $
$\text{where } z =\theta^\top x$

**Step-2: Find derivative of log-likelihood w.r.t p**

We know,

$LL(\theta) = y \log(p) + (1-y)\log(1-p) \quad \text{where } p = \sigma\left[\theta^\top x\right]$

$\frac{\partial LL(\theta)}{\partial p} = \frac{y}{p} + \frac{(1-y)}{(1-p)}$

**Step-3: Find derivative of ‚Äòp‚Äô w.r.t ‚Äòz‚Äô**

$ p= \sigma(z)$

$\frac{\partial p}{\partial z} = \frac{\partial[ \sigma (z)]}{\partial z}$

We know the derivative of sigmoid function is $\sigma[\theta^\top x][1 - \sigma(\theta^\top x)]$

$\Rightarrow \frac{\partial p}{\partial z} =  \sigma [z][1 - \sigma(z)]$

**Step-4: Find derivate of z w.r.t Œ∏**

$ z=\theta^\top x$

$\frac{\partial z}{\partial \theta_j} = x_j$

**Step-5: Put all the derivatives in equation 1**


In [None]:
y_pred = model.predict(X) > 0.5
accuracy = accuracy_score(y, y_pred)
print(f"Accuracy: {accuracy}")

In [None]:
cm = confusion_matrix(y, y_pred)
print(cm)

In [None]:
fpr, tpr, _ = roc_curve(y, model.predict(X))
auc = roc_auc_score(y, model.predict(X))
print(f"AUC: {auc}")

**Understanding Factors Significantly Influencing Readmission**

1. Use p-values from the logistic regression summary:
- Predictors with $ùëù< 0.05$ are statistically significant.
2. Assess the odds ratios:
- For example, if the odds ratio for LOS is 2.0, each additional day in the hospital doubles the odds of readmission.
3. Visualize relationships:
- Plot odds ratios for key predictors to present to stakeholders.

**Statistical Hypothesis Testing**

Example 1: Relationship Between LOS and Readmission
- Hypotheses:
    - $ùêª_0$: LOS has no effect on readmission.
    - $ùêª_ùëé$: LOS has a significant effect on readmission.
- Approach: Perform a logistic regression test and check the p-value for LOS.

Example 2: Age Group vs. Readmission
- Hypotheses:
    - $ùêª_0$: Age group is independent of readmission.
    - $ùêª_ùëé$: Age group and readmission are dependent.
- Approach: Use a Chi-Square test of independence (see previous example).

**Actionable Insights**
- Highlight key factors significantly influencing readmission (e.g., LOS, medication adherence).
- Use odds ratios to explain how much each factor increases or decreases the likelihood of readmission.
- Present findings visually (e.g., bar charts for odds ratios, ROC curves for model performance).

### Variables and Variable Selection

Learn how to:

- Differentiate between Variable Types and Dummy Variables;
- Select features based on correlation;
- Select features based on variance thresholds.

#### Introduction

**Variables** are the basic building blocks of datasets. 
- The quality of the variables present within your dataset has a direct impact on the intuition and overall outcome of your machine learning model. 

**Variable selection** and an in-depth knowledge of the domain you're building your model in remains essential when developing a predictive model.

The purpose of regression is essentially to build associations between multiple variables. 
- Variable selection involves the 
    - elimination of input variables which may in turn reduce the computational cost of modeling 
    - improve the performance of the model. 

The model is structured around the belief that one of the variables in our dataset is a dependent variable (DV), that is explained or predicted in some way by the other independent variables (IVs). In this sense we work with: 

**Input variables** - are referred to as the independent variables (IVs) and used to explain or predict the target variable

**Target variable** - are referred to as the dependent variable (DV) and is the target variable you want to predict

In [None]:
# columns have white space that we want to replace with an underscore (to avoid using the column names as variable names later on)
df.columns = [col.replace(" ","_") for col in df.columns] 
df.head()

##### Perfom preliminary data preprocessing

to build some relationship between variables that are likely to indicate the dependent variable outcome once someone has taken a positive outcome (taken a loan), we really only want to consider instances (customers) who actually are on the positive predictive outcome (took personal loan) to build this relationship:

In [None]:
df = df[df['Personal_Loan'] == 1]
df = df.drop(['Personal_Loan'],axis=1)
df.head()

##### Varaible types

`df.info()` specifically outputs the number of non-null entries in each column. 
- We can be certain that our data has missing values if columns have a varying number of non-null entries.

`df.describe()` show the summary statistics of the data.

In [None]:
df.info()

df.describe()

#### Dummy Variable Encoding
From summary statistics of our numerical categorical data ('Online', 'CD_Account', 'Securities_Account') , Little to No information gotten.

NB, All input data for regression model building purposes needs to be numerical. 

Transform the text data (found within columns such as 'Education','Gender', and 'Area') into numbers before we can train our machine learning model.

To facilitate this transformation from textual-categorical data to numerical equivalents, 
- use a pandas method called get_dummies. 
- The text data are categorical variables, and get_dummies will transform all the categorical text data into numbers by adding a column for each distinct category. 
    - The new column has a 1 for observations which were in this category, and a 0 for observations that were not.

For example, the dataframe:

| Dog Age | Breed      |
|---------|------------|
| 15      | "Bulldog"  |
| 12      | "Labrador" |
| 10      | "Labrador" |
| 22      | "Beagle"   |
| 9       | "Labrador" |


After `pd.dummies` becomes:

| Dog Age | Breed_Labrador | Breed_Bulldog | Breed_Beagle |
|---------|----------------|---------------|--------------|
| 15      | 1              | 0             | 0            |
| 12      | 0              | 1             | 0            |
| 10      | 1              | 0             | 0            |
| 22      | 0              | 0             | 1            |
| 9       | 1              | 0             | 0            |

This is a process known as [Dummy Variable Encoding]
- important step in preprocessing data for regression analysis

In [None]:
df_dummies = pd.get_dummies(df)

# Again we make sure that all the column names have underscores instead of whitespaces
df_dummies.columns = [col.replace(" ","_") for col in df_dummies.columns] 

df_dummies.head()

#### Correlations and Model Structure
Now, we can build a model that predicts `Loan_Size` (our dependent variable) as a function of 43 different independent variables (IVs)

1. reorder columns so that our dependent variable is the last column of the dataframe. 
- making a heatmap visualisation representing a correlation matrix of our data easier to interpret.

2. Run correlation matrix

In [None]:
column_titles = [col for col in df_dummies.columns if col!= 'Loan_Size'] + ['Loan_Size']
df_dummies=df_dummies.reindex(columns=column_titles)

In [None]:
# Run corr matrix
df_dummies.corr()

from statsmodels.graphics.correlation import plot_corr

fig = plt.figure(figsize=(15,15));
ax = fig.add_subplot(111);
plot_corr(df_dummies.corr(), xnames = df_dummies.corr().columns, ax = ax);

#### Rerun your Model: Fitting the model using `statsmodels.OLS`

##### Generating the regression string

Importing the statsmodels library which has a rich set of statistical tools to help us. 

Those of you familiar with the R language will know that fitting a machine learning model requires a sort of string of the form:

`y ~ X`

- which is read as follows: "Regress y on X". 

`statsmodels` works in a similar way, so we need to generate an appropriate string to feed to the method when we wish to fit the model.

In [None]:
from statsmodels.formula.api import ols

# Model DataFrame with all of the columns:
dfm = df_dummies.copy()

# The dependent variable:
y_name = 'Loan_Size'
# The independent variable
# (let's first try all of the columns in the model DataFrame)
X_names = [col for col in dfm.columns if col != y_name]

# Build the OLS formula string " y ~ X "
formula_str = y_name+" ~ "+" + ".join(X_names);
print('Formula:\n\t {}'.format(formula_str))

In [None]:
# Fit the model using the model dataframe
model=ols(formula=formula_str, data=dfm)
fitted = model.fit()

# Output the fitted summary
print(fitted.summary())

### Interpreting the OLS Regression Summary

**Model Performance**

|Measure           |Value             |
|------------------|------------------|
| Dep. Variable:   |        Loan_Size | 
| Model:           |              OLS | 
| Method:          |    Least Squares | 
| Date:            | Sat, 02 May 2020 |
| Time:            |         13:21:01 |
| No. Observations:|              471 |
| Df Residuals:    |              430 |
| Df Model:        |               40 |
| Covariance Type: |        nonrobust |
| R-squared:       |             0.777|
| Adj. R-squared:  |             0.757|
| F-statistic:     |             37.56|
| Prob (F-statistic): |      1.71e-115|
| Log-Likelihood:  |           -1387.0|
| AIC:             |             2856.|
| BIC:             |             3026.|

Dependent Variable: Loan_Size
- The target variable being modeled, indicating the size of loans in this context.

R-squared: 0.777
- Meaning: 77.7% of the variation in Loan_Size is explained by the independent variables in the model.
- Thresholds: Higher values (closer to 1) indicate better model fit. However, 77.7% is a strong fit for real-world data.
- Stakeholder Message: The model is effective at explaining the variability in loan sizes based on the input variables.

Adj. R-squared: 0.757
- Meaning: 75.7% of the variation in Loan_Size is explained by the independent variables in the model. but adjusts for the number of predictors to avoid overfitting. 
- A slight drop from R-squared suggests that some variables may add limited value to the model.

**Statistical Significance of the Model**

F-statistic: 37.56
- looks at Statistical Significance of the Model
- Meaning: The F-test checks if at least one of the predictors is statistically significant.

Prob (F-statistic): 1.71e-115 (extremely small, close to 0)
- Stakeholder Message: The overall model is statistically significant, indicating that the predictors together effectively explain variations in loan size.

__________________________________________________________________________________________________________________________________________________________

**Coefficients and Their Interpretation**

|                          |   coef   | std err    |      t     | P>t    | [0.025      0.975]|
|--------------------------|----------|------------|------------|-----------|-------------------|
|Intercept                 | 6.4496   |   2.696    |  2.392     | 0.017     |  1.150      11.749|
|Age                       |-0.3140   |   0.194    | -1.620     | 0.106     | -0.695       0.067|
|Experience                | 0.2226   |   0.195    |  1.142     | 0.254     | -0.160       0.605|
|Income                    | 0.1777   |   0.008    | 23.319     | 0.000     |  0.163       0.193|
|Family                    | 1.3289   |   0.219    |  6.060     | 0.000     |  0.898       1.760|
|CCAvg                     | 1.4333   |   0.114    | 12.521     | 0.000     |  1.208       1.658|
|Mortgage                  |-0.0370   |   0.001    |-24.962     | 0.000     | -0.040      -0.034|
|Securities_Account        | 1.5816   |   0.798    |  1.982     | 0.048     |  0.013       3.150|
|CD_Account                |-0.6828   |   0.634    | -1.078     | 0.282     | -1.928       0.563|
|Online                    | 0.1235   |   0.513    |  0.241     | 0.810     | -0.886       1.133|
|Education_Postgrad        | 2.3492   |   0.941    |  2.496     | 0.013     |  0.499       4.199|
|Education_Professional    | 1.9695   |   0.968    |  2.034     | 0.043     |  0.066       3.873|
|Education_Undergrad       | 2.1309   |   0.988    |  2.156     | 0.032     |  0.188       4.074|
|Gender_Female             | 3.6759   |   1.383    |  2.658     | 0.008     |  0.958       6.394|
|Gender_Male               | 2.7737   |   1.352    |  2.052     | 0.041     |  0.117       5.431|
|Area_Alameda              |-0.0350   |   0.854    | -0.041     | 0.967     | -1.714       1.644|
|Area_Butte                |-2.9267   |   3.371    | -0.868     | 0.386     | -9.553       3.700|
|Area_Contra_Costa         |-0.1349   |   1.435    | -0.094     | 0.925     | -2.956       2.686|
|Area_Fresno               | 2.0428   |   3.397    |  0.601     | 0.548     | -4.634       8.719|
|Area_Humboldt             | 0.0294   |   3.371    |  0.009     | 0.993     | -6.596       6.655|
|Area_Kern                 | 1.1313   |   1.830    |  0.618     | 0.537     | -2.465       4.727|
|Area_Los_Angeles          |-0.2556   |   0.653    | -0.392     | 0.696     | -1.538       1.027|
|Area_Marin                | 0.2734   |   1.969    |  0.139     | 0.890     | -3.596       4.143|
|Area_Mendocino            | 4.0507   |   4.756    |  0.852     | 0.395     | -5.297      13.398|
|Area_Monterey             |-2.3811   |   1.289    | -1.847     | 0.065     | -4.914       0.152|
|Area_Orange               | 0.5804   |   1.005    |  0.578     | 0.564     | -1.395       2.556|
|Area_Placer               |-0.1183   |   3.351    | -0.035     | 0.972     | -6.706       6.469|
|Area_Riverside            |-0.4246   |   1.991    | -0.213     | 0.831     | -4.339       3.489|
|Area_Sacramento           | 0.9005   |   1.310    |  0.687     | 0.492     | -1.675       3.476|
|Area_San_Bernardino       | 2.3827   |   2.770    |  0.860     | 0.390     | -3.062       7.827|
|Area_San_Diego            | 0.4737   |   0.767    |  0.618     | 0.537     | -1.034       1.981|
|Area_San_Francisco        |-1.4785   |   1.173    | -1.260     | 0.208     | -3.785       0.828|
|Area_San_Joaquin          | 1.1931   |   4.742    |  0.252     | 0.801     | -8.128      10.514|
|Area_San_Luis_Obispo      |-0.2345   |   2.408    | -0.097     | 0.922     | -4.968       4.499|
|Area_San_Mateo            | 0.6569   |   1.559    |  0.421     | 0.674     | -2.408       3.722|
|Area_Santa_Barbara        |-0.0998   |   1.498    | -0.067     | 0.947     |  -3.044       2.845|
|Area_Santa_Clara          | 0.1681   |   0.729    |  0.231     | 0.818     |  -1.265       1.601|
|Area_Santa_Cruz           |-0.3529   |   1.827    | -0.193     | 0.847     |  -3.944       3.238|
|Area_Shasta               |-0.6051   |   2.779    | -0.218     | 0.828     |  -6.068       4.858|
|Area_Solano               |-2.0356   |   2.749    | -0.740     | 0.459     |  -7.440       3.368|
|Area_Sonoma               | 0.4197   |   1.987    |  0.211     | 0.833     |  -3.485       4.325|
|Area_Stanislaus           |-0.9779   |   4.726    | -0.207     | 0.836     | -10.268       8.312|
|Area_Ventura              | 1.7134   |   1.487    |   1.152    |  0.250    |   -1.210       4.636|
|Area_Yolo                 | 2.4941   |   1.719    |  1.451     | 0.148     |  -0.885       5.873|


The coef values represent the average change in Loan_Size for a one-unit change in each predictor, holding other variables constant.

Significant Predictors:

Income (coef = 0.1777, p < 0.001):
- A one-unit increase in income is associated with an increase of 0.1777 in loan size, on average.
- Stakeholder Message: Higher income levels significantly increase loan size, suggesting income is a major determinant of loan allocation.

Family (coef = 1.3289, p < 0.001):
- Loan size increases by 1.33 units for each additional family member.
- Stakeholder Message: Family size positively influences loan size, which may reflect financial responsibilities influencing loan demand.

CCAvg (coef = 1.4333, p < 0.001):
- Average monthly credit card spending significantly increases loan size.
- Stakeholder Message: High credit card spending is a strong indicator of higher loan eligibility or need.

Mortgage (coef = -0.0370, p < 0.001):
- A negative coefficient implies that larger mortgages slightly reduce loan size.
- Stakeholder Message: Customers with higher mortgage liabilities may receive lower loans, possibly reflecting risk concerns.

Non-Significant Predictors:

Age (p = 0.106), Experience (p = 0.254), Online (p = 0.810), many Area variables:
- These variables do not have a statistically significant relationship with loan size as p > 0.05.
- Stakeholder Message: These factors might be excluded in future models unless they align with business insights or strategies.

Categorical Predictors:

Education:
- Postgraduates, professionals, and undergraduates receive larger loans compared to the reference category (likely "No Education").
- Stakeholder Message: Educational qualifications influence loan size, aligning with the idea that higher education may imply better creditworthiness.

Gender:
- Women receive loans that are on average larger by 3.68 units compared to men.
- Stakeholder Message: Gender differences in loan sizes could reflect underlying demographic or financial patterns.

___________________________________________________________________________________________________________________________________________________________

**Diagnostic Measures**
| Measure                  |  Value      |
|--------------------------|-------------|
|Omnibus:                  |  17.650 |
|Durbin-Watson:            |     2.004  |
|Prob(Omnibus):            |    0.000   |
|Jarque-Bera (JB):         |    19.137|
|Skew:                     |   -0.431   |
|Prob(JB):                 |  6.99e-05|
|Kurtosis:                 |    3.482 |
|Cond. No.                 |  7.37e+16|

Omnibus and Jarque-Bera Tests (p < 0.001):
- Indicate that residuals (errors) may not be perfectly normally distributed.
- Stakeholder Message: While the model is strong, residual non-normality could be further investigated to refine the model.

Durbin-Watson Statistic (2.004):
- A value close to 2 suggests no significant autocorrelation in residuals, meaning errors are independent.
- Stakeholder Message: The model meets the independence of errors assumption.

Condition Number (7.37e+16):
- High values suggest multicollinearity issues (predictors are highly correlated).
- Stakeholder Message: Some predictors may overlap in their explanatory power. This could be addressed through techniques like variable selection or regularization (e.g., Ridge/Lasso regression).

Warnings:

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The smallest eigenvalue is 4.16e-27. This might indicate that there are
strong multicollinearity problems or that the design matrix is singular.
- likely as a result of the incorrect filtering of one hot encoded dummy variables
- to ensure that we don't assume an underlying relationship between the categories
    - call `pd.get_dummies` with the argument `drop_first=True` so that we only create n-1 columns for each variable with n categories
        - (i.e. one variable/column with five categories will be transformed into four columns of 0's and 1's)

_______________________________________________________________________________________________________________________________________________________

**Actionable Insights for Stakeholders**
- Focus efforts on variables like Income, CCAvg, and Family, which are key drivers of loan sizes.
- Investigate non-significant variables for possible removal to simplify the model and enhance interpretability.
- Address potential multicollinearity by refining the input variables.
- Consider segmentation by Education and Gender to tailor loan products effectively.
- Reassess area-specific variables since many are non-significant; geographical targeting may not substantially impact loan size decisions.

In [None]:
df_dummies = pd.get_dummies(df, drop_first=True)

# Again make sure that all the column names have underscores instead of whitespaces
df_dummies.columns = [col.replace(" ", "_") for col in df_dummies.columns]

# Reorder columns with the dependent variable (claim_amount) the last column
column_titles = [col for col in df_dummies.columns if col !=
                 'Loan_Size'] + ['Loan_Size']
df_dummies = df_dummies.reindex(columns=column_titles)

df_dummies.head()

In [None]:
# We'll keep the model DataFrame, but only specify the columns we want to fit this time
X_names = [col for col in df_dummies.columns if col != y_name]

# Build the OLS formula string " y ~ X "
formula_str = y_name+' ~ '+'+'.join(X_names)

# Fit the model using the model dataframe
model = ols(formula=formula_str, data=dfm)
fitted = model.fit()

# Output the fitted summary
print(fitted.summary())

## Making further selections on the variables now using their significance.

### Variable Selection by Correlation and Significance

We need to choose the best ones to be our predictors. 

One way is to 
- look at the correlations between the `Loan Size` and each variables in our DataFrame
    - and select those with the strongest correlations (both positive and negative).
- consider how significant those features are. 

Create a new DataFrame and store the correlation coefficents and p-values in that DataFrame for reference.

In [None]:
# Calculate correlations between predictor variables and the response variable
corrs = df_dummies.corr()['Loan_Size'].sort_values(ascending=False)

In [None]:
from scipy.stats import pearsonr

# Build a dictionary of correlation coefficients and p-values
dict_cp = {}

column_titles = [col for col in corrs.index if col!= 'Loan_Size']
for col in column_titles:
    p_val = round(pearsonr(df_dummies[col], df_dummies['Loan_Size'])[1],6)
    dict_cp[col] = {'Correlation_Coefficient':corrs[col],
                    'P_Value':p_val}
    
df_cp = pd.DataFrame(dict_cp).T
df_cp_sorted = df_cp.sort_values('P_Value')
df_cp_sorted[df_cp_sorted['P_Value']<0.1]

Get a sorted list of the p-values and correlation coefficients for each of the features, when considered on their own.  

If we were to use a logic test with a significance value of 5% (p-value < 0.05), 
- we could infer that the following features are statistically significant:
    - List features

Keep only the variables that have a significant correlation with the dependent variable. 
- Put them into an independent variable DataFrame `X`

In [None]:
# The dependent variable remains the same:
y_data = df_dummies[y_name]  # y_name = 'Loan_Size'

# Model building - Independent Variable (IV) DataFrame
X_names = list(df_cp[df_cp['P_Value'] < 0.05].index)
X_data = df_dummies[X_names]

Also, look for predictor variable pairs which have a high correlation with each other to avoid autocorrelation.

Easier to isolate the sections of the correlation matrix to where the off-diagonal correlations are high:

In [None]:
# Create the correlation matrix
corr = X_data.corr()

# Find rows and columnd where correlation coefficients > 0.9 or <-0.9
corr[np.abs(corr) > 0.9]

In [None]:
# As before, we create the correlation matrix
# and find rows and columnd where correlation coefficients > 0.9 or <-0.9
corr = X_data.corr()
r, c = np.where(np.abs(corr) > 0.9)

# We are only interested in the off diagonal entries:
off_diagonal = np.where(r != c)

# Show the correlation matrix rows and columns where we have highly correlated off diagonal entries:
corr.iloc[r[off_diagonal], c[off_diagonal]]

##### Resulting OLS fit summary

In [None]:
# Lets take a new subset of our potential independent variables
X_remove = ['Age']
X_corr_names = [col for col in X_names if col not in X_remove]

# Create our new OLS formula based-upon our smaller subset
formula_str = y_name+' ~ '+' + '.join(X_corr_names);
print('Formula:\n\t{}'.format(formula_str))

In [None]:
# Fit the OLS model using the model dataframe
model=ols(formula=formula_str, data=dfm)
fitted = model.fit()

# Display the fitted summary
print(fitted.summary())

### Variable Selection by Variance Thresholds

Variance Thresholds remove features whose values don't change much from observation to observation. 

The objective here is to remove all features that have a variance lower than the selected threshold.
- Suppose that in our loans dataset 97% of observations were for 40-year-old women, then the *Age* and *Gender* features can be removed without a great loss in information.

It is important to note that variance is dependent on scale, so the features will have to be normalized before implementing variance thresholding.

In [None]:
# Separate data into independent (X) and independent (y) variables
X_names = list(df_dummies.columns)
X_names.remove(y_name)
X_data = df_dummies[X_names]
y_data = df_dummies[y_name]

In [None]:
# Normalize data
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X_data)
X_normalize = pd.DataFrame(X_scaled, columns=X_data.columns)

#### Variance Threshold in Scikit Learn

To implement Variance Threshold in Scikit Learn we have to do the following:

Import and create an instance of the VarianceThreshold class;
- Use the .fit() method to select subset of features based on the threshold.

In [None]:
from sklearn.feature_selection import VarianceThreshold

# Create VarianceThreshold object
selector = VarianceThreshold(threshold=0.03)

# Use the object to apply the threshold on data
selector.fit(X_normalize)

##### Calculated variance for each predictive variable.

Show the variances of the individual columns before any threshold is applied. 

It allows us to revise our initial variance threshold if we feel that we might exclude important variables.

In [None]:
# Get column variances
column_variances = selector.variances_

vars_dict = {}
vars_dict = [{"Variable_Name": c_name, "Variance": c_var}
             for c_name, c_var in zip(X_normalize.columns, column_variances)]
df_vars = pd.DataFrame(vars_dict)
df_vars.sort_values(by='Variance', ascending=False)

#### Extract the results and use them to select our new columns

In [None]:
# Select new columns
X_new = X_normalize[X_normalize.columns[selector.get_support(indices=True)]]

# Save variable names for later
X_var_names = X_new.columns

# View first few entries
X_new.head()

In [None]:
# Create Variance Threshold objects
selector_1 = VarianceThreshold(threshold=0.05)
selector_2 = VarianceThreshold(threshold=0.1)
selector_3 = VarianceThreshold(threshold=0.15)

In [None]:
selector_1.fit(X_normalize)

In [None]:
selector_2.fit(X_normalize)

In [None]:
selector_3.fit(X_normalize)

In [None]:
# Select subset of columns
X_1 = X_normalize[X_normalize.columns[selector_1.get_support(indices=True)]]
X_2 = X_normalize[X_normalize.columns[selector_2.get_support(indices=True)]]
X_3 = X_normalize[X_normalize.columns[selector_3.get_support(indices=True)]]

In [None]:
# Create figure and axes
f, ax = plt.subplots(figsize=(8, 3), nrows=1, ncols=1)

# Create list of titles and predictions to use in for loop
subset_preds = [X_1.shape[1], X_2.shape[1], X_3.shape[1]]
thresholds = ['0.05', '0.1', '0.15']

# Plot graph
ax.set_title('# of Predictors vs Thresholds')
ax.set_ylabel('# of Predictors')
ax.set_xlabel('Threshold')
sns.barplot(x=thresholds, y=subset_preds)
plt.show()


##### Extract the predictor names of the 3 different datasets above?

Results OLS fit summary for a threshold of 0.03

In [None]:
# What is our new OLS formula?
formula_str = y_name+' ~ '+' + '.join(X_new.columns)
print('Formula:\n\t{}'.format(formula_str))

In [None]:
# Fit the model using the model dataframe
model = ols(formula=formula_str, data=df_dummies)
fitted = model.fit()

print(fitted.summary())

#### Advantages & Disadvantages of Variance Thresholds

Let's consider some trade-offs associated with using variance thresholds for variable selection: 

**Advantages**

* Applying variance thresholds is based on solid intuition: features that don't change much also don't add much information;
* Easy and relatively safe way to reduce dimensionality (i.e. number of features) at the start of the modeling process.

**Disadvantages**

* Not the ideal algorithm if dimensionality reduction is not really required;
* The threshold must be manually tuned, which can be a fickle process requiring domain/problem expertise.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

Preprocess the data

make sure that all models are trained and tested on the same data.

In [None]:
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_data,
                                                    y_data,
                                                    test_size=0.20,
                                                    shuffle=False)

In [None]:
# Get training and testing data for variance threshold model
X_var_train = X_train[X_var_names]
X_var_test = X_test[X_var_names]

In [None]:
# Get training and testing data for correlation threshold model
X_corr_train = X_train[X_corr_names]
X_corr_test = X_test[X_corr_names]

##### Fit models

instantiate and fit our models

In [None]:
lm = LinearRegression()
lm_corr = LinearRegression()
lm_var = LinearRegression()

In [None]:
lm.fit(X_train, y_train);
lm_corr.fit(X_corr_train,y_train);
lm_var.fit(X_var_train,y_train);

##### Assess model accuracy 
Let's see how our linear models performed!

In [None]:
from sklearn.metrics import mean_squared_error, r2_score

# Create figure and axes
f, ax = plt.subplots(figsize=(15, 5), nrows=1, ncols=3, sharey=True)

# Create list of titles and predictions to use in for loop
train_pred = [lm.predict(X_train),
              lm_corr.predict(X_corr_train),
              lm_var.predict(X_var_train)]
test_pred = [lm.predict(X_test),
             lm_corr.predict(X_corr_test),
             lm_var.predict(X_var_test)]
title = ['No threshold', 'Corr threshold', 'Var threshold']

# Key:
# No threshold - linear regression with all predictive variables
# Corr threshold - linear regression with correlation thresholded predictive variables
# Var threshold - linear regression with variance thresholded predictive variables


# Loop through all axes to plot each model's results
for i in range(3):
    test_mse = round(mean_squared_error(test_pred[i], y_test), 4)
    test_r2 = round(r2_score(test_pred[i], y_test), 4)
    train_mse = round(mean_squared_error(train_pred[i], y_train), 4)
    train_r2 = round(r2_score(train_pred[i], y_train), 4)
    title_str = f"Linear Regression({title[i]}) \n train MSE = {train_mse} \n " + \
                f"test MSE = {test_mse} \n training $R^{2}$ = {train_r2} \n " + \
                f"test $R^{2}$ = {test_r2}"
    ax[i].set_title(title_str)
    ax[i].set_xlabel('Actual')
    ax[i].set_ylabel('Predicted')
    ax[i].plot(y_test, y_test, 'r')
    ax[i].scatter(y_test, test_pred[i])

### Regularisation Preprocessing: Scaling Data for Regularisation

Scaling data is a critical to regularisation as the penalty on particular coefficients in regularisation techniques namely L1 and L2, depends largely on the scale associated with the variables. 

Regularisation puts constraints on the size of the coefficients related to each variable.
- Rescaling is very important for methods with regularisation because the size of the variables affects how much regularisation will be applied to that specific variable. 
- To make it fair, we need to get all the features on the same scale. 

There are two common scaling techniques: 

#### Normalisation

One way to do this is with $[0,1]$-normalisation: 
- Squeezing your data into the range $[0,1]$. 

Through normalisation, 
- the maximum value of a variable becomes one, 
- the minimum becomes zero, and 
- the values in-between become decimals between zero and one.

We implement this transformation by applying the following operation to each of the values of a predictor variable:

$$\hat{x}_{ij} = \frac{x_{ij}-min(x_j)}{max(x_j)-min(x_j)},$$

where 
- $\hat{x}_{ij}$ is the value after normalisation, 
- $x_{ij}$ is the $i^{th}$ item of $x_j$, 
- and $min()$, $max()$ return the smallest and largest values of variable $x_j$ respectively. 

Normalisation is useful because it ensures all variables share the same range: $[0,1]$. 

Problem with normalisation,
- drawback: if there are outliers, the bulk of your data will all lie in a small range, so you would lose information.

#### Standardisation

Z-score standardisation, or simply standardisation,
- does not suffer from this drawback as it handles outliers gracefully. 

We implement Z-score standardisation by applying the following operation to each of our variables: 

$$\hat{x}_{ij} = \frac{x_{ij} - \mu_j}{\sigma_j}.$$

where, 
- $\mu_j$ represents the mean of variable $x_j$, 
- while $\sigma_j$ is the variable's standard deviation. As can be 
- seen from the above formula, instead of dividing by the full range of our variable, we instead divide by a more distribution-aware measure in the standard deviation. 
- While this doesn't completely remove the effects of outliers, it does consider them in a more conservative manner. 

As a trade-off to using this transformation, our variable is no longer contained within the $[0,1]$ range as it was during normalisation
- it can now take on a range which includes negative values
- This means that all our variables won't be bound to the exact same range 
    - they can have slightly different influence levels on the learnt regression coefficients during regularisation
    - but they are far closer to one another then they were before the use of standardisation.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
df = pd.read_csv('https://raw.githubusercontent.com/Explore-AI/Public-Data/master/Data/regression_sprint/regression_sprint_data_2.csv', index_col=0)
df.head()

Using monthly data for the Rand/Dollar exchange rate, as well as a few potential predictor variables. 

The goal is to try and model the exchange rate, using the other 19 variables.   

The way we write this is as follows:   

$$Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_pX_p$$   

- $Y$ is the reponse variable which depends on the _p_ predictor variables.

In [None]:
# split data into predictors and response
X = df.drop('ZAR/USD', axis=1)
y = df['ZAR/USD']

In [None]:
# import scaler method from sklearn
from sklearn.preprocessing import StandardScaler

# create scaler object
scaler = StandardScaler()

# create scaled version of the predictors (there is no need to scale the response)
X_scaled = scaler.fit_transform(X)

# convert the scaled predictor values into a dataframe
X_standardise = pd.DataFrame(X_scaled,columns=X.columns)
X_standardise.head()

Taking a look at one of the variables as an example (Value of Exports (USD)), we can see that standarizing the data has caused it to be centered around zero.

The variance within each variable in the data is now equal to one.

In [None]:
plt.hist(X_standardise['Value of Exports (USD)'])
plt.show()

In [None]:
X_standardise.describe().loc['std']

### 3. Regularisation Methods: Ridge Regression

Understand what regularisation is and how to implement it using the ridge method

Linear regression is a popular choice, but it often faces the challenge of overfitting, especially with a high number of parameters. 

This is where ridge and lasso regression comes in, offering practical solutions to enhance model accuracy and make informed decisions in data analysis. 

Regularization techniques are used to address overfitting and enhance model generalizability. 
- Ridge and lasso regression are effective methods in machine learning, that introduce penalties on the magnitude of regression coefficients. 
    - They work by penalizing the magnitude of coefficients of features and minimizing the error between predicted and actual observations. These are called ‚Äòregularization‚Äô techniques.

Ridge and Lasso regression, are powerful techniques generally used for creating parsimonious models in the presence of a ‚Äòlarge‚Äô number of features. 
- ‚ÄòLarge‚Äô can typically mean either of two things:
    - Large enough to enhance the tendency of a model to overfit (as low as 10 variables might cause overfitting)
    - Large enough to cause computational challenges. With modern systems, this situation might arise in the case of millions or billions of features.

#### Shrinkage Methods

Ridge regression, aims to modify and potentially improve the test-set performance of a least squares regression model by reducing the magnitude of some subset of the coefficients $\hat{\beta}$.
- The ridge regression process of reducing the magnitude of those coefficients is a type of _shrinkage_ method - we are attempting to shrink the values of those less important coefficients.
- In ridge regression, it is possible to shrink a coefficient's value towards zero, but never reaching exactly zero.

#### Usage of Ridge Regression:
- When we have the independent variables which are having high collinearity between them, general linear or polynomial regression will fail
    - Solve problems, Ridge regression can be used.
- If we have more parameters than the samples,
    - Ridge regression helps to solve the problems.

#### Limitation of Ridge Regression:

Does not helps in Feature Selection: 
- It decreases the complexity of a model but does not reduce the number of independent variables since it never leads to a coefficient being zero rather only minimizes it. 
    - This technique is not good for feature selection.

Model Interpretability: 
- It shrinks the coefficients for least important predictors, very close to zero but it will never make them exactly zero. 
- The final model will include all the independent variables, also known as predictors.

### Ridge Regression

Description
- Ridge regression, also known as Tikhonov regularization, 
- is a technique that introduces a penalty term to the linear regression model to shrink the coefficient values.

Penalty Type
- Ridge regression utilizes an L2 penalty, 
    - which adds the sum of the squared coefficient values multiplied by a tuning parameter (lambda).

Coefficient Impact
- The L2 penalty in ridge regression discourages large coefficient values, pushing them towards zero but never exactly reaching zero. This shrinks the less important features‚Äô impact.

Feature Selection
- Ridge regression retains all features in the model, reducing the impact of less important features by shrinking their coefficients.

Use Case
- Ridge regression is useful when the goal is to minimize the impact of less important features while keeping all variables in the model.

Model Complexity
- Ridge regression tends to favor a model with a higher number of parameters, as it shrinks less important coefficients but keeps them in the model.

Interpretability
- The results of ridge regression may be less interpretable due to the inclusion of all features, each with a reduced but non-zero coefficient.

Sparsity
- Ridge regression does not yield sparse models since all coefficients remain non-zero.

Sensitivity
- More robust and less sensitive to outliers compared to lasso regression.

#### Regularisation: The theory behind regularisation.

When performing variable selection, 
- manual variable selection is often performed to improve the predictive accuracy of a model.

The process of variable selection is discrete in that we either keep a variable, or we throw it away.   

**Regularisation** offers an alternative method in which all predictor variables are included, but are subject to constraint. 

Recall that the least squares method seeks to minimise the sum of the squares of the residuals:

$$RSS = \sum_{i=1}^n(y_i-\hat{y}_i)^2$$   

which can be written in terms of the predictor variable coefficients, [$b_1, b_2, b_p$], and slope, $a$:   

$$RSS = \sum_{i=1}^n(y_i-(a+\sum_{j=1}^pb_jx_{ij}))^2$$

where 
- _n_ is the number of observations, and 
- _p_ is the number of predictor variables. 

In the case of **ridge regression**, the regression coefficients are calculated as the values that minimise:

$$\sum_{i=1}^n(y_i-(a+\sum_{j=1}^pb_jx_{ij}))^2 + \alpha\sum_{j=1}^pb_j^2$$

which is rewritten simply as:

$$\min_{\beta} (RSS + \alpha\sum_{j=1}^pb_j^2)$$

Objective = RSS + Œ± * (sum of the square of coefficients)

In minimising _RSS_ , we improve the overall fit of the model. 

Ridge regression performs ‚ÄòL2 regularization‚Äò, i.e., it adds a factor of the sum of squares of coefficients in the optimization objective.

In the newly introduced term, $\alpha\sum_{j=1}^pb_j^2$, 
- the intention is to penalise those individual coefficients that get too large (those that contribute the most to reducing the fit).
- $\alpha$ is a tuning parameter (which we calculate later on), which controls the degree to which the regression coefficients are penalised. 
    - The effect of this penalty parameter is to create a tradeoff between how much a coefficient contributes to minimising RSS and the size of the coefficient. 
    - In other words: _training fit_ vs. _size of coefficients_. 
- $\alpha$, we can see that the penalty parameter is applied to the sum of the squares of the coefficients. 
- This means that as we increase the size of the coefficients, the penalty will increase too. 
- This has the effect of _shrinking_ the coefficients towards zero.

$\alpha$(alpha) is the parameter that balances the amount of emphasis given to minimizing RSS vs minimizing the sum of squares of coefficients. Œ± can take various values:

$\alpha$ = 0:
- The objective becomes the same as simple linear regression.
    - We‚Äôll get the same coefficients as simple linear regression.

$\alpha$ = ‚àû:
- The coefficients will be zero. Why? 
    - Because of infinite weightage on the square of coefficients, anything less than zero will make the objective infinite.

0 < $\alpha$ < ‚àû:
- The magnitude of $\alpha$ will decide the weightage given to different parts of the objective.
- The coefficients will be somewhere between 0 and ones for simple linear regression.

non-zero value would give values less than that of simple linear regression.

##### Example

Dataset which contains monthly data for the Rand/Dollar exchange rate, as well as a few potential predictor variables.
- the goal is to try and model the exchange rate, using the other 19 variables.

The way we write this is as follows:   

$$Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + ... + \beta_pX_p$$   

where
- $Y$ is the reponse variable which depends on the _p_ predictor variables.

##### Review of Data Scaling

Data scaling is essential in regularisation as regularising penalizes a model for large coefficients. 

The magnitude of coefficients is dependent on the following:

* The strength of the relationship between the predictor variables (`x`) and the output variable (`y`)
* The units of measurement of x(eg. distance measured in millimetres or metres).

For example, if x is measured in metres, and its coefficient is 5; if it is expressed in kilometres, its coefficient will be 5*10¬≥.

We want regularisation to be impacted by the strength of the relationship that exists between `x` and `y` variables and not the magnitude of the coefficients.
- Thus, to eliminate the impact of the units of measurement of the variables on the coefficients, 
- Performed data scaling to ensure variables are fairly scaled. 

**Z-score standardisation** is a great way to scale variables such that they have similar (though not identical) ranges, in a way that is fairly robust to outlier values.


In [None]:
# Split data into predictors and response
X = df.drop('ZAR/USD', axis=1)
y = df['ZAR/USD']

In [None]:
# Import scaler method from sklearn
from sklearn.preprocessing import StandardScaler

# Create scaler object
scaler = StandardScaler()

# Create scaled version of the predictors (there is no need to scale the response)
X_scaled = scaler.fit_transform(X)

# Convert the scaled predictor values into a dataframe
X_standardise = pd.DataFrame(X_scaled,columns=X.columns)
X_standardise.head()

### Ridge Regression
Split our data into a training and a testing set.
- Use the first eight years of data as our training set and 
- test the model on the final two years. 

Note that with time-series data it isn't appropriate to sample rows randomly for the training and testing sets because **chronological order** remains important.

Fit and test our model.
- Create a `Ridge()` object without modifying any of the parameters. 
    - This means that we will use the default value of $\alpha=1$. 
    
We'll learn about choosing a better value for this hyperparameter.

In [None]:
# Import train/test splitting function from sklearn
from sklearn.model_selection import train_test_split

# Split the data into train and test, being sure to use the standardised predictors
X_train, X_test, y_train, y_test = train_test_split(X_standardise, 
                                                    y, 
                                                    test_size=0.2, 
                                                    shuffle=False)

In [None]:
# Import the ridge regression module from sklearn
from sklearn.linear_model import Ridge

In [None]:
# Create ridge model
ridge = Ridge()

In [None]:
# Train the model
ridge.fit(X_train, y_train)

In [None]:
# Extract the model intercept value
b0 = float(ridge.intercept_)

In [None]:
# Extract the model coefficient value
coeff = pd.DataFrame(ridge.coef_, X.columns, columns=['Coefficient'])

In [None]:
print("Intercept:", float(b0))

In [None]:
# Check out the coefficients
coeff

##### Interpretation of the intercept and coefficients

Sincee standardised the features,
- compare coefficients to each other,
- respective variables are all on the same scale.
- interpret the intercepts as the expected exchange rate when all the features are equal to their respective means and the coefficients are interpreted as the expected change in exchange rate given an increase of 1 in the **scaled feature value**. 

We can intepret variables with smaller coefficients as less important as they have suffered more in the shrinkage tradeoff.

In [None]:
# Fit a basic linear model
from sklearn.linear_model import LinearRegression

# Create model object
lm = LinearRegression()

# Train model
lm.fit(X_train, y_train)

In [None]:
# Import metrics module
from sklearn import metrics

# Check training accuracy
train_lm = lm.predict(X_train)
train_ridge = ridge.predict(X_train)

print('Training MSE')
print('Linear:', metrics.mean_squared_error(y_train, train_lm))
print('Ridge :', metrics.mean_squared_error(y_train, train_ridge))

In [None]:
test_lm = lm.predict(X_test)
test_ridge = ridge.predict(X_test)

print('Testing MSE')
print('Linear:', metrics.mean_squared_error(y_test, test_lm))
print('Ridge :', metrics.mean_squared_error(y_test, test_ridge))

Ridge regression achieves a much lower score on the testing set at the expense of a slightly higher score on the training set.
 
The increase in training MSE is not anything to be worried about since we want to avoid overfitting on the training set.

In [None]:
# We want to plot the train and test response variables as a continuous line
train_plot = y_train.append(pd.Series(y_test[0], index=['2016M01']))

In [None]:
plt.plot(np.arange(len(y)), ridge.predict(X_standardise), label='Predicted')
plt.plot(np.arange(len(train_plot)), train_plot, label='Training')
plt.plot(np.arange(len(y_test))+len(y_train), y_test, label='Testing')
plt.legend()

plt.show()

#### Ridge regression in Sine / polynomial problem as below under GLMs

Function for Ridge Regression

It takes ‚Äòalpha‚Äô as a parameter on initialization.

Remember that normalizing the inputs generally benefits every type of regression and should apply to ridge regression

In [None]:
from sklearn.linear_model import Ridge

def ridge_regression(data, predictors, alpha, models_to_plot={}):
    #Fit the model
    ridgereg = Ridge(alpha=alpha,normalize=True)
    ridgereg.fit(data[predictors],data['y'])
    y_pred = ridgereg.predict(data[predictors])
    
    #Check if a plot is to be made for the entered alpha
    if alpha in models_to_plot:
        plt.subplot(models_to_plot[alpha])
        plt.tight_layout()
        plt.plot(data['x'],y_pred)
        plt.plot(data['x'],data['y'],'.')
        plt.title('Plot for alpha: %.3g'%alpha)
    
    #Return the result in pre-defined format
    rss = sum((y_pred-data['y'])**2)
    ret = [rss]
    ret.extend([ridgereg.intercept_])
    ret.extend(ridgereg.coef_)
    return ret

Analyze the result of Ridge regression for 10 different values of Œ± ranging from 1e-15 to 20. 

These values have been chosen so that we can easily analyze the trend with changes in values of $\alpha$.

These 10 models will contain all the 15 variables, and only the value of alpha would differ. 
- This differs from the simple linear regression case, where each model had a subset of features.

In [None]:
#Initialize predictors to be set of 15 powers of x
predictors=['x']
predictors.extend(['x_%d'%i for i in range(2,16)])

#Set the different values of alpha to be tested
alpha_ridge = [1e-15, 1e-10, 1e-8, 1e-4, 1e-3,1e-2, 1, 5, 10, 20]

#Initialize the dataframe for storing coefficients.
col = ['rss','intercept'] + ['coef_x_%d'%i for i in range(1,16)]
ind = ['alpha_%.2g'%alpha_ridge[i] for i in range(0,10)]
coef_matrix_ridge = pd.DataFrame(index=ind, columns=col)

models_to_plot = {1e-15:231, 1e-10:232, 1e-4:233, 1e-3:234, 1e-2:235, 5:236}
for i in range(10):
    coef_matrix_ridge.iloc[i,] = ridge_regression(data, predictors, alpha_ridge[i], models_to_plot)

Observation: 
- As the value of alpha increases, the model complexity reduces. 
    - Though higher values of alpha reduce overfitting, significantly high values can cause underfitting as well (e.g., alpha = 5). 
        - Thus alpha should be chosen wisely. 
- A widely accepted technique is **cross-validation**, i.e., the value of alpha is iterated over a range of values, and the one giving a higher cross-validation score is chosen.

In [None]:
#Set the display format to be scientific for ease of analysis
pd.options.display.float_format = '{:,.2g}'.format
coef_matrix_ridge

inferences:

- The RSS increases with an increase in alpha.
- An alpha value as small as 1e-15 gives us a significant reduction in the magnitude of coefficients. 
    - How? Compare the coefficients in the first row of this table to the last row of the simple linear regression table.
- High alpha values can lead to significant underfitting. Note the rapid increase in RSS for values of alpha greater than 1
    - Though the coefficients are really small, they are NOT zero.

Reconfirm the same by determining the number of zeros in each row of the coefficients data set:

This should confirm that all 15 coefficients are greater than zero in magnitude (can be +ve or -ve).

In [None]:
coef_matrix_ridge.apply(lambda x: sum(x.values==0),axis=1)

### 3.1. Regularisation Methods: LASSO Regression

- Understand the difference between L1 and L2 regularisation
- Understand the concept of sparsity.

#### Shrinkage Methods

In ridge regression, we learned that it is possible to modify and potentially improve the test-set performance of a least squares regression model by reducing the magnitude of some subset of the coefficients $\hat{\beta}$.
- The ridge regression process of reducing the magnitude of those coefficients is a type of _shrinkage_ method - we are attempting to shrink the values of those less important coefficients.
- In ridge regression, it is possible to shrink a coefficient's value towards zero, but never reaching exactly zero.

#### Sparsity

L1 penalty has the eÔ¨Äect of forcing some of the coeÔ¨Écient estimates to be exactly equal to zero which means there is a complete removal of some of the features for model evaluation when the tuning parameter Œª is suÔ¨Éciently large.
- Therefore, the lasso method also performs Feature selection and is said to yield sparse models.

#### Limitation of Lasso Regression:

Problem - types of Dataset: 
- If the number of predictors is greater than the number of data points, 
    - Lasso will pick at most n predictors as non-zero, even if all predictors are relevant.

Multicollinearity Problem: 
- If there are two or more highly collinear variables then LASSO regression selects one of them randomly which is not good for the interpretation of our model.

### LASSO Regression

Description
- Lasso regression, or Least Absolute Shrinkage and Selection Operator, 
- is a regularization method that also includes a penalty term but can set some coefficients exactly to zero, effectively selecting relevant features.

Penalty Type
- Lasso regression employs an L1 penalty, 
    - which sums the absolute values of the coefficients multiplied by lambda.

Coefficient Impact
- The L1 penalty in lasso regression can drive some coefficients to exactly zero when the lambda value is large enough, performing feature selection and resulting in a sparse model.

Feature Selection
- Lasso regression can set some coefficients to zero, effectively selecting the most relevant features and improving model interpretability.

Use Case
- Lasso regression is preferred when the goal is feature selection, resulting in a simpler and more interpretable model with fewer variables.

Model Complexity
- Lasso regression can lead to a less complex model by setting some coefficients to zero, reducing the number of effective parameters.

Interpretability
- Lasso regression can improve interpretability by selecting only the most relevant features, making the model‚Äôs predictions more explainable.

Sparsity
- Lasso regression can produce sparse models by setting some coefficients to exactly zero.

Sensitivity
- More sensitive to outliers due to the absolute value in the penalty term.

### L1 (LASSO) vs. L2(Ridge) Regularization Techniques

The key difference is in how they assign penalties to the coefficients:

Ridge Regression:
- Performs L2 regularization, i.e., adds penalty equivalent to the square of the magnitude of coefficients
    - Minimization objective = LS Obj + Œ± * (sum of square of coefficients)

Lasso Regression:
- Performs L1 regularization, i.e., adds penalty equivalent to the absolute value of the magnitude of coefficients
    - Minimization objective = LS Obj + Œ± * (sum of the absolute value of coefficients)

LS Obj refers to the ‚Äòleast squares objective,‚Äô i.e., the linear regression objective without regularization.

#### Key Differences between Ridge and Lasso Regression
- Ridge regression helps us to reduce only the overfitting in the model while keeping all the features present in the model.
    - It reduces the complexity of the model by shrinking the coefficients whereas Lasso regression helps in reducing the problem of overfitting in the model as well as automatic feature selection.
- Lasso Regression tends to make coefficients to absolute zero whereas Ridge regression never sets the value of coefficient to absolute zero.

Recall the optimisation expression for ridge regression:

$$\min_{\beta} (RSS + \alpha\sum_{j=1}^pb_j^2)$$

where we attempt to minimise the RSS and some penalty term. This can be rewritten:

$$\min_{\beta} (RSS + \alpha(L2\_norm))$$

where $L2\_norm$ is the *sum of the squares of the coefficients*.

In LASSO regularisation, 
- we replace the $L2\_norm$ with what is known as the $L1\_norm$: the *sum of the _absolute_ values of the coefficients*.

This is a relatively recent adaptation of ridge regression which is capable of shrinking predictors to exactly zero - effectively removing them from the model entirely and creating what we call a sparse model (one which uses some subset of all of the available predictors).

LASSO achieves both shrinkage and subset selection.

A LASSO model is fit under the constraint of minimizing the following equation:

$$\sum_{i=1}^n(y_i-(a+\sum_{j=1}^pb_jx_{ij}))^2 + \alpha\sum_{j=1}^p|b_j|$$

which can be rewritten as follows:

$$\min_{\beta} (RSS + \alpha\sum_{j=1}^p|b_j|)$$

or,

$$\min_{\beta} (RSS + \alpha(L1\_norm))$$

Lasso regression performs L1 regularization, i.e., it adds a factor of the sum of the absolute value of coefficients in the optimization objective.

Objective = RSS + $\alpha$ * (sum of the absolute value of coefficients)

$\alpha$ (alpha) works similar to that of the ridge and provides a trade-off between balancing RSS and the magnitude of coefficients. 

Like that of the ridge, $\alpha$ can take various values.

- $\alpha$ = 0: Same coefficients as simple linear regression
- $\alpha$ = ‚àû: All coefficients zero (same logic as before)
- 0 < $\alpha$ < ‚àû: coefficients between 0 and that of simple linear regression

In [None]:
# Separate the features from the response
X = df.drop('ZAR/USD', axis=1)
y = df['ZAR/USD']

In [None]:
# Import the scaling module
from sklearn.preprocessing import StandardScaler

# Create standardization object
scaler = StandardScaler()

# Save standardized features into new variable
X_scaled = scaler.fit_transform(X)

In [None]:
# Import train/test split module
from sklearn.model_selection import train_test_split

# Split dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X_scaled, 
                                                    y, 
                                                    test_size=0.20,
                                                    random_state=1,
                                                    shuffle=False)

In [None]:
# Import LASSO module
from sklearn.linear_model import Lasso

# Create LASSO model object, setting alpha to 0.01
lasso = Lasso(alpha=0.01)

# Train the LASSO model
lasso.fit(X_train, y_train)

# Extract intercept from model
intercept = float(lasso.intercept_)

# Extract coefficient from model
coeff = pd.DataFrame(lasso.coef_, X.columns, columns=['Coefficient'])

# Extract intercept
print("Intercept:", float(intercept))

coeff

##### Interpretation of the intercept and coefficients

We interpret the values of the intercept and coefficients the same way as before:

 - The intercept can be interpreted as the **expected exchange rate when all the features are equal to their means**.
 - Each coefficient is interpreted as the expected change in the response variable given an increase of 1 in the **scaled feature value**.
 
See from the list of coefficients above that some of the coefficients have indeed been shrunk to exactly zero.

##### Assessment of predictive accuracy
fit the following models as well, in order to compare the LASSO results thoroughly:

- A least squares model using all available predictors;
- A least squares model using the predictors with non-zero coefficients from LASSO;
- A ridge regression model using all available predictors.

In [None]:
# Fit a basic linear model
from sklearn.linear_model import LinearRegression, Ridge

X_subset = df.drop(['ZAR/USD',
                   'Total Reserves excl Gold (USD)',
                   'IMF Reserve Position (USD)',
                   'Claims on Non-residents (USD)',
                   'Central Bank Policy Rate',
                   'Treasury Bill Rate',
                   'Savings Rate',
                   'Deposit Rate',
                   'Lending Rate',
                   'Government Bonds'], axis=1)

X_subset_scaled = scaler.fit_transform(X_subset)

X_train2, X_test2, y_train2, y_test2 = train_test_split(X_subset, 
                                                        y, 
                                                        test_size=0.20, 
                                                        random_state=1,
                                                        shuffle=False)

# Least squares using non-zero variables from LASSO
lm_subset = LinearRegression()

# Least squares using all predictors
lm_all = LinearRegression()

# Ridge using all predictors
ridge = Ridge()

lm_subset.fit(X_train2, y_train2)
lm_all.fit(X_train, y_train)
ridge.fit(X_train, y_train)

In [None]:
from sklearn import metrics

# Make training set predictions for each model
train_lm_subset = lm_subset.predict(X_train2)
train_lm_all = lm_all.predict(X_train)
train_ridge = ridge.predict(X_train)
train_lasso = lasso.predict(X_train)

# Make test set predictions for each model
test_lm_subset = lm_subset.predict(X_test2)
test_lm_all = lm_all.predict(X_test)
test_ridge = ridge.predict(X_test)
test_lasso = lasso.predict(X_test)

In [None]:
# Dictionary of results
results_dict = {'Training MSE':
                    {
                        "Least Squares, Subset": metrics.mean_squared_error(y_train2, train_lm_subset),
                        "Least Squares, All": metrics.mean_squared_error(y_train, train_lm_all),
                        "Ridge": metrics.mean_squared_error(y_train, train_ridge),
                        "LASSO": metrics.mean_squared_error(y_train, train_lasso)
                    },
                    'Test MSE':
                    {
                        "Least Squares, Subset": metrics.mean_squared_error(y_test2, test_lm_subset),
                        "Least Squares, All": metrics.mean_squared_error(y_test, test_lm_all),
                        "Ridge": metrics.mean_squared_error(y_test, test_ridge),
                        "LASSO": metrics.mean_squared_error(y_test, test_lasso)
                    }
                }

In [None]:
# Create dataframe from dictionary
results_df = pd.DataFrame(data=results_dict)

# View the results
results_df

##### Result interpretation

LASSO was able to perform subset selection, while also performing shrinkage. 
- The result is a more generalised model with greater predictive capacity. 

The least squares model which we trained on the same subset of variables that LASSO retained as non-zero scored a higher MSE on the test set, 
- indicating that the shrinkage that LASSO applied to those remaining variables was effective.

LASSO achieved the best MSE on the test set, followed by ridge regression.

##### Plot our results to end off.
plot the the test set versus the three primary methods explored here:

- Least squares using all predictors;
- Ridge using all predictors;
- LASSO using all predictors.

In [None]:
##### we want to plot the train and test response variables as a continuous line
train_plot = y_train.append(pd.Series(y_test[0], index=['2016M01']))

plt.plot(np.arange(96,120), lasso.predict(X_test), label='LASSO')
plt.plot(np.arange(96,120), ridge.predict(X_test), label='Ridge')
plt.plot(np.arange(96,120), lm_all.predict(X_test), label='Least Squares')
plt.plot(np.arange(96,120), y_test, label='Testing')
plt.legend()

plt.show()

#### Losso regression in Sine / polynomial problem as below under GLMs

LASSO stands for Least Absolute Shrinkage and Selection Operator.

2 keywords here ‚Äì 
- absolute and
- selection.

In [None]:
from sklearn.linear_model import Lasso
def lasso_regression(data, predictors, alpha, models_to_plot={}):
    #Fit the model
    lassoreg = Lasso(alpha=alpha,normalize=True, max_iter=1e5)
    lassoreg.fit(data[predictors],data['y'])
    y_pred = lassoreg.predict(data[predictors])
    
    #Check if a plot is to be made for the entered alpha
    if alpha in models_to_plot:
        plt.subplot(models_to_plot[alpha])
        plt.tight_layout()
        plt.plot(data['x'],y_pred)
        plt.plot(data['x'],data['y'],'.')
        plt.title('Plot for alpha: %.3g'%alpha)
    
    #Return the result in pre-defined format
    rss = sum((y_pred-data['y'])**2)
    ret = [rss]
    ret.extend([lassoreg.intercept_])
    ret.extend(lassoreg.coef_)
    return ret

Additional parameters defined in the Lasso function ‚Äì 
- max_iter.
    - This is the maximum number of iterations for which we want the model to run if it doesn‚Äôt converge before. 
    - This exists for Ridge as well, but setting this to a higher than default value was required in this case.

In [None]:
#Initialize predictors to all 15 powers of x
predictors=['x']
predictors.extend(['x_%d'%i for i in range(2,16)])

#Define the alpha values to test
alpha_lasso = [1e-15, 1e-10, 1e-8, 1e-5,1e-4, 1e-3,1e-2, 1, 5, 10]

#Initialize the dataframe to store coefficients
col = ['rss','intercept'] + ['coef_x_%d'%i for i in range(1,16)]
ind = ['alpha_%.2g'%alpha_lasso[i] for i in range(0,10)]
coef_matrix_lasso = pd.DataFrame(index=ind, columns=col)

#Define the models to plot
models_to_plot = {1e-10:231, 1e-5:232,1e-4:233, 1e-3:234, 1e-2:235, 1:236}

#Iterate over the 10 alpha values:
for i in range(10):
    coef_matrix_lasso.iloc[i,] = lasso_regression(data, predictors, alpha_lasso[i], models_to_plot)

Observations:

Model complexity decreases with an increase in the values of alpha. But notice the straight line at alpha=1.

Expected inference: 
- higher RSS for higher alphas
- For the same values of alpha, the coefficients of lasso regression are much smaller than that of ridge regression (compare row 1 of the 2 tables).
- For the same alpha, lasso has higher RSS (poorer fit) as compared to ridge regression.
- Many of the coefficients are zero, even for very small values of alpha.

Check the number of coefficients that are zero in each model.

In [None]:
coef_matrix_lasso.apply(lambda x: sum(x.values==0),axis=1)

Observations:

- small value of alpha, a significant number of coefficients are zero. 
- This also explains the horizontal line fit for alpha=1 in the lasso plots; it‚Äôs just a baseline model! 
This phenomenon, where most coefficients become zero, is called **sparsity**. 
- Although lasso performs feature selection, we achieve this level of sparsity only in special cases

#### Mathematics behind why coefficients are zero in the case of lasso but not ridge.

In [None]:
'''
LINEAR, RIDGE AND LASSO REGRESSION
'''
# importing requuired libraries
import numpy as np
import pandas as pd
from pandas import Series, DataFrame
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Lasso, Ridge

# read test and train file
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

print('\n\n---------DATA---------------\n\n')
print(train.head())

#splitting into training and test
## try building model with the different features and compare the result.
X = train.loc[:,['Outlet_Establishment_Year','Item_MRP']]
x_train, x_cv, y_train, y_cv = train_test_split(X,train.Item_Outlet_Sales,random_state=5)

print('--------Trainig Linear Regression Model---------------')
lreg = LinearRegression()
#training the model
lreg.fit(x_train,y_train)

#predicting on cv
pred = lreg.predict(x_cv)

#calculating mse
mse = np.mean((pred - y_cv)**2)
print('\nMean Sqaured Error = ',mse )

#Let us take a look at the coefficients of this linear regression model.
# calculating coefficients
coeff = DataFrame(x_train.columns)

coeff['Coefficient Estimate'] = Series(lreg.coef_)

print(coeff)

print('\n\nModel performance on Test data = ')
print(lreg.score(x_cv,y_cv))

print('\n\n---------Training Ridge Regression Model----------------')

ridge = Ridge()
ridge.fit(x_train,y_train)
pred1 = ridge.predict(x_cv)
mse_1 = np.mean((pred1-y_cv)**2)

print('\n\nMean Squared Error = ',mse_1)

# calculating coefficients
coeff = DataFrame(x_train.columns)
coeff['Coefficient Estimate'] = Series(ridge.coef_)
print(coeff)

print('\n\nModel performance on Test data = ')
print(ridge.score(x_cv,y_cv))


print('\n\n---------Training Lasso Regression Model----------------')

lasso = Lasso()
lasso.fit(x_train,y_train)
pred2 = lasso.predict(x_cv)
mse_2 = np.mean((pred2-y_cv)**2)

print('\n\nMean Squared Error = ',mse_2)

# calculating coefficients
coeff = DataFrame(x_train.columns)
coeff['Coefficient Estimate'] = Series(lasso.coef_)
print(coeff)

print('\n\nModel performance on Test data = ')
print(lasso.score(x_cv,y_cv))

### 4. Generalized Linear Models (GLMs)
What It Means: 
- GLMs extend linear regression by allowing different types of data distributions
    - Poisson for count data. 
- It models the mean of the outcome variable based on a link function.

Outcome Interpretation: 
- The coefficients explain how each predictor affects the mean outcome, given the distribution.

Performance Measures:
- Deviance: Measures how well the model fits compared to a perfect model; lower values are better.

Lay Explanation: 
- GLMs are like flexible versions of linear regression that can handle different data types (like counts or binary data), giving predictions that respect the data‚Äôs nature.

Use Case: 
- Extends linear regression for non-normal distributions (e.g., Poisson regression for count data).

Model Types: 
- Poisson regression, 
- Binomial regression.


In [None]:
import statsmodels.api as sm
poisson_model = sm.GLM(y_train, X_train, family=sm.families.Poisson()).fit()
predictions = poisson_model.predict(X_test)

#### To understand why penalizing the magnitude of coefficients should work in the first place.

To understand the impact of model complexity on the magnitude of coefficients, simulated a sine curve (between 60¬∞ and 300¬∞) and added some random noise.

Resembles a sine curve but not exactly because of the noise.

Estimate the sine function using polynomial regression with powers of x from 1 to 15. Let‚Äôs add a column for each power upto 15 in our dataframe.

In [None]:
#Importing libraries. The same will be used throughout the article.
import numpy as np
import pandas as pd
import random

import matplotlib.pyplot as plt
from matplotlib.pylab import rcParams
rcParams['figure.figsize'] = 12, 10

#Define input array with angles from 60deg to 300deg converted to radians
x = np.array([i*np.pi/180 for i in range(60,300,4)])
np.random.seed(10)  #Setting seed for reproducibility
y = np.sin(x) + np.random.normal(0,0.15,len(x))
data = pd.DataFrame(np.column_stack([x,y]),columns=['x','y'])
plt.plot(data['x'],data['y'],'.')
plt.show()

In [None]:
for i in range(2,16):  #power of 1 is already there
    colname = 'x_%d'%i      #new var will be x_power
    data[colname] = data['x']**i
print(data.head()) # add a column for each power upto 15 

#### Making 15 Different Linear Regression Models

we have all the 15 powers, let‚Äôs make 15 different linear regression models, with each model containing variables with powers of x from 1 to the particular model number.

Define a generic function that takes in the required maximum power of x as an input and returns a list containing 
- model RSS, 
- intercept, 
- coef_x, 
- coef_x2, ‚Ä¶ upto entered power 

Here RSS refers to the ‚ÄòResidual Sum of Squares,‚Äô which is nothing but the sum of squares of errors between the predicted and actual values in the training data set and is known as the cost function or the loss function.

The function will not plot the model fit for all the powers but will return the RSS and coefficient values for all the models.

In [None]:
# Import Linear Regression model from scikit-learn.
from sklearn.linear_model import LinearRegression

def linear_regression(data, power, models_to_plot):
    #initialize predictors:
    predictors=['x']
    if power>=2:
        predictors.extend(['x_%d'%i for i in range(2,power+1)])
    
    #Fit the model
    linreg = LinearRegression(normalize=True)
    linreg.fit(data[predictors],data['y'])
    y_pred = linreg.predict(data[predictors])
    
    #Check if a plot is to be made for the entered power
    if power in models_to_plot:
        plt.subplot(models_to_plot[power])
        plt.tight_layout()
        plt.plot(data['x'],y_pred)
        plt.plot(data['x'],data['y'],'.')
        plt.title('Plot for power: %d'%power)
    
    #Return the result in pre-defined format
    rss = sum((y_pred-data['y'])**2)
    ret = [rss]
    ret.extend([linreg.intercept_])
    ret.extend(linreg.coef_)
    return ret

##### Store all the Results in Pandas Dataframe

Store all the results in a Pandas dataframe and plot 6 models to get an idea of the trend.

Expection: the models with increasing complexity to better fit the data and result in lower RSS values.
- As the model complexity increases, the models tend to fit even smaller deviations in the training data set. 
- Though this leads to overfitting, let‚Äôs keep this issue aside for some time and come to our main objective, i.e., the impact on the magnitude of coefficients.

In [None]:
# Initialize a dataframe to store the results:
col = ['rss','intercept'] + ['coef_x_%d'%i for i in range(1,16)]
ind = ['model_pow_%d'%i for i in range(1,16)]
coef_matrix_simple = pd.DataFrame(index=ind, columns=col)

# Define the powers for which a plot is required:
models_to_plot = {1:231,3:232,6:233,9:234,12:235,15:236}

# Iterate through all powers and assimilate results
for i in range(1,16):
    coef_matrix_simple.iloc[i-1,0:i+2] = linear_regression(data, power=i, models_to_plot=models_to_plot)

In [None]:
#Set the display format to be scientific for ease of analysis
pd.options.display.float_format = '{:,.2g}'.format
coef_matrix_simple

Its evident that the size of coefficients increases exponentially with an increase in model complexity.
- Intuition: into why putting a constraint on the magnitude of coefficients can be a good idea to reduce model complexity.

##### Large Coefficents Significance

It means that we‚Äôre putting a lot of emphasis on that feature, i.e., the particular feature is a good predictor for the outcome. 
- When it becomes too large, the algorithm starts modeling intricate relations to estimate the output and ends up overfitting the particular training data.

Solution
- ridge and lasso regression in detail 
- see how well they work for the same problem.

##### 5. Decision Trees and Random Forests
What It Means: 
- Decision trees split data based on conditions, creating branches that lead to a prediction. 
- Random forests use multiple trees to improve accuracy and reduce overfitting.

Outcome Interpretation: 
- Each "branch" shows how different conditions affect the outcome, 
- and random forests average the results of many trees for robust predictions.

Performance Measures:
- Accuracy: Proportion of correctly classified samples.
- Gini Index / Entropy: Used to measure the purity of the splits; lower values are better.

Lay Explanation: 
- Decision trees are like flowcharts that guide predictions based on conditions. 
- Random forests combine many trees to make stronger, more reliable decisions.

Use Case: 
- For classification or regression problems with non-linear relationships and high dimensionality.

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

tree_model = DecisionTreeClassifier()
tree_model.fit(X_train, y_train)
rf_model = RandomForestClassifier()
rf_model.fit(X_train, y_train)
predictions_tree = tree_model.predict(X_test)
predictions_rf = rf_model.predict(X_test)


# Classification via Mathematics Functions

Classification Using the Equation of a Straight Line

Steps:

1. Begin with the Equation of a Line: The general equation of a straight line in a 2D plane is:

$$ùë¶ = ùëö \times ùë• + ùëê $$

- m: Slope of the line (how steep it is)
- c: Intercept (where the line crosses the y-axis)

2. Connect it to Classification:
- In binary classification, the goal is to separate two classes (e.g., Class 0 and Class 1).
- The equation of a line can act as a decision boundary: 
    - points on one side of the line belong to Class 0, while 
    - points on the other side belong to Class 1.

3. Interactive Example: 
- Imagine a dataset with two features, $ùë•_1$ and $ùë•_2$

For simplicity:
- $ùë•_1$: Horizontal axis
- $ùë•_2$: Vertical axis

A simple decision boundary can be represented as:
$$ ùë•_2 = ùëö \times ùë•_1 + ùëê $$

**How the Slope (m) and Intercept (c) Influence the Boundary**

The slope and intercept determine the orientation and position of the decision boundary in the feature space.

- Slope (m):
    - Controls the steepness or angle of the line.
    - A larger absolute value of m means the line is steeper; a smaller absolute value means it is flatter.
    - Example: In $ ùë•_2 = ùëö \times ùë•_1 + ùëê $
        - If m > 0, the line slopes upward.
        - If m < 0, the line slopes downward.
        - If m = 0, the line is horizontal.
- Intercept (c):
    - Determines where the line crosses the $ùë•_2$ (vertical) axis.
    - Changing c shifts the line up or down without changing its slope.
    - Example: If c=1, the line crosses the x_2 axis at 1.

Together, m and c define how the decision boundary separates the feature space. Adjusting these values can change which points fall into Class 0 or Class 1.

4. Decision Boundary in Classification: Modify the equation to reflect classification logic:
$$ ùë•_2 - ùëö \times ùë•_1 - ùëê = 0 $$

- Points where this equation equals 0 lie exactly on the line.
- Points where $ ùë•_2 - ùëö \times ùë•_1 - ùëê > 0 $ belong to Class 1.
- Points where $ ùë•_2 - ùëö \times ùë•_1 - ùëê < 0 $ belong to Class 0.

5. Visualization: Plot this line on a 2D plane with some example data points:
- Red points for Class 0
- Blue points for Class 1
- The line $ùë¶ = ùëö \times ùë• + ùëê $ separates the two clesses

6. Extend to Higher Dimensions: In higher dimensions, the decision boundary becomes a hyperplane:

$$ w_1ùë•_1 + w_2ùë•_2 + ... + w_nùë•_n + b = 0 $$

- Where: 
    - $w_1, w_2, ..., w_n$ are weights (equivalent to slopes) and
    - $ùëè$ is the intercept.

**What Happens When the Data Points Overlap Significantly?**

When data points from different classes overlap, the decision boundary may not cleanly separate the two classes, leading to misclassification. Here‚Äôs what happens:

Misclassification:
- Points from one class appear on the "wrong" side of the decision boundary.
- This results in a classification error (false positives or false negatives).

Impact on Model:
- A linear decision boundary (a straight line) may not be flexible enough to separate overlapping or complex distributions.
- Performance metrics like accuracy, precision, and recall can degrade.

Example Scenario:
- Consider a dataset where the two classes form concentric circles. A straight-line boundary cannot separate the classes, leading to significant misclassification.

**Transition from Linear to Non-Linear Decision Boundaries**

Linear decision boundaries work well when data is linearly separable. However, real-world data is often complex, requiring non-linear boundaries. Here‚Äôs how we transition:

Extend the Feature Space:
- Use techniques like polynomial features to introduce non-linear relationships.
    - where $ùë•_1$ and $ùë•_2$ can be transformed to:
        - $ùë•^2_1$ and $ùë•^2_2$
        - $ùë•_1 \times ùë•_2$
- The linear classifier now operates in this transformed space, creating a non-linear boundary in the original feature space.

1. Kernel Methods (e.g., in SVMs):
- Apply kernel functions like RBF (Radial Basis Function) to map data into a higher-dimensional space where it is linearly separable.
- The decision boundary in the original space appears non-linear.

2. Neural Networks:
- Multi-layer perceptrons (MLPs) can learn complex, non-linear decision boundaries by stacking layers of non-linear activation functions.
- Neural networks are particularly powerful for high-dimensional and unstructured data.

3. Ensemble Models:
- Techniques like random forests or gradient boosting combine multiple weak learners to create flexible decision boundaries.

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Example data
np.random.seed(0)
x1_class0 = np.random.rand(50)
x2_class0 = 2 * x1_class0 + 0.5 + np.random.normal(0, 0.1, 50)
x1_class1 = np.random.rand(50)
x2_class1 = 2 * x1_class1 - 0.5 + np.random.normal(0, 0.1, 50)

# Equation of line: x2 = m*x1 + c
m = 2  # slope
c = 0  # intercept
x_line = np.linspace(0, 1, 100)
y_line = m * x_line + c

# Plot
plt.figure(figsize=(8, 6))
plt.scatter(x1_class0, x2_class0, color='red', label='Class 0')
plt.scatter(x1_class1, x2_class1, color='blue', label='Class 1')
plt.plot(x_line, y_line, color='black', label='Decision Boundary')
plt.xlabel('x1')
plt.ylabel('x2')
plt.legend()
plt.title('Linear Decision Boundary for Classification')
plt.grid()
plt.show()

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_circles
from sklearn.svm import SVC
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import Pipeline

# Generate non-linear dataset
X, y = make_circles(n_samples=300, noise=0.1, factor=0.5, random_state=0)

# Plot raw data
plt.figure(figsize=(6, 6))
plt.scatter(X[y == 0][:, 0], X[y == 0][:, 1], color='red', label='Class 0')
plt.scatter(X[y == 1][:, 0], X[y == 1][:, 1], color='blue', label='Class 1')
plt.title('Non-linear Data')
plt.legend()
plt.show()

# Linear decision boundary (fails for non-linear data)
linear_svm = SVC(kernel='linear', C=1)
linear_svm.fit(X, y)

# Non-linear decision boundary using kernel trick
nonlinear_svm = SVC(kernel='rbf', C=1, gamma=2)
nonlinear_svm.fit(X, y)

# Visualize decision boundaries
def plot_decision_boundary(clf, X, y, title):
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100), np.linspace(y_min, y_max, 100))
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    plt.contourf(xx, yy, Z, alpha=0.8, cmap='coolwarm')
    plt.scatter(X[y == 0][:, 0], X[y == 0][:, 1], color='red', edgecolor='k', label='Class 0')
    plt.scatter(X[y == 1][:, 0], X[y == 1][:, 1], color='blue', edgecolor='k', label='Class 1')
    plt.title(title)
    plt.legend()
    plt.show()

# Linear decision boundary
plot_decision_boundary(linear_svm, X, y, title='Linear Decision Boundary (Fails)')

# Non-linear decision boundary
plot_decision_boundary(nonlinear_svm, X, y, title='Non-linear Decision Boundary (Succeeds)')


### **Linear Discriminant Analysis (LDA)**
Linear Discriminant Analysis (LDA) is a classification technique that uses a linear combination of features to separate classes. 

It assumes:
- The data within each class is normally distributed.
- The covariance of each class is identical (homoscedasticity).

LDA works by finding a linear decision boundary that maximizes the separation between classes.

Goal:
- Project data onto a lower-dimensional space (usually 1D for binary classification).
- Maximize the distance between class means while minimizing the variance within each class.

**Calculating the best values for the parameters of a linear discriminant**
- Estimate the coefficients that define the linear decision boundary based on your dataset.
- In Linear Discriminant Analysis (LDA), these coefficients are derived by maximizing the separation between the means of the classes while minimizing the variance within each class.

Steps:
1. Define the Linear Discriminant Function

The linear discriminant function for binary classification can be written as:

$$ y = w_0 + w_1ùë•_1 + w_2ùë•_2 + ... + w_dùë•_d $$

- where:
    - $w_0$: Intercept (bias term).
    - $w_1, w_2, ..., w_d$: Coefficients for each feature $x_1, x_2, ..., x_d$
    - y: The decision score. A threshold is applied to classify points.

2. Estimate Class Statistics

To compute the parameters, you first need the following statistics from the data:

- Compute Class Means ($\mu_0 and \mu_1): Calculate the mean vector for each class.
    - for each class $C_0 and C_1$
$$ \mu_k = \frac{1}{N_k} \sum_{x \in C_i} x $$
- where 
    - $N_k$ is the number of instances in class k.

- Compute Pooled Covariance Matrix ($ùëÜ_ùë§$):
    - Within-Class Scatter Matrix ($ùëÜ_ùë§$): Measures the spread of points within each class.
$$ S_w = \sum^{[c]}_{i = 1} \sum_{x \in C_i} (x - \mu_i)(x - \mu_i)^T $$
    - Divide by the total number of samples (N) to get the pooled covariance matrix.
- Compute Between-Class Scatter Matrix ($ùëÜ_b$): Measures the separation between class means.
$$ S_b = \sum^{[c]}_{i = 1} N_i (\mu_i - \mu)(\mu_i - \mu)^T $$
- where 
    - $N_i$ is the number of samples in class i
    - $\mu$ is the overall mean.

- Prior Probabilities (P($C_0$) and P($C_1$)):
    - These are the proportions of each class in the dataset:
$$ P(C_k) = \frac{N_k}{N} $$

3. Compute the Parameters

- Find Optimal Projection: Solve the eigenvalue problem for $S^{-1}_w S_b$, and select the eigenvector with the largest eigenvalue.
    - The parameters of the discriminant function are calculated as follows:
        - Linear Coefficients ($w$):
$$ w = S^{-1}_w (\mu_1 - \mu_0)$$
- where:
    - $S^{-1}_w$ is the inverse of the pooled covariance matrix.

        - Intercept($w_0$):
$$ w_0 = -\frac{1}{2} (\mu^T_1 S^{-1}_w \mu_1 - \mu^T_0 S^{-1}_w \mu_0 ) + ln \frac{P(C_1)}{P(C_0)}$$

4. Predict Class Labels (Decision Rule:)

- Project data points onto the linear discriminant.
    - For a new instance ùë•, compute the linear discriminant score:
$$ y = w_0 + w^T x$$

- Use a threshold (e.g., midpoint between means) to classify points.
- Classify based on the threshold (usually $y> 0 \Rightarrow C_1 , y \leq 0 \Rightarrow C_0$ )

**Interpretation of Parameters**

Linear Coefficients (ùë§):
- Feature Weight Represent?
    - A feature weight (coefficient) indicates the change in the predicted outcome associated with a unit change in the feature, keeping all other features constant.
    - These determine how much each feature contributes to the decision boundary.

- A larger magnitude of $ùë§_ùëñ$ means the corresponding feature $ùë•_ùëñ$ has more influence.
    -  Interpreting the Magnitude (Absolute Magnitude):
        - Larger Magnitudes: Indicate that a feature has a stronger effect on the outcome.
        - Smaller Magnitudes: Suggest that the feature has less influence on the outcome.
    - Interpreting the Magnitude (Positive or Negative Sign)
        - Positive Weight: Indicates a positive relationship between the feature and the outcome.
        - Negative Weight: Indicates a negative relationship between the feature and the outcome.

- Impact of Scaling on Magnitudes
    - Feature magnitudes are meaningful only if the features are on the same scale. If features differ in scale:
        - Larger scales will lead to larger coefficients, even if the feature has less relative importance.
        - Standardizing or normalizing the features (e.g., using z-scores or min-max scaling) ensures that coefficient magnitudes are comparable.

Linear Regression Example
- Model: Predict house price (y) using square footage ($ùë•_1$) and number of bedrooms ($ùë•_2$):

$$ y = w_0 + w_1ùë•_1 + w_2ùë•_2 $$
$$ y = 50 + 300ùë•_1 + 10000ùë•_2 $$

- Interpretation:
    - $w_1$ = 300: Increasing square footage by 1 unit increases the house price by R300.
    - $w_2$ = 10,000: Adding one bedroom increases the house price by R10,000.

1. Interpretation of Weight Magnitude in Logistic Regression
- In logistic regression, weights do not directly represent the change in the outcome but the log-odds of the outcome.

$$ log(\frac{P(y = 1)}{P(y = 0)}) = w_0 + w_1ùë•_1 + w_2ùë•_2 + ... + w_dùë•_d $$

- Exponentiated coefficients ($e^{w_i}$) indicate the multiplicative effect on the odds for a unit change in $x_i$

Logitic Regression Example
- Model: Predict customer churn (y) based on monthly charges ($ùë•_1$) and contract length ($ùë•_2$):

$$ log(\frac{P(y = 1)}{P(y = 0)}) = w_0 + w_1ùë•_1 + w_2ùë•_2 $$
$$ log(\frac{P(y = 1)}{P(y = 0)}) = -3 + 0.05ùë•_1 + 2ùë•_2 $$

- Interpretation:
    - $w_1$ = 0.05: for every R1 increase in monthly charges, the log-odds of churn increase by 0.05.
    - $w_2$ = 10,000: For each additional month of contract length, the log-odds of churn increase by 2.

2. Interpretation of Weight Magnitude in Regularized Models (Lasso and Ridge)
- Coefficients may be shrunk or set to zero based on regularization strength, which impacts their magnitude.
- Regularization ensures that larger weights correspond to truly important features.

Intercept ($ùë§_0$):
- Adjusts the position of the decision boundary.

Decision Rule:
- $y>0$: Class 1.
- $y‚â§0$: Class 0.

Considerations for interpretations

Multicollinearity:
- If features are highly correlated, the magnitude of weights can become unstable and misleading.
    - Techniques like Variance Inflation Factor (VIF) or regularization can mitigate this.

Standardization:
- Always standardize features to ensure meaningful comparisons between coefficients.

Model-Specific Meaning:
- Interpretations vary slightly across linear regression, logistic regression, and other models.
    - In logistic regression, remember that coefficients affect the log-odds, not the raw probabilities.

**Interpretation of Results**

Confusion Matrix and Classification Report:
- The confusion matrix indicates true positives, true negatives, false positives, and false negatives.
- The classification report shows metrics like precision, recall, F1-score, and accuracy.

Decision Boundary:
- The plot shows the LDA decision boundary, which is linear. 
    - It separates the two classes by maximizing the ratio of between-class variance to within-class variance.
- Data points on either side of the boundary are classified into their respective classes.

**Assumptions and Limitations:**

Assumptions:
- Classes have a normal distribution.
- Classes share the same covariance matrix.

Limitations:
- LDA struggles with non-linear boundaries or when the assumptions of normality and homoscedasticity are violated.

**When to Use LDA**

Advantages:
- Works well when the data satisfies its assumptions.
- Provides interpretable results with clear decision boundaries.

Use Cases:
- Medical diagnosis (e.g., distinguishing between disease states).
- Marketing (e.g., classifying customer preferences).
- Text classification (when transformed into vector space).

Not Suitable:
- When classes are non-linearly separable (use non-linear methods like quadratic discriminant analysis or kernel methods in such cases).

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

# Generate synthetic data
X, y = make_classification(n_samples=200, n_features=2, n_informative=2, n_redundant=0,
                           n_clusters_per_class=1, class_sep=2, random_state=42)

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Visualize the data
plt.figure(figsize=(8, 6))
plt.scatter(X[y == 0][:, 0], X[y == 0][:, 1], color='red', label='Class 0')
plt.scatter(X[y == 1][:, 0], X[y == 1][:, 1], color='blue', label='Class 1')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Synthetic Dataset')
plt.legend()
plt.show()

# Apply LDA
lda = LDA()
lda.fit(X_train, y_train)

# Predictions
y_pred = lda.predict(X_test)

# Model evaluation
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

# Visualize decision boundary
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100), np.linspace(y_min, y_max, 100))
Z = lda.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)

plt.figure(figsize=(8, 6))
plt.contourf(xx, yy, Z, alpha=0.8, cmap='coolwarm')
plt.scatter(X_test[y_test == 0][:, 0], X_test[y_test == 0][:, 1], color='red', edgecolor='k', label='Class 0')
plt.scatter(X_test[y_test == 1][:, 0], X_test[y_test == 1][:, 1], color='blue', edgecolor='k', label='Class 1')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('LDA Decision Boundary')
plt.legend()
plt.show()


In [None]:
import numpy as np

# Example data: two classes, each with 2 features
class_0 = np.array([[2, 3], [3, 4], [4, 5]])
class_1 = np.array([[6, 8], [7, 9], [8, 10]])

# Combine data and calculate class statistics
X = np.vstack([class_0, class_1])
y = np.array([0] * len(class_0) + [1] * len(class_1))

# Calculate class means
mu_0 = np.mean(class_0, axis=0)
mu_1 = np.mean(class_1, axis=0)

# Calculate within-class scatter matrix
S_w = np.zeros((X.shape[1], X.shape[1]))
for xi in class_0:
    S_w += np.outer(xi - mu_0, xi - mu_0)
for xi in class_1:
    S_w += np.outer(xi - mu_1, xi - mu_1)

# Calculate linear coefficients
w = np.linalg.inv(S_w).dot(mu_1 - mu_0)

# Calculate intercept
prior_0 = len(class_0) / len(X)
prior_1 = len(class_1) / len(X)
intercept = -0.5 * (mu_1.T @ np.linalg.inv(S_w) @ mu_1 - mu_0.T @ np.linalg.inv(S_w) @ mu_0) + np.log(prior_1 / prior_0)

# Display results
print("Linear Coefficients (w):", w)
print("Intercept (w0):", intercept)

# Predict for a new sample
sample = np.array([5, 6])
decision_score = intercept + w.T.dot(sample)
prediction = 1 if decision_score > 0 else 0
print("Prediction for sample {}: Class {}".format(sample, prediction))

### Optimizing the Objective Function in a Linear Discriminant Model

Objective of a Linear Discriminant Analysis (LDA) model is to:
- Find a linear combination of features that best separates two or more classes. This is achieved by 
    - optimizing an objective function that 
        - maximizes the separation between classes while 
        - minimizing the spread (variance) within each class.

##### **Objective Function of LDA**
The objective function in LDA is based on two key matrices:
1. Between-Class Variance ($S_B$):
    - Measures the separation between the class means.
    - Defined as:
$$ S_B = \sum^k_{i= 1} n_i (\mu - \mu)(\mu_i - \mu)^T$$

- where:
    - k:Number of classes.
    - $n_i$: Number of instances in class i
    - $\mu_i$: Mean vector of class i
    - $\mu$: Overall mean vector.

2. Within-Class Variance ($S_W$): 
- Measures the spread of data points within each class.
Defined as:
$$ S_B = \sum^k_{i= 1} \sum^k_{x \in C_1} (\mu - \mu)(\mu_i - \mu)^T$$

- where:
    - $ùê∂_ùëñ$ represents all instances belonging to class i.

The objective function to optimize in LDA is:

$$ J(w) = \frac{w^T S_B w}{w^T S_W w}$$

- Where:
    - w : is the weight vector that defines the linear discriminant

##### **Optimizing the Objective Function**
To maximize J(w):
1. Solve the generalized eigenvalue problem:
$$ S^-1_W S_Bw = \lambda w$$

- Where:
    - $\lambda$ is the eigenvalue and 
    - w is the eigenvector.

2. Select the eigenvector corresponding to the largest eigenvalue $\lambda_1$,  as it maximizes the class separation.

3. For multiclass problems, select the top k-1 eigenvectors (for k classes) to project data into a lower-dimensional space with maximum discrimination. 

### Scoring and Ranking Instances
Once the linear discriminant function is computed, it can be used to score and rank instances as follows:

Scoring:
- The discriminant score for an instance x is calculated as:

$$ y = w^T x + b$$

- Where
    - w is the optimized weight vector.
    - b is the intercept (bias term).
    - y is the scalar discriminant score.

- The score indicates how far x lies from the decision boundary:
    - Positive scores suggest the instance is likely to belong to one class.
    - Negative scores suggest the instance is likely to belong to the other class.

Ranking:
- Instances can be ranked based on their discriminant scores y.
    - Larger absolute scores indicate greater confidence in classification.
    - Instances closer to zero are near the decision boundary, indicating uncertainty.

##### Practical Example

Given Dataset
Suppose you have two classes (Class A and Class B) and two features $x_1, x_2$

Steps to Optimize and Use the Objective Function:
1. Compute Class Means:
- $\mu_A and \mu_B$ are the mean vectors for Class A and Class B.
- $\mu$ is the overall mean.

2. Compute Variance Matrices:
- Calculate $S_B, S_W$

3. Solve for w:
- Find the eigenvector corresponding to the largest eigenvalue of $S^-1_W S_b$

4. Calculate Scores:
- For each instance $x_i$, compute the discriminant score:
$$ y_i = w^T x_i + b$$

5. Rank Instances:
- Sort instances by their discriminant scores to rank them by their likelihood of belonging to a specific class

__________________

Disclaimer: `decision_function()` method comes from specific machine learning models in libraries like scikit-learn, and it is used to compute the distance of a sample to the decision boundary in classification tasks. 
- This function is particularly useful in models that rely on decision boundaries, such as 
    - Linear Discriminant Analysis (LDA), 
    - Support Vector Machines (SVM),
    - Logistic Regression.

What Does `decision_function()` Return?

Binary Classification (2 Classes):
- Returns a 1D array of scores where each score indicates the distance of the instance from the decision boundary.
    - Positive scores suggest one class (e.g., Class 1), and negative scores suggest the other class (e.g., Class 0).

Multiclass Classification (More than 2 Classes):
- Returns a 2D array of scores (one score per class for each instance).
    - The classifier assigns a class label based on the highest score.

Why Use decision_function()?
- To understand the confidence of predictions.
- To enable custom ranking or thresholding based on discriminant scores.
- To analyze how far instances are from the boundary, providing insight into borderline cases.

Use `decision_function()` in Ranking and Thresholding

- Ranking: Instances can be ranked by their scores. 
    - Higher absolute values indicate greater confidence in classification.
- Thresholding: The decision scores can be used to apply custom thresholds to refine classification decisions.

Decision Boundary

In the case of Linear Discriminant Analysis:
- The decision boundary corresponds to where the decision_function() outputs 0.
- This boundary is a hyperplane that separates the feature space into regions corresponding to each class.

In [None]:
import numpy as np
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA

# Example data
X = np.array([[2, 3], [3, 5], [5, 7], [6, 8], [8, 10], [9, 12]])  # Features
y = np.array([0, 0, 0, 1, 1, 1])  # Labels (0: Class A, 1: Class B)

# Fit LDA model
lda = LDA()
lda.fit(X, y)

# Compute discriminant scores
scores = lda.decision_function(X)

# Print scores and rankings
print("Discriminant Scores:", scores)
print("Ranking of Instances:", np.argsort(-scores))  # Descending order


Interpretation of Scores and Rankings

Discriminant Scores:
- Positive scores suggest membership in Class 1.
- Negative scores suggest membership in Class 0.

Rankings:
- Instances with higher absolute scores are ranked higher, as the classifier is more confident in their classification.

Decision Boundary:
- The boundary is where the discriminant score y = 0

### Analyzing the relationship between the distance from the decision boundary of a linear discriminant and the likelihood of response
Helps us understanding how confident the model is in its predictions.

- The distance from the decision boundary (discriminant score) relates directly to the confidence in classification.
- Scores are transformed into posterior probabilities using logistic (binary) or softmax (multiclass) functions.
- These probabilities are interpretable as the likelihood of response and can be used for scoring, ranking, and applying thresholds for decision-making.

##### Theoretical Relationship : Distance from the decision boundary

The decision boundary in a Linear Discriminant Analysis (LDA) separates classes by 
- maximizing the distance between class means while 
- minimizing variance within each class. 

The discriminant score $y = w^T x + b$ represents the signed distance of an instance x from the decision boundary:
- Positive scores indicate the instance is classified into one class (e.g., Class 1).
- Negative scores indicate the instance is classified into the other class (e.g., Class 0).

The magnitude of the score reflects the confidence in classification:
- Larger absolute values imply that the instance is far from the decision boundary and thus more confidently classified.
- Smaller absolute values (near zero) indicate that the instance is close to the boundary, suggesting uncertainty.

##### Likelihood of Response
In LDA, we can link the discriminant score to the posterior probability of a class, which represents the likelihood of the instance belonging to that class:

$$ P(C_k | x) = \frac{exp(y_k)}{\sum^K_{j = 1} exp(y_i)} $$

- Where: 
    - $P(C_k | x)$: is the posterior probability for class k.
    - $ y_k = w^T x_i + b_k$: is the discriminant score for class k.
    - The denominator is the normalization factor across all classes K.

The posterior probability serves as a soft classification metric:
- Probabilities closer to 1 indicate high confidence.
- Probabilities closer to 0.5 (in a binary classification) indicate uncertainty.

#####  Practical Example
Let‚Äôs calculate the relationship between discriminant scores and posterior probabilities for a **Binary classification**.

Example Dataset

Suppose we have a binary classification problem with discriminant scores:
$$y=[2.0,0.5,0.0,‚àí0.5,‚àí2.0]$$

We can compute the posterior probabilities using the logistic function:
$$ P(C_1 | x) = \frac{1}{1 + exp(-y)} $$

_______________

Generalization to **Multiclass Classification**
- In multiclass problems, the discriminant scores $ùë¶_ùëò$ are normalized using the softmax function to compute posterior probabilities for each class:

$$ P(C_k | x) = \frac{exp(y_k)}{\sum^K_{j = 1} exp(y_i)} $$

- And the class with the highest posterior probability is the predicted class.

In [None]:
import numpy as np

# Discriminant scores
scores = np.array([2.0, 0.5, 0.0, -0.5, -2.0])

# Compute posterior probabilities using the logistic function
posterior_probabilities = 1 / (1 + np.exp(-scores))

# Print results
print("Scores:", scores)
print("Posterior Probabilities:", posterior_probabilities)


Interpretation of Results

1. Scores Far from Zero:
- y=2.0: High confidence in Class 1 ($P(C_1‚à£x)=0.88$).
- y=‚àí2.0: High confidence in Class 0 ($P(C_0‚à£x)=0.88$).

2. Scores Near Zero:
- y=0.0: The posterior probability is 0.5, indicating complete uncertainty.

3. Intermediate Scores:
- y=0.5: Moderately confident in Class 1 ($P(C_1‚à£x)=0.62$).
- y=‚àí0.5: Moderately confident in Class 0 ($P(C_0‚à£x)=0.62$).

Insights

Distance and Likelihood:
- Instances farther from the boundary (large ‚à£y‚à£) have posterior probabilities close to 0 or 1, indicating higher confidence in classification.
- Instances near the boundary (y‚âà0) have probabilities close to 0.5, indicating uncertainty.

Scoring and Ranking:
- By sorting instances based on posterior probabilities, you can rank them in terms of likelihood of response (e.g., likelihood of belonging to Class 1).

Visualization
To better understand the relationship, plot the discriminant score against the posterior probability:

In [None]:
import matplotlib.pyplot as plt

# Plot scores vs posterior probabilities
plt.plot(scores, posterior_probabilities, marker='o')
plt.axvline(0, color='gray', linestyle='--', label='Decision Boundary')
plt.title('Discriminant Score vs Posterior Probability')
plt.xlabel('Discriminant Score (y)')
plt.ylabel('Posterior Probability')
plt.legend(['Scores', 'Decision Boundary'])
plt.grid()
plt.show()

### Understanding Decision Boundaries in Depth

Decision boundaries are surfaces (lines, planes, or hypersurfaces) that separate data points into different classes in a feature space.
- These boundaries are derived based on the decision rules of a classifier, and they indicate the regions where the classifier predicts different outcomes.

##### Decision Boundaries in 2D
In 2D space, the decision boundary is a 
- line (for linear classifiers) or a 
- curve (for non-linear classifiers).

1. Linear Decision Boundaries
- For a binary classification problem, a linear decision boundary is represented as:

$$ w_0 + w_1ùë•_1 + w_2ùë•_2 = 0 $$

- where:
    - $ w_1, w_2$ are the coefficients.
    - $ ùë•_1 , ùë•_2$ are the features.
    - $ w_0 $  is the intercept.

Separating red and blue points in a 2D space, the decision boundary is a straight line. Points on one side belong to one class, while points on the other side belong to the other class.

2. Non-Linear Decision Boundaries
- For complex data distributions, non-linear classifiers create curved decision boundaries. These are:
    - SVM with kernel trick or 
    - neural networks 
- Example: A circular boundary might separate inner and outer regions in a concentric circle dataset.

Visualization
- The boundary is typically visualized by plotting the equation in 2D space and showing the classification regions.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression

# Generate 2D data
X, y = make_classification(n_samples=200, n_features=2, n_informative=2, n_redundant=0, class_sep=1.5, random_state=42)
model = LogisticRegression()
model.fit(X, y)

# Create grid for decision boundary
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100), np.linspace(y_min, y_max, 100))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)

# Plot data and decision boundary
plt.contourf(xx, yy, Z, alpha=0.8, cmap='coolwarm')
plt.scatter(X[:, 0], X[:, 1], c=y, edgecolor='k', cmap='coolwarm')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('2D Decision Boundary')
plt.show()

##### Decision Boundaries in 3D
In 3D space, the decision boundary becomes a 
- plane.

1. Linear Decision Boundaries
- Represented as:

$$ w_0 + w_1ùë•_1 + w_2ùë•_2 + w_3ùë•_3= 0 $$

- where:
    - $ w_1, w_2,  w_3 $ are the coefficients.
    - $ ùë•_1 , ùë•_2, ùë•_3$ are the features.
    - $ w_0 $  is the intercept.

For the features, the plane separates the feature space into two regions for classification.

2. Non-Linear Decision Boundaries

- Non-linear models define curved surfaces in 3D space
    - spheres, - parabolas.
- Example: In 3D, the boundary might look like a bowl separating one region (inside the bowl) from another (outside the bowl).

Visualization
- Visualizing a plane or curved surface in 3D is possible with tools like Matplotlib's 3D plotting. It shows how the boundary divides the space.


In [None]:
from mpl_toolkits.mplot3d import Axes3D
from sklearn.svm import SVC

# Generate 3D data
X = np.random.rand(200, 3)
y = (X[:, 0] + X[:, 1] > 1).astype(int)  # Linear decision boundary
model = SVC(kernel='linear')
model.fit(X, y)

# Create grid for decision boundary
xx, yy = np.meshgrid(np.linspace(0, 1, 50), np.linspace(0, 1, 50))
zz = (-model.intercept_[0] - model.coef_[0][0] * xx - model.coef_[0][1] * yy) / model.coef_[0][2]

# Plot data and decision boundary
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=y, cmap='coolwarm', edgecolor='k')
ax.plot_surface(xx, yy, zz, alpha=0.5, color='gray')
ax.set_xlabel('Feature 1')
ax.set_ylabel('Feature 2')
ax.set_zlabel('Feature 3')
ax.set_title('3D Decision Boundary')
plt.show()

##### Decision Boundaries in Higher Dimensions
In higher-dimensional spaces, the decision boundary becomes a 
- hyperplane 
- more complex hypersurface.

1. Linear Decision Boundaries
- For a d-dimensional feature space, the equation is:
$$ w_0 + w_1ùë•_1 + w_2ùë•_2 + w_3ùë•_3 + ... + w_dùë•_d = 0 $$
- This hyperplane divides the d-dimensional space into regions for classification.

In a 4D feature space $ùë•_1, ùë•_2, ùë•_3, ùë•_4$, the decision boundary is a 3D hyperplane.

2. Non-Linear Decision Boundaries
- Non-linear models use transformations (e.g., polynomial, kernel tricks) to create non-linear hypersurfaces.
- These hypersurfaces can separate data points that are non-linearly separable in their original feature space.

Visualization
- Direct visualization becomes challenging beyond 3 dimensions. 
- However, techniques like dimensionality reduction (PCA, t-SNE, UMAP) can project high-dimensional data and decision boundaries into 2D or 3D for interpretation.

**Impact of Dimension on Decision Boundaries**

Curse of Dimensionality:
- As dimensions increase, data points become sparse, making classification harder.
- Models like LDA or logistic regression may underperform without feature selection.

Model Complexity:
- Non-linear decision boundaries require more complex models (e.g., SVM with RBF kernels, neural networks).
- Overfitting is a significant risk in high dimensions.

### **6. Support Vector Machines (SVM)**
What It Means: 
- SVMs classify data by finding the best ‚Äúboundary‚Äù (hyperplane) that separates classes with the widest possible margin.

Outcome Interpretation: 
- Data points on either side of the boundary belong to different classes, with "support vectors" helping to define the boundary.

Performance Measures:
- Accuracy: Proportion of correct classifications.
- Precision and Recall: Used when classes are imbalanced; precision is the correctness of positive predictions, and recall measures coverage.

Lay Explanation: 
- SVMs are like drawing a line to separate different groups, ensuring the groups are as distinct as possible with the help of a few key points.

Use Case: 
- Used for classification and regression in high-dimensional spaces, often for non-linearly separable data.

### **Support Vector Machines (SVM): Key Idea**

The idea behind SVM is to find the optimal hyperplane that best separates data points of different classes in the feature space.
- This basic idea of the SVM is to separate points using a $(p - 1)$ dimensional hyperplane. 

What does it mean to separate points? 
- This means that the SVM will construct a decision boundary such that points on the left are assigned a label of $A$ and points on the right are assigned a label of $B$.  
- When finding this separating hyperplane we wish to maximise the distance of the nearest points to the hyperplane. 
    - The technical term for this is **maximum separating hyperplane**.
- The data points which dictate where the separating hyperplane goes are called **support vectors**.

How It works in laymans terms:

Pretend that you want to classify data points into group $A$ or group $B$. An SVM will plot your labelled training data as points in space and will:
- look for the widest, clearest gap between points belonging to group A and points belonging to group B. 
- It will then use this newly identified dividing line (known as a hyperplane) and the margin around it to classify new observations. 
- An unseen data point will be classified into group A or B depending on which side of the margin it is closest to. 

##### Important Concepts in SVM
1. Hyperplane:
- A decision boundary that separates classes in the feature space.
    - In 2D, it‚Äôs a line; 
        -  when your data only has 2 features. You only need a simple one-dimensional decision boundary (which is basically a line) to classify the data.
        - line only has one dimension
    - In 3D, it‚Äôs a plane; 
    - In higher dimensions, it‚Äôs a hyperplane.
        - more features get added the line needs to take on more dimensions,
        - 4 or more dimensions
        - In SVM, the hyperplane will always have one less dimension ($-1$) than the number of input features ($p$), or a total of $(p-1)$ dimensions.

2. Margin:
- The distance between the hyperplane and the closest data points (called support vectors) of either class.
    - SVM maximizes this margin to create the most robust separation.

3. Support Vectors:
- The data points closest to the hyperplane, which influence its position and orientation.
4. Optimal Hyperplane:
- The hyperplane that maximizes the margin while correctly classifying the training data (or minimizing misclassifications).

Support Vector Machines in a nutshell:
- Like logistic regression, SVMs fit a linear decision boundary. 
- Unlike logistic regression, SVMs do this in a non-proabilistic way and are able to fit to non-linear data using an algorithm known as the [kernel trick](https://en.wikipedia.org/wiki/Kernel_method).

SVMs can be used for both classification and regression. In `sklearn`, these are called:
- `SVC` (Support Vector Classifier)
- `SVR` (Support Vector Regression) 

SVC can also refer to Support Vector **Clustering**

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, accuracy_score

import warnings
warnings.filterwarnings("ignore")

##### Generate synthetic data
Generate a random dataset to experiment with.
- by taking a multi-dimensional **standard normal distribution** and defining classes separated by nested concentric multi-dimensional spheres such that roughly equal numbers of
samples are in each class (quantiles of the $\chi^2$ distribution).
    - generated a donut shaped dataset, where 
        - the samples belonging to one class are generally located in the centre
        - the samples belonging to the other class are generally located in the outer ring.


##### **Reasons for Normalizing Data in SVMs**
- because of how the SVM algorithm calculates margins and distances between data points.

SVM is Sensitive to Feature Scales
- SVM relies on calculating distances (e.g., Euclidean distance) between data points to determine margins and support vectors. 
    - If one feature has a much larger range than others, it will dominate the distance calculation, leading to biased results.
- Example: In a dataset with two features‚Äîage (ranging from 0 to 100) and income (ranging from 0 to 100,000)‚Äîincome will heavily influence the decision boundary, even if age is equally or more important.

Ensures Proper Margins
- The SVM objective is to find the hyperplane that maximizes the margin between classes. 
    - Without normalization, the margin calculation may become skewed, resulting in suboptimal or incorrect decision boundaries.
- Example: If one feature has a larger scale, the margin might stretch disproportionately along that dimension, ignoring other features.

Improves Kernel Performance
- SVMs often use kernels (e.g., RBF, polynomial) to project data into higher dimensions. 
    - Kernels are sensitive to the relative scaling of features. Normalization ensures that all features contribute equally to the projection.
- Example: An RBF kernel requires well-scaled data to compute meaningful similarity measures between points. Poorly scaled data may lead to ineffective kernel computations.

Reduces Convergence Time
- SVM optimization involves iterative calculations that are influenced by feature scaling. 
    - Normalized data leads to faster and more stable convergence of the optimization algorithm.
- Example: When features are on drastically different scales, the optimization problem may take longer to converge or fail to converge entirely.

Handles Non-linear Decision Boundaries Better
- Why? For non-linear kernels (like RBF), the distance between points in feature space influences the shape of the decision boundary. 
    - Normalization ensures these distances are meaningful, leading to smoother and more accurate decision boundaries.

##### **Consequences of Not Normalizing**
- Poor Decision Boundaries: The SVM may create biased or incorrect hyperplanes, reducing model performance.
- Misclassification: The model may misclassify data, especially when features with large ranges dominate.
- Kernel Inefficiency: Kernels may fail to project the data effectively, leading to poor separation of classes.
- Increased Training Time: Optimization takes longer, impacting the efficiency of training.

##### **How to Normalize Data for SVMs**

1. **Standardization**: Subtract the mean and divide by the standard deviation for each feature
$$ z = (\frac{x - \mu}{\sigma})$$
- This scales features to have a mean of 0 and a standard deviation of 1.

2. **Min-Max Scaling**: Rescale each feature to a fixed range, typically [0, 1] 
$$ z = (\frac{x - min(x)}{max(x)}$$


In [None]:
from sklearn.datasets import make_gaussian_quantiles

# Set the feature dimensionality
p = 2

# Construct the dataset
X, y = make_gaussian_quantiles(cov=3.,
                                 n_samples=1000, n_features=p,
                                 n_classes=2, random_state=1)

In [None]:
# get training and testing data 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

##### Fit a SVM classifier with a linear decision boundary
We are going to fit an SVC model with a `linear kernel`. This means that we are telling the SVC to fit the data using a linear decision boundary. Let's also take a look at the accuracy score:

In [None]:
from sklearn.svm import SVC

svc = SVC(kernel='linear')
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)

print("The accuracy score of the SVC is:", accuracy_score(y_test, y_pred))
print("\n\nClassification Report:\n\n", classification_report(y_test, y_pred))

In [None]:
from sklearn.svm import SVC
model = SVC()
model.fit(X_train, y_train)
predictions = model.predict(X_test)

##### Plot the decision boundary for the SVC
When accuracy score doesn't seem very good. To help us understand what's going on use: Visualisation.

The SVC calculates and implements a $p-1$ dimensional decision boundary (hyperplane) over the input features.
- Since we are only looking at 2 features (our synthetic dataset only has two features, or $p=2$), our hyperplane will only have 1 dimension ($p-1$)
    - look like a single line.
- if your model has more than 2 features, you can plot the hyperplane for any 2 features you choose.

##### **Calculating the Dimensions of a separating Hyperplane**
The dimensions of a separating hyperplane depend on the number of features (or predictors) in the dataset.

Definition of a Hyperplane

A hyperplane in n-dimensional space is defined as:
$$ ùë§ \times ùë• + ùëè = 0$$

where:
- $w = [w_1, w_2, ..., w_n]$: Weight vector normal to the hyperplane.
- $x = [x_1, x_2, ..., x_n]$: Feature vector of an instance.
- $b$: Bias term (offset from the origin).

The hyperplane separates data into two classes:
- $ ùë§ \times ùë• + ùëè > 0$: Class 1
- $ ùë§ \times ùë• + ùëè < 0$: Class 2

Dimensions of a Hyperplane

The dimensionality of the hyperplane is determined by the number of features in the dataset:
- If the dataset has n features, the hyperplane is an (n‚àí1)-dimensional subspace.

Examples:
- 2 Features (2D): The hyperplane is a 1D line.
- 3 Features (3D): The hyperplane is a 2D plane.
- 4 Features (4D): The hyperplane is a 3D subspace (hard to visualize, but mathematically valid).

Intuition Behind Dimensions

- The hyperplane must divide the feature space into two regions corresponding to different classes.
- Higher dimensions mean more complex hyperplanes, allowing SVM to handle more intricate patterns.
- Kernels: When data is mapped to a higher-dimensional feature space using kernels (e.g., RBF), the hyperplane exists in the higher-dimensional space, though its exact dimensions depend on the kernel's transformation

When Are the Dimensions Relevant?

- At Training Time: The dimensions of the hyperplane are implicitly calculated when the SVM solves the optimization problem to find ùë§ and b.
    - The optimization ensures the hyperplane maximizes the margin between support vectors of the two classes.
- During Prediction: The dimensionality of the hyperplane affects how data points are classified. The model computes:
    - Decision¬†Function: $ùë§ \times ùë• + ùëè$
        - The sign of this value determines the predicted class.

##### **Calculation of the Dimensions**
- The dimensions are calculated implicitly when the SVM solves its optimization problem to find ùë§ and b.
- The dimensionality of the hyperplane is directly tied to the feature space in which the data resides.

Steps:
1. Input Data Dimension: Count the number of features n in your dataset.
- Example: If your dataset has features $x = [x_1, x_2, x_3]$ , it‚Äôs a 3-dimensional space.

2. Hyperplane Dimension: The hyperplane will have (n‚àí1) dimensions.
- For the 3-feature example, the hyperplane is a 2D plane.

In this case, donut-shaped data is not `linearly separable`

In [None]:
i = 0 # Feature 1
j = 1 # Feature 2

svc.fit(X[:, [i, j]], y)
fig = plt.figure(figsize=(8, 8))
ax1 = fig.add_subplot(111)
 
x_min, x_max = X[:, i].min(), X[:, i].max()
y_min, y_max = X[:, j].min(), X[:, j].max()
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 1000), np.linspace(y_min, y_max, 1000))

y_hat = svc.predict(np.concatenate((xx.reshape(-1,1), yy.reshape(-1,1)), axis=1))
y_hat = y_hat.reshape(xx.shape)

ax1.pcolormesh(xx, yy, y_hat, cmap=plt.cm.get_cmap('RdBu_r'))
ax1.scatter(X[:, i], X[:, j], c=y, edgecolors='k', cmap=plt.cm.get_cmap('RdBu_r'))
ax1.set_xlabel('Feature 1')
ax1.set_ylabel('Feature 2')
ax1.set_xlim(xx.min(), xx.max())
ax1.set_ylim(yy.min(), yy.max())
ax1.set_xticks(())
ax1.set_yticks(())
plt.show()

Solution: Use of SVM's [kernel trick](https://en.wikipedia.org/wiki/Kernel_method) to use a **non-linear** decision boundary instead.



##### Fit a SVC classifier with a non-linear decision boundary

Use the rbf kernel (Radial_basis_function_kernel), which allows the SVC to fit a non-linear decision boundary. 

In [None]:
svc = SVC(kernel='rbf')
svc.fit(X_train, y_train)
y_pred = svc.predict(X_test)

print("The accuracy score of the SVC is:", accuracy_score(y_test, y_pred))
print("\n\nClassification Report:\n\n",classification_report(y_test, y_pred))

##### Plot the decision boundary for the SVC using the non-linear rbf kernel

Plot the 1 dimensional decision boundary between the 2 features present in our synthetic dataset:

In [None]:
i = 0 # Feature 1
j = 1 # Feature 2

svc.fit(X[:, [i, j]], y)
fig = plt.figure(figsize=(8, 8))
ax1 = fig.add_subplot(111)
 
x_min, x_max = X[:, i].min(), X[:, i].max()
y_min, y_max = X[:, j].min(), X[:, j].max()
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 1000), np.linspace(y_min, y_max, 1000))

y_hat = svc.predict(np.concatenate((xx.reshape(-1,1), yy.reshape(-1,1)), axis=1))
y_hat = y_hat.reshape(xx.shape)

ax1.pcolormesh(xx, yy, y_hat, cmap=plt.cm.get_cmap('RdBu_r'))
ax1.scatter(X[:, i], X[:, j], c=y, edgecolors='k', cmap=plt.cm.get_cmap('RdBu_r'))
ax1.set_xlabel('Feature 1')
ax1.set_ylabel('Feature 2')
ax1.set_xlim(xx.min(), xx.max())
ax1.set_ylim(yy.min(), yy.max())
ax1.set_xticks(())
ax1.set_yticks(())
plt.show()

### **Objective Function of SVM**
The objective of SVM is twofold:
- Maximize the margin (maximize separation between classes).
- Minimize classification errors for non-linearly separable data (using a hinge-loss function).

It is designed to maximize the margin between the two classes while minimizing classification errors. 
- For m features, the objective function considers the weight vector $w \in R^m$, which defines the orientation of the separating hyperplane in the feature space.

#####  Primal Form of the Objective Function: Mathematical Formulation
Given:
- A dataset with n training samples $(ùë•_ùëñ, ùë¶_ùëñ)$ where 
    - $ùë•_ùëñ \in ùëÖ^ùëë$ are feature vectors
    - $y_ùëñ \in {‚àí1,1}$ are class labels.
- A weight vector ùë§ and bias b defining the hyperplane.

The decision boundary is represented by:
$$ f(x) = w^T x + b $$

The **SVM primal objective function** is:

$$ Minimize_{w,b,\xi}: \frac{1}{2} ||w||^2 + C \sum^n_{i=1} Hinge Loss$$

- Where:
    - $\frac{1}{2} ||w||^2$: Ensures the margin is maximized, promoting simplicity.
        - Encourages a large margin between the two classes (simpler decision boundary).
        - It minimizes the norm of the weight vector w, maximizing the margin between the classes.
    - $C$: Regularization parameter that controls the trade-off between margin maximization and classification errors.
        - The choice of C determines where this balance lies.
    - $\sum^n_{i=1} \xi_i$: A penalty for all margin violations. 
        - A higher sum of $\xi_i$ implies more violations.
    - $\xi$: Classification errors / Slack variables that represent misclassification or margin violations.
        - measure the extent of misclassification or margin violation for each data point.
        - It represents the extent to which the i-th data point violates the margin (misclassification or lying within the margin).


The constraints for correctly classified data points are / subject to:

$$ y_i  \cdot (w^T x_i + b) \geq 1  - \xi_i , \xi_i \geq 0 \forall i $$

##### Hinge-Loss Function

The hinge-loss function is used to penalize misclassifications and points close to the margin. It is defined as:

$$ Hinge Loss: L(y,f(x)) = max(0,1 - y  \cdot f(x)) $$

- If $y  \cdot f(x) \geq 1$, the loss is 0 (correctly classified and beyond the margin).
- If $y  \cdot f(x) < 1$, the loss increases linearly as the point moves closer to or across the margin.

he **SVM primal objective function** is now:

$$ Minimize_{w,b}: \frac{1}{2} ||w||^2 + C \sum^n_{i=1} max(0,1 - y_i  \cdot (w^T x_i + b))$$

### **Understanding the Objective Function**

1. Margin Maximization ($\frac{1}{2} ||w||^2$): 
- The first term ensures that the hyperplane has the largest margin by minimizing the norm of the weight vector ($||w||$).
    - A smaller ||w|| corresponds to a larger margin.

2. Hinge Loss ($C \sum^n_{i=1} max(0,1 - y  \cdot f(x))$)
- The second term $\sum^n_{i=1} \xi_i$ penalizes points that are misclassified or fall within the margin.
- $max(0,1 - y_i  \cdot (w^T x_i + b)$ penalizes points that are either misclassified or lie within the margin.
- The parameter C>0 is a `regularization parameter` that controls the trade-off between maximizing the margin and minimizing classification errors.
    - How C Works:
        - Large C: Strongly penalizes misclassifications, leading to a tighter fit to the training data.
        - Small C: Allows for more margin violations, leading to a simpler, more generalizable model.

Regularization parameter trade-off:
- Regularization adjusts the balance between two objectives:
    - Maximizing the margin:
        - Maximizing the margin: Keeping $||w||^2$ small promotes a large margin and simpler models.
        - Minimizing misclassification error: Penalizing $\sum^n_{i=1} \xi_i$ ensures the model correctly classifies most training instances.


##### Interpretation of the Objective Function
The function combines two objectives:
1. Maximizing the margin: Achieved by minimizing $\frac{1}{2} ||w||^2$ resulting in a decision boundary that is as far as possible from the nearest data points (support vectors).
    - A larger margin improves the model's generalization ability (i.e., it performs better on unseen data).
2. Minimizing classification errors: Achieved by penalizing the slack variables $\xi_i$ via the term $C \sum^n_{i=1} \xi_i$, which accounts for points within or outside the margin.
    - Points misclassified or within the margin are penalized, encouraging the model to position the hyperplane optimally.

#### How SVM Uses Hinge Loss

##### 2.1 How violations and misclassification are measured in soft margin classification.
Soft Margin Classification:
- The soft margin SVM introduces flexibility by allowing violations of the margin through $\xi_i$, making it suitable for non-linearly separable and noisy datasets.
- For linearly inseparable data, SVM introduces slack variables ($\xi_i$) to allow some points to violate the margin constraints.
- The hinge loss incorporates these violations, enabling SVM to work with noisy or overlapping data.

In soft margin classification, violations and misclassification are measured using slack variables ($\xi_i$), which represent the extent to which a data point deviates from the ideal separation defined by the decision boundary and margin.

The Role of the Slack Variables ($\xi_i$)
- Slack variables are introduced in the soft margin SVM to allow for some data points to:
    - Lie inside the margin (violations).
    - Be misclassified (on the wrong side of the decision boundary).
- Each data point i has an associated slack variable ($\xi_i \geq 0$), which quantifies its violation of the margin constraints.

Decision Boundary and Constraints
- The decision boundary in soft margin classification is defined by:
$$ y_i (w \cdot x_i + b) \geq 1  - \xi_i , \xi_i \geq 0 $$

- When $ y_i (w \cdot x_i + b) \geq 1$:
    - The data point is correctly classified and outside the margin. No violation occurs, so $\xi_i = 0$.
- When $ 0 < y_i (w  \cdot x_i + b) < 1$: 
    - The data point is correctly classified but lies inside the margin. The margin is violated, and $\xi_i > 0$.
- When $ y_i (w \cdot x_i + b) < 0$:
    - The data point is misclassified and on the wrong side of the decision boundary. This is a severe violation, with a larger $\xi_i$.

Measuring Margin Violations
- The slack variable $\xi_i$ measures the distance a point falls short of the margin boundary. Specifically:
    - $\xi_i = 0$, the point lies on or outside the correct margin.
    - $0 < \xi_i \leq 1$, the point is inside the margin but correctly classified.
    - $\xi_i > 1$, the point is misclassified.

- Total Margin Violation
    - The total violation across all data points is:
$$ \sum^N_{i=1} \xi_i$$

Misclassification
- Misclassification occurs when a data point lies on the wrong side of the decision boundary:
$$y_i (w \cdot x_i + b)$$

- For misclassified points, $\xi_i >1$
    - The slack variable $\xi_i - 1$ represents the extent of misclassification.

- Misclassification Count
    - The number of misclassified points can be roughly estimated as:
        - number of misclassifications $\approx \sum^N_{i=1} 1 (\xi_i >1)$ where:
            - $1(\cdot)$ is an indicator function that equals 1 if the condition is true, and 0 otherwise.

- Example: assume we have the following
    - $y_i = +1$: Positive class.
        - The margin for $y_i = +1$ is defined as $(w \cdot x_i + b) \geq +1$

- Possible Cases:
    - Correct Classification Outside the Margin: $y_i(w \cdot x_i + b) \geq +1)$
        - No violation $\xi_i = 0$
    - Correct Classification Inside the Margin  $0 < y_i(w \cdot x_i + b) < 1)$
        - Margin violation occurs $\xi_i > 0$
    - Misclassified Point  $ y_i(w \cdot x_i + b) < 0)$
        - Severe violation $\xi_i > 1$

##### 2.2 Reasons for using `Regularization` in SVM
Regularization in Support Vector Machines (SVMs) is crucial to ensure that the model generalizes well to unseen data. 
- Regularization introduces a penalty for overly complex models, preventing overfitting.
- Regularization in SVM controls the trade-off between:
    - Maximizing the margin: Ensuring the decision boundary is as far as possible from the nearest data points.
    - Minimizing misclassification errors: Allowing some points to fall inside the margin or on the wrong side of the decision boundary for better generalization.

Control Overfitting
- Reason: SVM aims to maximize the margin between classes while minimizing misclassification errors. Without regularization, the model might try to perfectly classify the training data, resulting in overfitting.
- Solution: Regularization balances the trade-off between achieving a larger margin (simpler model) and minimizing classification errors.
    - A larger regularization parameter (C): penalizes misclassifications more heavily, potentially leading to overfitting.
    - A smaller regularization parameter (C):  favors a larger margin and allows for more misclassifications, promoting generalization.

Handle Noisy Data
- Reason: Real-world datasets often contain noise, outliers, or mislabeled data points. Without regularization, SVM may overemphasize these noisy points, leading to a distorted decision boundary.
- Solution: Regularization reduces the influence of such noisy points by allowing some tolerance for misclassification, leading to a more robust model.

Promote Simpler Decision Boundaries
- Reason: Complex decision boundaries can lead to poor generalization on new data.
- Solution: Regularization encourages the SVM to find a simpler decision boundary by controlling the weight vector (w) through a regularization term in the objective function.

Avoid Curse of Dimensionality
- Reason: In high-dimensional spaces, the risk of overfitting increases because the model has more capacity to fit the training data perfectly.
- Solution: Regularization reduces the model's flexibility, preventing overfitting in high-dimensional feature spaces.

Improve Generalization Performance
- Reason: A model that fits the training data too closely may fail to generalize to unseen data.
- Solution: Regularization ensures that the SVM focuses on the most informative patterns in the data, improving performance on test data.

Kernel Methods and Regularization
- Reason: When using kernel functions (e.g., RBF, polynomial), the feature space is transformed into a higher dimension, increasing the model's capacity to overfit.
- Solution: Regularization mitigates overfitting by constraining the optimization process, ensuring the model finds a balance between complexity and accuracy.

##### **Type of regularization used in Soft Margin Classification for Support Vector Machines (SVMs)** 
Type used is L2 regularization.

L2 Regularization in the Objective Function: 
$$Minimize_{w,b,\xi}: \frac{1}{2} ||w||^2 + C \sum^n_{i=1} \xi_i$$

- Where:
    - $\frac{1}{2} ||w||^2$: represents L2 regularization, as it minimizes the squared Euclidean norm of the weight vector w. 
        - It helps in maximizing the margin by penalizing larger weight values, which results in a smoother and more generalizable decision boundary.
    - $C \sum^n_{i=1} \xi_i$: his term penalizes margin violations (misclassification or points lying within the margin). 
        - The parameter C determines the penalty strength.

Why L2 Regularization?
- L2 regularization is chosen because:
    - It encourages smaller weight magnitudes ($w_i^2), which leads to a more stable model less sensitive to noise in the data.
    - It avoids overfitting by penalizing complex decision boundaries.
    - The quadratic term $‚à•w‚à•^2$ ensures that the solution is smooth and generalizes well to unseen data.

Mathematical Interpretation
- The L2 regularization term $\frac{1}{2} ||w||^2$ ensures that the weight vector w remains small, which effectively controls the model's complexity. 
    - Smaller weights correspond to a more stable and less overfitted model.

Connection to Dual Formulation
- In the dual formulation, the regularization parameter C indirectly limits the Lagrange multipliers $\alpha_i$:

$$ 0 \leq \alpha_i \leq C $$

This constraint ensures that the influence of each data point on the decision boundary is limited, balancing the trade-off between margin maximization and classification accuracy.

##### 2.3 Hyperparameter C in soft margin classification
The objective function for soft margin SVM is:

$$Minimize_{w,b,\xi}: \frac{1}{2} ||w||^2 + C \sum^n_{i=1} \xi_i$$

Hyperparameter C
- Determines the trade-off between maximizing the margin (pathway width) and minimizing classification errors.
- Determines how much weight is given to minimizing slack variables relative to maximizing the margin width.

Trade-Off Parameter (C):
- Governs the trade-off between the two components of the objective function:
    - Large C: Focuses on minimizing misclassification, potentially at the cost of a smaller margin (risk of overfitting).
        - Strongly penalizes misclassification.
        - Results in a smaller margin as the model tries to classify every point correctly.
        - May lead to overfitting, especially on noisy data.
    - Small C: Focuses on maximizing the margin, tolerating some misclassifications (risk of underfitting).
        - Allows more margin violations (misclassified points).
        - Results in a larger margin and simpler decision boundary.
        - Promotes better generalization, reducing the risk of overfitting.

High C: Narrower Pathway Width
- The penalty for margin violations ($C \sum \xi_i$) becomes significant.
- The SVM prioritizes classifying training points correctly over maximizing the margin width.
- The model becomes more sensitive to individual data points, which can lead to:
    - A narrower pathway width (smaller margin).
    - Overfitting, where the decision boundary conforms too closely to the training data.
- Behavior:
    - The margin shrinks to fit the data points tightly.
    - Misclassified points are heavily penalized, so the model tries to minimize their number at the cost of a smaller margin.

Low C: Wider Pathway Width
- The penalty for margin violations becomes less significant.
- The SVM focuses on maximizing the margin width, even if it means allowing some misclassified points.
- The model becomes less sensitive to noise and outliers, leading to:
    - A wider pathway width (larger margin).
    - Better generalization to unseen data.
- Behavior:
    - The decision boundary prioritizes a larger margin over perfect classification.
    - Misclassified points are tolerated, reducing the risk of overfitting.

##### **Relationship Between C and Pathway Width**
The pathway width (or margin width) is inversely related to C:
- High C: Narrower pathway (small margin).
- Low C: Wider pathway (large margin).

This trade-off reflects the bias-variance trade-off:
- High C: Low bias, high variance (more complex model).
- Low C: High bias, low variance (simpler model).

Practical Analysis of C and Pathway Width
- To analyze the relationship between C and margin width:
    - Train the SVM Model: Train models with different values of C.
    - Visualize the Decision Boundary:
        - Plot the decision boundary and margins for low, medium, and high C values.
        - Observe how the margin width and boundary placement change.
    - Evaluate Performance:
        - On training data, high C often results in lower misclassification rates.
        - On test data, low C often results in better generalization.

Impact of C on Generalization
- High C: The model prioritizes accuracy on the training data but risks overfitting due to a narrow margin.
- Low C: The model sacrifices some accuracy on the training data but generalizes better due to a wider margin.

##### **Dual Formulation of the Soft Margin Objective**
Objective function for soft margin classification in Support Vector Machines (SVMs) allows for some misclassification or margin violations in the dataset. 
- This makes the model more robust to noisy and non-linearly separable data.

The dual formulation is more computationally efficient for many datasets, especially when using kernels. It is expressed as:

$$ max_{\alpha} \sum^{N}_{i = 1} \alpha_i - \frac{1}{2} \sum^{N}_{i = 1} \sum^{N}_{j = 1} \alpha_i \alpha_i y_i y_j K(x_i, x_j) $$

Subject to:

$$ 0 \leq \alpha_i \leq C, \sum^{N}_{i = 1} \alpha_i  y_i = 0 $$

Where: 
- $\alpha_i$: Lagrange multipliers.
- $K(x_i, x_j)$: Kernel function, used for non-linear decision boundaries.
- C: Controls the range of $\alpha_i$, balancing margin width and classification error.

##### **Components in Relation to m Features**
The objective function in SVM for m-features balances the goals of maximizing the margin and minimizing misclassification through the weight vector w, bias b, and slack variables $\xi_i$. Regularization, via C, plays a key role in ensuring that the model generalizes well to unseen data.

- w: Weight vector of dimension m, one weight per feature, defines the hyperplane's orientation.
- $ùë•_ùëñ \in ùëÖ^m$ Feature vectors in the m-dimensional space.
- Kernal $K(x_i, x_j)$: Allows mapping of $ùë•_ùëñ$ into a higher-dimensional feature space for non-linear separability, indirectly involving m.

##### Intuition for m Features
- The dimension m dictates the complexity of the weight vector w, which defines the separating hyperplane.
- Larger m means a higher-dimensional feature space, potentially increasing the model's capacity but also the risk of overfitting.
- Regularization (C) ensures that the optimization remains robust, even with a large number of features.

### **Optimization**
- To solve the SVM objective, quadratic programming methods or optimization algorithms (e.g., SMO‚ÄîSequential Minimal Optimization) are used. 
- For large datasets, kernels or approximate methods are often applied.

##### **Tuning an SVM model**
Use `sklearn`'s [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html). 
- This procedure allows us to specify a set of possible parameters for a specific model.
    - `GridSearchCV` will then go through those parameters and try every possible combination of them (kind of like it's working through a grid in a systematic way - that's where the name comes from). 
    - `GridSearchCV` will then return the combination of parameters that resulted in a model with the best score. 
    - `GridSearchCV` makes use of **cross validation**, helping to ensure the robustness of it's results.

Grid search is a systematic method for hyperparameter optimization that evaluates a predefined set of hyperparameters for a machine learning model, such as an SVM.

**Hyperparameters**

C: Regularization parameter.
- Higher C: Focuses on minimizing classification errors (lower margin, more overfitting).
- Lower C: Allows more classification errors (larger margin, more underfitting).

Kernel: Specifies the kernel function.
- Linear: Best for linearly separable data.
- Polynomial/RBF (Radial Basis Function): Handles nonlinear decision boundaries.

Gamma: Used with RBF and polynomial kernels.
- Controls the influence of a single training example.
    - Lower values: More generalized decision boundaries.
    - Higher values: Tighter fit around data points.

Degree: Relevant for polynomial kernels, representing the polynomial degree.

Steps:

1. Create a dictionary that contains the parameters you want to tune as `keys` and all the different options you want to test for those parameters as `values`.

In [None]:
parameters = {'kernel':('linear', 'rbf'), 
              'C':(0.25,1.0),
              'gamma': (1,2)}

2. Instantiate an SVC classifier and tell `GridSearchCV` to test it using the parameters we previously specified:

In [None]:
svm = SVC()
clf = GridSearchCV(svm, parameters)
clf.fit(X_train,y_train)

**Understanding the Output of Grid Search for SVM**

1. Best Parameters (best_params_)

This indicates the combination of hyperparameters that resulted in the best cross-validation score.

    - {'C': 10, 'gamma': 0.1, 'kernel': 'rbf'}
        - A regularization strength of C=10.
        - A kernel function of Radial Basis Function (RBF).
        - Gamma value of 0.1.

Extract the Best Parameters:
- Use grid_search.best_params_ to identify the best-performing combination.

2. Best Score (best_score_)

This is the highest cross-validation score achieved for the best parameter combination. It indicates how well the model generalized during validation.

    - 0.93
    - The best parameter combination resulted in 93% accuracy during cross-validation.

Examine the Best Score:
- Use grid_search.best_score_ to see the best validation accuracy achieved (e.g., 0.90).

Make Predictions:
- Use the grid_search.best_estimator_ to make predictions on new data.

3. Complete Results (cv_results_)

- A dictionary containing detailed results for all parameter combinations evaluated during grid search. Key fields:
    - mean_test_score: Average cross-validation score for each parameter set.
    - std_test_score: Standard deviation of scores across folds (indicates variability).
    - params: Parameter combinations corresponding to the scores.

            - {'mean_test_score': [0.91, 0.93, 0.89],
            'std_test_score': [0.01, 0.02, 0.03],
            'params': [{'C': 1, 'gamma': 0.1, 'kernel': 'rbf'},
                        {'C': 10, 'gamma': 0.1, 'kernel': 'rbf'},
                        {'C': 1, 'gamma': 0.01, 'kernel': 'rbf'}]}

            - The best score (0.93) corresponds to {'C': 10, 'gamma': 0.1, 'kernel': 'rbf'}
            - The variability of scores (e.g., 0.02) reflects the model's consistency during cross-validation.

4.  Best Estimator (best_estimator_)

The trained SVM model with the best parameters. This can be used for predictions.

    - SVC(C=10, gamma=0.1, kernel='rbf')

##### How to Use the Grid Search Results
- Best Parameters: Use `grid_search.best_params_` to train the final SVM model on the full training data for optimal performance.
- Best Estimator: Use `grid_search.best_estimator_` directly for prediction.
- Scoring and Ranking: The `mean_test_score in cv_results_` can be used to evaluate how different parameter combinations perform.
- Variability: Use `std_test_score` to assess how consistent the model performance is across folds. Lower variability indicates a robust model.

**GridSearch Output**

    - GridSearchCV(cv=None, error_score=nan,
                estimator=SVC(C=1.0, break_ties=False, cache_size=200,
                            class_weight=None, coef0=0.0,
                            decision_function_shape='ovr', degree=3,
                            gamma='scale', kernel='rbf', max_iter=-1,
                            probability=False, random_state=None, shrinking=True,
                            tol=0.001, verbose=False),
                iid='deprecated', n_jobs=None,
                param_grid={'C': (0.25, 1.0), 'gamma': (1, 2),
                            'kernel': ('linear', 'rbf')},
                pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
                scoring=None, verbose=0)

1. cv=None:
- By default, cv=None uses 5-fold cross-validation to evaluate the performance of each hyperparameter combination.

2. error_score=nan:
- Specifies what happens if a model fails during training. 
    - If set to nan, the model skips that combination and assigns a score of nan.

3. estimator=SVC(...):
- The base model being optimized, in this case, an SVM classifier (SVC).
- The parameters within the SVC object (e.g., C=1.0, kernel='rbf') represent its default settings, which may be overridden by the grid search.

4. param_grid={'C': (0.25, 1.0), 'gamma': (1, 2), 'kernel': ('linear', 'rbf')}:
- The hyperparameter combinations being evaluated:
    - C: Regularization parameter values [0.25,1.0].
    - gamma: Kernel coefficient values [1,2].
    - kernel: Kernel types [linear ,  rbf].
- The grid search will test all possible combinations of these parameters (a total of 2√ó2√ó2=8 combinations).

5. iid='deprecated':
- Refers to the Independent Identically Distributed assumption, which has been deprecated in Scikit-learn 0.24 and later. 
    - It is safe to ignore this unless you‚Äôre using an older version of Scikit-learn.

6. n_jobs=None:
- Specifies the number of CPU cores to use for parallel computation. None means it will run in serial mode on a single core.

7. pre_dispatch='2*n_jobs':
- Controls the number of jobs that get dispatched during parallel computation. 
    - Since n_jobs=None, this has no effect.

8. refit=True:
- After finding the best hyperparameter combination, the grid search automatically refits the model on the entire training dataset using those parameters.

9. return_train_score=False:
- If True, the results would include training scores in addition to validation scores. 
- Here, it is False, so only validation scores are calculated.

10. scoring=None:
- Indicates that the default scoring metric for the estimator (e.g., accuracy for classification) is used.

**This Configuration Means**

The grid search is tuning an SVM classifier with:
- Two values of C [0.25,1.0],
- Two values of gamma [1,2], and
- Two kernel types (linear and rbf).
Each of these 2√ó2√ó2=8 combinations is evaluated using 5-fold cross-validation.

The performance of each combination is assessed using the default scoring metric (accuracy for classification).

The best-performing combination is automatically selected and refitted on the entire training dataset.

### **Reasons for not using squared loss function in classification problems**

1. It is sensitive to outliers.
2. It does not align with the probabilistic interpretation of classification tasks.
3. It fails to emphasize the separation of classes effectively.
4. Alternatives like cross-entropy or hinge loss are better suited for optimizing classification models, focusing on class separation and meaningful probabilities.

Non-robustness to Outliers
- In Classification: Misclassified points, especially outliers, can disproportionately influence the decision boundary, leading to poor generalization.
- Squared loss penalizes large errors quadratically, 
    - meaning that a few instances with large prediction errors can dominate the loss function.

Misalignment with Classification Goals
- Nature of Classification: Classification problems aim to predict discrete labels or probabilities for class membership, focusing on correctly separating classes.
- Squared Loss Behavior: Squared loss minimizes the difference between predicted and true values. 
    - In classification, true labels are usually encoded as 0 or 1, and predictions outside [0,1] are meaningless probabilities. 
        - This can result in illogical outcomes for probabilities and suboptimal boundaries.

Poor Handling of Probabilities
- Probabilistic Interpretation: Classification models often interpret predictions as probabilities of class membership.
- Squared Loss Issues: It does not naturally account for the probabilistic nature of classification. 
    - loss functions, like log-loss (cross-entropy), directly optimize for probability-based interpretations, ensuring that predictions align better with actual class probabilities.

 Inappropriate Gradients for Classification
- Gradient Shape: Squared loss gradients are linear, meaning the gradient changes linearly with the error.
- Impact: In classification problems, small classification errors might still produce significant gradients, leading to inefficient updates. 
    - loss functions like hinge loss or cross-entropy loss prioritize the misclassified or uncertain points more effectively, which aligns with the goal of improving class separation.

Squared Loss Leads to Non-optimal Decision Boundaries
- Decision Boundary Nature: In classification, the goal is to maximize the margin between classes or ensure good separation.
- Squared Loss Focus: By trying to minimize the distance between predicted and true labels, squared loss tends to favor a compromise boundary, potentially leading to poorly separated classes, especially in non-linear classification problems.

Better Alternatives Exist
- Hinge Loss: Used in SVMs, it focuses on maximizing the margin and ensures only points near or across the boundary contribute to the loss.
- Cross-Entropy Loss: Used in logistic regression and neural networks, it optimizes probabilities directly, aligning well with classification goals.


### Advantages of SVMs
1. Effective in High-Dimensional Spaces
-  SVMs perform well when the number of features is large relative to the number of observations.
- Example: Applications in text classification or genomics, where the feature space is often very high-dimensional.

2. Works Well with Clear Margins of Separation
- SVM aims to find the optimal hyperplane that maximizes the margin between classes, which ensures robust classification when classes are well-separated.
- Example: Binary classification tasks where the data is linearly separable.

3. Kernel Trick for Non-linear Data
- SVM uses the "kernel trick" to map non-linearly separable data into a higher-dimensional space where a linear separation is possible.
- Example: Radial Basis Function (RBF) and polynomial kernels can handle complex decision boundaries.

4. Regularization Through C Parameter
- The regularization parameter C controls the trade-off between maximizing the margin and minimizing classification errors, making SVMs flexible to different types of data distributions.
- Example: Adjusting C to avoid overfitting on small datasets or noisy data.

5. Robust to Overfitting (with Proper Tuning)
- By controlling the margin size and kernel functions, SVMs can generalize well, especially for small datasets.
- Example: SVMs perform better than other models when data has limited examples but a high feature count.

6. Effective for Outlier Detection
- SVM variants, such as one-class SVM, are used to detect anomalies by learning the boundaries of the majority class.
- Example: Fraud detection or network intrusion detection.

### Disadvantages of SVMs
1. High Computational Cost
- SVM training involves solving a convex optimization problem, which can become computationally expensive for large datasets.
- Example: For datasets with millions of samples, training can be significantly slower compared to models like logistic regression or decision trees.

2. Sensitive to Choice of Kernel
- The performance of SVM heavily depends on the choice of kernel function and its parameters (e.g., RBF kernel with parameters $ùõæ$ and C).
- Example: Incorrect kernel choice may lead to poor performance or overfitting.

3. Inefficient for Large Datasets
- The complexity of SVMs scales with the size of the dataset ($O(n^2$) to ($O(n^3$), making it less suitable for massive datasets.
- Example: SVM may struggle with datasets containing millions of instances compared to neural networks or gradient-boosted trees.

4. Difficulty Handling Noisy Data
- SVMs try to maximize the margin and are sensitive to mislabeled data points, which can shift the decision boundary significantly.
- Example: In datasets with a high degree of label noise, SVMs may underperform compared to models with robust loss functions.

5. Lack of Probabilistic Output
- SVMs do not naturally provide probabilities for predictions. While this can be approximated using Platt scaling or cross-validation, the results are not as interpretable as probabilistic models.
- Example: Logistic regression offers direct probabilities, which are more useful in some applications, like medical diagnosis.

6. Hyperparameter Tuning is Non-trivial
- Choosing the right values for hyperparameters like C, $ùõæ$, and the kernel function often requires extensive grid search or cross-validation.
- Example: Poorly tuned parameters can lead to overfitting or underfitting, requiring careful experimentation.

7. Not Easily Scalable for Multiclass Problems
- SVMs are inherently binary classifiers. 
- For multiclass classification, strategies like 
    - one-vs-rest (OVR) or 
    - one-vs-one (OVO) must be used, adding complexity and computational cost.
- Example: For 10 classes, OVO requires 10√ó(10‚àí1)/2=45 classifiers to be trained.

When to Use SVMs
- Best Use Cases:
    - High-dimensional datasets with clear margins of separation.
    - Small-to-medium-sized datasets with complex decision boundaries.
    - Applications where interpretability of the decision boundary is important (e.g., feature weights in a linear kernel).
- Not Ideal For:
    - Large datasets due to computational cost.
    - Noisy datasets where robust models like random forests or neural networks might outperform.
    - Problems requiring probabilistic outputs or interpretable probabilities.

##### Five common use cases that require probabilistic outputs or interpretable probabilities that SVM poorly performs to.

Key Challenges with SVMs in Probabilistic Scenarios
- Calibration Issues: SVM probabilities (from methods like Platt Scaling) are often less reliable than probabilities from inherently probabilistic models.
- Interpretability: Decision boundaries and margins are not intuitive for users who need to interpret confidence levels.
- Actionable Insights: Many use cases (e.g., credit scoring, fraud detection) require actionable thresholds or prioritization, which hinge on well-calibrated probabilities.

Medical Diagnosis
- Why Probabilities are Needed: In medical applications, probabilistic outputs help determine the likelihood of a disease or condition, allowing practitioners to weigh risks and make informed decisions.
    - Example: Predicting whether a patient has cancer with a 90% probability versus 55%.
- Why SVM Fails: SVM outputs are distances from the decision boundary, which don‚Äôt naturally translate to probabilities. While calibration techniques like Platt Scaling can convert these into probabilities, they often yield less reliable and less interpretable probabilities than models like logistic regression.

Fraud Detection
- Why Probabilities are Needed: In fraud detection, probabilities allow for setting thresholds based on the acceptable level of risk. For instance, transactions with a probability of fraud >95% may trigger an immediate block, while transactions with 60%-80% may require manual review.
    - Example: Flagging fraudulent transactions on an e-commerce platform.
- Why SVM Fails: SVMs don‚Äôt inherently provide probabilities for these thresholds, making it difficult to prioritize actions based on the confidence level of predictions. This lack of interpretability can lead to either overreaction (blocking too many transactions) or underreaction.

Customer Churn Prediction
- Why Probabilities are Needed: Businesses use churn probability to allocate resources effectively, targeting high-probability churners with retention offers. Probabilities help prioritize interventions.
    - Example: Predicting that a customer has a 70% chance of leaving allows the company to offer personalized discounts or incentives.
- Why SVM Fails: SVM‚Äôs non-probabilistic nature makes it hard to prioritize customers effectively. In contrast, logistic regression or gradient boosting models provide reliable churn probabilities, directly guiding resource allocation.

Marketing Campaign Effectiveness
- Why Probabilities are Needed: Campaign optimization often relies on the likelihood of conversion or engagement. For example, targeting customers with an 80% chance of responding to an ad is more efficient than targeting those with only 20%.
    - Example: Predicting the probability that a customer will click on an ad or make a purchase.
- Why SVM Fails: SVM outputs distances, not probabilities, making it harder to assign confidence levels to predictions. This lack of probabilistic output complicates the ranking of prospects for targeted campaigns.

# 7. Clustering Models (e.g., K-Means)
What It Means: 
- Clustering groups similar data points together without predefined labels, often used for segmenting customers or finding patterns.

Outcome Interpretation: 
- Each cluster represents a natural grouping in the data, with data points in the same cluster sharing similar characteristics.

Performance Measures:
- Silhouette Score: Measures how well each point fits within its cluster; values closer to 1 indicate better-defined clusters.
- Within-Cluster Sum of Squares (WCSS): Measures the compactness of clusters; lower values are better.

Lay Explanation: 
- Clustering is like sorting items into bins based on similarity, helping us identify groups in our data.

Use Case: 
- To group similar observations without predefined labels.

Model Types: 
- K-Means, 
- Hierarchical Clustering, 
- DBSCAN.

In [None]:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3)
clusters = kmeans.fit_predict(X)

##### 8. Principal Component Analysis (PCA)
What It Means: 
- PCA reduces the number of variables in the data by finding combinations of variables that capture the most information (variance).

Outcome Interpretation: 
- Each "principal component" explains a percentage of the total variance, helping simplify the data without losing much information.

Performance Measures:
- Explained Variance Ratio: Shows how much information each principal component holds; higher is better.

Lay Explanation: 
- PCA is like summarizing a book by keeping only the most important points, making data easier to work with without losing key insights.

Use Case: 
- Dimensionality reduction while retaining the most critical information.

In [None]:
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
reduced_data = pca.fit_transform(X)

##### 9. Bayesian Models
What It Means: 
- Bayesian models incorporate prior knowledge or beliefs with the data to update the probability of outcomes as new evidence is available.

Outcome Interpretation: 
- Each output is a probability distribution reflecting both prior knowledge and the new data, offering a range of likely outcomes.

Performance Measures:
- Log-Likelihood: Measures how well the model explains the data; higher values indicate better fit.

Lay Explanation: 
- Bayesian models are like revising a guess based on new evidence‚Äîupdating beliefs as we get more information.

Use Case: 
- To incorporate prior knowledge and quantify uncertainty.

Model Types: 
- Bayesian Linear Regression, 
- Bayesian Networks.



In [None]:
import pymc3 as pm

with pm.Model() as model:
    alpha = pm.Normal('alpha', mu=0, sigma=1)
    beta = pm.Normal('beta', mu=0, sigma=1, shape=len(X_train.columns))
    epsilon = pm.HalfNormal('epsilon', sigma=1)
    mu = alpha + pm.math.dot(X_train, beta)
    y_pred = pm.Normal('y_pred', mu=mu, sigma=epsilon, observed=y_train)
    trace = pm.sample(2000)

##### 10. Survival Analysis (e.g., Cox Proportional Hazards)
What It Means: 
- Survival analysis predicts the time until an event occurs, such as customer churn or equipment failure.

Outcome Interpretation: 
- Each output shows the likelihood of the event happening over time, considering various risk factors.

Performance Measures:
- Concordance Index (C-Index): Measures the model‚Äôs ability to correctly rank predictions; values closer to 1 indicate better performance.

Lay Explanation: 
Survival analysis is like tracking how long something will last, based on factors that might speed it up or slow it down.

Use Case: 
- For time-to-event data, such as time until a customer churns or equipment fails.

Model Types: 
- Kaplan-Meier estimator, Cox Proportional Hazards Model.

In [None]:
from lifelines import CoxPHFitter
cph = CoxPHFitter()
cph.fit(data, 'time', event_col='event')
cph.predict_survival_function(data)

##### 4. Time Series Models (e.g., ARIMA)
What It Means: 
- Time series models account for:
    - trends, 
    - seasonality, and 
    - temporal dependencies in data collected over time, often used for forecasting future values.

Outcome Interpretation: 
- Each prediction is based on patterns in past data points, accounting for recent trends and cycles.

Performance Measures:
- Mean Absolute Percentage Error (MAPE): Shows the average prediction error in percentage terms.
- Root Mean Squared Error (RMSE): Measures the prediction accuracy; lower values mean better predictions.

Lay Explanation: 
- Time series models are like weather forecasts‚Äîthey predict future values based on past patterns, like trends and cycles.

Use Case: 
- Forecasting for data with a temporal component (e.g., sales data, stock prices).

Model Types: 
- ARIMA, 
- SARIMA, 
- Exponential Smoothing.

In [None]:
from statsmodels.tsa.arima.model import ARIMA
model = ARIMA(time_series_data, order=(1,1,1))
model_fit = model.fit()
predictions = model_fit.forecast(steps=10)

# Metrics

In [None]:
# Functions to compute True Positives, True Negatives, False Positives and False Negatives

def true_positive(y_true, y_pred):
    tp = 0
    for yt, yp in zip(y_true, y_pred):
        if yt == 1 and yp == 1:
            tp += 1
    return tp

def true_negative(y_true, y_pred):
    tn = 0
    for yt, yp in zip(y_true, y_pred):
        if yt == 0 and yp == 0:
            tn += 1        
    return tn

def false_positive(y_true, y_pred):
    fp = 0
    for yt, yp in zip(y_true, y_pred):
        if yt == 0 and yp == 1:
            fp += 1       
    return fp

def false_negative(y_true, y_pred):
    fn = 0
    for yt, yp in zip(y_true, y_pred):
        if yt == 1 and yp == 0:
            fn += 1        
    return fn

In [None]:
FP = cnf_matrix.sum(axis=0) - np.diag(cnf_matrix) 
FN = cnf_matrix.sum(axis=1) - np.diag(cnf_matrix)
TP = np.diag(cnf_matrix)
TN = cnf_matrix.sum() - (FP + FN + TP)FP = FP.astype(float)
FN = FN.astype(float)
TP = TP.astype(float)
TN = TN.astype(float)# Sensitivity, hit rate, recall, or true positive rate
TPR = TP/(TP+FN)
# Specificity or true negative rate
TNR = TN/(TN+FP) 
# Precision or positive predictive value
PPV = TP/(TP+FP)
# Negative predictive value
NPV = TN/(TN+FN)
# Fall out or false positive rate
FPR = FP/(FP+TN)
# False negative rate
FNR = FN/(TP+FN)
# False discovery rate
FDR = FP/(TP+FP)
# Overall accuracy for each class
ACC = (TP+TN)/(TP+FP+FN+TN)

In [None]:
# implementation for table metrics:
import sklearn.metrics
import mathdef matrix_metrix(real_values,pred_values,beta):
CM = confusion_matrix(real_values,pred_values)
TN = CM[0][0]
FN = CM[1][0] 
TP = CM[1][1]
FP = CM[0][1]
Population = TN+FN+TP+FP
Prevalence = round( (TP+FP) / Population,2)
Accuracy   = round( (TP+TN) / Population,4)
Precision  = round( TP / (TP+FP),4 )
NPV        = round( TN / (TN+FN),4 )
FDR        = round( FP / (TP+FP),4 )
FOR        = round( FN / (TN+FN),4 ) 
check_Pos  = Precision + FDR
check_Neg  = NPV + FOR
Recall     = round( TP / (TP+FN),4 )
FPR        = round( FP / (TN+FP),4 )
FNR        = round( FN / (TP+FN),4 )
TNR        = round( TN / (TN+FP),4 ) 
check_Pos2 = Recall + FNR
check_Neg2 = FPR + TNR
LRPos      = round( Recall/FPR,4 ) 
LRNeg      = round( FNR / TNR ,4 )
DOR        = round( LRPos/LRNeg)
F1         = round ( 2 * ((Precision*Recall)/(Precision+Recall)),4)
FBeta      = round ( (1+beta**2)*((Precision*Recall)/((beta**2 * Precision)+ Recall)) ,4)
MCC        = round ( ((TP*TN)-(FP*FN))/math.sqrt((TP+FP)*(TP+FN)*(TN+FP)*(TN+FN))  ,4)
BM         = Recall+TNR-1
MK         = Precision+NPV-1   

mat_met = pd.DataFrame({'Metric':['TP','TN','FP','FN','Prevalence','Accuracy','Precision','NPV','FDR','FOR','check_Pos','check_Neg','Recall','FPR','FNR','TNR','check_Pos2','check_Neg2','LR+','LR-','DOR','F1','FBeta','MCC','BM','MK'],     
                        'Value':[TP,TN,FP,FN,Prevalence,Accuracy,Precision,NPV,FDR,FOR,check_Pos,check_Neg,Recall,FPR,FNR,TNR,check_Pos2,check_Neg2,LRPos,LRNeg,DOR,F1,FBeta,MCC,BM,MK]})   

return (mat_met)

In [None]:
# ROC Implementation

from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score
from matplotlib import pyplotfpr, tpr, thresholds = roc_curve(real_values, prob_values)

auc = roc_auc_score(real_values, prob_values)
print('AUC: %.3f' % auc)pyplot.plot(fpr, tpr, linestyle='--', label='Roc curve')
pyplot.xlabel('False Positive Rate')
pyplot.ylabel('True Positive Rate')
pyplot.legend()pyplot.show()

# Precision-recall implementation

precision, recall, thresholds = sklearn.metrics.precision_recall_curve(real_values,prob_values)pyplot.plot(recall, precision, linestyle='--', label='Precision versus Recall')
pyplot.xlabel('Recall')
pyplot.ylabel('Precision')
pyplot.legend()pyplot.show()

In [None]:
# function for get many metrics directly from sklearn

def sk_metrix(real_values,pred_values,beta):
Accuracy = round( sklearn.metrics.accuracy_score(real_values,pred_values) ,4)
Precision= round( sklearn.metrics.precision_score(real_values,pred_values),4 )
Recall   = round( sklearn.metrics.recall_score(real_values,pred_values),4 )   
F1       = round ( sklearn.metrics.f1_score(real_values,pred_values),4)
FBeta    = round ( sklearn.metrics.fbeta_score(real_values,pred_values,beta) ,4)
MCC      = round ( sklearn.metrics.matthews_corrcoef(real_values,pred_values)  ,4)   
Hamming  = round ( sklearn.metrics.hamming_loss(real_values,pred_values) ,4)   
Jaccard  = round ( sklearn.metrics.jaccard_score(real_values,pred_values) ,4)   
Prec_Avg = round ( sklearn.metrics.average_precision_score(real_values,pred_values) ,4)   
Accu_Avg = round ( sklearn.metrics.balanced_accuracy_score(real_values,pred_values) ,4)   

mat_met = pd.DataFrame({
'Metric': ['Accuracy','Precision','Recall','F1','FBeta','MCC','Hamming','Jaccard','Precision_Avg','Accuracy_Avg'],
'Value': [Accuracy,Precision,Recall,F1,FBeta,MCC,Hamming,Jaccard,Prec_Avg,Accu_Avg]})   

return (mat_met)


In [None]:
# Evaluation Metrics For Multi-class Classification

def accuracy(y_true, y_pred):
    
    """
    Function to calculate accuracy
    -> param y_true: list of true values
    -> param y_pred: list of predicted values
    -> return: accuracy score
    
    """
    
# Intitializing variable to store count of correctly predicted classes
    correct_predictions = 0
    for yt, yp in zip(y_true, y_pred):
        if yt == yp:
            correct_predictions += 1
    #returns accuracy
    return correct_predictions / len(y_true)

In [None]:
#Computation of macro-averaged precision

def macro_precision(y_true, y_pred):

    # find the number of classes
    num_classes = len(np.unique(y_true))

    # initialize precision to 0
    precision = 0
    
    # loop over all classes
    for class_ in list(y_true.unique()):
        
        # all classes except current are considered negative
        temp_true = [1 if p == class_ else 0 for p in y_true]
        temp_pred = [1 if p == class_ else 0 for p in y_pred]
        
        
        # compute true positive for current class
        tp = true_positive(temp_true, temp_pred)
        
        # compute false positive for current class
        fp = false_positive(temp_true, temp_pred)
        
        
        # compute precision for current class
        temp_precision = tp / (tp + fp + 1e-6)
        # keep adding precision for all classes
        precision += temp_precision
        
    # calculate and return average precision over all classes
    precision /= num_classes
    
    return precision

print(f"Macro-averaged Precision score : {macro_precision(y_test, y_pred) }")

# implement marco-averaged precision using sklearn
macro_averaged_precision = metrics.precision_score(y_test, y_pred, average = 'macro')
print(f"Macro-Averaged Precision score using sklearn library : {macro_averaged_precision}")

In [None]:
#Computation of micro-averaged precision

def micro_precision(y_true, y_pred):


    # find the number of classes 
    num_classes = len(np.unique(y_true))
    
    # initialize tp and fp to 0
    tp = 0
    fp = 0
    
    # loop over all classes
    for class_ in y_true.unique():
        
        # all classes except current are considered negative
        temp_true = [1 if p == class_ else 0 for p in y_true]
        temp_pred = [1 if p == class_ else 0 for p in y_pred]
        
        # calculate true positive for current class
        # and update overall tp
        tp += true_positive(temp_true, temp_pred)
        
        # calculate false positive for current class
        # and update overall tp
        fp += false_positive(temp_true, temp_pred)
        
    # calculate and return overall precision
    precision = tp / (tp + fp)
    return precision

print(f"Micro-averaged Precision score : {micro_precision(y_test, y_pred)}")


#  implement mirco-averaged precision using sklearn
micro_averaged_precision = metrics.precision_score(y_test, y_pred, average = 'micro')
print(f"Micro-Averaged Precision score using sklearn library : {micro_averaged_precision}")

In [None]:
#Computation of macro-averaged recall

def macro_recall(y_true, y_pred):

    # find the number of classes
    num_classes = len(np.unique(y_true))

    # initialize recall to 0
    recall = 0
    
    # loop over all classes
    for class_ in list(y_true.unique()):
        
        # all classes except current are considered negative
        temp_true = [1 if p == class_ else 0 for p in y_true]
        temp_pred = [1 if p == class_ else 0 for p in y_pred]
        
        
        # compute true positive for current class
        tp = true_positive(temp_true, temp_pred)
        
        # compute false negative for current class
        fn = false_negative(temp_true, temp_pred)
        
        
        # compute recall for current class
        temp_recall = tp / (tp + fn + 1e-6)
        
        # keep adding recall for all classes
        recall += temp_recall
        
    # calculate and return average recall over all classes
    recall /= num_classes
    
    return recall

print(f"Macro-averaged recall score : {macro_recall(y_test, y_pred)}")


# implement macro-averaged recall using sklearn

macro_averaged_recall = metrics.recall_score(y_test, y_pred, average = 'macro')
print(f"Macro-averaged recall score using sklearn : {macro_averaged_recall}")


In [None]:
#Computation of micro-averaged recall

def micro_recall(y_true, y_pred):


    # find the number of classes 
    num_classes = len(np.unique(y_true))
    
    # initialize tp and fp to 0
    tp = 0
    fn = 0
    
    # loop over all classes
    for class_ in y_true.unique():
        
        # all classes except current are considered negative
        temp_true = [1 if p == class_ else 0 for p in y_true]
        temp_pred = [1 if p == class_ else 0 for p in y_pred]
        
        # calculate true positive for current class
        # and update overall tp
        tp += true_positive(temp_true, temp_pred)
        
        # calculate false negative for current class
        # and update overall tp
        fn += false_negative(temp_true, temp_pred)
        
    # calculate and return overall recall
    recall = tp / (tp + fn)
    return recall

print(f"Micro-averaged recall score : {micro_recall(y_test, y_pred)}")


#  implement micro-averaged recall using sklearn

micro_averaged_recall = metrics.recall_score(y_test, y_pred, average = 'micro')
print(f"Micro-Averaged recall score using sklearn library : {micro_averaged_recall}")

In [None]:
#Computation of macro-averaged f1 score

def macro_f1(y_true, y_pred):

    # find the number of classes
    num_classes = len(np.unique(y_true))

    # initialize f1 to 0
    f1 = 0
    
    # loop over all classes
    for class_ in list(y_true.unique()):
        
        # all classes except current are considered negative
        temp_true = [1 if p == class_ else 0 for p in y_true]
        temp_pred = [1 if p == class_ else 0 for p in y_pred]
        
        
        # compute true positive for current class
        tp = true_positive(temp_true, temp_pred)
        
        # compute false negative for current class
        fn = false_negative(temp_true, temp_pred)
        
        # compute false positive for current class
        fp = false_positive(temp_true, temp_pred)
        
        
        # compute recall for current class
        temp_recall = tp / (tp + fn + 1e-6)
        
        # compute precision for current class
        temp_precision = tp / (tp + fp + 1e-6)
        
        
        temp_f1 = 2 * temp_precision * temp_recall / (temp_precision + temp_recall + 1e-6)
        
        # keep adding f1 score for all classes
        f1 += temp_f1
        
    # calculate and return average f1 score over all classes
    f1 /= num_classes
    
    return f1


print(f"Macro-averaged f1 score : {macro_f1(y_test, y_pred)}")


# implement macro-averaged F1 score using sklearn

macro_averaged_f1 = metrics.f1_score(y_test, y_pred, average = 'macro')
print(f"Macro-Averaged F1 score using sklearn library : {macro_averaged_f1}")

In [None]:
#Computation of micro-averaged fi score

def micro_f1(y_true, y_pred):


    #micro-averaged precision score
    P = micro_precision(y_true, y_pred)

    #micro-averaged recall score
    R = micro_recall(y_true, y_pred)

    #micro averaged f1 score
    f1 = 2*P*R / (P + R)    

    return f1

print(f"Micro-averaged recall score : {micro_f1(y_test, y_pred)}")


# implement micro-averaged F1 score using sklearn

micro_averaged_f1 = metrics.f1_score(y_test, y_pred, average = 'micro')
print(f"Micro-Averaged F1 score using sklearn library : {micro_averaged_f1}")


In [None]:
# ROC AUCurve Computation

from sklearn.metrics import roc_auc_score

def roc_auc_score_multiclass(actual_class, pred_class, average = "macro"):
    
    #creating a set of all the unique classes using the actual class list
    unique_class = set(actual_class)
    roc_auc_dict = {}
    for per_class in unique_class:
        
        #creating a list of all the classes except the current class 
        other_class = [x for x in unique_class if x != per_class]

        #marking the current class as 1 and all other classes as 0
        new_actual_class = [0 if x in other_class else 1 for x in actual_class]
        new_pred_class = [0 if x in other_class else 1 for x in pred_class]

        #using the sklearn metrics method to calculate the roc_auc_score
        roc_auc = roc_auc_score(new_actual_class, new_pred_class, average = average)
        roc_auc_dict[per_class] = roc_auc

    return roc_auc_dict

roc_auc_dict = roc_auc_score_multiclass(y_test, y_pred)
roc_auc_dict

In [None]:
# ROC implementation: 

import matplotlib.pyplot as plt
from sklearn import svm, datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import label_binarize
from sklearn.metrics import roc_curve, auc
from sklearn.multiclass import OneVsRestClassifier
from itertools import cycle
plt.style.use('ggplot')

# Load the iris data
iris = datasets.load_iris()
X = iris.data
y = iris.target# Binarize the output
y_bin = label_binarize(y, classes=[0, 1, 2])
n_classes = y_bin.shape[1]# We split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y_bin, test_size= 0.5, random_state=0)


# We define the model as an SVC in OneVsRestClassifier setting.
# this means that the model will be used for class 1 vs class 2, 
# class 2vs class 3 and class 1 vs class 3. 
# So, we have 3 cases at #the end and within each case, the bias will be varied in order to 
# Get the ROC curve of the given case - 3 ROC curves as output.

classifier = OneVsRestClassifier(svm.SVC(kernel='linear', probability=True, random_state=0))
y_score = classifier.fit(X_train, y_train).decision_function(X_test)
# Plotting and estimation of FPR, TPR
fpr = dict()
tpr = dict()
roc_auc = dict()
for i in range(n_classes):
    fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])
    roc_auc[i] = auc(fpr[i], tpr[i])
colors = cycle(['blue', 'red', 'green'])

for i, color in zip(range(n_classes), colors):
    plt.plot(fpr[i], tpr[i], color=color, lw=1.5, label='ROC curve of class {0} (area = {1:0.2f})' ''.format(i+1, roc_auc[i]))
    plt.plot([0, 1], [0, 1], 'k-', lw=1.5)
    plt.xlim([-0.05, 1.0])
    plt.ylim([0.0, 1.05])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('Receiver operating characteristic for multi-class data')
    plt.legend(loc="lower right")
    plt.show()