# Solution Seekers Group

Lead of the Study Group Discussion: **Badr Bensassi**

Author: **Youssef Laouina**


> **“In summary, both intuition and empirical evidence suggest that when you have heteroscedasticity, ordinary least squares standard errors, confidence intervals, and hypothesis tests are unreliable. The heteroscedasticity-robust standard errors can provide more reliable inferences.”**

*Francis J. Anscombe*

# 1. Introduction

## Define heteroscedasticity

**Understanding Heteroscedasticity in Statistical Analysis**

Heteroscedasticity is a term used in statistics to describe a situation where the variability of a variable is unequal across the range of values of a second variable that predicts it. It comes from the Greek words "hetero," meaning different, and "skedasis," meaning dispersion or spreading.

In the context of statistical analysis, particularly in machine learning using Ordinary Least Squares (OLS), the concept of heteroscedasticity challenges the assumption made by OLS that the variance of the error term remains constant. OLS assumes that for all observations, the variance of the error term, denoted as ε, is consistent, a condition known as homoscedasticity. However, if the error terms do not exhibit this constant variance, they are described as heteroscedastic. 

This inconsistency in the spread or dispersion of errors can lead to biased parameter estimates and inefficient model predictions. Thus, identifying and addressing heteroscedasticity is crucial for improving the accuracy and reliability of statistical models, particularly in predictive analytics and machine learning applications.


# 2. Impact on Regression Models 

**Exposing the importance of Heteroskedasticity through its consequences in Statistical Analysis**

Heteroskedasticity, while not resulting in biased parameter estimates, leads to several important consequences that affect the efficiency and reliability of Ordinary Least Squares (OLS) estimates:

- **Non-BLUE Estimates**: Although heteroskedasticity does not introduce bias into parameter estimates, OLS estimates are no longer Best Linear Unbiased Estimators (BLUE). This means that among all unbiased estimators, OLS does not provide the estimate with the smallest variance. Depending on the nature of heteroskedasticity, significance tests can produce inflated or deflated results. Allison explains this as OLS giving equal weight to all observations, even though those with larger disturbance variance contain less informative value compared to observations with smaller disturbance variance.

- **Biased Standard Errors**: Heteroskedasticity also leads to biased standard errors. This bias affects test statistics and confidence intervals derived from OLS estimates, potentially leading to incorrect conclusions about the significance of variables or the overall model.

- **Impact on Significance Tests**: Unless heteroskedasticity is severe, significance tests may remain relatively unaffected. This implies that OLS estimation can still be used without significant distortion in most cases. However, severe heteroskedasticity can introduce serious problems, affecting the reliability of parameter estimates and hypothesis testing outcomes.


<center><img src="../images/Heteroskedasticity_overall_effect.jpg" alt="Overall Effect of Heteroskedasticity" style="width: 800px;"/></center> 

## Causes of Heteroskedasticity:
- Errors increase with higher values of independent variables (IVs).
> For example, consider a model in which
annual family income is the IV and annual family expenditures on vacations is the DV.
Families with low incomes will spend relatively little on vacations, and the variations in
expenditures across such families will be small. But for families with large incomes, the
amount of discretionary income will be higher. The mean amount spent on vacations will be
higher, and there will also be greater variability among such families, resulting in
heteroskedasticity.

<center><img src="../images/heteroskedastic_dataset_graph.png" alt="Heteroskedasticy: visual representation" style="width: 800px;"/></center> 

**Note that, in this example, a high family income is a necessary but not sufficient condition
for large vacation expenditures. Any time a high value for an IV is a necessary but not
sufficient condition for an observation to have a high value on a DV, heteroskedasticity is
likely.**

- Errors may also increase with extreme values of IVs in either direction, e.g. with attitudes that range from extremely negative to extremely positive. This will produce something that looks like an hourglass shape.

<center><img src="../images/Hour_glass_shape_scatter_plot.png" alt="Hour Glass shape scatter plot of residuals" style="width: 800px;"/></center> 

- Measurement error in data collection can contribute to heteroskedasticity, e,g. some respondents might provide more accurate responses than others. (**Note** that this problem arises from the violation of another assumption, that variables are measured without error.)
- Subpopulation differences or interaction effects can lead to heteroskedasticity (e.g. the effect of income on expenditures differs for whites and blacks). (Again, the problem arises from violation of the assumption that no such differences exist or have already been incorporated into the model.)
> For example, in the following diagram suppose that **Z** stands for three different populations. At low values of *X*, the regression lines for each population are very close to each other. As *X* gets bigger, the regression lines get further and further apart. This means that the residual values will also get further and further apart.

<center><img src="../images/subpopulation_differences.png" alt="subpopulation differences in the dataset" style="width: 800px;"/></center> 


# 3. Detecting Heteroscedasticity

## Visual inspection


The main visual inspection tool for identifying heteroscedasticity is the residuals plot. However, there are several other visual tools that can provide insights into heteroscedasticity:

1. **Residuals Plot:** Shows the residuals against predicted values or independent variables.
2. **Scatterplot of Residuals:** Plots residuals against actual data points to identify patterns.
3. **Variance Inflation Factor (VIF) Plot:** Indicates multicollinearity among predictor variables.
4. **Leverage Plot:** Shows the influence of each data point on the regression model.
5. **Standardized Residuals Plot:** Displays standardized residuals against predictor variables or fitted values.


## Formal tests like Breusch-Pagan test

In regression analysis, detecting heteroscedasticity is crucial for ensuring the reliability of statistical inference. Several formal tests have been developed to assess heteroscedasticity:

### Breusch-Pagan Test: 

The Breusch-Pagan test evaluates whether the variance of residuals changes systematically with the predictor variables in the model. This test involves the following steps:
1. Fit the regression model.
2. Obtain the residuals from the model.
3. Perform a regression of squared residuals on the independent variables.
4. Use the F-statistic from this regression to test for heteroscedasticity.

### White's Test: 

White's test is a robust examination for heteroscedasticity that accounts for potential serial correlation in residuals. The steps include:
1. Fit the regression model.
2. Obtain both residuals and squared residuals.
3. Run a regression of squared residuals on independent variables and their cross-products.
4. Use the F-statistic from this regression to assess heteroscedasticity.

### Goldfeld-Quandt Test: 

The Goldfeld-Quandt test is particularly useful when investigating heteroscedasticity by comparing variances of residuals between two subgroups of data. The procedure involves:
1. Splitting the data into two subgroups based on a criterion.
2. Fitting separate regression models to each subgroup.
3. Comparing the variances of residuals between subgroups using an appropriate statistical test.

### Park Test: 

Specifically designed for time series data, the Park test examines heteroscedasticity by considering lagged squared residuals. The steps are as follows:
1. Fit the time series regression model.
2. Obtain the residuals.
3. Conduct a regression of squared residuals on lagged squared residuals.
4. Utilize the coefficient from this regression to evaluate heteroscedasticity.

### Glejser Test: 

The Glejser test explores the relationship between absolute residuals and predictor variables to detect heteroscedasticity. The steps include:
1. Fit the regression model.
2. Obtain the residuals.
3. Run a regression of absolute residuals on independent variables.
4. Use the coefficients from this regression to assess heteroscedasticity.


# 4. Addressing Heteroscedasticity

## Transformations

Transformations modify variable scales or distributions to achieve a more uniform spread of residuals, aligning with regression assumptions. Common methods such as logarithmic, square root, reciprocal, power transformations, and Box-Cox transformations offer distinct strategies based on data characteristics and heteroscedasticity patterns.

In this part, we delve into the scientific rationale behind using transformations in heteroscedastic datasets, exploring their intuitive impact and providing step-by-step procedures for implementation.

### Logarithmic Transformation

#### Intuition:
- **Purpose:** Logarithmic transformation is employed to stabilize the variance of residuals, particularly when the variance increases exponentially with predictor variables.
- **Effect:** By compressing large values and spreading out small values, logarithmic transformation aims to achieve a more consistent spread of residuals across predictor variable values.

#### Procedure:
1. **Identify Variables:** Determine which predictor variable(s) contribute to heteroscedasticity.
2. **Apply Transformation:** Utilize the natural logarithm (ln) or base-10 logarithm (log10) to transform the identified variable(s).
3. **Model Adjustment:** Incorporate the transformed variables into your regression model.
4. **Validation:** Validate the effectiveness of the transformation through residual plots or formal tests for heteroscedasticity.

### Square Root Transformation

#### Intuition:
- **Purpose:** Square root transformation is suitable for stabilizing variance when it increases with the square of predictor variables.
- **Effect:** Taking the square root of a variable helps in moderating extreme values and promoting a more even dispersion of residuals.

#### Procedure:
1. **Identify Variables:** Determine which variable(s) exhibit heteroscedasticity based on variance trends.
2. **Apply Transformation:** Implement the square root transformation to the identified variable(s).
3. **Model Integration:** Integrate the transformed variables into your regression model.
4. **Evaluation:** Assess the transformation's efficacy using residual analysis or formal heteroscedasticity tests.

### Other Transformations
- **Reciprocal Transformation (1/x):** Useful for variables with inversely proportional relationships to variance.
- **Power Transformations (x^p):** Offers flexibility in adjusting variance based on the power (p) applied.
- **Box-Cox Transformation:** A parametric approach that optimizes the transformation to achieve a constant variance.

### Considerations
- **Interpretation:** Be mindful of the interpretation changes introduced by transformations, especially in terms of coefficients and effect sizes.
- **Validation:** Always validate transformations through diagnostic checks to ensure they address heteroscedasticity effectively.
- **Theoretical Alignment:** Ensure transformations align with the theoretical understanding of relationships in your data to avoid misleading interpretations.


## Weighted least squares

Weighted least squares (WLS) regression is a method used to address heteroscedasticity. This method incorporates weights into the regression model, assigning higher weights to observations with lower variance and lower weights to observations with higher variance. By doing so, WLS aims to mitigate the impact of heteroscedasticity on the estimation process and improve the accuracy of parameter estimates.

**Scientific Intuition:**
- **Variance Adjustment:** WLS accounts for the varying levels of variance in residuals by adjusting the contribution of each observation based on its estimated variance.
- **Efficiency in Estimation:** By giving more weight to observations with lower variance, WLS enhances the efficiency of parameter estimation compared to ordinary least squares (OLS) regression under heteroscedasticity.

**Procedure:**
1. **Heteroscedasticity Assessment:** Utilize diagnostic methods or formal tests to confirm the presence of heteroscedasticity in the residuals of the OLS regression model.
2. **Weight Calculation:**
   Compute appropriate weights for each observation using the following formula:
   ${Weight}_i = \frac{1}{\sigma_i^2}$
   where $( \sigma_i^2)$ represents the estimated variance for observation $( i )$. This weight calculation assigns higher weights to observations with lower variance $( \sigma_i^2 )$, reflecting their greater influence in the regression model. Lower variance observations contribute more significantly to parameter estimation in weighted least squares (WLS) regression, effectively addressing heteroscedasticity.

4. **Weighted Regression:** Implement the weighted regression using the WLS method, minimizing the sum of squared weighted residuals to account for varying impact of observations.
5. **Model Validation:** Validate the weighted regression model using diagnostic procedures such as residual plots to ensure effective mitigation of heteroscedasticity.


# 5. Example with Dataset

Importing our libraries

In [None]:
import warnings

import pandas  as pd 
import numpy as np 
import matplotlib.pyplot as plt 
import seaborn as sns

import statsmodels.formula.api as smf

warnings.simplefilter(action='ignore', category=FutureWarning)

## Importing our data into a Pandas DataFrame

In [None]:
df = pd.read_csv('../data/home_price.csv') # csv file is in the my repo inside the data folder.

In [None]:
df.info()

## Exploratory data analysis

In [None]:
pd.options.display.float_format = "{:0.2f}".format

In [None]:
df.describe()

In [None]:
fig1, ax = plt.subplots(1, 2, figsize=(10, 5))
flat_ax = ax.flatten()

sns.histplot(data=df, x='Price', kde=True, color='orange', ax=flat_ax[0])
sns.histplot(data=df, x='House_size', kde=True, ax=flat_ax[1])

plt.show()

Both features `Price` and `House_Size` look right skewed. This information might help us later in our analysis.

In [None]:
sns.lmplot(data=df, x='House_size', y='Price', aspect=2, height=4)
plt.xlabel('House size: as Independent variable')
plt.ylabel('House Price: as Dependent variable')
plt.title('House Price Vs House Size')
plt.show()

From the plot above, we can observe that `Price` tend to have intence scattering as the `House_Size` gets larger. At a first glance, we can confirm that residuals will show some heteroskedasticity.

## Model building using OLS from statsmodels library

In [None]:
feature_matrix = df.drop(columns='Price')
target_vector = df.Price

formula_str = target_vector.name + ' ~ ' + ' + '.join(feature_matrix)

model = smf.ols(formula=formula_str, data=df)
fitted = model.fit()

print(fitted.summary())

### Detect Heteroskedasticity using visual methods: Residuals plot

In [None]:
fig2, ax = plt.subplots(figsize=(8, 5))

sns.scatterplot(fitted.resid, color='green', s=80, ax=ax)
ax.axhline(0, color='red')

# ax.axhline(0.18, color='blue')
# ax.axhline(-0.18, color='blue')

plt.title('Residuals vs. Fitted Values')

plt.show()

### Detect Heteroskedasticity using formal test:

The null hypothesis $( H_0 )$ in the context of heteroscedasticity tests is typically formulated as follows:

$H_0: \text{The errors/residuals exhibit homoscedasticity (constant variance).}$

In simpler terms, the null hypothesis states that there is no systematic difference in the variance of errors across different levels or values of the independent variables. If the p-value from the heteroscedasticity test is sufficiently small, we reject this null hypothesis in favor of the alternative hypothesis, indicating that there is evidence of heteroscedasticity present in the data.

To determine whether to reject or not reject the null hypothesis $( H_0)$ based on the p-value:

1. **Reject $( H_0 )$ (Evidence of Effect):**
   - Small p-value (p < α, where α is the significance level): Strong evidence against $( H_0 )$.
   - Indicates that observed data is unlikely under the assumption of $( H_0 )$.

2. **Do Not Reject $( H_0 )$ (Lack of Evidence of Effect):**
   - Large p-value (p ≥ α): Not enough evidence to conclude against $( H_0 )$.
   - Suggests that observed data is consistent with $( H_0 )$ or does not deviate significantly from expected values under $( H_0 )$.

In the context of heteroscedasticity tests:
- Small p-value: Evidence of heteroscedasticity; reject the assumption of constant variance.
- Large p-value: Insufficient evidence of heteroscedasticity; do not reject the assumption of constant variance.

In [None]:
from statsmodels.stats.diagnostic import het_breuschpagan, het_white, het_goldfeldquandt

In [None]:
# Perform Breusch-Pagan test
bp_test = het_breuschpagan(fitted.resid, fitted.model.exog)
print("Breusch-Pagan test p-value:", bp_test[1])

In [None]:
# Perform White test
white_test = het_white(fitted.resid, fitted.model.exog)
print("White test p-value:", white_test[1])

In [None]:
from statsmodels.api import OLS

# Splitting the data into two subgroups based on the median of 'House_size'
median_house_size = df.House_size.median()
group1 = df[df.House_size <= median_house_size]
group2 = df[df.House_size > median_house_size]

# Fitting separate regression models to each subgroup
model1 = OLS(group1.Price, group1.House_size).fit()
model2 = OLS(group2.Price, group2.House_size).fit()

# Performing Goldfeld-Quandt Test for subgroup 1
gq_test_group1 = het_goldfeldquandt(model1.resid, model1.model.exog)
print("Goldfeld-Quandt Test F-statistic for subgroup 1:", gq_test_group1[0])
print("Goldfeld-Quandt Test p-value for subgroup 1:", gq_test_group1[1])

print()

# Performing Goldfeld-Quandt Test for subgroup 2
gq_test_group2 = het_goldfeldquandt(model2.resid, model2.model.exog)
print("Goldfeld-Quandt Test F-statistic for subgroup 2:", gq_test_group2[0])
print("Goldfeld-Quandt Test p-value for subgroup 2:", gq_test_group2[1])

>A smaller F-statistic indicates that the variances of residuals are relatively smaller or more consistent in Subgroup 1 compared to Subgroup 2.

In [None]:
# Perform Glejser Test
glejser_test = het_goldfeldquandt(fitted.resid, fitted.model.exog)
print("Glejser Test F-statistic:", glejser_test[0])
print("Glejser Test p-value:", glejser_test[1])

In [None]:
# Fit a linear regression model
model = OLS(target_vector, feature_matrix).fit()

# Obtain the residuals
residuals = model.resid

# Create absolute residuals
abs_residuals = np.abs(residuals)

# Run a regression of absolute residuals on independent variables
abs_residuals_model = OLS(abs_residuals, feature_matrix).fit()

# Use the coefficients from this regression to assess heteroscedasticity
glejser_coeff = abs_residuals_model.params['House_size']


# Check the p-value of the Glejser test coefficient
glejser_pvalue = abs_residuals_model.pvalues['House_size']

print("Glejser Test Coefficient:", glejser_coeff)
print("Glejser Test p-value:", glejser_pvalue)

# Determine significance based on a chosen alpha level (e.g., 0.05)
alpha = 0.05
if glejser_pvalue < alpha:
    print("\nThe Glejser test coefficient is statistically significant.")
else:
    print("\nThe Glejser test coefficient is not statistically significant.")


> The Glejser test coefficient is statistically significant, as indicated by the extremely low p-value (close to zero). This implies a strong and meaningful relationship between the `House_size` variable and the variability of residuals in our regression model. Specifically, the positive Glejser test coefficient suggests that larger values of `House_size` are associated with larger absolute residuals, indicating increasing variability or heteroscedasticity as `House_size` increases.

In [None]:
fig3, ax = plt.subplots(figsize=(8, 5))

sns.scatterplot(x=df.House_size, y=abs_residuals, s=80, ax=ax)
plt.ylabel('Absolute Residuals')
plt.title('Relationship between House Size and Absolute Residuals\n')
plt.show()

<center><img src="../images/differences_log_base10-expo.png" alt="Graph showcases the differences between a log base 10 plot, the exponetial plot and the second Bisectrix" style="width: 500px;"/></center> 

> "When an exponential trend is observed in the residuals plot, characterized by a changing spread of residuals that increases or decreases exponentially with predicted values, a log transformation is often best suited to stabilize variance and linearize the relationship in regression analysis."

# 6. Dealing with Heteroscedasticity

## Logarithmic Transformation

In [None]:
model_log = OLS(np.log10(df.Price), np.log10(df.House_size)).fit()

print(model_log.summary())

In [None]:
fig4, ax = plt.subplots(figsize=(8, 5))

sns.scatterplot(model_log.resid, color='green', s=80, ax=ax)
ax.axhline(0, color='red')

#ax.axhline(0.2, color='blue')
#ax.axhline(-0.2, color='blue')

plt.title('Residuals vs. Fitted Values (After Log transformation)')

plt.show()

Part II untill the next time.

~ Y.L