In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats as st
import statsmodels.api as sm
import statsmodels.formula.api as smf

In [2]:
m = pd.read_csv("measurements.csv")

In [3]:
m.columns

Index(['ID', 'group', 'sex', 'birth_year', 'chol_init', 'chol_final',
       'trig_init', 'trig_final', 'weight_init', 'weight_final', 'weight_diff',
       'height_init', 'height_final', 'height_diff', 'armspan_init',
       'armspan_final', 'armspan_diff', 'arm_perimeter_init',
       'arm_periemter_final', 'arm_perimeter_diff', 'thorax_perimeter_init',
       'thorax_perimeter_final', 'thorax_perimeter_diff',
       'abdominal_perimeter_init', 'abdominal_perimeter_final',
       'abdominal_perimeter_diff', 'hip_perimeter_init', 'hip_perimeter_final',
       'hip_perimeter_diff', 'BMI_init', 'BMI_final', 'BMI_diff',
       'tricipital_fold_final', 'tricipital_fold_init', 'abdominal_fold_final',
       'abdominal_fold_init', 'subscapular_fold_final',
       'subscapular_fold_init'],
      dtype='object')

Regression analysis relies on several assumptions for its validity. These assumptions include:

1. **Linearity**: There should be a linear relationship between the independent variables (predictors) and the dependent variable (outcome). This means that changes in the independent variables should result in proportional changes in the dependent variable.

2. **Homoscedasticity**: Also known as homogeneity of variance, this assumption states that the variance of the residuals should be constant across all levels of the predictors. In other words, the spread of residuals should remain consistent throughout the range of the predictors.

3. **Normality of residuals**: The residuals should be normally distributed. This assumption is necessary for the validity of hypothesis tests and confidence intervals based on the regression coefficients. However, violation of normality assumption may not be critical for large sample sizes due to the central limit theorem.

4. **No perfect multicollinearity**: There should not be perfect linear relationships among the independent variables. This means that one independent variable should not be a perfect linear combination of the others. High multicollinearity can inflate standard errors and lead to unstable estimates.

5. **No influential outliers**: Outliers or influential data points should not excessively influence the regression results. Outliers can distort parameter estimates and affect the overall fit of the model.

6. **Independence of residuals**: The residuals (the differences between observed and predicted values) should be independent of each other. This assumption is similar to the independence assumption in correlation tests.


When performing regression analysis on time series data, additional assumptions need to be considered to ensure the validity of the regression model. These assumptions include:

1. **Stationarity**: The time series data should be stationary, meaning that the statistical properties of the data (such as mean and variance) should remain constant over time. Non-stationarity can lead to spurious regression results and inaccurate parameter estimates.

2. **Absence of seasonality**: If the time series data exhibit seasonal patterns, appropriate seasonal adjustments should be made to account for them. Failure to address seasonality can lead to biased estimates and incorrect inferences.

3. **Autocorrelation (Serial correlation)**: The residuals (errors) of the regression model should not be correlated with each other over time. Autocorrelation indicates that there is some systematic pattern in the residuals that the model has not captured. This assumption is particularly important in time series data, where observations are often correlated with their past values.

The Omnibus, Durbin-Watson, and Jarque-Bera statistics are commonly used diagnostics for regression models:

1. Omnibus Test:
The Omnibus test is a test of the overall significance of a regression model. It tests whether the model as a whole explains a significant amount of variance in the dependent variable. The Omnibus test statistic follows an F-distribution and is calculated by comparing the residual sum of squares of the current model to that of a model with no predictors.

2. Durbin-Watson Test:
The Durbin-Watson test is a test for autocorrelation in the residuals of a regression model. It assesses whether there is serial correlation between consecutive residuals. The Durbin-Watson statistic ranges from 0 to 4, with a value close to 2 indicating no significant autocorrelation. Values significantly below 2 indicate positive autocorrelation, while values significantly above 2 indicate negative autocorrelation.

3. Jarque-Bera Test:
The Jarque-Bera test is a test for normality of the residuals in a regression model. It tests whether the residuals are normally distributed. The Jarque-Bera statistic is calculated based on skewness and kurtosis of the residuals. Under the null hypothesis of normality, the Jarque-Bera statistic follows a chi-squared distribution with 2 degrees of freedom.

4. The condition number is a measure of how sensitive a function (or system of equations) is to changes or errors in its input. In the context of regression analysis, the condition number is often used to assess multicollinearity among predictor variables. A high condition number indicates that the matrix of predictor variables is ill-conditioned, meaning that small changes in the data could lead to large changes in the estimated coefficients. In practice, it's common to use the singular value decomposition (SVD) of the predictor variable matrix to compute the condition number. The condition number is a useful diagnostic tool for identifying potential issues with multicollinearity in regression models. A condition number much greater than 1 indicates multicollinearity may be present, potentially affecting the stability and interpretability of the regression coefficients.