# 1.Ask the Right Question:

The purpose of this notebook is to investigate the linear relationship (if any) between [Height, Gender] as independent variables and Weigths (dependent variable). And build a model that resembles this relationship, validate it, interpert results. And then make some predictions.

# 2. Data Collection:
The second step in data analysis is to collect the data. In our case we're using a simple Bivariate dataset contains the Weights in Lbs and Heights in inches of 10,000 observations that is split between Male and Female adults

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import statsmodels.formula.api as smf
import seaborn as sns
import matplotlib.pyplot as plt

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

df = pd.read_csv('../input/weight-height/weight-height.csv')
df.head()

# 3. Data Cleaning:
**First order** of business is to take a quick look at our data set to make sure it is:   
### 3.1 No missing values


In [None]:
# 1. No missing values
print(df.isna().sum())

### 3.2 Outliers

Sometimes outliers are bad data and must be get rid of, and sometimes they're Michael Jordan or Shaquille O'Neal and must be kept!  

Either way, they must be identified first. To do so, we define outliers as observations that are: 
1. More than 3 standard deviations away from the mean.
2. Or more than 1.5 * Inter-Quartile-Range (IQR) away from either the 25th OR the 75th percentiles (ends of the boxplot).

For the current dataset; it contains one continuous and one categorical (discrete) variables. We can use boxplot to visualze outliers, using the builtin feature in matplotlib, searborn or pandas

**More on outliers detection**   
Outlier detection methods by data type include:  
- Univariate data -> boxplot. outside of 1.5 times inter-quartile range is an outlier.  
- Bivariate -> scatterplot with confidence ellipse. outside of, say, 95% confidence ellipse is an outlier.  
- Multivariate -> Mahalanobis D2 distance   

Once defined, mark those observations as outliers. Then run a logistic regression to see if there are any systematic patterns.   

In [None]:
# 2. Outliers: 
fig, (ax1,ax2) = plt.subplots(2,1,figsize=(8,4))
ax1.grid()
sns.boxplot(x=df.Weight, y=df.Gender, ax=ax1)
ax2.grid()
sns.boxplot(x=df.Height, y=df.Gender, ax=ax2);

I don't see a reason why a 270lbs male or a 55inch female should be excluded from the dataset. Thuse, I'm going to leave the dataset as is for now and revisit the outliers later, but from a predictive point of view. i.e. when plotting the residuals

### 3.3 Balacend between Genders

In general, having balanced data generates higher accuracy models. This is true for all regression types and machine learning algorithms.  

In [None]:
# 3. Distribution across genders is perfect
df.groupby('Gender').Weight.nunique()/len(df)*100

In our case the data is split equally between Male and Female, which is ideal. But in case you're dealing with an imbalanced dataset, can use WLS (weighted least squared regression) instead of OLS. This is out of this Notebook's scope

### 3.4 Distribution of Variables (Optional)
Unlike error terms, the dependent and/or independet variables don't need to be normally distributed. But if their distributions are very far off from normal, it'll be difficut to fit a line that results in a normally distributed erorrs (which is one of the assumption of linear regression 2.1.6). That's why I like to check the distribution upfront just to know what to expect. This is an optional step but extremely helpful   

The easiest way to do that in practice is to visualize the distribution using histogram:

In [None]:
fig,ax = plt.subplots(2,1,figsize=(8,6))
fig.suptitle('Distribution of Variables', fontsize=20)
sns.histplot(data=df, x='Weight', hue='Gender', ax=ax[0], stat='percent')
sns.histplot(data=df, x='Height', hue='Gender',ax=ax[1]);

Seems like both variables are normally distributed. This will make our life easier when investigating some of the typical problems of linear regression. 

## 4. Analyzing the Data

This step should focus on finding the answer to the question asked in step (1): Is there a linear relationship between the dependent and independent variables. That being said, and before we get to the regression itself, let's list the assumptions of a linear regression model:
### 4.1 Assumptions of a Linear Regression
1. There is a linear relationship between dependend and independent variables
2. The independent variables are not random, and there is no exact linear relationship between them (No multicollinearity)
3. The expected value of error term is zero $E(\epsilon | x_i)=0$
4. The variance of the error term is constanct for all observations (i.e. $E(\epsilon^2_i)=\sigma_{\epsilon}^2$ (heteroscedasticity)
5. The error term of one observation is not corrolated with that of another (no serial correlation)
6. The error term is normally distributed

We are going to test each of these assumptions after regressing the data!

### 4.2 Pre-Processing
#### 4.2.1 Split the Data

It's standard practice to split the data into training and test sets to see how the model behaves when presented with data it hasn't seens. I'm goign to use pandas to do this. 

Another, very common way to do it is by using`train_test_split` from `sklearn.model_selection`, but I prefer pandas. As it retains both sets as DataFrames. This will make the regression outcome much more user friendly!  

In [None]:
# Split the data to train and test using pandas

df_test = df.sample(frac=0.3)
df_train = pd.merge(df, df_test, how='outer', indicator=True).query('_merge =="left_only"').drop(columns=['_merge'])
print(f'Size of training set:{len(df_train)}, size of test set: {len(df_test)}\n')

#### 4.2.2 Let's make sure splitting didn't affect databalance between Genders:

In [None]:
print("Gender count across Train Set:\n",df_train.groupby('Gender').size())

### 4.3 OLS regression model

In [None]:
# Ordinary Least Square (OLS) regression

ols = smf.ols(formula="Weight ~ C(Gender) + Height", data=df_train)
res = ols.fit(cov_type='nonrobust')
print(res.summary())

## 5. Interpretting Regression Results
### 5.1 R-squared
R-squared/Adj.R-squared: 90.4% of variability in dependent variable can be explained by the independed variables. This is high enough for training set, but let's see how well the model deals with data it hasn't seen (test_set)

In [None]:
from sklearn.metrics import r2_score

y_test_pred = res.predict(exog=df_test)
test_rsquare= r2_score(df_test.Weight, y_test_pred)
print(f"Test R-squared:{test_rsquare:.3f} ")

Test R-squared at 90.5% is as high as train R-squared, which means the model generalizes well


#### 2.4.2 F-statistic|Prob(F)
It tests how well the independent variables as a group explain the variations of the dependent variable (i.e. tests the null hypothesis $H_0: coef_1=coef_2=\dots coef_n=0$)   
P-value of F-statistic, or the probability of type I error (observation happenning giving the null hypothesis is true)  
In this case F-statistic is very large and Prob(F) is zero, which is great!  

#### 2.4.3 AIC:
The Akaike Information Criterion (AIC) measures overfit. It rewards the model for goodness-of-fit and penalize it if the model becomes overly complicated  
In this case AIC is large, which is good!  

#### 2.4.4 Omnibus/Prob(Omnibus):
Omnibus tests the skewness and kurtosis of the residuals.  
**closer to zero the better***  
Prob(Omnibus):tests the probability the residuals are normally distributed.  
**The closer to one the better***  

In this case Omnibus is relatively hight and the Prob (Omnibus) is relatively low so the data is far from normal. A linear regression approach would probably be better than random guessing but likely not as good as a nonlinear approach.  

#### 2.4.5 Skew 
Measures data symmetry. We want to see something close to zero, indicating the residual distribution is normal. Note that this value also drives the Omnibus.  
**The closer to zero the better***   

#### 2.4.6 Kurtosis
Measure of "peakiness", or curvature of the data. Higher peaks lead to greater Kurtosis. Greater Kurtosis can be interpreted as a tighter clustering of residuals around zero, implying a better model with few outliers.  
**The higher the better**  

#### 2.4.7 Durbin-Watson
Tests for homoscedasticity. We hope to have a value between 1.5 and 2.5. In this case, the data is close, but within limits.  

#### 2.4.8 Jarque-Bera (JB)/Prob(JB)
like the Omnibus test in that it tests both skew and kurtosis. We hope to see in this test a confirmation of the Omnibus test.   

#### 2.4.9 Condition Number
Measures the sensitivity of a function's output as compared to its input. When we have multicollinearity, we can expect much higher fluctuations to small changes in the data, hence, we hope to see a relatively small number, something below 30.  
**The smaller below 30 the better**


In [None]:
res.resid.hist()
res.resid.describe()

In [None]:
pred = res.get_prediction().summary_frame().rename(columns={'mean':'Weight_P'})
pred['Weight'] = df_train.Weight
pred['Height'] = df_train.Height
pred['Gender'] = df_train.Gender

In [None]:
mask = pred.Gender =='Male'
color = None
fig, ax = plt.subplots(1,2,figsize=(17,6), sharey=True)
fig.suptitle("Train Set Observed vs Predicted", fontsize=16)
for i,gender in enumerate(pred.Gender.unique()):
    if i ==1:
        mask = ~mask
        color = 'orange'
    ax[i].scatter(data=pred[mask], x='Height', y='Weight',s=2,alpha=0.5, label='True', color=color)
    ax[i].plot('Height','Weight_P',data=pred[mask], linewidth=0.75, label='Predicted')
    ax[i].plot('Height','obs_ci_upper',data=pred[mask], linestyle=':', linewidth=0.5, label='Upper Bound')
    ax[i].plot('Height','obs_ci_lower',data=pred[mask], linestyle=':', linewidth=0.5,label='Lower Bound')
    ax[i].set_title(gender, fontsize=12)
    ax[i].set_xlabel('Height')
    ax[i].set_ylabel('Weight')
    ax[i].grid(True)
    ax[i].legend()

In [None]:
res.predict(df_test)

In [None]:
"""
bonferroni` : one-step correction
    - `sidak` : one-step correction
    - `holm-sidak` :
    - `holm` :
    - `simes-hochberg` :
    - `hommel` :
    - `fdr_bh` : Benjamini/Hochberg
    - `fdr_by` : Benjamini/Yekutieli
"""
outliers = res.outlier_test("bonferroni")

In [None]:
outidx = outliers.iloc[:,-1].nsmallest(5).index
df_train.loc[outidx]

In [None]:
fig= plt.figure(figsize=(10,6))
plt.scatter(df_train.Height, df_train.Weight, c=df_train.Gender.values=='Male', s=0.7)
plt.scatter(df_train.loc[outidx].Height, df_train.loc[outidx].Weight, c=df_train.loc[outidx].Gender.values=='Male', marker='<');