# Basic Statistical Models

This notebook demonstrates how to estimate a regression model with the module [statmodels][statmodels]. The techniques introduced here can also be applied using [scikit-learn](https://www.google.com/search?client=firefox-b-d&q=scikit-learn), one of the more popular Python modules for machine learning. 


## Linear Regression

The basic idea of a linear regression model is a model of the form 

$$ y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \cdots + \beta_K x_{Ki} + u_i \qquad i = 1, \ldots, N $$

where $y_i$ denotes the dependent (or response) variables and $x_{ki}$ denotes the $k^{th}$ covariate or explanatory variable with $k=1,\ldots, K$. 

Given the data of the response and the co-variate, the aim is to obtain some estimates of the coefficients $\beta_0, \beta_1, \ldots, \beta_K$. A standard method of estimate these coefficients is the **Ordinary Least Squares Estimator** (OLS). 

Upon obtaining the estimate of each coefficient, denotes $\hat{\beta}_{k}$, $k=0,\ldots,K$, we can make a prediction of $y_{i+1}$ by 

$$ \hat{y}_{j} = \hat{\beta}_0 + \hat{\beta}_1 x_{1j} + \hat{\beta}_2 x_{2j} + \cdots + \hat{\beta}_K x_{Kj}, \qquad j=N+1,\ldots,  $$

We won't go through the mathemtics here but for a comprehensive treatment of Linear Regression model see [here](https://www.ssc.wisc.edu/~bhansen/econometrics/). Instead, we will focus on how to use [statmodels][statmodels].


[statmodels]: https://www.statsmodels.org/stable/index.html

## Economic Return to Schooling 

The example we will be using to demonstrate this is based on the paper [Z. Griliches (1976)](https://www.jstor.org/stable/1831103?seq=1#metadata_info_tab_contents) which examined the economic return to formal schooling. 

First we import the necessary modules and the data 

In [None]:
import pandas as pd
import statsmodels.api as sm 
import sklearn.linear_model as skl
import sklearn.metrics as skm 
import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['figure.figsize'] = [12,9]

In [None]:
school = pd.read_csv('../input/schooling/griliches.csv', header=0)
school.info()

Below is a brief data dictionary:

1. $year_i$: year of observatrion. This is the base year for when the individual $i$ entered the sample. 
2. $rns_i$: dummy variable, residency in South.
3. $mrt_i$: dummy variable, martial status (=1 if married). 
4. $smsa_i$: dummy variable, reside metro area. 
5. $med_i$: Mother's education measured in years. 
6. $iq_i$: IQ score. 
7. $kww_i$: Knowledge of the World of Work test. 
8. $age_i$: age in years. 
9. $s_i$: completed years of schooling.
10. $expr_i$: work experience measured in years. 
11. $tenure_i$: tenure measured in years. 
12. $lw_i$: log of wage. 

Variables with $80$ as suffix measure the same variable again in 1980 for the same individauls. It can be seen as *follow up* measurements. 

The linear regression model considered is 

$$ lw_i = \beta_0 + \beta_1 rns_i + \beta_2 mrt_i + \beta_3 smsa_i + \beta_4 iq_i + \beta_5 kww_i + \beta_6 age_i + \beta_7 s_i + \beta_8 expr_i + \beta_9 tenure_i + u_i. $$

**Exercise:** <a style="color:red">IMPORTANT</a> Conduct some preliminary Exploratory Data Analysis before proceeding to the regression estimation. What types of EDA do you think are appropriate in this case? 

To estimate the coefficients, we first need to identity the response variable and the covariate

In [None]:
covariate = ['rns', 'mrt', 'smsa', 'iq', 'kww', 'age', 's', 'expr', 'tenure']
y = school['lw']
#X = school.loc[:, covariate] or
X = school[covariate]
X = sm.add_constant(X) # We want to include the intercept

Given these, we can then get *statsmodels* to estimate the coefficients. 

We split the process here into three steps:

1. We specify the model
2. We estimate (fit) the model
3. We use the data with 80 prefix for prediction. 

In [None]:
model = sm.OLS(y,X) #specify the model
est = model.fit() #estimate (fit) the coefficients
est.summary() #disply the result

In [None]:
testcovariate = ['rns80', 'mrt80', 'smsa80', 'iq', 'kww', 'age80', 's80', 'expr80', 'tenure80']
test = school.loc[:, testcovariate]
test = sm.add_constant(test)
predict80 = model.predict(params=est.params, exog=test)
predict80

Add the predicted values into the dataframe

In [None]:
school['predicted80'] = predict80

Plotting predicted values against actual 

In [None]:
school.plot.scatter(y='lw80', x='predicted80')

Three criteria are typically used to measure predictive performance namely Mean Squares Error (MSE), Mean Absolute Error and Mean Absolute Percentage Error. They are defined as 

$$ 
\begin{align}
     MSE =& \frac{1}{N} \sum^N_{i=1} ( \hat{y}_i - y_i )^2 \\
    MAE =& \frac{1}{N} \sum^N_{i=1} | \hat{y}_i - y_i | \\
    MAPE =& \frac{1}{N} \sum^N_{i=1} \frac{| \hat{y}_i - y_i |}{y_i} 
\end{align}
$$

where $N$ is the number of prediction. 

The *metrics* submodule in scikit-learn can generate all these. 

In [None]:
mse = skm.mean_squared_error(school['lw80'], school['predicted80'])
mad = skm.mean_absolute_error(school['lw80'], school['predicted80'])
mape = skm.mean_absolute_percentage_error(school['lw80'], school['predicted80'])
mse, mad, mape

For readability, we can put these criteria into a dataframe

In [None]:
error = pd.DataFrame([mse, mad, mape], index=['MSE', 'MAD', 'MAPE'], columns=['error'])
error

## Function

This section can be skipped but it would be useful if you wish to learn how to automate a sequence of actions in Python by definining a *function*. 

Let's say we need to calculate these errors over and over again for different models. We may not necessarily want to cut-and-paste the code above over and over again. Ideally we want to *automate* this process up to a point. Put all these into a function is one way to achieve this.

The idea of a function is that you provide certain inputs (or arguments) and Python will manipulate those inputs and produce some outputs. We have been utilising various functions already, but we are yet to introduce how to create one for yourself. 

The basic syntax to define a function is by writting 

def functioname(list of arguments) 

    """
    Some doc-strings to explain what the function does and what inputs and outputs should the user expect. 
    """

    *what code you want Python to perform.*
    
    ......
    
    *return something*
    
Note that indentation is important in Python. The indented lines are the actions which you wish Python to perform. The function definition ends when you no longer indenting the line. 

For example, if we wish to develop a function that takes the actual and predicted values from above and returns a dataframe that contains MSE, MAE and MAPE, we can consider 

In [None]:
def getForecastCriteria(y, yhat):
    """
    A fucntion that returns MSE, MAE and MAPE based on the actual and predicted values. The function requires scikit-learn module.
    Inputs:
        y: (T,) array containting the actual values of the variable. 
        yhat: (T,) array containing the predicted values of the variable. 
    Output:
        fcTable: a dataframe containing the MSE, MAE and MAPE. 
    
    """
    mse = skm.mean_squared_error(y, yhat)
    mad = skm.mean_absolute_error(y, yhat)
    mape = skm.mean_absolute_percentage_error(y, yhat)
    fcTable = pd.DataFrame([mse, mad, mape], index=['MSE', 'MAD', 'MAPE'], columns=['error'])
    return fcTable

Check if the function works. 

In [None]:
getForecastCriteria(school['lw80'], school['predicted80'])

So now, you can utilise the getForecastCriteria function over and over again for different response and/or predicted values. 

## Discrete outcome 

Linear Regression Model is great when the response variable is continuous (and unbounded). When the response variable is binary or discrete, such as Male or Female, Yes or no, or contain several choices, such as transportation choices, then standard linear regression model may not be the most appropriate approach.

To demonstrate the idea, let's consider the following scenario. 

A retail bank would like to hire you to build a credit default model for their credit card portfolio. The bank expects the model to identify the consumers who will default on their credit card payments over the next 12 months and as such reduce their losses. The bank is willing to provide you with the data (credit_data.txt) that they can currently extract from their systems. This data set consists of 13,444 records with 14 fields. In addition to the default indicator for each observation, the remaining fields capture customer attributes and their credit history. A brief data dictionary can be found below:

  - Cardhldr = Dummy variable, 1 if application for credit card accepted, 0 if not
  - Default = 1 if defaulted 0 if not (observed when Cardhldr = 1, 10,499 observations),
  - Age = Age in years plus twelfths of a year,
  - Adepcnt = 1 + number of dependents,
  - Acadmos = months living at current address,
  - Majordrg = Number of major derogatory reports,
  - Minordrg = Number of minor derogatory reports,
  - Ownrent = 1 if owns their home, 0 if rent
  - Income = Monthly income (divided by 10,000),
  - Selfempl = 1 if self employed, 0 if not,
  - Inc_per = Income divided by number of dependents,
  - Exp_Inc = Ratio of monthly credit card expenditure to yearly income,
  - Spending = Average monthly credit card expenditure (for Cardhldr = 1),
  - Logspend = Log of spending.
  
In this case, the response variable, *Default*, is a binary variable which value can only be either 0 or 1. So one way to approach this is to model the *log odd ratio* rather than the variable itself. The log odd ratio is 

$$z_i =  \log \frac{\Pr (Default_i = 1) }{\Pr (Default_i=0)} $$

where $\Pr(A)$ denotes the probability of an event $A$ to occur. The main idea is that while $Defualt$ is a binary variable, the log-odd ratio $z_i$ is a continuous variable ranges from $-\infty$ to $\infty$. To see this note

1. $\Pr (Default_i = 1)$ ranges from 0 to 1. 
2. $\Pr (Default_i = 1)/\Pr (Default_i = 0)$ ranges from 0 to $\infty$
3. $\log \Pr (Default_i = 1)/\Pr (Default_i = 0)$ ranges from $-\infty$ to $\infty$

Since $z_i$ is continuous and not bounded, we can then model $z_i$ as a linear model similar to the regression model, that is, 

$$z_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + \cdots + \beta_K x_{Ki}$$.

The objective is therefore to obtain the coefficients from data $\beta_0, \ldots, \beta_K$. 

We will demonstrate how to estimate these coefficients using [statsmodels](https://www.statsmodels.org/stable/index.html). First we import the data. 

Note that in this data, there are *missing values*, so we need to tell Pandas how missing values were represented using the input argument *na_values*. In this particular dataset, the missing values are simply empty entries without any whitespace. 

In [None]:
credit = pd.read_csv("../input/credit/creditData.csv", header=0, na_values='')
credit.info()

**Exercise** Please conduct some Exploratory Data Analysis before proceeding to the estimation of a logit model. 

Note that we only want those with a credit card i.e. CARDHLDR=1 and we also need to drop all the rows with missing values for purposes of estimation. 

<a style="color:red">IMPORTANT NOTE:</a> Dropping missing values is not always the appropriate method. There is a literature on how to handle missing values based on the objectives of the analysis. This includes the possibility of actually estimating the missing values using [Expectation-Maximisation](https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm) (EM) algorithm. The detials are beyond the scope of this notebook, so we simply just drop them for now. 

The method *dropna* can be used to drop the missing values from the dataframe. 

In [None]:
creditClean = credit.loc[credit['CARDHLDR']==1,:].dropna()

The code above essentially does two things sequentially. The first '.' extracts the individuals with credit card using *conditional slicing* and the second '.' removes the missing values. 

The next step is to specify our model. We follow the steps below

1. Identify and extract the data of the response (endogenous) variable. 
2. Identify and extract the data of the co-variate (exogenous variables). 
3. Add a constant term to the dataframe of the covariates. 
4. Specify our logit model. 
4. Estimate (fit) our logit model. 

In [None]:
y = creditClean['DEFAULT'] # define the response variable.
credit_covariates = creditClean.columns[2:-1] #Grab all the covariates except for the first two columns.
X = creditClean[credit_covariates] # define the covariates. 
X = sm.add_constant(X) # adding a constant
Model0 = sm.Logit(endog=y, exog=X) # Define a logit model. 
Model0_fit = Model0.fit() #estimate (fit) our model. 

To show our result, we can use the *summary* method. 

In [None]:
Model0_fit.summary()