# Module 9: Inferential statistics and regression

**In this module, we will learn about inferential statistics and regression modelling using python.**

We will be using the following statistical Python packadges:

1. [seaborn](http://seaborn.pydata.org/) for statistical visualization like scatter plot we learned earlier, such as frequency distribution, linear regression plot etc. These would be helpful for exploratory data visualization

2. [scipy.stats](https://docs.scipy.org/doc/scipy/reference/stats.html) for basic statistical testing (correlation, t-tests, etc.)

3. [statsmodels](https://www.statsmodels.org/stable/examples/index.html) for more advanced statistical models, like Multiple regression and other complex models.

**Key concepts**  
Statistical inference is the process of using a sample to *infer* the characteristics of an underlying population (from which this sample was drawn) through estimation and hypothesis testing. Contrast this with descriptive statistics, which focus simply on describing the characteristics of the sample itself. Key concepts include: 
- parameter estimation and confidence intervals 
- normal distribution, sampling distribution and Central Limit Theorem
- statistical hypothesis testing
- statistical modelling (eg. t-test, regression)
- *p*-values

To conduct statistical inference, we rely on *statistical models*: sets of assumptions plus mathematical relationships between variables, producing a formal representation of some theory. We are essentially trying to explain the process underlying the generation of our data. 

Readings:
- Urdan, Statistics in Plain English, ch. 8, 9, 13.
- Rumsey, Statistics For Dummies, 2nd Edition (optional, if you have not taken any statistical course before, this would be a good entry-level look for understanding the key concepts), available as free access online from NEU library

In this notebook, we will focus on the basics of specifying, estimating, and interpreting regression models using Python tools. The goal is to make you a knowledgeable consumer of studies that use regression, and know of the available tools that would help you conduct statistical analysis using regression yourself.

There is lots more to cover in a course on regression that we must skip for today's quick overview (such as interactions, transforming variables, handling multicollinearity, handling outliers, conducting diagnostics, etc.) That's why there are entire courses dedicated to regression analysis. 


In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import statsmodels.api as sm
from scipy import stats
from statsmodels.tools.tools import add_constant


## 1. Load and prep the data

Let's continue the example topic we used in week 4, income inequality in US Metropolitan Statistical Areas (MSAs). We will conduct some statistical modelling to examine the relationship between income inequlaity and some characteristics of MSAs.

Related working papers and publications by Census:  
Income Inequality: https://www.census.gov/topics/income-poverty/income-inequality.html  
Income Inequality among Regions and Metropolitan Statistical Areas: 2005 to 2015: https://www.census.gov/library/working-papers/2017/demo/SESHD-WP2017-41.html

In [None]:
# load the data
df = pd.read_csv('../data/hhincome_msa_05_15.csv', dtype={'msacode':str})

# data cleaning and processing 
# drop non useful column
df = df.drop(columns=['Unnamed: 0'])

# drop columns with no data
df = df.dropna()

# change year data type to int
df['year'] = df['year'].astype(int)

# let's only focus on one-year data as an example
df = df[df['year'] == 2015]
len(df)

In [None]:
df.columns

In [None]:
df.dtypes

## 2. Descriptive statistics and visualization (review)

In [None]:
# choose a variable
var = 'hhinc_50prct'

In [None]:
df['gini', 'hhinc_50prct'].describe()

In [None]:
# create two data subsets
# subset the dataframe into large and small metropolitan areas
df_lar_msa = df[df['tot_pop'] > 1000000]
group1 = df_lar_msa[var]

df_sml_msa = df[df['tot_pop'] < 1000000]
group2 = df_sml_msa[var]

In [None]:
# what are the probability distributions of these two groups?
ax = sns.distplot(group1.dropna(), label='group1-large-cities')
ax = sns.distplot(group2.dropna(), label='group2-small-cities', ax=ax)
ax.legend()

Probability distributions indicate the likelihood of an event or outcome. A density plot is a representation of the distribution of a numeric variable, indicating the probability or the likelihood of an event. The peaks of the density plot are at the locations where there is the highest concentration of points. 

## 3. Hypothesis testing: difference in means and t-tests 

In the above plot, we see the probability distribution of median household income seem to be different between large and small cities. But is the difference between two groups statistically significant? Large cities tend to have more economic resources, employment opportunities and business activities than small cities. Given this theory, I want to hypothesis test it. My hypothesis is that median household income is higher for large cities than small cities.

In [None]:
print(group1.mean())
print(group2.mean())

In [None]:
# calculate difference in means
diff = group1.mean() - group2.mean()
diff

The average large cities' median household income is approx $10,197 higher than the average small cities'. But is it statistically significant? To determine this, we calculate the t-statistic and its p-value.

In [None]:
# compute the t-stat and its p-value
t_statistic, p_value = stats.ttest_ind(group1, group2, equal_var=False, nan_policy='omit')
p_value

In [None]:
# is the difference in means statistically significant?
# My chosen confidence level is 95% (and thus my significance level is 0.05).
alpha = 0.05 #significance level
p_value < alpha

Remember my original hypothesis: "median household income is higher for large cities than small cities." Let's express it formally in the parlance of statistical hypothesis testing:

H0: median household income in large cities are not higher than small cities (null hypothesis)  
H1: median household income in large cities are higher than small cities (alternative hypothesis)

The two possible outcomes of a hypothesis test are 1) I reject the null hypothesis or 2) I cannot reject the null hypothesis.

My p-value is less than the desired significance level, therefore I can reject the null hypothesis. You can only reject the null hypothesis if p is less than the significance level (which itself is an arbitrarily chosen probability threshold). Rejecting the null hypothesis does not mean that we've proven the alternative hypothesis, but rather just that it provides some evidence for this alternative.

In [None]:
# now it's your turn
# what is the difference in mean poverty_rate in large vs small cities?
# is it statistically significant?


## 4. Correlation and Regression

### 4a. Scatter plot and correlation


Let's say we are interested to know why some areas are more unequal than others? what urban and regional factors contribute to inequalities in cities? Normally we will ask these questions first and make some assumptions based on existing studies or theories to guess the factors that drive inequality in cities, then look for relevant data for the study. But for this example, let's think using the data we have on hand. 

In [None]:
# what variables might be related to income inequality in cities?
df.columns

In [None]:
# Let's make a scatter plot to picture a relationship
# You interpret a scatterplot by looking for trends in the data as you go from left to right
# use seaborn to scatter-plot two variables
ax = sns.scatterplot(x=df['poverty_rate'], y=df['gini'])

- If the data show an uphill pattern as you move from left to right, this indicates a positive relationship between X and Y. As the X-values increase (move right), the Y-values increase (move up) a certain amount.  
- If the data show a downhill pattern as you move from left to right, this indicates a negative relationship between X and Y. As the X-values increase (move right) the Y-values decrease (move down) by a certain amount.  
- If the data don't seem to resemble any kind of pattern (even a vague one), then no relationship exists between X and Y.  

In the above section “Interpreting a scatterplot,” I say data that resembles an uphill line has a positive linear relationship and data that resembles a downhill line has a negative linear relationship. However, I didn't address the issue of whether or not the linear relationship was strong or weak. 

Can one statistic measure both the strength and direction of a linear relationship between two variables? Sure! Statisticians use the correlation coefficient to measure the strength and direction of the linear relationship between two or more numerical variables. 

In [None]:
# specify variables that you think may have some levels of relationship
x = 'gini'
y = 'poverty_rate'

# correlation matrix
# how well are predictors correlated with response... and with each other?
correlations = df[[x, y]].corr()
correlations.round(2)

Some rules of thumb:
- Exactly −1: A perfect downhill (negative) linear relationship
- −0.70: A strong downhill (negative) linear relationship
- −0.50: A moderate downhill (negative) relationship
- −0.30: A weak downhill (negative) linear relationship
- 0: No linear relationship
- +0.30: A weak uphill (positive) linear relationship
- +0.50: A moderate uphill (positive) relationship
- +0.70: A strong uphill (positive) linear relationship
- Exactly +1: A perfect uphill (positive) linear relationship

### 4c. Simple (bivariate) linear regression
In the case of two numerical variables x and y, when at least a moderate correlation has been established through both the correlation and the scatterplot, you know they have some type of linear relationship.  

Researchers often use that relationship to predict the (average) value of y for a given value of x using a straight line. Statisticians call this line the regression line. If you know the slope and the y-intercept of that regression line, then you can plug in a value for X and predict the average value for Y. In other words, you predict (the average) y from x.

Simple (aka bivariate) regression has just 2 variables: one is used to predict the other.
  
  - **Response** variable = what you are predicting (synonyms: dependent variable, outcome variable, regressand)
  - **Predictor** variable = what you are using to predict (synonyms: independent variable, feature, covariate, regressor)
  

In this example, I want to predict gini income inequality as a function of poverty rate. Therefore, I **specify** my model as $y = \beta_0 + \beta_1 \times x_1$ where $y$ represents a city's gini index and $x_1$ represents a city's poverty rate. $\beta_0$ (the intercept, aka constant) and $\beta_1$ (the coefficient on poverty) are the model parameters to be estimated. My chosen confidence level is 95% (and thus my significance level is 0.05). 

In [None]:
# choose a response and predictor
response = 'gini'
predictor = 'poverty_rate'

In [None]:
#We can see the regression line visually
ax = sns.regplot(data=df, x=predictor, y=response)

In [None]:
# filter full dataset to retain only these columns and only rows without nulls in these columns
data = df[[response, predictor]].dropna()
print(data.shape)

# create design matrix and response vector
X = data[predictor]
y = data[response]

In [None]:
# estimate a linear regression model with OLS, using statsmodels
model = sm.OLS(y, sm.add_constant(X))
result = model.fit()
print(result.summary())

**How do I interpret this regression results table?**

The coefficient (**coef**) is the estimated relationship between my variables (the slope of the line). The *p* value allow us to determine whether each predictor variable is statistically significantly related to the response variable.

The p values associated with the predictor variable are much smaller than .05, indicating that my predictor variables is a significant predictor of my response variable.



So how good is my model? What does the $R^2$ value tell me? This single predictor explains about 12% the variation of the response. 


To explain more (and predict better), we need more predictors in our model as there should be other factors that may help predict the income inequalities in cities.

### 4d. Multiple regression

OLS regression with multiple predictors

In [None]:
# choose a response and predictors
response = 'gini'
predictors = ['unemp_rate', 'poverty_rate']

# filter full dataset to retain only these columns and only rows without nulls in these columns
data = df[[response] +  predictors].dropna()

# create design matrix and response vector
X = data[predictors]
y = data[response]

# estimate a linear regression model with OLS, using statsmodels
model = sm.OLS(y, sm.add_constant(X))
result = model.fit()
print(result.summary())

#### Now add in more variables...

In [None]:
df.columns

In [None]:
# create design matrix containing predictors (drop nulls), and a response variable vector
predictors = ['unemp_rate', 'poverty_rate', 'black_alonerate', 'hispanic_anyrate', 'asian_alonerate']
X = df[predictors].dropna()
y = df.loc[X.index][response]

In [None]:
# estimate a linear regression model
Xc = add_constant(X)
model = sm.OLS(y, Xc)
result = model.fit()
print(result.summary())

**How do I interpret this multiple regression results table?**

Each coefficient represents the individual predictor's relationship with the response, while holding all the other predictors constant. 

When looking at the variables and their associated p-values, all variables, except hispanic, have a p-value that is less than 0.05. This may indicate that these variables (i.e. unemp_rate, poverty_rate, black_alonerate, asian_alonerate) are significantly associated with the gini index.   

When looking at the variables and their associated coefficients, for example, the coefficient of the unemployment rate is negative (-0.171), that says that a city's unemployment rate is negatively associated with a city's income gini index.   

To interpret my results based on the size of the coefficient, I would say for every one unit (or one percentage point) increase in a city's unemployment rate, the household income gini index in a city is likely to decrease by 0.171, while holding other factors (in the model) constant.

To interpret my results in plain language, I would say a city with higher unemployment rate is likely to have lower income inequality.


In [None]:
# now it's your turn
# pick another variable and see if you could intreprete the result

In [None]:
# now it's your turn
# try different sets of predictors and see if R-squared increase


**Regression modeling steps**:

  1. think through the relevant theory and assumptions
  2. specify a model based on theory
  3. collect data and clean/prep it
  4. estimate model parameters using the data
  5. interpret and report the results


After we add in more and more predictors in a regression model, the model results may indicate there might be some numerical error, there could be many reasons, like endogeneity, multicollinearity, outliers etc. We may want to take a step back to think about the validity of the model and do some diagnostic test (like to following correlation tests) to examine what might cause the problem. This is a also a huge topic, and require a lot of understanding of your data, related theories, and experience. We are not going to cover these in this lecture, but these are things I would like to be alert about. 

Take a step back to examine the correlation...

In [None]:
# correlation matrix
# how well are predictors correlated with response... and with each other?
correlations = df[predictors + sorted([response])].corr()
correlations.round(2)

In [None]:
# visual correlation matrix via seaborn heatmap
# use vmin, vmax, center to set colorbar scale properly
sns.set(style='white')
ax = sns.heatmap(correlations, vmin=-1, vmax=1, center=0,
                 cmap='coolwarm', square=True, linewidths=1)

# Looking back
  - Fundamentals of programming with Python and Jupyter notebook  
  - The basic Python programming language and its uses  
  - Cleaning, manipulating, and analyzing data with Python’s pandas library  
  - Descriptive statistics and data visualization 
  - Querying APIs and scraping for public open data  
  - Spatial analysis and mapping  
  - Inferential statistics and regression  

# What's next

- Make your work presentable
    - Rule A, Birmingham A, Zuniga C, Altintas I, Huang S-C, Knight R, et al. (2019) [Ten simple rules for writing and sharing computational analyses in Jupyter Notebooks](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007007). PLoS Comput Biol 15(7)
    - Open science: share your code and data and avoid point-and-click software. 
  
- Assignment6: run through this notebook, and complete the exercise in "now your turn" cell. Then submit this notebook

- Final project

- Continue your own Python Journey

- Enjoy the rest of the summer!