## Linear Regression



\index{linear regression}
\index{type hinting!external libraries}



### Correlation versus Correlation



\index{Correlation} \index{Causation} \index{Storks}
 Back in my home country, and
before the hippy movement changed our culture, kids, who were curious
where the babies come from, were told that they are brought by the
stork (a large bird, see Fig.[fig:storksa](#fig:storksa)). Storks were indeed a
common sight in rural areas, and large enough to sell this story to a
3-year-old.

![img](Linear_Regression/Ringed_white_stork_2019-11-22_15-14-15.png "The Stork. Image by Soloneying, from ![img](https://commons.wikimedia.org/wiki/File:Ringed_white_stork.jpg) Downloaded Nov 22<sup>nd</sup> 2019.")

To bad, we are now grown up scientists with a penchant for critical
thinking. Rather than believing this story, we want to see the data, and ask
if this were true, we should see a good correlation between the number of storks
and the number of babies. Low and behold, these two variables, actually
correlate in a statistically significant way, i.e, the more storks we count in
a country, the higher the (human) birthrate. Since both variables increase
together, this is called a positive correlation. See Fig. [4](#org193e58b)

![img](storks.png "The birthrate and the number of stork pairs correlate in a statistical significant way. This analysis suggest that each stork pair delivers about 29 human babies, and that about 225 thousand babies were born otherwise. Data after <sup id="33933dcd3d4eb462061d758c913e8daa"><a href="#matthews-2000-stork-devil" title="Robert Matthews, Storks Devilver Babies (p = 0.008), {Teaching Statistics}, v(2), 36--38 (2000).">matthews-2000-stork-devil</a></sup>.")

Now, does this prove that the storks deliver the babies? Obviously (or so we
think) not. Just because two observable quantities correlate, does in no way
imply that one is the cause of the other. The more likely explanation is that
both variables are affected by a common process (i.e., industrialization).

It is a common mistake to confuse correlation with causation. Another
good example is to correlate drinking with heart attacks. This surely
will correlate but the story is more difficult. Are there e.g.,
patterns like drinkers tend to do less exercise than non-drinkers? So
even if you have a good hypothesis why two variables are correlated,
the correlation on its own, proves nothing.



### Understanding the results of a linear regression analysis



\index{linear regression!dependent variable}
\index{linear regression!independent variable}
Regression analysis compares how well a dataset of two variables (lets
call them `x` and `y`) can be described by a function which allows us
to predict the value of the dependent variable `y` based on the value
of the independent variable `x`.  In the case of a linear regression,
this can be expressed by a linear equation:

\begin{equation}
\label{eq:1}
y = a+mx
\end{equation}

where `a` denotes the y-axis intercept, `m` denotes the slope. Note
that the above equation is a simple model, which we can use to make
predictions about actual data. Linear regression analysis adjusts the
parameters `a` and `m` in such a way that the difference between the
measured data and the model prediction is minimized.

From a user perspective, we are interested to understand how good the
model actually is. and how to interpret the key indicators of a given
regression model:

-   **r<sup>2</sup>:** or coefficient of determination. \index{linear
    regression!rsquare} \index{linear regression!coefficient of
    determination} This value is in the range from zero to one and
    expresses how much of the observed variance \index{linear
    regression!variance} in the data is explained by the regression
    model. So a value of 0.7 indicates that 70% of the variance is
    explained by the model, and that 30% of the variance is explained
    by other processes which are not captured by the linear model
    (e.g., measurements errors, or some non-linear effect affecting `x`
    and `y`). In Fig. [BROKEN LINK: fig:storks] 38% of the variance in the birthrate
    can be explained by the increase in stork pairs.  Note that often
    you will also find the term R<sup>2</sup>. For a simple linear regression with
    two variables, r<sup>2</sup> equals R<sup>2</sup>. However, if your model incorporates
    more than 2 variables, these numbers can be different.
-   **p:** When you do a linear regression, you basically state the
    hypothesis that `y` depends `x` and that they are linked by a
    linear equation. If you test a hypothesis, you however also have to
    test the so called **null-hypothesis**, which in this case would
    \index{linear regression!null hypothesis} state that `y` is
    \index{linear regression!p-value} unrelated to `x`. The p-value
    expresses the likelihood that the null-hypothesis is true. So a
    p-value of 0.1 indicates a 10% chance that your data does not
    correlate. A p-value of 0.01, indicates a 1% chance that your data
    is not correlated. Typically, we can reject the null-hypothesis if
    `p < 0.05`, in other words, we are 95% sure the null hypothesis is
    wrong. In Fig. [BROKEN LINK: fig:storks], we are 99.2% sure the null hypothesis is
    wrong. Note that there is not always a simple relationship between
    r<sup>2</sup> and p.



### The statsmodel library



\index{library!statsmodel} \index{library!statsmodel!formula api}
Pythons success rests to a considerable degree on the myriad of third
party libraries which, unlike matlab, are typically free to use. In
the following we will use the "statsmodel" library, but there are
plenty of other statistical libraries we could use as well. 

The statsmodel library provides different interfaces. Here we will use
the formula interface which is similar to the R-formula
syntax. However not all statsmodel functions are available through
this interface (yet?). First we import the needed libraries:



In [1]:
import pandas as pd  # import pandas as pd
import os  # no need to set an alias, since os is already short
import statsmodels.formula.api as smf 

# define the file and sheetname we want to read. Note that the file
# has to be present in the local working directory!
fn: str = "storks_vs_birth_rate.csv"  # file name

# this little piece of code could have saved me 20 minutes
if not os.path.exists(fn):  # check if the file is actually there
    raise FileNotFoundError(f"Cannot find file {fn}")

df :pd.DataFrame = pd.read_csv(fn)  # read data
df.columns = ["Babies", "Storks"] # replace colum names
df.head() # test that all went well

Unnamed: 0,Babies,Storks
0,83,100
1,87,300
2,118,1
3,117,5000
4,59,9


In [2]:
dir(smf)


['__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 'gee',
 'glm',
 'glmgam',
 'gls',
 'glsar',
 'logit',
 'mixedlm',
 'mnlogit',
 'negativebinomial',
 'nominal_gee',
 'ols',
 'ordinal_gee',
 'phreg',
 'poisson',
 'probit',
 'quantreg',
 'rlm',
 'wls']

For the statistical analysis, we want to analyze whether the number of
storks predicts the number of babies. In other words does the birth
rate depend on the number of storks?



In [1]:
# next initialize our statistical model which describes our analysis
# as well as the datasource. "ols" stands for ordinary least squares
model :smf.ols  = smf.ols(formula="Babies ~ Storks",data=df)
results = model.fit()      # fit the model to the data
print(results.summary())   # print the results of the analysis

Plenty of information here, probably more than the you asked for. But
note the first line, which states that 'Babies' is the dependent
variable. This is useful and will help you to catch errors in your
model definition. But what we really want are the slope, r<sup>2</sup> and
p-values. The below code demonstrates how to extract these from the
model results



In [1]:
# retrieve values from the model results
slope   :float = results.params[1]  # the slope
y0      :float = results.params[0]  # the y-intercept
rsquare :float = results.rsquared   # rsquare
pvalue  :float = results.pvalues[1] # the pvalue