## Linear Regression



Before diving into linear regression analysis, I need to elaborate a
bit on type hinting variable types which are defined by external
libraries. Since type hinting is a rather new addition to python, most
third party packages have no support for it. On the other hand, it is
a useful tool, which reminds us that while `ax.scatter()` and
`sns.scatterplot()` look very similar, `ax` is a plot handle, whereas
`sns` is an alias to the seaborne library.

The typing library provides support to declare generic variable types,
and we used this without much explanation before, e.g.:

    pdf = TypeVar('pandas.core.frame.DataFrame')

however, how would you know the argument to `TypeVar()`? We can simply
declare a data-frame object, and then query its type:



In [1]:
import pandas as pd
newDF = pd.DataFrame() #create empty dataframe
print(type(newDF))     #print its type

so now you know that pandas dataframe is of type
`'pandas.core.frame.DataFrame`, and we can rewrite the above code as



In [1]:
from typing import TypeVar
import pandas as pd

pdf = TypeVar('pandas.core.frame.DataFrame')
newDF :pdf = pd.DataFrame() #create empty dataframe

Granted, line 5 above is pretty redundant, however, sometimes we don't
want to create the actual data object in the header section of our
program, consider e.g., the case of a plot handle which is specific to
a given figure handle. You may not want to declare these in the top
section of your program, but it would still be useful to declare all
the types we use for type hinting. Python allows us to do this with
sort of a non statement. We can, e.g., write:



In [1]:
a : int # this is a counter

So we define the type of a variable, but we do not initialize its
value (add a =print(a) to test for yourself). This allows us to
explain all variables at the beginning of our code, without actually
initializing them. In the following, please use the following type
hint definitions in your code:



In [1]:
# declare the non standart typ hints
pdf = TypeVar('pandas.core.frame.DataFrame')
pds = TypeVar('pandas.core.series.Series')
smm = TypeVar('statsmodels.regression.linear_model.OLS')
smr = TypeVar('statsmodels.regression.linear_model.RegressionResultsWrapper')
npa = TypeVar('numpy.ndarray')

### Correlation versus Correlation



Back in my home country, and before the hippy movement changed our
culture, kids, who were curious where the babies come from, were told
that they are brought by the stork (a large bird, see
Fig.[fig:storksa](#fig:storksa)). Storks were indeed a common sight in rural
areas, and large enough to sell this story to a 3-year-old.

![img](Linear_Regression/Ringed_white_stork_2019-11-22_15-14-15.png "The Stork. Image by Soloneying, from ![img](https://commons.wikimedia.org/wiki/File:Ringed_white_stork.jpg) Downloaded Nov 22<sup>nd</sup> 2019.")

To bad, we are now grown up scientists with a penchant for critical
thinking. Rather than believing this story, we want to see the data, and ask
if this were true, we should see a good correlation between the number of storks
and the number of babies. Low and behold, these two variables, actually
correlate in a statistically significant way, i.e, the more storks we count in
a country, the higher the (human) birthrate. Since both variables increase
together, this is called a positive correlation. See Fig. [4](#orgdbab230)

![img](storks.png "The birthrate and the number of stork pairs correlate in a statistical significant way. This analysis suggest that each stork pair delivers about 29 human babies, and that about 225 thousand babies were born otherwise. Data after [Mathews (2000)](https://drive.google.com/open?id=1zXX-6dp4X1heAb9XKq9ya3PrENI5beIH).")

Now, does this prove that the storks deliver the babies? Obviously (or so we
think) not. Just because two observable quantities correlate, does in no way
imply that one is the cause of the other. The more likely explanation is that
both variables are affected by a common process (i.e., industrialization).

It is a common mistake to confuse correlation with causation. Another good
example is to correlate drinking with heart attacks. This surely will
correlate but the story is more difficult. Are there e.g., patterns like
drinkers tend to do less exercise than non-drinkers? So even if you have a
good hypothesis why two variables are correlated, the correlation on its own,
proves nothing.



### Understanding the results of a linear regression analysis



Regression analysis compares how well a dataset of two variables (lets
call them `x` and `y`) can be described by a function which allows us
to predict the value of the dependent variable `y` based on the value
of the independent variable `x`.  In the case of a linear regression,
this can be expressed by a linear equation:

\begin{equation}
\label{eq:1}
y = a+mx
\end{equation}

where `a` denotes the y-axis intercept, `m` denotes the slope. Note
that the above equation is a simple model, which we can use to make
predictions about actual data. Linear regression analysis adjusts the
parameters `a` and `m` in such a way that the difference between the
measured data and the model prediction is minimized.

From a user perspective, we are interested to understand how good the
model actually is. and how to interpret the key indicators of a given
regression model:

-   **r^2:** or coefficient of determination. This value is in the range
    from zero to one and expresses how much of the observed
    variance in the data is explained by the regression
    model. So a value of 0.7 indicates that 70% of the variance
    is explained by the model, and that 30% of the variance is
    explained by other processes which are not captured by the
    linear model (e.g., measurements errors, or some non-linear
    effect affecting `x` and `y`). In Fig. [BROKEN LINK: fig:storks] 38% of the
    variance in the birthrate can be explained by the increase
    in stork pairs.  Note that often you will also find the term
    R^2. For a simple linear regression with two variables, r^2
    equals R^2. However, if your model incorporates more than 2
    variables, these numbers can be different.
-   **p:** When you do a linear regression, you basically state the
    hypothesis that `y` depends `x` and that they are linked by a
    linear equation. If you test a hypothesis, you however also
    have to test the so called **null-hypothesis**, which in this
    case would state that `y` is unrelated to `x`. The p-value
    expresses the likelihood that the null-hypothesis is true. So
    a p-value of 0.1 indicates a 10% chance that your data does
    not correlate. A p-value of 0.01, indicates a 1% chance that
    your data is not correlated. Typically, we can reject the
    null-hypothesis if `p < 0.05`, in other words, we are 95% sure
    the null hypothesis is wrong. In Fig. [BROKEN LINK: fig:storks], we are 99.2%
    sure the null hypothesis is wrong. Note that there is not
    always a simple relationship between r^2 and p.



### The statsmodel library



Pythons success rests to a considerable degree on the myriad of third
party libraries which, unlike matlab, are typically free to use. In
the following we will use the "statsmodel" library, but there are
plenty of other statistical libraries we could use as well. 

The statsmodel library provides different interfaces. Here we will use
the formula interface which is similar to the R-formula
syntax. However not all statsmodel functions are available through
this interface (yet?). First we import the needed libraries:



In [1]:
from typing import TypeVar # type hinting support
import os                  # os support
import pandas as pd        # use pandas to read the data
# and the statsmodel formula interface for the regression
import statsmodels.formula.api as smf

Next we declare the non-standard type hints



In [1]:
# declare the type hints
pdf = TypeVar('pandas.core.frame.DataFrame')
pds = TypeVar('pandas.core.series.Series')
smm = TypeVar('statsmodels.regression.linear_model.OLS')
smr = TypeVar('statsmodels.regression.linear_model.RegressionResultsWrapper')
npa = TypeVar('numpy.ndarray')
mpf = TypeVar('matplotlib.figure.Figure')
mpa = TypeVar('matplotlib.axes._subplots.AxesSubplot')

Now we read the data:



In [1]:
# the filename
fn :str = "storks_vs_birth_rate.csv" # file name

# read the data
if os.path.exists(fn): # check if the file is actually there
     df :pdf = pd.read_csv(fn)         # read data
     df.columns = ["Babies", "Storks"] # replace colum names
     print(df.head())
else:
     print("\n ------------------------------- \n")
     print(f"{fn} not found")
     print("\n ------------------------------- \n")
     exit()

For the statistical analysis, we want to analyze whether the number of
storks predicts the number of babies. In other words does the birth
rate depend on the number of storks?



In [1]:
# first we declare some variable names otherwise the below lines will
# look quite messy. Note that these variable are not initialized with
# any value. The next two lines are in fact non-statements and merely
# help to improve the clarifty of the code
model :smm      # this variable will hold our statistical model
results :smr    # this variable will hold the results of the analysis

# next initialize our statistical model which describes our analysis
# as well as the datasource. "ols" stands for ordinary least squares
model   = smf.ols(formula="Babies ~ Storks",data=df)
results = model.fit()      # fit the model to the data
print(results.summary())   # print the results of the analysis

Plenty of information here, probably more than the you asked for. But
note the first line, which states that 'Babies' is the dependent
variable. This is useful and will help you to catch errors in your
model definition. But what we really want are the slope, r^2 and
p-values. The below code demonstrates how to extract these from the
model results



In [1]:
# retrieve values from the model results
slope   :float = results.params[1]  # the slope
y0      :float = results.params[0]  # the y-intercept
rsquare :float = results.rsquared   # rsquare
pvalue  :float = results.pvalues[1] # the pvalue

## Assignments



Notes: Create a notebook in your submissions folder with this name:
"linear-regression-FirstName-LastName". In order to submit your
assignment, you need to download it and submit it on Quercus (ipynb
and pdf format). Please have the usual header with date, name
etc. Don't forget to copy all `csv` files into your submission folder,
otherwise the assignments won't work.

Make sure that your pdf file is complete and shows all output from
your code cells

Marking Scheme: 1 pt per question. It either right or wrong. No half
points.

All exercises:

-   use self contained code
-   use 100 dpi plot resolution
-   use 5x4 inch plot size
-   use the darkgrid scheme
-   use the object oriented interface to matplotlib.pyplot or seaborn

Here you go:

1.  Create a scatter plot of the stork data, and annotate the x and y axes
2.  Analyze the stork data using the example code, then extract the
    regression parameters, and use the results to create a single
    multiline f-string which shows the regression parameters as seen
    in Fig. 2. Use the `text()` method of the plot handle to place
    this string in the upper left corner of the plot. Here a quick
    refresher on the fstring.



In [1]:
mystring :str = (f"y = {y0:1.6f}"
                      f"next line"
                      )

Look up the print module of the course for details.

3.  Using the predict method of your model result, generate a new set
    of predicted y-values (yp = results.predict()), which you then
    plot as a line on top of your scatterplot. By now, this should
    look very similar to Figure 2.

1.  Rather then doing a scatter and line plot with matplotlib, create
    a plot with seaborns regression plot function. This plot type is
    called in the following way:
    
        sns.regplot(x='Storks', y='Babies', data=df, ax=ax)
    
    and add the regression parameters. Your plot should now look
    exactly like Fig. 2.
2.  Using the plot (and code) from #4 as template, and the
    `grades_vs_attendance.csv` file as data source, analyze the
    relationship between class attendance (the independent Variable)
    and grade performance (the dependent).
3.  Discuss the statistical relevance of your analysis.
4.  Using the code below, we will first create a dataset with a known
    linear relationship, then add some noise to it, and then analyze
    the noisy data and compare it to the actual data. Plot the
    results of this experiment similar Fig. [fig:example](#fig:example). This
    figure shows an example how this could look like. The orange line
    represents the original equation, whereas the blue line is the
    result of the regression analysis.

![img](Assignments/example_2019-11-25_13-28-13.png "Example how a plot for Question 8 should look like. The orange line represents the original equation, whereas the blue line is the result of the regression analysis.")



In [1]:
# create 30 values between 0 and 30 with a stepsize of 1
     x0 :npa =  np.arange(0, 30, 1) # this is the independent variable
     
     # calcualte y-values based on x0
     y0 :npa =  12.1 + 0.23 * x0

     # create some random noise which varies between -3 and 3
     noise :npa = (np.random.uniform(-3,3,size=30))

     # and add it to the dataset
     yn = y0 + noise

     # create a dataframe
     df :pdf = pd.DataFrame({'X':x0,'Y':yn})

     # create a regression model
     model :smm = smf.ols(formula="yn ~ x0",data=df)
     results :smr = model.fit()

8.  Rerun the code in #7 but this time, instead of using 30 data
    points, use only 10. Then rerun it with 100 data points. There is
    no need to plot the results, but you should tabulate `rsquare`
    and `p` versus the number of observations `N`. Describe the
    results in your own words.

