# Fundamentals of Statistics
## Correlation and Normal Linear Regression

## Data

This dataset is from the [World Happiness Report](https://worldhappiness.report/ed/2018/). The report uses six key variables to measure happiness differences: “income, healthy life expectancy, having someone to count on in times of trouble, generosity, freedom and trust, with the latter measured by the absence of corruption in business and government.”
The Happiness Index is an indication of happiness based on survey results, that was first used in the 2012 World Happiness Report. In the survey, the respondents were asked to rate their happiness on a scale from 0 to 10. The Happiness Index is calculated by averaging the survey results of the respondents. 

We import the required modules and then read in the data which is in an Excel file. We make a subset of the data for  2018, `Year == 2018`, and choose only the first 12 columns/variables. Since the data includes many missing values shown with NaN (Not a Number), the `dropna()` function is used to drop the rows where at least one element is missing. 

The final dataset (datafreme) that we will work with is `happy_2018`.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
happyness = pd.read_excel("Chapter2OnlineData.xls")     #load all the dataset
happy_2018 = happyness.loc[(happyness['Year'] == 2018)] #make a subset for year 2018
happy_2018 = happy_2018.iloc[0:136, 0:11]               #choose all the rows and only the first 11 columns we need
happy_2018 = happy_2018.dropna()                        #drop the countries/rows that have NaNs
happy_2018                                              #the final dataset we work with has 118 rows and 11 columns

## Correlation

In order to investigate the relationship of two variables, correlation between them can be calculated. However, it helps to look at the scatter plot of the two variables as well. We use the `seaborn` module, imported as `sns`, to make such a plot easily. Let's explore the relationship between "Social support" and "Life Ladder".

In [None]:
sns.scatterplot(data=happy_2018, x="Social support", y="Life Ladder")
plt.title("Scatter plot")
plt.show()

It is shown that increasing social support increases the life ladder. But, how strong is this linear connection? There are different ways to calculate Pearson correlation in Python. We import the `scipy.stats` module and use the `pearsonr` function.

In [None]:
import scipy.stats
scipy.stats.pearsonr(happy_2018["Social support"], happy_2018["Life Ladder"])[0]
# or you can try
# from scipy.stats import pearsonr
# pearsonr(happy_2018["Social support"], happy_2018["Life Ladder"])[0]

Next, we look at the relationship of "Perceptions of corruption" and "Life Ladder".

In [None]:
sns.scatterplot(data=happy_2018, x="Perceptions of corruption", y="Life Ladder")
plt.title("Scatter plot")
plt.show()

In [None]:
scipy.stats.pearsonr(happy_2018["Perceptions of corruption"], happy_2018["Life Ladder"])[0]

## Simple Linear Regression (with one predictor)

### Regression model for Life Ladder as the response variable and Social Support as the predictor

The `lmplot` from Seaborn module makes a scatter plot with a regression line fitted to the variables. In the plot below, the $(0,0)$ point is also included.  

In [None]:
sns.lmplot(data=happy_2018, x="Social support", y="Life Ladder", ci=None)
plt.title("Scatter plot with a regression line")
plt.ylim(0, )
plt.xlim(0, )
plt.show()

We use the Python module [statsmodels](https://www.statsmodels.org/stable/examples/index.html) to fit regression models. There are other Python modules for this purpose but `statsmodels` produces nice and complete outputs which help with understanding the process. See this [link](http://scipy-lectures.org/packages/statistics/index.html#linear-models-multiple-factors-and-analysis-of-variance) for extra regression examples with `statsmodels`.

The function `ols` from `statsmodels.formula.api` is imported to fit the regression model mathematically and estimate the model parameters. The `Q` is used in the formula because there is a space included in the variable names, otherwise you wouldn't need it. The general format of using `ols` is:

`model_name = ols('y ~ x', data= data_name).fit()`

In [None]:
from statsmodels.formula.api import ols
model_ss = ols('Q("Life Ladder") ~ Q("Social support")', data=happy_2018).fit()
model_ss.params

A complete and detailed output of the model is given by checking the `summary` of the fitted model.

In [None]:
model_ss.summary()

__Model wihtout an intercept:__
In the model output, you can see that p-value of the intercept is large (>0.05). It is an indication that the intercept of the model could be zero. So we can try to fit a regression model without an intercept by adding `-1` to the model. Then the model will be `y = b*x`.

In [None]:
from statsmodels.formula.api import ols
model_ss0 = ols('Q("Life Ladder") ~ Q("Social support") -1', data=happy_2018).fit()
model_ss0.summary()

### Regression model for Life Ladder as the response variable and Perceptions of corruption as the predictor

Repeat the same process to fit another model and use Perceptions of corruption to predict the response variable.

In [None]:
sns.lmplot(data=happy_2018, x="Perceptions of corruption", y="Life Ladder", ci=None)
plt.title("Scatter plot with a regression line")
plt.show()

In [None]:
from statsmodels.formula.api import ols
model_pc = ols('Q("Life Ladder") ~ Q("Perceptions of corruption")', data=happy_2018).fit()
model_pc.params

In [None]:
model_pc.summary()

# Excercises

1. Make a scatterplot of "Life Ladder" and "Freedom to make life choices". Calculate the Pearson correlation between these two variables. What does this correlation mean?

2. Fit a simple linrear regression model for Life Ladder as the dependent variable and "Freedom to make life choices" as the predictor variable. Get the summary of the model.

3. Explain what you learn about life ladder by this regression model. 