In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("09-exercise-pids2024.ipynb")

# Exercise sheet 9
**Hello everyone!**

**Points: 15**

Topics of this exercise sheet are:
* P-Values
* Correlations

Please let us know if you have questions or problems! <br>
Contact us during the exercise session or on [Piazza](https://piazza.com/unibas.ch/spring2024/63982).

**Automatic Feedback**

This notebook can be automatically graded using Otter grader. To find how many points you get, simply run `grader.check_all()` from a new cell. 

In [None]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
from scipy.stats import norm

![Basel](basel.jpg)

# Basel's neighborhoods
We are working with a dataset containing information about Basel and its neighborhoods. You can find it here: https://opendata.swiss/de/dataset/kennzahlen-zu-den-basler-wohnvierteln-und-landgemeinden. It's also uploaded as a csv to this folder.

In [None]:
bs = pd.read_csv("basel_neighborhoods.csv", sep=";")
bs.head()

## Question 1) P value (10 points)
Let's see if there's an answer to the following question: Did the amount of green spaces in Basel's neighborhoods increase *significantly* between 2015 and 2021? <br>

The column that measures the amount of green spaces (parks and such) is called `anteil_gruenflaechen`.

### 1a) Observed difference (1 point)
What is the observed difference between the mean of the amount of green spaces in 2015 and in 2021? <br>
Assign the mean of the amount of green spaces for the years 2015 and 2021 to the variables `mean_2015` and `mean_2021` respectively.

*Hint:* You can average over all neighborhoods

In [None]:
class Question1a:
    mean_2015 = ...
    mean_2021 = ...
    
    print(mean_2021 - mean_2015)

In [None]:
grader.check("Question 1a")

### 1b) Artificial variables (3 points)
Let us make a new dataframe called: 
- `neighborhoods` 
  - This new dataframe contains the neighborhoods of Basel in a column called `wohnviertel_name.` Additionally, it should have a column called `increase_green_space` that has 
      - the number 1 if the amount of green space increased between 2015 and 2021 for this neighborhood
      - the number 0 otherwise
      
*More formally:*
$$
  \begin{equation}
    increase\_green\_space=
    \begin{cases}
      1, & \text{if}\ \text{anteil\_gruenflaechen\_2021}[x] > \text{ anteil\_gruenflaechen\_2015}[x], x \in \text { wohnviertelname}  \\
      0, & \text{otherwise}
    \end{cases}
  \end{equation}
$$
**Hints:** 
- Spread the data into two new dataframes: `bs_2021` and `bs_2015`. `bs_2021` will also be reused in this exercise sheet's last exercise.
- Create from `bs_2021` and `bs_2015` a merged dataframe on the column `wohnviertel_name` and use the `suffixes=(2021,2015)` option 

In [None]:
class Question1b:
    ...
    neighborhoods = ...
    display(neighborhoods.head())

In [None]:
grader.check("Question 1b")

### 1c) Amount of neighborhoods with an increase in green spaces (2 points)

We want to verify if there was an increase or decrease in green spaces. For this, we work under the null hypothesis:
$$ \mathbb{P}(\text{green spaces increase}) -0.5 = 0. $$


What is the percentage of neighborhoods that had increased green spaces from 2015 to 2021? Save this decimal number (float) in the variable called `increase_2021`. <br>
This value will let us calculate the difference to our null hypothesis value since it is our $ \mathbb{P}(\text{green spaces increase})$. Assign the difference of `increase_2021` to our null hypothesis value (0.5) to the variable called `difference_null_hypothesis` (float).

In [None]:
class Question1c:
    increase_2021 = ...
    difference_null_hypothesis = ...
    print(increase_2021, difference_null_hypothesis)

In [None]:
grader.check("Question 1c")

### 1d) Calculate the p-value (3 points)
What is the p-value for the significant difference in green space between 2015 and 2021? In other words: What is the p-value for: $$
\begin{align*}
P(|mean2021 - 0.5| > \text{observed\_difference})
\end{align*}
$$
Assign the p-value to the variable called `p_value`

*Hint:* Look at the slides from lecture 9 to find an example.

In [None]:
class Question1d:
    ...
    p_value = ...
    print("The p-value is: ", p_value)

In [None]:
grader.check("Question 1d")

### 1e) Do we reject or fail to reject the null hypothesis based on our p value? (1 point)
Here our significance value is 0.05.

Please write your answer like this:
* If we reject the null hypothesis: null_hypothesis = "reject"
* If we can not reject the null hypothesis: null_hypothesis = "fail to reject"

(Hint: Read the question carefully and google it if you're not sure what it means)

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: Question 1e      # (required) the path to a requirements.txt file
manual: true     # whether this is a manually-graded question
points: 1      # how many points this question is worth; defaults to 1 internally
check_cell: false  # whether to include a check cell after this question (for autograded questions only)
-->

In [None]:
class Question1e:
    null_hypothesis = "..."

In [None]:
grader.check("Question 1e")

<!-- END QUESTION -->



## Question 2) Correlation and regression (5 points)

### 2a) Plotting the correlation (2 points)
Make a scatterplot **using pandas plotting** function that shows you how income tax `einkommenssteuer_pro_veranlagung` and apartment size `flaeche_pro_wohnung` are correlated. 

Plot **for the year 2021**: 
- `einkommenssteuer_pro_veranlagung` on the *x*-axis and  
- `flaeche_pro_wohnung` on the *y*-axis.


<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: Question 2a      # (required) the path to a requirements.txt file
manual: true     # whether this is a manually-graded question
points: 2     # how many points this question is worth; defaults to 1 internally
check_cell: false  # whether to include a check cell after this question (for autograded questions only)
-->

In [None]:
class Question2a:
    corr_plot = ...  # your scatter plot
    display(corr_plot)

In [None]:
grader.check("Question 2a")

<!-- END QUESTION -->



### 2b) Correlation coefficient (1 point)
What is the correlation coefficient between income tax and apartments size in Basel for 2021? Assign this number (float) to the variable called `corr_coeff`.

*Hint:* pandas.dataframe has a function called `.corr()`

In [None]:
class Question2b:
    corr_coeff = ...
    print("correlation coefficient: ", corr_coeff)

In [None]:
grader.check("Question 2b")

### 2c) Plot another correlation (1 point)
Make a scatter plot **using seaborn** that shows you how living space per person `wohnflaeche_pro_person` and apartment size `flaeche_pro_wohnung` are correlated **for the year 2021.**

**General Remark:** If you have to plot something on the *x and y axis, always ask yourself: Which variable is dependent on the other?* 

<!-- BEGIN QUESTION -->

<!--
BEGIN QUESTION
name: Question 2c      # (required) the path to a requirements.txt file
manual: true     # whether this is a manually-graded question
points: 1    # how many points this question is worth; defaults to 1 internally
check_cell: false  # whether to include a check cell after this question (for autograded questions only)
-->

In [None]:
class Question2c:
    corr_plot = ...
    display(corr_plot)

In [None]:
grader.check("Question 2c")

<!-- END QUESTION -->



What do you notice on this plot, what kind of correlation do you observe? What could be problematic about this correlation?
Write your answer in the cell below as a comment (use #).

In [None]:
# WRITE DOWN YOUR OBSERVATIONS

### 2d) Regression (1 point)

Compute the **slope** `alpha` and **intercept** `beta` of the regression line that relates income tax (predictor) and apartment_size. 

Using the data from the year 2021 `Question1b.bs_2021`. Plot the data and the regression line.


**Reminder:**$ \text{ The regression line is defined as: } f(x) = \alpha x + \beta$
$$
\begin{align*}
&\alpha = corr(x,y) \cdot \frac{\sigma_y}{\sigma_x}, &\text{ } \\
& \beta = \bar{y} - \alpha \cdot \bar{x} \\
& x = \text{tax}, y = \text{appartment size},  \\ 
& \bar{x},\text{ }\bar{y} = \text{arithmetic mean}
\end{align*}
$$ 

*Hints:*
- Make use of the following functions in pandas:  
    - `mean()`
    - `std()`
    - `corr()`

In [None]:
class Question2d:

    mean_income_tax = ...
    stddev_income_tax = ...

    mean_apartment_size = ...
    stddev_apartment_size = ...
  
    corr = ...

    alpha = ...
    beta = ...

    ...

In [None]:
grader.check("Question 2d")

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [None]:
grader.check_all()