# Exercises course 1 - Data manipulation, representation and statistics
-------------------------------------------------------------------------------------------------------

This notebook contains exercises for Intermediate Python Course 1 - data manipulation, representation and statistics.

<br>
<br>

# Notebook 1 - Data manipulation
-------------------------------------------------

## Exercise 1.1

Load the Swiss census from 1880 dataset located in `data/swiss_census_1880.csv`.
The main columns of this dataset are:

 * **town name**
 * **Total** : total number of inhabitants in the town
 * **canton** : 2-letter code of the town's canton
 *  'Swiss', 'Foreigner','Male', 'Female', '0-14 y.o.', '15-59 y.o.', '60+ y.o.', 'Reformed',
       'Catholic', 'Other', 'German speakers', 'French speakers', 'Italian speakers', 'Romansch speakers', 'Non-national tongue speakers' : number of inhabitants of the town belonging to that criterion



Perform the following tasks:

1. Load the 1880 census dataset as a pandas DataFrame.
2. Subset the DataFrame to keep only the following columns:  
   `"town name", "Total", "German speakers", "French speakers", "Italian speakers",
   "Romansch speakers", "canton"`.
2. How many people live in the least populated town ?
3. Which fraction of the population lives in a town with more than 1'000 inhabitants?
   **Hint:** to make things easier, you can first compute how many people live in a town larger
   than 1'000 inhabitants.
4. **If you have the time:** how many towns have more than 50% of their population speaking french?


<br>

**Solution:**

1. Load the data as a pandas DataFrame.

In [None]:
# %load -r -7 solutions/solution_11.py

<br>

2. Subset the DataFrame to keep only certain columns.

In [None]:
# %load -r 9-22 solutions/solution_11.py

<br>

3. How many people live in the least populated town ?

In [None]:
# %load -r 24-31 solutions/solution_11.py

<br>

4. Which fraction of the population live in a town with more than 1'000 inhabitants?
    *hint:* as a first step, compute how many people live in a town larger than 1'000 inhabitants

In [None]:
# %load -r 33-44 solutions/solution_11.py

<br>

5. **If you have the time:** how many town have more than 50% of their population speaking french?

In [None]:
# %load -r 46- solutions/solution_11.py

<br>
<br>

## Additional exercise 1.2

In notebook 01 of this course, we have seen how to apply a custom function to a DataFrame column using the `.map()` method.

In this exercise, your task is to **write your own implementation of the `.mean()`** method of DataFrame, then apply it to the `Age` and `Fare` columns of a the Titanic dataset. Proceed as follows:

1. Load the Titanic dataset (`data/titanic.csv`) as a pandas DataFrame.
2. Write your own implementation of a "mean" function (let's call it `custom_mean()`), which computes
   the mean value of a sequence of values.  
   **Important:** your implementation should be able to skip `NA` values.  
   **Hint:** to detect whether a value is `NA` or not, you can use `math.isnan(x)` from the `math` module.
   <br>
   
3. Apply your `custom_mean()` function to the `Age` and `Fare` columns of the Titanic dataset using
   the `.apply()` method.  
4. Compare your result to those from the built-in `DataFrame.mean()` method provided by pandas.


<br>

### Solution:
Uncomment and run the cell below to show the solution.

In [None]:
# %load solutions/solution_12.py


<br>
<br>
<br>

## Additional exercise 1.3

Using the Titanic dataset located in `data/titanic.csv`, perform the following tasks:

1. Load the data as a pandas DataFrame.
2. Select the passengers that survived. How many are males/females?
3. Create a new column `Title` in the `DataFrame` representing the title by which passengers should
   be addressed. The title can be found in the passenger name and is the only word ending with a `.`
   
   **Hint for part 3:** there is no *easy, one-line* answer. Create a function to extract the
   title from the name and work your way from there.
   

<br>

### Solution:
Uncomment and run the cells below to show the solution.

1. Load the data as a pandas DataFrame.

In [None]:
# %load -r 1-5 solutions/solution_13.py

<br>

2. Select passengers which survived. How many are males/females?

In [None]:
# %load -r 6-22 solutions/solution_13.py

In [None]:
# %load -r 23-27 solutions/solution_13.py

<br>

3. Create a new column named "Title" in the DataFrame, representing the title by which
   passengers should be addressed.

In [None]:
# %load -r 28- solutions/solution_13.py

<br>
<br>
<br>


# Notebook 2 - Data description and representation
---------------------------------------------------------------------------

## Exercise 2.1 - Histograms

Using the Titanic dataset from `data/titanic.csv`:

1. Plot the `Age` distribution among first class passengers. Try to choose an appropriate mode of 
   representation (histogram? density line? number of bins?).
2. Make a figure with 3 panels. In the panels, plot the histogram of the `Fare` among passengers in
   the first, second, and third class, respectively.

<br>

To load the dataset and the different modules that you will need, you can copy/paste the following:
```py
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

# Load the data as DataFrame.
df = pd.read_csv("data/titanic.csv")
```

<br>

### Solution:
Uncomment and run the cell below to show the solution.

1. Plot the Age distribution among first class passengers.

In [None]:
# %load -r 1-16 solutions/solution_21.py

<br>

2. Make a figure with 3 panels. 

In [None]:
# %load -r 17- solutions/solution_21.py

<br>
<br>
<br>

## Exercise 2.2 - Representing categories

Using the 1880 swiss census data found in the file `data/swiss_census_1880.csv`:

1. Compute a new column `fraction60+` representing the fraction of 60+ years old people in each town.  
   **Hint:** the column `60+ y.o.` contains the number of 60+ years old. The column `Total` contains
    the total number of inhabitants.
    
2. Graphically represent the proportion of people over 60 years old (`60+ y.o.`) across all cantons.
   Choose the most appropriate kind of plot.

<br>

### Solution:
Uncomment and run the cell below to show the solution.

1. Compute a new column `fraction60+`.

In [None]:
# %load -r 1-12 solutions/solution_22.py

<br>

2. Graphically represent the proportion of people more 60 years old.

In [None]:
# %load -r 14- solutions/solution_22.py

<br>

Of course possibilities are endless. Here is **a more fancy solution**, inspired by [this](https://seaborn.pydata.org/examples/kde_ridgeplot.html) and its [later correction](https://www.pythonfixing.com/2022/02/fixed-python-seaborn-ridge-plot.html).

In [None]:
# %load solutions/solution_22_fancy.py

<br>
<br>
<br>

## Exercise 2.3 - Free-form exercise

The goal of this exercise is to **perform an exploration of a dataset** related to heart disease.
* In particular, we want to explore the relationship between the `target` variable - whether patient
  has a heart disease or not - and several other variables such as cholesterol level, age, ...
* The data is present in the file `'data/heartData_simplified.csv'`, which is a cleaned and simplified
  version of the [UCI heart disease data set](https://archive.ics.uci.edu/ml/datasets/heart+Disease).


### Description of the dataset columns

* `age`: Patient age in years
* `sex`: Patient sex
* `chol`: Cholesterol level in mg/dl. 
* `thalach`: Maximum heart rate during the stress test
* `oldpeak`: Decrease of the ST segment during exercise according to the same one on rest.
* `ca`: Number of main blood vessels coloured by the radioactive dye. The number varies between 0 to 3.
* `thal`: Results of the blood flow observed via the radioactive dye.
	* `defect` -> fixed defect (no blood flow in some part of the heart)
	* `normal` -> normal blood flow
	* `reversible` -> reversible defect (a blood flow is observed but it is not normal)
* `target`: Whether the patient has a heart disease or not


### Instructions

As stated earlier, your goal is to explore this data-set. 
One objective of this would be to diagnose eventual problems in the dataset (outliers, strange values) and prepare further statistical analysis and reporting.

To this end you will want to formulate a number of hypothesis that would be interesting to pursue from this data (*e.g.*, is heart disease linked to cholesterol levels), and gather evidence (plots, summary statistics) explaining why this hypothesis seems to be worth testing for.

> Note: we do not ask you to perform the statistical testing itself, but you can do it if you feel like it.

We will not provide a particular set of precise questions, but here are a few checkpoints to help you get stared:
* Read the data as a pandas `DataFrame`.
* Compute summary statistics for the different variables.
* Eventually, do the same for different subset of the data (for instance, grouping by sex).
* Use visualization to help you describe the relationship between the different variables.
* Choose a few associations (2?4?) that seems promising and describe them.


<br>
<br>
<br>


# Extra Notebook - Statistics with python
------------------------------------------------------

## Exercise 3.1

Load the swiss census data from 1880 (available in the file `data/census1880_fractions.csv`) and display its first few rows. You will see that it contains, for each town in Switzerland (a row of the table), the information about the majority religion and the majority language.

* Test the association between the majority religion (`majority religion`) and majority language (`majority language`).
* **Hint**: to create a contingency table between `colA` and `colB` in `df`:
  `table = pd.crosstab(df.colA, df.colB)`
* **Additional task, if you have time:**
    * How could you make Fisher's test work here?

<br>

### Solution:
Uncomment and run the cell below to show the solution.

1. Creating the contingency table.

In [None]:
# %load -r 1-13 solutions/solution_31.py

<br>

2. Test the association between the majority religion (majority religion) and majority language.

In [None]:
# %load -r 14-21 solutions/solution_31.py

<br>

3. Additional task: using Fisher's exact test.

In [None]:
# %load -r 22- solutions/solution_31.py

<br>
<br>
<br>

## Exercise 3.2 - Free-form exercise

Continuing on the free-form exercise from the previous notebook (exercise 2.3), take the few associations (2?4?) which you described in exercise 2.3 and see how you could test/model them.

* The data is in the file `data/heartData_simplified.csv`.
* Here is a reminder of the description of the dataset columns (see also exercise 2.3):
    * `age`: Patient age in years
    * `sex`: Patient sex
    * `chol`: Cholesterol level in mg/dl. 
    * `thalach`: Maximum heart rate during the stress test
    * `oldpeak`: Decrease of the ST segment during exercise according to the same one on rest.
    * `ca`: Number of main blood vessels coloured by the radioactive dye. The number varies between 0 to 3.
    * `thal`: Results of the blood flow observed via the radioactive dye.
        * `defect` -> fixed defect (no blood flow in some part of the heart)
        * `normal` -> normal blood flow
        * `reversible` -> reversible defect (a blood flow is observed but it is not normal)
    * `target`: Whether the patient has a heart disease or not

