# Final exam
# YourUserID: xxxxxxxx
## General 

The instruction for the final exam for DS2000B/IS2002B is included in this Jupyter Notebook. 
Some basic rules: 
- You are allowed to use any document and source on your computer and look up documents on the internet. 
- You or not allowed to share documents, or communicate in any other way with people about the final during the 6hr period after the start of the final (2 pm - 8 pm April 23d). 
- You are only allowed to use the packages listed under "preliminaries" - the use of other regression or machine learning toolboxes is not permitted. 
- All the code you are using from previous Assignments or Labs need to be included in the notebook. 
- Most questions also require some written answer. The answer to these questions should be given in full English sentences. 
- All Figures should be appropriately labeled and should have a figure caption. 
- The Final exam needs to be submitted on OWL (Assignments->Final) before 6 pm (unless you have an extension.)

## Preliminaries

In [1]:
import matplotlib
matplotlib.use('TkAgg')
import matplotlib.pyplot as plt 
%matplotlib inline
import scipy.stats as ss
import pandas as pd 
import numpy as np
import scipy.optimize as so
import seaborn as sns

In [4]:
# import functions


## Data set

The data frame contains the total number of confirmed Covid-19 infections, vaccination data, and number of deaths as of March 8th, 2022, across 79 countries. It also contains some basic socio-economic data from these countries. 
- continent: - continent: Asia, Europe, Africa, Oceania, North America, South America. 
- country: Name of the country 
- total_cases: total number of COVID cases by March 8th, 2022
- total_deaths: total number of COVID19 deaths by March 8th, 2022
- total_vaccinations: total number of vaccines 
- people_fully_vaccinated: number of people who had two doses of vaccine 
- population: Population size 
- median_age: Median Age of the population 
- gdp_per_capita: Gross Domestic Product (GDP) per capita in US Dollars
- diabetes_prevalence: share of the population with diabetes
- hospital_beds: Number of Hospital beds per 1000 inhabitants 
- life_expectancy: life expectancy in years
- HDI: Human Development Index. A composite index combining life expectancy, years of education, and GDP per capita.

## Task 1: Vaccination rate (40 pts)
In this task we are looking at vaccination rate across different countries.

### Question 1 (3 pts) 
Load the data set 'covid19_2022.csv'. How many observations and variables do you have? Show the first few rows.

### Question 2  (6 pts)
To calculate the vaccination rate, we need to normalize the number of fully vaccinated people ('people_fully_vaccinated') by the population of the country ('population'). Make a new column in the data frame, called 'vax_rate', that codes for the vaccination rate per capita.
Plot a histogram of the distribution using the appropriate number of bins.

Written answer: How would you characterize the distribution in terms of modality and skew?

### Question 3 (5 pts)
Let's look at how to explain the differences in the vaccination rates. 

First, let's consider the Human Development Index (HDI) that combines life expectancy, years of education, and GDP per capita. Make a scatterplot of HDI (x-axis) and Vaccination Rate (y-axis). 

Written answer: What relationship do you observe? Give at least 2 possible reasons that could explain such a relationship. 


### Question 4 (4 pts)
Fit a simple linear regression model, explaining vaccination rate by the human development index (HDI). 
Report the R2 value of the fit. 

Then plot the data and the prediction line. Don't forget the Figure caption. 

### Question 5 (4 pts)
What else could explain differences in vaccination rates? We know that vaccination access varies a lot in different regions. Use boxplot to visualize vaccination rate (y axis) for different continents (x axis).

Written answer: what do you observe?

### Question 6 (10 pts)
Let's investigate whether the relationship between HDI and vaccination rate still holds when we account for the geographical region. 
* First, make a dummy-coded variable that indicates whether the country is in Africa or not (using variable Continent)
* Second, restrict the data to African and European countries. *Hint: If you want a vector of True/False values that indicates whether a column in the data frame (D.col) equals A OR B, you can use  `np.logical_or(D.col==A,D.col==B)`*
* Using this restricted data, fit a regression model, using the dummy-coded variable (Africa vs Europe) and HDI as regressors. 
* Make a scatterplot of HDI against vaccination rates, with different color dots for European and African countries. In the same plot, add seperate regression lines for Africa and Europe from the model that you fit.
* Written answer: describe the resulting plot. How do the two regression lines differ? How would you interpret this?

### Question 7 (8 pts)
Use Bootstrap analysis (1000 iterations) to determine whether the relationship between Human Development Index and vaccination rate is significant when we account for whether the country is European or African. Again, limit the analysis to European and African countries only. Report the confidence intervals for the influence of continent and HDI separately. 

Written answer: What do you conclude from the bootstrap in terms of the significance of a relationship between HDI and vaccination rate? What claims can you make? 

## Task 2: Mortality rate (40 pts)
### Question 1 (3 pts) 
Define a new column in your dataframe called 'Mortality' which corresponds to the mortality rate. The mortality rate of a disease is defined as the probability that a patient will die, given that he or she contracted the disease (total number of deaths divided by total number of positive cases). Make a histogram of mortality, show the median as a vertical line, and don't forget axis labels. How do you describe the distribution in terms of symmetry and skew? 

### Question 2  (6 pts)
Why may TotalDeaths/TotalCases not be a good estimator for the true mortality rate? See the definition of Mortality rate in question 1. Name at least 1 factor that could make your estimate lower than the true mortality rate, and 1 factor that would make the estimate higher than the true mortality rate. For each factor you describe, explain what data a research team could realistically acquire within a month of work to make the estimate of mortality better.

### Question 3 (5 pts) 
Question 3-6 look at the influence of human development index (HDI) and number of hospital beds per 1000 people ('hospital_beds') on mortality rate. First, run a multiple regression analysis of mortality (response variable) as a function of HDI (with figure, don't forget to label the axes). 

Written answer: How would you describe the relationship between HDI and mortality? Write down the regression equation. Report the R2 value and the slope value for HDI. What does the R2 and the slope value mean? 

### Question 4 (4 pts)
Run a multiple regression analysis of mortality as a function of number of hospital beds (with figure).  Again, report the R2 value and the slope for hospital beds. Based on the slope, does the number of hospital beds have an influence on mortality? 

### Question 5  (4 pts)
Now run a multiple regression analysis of mortality with both HDI and hospital beds as predictors. How has R2 changed from the answer the obtained in Question 2? Based on the new slope, what can you conclude about the influence of HDI on mortality (after controlling for the number of hospital beds)? 

### Question 6 (6 pts)
Use Leave-one-out-crossvalidation and forward step-wise-regression to build the best predictive model of mortality using as candidates the five explanatory variables: HDI, diabetes prevalence, median age, hospital beds, and vaccination rate.

Show all steps of the step-wise regression explicitly and comment on the decisions you make in the process (and why you made them). Report the formula for the best predictive model that you found. 

### Question 7 (2 pts)
Let's now look at mortality rate by continent. Make a boxplot of continent on the x-axis and mortality on the y-axis. Which continent has the highest overall mortality?  

### Question 8  (5 pts)
Create a set of dummy variables that together code the continents. Set Oceania to be your comparison group. Run a multiple regression model of mortality with the dummy variables as explanatory variables. Report the interecept and slope values. What do the intercept and slope values mean? 

### Question 9  (5 pts)
Repeat the same model but this time add human development index (HDI) as one of the regressors. Report R2. How does the R2 change? Is it meaningful?

Now use cross-validation to compare the following two models:
1. mortality as a function of continents (previous question)
2. mortality as a function of continents and HDI

Which model is better? How do you explain changes in R2 with the addition of HDI? What do you conclude about the effect of the continent on mortality?

## Task 3: Logistic regression (20 pts)
### Question 1 (5 pts) 
Generate a new column in the data frame called high_gdp that assigns a value of 1 to indices with GDP larger than 15000, and 0 to the rest. Make a scatter plot (with labeled axes, and caption) that shows life expectancy on the x-axis and high_gdp on the y axis. Written answer: what do you observe?

### Question 2 (5 pts) 
Fit a logistic regression model that predicts whether a country has high or low GDP based on its life expectancy. Make sure to return a figure of the results (complete with axes and caption). Print out the optimal parameters and interpret your result (negative log likelihood). Is it meaningful? 

### Question 3 (4 pts) 
Now try a different logistic regression model that uses both life expectancy and mortality to predict whether a country has high GDP or not. How does the model compare to the previous one? Explain whether we can interpret the difference between the two models and motivate why.

### Question 4 (6 pts)
Using leave-one-out-crossvalidation, compare first the simple intercept model (b0 only) to the model that also uses life expectancy as explanatory variable for high GDP (on top of the intercept).

Next, compare this model (intercept + life expectancy) to one that additionally includes also the human development index (HDI).

For each comparison, calculate the the difference in crossvalidated log-likelihood and answer the following questions: which one is the best model? How confident are you in your result?