# First Year Project - Project 1 - Corona and Weather

----
#### **Group 6(F)**: Bjørn Søvad (bjso@itu.dk), Katarina Kraljevic (katkr@itu.dk), Mirka Katuscáková (katu@itu.dk), Emma Cecilie Bjerring Jensen (emcj@itu.dk), Viggo Yann Unmack Gascou (viga@itu.dk)
----

## Required Libraries
* [Pandas Documentation](https://pandas.pydata.org/docs/)
* [Numpy Documentation](https://numpy.org/doc/)
* [Folium Documentation](https://python-visualization.github.io/folium/)
* [Json Documentation](https://docs.python.org/3/library/json.html)
* [Statsmodels Documentation](https://www.statsmodels.org/stable/)
* [Scipy Documentation](https://scipy.github.io/devdocs/index.html)

In [1]:
#Importing needed libraries
import pandas as pd                                                            # provides major datastructure pd.DataFrame() to store the datasets
import numpy as np                                                             # used for numerical calculations and fast array manipulations
import folium                                                                  # used for spatial data visualizations
import json                                                                    # used for loading json data correctly
import statsmodels.api as sm                                                   # used to run multivariate linear regression
from scipy.stats import pearsonr, spearmanr                                    # used to run `pearson` and `spearman` association tests of numerical variables on two variables
from statsmodels.stats.multitest import multipletests                          # used to run multiple tests of p-values for multiple variables

### Loading the raw data

In [41]:
#Importing the raw corona data from Germany
corona_df = pd.read_csv('../../data/raw/corona/de_corona.csv', sep = '\t')
corona_df.name = 'corona_df'

#Importing the raw weather data for the countries, Germany, Netherlands, Sweden and Denmark
weather_df = pd.read_csv("../../data/raw/weather/weather.csv")
weather_df.name = 'weather_df'

#Loading in the metadata json using the Python json library
with open('../../data/raw/metadata/de_metadata.json','r', encoding="utf8") as f:
    country_metadata=json.load(f)

#Creating a folium map (called de_map) that is based around Germany and uneditable in terms of placement and zoom
de_map = folium.Map(location = [51.1657, 10.4515], zoom_start = 6, crs = 'EPSG3857', 
    zoom_control = False, scrollWheelZoom = False, dragging = False)

#Loading in the geojson that contains data for the regions and borders of Germany and adding it to the folium map
folium.GeoJson('../../data/raw/shapefiles/de.geojson', name = "geojson").add_to(de_map)
folium.LayerControl().add_to(de_map);


### Task 0: Data filtering and cleaning

The data analysis done in this notebook is done with a handful of different datasets:

> CSV: Corona (DE) - Contains the Number of new infections (per day) and Number of new casualties (per day) filtered by day and region in Germany.
>
> CSV: Weather - Contains information about several indicators of weather conditions for each region in Germany, Denmark, Sweden and the Netherlands for each day in the period `2020-02-13` to `2021-02-21`
>
> JSON: Metadata (DE) - Contains more information about the different regions in Germany
>
> GEOJSON: Geojson (DE) - Holds the geojson data for the different regions in Germany

### Initial inspection of the datasets

In [21]:
weather_df.describe()

Unnamed: 0,RelativeHumiditySurface,SolarRadiation,Surfacepressure,TemperatureAboveGround,Totalprecipitation,UVIndex,WindSpeed
count,6000.0,6000.0,6000.0,6000.0,6000.0,6000.0,6000.0
mean,74.33212,6678336.0,2379588.0,10.131258,0.002206,16.0508,3.503221
std,13.595421,6212940.0,51211.44,7.14661,0.003439,14.515517,1.438837
min,33.880265,0.0,2212828.0,-12.618286,0.0,0.0,1.091346
25%,64.268213,1090176.0,2343795.0,4.68739,5.4e-05,2.500231,2.427226
50%,76.469758,4610296.0,2385826.0,10.078646,0.000684,12.720154,3.202055
75%,85.703146,11509290.0,2422283.0,15.722994,0.002907,27.398392,4.261427
max,98.264247,23708230.0,2497243.0,27.810922,0.031971,52.792235,11.221876


We can see here that `weather` dataset contains 

In [None]:
corona_df.head()

Unnamed: 0,date,region_code,confirmed_addition,deceased_addition,iso3166-2,population,cases_pc
0,2020-01-02,Nordrhein-Westfalen,1,0,DE-NW,17932651,5.57642e-08
1,2020-01-07,Nordrhein-Westfalen,1,0,DE-NW,17932651,5.57642e-08
2,2020-01-09,Nordrhein-Westfalen,1,1,DE-NW,17932651,5.57642e-08
3,2020-01-12,Nordrhein-Westfalen,1,0,DE-NW,17932651,5.57642e-08
4,2020-01-14,Nordrhein-Westfalen,1,0,DE-NW,17932651,5.57642e-08


In [50]:
datasets = [weather_df, corona_df]
for dataset in datasets:
    if dataset.name == "weather_df":
        print("Weather Dataset")
    else: print("Corona Dataset")
    print(dataset.isnull().any())
    print("----------------------------------------------")
    print("There are no missing values in the dataset!" if not dataset.isnull().any().any() 
            else "There are missing values in the dataset")
    print("______________________________________________")


Weather Dataset
date                       False
iso3166-2                  False
RelativeHumiditySurface    False
SolarRadiation             False
Surfacepressure            False
TemperatureAboveGround     False
Totalprecipitation         False
UVIndex                    False
WindSpeed                  False
dtype: bool
----------------------------------------------
There are no missing values in the dataset!
______________________________________________
Corona Dataset
date                  False
region_code           False
confirmed_addition    False
deceased_addition     False
dtype: bool
----------------------------------------------
There are no missing values in the dataset!
______________________________________________


In [18]:
#Creating a dictionary that contains the full names of the different regions as keys and their respective iso3166-2 code as values
region_map = {country_metadata["country_metadata"][i]["covid_region_code"]: 
    country_metadata["country_metadata"][i]["iso3166-2_code"] for i in range(len(country_metadata["country_metadata"]))}

#Using the region_map dictionary to create a new column with the respective iso3166-2 code for each region based on the full region name
#from the region_code column
corona_df["iso3166-2"] = corona_df["region_code"].map(region_map)

#Creating a dictionary that contains the full names of the different regions as keys and their respective populations as values
population_map = {country_metadata["country_metadata"][i]["iso3166-2_code"]: 
    country_metadata["country_metadata"][i]["population"] for i in range(len(country_metadata["country_metadata"]))}

#Using the population_map dictionary to create a new column with the respective population for each region based on the iso3166-2 code
#from the iso3166-2 column
corona_df["population"] = corona_df["iso3166-2"].map(population_map)

#Also adding a cases per capita column that is created using the confirmed amount of covid cases divided by the population in that region
corona_df["cases_pc"] = corona_df["confirmed_addition"] / corona_df["population"]

#Converting the temperature from Kelvin to Celsius
weather_df["TemperatureAboveGround"] = weather_df["TemperatureAboveGround"] - 273.15

#Filtering out all the weather data that is not relevant as we are only interested in weatherdata from Germany
weather_df = weather_df[weather_df["iso3166-2"].str.startswith("DE")]

#Merging the weatherdata with the coronadata to create one dataframe with all the data that we need
df = corona_df.merge(weather_df)

### Inspection of the datasets

In [None]:
df = pd.read_csv("../../Data/processed/de_weather+cases.csv")
corona_df = pd.read_csv("../../Data/processed/de_corona.csv")
# case_by_region
# case_by_region = case_by_region.set_index(['iso3166-2', case_by_region.index])
# case_by_region
df.describe()
