# Task 2

Task 2: You are free to find and define a problem (apply the discovery and define phases first, from the UK Design Council Double Diamond, 3.007 Design Thinking and Innovation) of your interest related to COVID-19. The problem can be modelled either using Linear Regression (or Multiple Linear Regression) or Logistic Regression, which means you can work with either continuous numerical data or classification.

The following technical/tool constraint applies: you are NOT allowed to use Neural Networks or other Machine Learning models. You must use Python and Jupyter Notebook.

In general, you may want to consider performing the following steps:
- Find an interesting problem which you want to solve either using **Linear Regression or Classification** (please check with your instructors first on whether the problem makes sense).
- Find a **dataset** to build your model. For example, you can use Kaggle (https://www.kaggle.com/datasets) to find suitable datasets.
- Use **plots** to visualize and understand your data.
- Create **training and test** data sets.
- Build your model.
- Choose an **appropriate metric** to evaluate your model (you may use the same metric as the one used in Task 1).
- Improve your model.

Problem: predict gold prices given covid
Effect of covid on economy

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Load data 
We use data from [Yahoo Finance](https://finance.yahoo.com/quote/GLD/history?period1=1479859200&period2=1637625600&interval=1wk&filter=history&frequency=1wk&includeAdjustedClose=true) 

In [92]:
df_gold_all = pd.read_csv('GLD.csv')
df = pd.read_csv('covid_data.csv')

# display(df_gold)
# display(df)


## Clean data

In [93]:
# Clean covid data by selecting only the relevant columns
# split data into numerical and categorical set so that we can normalize the numerical set

columns_cat=['date','location','continent']
columns_num=['new_deaths', 'new_cases',
         'stringency_index','total_tests','total_vaccinations',
         'reproduction_rate','hospital_beds_per_thousand','hosp_patients_per_million',
         'hosp_patients','icu_patients_per_million','icu_patients']

df_cat = df.loc[((df['location']=='United States')| (df['location']=='China') | (df['location']=='Japan') | (df['location']=='Hong Kong') | (df['location']=='United Kingdom') | 
(df['location']=='Canada') | (df['location']=='India') | (df['location']=='Saudi Arabia') | (df['location']=='France') | (df['location']=='Germany') | 
(df['location']=='South Korea') | (df['location']=='Switzerland') | (df['location']=='Australia') | (df['location']=='Netherlands') |  
(df['location']=='Iran')| (df['location']=='Sweden')| (df['location']=='Brazil')| (df['location']=='Spain')|(df['location']=='Russia') |(df['location']=='Singapore')) ,columns_cat]

df_num = df.loc[((df['location']=='United States')| (df['location']=='China') | (df['location']=='Japan') | (df['location']=='Hong Kong') | (df['location']=='United Kingdom') | 
(df['location']=='Canada') | (df['location']=='India') | (df['location']=='Saudi Arabia') | (df['location']=='France') | (df['location']=='Germany') | 
(df['location']=='South Korea') | (df['location']=='Switzerland') | (df['location']=='Australia') | (df['location']=='Netherlands') |  
(df['location']=='Iran')| (df['location']=='Sweden')| (df['location']=='Brazil')| (df['location']=='Spain')|(df['location']=='Russia') |(df['location']=='Singapore')) ,columns_num]

# & ((df['date']> '2021-01-01') & (df['date']< '2021-11-17'))
#'population_density','handwashing_facilities','extreme_poverty','gdp_per_capita',
# (df['location']=='Bulgaria') | (df['location']=='Croatia') | (df['location']=='Cyprus') | (df['location']=='Czech Republic') | 
# (df['location']=='Latvia') | (df['location']=='Lithuania') | (df['location']=='Luxembourg') | 
# (df['location']=='Malta') | 
# (df['location']=='Romania')| 
# (df['location']=='Slovakia')|

In [94]:
def normalize_minmax(dfin):
    df_copy=dfin.copy()
    min_v=dfin.min(axis=0)
    max_v=dfin.max(axis=0)
    dfout=(df_copy-min_v)/(max_v-min_v)
    return dfout

df_num_norm = normalize_minmax(df_num)
stats = df_num_norm.describe()
# display(stats)

frames=[df_cat , df_num_norm]
result = pd.concat(frames,axis=1)
df_covid=result.fillna(0)
display(df_covid)

Unnamed: 0,date,location,continent,new_deaths,new_cases,stringency_index,total_tests,total_vaccinations,reproduction_rate,hospital_beds_per_thousand,hosp_patients_per_million,hosp_patients,icu_patients_per_million,icu_patients
6885,2020-01-26,Australia,Oceania,0.000000,0.152192,0.0556,0.0,0.000000,0.0,0.264377,0.000000,0.000000,0.000000,0.000000
6886,2020-01-27,Australia,Oceania,0.000000,0.152186,0.0556,0.0,0.000000,0.0,0.264377,0.000000,0.000000,0.000000,0.000000
6887,2020-01-28,Australia,Oceania,0.000000,0.152184,0.0556,0.0,0.000000,0.0,0.264377,0.000000,0.000000,0.000000,0.000000
6888,2020-01-29,Australia,Oceania,0.000000,0.152186,0.0556,0.0,0.000000,0.0,0.264377,0.000000,0.000000,0.000000,0.000000
6889,2020-01-30,Australia,Oceania,0.000000,0.152190,0.0556,0.0,0.000000,0.0,0.264377,0.000000,0.000000,0.000000,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
127133,2021-11-13,United States,North America,0.255058,0.249411,0.4491,0.0,0.184892,0.0,0.178914,0.181536,0.310692,0.324237,0.391160
127134,2021-11-14,United States,North America,0.219328,0.202968,0.4491,0.0,0.185005,0.0,0.178914,0.183070,0.313318,0.323511,0.390295
127135,2021-11-15,United States,North America,0.343198,0.443051,0.4491,0.0,0.185044,0.0,0.178914,0.187783,0.321385,0.327733,0.395383
127136,2021-11-16,United States,North America,0.362247,0.331252,0.4491,0.0,0.185044,0.0,0.178914,0.184496,0.315757,0.320617,0.386799


In [95]:
#Clean Up of Gold Data frame
#Datetime -> Date Clean Up
# change to standard date 


#Remove non-close related tabs
# df_gold.drop(['Volume','Open','High','Low'], axis=1, inplace=True)
# #Rename close* -> Gold End Of Day Price
# df_gold.rename(columns = {'Close/Last':'Close'}, inplace = True)
df_gold_all = pd.read_csv('GLD.csv')

# change gold date range to same range as covid date range
df_gold=df_gold_all.copy()
df_gold=df_gold.loc[((df_gold['Date']> '2020-01-26') & (df_gold['Date']< '2021-11-17')) ,:]

# change the date column to DateTime Index
df_gold['Date'] = pd.to_datetime(df_gold['Date']).dt.date

# pd.set_option('display.max_rows', None)
# display(df_gold)
# print(pd.options.display.max_rows)
# pd.reset_option('display.max_rows')
# print(pd.options.display.max_rows)
print(df_gold.shape)

    

(95, 7)


In [99]:
df_covid['d']=pd.to_datetime(df_covid['date'])
# print(df_covid)
df_covid.groupby(pd.Grouper(key='d', freq="1W")).mean()
display(df_covid)
print(df_covid.shape)



Unnamed: 0,date,location,continent,new_deaths,new_cases,stringency_index,total_tests,total_vaccinations,reproduction_rate,hospital_beds_per_thousand,hosp_patients_per_million,hosp_patients,icu_patients_per_million,icu_patients,d,f
6885,2020-01-26,Australia,Oceania,0.000000,0.152192,0.0556,0.0,0.000000,0.0,0.264377,0.000000,0.000000,0.000000,0.000000,2020-01-26,2020-01-26
6886,2020-01-27,Australia,Oceania,0.000000,0.152186,0.0556,0.0,0.000000,0.0,0.264377,0.000000,0.000000,0.000000,0.000000,2020-01-27,2020-01-27
6887,2020-01-28,Australia,Oceania,0.000000,0.152184,0.0556,0.0,0.000000,0.0,0.264377,0.000000,0.000000,0.000000,0.000000,2020-01-28,2020-01-28
6888,2020-01-29,Australia,Oceania,0.000000,0.152186,0.0556,0.0,0.000000,0.0,0.264377,0.000000,0.000000,0.000000,0.000000,2020-01-29,2020-01-29
6889,2020-01-30,Australia,Oceania,0.000000,0.152190,0.0556,0.0,0.000000,0.0,0.264377,0.000000,0.000000,0.000000,0.000000,2020-01-30,2020-01-30
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
127133,2021-11-13,United States,North America,0.255058,0.249411,0.4491,0.0,0.184892,0.0,0.178914,0.181536,0.310692,0.324237,0.391160,2021-11-13,2021-11-13
127134,2021-11-14,United States,North America,0.219328,0.202968,0.4491,0.0,0.185005,0.0,0.178914,0.183070,0.313318,0.323511,0.390295,2021-11-14,2021-11-14
127135,2021-11-15,United States,North America,0.343198,0.443051,0.4491,0.0,0.185044,0.0,0.178914,0.187783,0.321385,0.327733,0.395383,2021-11-15,2021-11-15
127136,2021-11-16,United States,North America,0.362247,0.331252,0.4491,0.0,0.185044,0.0,0.178914,0.184496,0.315757,0.320617,0.386799,2021-11-16,2021-11-16


(13085, 16)


In [47]:
myplot = sns.pairplot(data=result, hue='location',x_vars=['new_deaths', 'new_cases',
         'stringency_index','total_tests','total_vaccinations',
         'reproduction_rate'],y_vars=['new_deaths'])

0    2021-12-27
dtype: object

In [None]:
myplot = sns.pairplot(data=result, hue='location',x_vars=['new_deaths','hospital_beds_per_thousand','hosp_patients_per_million',
         'hosp_patients','icu_patients_per_million','icu_patients'],y_vars=['new_deaths'])

In [None]:
myplot = sns.pairplot(data=result, hue='location',x_vars=['new_cases','new_deaths', 
         'stringency_index','total_tests','total_vaccinations',
         'reproduction_rate'],y_vars=['new_cases'])

In [None]:
myplot = sns.pairplot(data=result, hue='location',x_vars=['new_cases','new_deaths',
         'hospital_beds_per_thousand','hosp_patients_per_million',
         'hosp_patients','icu_patients_per_million','icu_patients'],y_vars=['new_cases'])