<h1> DC19 - Analysis of Deaths in COVID-19 </h1>

<h2> Alternate Hypothesis : Which People are at the most risk of contracting COVID-19 and dependency of factors on COVID Deaths </h2>

# General Overview

<H3> Task Details </H3>

The initiative is prompted by the suggestion that there may be a link between reduced rates of infection and lower case fatality rates associated with COVID-19 in countries that recommend BCG vaccine for all as opposed to countries that recommend BCG only for specific high-risk groups. We hope that the analysis done as part of this task might help discover useful information about the BCG - COVID-19 clinical trials. For example, some insights that may come from this analysis is whether factors such as the strain of BCG, the age at which people have been vaccinated, revaccination, or how long ago people have been vaccinated are important.

Contact LinkedIn - https://www.linkedin.com/in/amankumar01/

<h3> Key questions considered in the Notebook </h3>

* Do certain Population Demographics play a role in COVID-19 Spread?
* Does the climate of a place affects COVID-19 Spread Patterns?
* How does mobility affect COVID-19
* Which Gender has saw more COVID-19 deaths?
* Are there any clinical factors that relates to COVID-19 Mortality?

<h3> Overview of Topics Researched on in this notebook </h3>
Thia notebook contains some datasets created by me, which has publically been posted under the kaggle datasets for COVID-19. This notebook analyzes the medical parameters that contribute to the confirmed cases and deaths of covid-19, since I guess medical parameters are the must sections to look for while cases analysis It also analyzes role of certain population demographics (Gender, Population parameters) and climatic conditions on the spread of the pandemic. It is constantly been updated by me, so any new analyses annd medical research that would be done would be updated here as a part of it. 

The detailed conclusions for the notebook is available under the conclusions section.


# <a id='main'><h3>Table of Contents</h3></a>
- [Importing the Essential Libraries](#lib)
- [What we actually know?](#knw)
- [Datasets used in notebook](#data)
- [Analysis of Speread of COVID-19 : Bar Graphs](#barspread)
- [Which demographic factors play a role in Transmission?](#demo)
- [Affect of Temperature on Transmission of COVOID-19](#temp)
- [Dependency of COVID-19 Spread on certain Health/Demographic Figures : USA](#age)
- [Does certain age groups are at a higher risk of contracting COVID-19?](#metric)
- [The factors that lead to death of poeople](#death)
- [Findings from the analyses](#findings)

<a id='lib'><h3>Importing the Essential Libraries </h3></a>

In [None]:
#Data Analyses Libraries
import pandas as pd                
import numpy as np    
from urllib.request import urlopen
import json
import glob
import os

#Importing Data plotting libraries
import matplotlib.pyplot as plt     
import plotly.express as px       
import plotly.offline as py       
import seaborn as sns             
import plotly.graph_objects as go 
from plotly.subplots import make_subplots
import matplotlib.ticker as ticker
import matplotlib.animation as animation

#Other Miscallaneous Libraries
import warnings
warnings.filterwarnings('ignore')
from IPython.display import HTML
import matplotlib.colors as mc
import colorsys
from random import randint
import re

<a id='knw'><h3>What we actually know?</h3></a>

It is a genral trend seen that patients who belong to an elderly age group has higher chances of getting into a COVID-19 infection than that of the younger people. [-Reports ABC News](https://abcnews.go.com/Health/risk-severe-covid-19-increases-decade-age/story?id=69914642)

The datasets mentioned under this challenge takes data from Worldometer which possess the similar figures for the age-group wise distribution of COVID-19 Cases. The figures mentioned there also highlights that people associated with an already exiisting COPD's or medical ailments have a higher risk of getting into a COVID-19 infection -[See here](http://https://www.worldometers.info/coronavirus/coronavirus-age-sex-demographics/)

We can analysis multiple datasets to understand this fact much better. 



<a id='data'><h3>Datasets used in the notebook<h3></a>

1. We read the Novel-Corona-Virus-2019-dataset managed by SRK into this notebook. The dataset hold s information about the cumulative case counts of COVID-19 Across the world. The dataset can be viewed and downloaded from - [here](https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset)

2. The dataset CovCSD - COVID-19 Countries Statistical Dataset created by me (Available at https://www.kaggle.com/aestheteaman01/covcsd-covid19-countries-statistical-dataset) is loaded here. The information for the dataset can be seen at the description section for the dataset.

3. COVID-19 UNCOVER Collection of Datasets available from Kaggle.

4. US-Counties Covid-19 Dataset

In [None]:
#Reading the cumulative cases dataset
covid_cases = pd.read_csv('../input/novel-corona-virus-2019-dataset/covid_19_data.csv')

#Viewing the dataset
covid_cases.head()

<h3> Further Analysis for the dataset </h3>

The following are the procedures taken into consideration.

1. We group the dataset Country wise 
2. Data for country for which we waana check is later fetched from the main dataset generated.

In [None]:
#Groping the same cities and countries together along with their successive dates.

country_list = covid_cases['Country/Region'].unique()

country_grouped_covid = covid_cases[0:1]

for country in country_list:
    test_data = covid_cases['Country/Region'] == country   
    test_data = covid_cases[test_data]
    country_grouped_covid = pd.concat([country_grouped_covid, test_data], axis=0)
    
country_grouped_covid.reset_index(drop=True)
country_grouped_covid.head()

#Dropping of the column Last Update
country_grouped_covid.drop('Last Update', axis=1, inplace=True)

#Replacing NaN Values in Province/State with a string "Not Reported"
country_grouped_covid['Province/State'].replace(np.nan, "Not Reported", inplace=True)

#Printing the dataset
country_grouped_covid.head()

#country_grouped_covid holds the dataset for the country

In [None]:
#Creating a dataset to analyze the cases country wise - As of 12/06/2020

latest_data = country_grouped_covid['ObservationDate'] == '12/06/2020'
country_data = country_grouped_covid[latest_data]

#The total number of reported Countries
country_list = country_data['Country/Region'].unique()
print("The total number of countries with COVID-19 Confirmed cases = {}".format(country_list.size))

<a id='barspread'><h3> Analysis of Spread and deaths due to COVID-19 from Bar Graphs </h3></a>

In [None]:
#Plotting a bar graph for confirmed cases vs deaths due to COVID-19 in World.

unique_dates = country_grouped_covid['ObservationDate'].unique()
confirmed_cases = []
recovered = []
deaths = []

for date in unique_dates:
    date_wise = country_grouped_covid['ObservationDate'] == date  
    test_data = country_grouped_covid[date_wise]
    
    confirmed_cases.append(test_data['Confirmed'].sum())
    deaths.append(test_data['Deaths'].sum())
    recovered.append(test_data['Recovered'].sum())
    
#Converting the lists to a pandas dataframe.

country_dataset = {'Date' : unique_dates, 'Confirmed' : confirmed_cases, 'Recovered' : recovered, 'Deaths' : deaths}
country_dataset = pd.DataFrame(country_dataset)

#Plotting the Graph of Cases vs Deaths Globally.

fig = go.Figure()
fig.add_trace(go.Bar(x=country_dataset['Date'], y=country_dataset['Confirmed'], name='Confirmed Cases of COVID-19', marker_color='rgb(55, 83, 109)'))
fig.add_trace(go.Bar(x=country_dataset['Date'],y=country_dataset['Deaths'],name='Total Deaths because of COVID-19',marker_color='rgb(26, 118, 255)'))

fig.update_layout(title='Confirmed Cases and Deaths from COVID-19',xaxis_tickfont_size=14,
                  yaxis=dict(title='Reported Numbers',titlefont_size=16,tickfont_size=14,),
    legend=dict(x=0,y=1.0,bgcolor='rgba(255, 255, 255, 0)',bordercolor='rgba(255, 255, 255, 0)'),barmode='group',bargap=0.15, bargroupgap=0.1)
fig.show()


fig = go.Figure()
fig.add_trace(go.Bar(x=country_dataset['Date'], y=country_dataset['Confirmed'], name='Confirmed Cases of COVID-19', marker_color='rgb(55, 83, 109)'))
fig.add_trace(go.Bar(x=country_dataset['Date'],y=country_dataset['Recovered'],name='Total Recoveries because of COVID-19',marker_color='rgb(26, 118, 255)'))

fig.update_layout(title='Confirmed Cases and Recoveries from COVID-19',xaxis_tickfont_size=14,
                  yaxis=dict(title='Reported Numbers',titlefont_size=16,tickfont_size=14,),
    legend=dict(x=0,y=1.0,bgcolor='rgba(255, 255, 255, 0)',bordercolor='rgba(255, 255, 255, 0)'),
    barmode='group',bargap=0.15, bargroupgap=0.1)
fig.show()

# Do Population Demographic Factors correlate with COVID-19 Cases/Deaths?

<h3> Observations using CoV-CSD | Covid-19 Countries Statistical Dataset </h3>

In [None]:
#Generating a function to concatenate all of the files available.

folder_name = '../input/covcsd-covid19-countries-statistical-dataset'
file_type = 'csv'
seperator =','
dataframe = pd.concat([pd.read_csv(f, sep=seperator) for f in glob.glob(folder_name + "/*."+file_type)],ignore_index=True,sort=False)

In [None]:
#Selecting the columns that are required as is essential for the data-wrangling task

covid_data = dataframe[['Date', 'State', 'Country', 'Cumulative_cases', 'Cumulative_death',
       'Daily_cases', 'Daily_death', 'Latitude', 'Longitude', 'Temperature',
       'Min_temperature', 'Max_temperature', 'Wind_speed', 'Precipitation',
       'Fog_Presence', 'Population', 'Population Density/km', 'Median_Age',
       'Sex_Ratio', 'Age%_65+', 'Hospital Beds/1000', 'Available Beds/1000',
       'Confirmed Cases/1000', 'Lung Patients (F)', 'Lung Patients (M)',
       'Life Expectancy (M)', 'Life Expectancy (F)', 'Total_tests_conducted',
       'Out_Travels (mill.)', 'In_travels(mill.)', 'Domestic_Travels (mill.)']]

<h3> A little editing with the dataset </h3>

In [None]:
#Filtering of the dataset to view the contents (as of 30-03-2020) #Taking early date into consideration to traceback cause of spread
latest_data = covid_data['Date'] == '30-03-2020'
country_data_detailed = covid_data[latest_data]

#Dropping off unecssary columns from the country_data_detailed dataset
country_data_detailed.drop(['Daily_cases','Daily_death','Latitude','Longitude'],axis=1,inplace=True)

#Viewing the dataset
country_data_detailed.head(3)

In [None]:
#Replacing the text Not Reported and N/A with numpy missing value cmputation

country_data_detailed.replace('Not Reported',np.nan,inplace=True)
country_data_detailed.replace('N/A',np.nan,inplace=True)


#Viewing the dataset
country_data_detailed.head(3)

In [None]:
#Converting the datatypes

country_data_detailed['Lung Patients (F)'].replace('Not reported',np.nan,inplace=True)
country_data_detailed['Lung Patients (F)'] = country_data_detailed['Lung Patients (F)'].astype("float")

<H3> Understanding the dataset generated above </H3>

The dataset holds information about:

1. The name of the country
2. Total deaths and cases reported from COVID-19 as of March 30th 2020
3. Latitude and Longitude of the country
4. Other demographics

In [None]:
#Getting the dataset to check the correlation 
corr_data = country_data_detailed.drop(['Date','State','Country','Min_temperature','Max_temperature','Out_Travels (mill.)',
                                        'In_travels(mill.)','Domestic_Travels (mill.)','Total_tests_conducted','Age%_65+'], axis=1)

#Converting the dataset to the correlation function
corr = corr_data.corr()

# <a id='demo'><h3>Which demographic factors play a role in transmission across populations?</h3></a>


In [None]:
#Plotting a heatmap

def heatmap(x, y, size,color):
    fig, ax = plt.subplots(figsize=(20,3))
    
    # Mapping from column names to integer coordinates
    x_labels = corr_data.columns
    y_labels = ['Cumulative_cases', 'Cumulative_death']
    x_to_num = {p[1]:p[0] for p in enumerate(x_labels)} 
    y_to_num = {p[1]:p[0] for p in enumerate(y_labels)} 
    
    n_colors = 256 # Use 256 colors for the diverging color palette
    palette = sns.cubehelix_palette(n_colors) # Create the palette
    color_min, color_max = [-1, 1] # Range of values that will be mapped to the palette, i.e. min and max possible correlation

    def value_to_color(val):
        val_position = float((val - color_min)) / (color_max - color_min) # position of value in the input range, relative to the length of the input range
        ind = int(val_position * (n_colors - 1)) # target index in the color palette
        return palette[ind]

    
    ax.scatter(
    x=x.map(x_to_num),
    y=y.map(y_to_num),
    s=size * 1000,
    c=color.apply(value_to_color), # Vector of square color values, mapped to color palette
    marker='s'
)
    
    # Show column labels on the axes
    ax.set_xticks([x_to_num[v] for v in x_labels])
    ax.set_xticklabels(x_labels, rotation=30, horizontalalignment='right')
    ax.set_yticks([y_to_num[v] for v in y_labels])
    ax.set_yticklabels(y_labels)
    
    
    ax.set_xticks([t + 0.5 for t in ax.get_xticks()], minor=True)
    ax.set_yticks([t + 0.5 for t in ax.get_yticks()], minor=True)
    
    ax.set_xlim([-0.5, max([v for v in x_to_num.values()]) + 0.5]) 
    ax.set_ylim([-0.5, max([v for v in y_to_num.values()]) + 0.5])
    
corr = pd.melt(corr.reset_index(), id_vars='index') 
corr.columns = ['x', 'y', 'value']
heatmap(x=corr['x'],y=corr['y'],size=corr['value'].abs(),color=corr['value'])

In [None]:
#Creating a correlation matrix

matrix = corr_data.corr()
#print(matrix)

<h3> Initial Analysis from the datasets </h3>

With a weak correlation we observe the following trends

1. With the rise in tempertaure, the confirmed cases tend to slow down (negative correlation). However substantial proof needs to be added here. For the sake of this in the upcoming versions of the notebook I'll analyze the trends for all the days to check the temperature.

2. Median age tends to affect the cases. So for a higher median age of the country cases tends to increase.

3. Life expectancy also seems to affect the COVID-19 confirmed cases with a weak correlation. The effect is seen more prominent in males than in females.

We keep forward to look with more dataset to analyze the correlation as since the correaltions obtained here are too weak.

# <a id='temp'><h3>Role of Temperature and Climate in spread of COVID-19</h3></a>

And does people living in colder weather environment are more prone to covid-19 that those in warmer regions?

In [None]:
#Reading the temperature data file
temperature_data = pd.read_csv('../input/covcsd-covid19-countries-statistical-dataset/temperature_data.csv')

#Viewing the dataset
temperature_data.head()

In [None]:
#Checking the dependence of Temperature on Confirmed COVID-19 Cases

unique_temp = temperature_data['Temperature'].unique()
confirmed_cases = []
deaths = []

for temp in unique_temp:
    temp_wise = temperature_data['Temperature'] == temp
    test_data = temperature_data[temp_wise]
    
    confirmed_cases.append(test_data['Daily_cases'].sum())
    deaths.append(test_data['Daily_death'].sum())
    
#Converting the lists to a pandas dataframe.

temperature_dataset = {'Temperature' : unique_temp, 'Confirmed' : confirmed_cases, 'Deaths' : deaths}
temperature_dataset = pd.DataFrame(temperature_dataset)

<h3> Analysis of Temperature and Confrimed Cases via Plotly Graphs </h3>

In [None]:
#Plotting a scatter plot for cases vs. Temperature

fig = make_subplots(specs=[[{"secondary_y": True}]])

fig.add_trace(go.Scattergl(x = temperature_dataset['Temperature'],y = temperature_dataset['Confirmed'], mode='markers',
                                  marker=dict(color=np.random.randn(10000),colorscale='Viridis',line_width=1)),secondary_y=False)

fig.add_trace(go.Box(x=temperature_dataset['Temperature']),secondary_y=True)

fig.update_layout(title='Daily Confirmed Cases (COVID-19) vs. Temperature (Celcius) : Global Figures - January 22 - March 30 2020',
                  yaxis=dict(title='Reported Numbers'),xaxis=dict(title='Temperature in Celcius'))

fig.update_yaxes(title_text="BoxPlot Range ", secondary_y=True)

fig.show()


<h3> Digging down deeper into understanding affect of temperature </h3>

We import a dataset : Weather Data for COVID-19 Data Analysis uploaded by Davin Bonin - [See here](https://www.kaggle.com/davidbnn92/weather-data-for-covid19-data-analysis#training_data_with_weather_info_week_4.csv). This dataset contains information about temperature and other weather figures for the countries confirmed with COVID-19 infections. The dataset is updated till April 14th 2020

In [None]:
#Importing the dataset
temperature_figures = pd.read_csv('../input/weather-data-for-covid19-data-analysis/training_data_with_weather_info_week_4.csv')

#Converting Temperature to celcius scale
temperature_figures['Temperature'] = (temperature_figures['temp']-32)*(5/9)
temperature_figures['Days since reported'] = temperature_figures['day_from_jan_first']-22

#Removing the not-important columns from the dataset.
temperature_figures.drop(['Id','Lat','Long','day_from_jan_first','wdsp', 'prcp','fog','min','max','temp'],axis=1,inplace=True)

#Viewing the datasert
temperature_figures.head()

<h3> Understanding the temperature trends of cities with highest covid-19 cases </h3>

We plot a Circular Weight Hierarchy Plot to understand the spread and temperature

<iframe src='https://flo.uri.sh/visualisation/2014223/embed' frameborder='0' scrolling='no' style='width:100%;height:600px;'></iframe><div style='width:100%!;margin-top:4px!important;text-align:right!important;'><a class='flourish-credit' href='https://public.flourish.studio/visualisation/2014223/?utm_source=embed&utm_campaign=visualisation/2014223' target='_top' style='text-decoration:none!important'><img alt='Made with Flourish' src='https://public.flourish.studio/resources/made_with_flourish.svg' style='width:105px!important;height:16px!important;border:none!important;margin:0!important;'></a></div>

Clearly from the above figures, we get to know the countries with the most number of cases have a cooler temperature. We analyze and confirm the same trends with bar charts.

<h3> Observing the Trends </h3>

The temperature Range (-4 Deree Celcius- 17 Degree Celcius) has the highest number of confirmed case count. Almost majority of the confirmed cases are in this range. Although the spread of COVID-19 is across all the temperature range, but within this range the spread is observed to be highest.

And as observed in the boxplot plotted over the scatter plot, within three quartiles of temperature (Q1-Q3) Majority and highest number of confirmed cases are observed.

<h3> Considerations while figuring out the trends </h3>

Since there might be a possiblilty and there are high chances that this dataset and publically available datasets doesn't count for asyptomatic COVID-19 Confirmed cases, that might not even be tested, the official number of COVID-19 infections can be much higher.

For the sake of this consideration, the dataset mentioned above can behave as a sample data. We can do a hypothesis testing over this dataset to build our analysis on the population dataset, (i.e - The actual confirmed numbers). 

Hence if P-value > 0.05 we can safely accept our null hypothesis and can conclude that the affect of temperature would remain same, even if we take non-symptomatic confirmed COVID-19 Cases into consideration.

<h3> Conducting Hypothesis Testing </h3>



In [None]:
sample = temperature_dataset['Temperature'].sample(n=250)
test = temperature_dataset['Temperature']

from scipy.stats import ttest_ind

stat, p = ttest_ind(sample, test)
print('Statistics=%.3f, p=%.3f' % (stat, p))

Since we get p value > 0.05 we can safely accept our null hypothesis and can conclude, that temperature affect on COVID-19 remains same over the population data. No statistical difference is present between the two datasets and the sole effect of temperature on spread of COVID-19 can be safely rejected. However, the idea of spread of COVID-19 across a certain range of temperature needs more dataset and statistical testing to come up with a substantial conclusion.

# <a id='age'><h3>Dependency of COVID-19 Spread on certain Health/Demographic Figures : USA</h3></a>

Do certain population/health demographics affects the spread of COVID-19 or is the spread completely random ? - Case Study of USA

<h3> Loading down the datasets </h3>

We load the following datasets form the UNCOVER COVID-19 Challenge datasets 

1. US Counties COVID-19 Dataset : Available on Kaggle by MyrnaMFL [- See here](https://www.kaggle.com/fireballbyedimyrnmom/us-counties-covid-19-dataset)
2. UNCOVER COVID-19 USAFacts Dataset : Confirmed Covid-19 Cases in US by county and state.
3. CovCSD : Covid-19 Countires Statistical Dataset, prepared by me.

In [None]:
#Loading US County Wise Confirmed Cases Dataset
usa_cases_tot = pd.read_csv('../input/covcsd-covid19-countries-statistical-dataset/us-county.csv',dtype={"fips": str})

#Viewing the data
usa_cases_tot.head()

In [None]:
#Getting the geo-json files
with urlopen('https://raw.githubusercontent.com/plotly/datasets/master/geojson-counties-fips.json') as response:
    counties = json.load(response)

#Plotting the data    

usa_cases_tot['log_ConfirmedCases'] = np.log(usa_cases_tot.Confirmed + 1)
usa_cases_tot['fips'] = usa_cases_tot['fips'].astype(str).str.rjust(5,'0')
 
fig = px.choropleth(usa_cases_tot, geojson=counties, locations='fips', color='log_ConfirmedCases',
                           color_continuous_scale="Viridis",
                           range_color=(0, 12),
                           scope="usa")
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
py.offline.iplot(fig)

<h3> Understandings from the choropleth Map Generated Above </h3>

The spread of covid-19 is seen lately around the eastern coastal side of US. New York is the major epicenter for US and counties nearby New York have higher concentration of cases than those away from it. Area around Chicago has also higher cases density than other parts of US.

<h3> Analysis of Spread of COVID-19 Across US Counties via Running Chart Analysis </h3>

<iframe src='https://flo.uri.sh/visualisation/2020035/embed' frameborder='0' scrolling='no' style='width:100%;height:600px;'></iframe><div style='width:100%!;margin-top:4px!important;text-align:right!important;'><a class='flourish-credit' href='https://public.flourish.studio/visualisation/2020035/?utm_source=embed&utm_campaign=visualisation/2020035' target='_top' style='text-decoration:none!important'><img alt='Made with Flourish' src='https://public.flourish.studio/resources/made_with_flourish.svg' style='width:105px!important;height:16px!important;border:none!important;margin:0!important;'></a></div>

<h3> Cases counts in US </h3>

1. New York City, Nassau, Suffolk, Westchester has the highest reported cases of COVID-19
2. We would further look forward with the demographic distribution of these regions to analyze the trends on much better scale.

<h3> Does spread of COVID-19 Across US Counties have any realtion with health indices? </h3>

To analyze this statement we look forward to our generated dataset


In [None]:
#Getting the dataset to check the correlation 
corr_data = usa_cases_tot.drop(['fips','state','county'], axis=1)

#Converting the dataset to the correlation function
corr = corr_data.corr()

#Plotting a heatmap

def heatmap(x, y, size,color):
    fig, ax = plt.subplots(figsize=(20,10))
    
    # Mapping from column names to integer coordinates
    x_labels = corr_data.columns
    y_labels = corr_data.columns
    x_to_num = {p[1]:p[0] for p in enumerate(x_labels)} 
    y_to_num = {p[1]:p[0] for p in enumerate(y_labels)} 
    
    n_colors = 256 # Use 256 colors for the diverging color palette
    palette = sns.cubehelix_palette(n_colors) # Create the palette
    color_min, color_max = [-1, 1] # Range of values that will be mapped to the palette, i.e. min and max possible correlation

    def value_to_color(val):
        val_position = float((val - color_min)) / (color_max - color_min) # position of value in the input range, relative to the length of the input range
        ind = int(val_position * (n_colors - 1)) # target index in the color palette
        return palette[ind]

    
    ax.scatter(
    x=x.map(x_to_num),
    y=y.map(y_to_num),
    s=size * 1000,
    c=color.apply(value_to_color), # Vector of square color values, mapped to color palette
    marker='s')
    
    # Show column labels on the axes
    ax.set_xticks([x_to_num[v] for v in x_labels])
    ax.set_xticklabels(x_labels, rotation=30, horizontalalignment='right')
    ax.set_yticks([y_to_num[v] for v in y_labels])
    ax.set_yticklabels(y_labels)
    
    
    ax.set_xticks([t + 0.5 for t in ax.get_xticks()], minor=True)
    ax.set_yticks([t + 0.5 for t in ax.get_yticks()], minor=True)
    
    ax.set_xlim([-0.5, max([v for v in x_to_num.values()]) + 0.5]) 
    ax.set_ylim([-0.5, max([v for v in y_to_num.values()]) + 0.5])
    
corr = pd.melt(corr.reset_index(), id_vars='index') 
corr.columns = ['x', 'y', 'value']
heatmap(x=corr['x'],y=corr['y'],size=corr['value'].abs(),color=corr['value'])

In [None]:
#Plotting a scatter plot for cases vs. Temperature

fig = make_subplots(specs=[[{"secondary_y": True}]])

fig.add_trace(go.Scattergl(y = usa_cases_tot['Traffic Volume'],x = usa_cases_tot['Confirmed'], mode='markers',
                                  marker=dict(color=np.random.randn(10000),colorscale='Viridis',line_width=1)),secondary_y=False)

fig.update_layout(title='Daily Confirmed Cases (COVID-19) vs. Traffic Volume : US Figures - January 22 - April 14 2020',
                  xaxis=dict(title='Reported Numbers'),yaxis=dict(title='Traffic Volume'))

fig.show()

sample = usa_cases_tot['Traffic Volume'].sample(n=250)
test = usa_cases_tot['Traffic Volume']

from scipy.stats import ttest_ind

stat, p = ttest_ind(sample, test)
print('Statistics=%.3f, p=%.3f' % (stat, p))

<h3> Observations from the above Heatmap over Population Habits </h3>

1. None of the figures like Smokers percentage in population, obesity, diabetics tend to affect the spread of COVID-19 Infections in US in general.

2. A certain correaltion is observed with the number of confirmed cases in a county and the traffic congestion present for that county (as of 2020). The correaltion for the varibles are (0.613053). This might be significant as the quarantine and total isolation of people disallowing people movement across US Counties were late in comparison to countries like India/Korea/China/Japan. Hence asymptomatic cases that were carrying the virus might had spread the same, as the moment weren't restricted and the congestion of traffic for the particular counties are high. 

The p-value is higher so significantly the null hypothesis can be accepted. We can't reject our null hypothesis over this case. However more reasearch is to be made to make this an evident conclusion.

# <a id='metric'><h3>Does certain age/gender groups are at a higher risk of contracting COVID-19?</h3></a>

Investigating the role of age/gender of population and it's realtion with the COVID-19 Virus infection rate.

<h3> Analysis of the gender and age figures via Graphs </h3>

Constraints : As the above mentioned data is insufficient to draw out meaningful analyses, in this notebook I took help of open source tools like statista to analyze various trends of COVID-19 spread on the basis of gender and age. The below mentioned are the graphical analyses of various countries across globe and the spread of COVID-19 across population figures.

<img src="https://www.statista.com/graphic/1/1105512/coronavirus-covid-19-deaths-by-gender-germany.jpg" alt="Statistic: Number of coronavirus (COVID-19) deaths in Germany in 2020, by gender and age | Statista" style="width: 100%; height: auto !important; max-width:1000px;-ms-interpolation-mode: bicubic;"/>

<h3> Observation from the graph </h3>

For germany, we observe the following trends:
1. People of age 60+ have a higher chances of deaths from COVID-19
2. Males of age of 60+ had a greater chance of death than females of the same age.

We observe multiple graphs available across the web to consolidate the findings.

<img src="https://www.statista.com/graphic/1/1109638/covid-19-deaths-by-age-and-gender-ukraine.jpg" alt="Statistic: Distribution of coronavirus (COVID-19) lethal cases in Ukraine as of April 4, 2020, by age and gender | Statista" style="width: 100%; height: auto !important; max-width:1000px;-ms-interpolation-mode: bicubic;"/>

Similar patterns were observed for Ukraine where population greater than 50+ developed lethal symptoms of COVID-19 and are more prone to get serious from the disease. A similar statistical report by wall-street journal could be found that speaks much about the figures. The information is cited in the section beneath.

<img src="https://cdn.statcdn.com/Infographic/images/normal/21345.jpeg" alt="Infographic: More Men Dying to COVID-19 Than Women | Statista" width="100%" height="auto" style="width: 100%; height: auto !important; max-width:960px;-ms-interpolation-mode: bicubic;"/>   



Tough inital trends explain male are more prone to COVID-19 than females, however this constraint can be safely rejected as for the most of the countries, the males are more subjected to moving out of homes even in quarantine zones.

# <a id='death'><h3>The Decider Factors : Death and Recovery</h3></a>

<h3> Which factor leads to death of people suffering from COVID-19? </h3>




In [None]:
#Importing the clinical spectrum data
clinical_spectrum = pd.read_csv('../input/uncover/UNCOVER/einstein/diagnosis-of-covid-19-and-its-clinical-spectrum.csv')

#Filtering the data to contain the values only for the confirmed COVID-19 Tests
confirmed = clinical_spectrum['sars_cov_2_exam_result'] == 'positive'
clinical_spectrum = clinical_spectrum[confirmed]

#Filetering the datasets
hospitalized_condtion = clinical_spectrum['patient_addmited_to_regular_ward_1_yes_0_no'] == 't'
us_hospitalized_spectra = clinical_spectrum[hospitalized_condtion]


unhospitalized_condtion = clinical_spectrum['patient_addmited_to_regular_ward_1_yes_0_no'] == 'f'
us_unhospitalized_spectra = clinical_spectrum[unhospitalized_condtion]

#Taking mean value of the spectra conditions
hospitalized_mean = us_hospitalized_spectra.mean(axis = 0, skipna = True) 
unhospitalized_mean = us_unhospitalized_spectra.mean(axis = 0, skipna = True) 

#Making columns for the dataset
hospitalized_mean = hospitalized_mean.to_frame()
hospitalized_mean = hospitalized_mean.reset_index()
hospitalized_mean.columns = ['Parameter','Hospitalized_figures']

unhospitalized_mean = unhospitalized_mean.to_frame()
unhospitalized_mean = unhospitalized_mean.reset_index()
unhospitalized_mean.columns = ['Parameter','Unhospitalized_figures']

#Merging both the dataframes together
hospitalized_mean['Unhospitalized_figures'] = unhospitalized_mean['Unhospitalized_figures']

#Viewing the dataset
hospitalized_mean.dropna()

#The most important clinical factors
hospitalized_mean['Change'] =  hospitalized_mean['Hospitalized_figures'] - hospitalized_mean['Unhospitalized_figures']
hospitalized_mean.sort_values(['Change'], axis=0, ascending=True, inplace=True) 

#Getting to know the health factors that define HCP Requirement for a patient
lower = hospitalized_mean.head(10)
higher = hospitalized_mean.tail(10)

#Printing the values
for i in lower['Parameter']:
    print('For lower value of {}, the patient may require HCP'.format(i))
    
for i in higher['Parameter']:
    print('For higher value of {}, the patient may require HCP'.format(i))

We look to a recent post published in New York Times. The article states - [Available here](https://www.nytimes.com/2020/04/20/opinion/coronavirus-testing-pneumonia.html)

Pneumonia caused by the coronavirus has had a stunning impact on the city’s hospital system. Normally an E.R. has a mix of patients with conditions ranging from the serious, such as heart attacks, strokes and traumatic injuries, to the nonlife-threatening, such as minor lacerations, intoxication, orthopedic injuries and migraine headaches.

These patients did not report any sensation of breathing problems, even though their chest X-rays showed diffuse pneumonia and their oxygen was below normal. We are just beginning to recognize that Covid pneumonia initially causes a form of oxygen deprivation we call “silent hypoxia” — “silent” because of its insidious, hard-to-detect nature.

Pneumonia is an infection of the lungs in which the air sacs fill with fluid or pus. Normally, patients develop chest discomfort, pain with breathing and other breathing problems. But when Covid pneumonia first strikes, patients don’t feel short of breath, even as their oxygen levels fall. And by the time they do, they have alarmingly low oxygen levels and moderate-to-severe pneumonia (as seen on chest X-rays).


**The above mentioned clinical figures are related to pneumonia ailments. This undetected figures, among populations can be a main source of deaths related to COVID-19. These clinical figures, might go undetected, becuase of which hospitalization is given to patients who become very serious, leading to death. Early detection of the figures above can be a good way to reduce the deaths through a significant number**

# <a id='findings'><h3>The findings from the Analyses</h3></a>

1. Median age tends to affect the cases. So for a city that has a higher median age, the cases tend to increase. 

2. Life expectancy also seems to affect the COVID-19 confirmed cases with a weak correlation. The effect is seen more prominent in males than in females. This weak correlation might highlight the medical facilities for the country. Further research to which is needed to be carried out.

3. With the rise in tempertaure, the confirmed cases tend to slow down (negative correlation). The maximum number of cases has occured between a range on temperatures (from -4 to 17 deree celcius). The transmissons are substantially higher in colder regions and with the increase in temperature, transmissions trend to decrease. Hence, people in colder environments and cooler climatic conditions are much prone to the transmission of COVID-19. This effect isn't saw in general population data as the p-value is hgiher and we can't reject our null hypothesis.

4. None of the figures like Smokers percentage in population, obesity, diabetics tend to affect the spread of COVID-19 Infections in US in general.

5. A certain correaltion is observed with the number of confirmed cases in a county and the traffic congestion present for that county (as of 2020). The correaltion for the varibles are (0.613053). This might be significant as the quarantine and total isolation of people disallowing people movement across US Counties were late in comparison to countries like India/Korea/China/Japan. Hence asymptomatic cases that were carrying the virus might had spread the same, as the moment weren't restricted and the congestion of traffic for the particular counties are high.

6. For multiple countries elder population (above 50 years of age were more serious to covid-19 than the younger population. Although for the countries like India, where younger population contracted COVID-19, the death for the majority of the cases were observed in the elder population.)

7. Tough inital trends explain male are more prone to COVID-19 than females, however this constraint can be safely rejected as for the most of the countries, the males are more subjected to moving out of homes even in quarantine zones.