In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

# UNCOVER COVID-19 CHALLENGE

**Challenge Description**

The Roche Data Science Coalition (RDSC) is requesting the collaborative effort of the AI community to fight COVID-19. This challenge presents a curated collection of datasets from 20 global sources and asks you to model solutions to key questions that were developed and evaluated by a global frontline of healthcare providers, hospitals, suppliers, and policy makers.

**Dataset Description**

This dataset is composed of a curated collection of over 200 publicly available COVID-19 related datasets from sources like Johns Hopkins, the WHO, the World Bank, the New York Times, and many others. It includes data on a wide variety of potentially powerful statistics and indicators, like local and national infection rates, global social distancing policies, geospatial data on movement of people, and more.

**Challenge Details**

The tasks associated with this dataset were developed and evaluated by global frontline healthcare providers, hospitals, suppliers, and policy makers. They represent key research questions where insights developed by the Kaggle community can be most impactful in the areas of at-risk population evaluation and capacity management.
To participate in this challenge, review the research questions posed in the dataset tasks and submit solutions in the form of Kaggle Notebooks.

We encourage participants to use the presented data and if needed, their own proprietary and non-proprietary datasets to create their submissions.

**Timeline**

The goal of the UNCOVER challenge is to connect the AI community with the frontline of responders to this global crisis, therefore the Roche Data Science Coalition will be evaluating solutions and surfacing this research to experts on the following schedule:

Wednesday, April 22nd
Wednesday, May 6th
Wednesday, May 20th
On each of those dates, each task will have one submission identified as the best response to the research question posed in the task. That submission will be marked as the “accepted solution” to that task, and will be reevaluated by the next deadline against the new research contributed to that task.

Submissions will be reviewed on a rolling basis, so participants are encouraged to work publicly and collaboratively to accelerate the research available for each task. Roche Canada will be inviting the authors of accepted task submissions to present their solutions to a panel of Roche leadership for potential application and scale in various regions across the globe.

**Accessing the Data**

Datasets have been made available here on Kaggle and are intermittently being updated from their respective sources.

You may also access the datasets through the Namara platform to get the most up to date version of each dataset, thanks to our collaborators at ThinkData Works.

Details on the provenance of each dataset are available in the file descriptions of each folder.

**Acknowledgements**

Hoffmann-La Roche Limited (Roche Canada) is committed to working with the global community to develop solutions to the challenges of the SARS-CoV-2 (COVID-19) pandemic. We believe that an important way in which the world can win this fight is through the sharing of knowledge and healthcare data to better inform patient care and health system decision making.

To help achieve this, we have assembled a group of like-minded public and private organizations with a common mission and vision to bring actionable intelligence to patients, frontline healthcare providers, institutions, supply chains, and government. We call ourselves the Roche Data Science Coalition.[](http://)

# TASK 1

Here,the dataset **'canadian_outbreak_tracker/canada-cumulative-case-count-by-new-hybrid-regional-health-boundaries.csv' ** is used to perform first task of the challenge i.e. to calculate the populations which are at risk of contracting covid-19 using exploratory data analysis(EDA).


In [None]:
#importing the dataset
p=pd.read_csv('/kaggle/input/uncover/UNCOVER/canadian_outbreak_tracker/canada-cumulative-case-count-by-new-hybrid-regional-health-boundaries.csv')
p.head()#printing first 5 rows

In [None]:
#checking for missinig values
p.isnull().any()

In [None]:
#using missingno library to identify missing values visually 
import missingno as msno
msno.matrix(p)

In [None]:
#determining the datatypes of features
p.info()

# EXPLORATORY DATA ANALYSIS
**HANDLING MISSING VALUES**

Here,missing values are there in three features i.e. 'deaths','recovered','tests' in the dataset named p.
Using **numpy.fillna()** function the missing values are replaced with 0 for deaths and recovered features whereas the missing values of tests are replaced with its mean values where it is null.**(mean() function can't be used for death and recovery cases as their resultant is more than casecount which is not true.)**

In [None]:
p['deaths']=p['deaths'].fillna(0)
p['recovered']=p['recovered'].fillna(0)
p['tests']=np.where(p['tests'].isnull(),p['tests'].mean(),p['tests'])

In [None]:
#checking again
msno.matrix(p)

In [None]:
#display of first 5 rows without null values
p.head()

**In,Casecounts where there is no death and recovered cases(both 0),then there exists the possibility of active cases.**

In [None]:
#finding active cases and storing it in the dataframe 
p['active']=p['casecount']-p['recovered']-p['deaths']

In [None]:
#viewing the changes
p.head()

**REMOVING UNRELEVANT COLUMNS**

Features such as 'frename','shape_area','shape_length','last_updated','sourceurl','globalid' and 'retrieved_at' aren't relevant to the task,so it will be removed.

In [None]:
#removing the unrelevant columns
p.drop(p.columns[[4,30,31,32,33,34,35]],axis=1,inplace=True)

In [None]:
p.head()

**DATA VISUALISATION**

Here,the data is represented in the form of graphs.Since the task is to find the contracting populations,so we will be considering each category of age population with respect to the deaths occured in that particular population.

In [None]:
#importing visualisation libraries
import seaborn as sns
import matplotlib.pyplot as plt
#taking all population age groups in separate dataframe for easier implementation
y=p[["pop0to4_2019", "pop5to9_2019","pop10to14_2019","pop15to19_2019",
'pop20to24_2019',
'pop25to29_2019',
'pop30to34_2019',
'pop35to39_2019',
'pop40to44_2019',
'pop45to49_2019',
'pop50to54_2019',
'pop55to59_2019',
'pop60to64_2019',
'pop65to69_2019',
'pop70to74_2019',
'pop75to79_2019','pop80to84_2019','pop85older']]

In [None]:
#plotting each category with respect to deaths occured
plt.figure(figsize=(40,30))
plt.subplots_adjust(hspace=1.0)
j=1
for i in y.columns:
    plt.subplot(4,5,j)
    sns.scatterplot(p['deaths'],p[i],ci=None)
    plt.ylabel(i)
    plt.xlabel('deaths')
    plt.xticks(rotation=90)
    j+=1
plt.suptitle('DEATHS PER AGE POPULATION CATEGORY',fontsize=41)     

**OBSERVATIONS**

By plotting deaths occured in each age category starting from age '0 to 4'(pop0to4_2019) till '85 or older'(pop85older),we divide the risk level of contraction into three categories(conseidering only maximum level of deaths)-
1. **LOW RISK**
3. **HIGH RISK**
4. **SEVERE RISK**

# LOW RISK

It contains population having ages between 20 to 39 in which maximum deaths(here,32) occur per 1.3 lakhs-2.5 lakhs of individuals.It means that population between age 20 to 24 exhibits 32 deaths per 2 lakh individuals of canada and then increasing the population gradually to 2.5 lakhs for individuals of age between 25 to 39.It is considered low risk as deaths are occuring at slow rate and the population is increasing gradually due to individuals having better immune systems or greater recovery rate.(graph 5,6,7,8)


# HIGH RISK

**For population group of age between 0 to 19**

It contains population having ages between 0 to 19 in which maximum deaths(here,32) occur per 1.3-1.5 lakhs of individuals.It means that population having age 0 to 4 exhibits 32 deaths per 1.5 lakhs childeren or infants of canada and then decreasing the population gradually to 1.3 lakhs for individuals between age 5 to 19.It is considered as medium risk as population starts to decline/shrink gradually as ratio increases from maximumn deaths per 1.5 lakh to maximum deaths per 1.3 lakh individuals as young people who have weaker or compromised immune system are unable to sustain the disease.(graph 1,2,3,4)

**For population group of age between 40 to 59**

It contains population between ages 40 to 59 in which maximum deaths occur(here,32) per 2.0 lakhs-2.1 lakhs of individuals.The population when transitioning from low risk category(i.e population upto 2.5 lakhs till age 39)decreases at a faster rate and drops down to maximum deaths per 2.1 lakhs for age 40 to 44 and decrease the ratio further to maximum deaths per 2.0 lakhs of indiviuals between age 45 to 59.It can happen as the immune system gets weaker as the age progresses or due to presence of any other disease thereby causing individuals succumbing to the disease.(graph 9,10,11,12)

# SEVERE RISK

It contains population above 60 years of age which has the worst ratio of 32 deaths per 1.8lakhs-70 thousand individuals.Here population between 60 to 74 causes gradually decrease and after that sharp decrease of population is observed i.e. exhibiting maximum deaths per 70000 individuals.It is due to old age population having weaker immune system.(graph 13,14,15,16,17,18)


In [None]:
#converting datatypes of average age and median age for easier scalability
p['averageage_2019']=p['averageage_2019'].astype('int64')
p['medianage_2019']=p['medianage_2019'].astype('int64')

In [None]:
p.info()

In [None]:
#identify average age of individuals who are dead
plt.figure(figsize=(20,20))
sns.barplot('averageage_2019','deaths',data=p,ci=None)

*Most deaths of individuals are of age of 40 or above.*

In [None]:
#identifying median age of individuals who are dead
plt.figure(figsize=(20,20))
sns.barplot('medianage_2019','deaths',data=p,ci=None)

*Individuals of age 41 or lower are likely to survive as compared to individuals having age 41 or higher.*

In [None]:
#identifying area having most deaths
plt.figure(figsize=(30,30))
sns.barplot('deaths','engname',hue='province',data=p,ci=None)

# Worst cases are noticed amongst population residing in-

**BC PROVINCE**-

* *** Vancouver Coastal Health***

* *** Fraser Health***

**YT PROVINCE**-

* *** Halibutron,Kawartha,Pine Ridge District Health Unit ***

* *** Durham Regional Health Unit***

* ***  City of Toronto Health Unit ***

* ***  York Regional Health Unit***

**AB PROVINCE**-

* ***Calgary Zone***





In [None]:
#identifying the casecount age
plt.figure(figsize=(30,30))
sns.barplot('averageage_2019','casecount',data=p,ci=None)

*Since the people above 40 are likely to be infected as appeared from previous observations,therefore,their casecounts will also be maximum.*

**If you like this notebook do upvote it.**

Do provide your valuable feedback.

Do checkout my other notebooks at https://www.kaggle.com/tmchls