# Task: Covid-19 Data Analysis
### This notebook is used to understand the comprehension of Data Analysis techniques using Pandas library.

### Data Source: 
https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_daily_reports

### File naming convention

MM-DD-YYYY.csv in UTC.

### Field description

- Province_State: China - province name; US/Canada/Australia/ - city name, state/province name; Others - name of the event (e.g., "Diamond Princess" cruise ship); other countries - blank.

- Country_Region: country/region name conforming to WHO (will be updated).

- Last_Update: MM/DD/YYYY HH:mm (24 hour format, in UTC).

- Confirmed: the number of confirmed cases. For Hubei Province: from Feb 13 (GMT +8), we report both clinically diagnosed and lab-confirmed cases. For lab-confirmed cases only (Before Feb 17), please refer to who_covid_19_situation_reports. For Italy, diagnosis standard might be changed since Feb 27 to "slow the growth of new case numbers." (Source)

- Deaths: the number of deaths.

- Recovered: the number of recovered cases.

In [1]:
import pandas as pd

### Question 1

#### Read the dataset

In [47]:
covid_df = pd.read_csv("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/01-01-2021.csv")
covid_df

Unnamed: 0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key,Incident_Rate,Case_Fatality_Ratio
0,,,,Afghanistan,2021-01-02 05:22:33,33.93911,67.709953,52513,2201,41727,8585,Afghanistan,134.896578,4.191343
1,,,,Albania,2021-01-02 05:22:33,41.15330,20.168300,58316,1181,33634,23501,Albania,2026.409062,2.025173
2,,,,Algeria,2021-01-02 05:22:33,28.03390,1.659600,99897,2762,67395,29740,Algeria,227.809861,2.764848
3,,,,Andorra,2021-01-02 05:22:33,42.50630,1.521800,8117,84,7463,570,Andorra,10505.403482,1.034865
4,,,,Angola,2021-01-02 05:22:33,-11.20270,17.873900,17568,405,11146,6017,Angola,53.452981,2.305328
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4006,,,Unknown,Ukraine,2021-01-02 05:22:33,,,0,0,0,0,"Unknown, Ukraine",0.000000,0.000000
4007,,,,Nauru,2021-01-02 05:22:33,-0.52280,166.931500,0,0,0,0,Nauru,0.000000,0.000000
4008,,,Niue,New Zealand,2021-01-02 05:22:33,-19.05440,-169.867200,0,0,0,0,"Niue, New Zealand",0.000000,0.000000
4009,,,,Tuvalu,2021-01-02 05:22:33,-7.10950,177.649300,0,0,0,0,Tuvalu,0.000000,0.000000


#### Display the top 5 rows in the data

In [48]:
covid_df.head()

Unnamed: 0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key,Incident_Rate,Case_Fatality_Ratio
0,,,,Afghanistan,2021-01-02 05:22:33,33.93911,67.709953,52513,2201,41727,8585,Afghanistan,134.896578,4.191343
1,,,,Albania,2021-01-02 05:22:33,41.1533,20.1683,58316,1181,33634,23501,Albania,2026.409062,2.025173
2,,,,Algeria,2021-01-02 05:22:33,28.0339,1.6596,99897,2762,67395,29740,Algeria,227.809861,2.764848
3,,,,Andorra,2021-01-02 05:22:33,42.5063,1.5218,8117,84,7463,570,Andorra,10505.403482,1.034865
4,,,,Angola,2021-01-02 05:22:33,-11.2027,17.8739,17568,405,11146,6017,Angola,53.452981,2.305328


#### Show the information of the dataset

In [49]:
covid_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4011 entries, 0 to 4010
Data columns (total 14 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   FIPS                 3265 non-null   float64
 1   Admin2               3270 non-null   object 
 2   Province_State       3833 non-null   object 
 3   Country_Region       4011 non-null   object 
 4   Last_Update          4011 non-null   object 
 5   Lat                  3922 non-null   float64
 6   Long_                3922 non-null   float64
 7   Confirmed            4011 non-null   int64  
 8   Deaths               4011 non-null   int64  
 9   Recovered            4011 non-null   int64  
 10  Active               4011 non-null   int64  
 11  Combined_Key         4011 non-null   object 
 12  Incident_Rate        3922 non-null   float64
 13  Case_Fatality_Ratio  3963 non-null   float64
dtypes: float64(5), int64(4), object(5)
memory usage: 438.8+ KB


#### Show the sum of missing values of features in the dataset

In [50]:
covid_df.isna().sum()

FIPS                   746
Admin2                 741
Province_State         178
Country_Region           0
Last_Update              0
Lat                     89
Long_                   89
Confirmed                0
Deaths                   0
Recovered                0
Active                   0
Combined_Key             0
Incident_Rate           89
Case_Fatality_Ratio     48
dtype: int64

### Question 2

#### Show the number of Confirmed cases by Country

In [51]:
covid_df[["Country_Region", "Confirmed"]].groupby("Country_Region",  as_index = False).sum()

Unnamed: 0,Country_Region,Confirmed
0,Afghanistan,52513
1,Albania,58316
2,Algeria,99897
3,Andorra,8117
4,Angola,17568
...,...,...
195,West Bank and Gaza,139223
196,Winter Olympics 2022,0
197,Yemen,2101
198,Zambia,20997


#### Show the number of Deaths by Country

In [52]:
covid_df[["Country_Region", "Deaths"]].groupby("Country_Region",  as_index = False).sum()

Unnamed: 0,Country_Region,Deaths
0,Afghanistan,2201
1,Albania,1181
2,Algeria,2762
3,Andorra,84
4,Angola,405
...,...,...
195,West Bank and Gaza,1418
196,Winter Olympics 2022,0
197,Yemen,610
198,Zambia,390


#### Show the number of Recovered cases by Country

In [53]:
covid_df[["Country_Region", "Recovered"]].groupby("Country_Region",  as_index = False).sum()

Unnamed: 0,Country_Region,Recovered
0,Afghanistan,41727
1,Albania,33634
2,Algeria,67395
3,Andorra,7463
4,Angola,11146
...,...,...
195,West Bank and Gaza,118926
196,Winter Olympics 2022,0
197,Yemen,1396
198,Zambia,18773


#### Show the number of Active Cases by Country

In [54]:
covid_df[["Country_Region", "Active"]].groupby("Country_Region",  as_index = False).sum()

Unnamed: 0,Country_Region,Active
0,Afghanistan,8585
1,Albania,23501
2,Algeria,29740
3,Andorra,570
4,Angola,6017
...,...,...
195,West Bank and Gaza,18879
196,Winter Olympics 2022,0
197,Yemen,95
198,Zambia,1834


#### Show the latest number of Confirmed, Deaths, Recovered and Active cases Country-wise

In [27]:
covid_df["Last_Update"] = pd.to_datetime(covid_df.Last_Update)

Date and time is same for all the record in this dataframe. So, there is no context of latest here. We will consider last record as latest. 

In [55]:
covid_df[["Country_Region", "Last_Update",
          "Active", "Confirmed", 
          "Deaths", "Recovered"]].groupby("Country_Region", 
                                          as_index = False).last()

Unnamed: 0,Country_Region,Last_Update,Active,Confirmed,Deaths,Recovered
0,Afghanistan,2021-01-02 05:22:33,8585,52513,2201,41727
1,Albania,2021-01-02 05:22:33,23501,58316,1181,33634
2,Algeria,2021-01-02 05:22:33,29740,99897,2762,67395
3,Andorra,2021-01-02 05:22:33,570,8117,84,7463
4,Angola,2021-01-02 05:22:33,6017,17568,405,11146
...,...,...,...,...,...,...
195,West Bank and Gaza,2021-01-02 05:22:33,18879,139223,1418,118926
196,Winter Olympics 2022,2021-01-02 05:22:33,0,0,0,0
197,Yemen,2021-01-02 05:22:33,95,2101,610,1396
198,Zambia,2021-01-02 05:22:33,1834,20997,390,18773


### Question 3

### Show the countries with no recovered cases

In [60]:
no_recovered_df = covid_df[["Country_Region", "Recovered"]].groupby("Country_Region", as_index = False).sum()
no_recovered_df[no_recovered_df["Recovered"] == 0]

Unnamed: 0,Country_Region,Recovered
5,Antarctica,0
17,Belgium,0
92,Kiribati,0
93,"Korea, North",0
125,Nauru,0
136,Palau,0
156,Serbia,0
169,Summer Olympics 2020,0
171,Sweden,0
180,Tonga,0


#### Show the countries with no confirmed cases

In [62]:
no_confirmed_df = covid_df[["Country_Region", "Confirmed"]].groupby("Country_Region", as_index = False).sum()
no_confirmed_df[no_confirmed_df["Confirmed"] == 0]

Unnamed: 0,Country_Region,Confirmed
5,Antarctica,0
92,Kiribati,0
93,"Korea, North",0
125,Nauru,0
136,Palau,0
169,Summer Olympics 2020,0
180,Tonga,0
184,Tuvalu,0
196,Winter Olympics 2022,0


#### Show the countries with no deaths

In [65]:
no_deaths_df = covid_df[["Country_Region", "Deaths"]].groupby("Country_Region", as_index = False).sum()
no_deaths_df[no_deaths_df["Deaths"] == 0]

Unnamed: 0,Country_Region,Deaths
5,Antarctica,0
20,Bhutan,0
31,Cambodia,0
51,Dominica,0
70,Grenada,0
76,Holy See,0
92,Kiribati,0
93,"Korea, North",0
98,Laos,0
114,Marshall Islands,0


### Question 4

#### Show the Top 10 countries with Confirmed cases

In [73]:
covid_df[["Country_Region", "Confirmed", "Deaths", "Recovered"]].groupby("Country_Region",  as_index = False).sum().sort_values("Confirmed", ascending=False).head(10)

Unnamed: 0,Country_Region,Confirmed,Deaths,Recovered
185,US,20397401,352844,0
80,India,10305788,149218,9929568
24,Brazil,7703971,195541,6855372
146,Russia,3153960,56798,2553467
63,France,2697014,64891,200822
189,United Kingdom,2549671,95917,5682
183,Turkey,2220855,21093,2114760
86,Italy,2129376,74621,1479988
166,Spain,1928265,50837,150376
67,Germany,1721839,33071,1388744


#### Show the Top 10 Countries with Active cases

In [74]:
covid_df[["Country_Region", "Confirmed", "Deaths", "Recovered", "Active"]].groupby("Country_Region",  as_index = False).sum().sort_values("Active", ascending=False).head(10)

Unnamed: 0,Country_Region,Confirmed,Deaths,Recovered,Active
185,US,20397401,352844,0,19978335
189,United Kingdom,2549671,95917,5682,2469774
63,France,2697014,64891,200822,2431301
166,Spain,1928265,50837,150376,1727052
117,Mexico,1437185,126507,1083768,1310678
140,Peru,1015137,93231,951318,921906
127,Netherlands,816616,11624,9651,795341
24,Brazil,7703971,195541,6855372,649795
17,Belgium,648289,19581,0,637588
86,Italy,2129376,74621,1479988,574767


### Question 5

#### Plot Country-wise Total deaths, confirmed, recovered and active casaes where total deaths have exceeded 50,000

In [80]:
import plotly.graph_objects as go

In [78]:
grouped_df1 = covid_df[["Country_Region", "Confirmed", "Active", "Recovered", "Deaths"]].groupby("Country_Region", as_index = False).sum()
grouped_df1 = grouped_df1[grouped_df1["Deaths"] >= 50000]

In [79]:
fig = go.Figure()

fig.add_trace(go.Scatter(x=grouped_df1["Country_Region"], y=grouped_df1["Confirmed"],
                    mode='lines', name='Confirmed'))
fig.add_trace(go.Scatter(x=grouped_df1["Country_Region"], y=grouped_df1["Deaths"],
                    mode='lines', name='Deaths'))
fig.add_trace(go.Scatter(x=grouped_df1["Country_Region"], y=grouped_df1["Recovered"],
                    mode='lines', name='Recovered'))
fig.add_trace(go.Scatter(x=grouped_df1["Country_Region"], y=grouped_df1["Active"],
                    mode='lines', name='Active'))

fig.show()

### Question 6

### Plot Province/State wise Deaths in USA

In [87]:
import plotly.express as px

In [81]:
covid_data= pd.read_csv('https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/01-09-2021.csv')

In [82]:
covid_data.columns

Index(['FIPS', 'Admin2', 'Province_State', 'Country_Region', 'Last_Update',
       'Lat', 'Long_', 'Confirmed', 'Deaths', 'Recovered', 'Active',
       'Combined_Key', 'Incident_Rate', 'Case_Fatality_Ratio'],
      dtype='object')

In [101]:
usa_data = covid_data[covid_data["Country_Region"] == "US"]

In [108]:
all_country_data = covid_data[["Country_Region", "Province_State", "Active", "Confirmed", "Deaths"]].groupby(["Country_Region", "Province_State"], as_index=False).sum()

In [102]:
fig = go.Figure()

fig = px.histogram(usa_data, x='Province_State', y='Deaths', title="State wise Deaths in USA")

fig.show()

To plot for all the countries:

In [109]:
fig = px.histogram(all_country_data, x="Country_Region", y="Deaths", color="Province_State", title="Country-wise Deaths")
fig.show()

### Question 7

### Plot Province/State Wise Active Cases in USA

In [103]:
fig = go.Figure()

fig = px.histogram(usa_data, x='Province_State', y='Active', title="State wise Active Cases in USA")

fig.show()

In [111]:

fig = px.histogram(all_country_data, x="Country_Region", y="Active", color="Province_State", title="Country-wise Active cases")
fig.show()

### Question 8

### Plot Province/State Wise Confirmed cases in USA

In [104]:
fig = go.Figure()

fig = px.histogram(usa_data, x='Province_State', y='Confirmed', title="State wise Confirmed Cases in USA")

fig.show()

In [112]:

fig = px.histogram(all_country_data, x="Country_Region", y="Confirmed", color="Province_State", title="Country-wise Confirmed cases")
fig.show()

### Question 9

### Plot Worldwide Confirmed Cases over time

In [24]:
import plotly.express as px
import plotly.io as pio

In [114]:
covid_data["Last_Update"] = pd.to_datetime(covid_data.Last_Update)

In [117]:
covid_data["Last_Update"]

0      2021-01-10 05:22:12
1      2021-01-10 05:22:12
2      2021-01-10 05:22:12
3      2021-01-10 05:22:12
4      2021-01-10 05:22:12
               ...        
4007   2021-01-10 05:22:12
4008   2021-01-10 05:22:12
4009   2021-01-10 05:22:12
4010   2021-01-10 05:22:12
4011   2021-01-10 05:22:12
Name: Last_Update, Length: 4012, dtype: datetime64[ns]

In [118]:
fig = px.line(covid_data, x='Last_Update', y="Confirmed", color = "Country_Region", title = "Worldwide Confirmed Cases over time")
fig.show()