## Notes
### Useful links
- blog about data sources: https://towardsdatascience.com/a-short-review-of-covid-19-data-sources-ba7f7aa1c342
- GitHub repo of John Hopkinns CSSE: https://github.com/CSSEGISandData/COVID-19
### Learning pandas
- useful link to learn pandas: https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/
- pandas is derived from the term "panel data", an econometrics term for data sets that include observations over multiple time periods for the same individuals
- pandas is built on top of the NumPy package, meaning a lot of the structure of NumPy is used or replicated in Pandas. Data in pandas is often used to feed statistical analysis in SciPy, plotting functions from Matplotlib, and machine learning algorithms in Scikit-learn
- a Series is essentially a column, and a DataFrame is a multi-dimensional table made up of a collection of Series
- two primary components of pandas are: series and dataframes
- handling missing data values: most commonly you'll see Python's 'None' or NumPy's 'np.nan', each of which are handled differently in some situations
- two options to deal with missing data values:get rid of them or replace them (called imputation)
- 

In [61]:
import pandas as pd

#reading data from GitHub
jhCsseURL='https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_deaths_global.csv'
frm = pd.read_csv(jhCsseURL,index_col=1)

#dataset information
frm.shape
frm.head()
frm.tail()
#frm.info()

#data cleaning
##remove duplicates
frm.shape
tempfrm=frm
tempfrm.drop_duplicates(inplace=True) # OR tempfrm=tempfrm.drop_duplicates()
#tempfrm.drop_duplicates(inplace=True,keep='first') #remove all duplicates except first (DEFAULT)
#tempfrm.drop_duplicates(inplace=True,keep='last') #remove all duplicates except last
#tempfrm.drop_duplicates(inplace=True,keep=False) #remove all duplicates
tempfrm.shape
##cleaning columns
frm.columns
frm.rename(columns={'Province/State':'Area'},inplace=True) #renaming to simpler names
frm.columns
frm.columns=[col.lower() for col in frm.columns] #lowercase for all colnames
frm.columns
##dealing missing data values
frm.isnull() #return true/false for each cell
frm.isnull().sum() #counts the total number of true
#frm.dropna(inplace=True)#will delete any row with at least a single null value (only use if very few na)
#frm.dropna(axis=1,inplace=True)#will delete any column with at least a single null value (only use if very few na)
area=frm['area']
area.head()
area.fillna('Unknown',inplace=True) #imputing null values with 'missing'
frm.isnull().sum()
frm.head()

#data statistics
#distibution
frm.describe()
frm['area'].describe() #can be performed on categorical data also
frm['area'].value_counts() #gives frequency of all the categories
#correlation
frm.corr() #generates correlation matrix


Unnamed: 0,lat,long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,1/28/20,1/29/20,...,3/25/20,3/26/20,3/27/20,3/28/20,3/29/20,3/30/20,3/31/20,4/1/20,4/2/20,4/3/20
lat,1.000000,-0.140507,0.023716,0.026359,0.028360,0.026515,0.026932,0.026457,0.025386,0.025816,...,0.103567,0.107285,0.110694,0.113666,0.116205,0.118741,0.123155,0.127146,0.131927,0.135580
long,-0.140507,1.000000,0.078559,0.083255,0.085696,0.082878,0.085071,0.085109,0.082551,0.083745,...,0.001667,-0.004373,-0.010843,-0.016954,-0.021901,-0.026764,-0.033680,-0.039450,-0.046403,-0.051842
1/22/20,0.023716,0.078559,1.000000,0.998274,0.998274,0.999378,0.999269,0.999491,0.999812,0.999687,...,0.334174,0.301585,0.269236,0.241892,0.221255,0.201755,0.184001,0.168638,0.152316,0.138884
1/23/20,0.026359,0.083255,0.998274,1.000000,0.998985,0.999114,0.998660,0.998523,0.998546,0.998418,...,0.333187,0.300631,0.268318,0.241003,0.220387,0.200906,0.183162,0.167806,0.151490,0.138065
1/24/20,0.028360,0.085696,0.998274,0.998985,1.000000,0.999724,0.999126,0.998839,0.998739,0.998609,...,0.333048,0.300479,0.268155,0.240830,0.220206,0.200717,0.182962,0.167597,0.151270,0.137837
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3/30/20,0.118741,-0.026764,0.201755,0.200906,0.200717,0.201215,0.200981,0.201001,0.201321,0.201166,...,0.975173,0.985552,0.992460,0.997029,0.999255,1.000000,0.998738,0.995216,0.986722,0.976786
3/31/20,0.123155,-0.033680,0.184001,0.183162,0.182962,0.183453,0.183209,0.183225,0.183552,0.183393,...,0.965035,0.977300,0.986077,0.992556,0.996390,0.998738,1.000000,0.998802,0.993347,0.985822
4/1/20,0.127146,-0.039450,0.168638,0.167806,0.167597,0.168082,0.167831,0.167843,0.168175,0.168013,...,0.952434,0.966659,0.977293,0.985745,0.991238,0.995216,0.998802,1.000000,0.997470,0.992296
4/2/20,0.131927,-0.046403,0.152316,0.151490,0.151270,0.151751,0.151490,0.151498,0.151837,0.151670,...,0.933093,0.949928,0.962650,0.973296,0.980650,0.986722,0.993347,0.997470,1.000000,0.998557
