#### READ CSV FILE WITH PANDAS

In [1]:
from urllib.request import urlretrieve

In [2]:
italy_covid_url = 'https://gist.githubusercontent.com/aakashns/f6a004fa20c84fec53262f9a8bfee775/raw/f309558b1cf5103424cef58e2ecb8704dcd4d74c/italy-covid-daywise.csv'

urlretrieve(italy_covid_url, 'italy-covid-daywise.csv')

('italy-covid-daywise.csv', <http.client.HTTPMessage at 0x2769a669ac0>)

We can read CSV files using a special pandas method.

In [11]:
import sys
!{sys.executable} -m pip install pandas

Collecting pandas
  Downloading pandas-2.2.1-cp312-cp312-win_amd64.whl.metadata (19 kB)
Collecting pytz>=2020.1 (from pandas)
  Using cached pytz-2024.1-py2.py3-none-any.whl.metadata (22 kB)
Collecting tzdata>=2022.7 (from pandas)
  Using cached tzdata-2024.1-py2.py3-none-any.whl.metadata (1.4 kB)
Downloading pandas-2.2.1-cp312-cp312-win_amd64.whl (11.5 MB)
   ---------------------------------------- 0.0/11.5 MB ? eta -:--:--
   ---------------------------------------- 0.0/11.5 MB 660.6 kB/s eta 0:00:18
    --------------------------------------- 0.2/11.5 MB 2.1 MB/s eta 0:00:06
   ---- ----------------------------------- 1.3/11.5 MB 10.1 MB/s eta 0:00:02
   -------- ------------------------------- 2.6/11.5 MB 14.9 MB/s eta 0:00:01
   ---------------- ----------------------- 4.6/11.5 MB 21.0 MB/s eta 0:00:01
   ---------------------- ----------------- 6.6/11.5 MB 24.6 MB/s eta 0:00:01
   -------------------------- ------------- 7.5/11.5 MB 24.0 MB/s eta 0:00:01
   ---------------------

In [12]:
import pandas as pd

In [15]:
covid_df = pd.read_csv('italy-covid-daywise.csv')
covid_df

Unnamed: 0,date,new_cases,new_deaths,new_tests
0,2019-12-31,0.0,0.0,
1,2020-01-01,0.0,0.0,
2,2020-01-02,0.0,0.0,
3,2020-01-03,0.0,0.0,
4,2020-01-04,0.0,0.0,
...,...,...,...,...
243,2020-08-30,1444.0,1.0,53541.0
244,2020-08-31,1365.0,4.0,42583.0
245,2020-09-01,996.0,6.0,54395.0
246,2020-09-02,975.0,8.0,


Data from a CSV file will be stored in a DataFrame object.

In [14]:
type(covid_df)

pandas.core.frame.DataFrame

Here's what the dataframe shows us:
- four daywise counts for Covid-19 in Italy
- Metrics reports are new cases, new deaths, and new tests
- 248 days worth of data (Dec 12, 2019 to Sep 3, 2020)

We can view basic information about a dataframe using .info(), .columns, .shape

In [16]:
covid_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 248 entries, 0 to 247
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   date        248 non-null    object 
 1   new_cases   248 non-null    float64
 2   new_deaths  248 non-null    float64
 3   new_tests   135 non-null    float64
dtypes: float64(3), object(1)
memory usage: 7.9+ KB


In [18]:
covid_df.columns

Index(['date', 'new_cases', 'new_deaths', 'new_tests'], dtype='object')

In [19]:
covid_df.shape

(248, 4)

Each column contains a specific data type. For numeric columns, you can view statistical information (mean, standard deviation, min/max values, and number of non-empty values) using .describe()

In [17]:
covid_df.describe()

Unnamed: 0,new_cases,new_deaths,new_tests
count,248.0,248.0,135.0
mean,1094.818548,143.133065,31699.674074
std,1554.508002,227.105538,11622.209757
min,-148.0,-31.0,7841.0
25%,123.0,3.0,25259.0
50%,342.0,17.0,29545.0
75%,1371.75,175.25,37711.0
max,6557.0,971.0,95273.0


#### RETRIEVE DATA FROM DATAFRAME
First, lets retrieve data from this dataframe. To do this, it helps to understand the internal representation of data in a dataframe. You can think of a dataframe as a dictionary of lists.

In [20]:
# Dataframe format is similar to this
covid_data_dict = {
    'data': ['2020-08-30', '2020-08-31'],
    'new_cases': [1444, 1365],
    'new_deaths': [1, 4],
    'new_tests': [53541, 42583],
}

This format has a few benefits:
- All values in a column have the same data type, so it's more efficient to store in a single array
- Retrieving values for a row requires simply indexing
- The representation is more compact (compared to the list of dictionaries we created using numpy where each dictionary is 1 row)

In [24]:
# Retrieving data using simply indexing
covid_data_dict['new_cases']

[1444, 1365]

In [25]:
covid_df['new_cases']

0         0.0
1         0.0
2         0.0
3         0.0
4         0.0
        ...  
243    1444.0
244    1365.0
245     996.0
246     975.0
247    1326.0
Name: new_cases, Length: 248, dtype: float64

Each column is represented using a data structure called a Series, which is basically a numpy array with extra methods/properties, so we can index it as well.

In [29]:
covid_df['new_cases'][246]

975.0

There's also an .at method that allows us to retrieve at a specific row/column.

In [30]:
covid_df.at[246, 'new_cases']

975.0

You can also pass a list of column names to access a subset of the dataframe.

Doing this creates a "view" of the original dataframe, which means that both variables will point to the same place in memory where this data is stored. This also means that altering the values in one variable will alter the values in the other variable. 

If you need an actual duplication of a dataframe, use .copy()

In [31]:
cases_df = covid_df[['date', 'new_cases']]
cases_df

Unnamed: 0,date,new_cases
0,2019-12-31,0.0
1,2020-01-01,0.0
2,2020-01-02,0.0
3,2020-01-03,0.0
4,2020-01-04,0.0
...,...,...
243,2020-08-30,1444.0
244,2020-08-31,1365.0
245,2020-09-01,996.0
246,2020-09-02,975.0


In [32]:
covid_df_copy = covid_df.copy()

To access values at a specific row:

In [33]:
covid_df.loc[243]

date          2020-08-30
new_cases         1444.0
new_deaths           1.0
new_tests        53541.0
Name: 243, dtype: object

In [34]:
type(covid_df.loc[243])

pandas.core.series.Series

To get the first or last few rows:

In [35]:
covid_df.head()

Unnamed: 0,date,new_cases,new_deaths,new_tests
0,2019-12-31,0.0,0.0,
1,2020-01-01,0.0,0.0,
2,2020-01-02,0.0,0.0,
3,2020-01-03,0.0,0.0,
4,2020-01-04,0.0,0.0,


In [36]:
covid_df.tail(3)

Unnamed: 0,date,new_cases,new_deaths,new_tests
245,2020-09-01,996.0,6.0,54395.0
246,2020-09-02,975.0,8.0,
247,2020-09-03,1326.0,6.0,


Notice that some row/column locations are 0 or NaN. This is because there may be some missing data. 

0 and NaN are distinct - NaN means "not a number", indicating lack of information, while 0 is just that, 0.

We can find the first index in a dataframe that doesnt contain NaN using .first_valid_index()

In [43]:
# NaN datatype is a float
type(covid_df.at[0, 'new_tests'])

numpy.float64

In [41]:
covid_df.new_tests.first_valid_index()

111

To verify that first_valid_index() works:

In [53]:
covid_df.loc[108:113]

Unnamed: 0,date,new_cases,new_deaths,new_tests
108,2020-04-17,3786.0,525.0,
109,2020-04-18,3493.0,575.0,
110,2020-04-19,3491.0,480.0,
111,2020-04-20,3047.0,433.0,7841.0
112,2020-04-21,2256.0,454.0,28095.0
113,2020-04-22,2729.0,534.0,44248.0


The method .sample() can be used to retrieve a random sample from the dataframe.

In [54]:
covid_df.sample(10)

Unnamed: 0,date,new_cases,new_deaths,new_tests
147,2020-05-26,300.0,92.0,33944.0
123,2020-05-02,1965.0,269.0,31231.0
201,2020-07-19,249.0,14.0,20621.0
210,2020-07-28,168.0,5.0,25341.0
6,2020-01-06,0.0,0.0,
230,2020-08-17,477.0,4.0,21379.0
10,2020-01-10,0.0,0.0,
148,2020-05-27,397.0,78.0,37299.0
199,2020-07-17,230.0,20.0,28661.0
231,2020-08-18,320.0,4.0,32687.0


When we take a random sample from a dataframe, the original index of each row is preserved. This is another useful feature of dataframes.

#### ANALYZE DATA FROM DATAFRAME

## EVERY TOOL FROM THIS SECTION:

Reading CSV files:
- pd.read_csv()
- .info()
- .describe()
- .columns
- .shape

Retrieving data:
- covid_df['new_cases']
- new_cases[243]
- covid_df.at[243, 'new_cases']
- covid_df.copy()
- covid_df.loc[243]
- .head(), .tail(), .sample()
- covid_df.new_tests.first_valid_index

Analyzing data: