# READING DATAFRAMES IN PYTHON
© Copyright: 2024, Selma Hadzic, all rights reserved.

## Installation

Install Pandas through a Terminal:

```bash
pip install pandas
```

In [1]:
# or directly within the notebook:
!pip install pandas



# 1. Pandas and DataFrames and Series

- Pandas is an open-source Python library that provides powerful and flexible tools for data manipulation, analysis, and cleaning. It is one of the most popular libraries for working with structured data, especially tabular data (like spreadsheets, CSV files, SQL tables, etc.).

- Pandas is built on top of NumPy and provides two main data structures:
`pd.Series` and `pd.DataFrame`   

In [3]:
# First import the package to be able to use it
import pandas as pd

## DataFrames

``pd.DataFrames`` are the central data structure in ``pandas``:

- A two-dimensional labeled data structure (like a table with rows and columns).
- It can have one or multiple rows, one or multiple columns
- Each column can hold a different data type (``int64`` for integer values, ``float64`` for float values, ``object`` for string values, ``bool`` for boolean values).

## Series

Each column of a DataFrame is an object of type ``pd.Series``.

A ``pd.Series`` is a one-dimensional array-like object in Python that is part of the Pandas library. 
- It is similar to a list or a column in a table.
- It has labels (or an index) for each element, allowing for easy access and manipulation of the data.
    - a row index (row names)
    - a column index (column names)

# 2. Read and display the DataFrame

### 2.1 display the dataframe

In [4]:
# Read the DataFrame
df = pd.read_csv('./data/WHO_COVID19_cases.csv')

### 2.2 Display the DataFrame

In [5]:
df

Unnamed: 0,Date_reported,Country_code,Country,Continent,WHO_region,New_cases,Cumulative_cases,New_deaths,Cumulative_deaths
0,2020-01-05,AF,Afghanistan,Asia,EMRO,,0,,0
1,2020-01-12,AF,Afghanistan,Asia,EMRO,,0,,0
2,2020-01-19,AF,Afghanistan,Asia,EMRO,,0,,0
3,2020-01-26,AF,Afghanistan,Asia,EMRO,,0,,0
4,2020-02-02,AF,Afghanistan,Asia,EMRO,,0,,0
...,...,...,...,...,...,...,...,...,...
58555,2024-08-04,ZW,Zimbabwe,Africa,AFRO,1.0,266387,,5740
58556,2024-08-11,ZW,Zimbabwe,Africa,AFRO,,266387,,5740
58557,2024-08-18,ZW,Zimbabwe,Africa,AFRO,,266387,,5740
58558,2024-08-25,ZW,Zimbabwe,Africa,AFRO,,266387,,5740


### 2.2 Show the first/last 5 rows 

In [6]:
df.head(10)

Unnamed: 0,Date_reported,Country_code,Country,Continent,WHO_region,New_cases,Cumulative_cases,New_deaths,Cumulative_deaths
0,2020-01-05,AF,Afghanistan,Asia,EMRO,,0,,0
1,2020-01-12,AF,Afghanistan,Asia,EMRO,,0,,0
2,2020-01-19,AF,Afghanistan,Asia,EMRO,,0,,0
3,2020-01-26,AF,Afghanistan,Asia,EMRO,,0,,0
4,2020-02-02,AF,Afghanistan,Asia,EMRO,,0,,0
5,2020-02-09,AF,Afghanistan,Asia,EMRO,,0,,0
6,2020-02-16,AF,Afghanistan,Asia,EMRO,,0,,0
7,2020-02-23,AF,Afghanistan,Asia,EMRO,,0,,0
8,2020-03-01,AF,Afghanistan,Asia,EMRO,1.0,1,,0
9,2020-03-08,AF,Afghanistan,Asia,EMRO,,1,,0


In [7]:
df.tail(10)

Unnamed: 0,Date_reported,Country_code,Country,Continent,WHO_region,New_cases,Cumulative_cases,New_deaths,Cumulative_deaths
58550,2024-06-30,ZW,Zimbabwe,Africa,AFRO,4.0,266378,,5740
58551,2024-07-07,ZW,Zimbabwe,Africa,AFRO,6.0,266384,,5740
58552,2024-07-14,ZW,Zimbabwe,Africa,AFRO,1.0,266385,,5740
58553,2024-07-21,ZW,Zimbabwe,Africa,AFRO,,266385,,5740
58554,2024-07-28,ZW,Zimbabwe,Africa,AFRO,1.0,266386,,5740
58555,2024-08-04,ZW,Zimbabwe,Africa,AFRO,1.0,266387,,5740
58556,2024-08-11,ZW,Zimbabwe,Africa,AFRO,,266387,,5740
58557,2024-08-18,ZW,Zimbabwe,Africa,AFRO,,266387,,5740
58558,2024-08-25,ZW,Zimbabwe,Africa,AFRO,,266387,,5740
58559,2024-09-01,ZW,Zimbabwe,Africa,AFRO,,266387,,5740


### 2.2 Show the number of rows and columns

In [10]:
# Number of rows
df.shape[0]

58560

In [12]:
# Number of columns
df.shape[1]

9

In [13]:
# Summary of both rows and columns
df.shape

(58560, 9)

# 3. Data Selection

### 3.1 Selecting a single column per name

In [14]:
# let's select Cumulative_cases

df[['Cumulative_cases']]

Unnamed: 0,Cumulative_cases
0,0
1,0
2,0
3,0
4,0
...,...
58555,266387
58556,266387
58557,266387
58558,266387


### 3.2 Selecting multiple columns per name

In [15]:
# Let's select 'Date_reported', 'Country', 'Cumulative_cases'
df[['Cumulative_cases','Country', 'Cumulative_cases']]

Unnamed: 0,Cumulative_cases,Country,Cumulative_cases.1
0,0,Afghanistan,0
1,0,Afghanistan,0
2,0,Afghanistan,0
3,0,Afghanistan,0
4,0,Afghanistan,0
...,...,...,...
58555,266387,Zimbabwe,266387
58556,266387,Zimbabwe,266387
58557,266387,Zimbabwe,266387
58558,266387,Zimbabwe,266387


### 3.3 Selecting based on the index: iloc

In [18]:
# Selecting the first row: index always starts at 0 in Python, so the first row has index '0'
df.iloc[0]

Date_reported         2020-01-05
Country_code                  AF
Country              Afghanistan
Continent                   Asia
WHO_region                  EMRO
New_cases                    NaN
Cumulative_cases               0
New_deaths                   NaN
Cumulative_deaths              0
Name: 0, dtype: object

In [19]:
# Selecting the first row and showing it as a DataFrame: double brackets
df.iloc[[0]]

Unnamed: 0,Date_reported,Country_code,Country,Continent,WHO_region,New_cases,Cumulative_cases,New_deaths,Cumulative_deaths
0,2020-01-05,AF,Afghanistan,Asia,EMRO,,0,,0


In [23]:
# Select the first row and third column
df.iloc[0,2]

'Afghanistan'

In [24]:
# Show the 'Date_reported', 'Country', 'Cumulative_cases', by using the indexes:
df.iloc[:, [0,2,6]]

Unnamed: 0,Date_reported,Country,Cumulative_cases
0,2020-01-05,Afghanistan,0
1,2020-01-12,Afghanistan,0
2,2020-01-19,Afghanistan,0
3,2020-01-26,Afghanistan,0
4,2020-02-02,Afghanistan,0
...,...,...,...
58555,2024-08-04,Zimbabwe,266387
58556,2024-08-11,Zimbabwe,266387
58557,2024-08-18,Zimbabwe,266387
58558,2024-08-25,Zimbabwe,266387


In [25]:
# Select only the first 3 rows (slicing) and the same columns. 
df.iloc[:3, [0,2,6]]

Unnamed: 0,Date_reported,Country,Cumulative_cases
0,2020-01-05,Afghanistan,0
1,2020-01-12,Afghanistan,0
2,2020-01-19,Afghanistan,0


In [26]:
# Select only the rows with index 2 until 9 (slicing) and the same columns
# You can notice that the right bound in not included in slicing
df.iloc[2:10, [0,2,6]]

Unnamed: 0,Date_reported,Country,Cumulative_cases
2,2020-01-19,Afghanistan,0
3,2020-01-26,Afghanistan,0
4,2020-02-02,Afghanistan,0
5,2020-02-09,Afghanistan,0
6,2020-02-16,Afghanistan,0
7,2020-02-23,Afghanistan,0
8,2020-03-01,Afghanistan,1
9,2020-03-08,Afghanistan,1


### 3.4 Selecting based on the label: loc

In [27]:
# Filter based on index labels
df.loc[[0]]

Unnamed: 0,Date_reported,Country_code,Country,Continent,WHO_region,New_cases,Cumulative_cases,New_deaths,Cumulative_deaths
0,2020-01-05,AF,Afghanistan,Asia,EMRO,,0,,0


In [28]:
# Show the 'Date_reported', 'Country', 'Cumulative_cases', by using the labels
df.loc[:, ['Date_reported', 'Country', 'Cumulative_cases']]

Unnamed: 0,Date_reported,Country,Cumulative_cases
0,2020-01-05,Afghanistan,0
1,2020-01-12,Afghanistan,0
2,2020-01-19,Afghanistan,0
3,2020-01-26,Afghanistan,0
4,2020-02-02,Afghanistan,0
...,...,...,...
58555,2024-08-04,Zimbabwe,266387
58556,2024-08-11,Zimbabwe,266387
58557,2024-08-18,Zimbabwe,266387
58558,2024-08-25,Zimbabwe,266387


#### iloc and loc cannot be used simultaneously
- You _can't_ combine .loc and .iloc.
- You _can_ combine selecting by single value, multiple values, and slicing

# 4. Rename and drop rows or columns

### 4.1 Show the columns

In [29]:
df.columns

Index(['Date_reported', 'Country_code', 'Country', 'Continent', 'WHO_region',
       'New_cases', 'Cumulative_cases', 'New_deaths', 'Cumulative_deaths'],
      dtype='object')

### 4.2 Rename the columns

In [30]:
df.columns = ['date_reported', 'country_code','country', 'continent', 'region',
       'new_cases', 'cumulative_cases', 'new_deaths', 'cumulative_deaths']

In [31]:
df.columns

Index(['date_reported', 'country_code', 'country', 'continent', 'region',
       'new_cases', 'cumulative_cases', 'new_deaths', 'cumulative_deaths'],
      dtype='object')

#### A- replacing the name of a specific column

`df.rename(columns={'new_deaths':'number_new_deaths', 'new_cases': 'number_new_cases'}, inplace=True)`


#### B-changing with using the lower method

`df.columns = data.columns.str.lower()`

### 4.3 Drop columns

In [32]:
# Let's drop the country code and assign the new dataframe to a new variable called `df_dropcol`
df_dropcol = df.drop('country_code', axis= 'columns') 

In [33]:
# Display the DataFrame
df_dropcol

Unnamed: 0,date_reported,country,continent,region,new_cases,cumulative_cases,new_deaths,cumulative_deaths
0,2020-01-05,Afghanistan,Asia,EMRO,,0,,0
1,2020-01-12,Afghanistan,Asia,EMRO,,0,,0
2,2020-01-19,Afghanistan,Asia,EMRO,,0,,0
3,2020-01-26,Afghanistan,Asia,EMRO,,0,,0
4,2020-02-02,Afghanistan,Asia,EMRO,,0,,0
...,...,...,...,...,...,...,...,...
58555,2024-08-04,Zimbabwe,Africa,AFRO,1.0,266387,,5740
58556,2024-08-11,Zimbabwe,Africa,AFRO,,266387,,5740
58557,2024-08-18,Zimbabwe,Africa,AFRO,,266387,,5740
58558,2024-08-25,Zimbabwe,Africa,AFRO,,266387,,5740


### 4.4 Drop rows

In [35]:
# Let's drop the first two rows and assign the new DataFrame into a new variable called df_droprows
df_droprows = df.drop(index = [0,1], axis = True)

In [36]:
# Display the DataFrame
df_droprows

Unnamed: 0,date_reported,country_code,country,continent,region,new_cases,cumulative_cases,new_deaths,cumulative_deaths
2,2020-01-19,AF,Afghanistan,Asia,EMRO,,0,,0
3,2020-01-26,AF,Afghanistan,Asia,EMRO,,0,,0
4,2020-02-02,AF,Afghanistan,Asia,EMRO,,0,,0
5,2020-02-09,AF,Afghanistan,Asia,EMRO,,0,,0
6,2020-02-16,AF,Afghanistan,Asia,EMRO,,0,,0
...,...,...,...,...,...,...,...,...,...
58555,2024-08-04,ZW,Zimbabwe,Africa,AFRO,1.0,266387,,5740
58556,2024-08-11,ZW,Zimbabwe,Africa,AFRO,,266387,,5740
58557,2024-08-18,ZW,Zimbabwe,Africa,AFRO,,266387,,5740
58558,2024-08-25,ZW,Zimbabwe,Africa,AFRO,,266387,,5740


In [None]:
# We can also replace the current DataFrame. Let's drop the row indexed by "2" in the newly created DataFrame:
df_droprows.drop(index = 2, inplace = True)

In [None]:
# Let's check


# 5. Filtering Dataframes

### 5.1 Filtering with a single logical operator

In [37]:
# Let's filter on the country 'Germany'
df[df['country']== 'Germany']

Unnamed: 0,date_reported,country_code,country,continent,region,new_cases,cumulative_cases,new_deaths,cumulative_deaths
19032,2020-01-05,DE,Germany,Europe,EURO,1.0,1,3.0,3
19033,2020-01-12,DE,Germany,Europe,EURO,,1,,3
19034,2020-01-19,DE,Germany,Europe,EURO,,1,,3
19035,2020-01-26,DE,Germany,Europe,EURO,1.0,2,3.0,6
19036,2020-02-02,DE,Germany,Europe,EURO,9.0,11,3.0,9
...,...,...,...,...,...,...,...,...,...
19271,2024-08-04,DE,Germany,Europe,EURO,,38437756,,174979
19272,2024-08-11,DE,Germany,Europe,EURO,,38437756,,174979
19273,2024-08-18,DE,Germany,Europe,EURO,,38437756,,174979
19274,2024-08-25,DE,Germany,Europe,EURO,,38437756,,174979


In [39]:
# Let's assign this dataFrame to a new variable called `germany`
germany = df[df['country']== 'Germany']

In [40]:
germany.head(5)

Unnamed: 0,date_reported,country_code,country,continent,region,new_cases,cumulative_cases,new_deaths,cumulative_deaths
19032,2020-01-05,DE,Germany,Europe,EURO,1.0,1,3.0,3
19033,2020-01-12,DE,Germany,Europe,EURO,,1,,3
19034,2020-01-19,DE,Germany,Europe,EURO,,1,,3
19035,2020-01-26,DE,Germany,Europe,EURO,1.0,2,3.0,6
19036,2020-02-02,DE,Germany,Europe,EURO,9.0,11,3.0,9


### 5.2 Using other logical operators such as > , < , >= , <= and !=

In [42]:
# let's filter for all rows that have a cumulative deaths over 1 million

df[df['cumulative_deaths'] > 1000000]

Unnamed: 0,date_reported,country_code,country,continent,region,new_cases,cumulative_cases,new_deaths,cumulative_deaths
56002,2022-06-05,US,United States of America,North America,AMRO,675627.0,83847498,1979.0,1000425
56003,2022-06-12,US,United States of America,North America,AMRO,741044.0,84588542,2284.0,1002709
56004,2022-06-19,US,United States of America,North America,AMRO,672256.0,85260798,2202.0,1004911
56005,2022-06-26,US,United States of America,North America,AMRO,709637.0,85970435,2501.0,1007412
56006,2022-07-03,US,United States of America,North America,AMRO,773772.0,86744207,2642.0,1010054
...,...,...,...,...,...,...,...,...,...
56115,2024-08-04,US,United States of America,North America,AMRO,,103436829,895.0,1193733
56116,2024-08-11,US,United States of America,North America,AMRO,,103436829,973.0,1194706
56117,2024-08-18,US,United States of America,North America,AMRO,,103436829,937.0,1195643
56118,2024-08-25,US,United States of America,North America,AMRO,,103436829,907.0,1196550


### 5.3 Filtering with multiple logical operators

In [47]:
# Filter on the continent 'Asia' with less than 1000 cumulative deaths
boolean_mask = (df['cumulative_deaths'] < 1000) & (df['continent'] != 'Asia')
df[boolean_mask]

Unnamed: 0,date_reported,country_code,country,continent,region,new_cases,cumulative_cases,new_deaths,cumulative_deaths
244,2020-01-05,AL,Albania,Europe,EURO,,0,,0
245,2020-01-12,AL,Albania,Europe,EURO,,0,,0
246,2020-01-19,AL,Albania,Europe,EURO,,0,,0
247,2020-01-26,AL,Albania,Europe,EURO,,0,,0
248,2020-02-02,AL,Albania,Europe,EURO,,0,,0
...,...,...,...,...,...,...,...,...,...
58367,2020-12-27,ZW,Zimbabwe,Africa,AFRO,812.0,12963,23.0,341
58368,2021-01-03,ZW,Zimbabwe,Africa,AFRO,1528.0,14491,36.0,377
58369,2021-01-10,ZW,Zimbabwe,Africa,AFRO,6008.0,20499,106.0,483
58370,2021-01-17,ZW,Zimbabwe,Africa,AFRO,6382.0,26881,200.0,683


In [None]:
df[df['country'].

### 5.4 Other methods

- x.isin([...])
- x.between(a, b)
- x.isna()
- x.str.startswith('A')

# 6. Save a new dataset to a csv file

In [48]:
# we can save a dataframe into a csv file

germany.to_csv("./data/germany_covid19_cases.csv", index=False)

# Readings
**pandas user guide**
https://pandas.pydata.org/docs/user_guide/index.html#user-guide