# READING DATAFRAMES IN PYTHON
© Copyright: 2024, Selma Hadzic, all rights reserved.

## Installation

Install Pandas through a Terminal:

```bash
pip install pandas
```

In [33]:
# or directly within the notebook:
!pip install pandas




# 1. Pandas and DataFrames and Series

- Pandas is an open-source Python library that provides powerful and flexible tools for data manipulation, analysis, and cleaning. It is one of the most popular libraries for working with structured data, especially tabular data (like spreadsheets, CSV files, SQL tables, etc.).

- Pandas is built on top of NumPy and provides two main data structures:
`pd.Series` and `pd.DataFrame`   

In [34]:
# First import the package to be able to use it
import pandas as pd

## DataFrames

``pd.DataFrames`` are the central data structure in ``pandas``:

- A two-dimensional labeled data structure (like a table with rows and columns).
- It can have one or multiple rows, one or multiple columns
- Each column can hold a different data type (``int64`` for integer values, ``float64`` for float values, ``object`` for string values, ``bool`` for boolean values).

## Series

Each column of a DataFrame is an object of type ``pd.Series``.

A ``pd.Series`` is a one-dimensional array-like object in Python that is part of the Pandas library. 
- It is similar to a list or a column in a table.
- It has labels (or an index) for each element, allowing for easy access and manipulation of the data.
    - a row index (row names)
    - a column index (column names)

# 2. Read and display the DataFrame

### 2.1 display the dataframe

In [35]:
# Read the DataFrame
df = pd.read_csv('/Users/cb/Downloads/WHO_COVID19_cases.csv')

### 2.2 Display the DataFrame

In [36]:
df

Unnamed: 0,Date_reported,Country_code,Country,Continent,WHO_region,New_cases,Cumulative_cases,New_deaths,Cumulative_deaths
0,2020-01-05,AF,Afghanistan,Asia,EMRO,,0,,0
1,2020-01-12,AF,Afghanistan,Asia,EMRO,,0,,0
2,2020-01-19,AF,Afghanistan,Asia,EMRO,,0,,0
3,2020-01-26,AF,Afghanistan,Asia,EMRO,,0,,0
4,2020-02-02,AF,Afghanistan,Asia,EMRO,,0,,0
...,...,...,...,...,...,...,...,...,...
58555,2024-08-04,ZW,Zimbabwe,Africa,AFRO,1.0,266387,,5740
58556,2024-08-11,ZW,Zimbabwe,Africa,AFRO,,266387,,5740
58557,2024-08-18,ZW,Zimbabwe,Africa,AFRO,,266387,,5740
58558,2024-08-25,ZW,Zimbabwe,Africa,AFRO,,266387,,5740


### 2.2 Show the first/last 5 rows 

In [37]:
df.head() #first 5


Unnamed: 0,Date_reported,Country_code,Country,Continent,WHO_region,New_cases,Cumulative_cases,New_deaths,Cumulative_deaths
0,2020-01-05,AF,Afghanistan,Asia,EMRO,,0,,0
1,2020-01-12,AF,Afghanistan,Asia,EMRO,,0,,0
2,2020-01-19,AF,Afghanistan,Asia,EMRO,,0,,0
3,2020-01-26,AF,Afghanistan,Asia,EMRO,,0,,0
4,2020-02-02,AF,Afghanistan,Asia,EMRO,,0,,0


In [38]:
df.tail() #last 5

Unnamed: 0,Date_reported,Country_code,Country,Continent,WHO_region,New_cases,Cumulative_cases,New_deaths,Cumulative_deaths
58555,2024-08-04,ZW,Zimbabwe,Africa,AFRO,1.0,266387,,5740
58556,2024-08-11,ZW,Zimbabwe,Africa,AFRO,,266387,,5740
58557,2024-08-18,ZW,Zimbabwe,Africa,AFRO,,266387,,5740
58558,2024-08-25,ZW,Zimbabwe,Africa,AFRO,,266387,,5740
58559,2024-09-01,ZW,Zimbabwe,Africa,AFRO,,266387,,5740


### 2.2 Show the number of rows and columns

In [39]:
# Number of rows
num_rows  = len(df)
num_rows

58560

In [40]:
# Number of columns
num_cols = len(df.columns)
num_cols

9

In [41]:
# Summary of both rows and columns
num_rows
num_cols

9

# 3. Data Selection

### 3.1 Selecting a single column per name

In [42]:
# let's select Cumulative_cases

df[['Cumulative_cases']]

Unnamed: 0,Cumulative_cases
0,0
1,0
2,0
3,0
4,0
...,...
58555,266387
58556,266387
58557,266387
58558,266387


### 3.2 Selecting multiple columns per name

In [43]:
# Let's select 'Date_reported', 'Country', 'Cumulative_cases'
df[['Date_reported', 'Country', 'Cumulative_cases']]


Unnamed: 0,Date_reported,Country,Cumulative_cases
0,2020-01-05,Afghanistan,0
1,2020-01-12,Afghanistan,0
2,2020-01-19,Afghanistan,0
3,2020-01-26,Afghanistan,0
4,2020-02-02,Afghanistan,0
...,...,...,...
58555,2024-08-04,Zimbabwe,266387
58556,2024-08-11,Zimbabwe,266387
58557,2024-08-18,Zimbabwe,266387
58558,2024-08-25,Zimbabwe,266387


### 3.3 Selecting based on the index: iloc

In [44]:
# Selecting the first row: index always starts at 0 in Python, so the first row has index '0'
df.iloc[0]

Date_reported         2020-01-05
Country_code                  AF
Country              Afghanistan
Continent                   Asia
WHO_region                  EMRO
New_cases                    NaN
Cumulative_cases               0
New_deaths                   NaN
Cumulative_deaths              0
Name: 0, dtype: object

In [45]:
# Selecting the first row and showing it as a DataFrame: double brackets
df_fr= df.iloc[[0]]
df_fr

Unnamed: 0,Date_reported,Country_code,Country,Continent,WHO_region,New_cases,Cumulative_cases,New_deaths,Cumulative_deaths
0,2020-01-05,AF,Afghanistan,Asia,EMRO,,0,,0


In [46]:
# Select the first row and third column
df_fr_tc= df.iloc[[0],[2]]
df_fr_tc

Unnamed: 0,Country
0,Afghanistan


In [47]:
# Show the 'Date_reported', 'Country', 'Cumulative_cases', by using the indexes:
selected_columns_df = df.iloc[:, [0, 2, 6]]
selected_columns_df

Unnamed: 0,Date_reported,Country,Cumulative_cases
0,2020-01-05,Afghanistan,0
1,2020-01-12,Afghanistan,0
2,2020-01-19,Afghanistan,0
3,2020-01-26,Afghanistan,0
4,2020-02-02,Afghanistan,0
...,...,...,...
58555,2024-08-04,Zimbabwe,266387
58556,2024-08-11,Zimbabwe,266387
58557,2024-08-18,Zimbabwe,266387
58558,2024-08-25,Zimbabwe,266387


In [48]:
# Select only the first 3 rows (slicing) and the same columns. 
selected_columns_df[:3]

Unnamed: 0,Date_reported,Country,Cumulative_cases
0,2020-01-05,Afghanistan,0
1,2020-01-12,Afghanistan,0
2,2020-01-19,Afghanistan,0


In [49]:
# Select only the rows with index 2 until 9 (slicing) and the same columns
# You can notice that the right bound in not included in slicing
selected_columns_df[2:9]

Unnamed: 0,Date_reported,Country,Cumulative_cases
2,2020-01-19,Afghanistan,0
3,2020-01-26,Afghanistan,0
4,2020-02-02,Afghanistan,0
5,2020-02-09,Afghanistan,0
6,2020-02-16,Afghanistan,0
7,2020-02-23,Afghanistan,0
8,2020-03-01,Afghanistan,1


### 3.4 Selecting based on the label: loc

In [50]:
# Filter based on index labels
df.loc[[0]]

Unnamed: 0,Date_reported,Country_code,Country,Continent,WHO_region,New_cases,Cumulative_cases,New_deaths,Cumulative_deaths
0,2020-01-05,AF,Afghanistan,Asia,EMRO,,0,,0


In [51]:
# Show the 'Date_reported', 'Country', 'Cumulative_cases', by using the labels
df_loc_labels = df.loc[:, ['Date_reported', 'Country', 'Cumulative_cases']]
df_loc_labels

Unnamed: 0,Date_reported,Country,Cumulative_cases
0,2020-01-05,Afghanistan,0
1,2020-01-12,Afghanistan,0
2,2020-01-19,Afghanistan,0
3,2020-01-26,Afghanistan,0
4,2020-02-02,Afghanistan,0
...,...,...,...
58555,2024-08-04,Zimbabwe,266387
58556,2024-08-11,Zimbabwe,266387
58557,2024-08-18,Zimbabwe,266387
58558,2024-08-25,Zimbabwe,266387


#### iloc and loc cannot be used simultaneously
- You _can't_ combine .loc and .iloc.
- You _can_ combine selecting by single value, multiple values, and slicing

# 4. Rename and drop rows or columns

### 4.1 Show the columns

In [52]:
df.columns

Index(['Date_reported', 'Country_code', 'Country', 'Continent', 'WHO_region',
       'New_cases', 'Cumulative_cases', 'New_deaths', 'Cumulative_deaths'],
      dtype='object')

### 4.2 Rename the columns

In [53]:
df.columns = ['date_reported', 'country_code','country', 'continent', 'region',
       'new_cases', 'cumulative_cases', 'new_deaths', 'cumulative_deaths']

#### A- replacing the name of a specific column

`data.rename(columns={'new_deaths':'number_new_deaths', 'new_cases': 'number_new_cases'}, inplace=True)`


#### B-changing with using the lower method

`data.columns = data.columns.str.lower()`

### 4.3 Drop columns

In [54]:
# Let's drop the country code and assign the new dataframe to a new variable called `df_dropcol`
df_dropcol = df.drop(columns= 'country_code')

In [55]:
# Display the DataFrame
df_dropcol

Unnamed: 0,date_reported,country,continent,region,new_cases,cumulative_cases,new_deaths,cumulative_deaths
0,2020-01-05,Afghanistan,Asia,EMRO,,0,,0
1,2020-01-12,Afghanistan,Asia,EMRO,,0,,0
2,2020-01-19,Afghanistan,Asia,EMRO,,0,,0
3,2020-01-26,Afghanistan,Asia,EMRO,,0,,0
4,2020-02-02,Afghanistan,Asia,EMRO,,0,,0
...,...,...,...,...,...,...,...,...
58555,2024-08-04,Zimbabwe,Africa,AFRO,1.0,266387,,5740
58556,2024-08-11,Zimbabwe,Africa,AFRO,,266387,,5740
58557,2024-08-18,Zimbabwe,Africa,AFRO,,266387,,5740
58558,2024-08-25,Zimbabwe,Africa,AFRO,,266387,,5740


### 4.4 Drop rows

In [56]:
# Let's drop the first two rows and assign the new DataFrame into a new variable called df_droprows
df_droprows = df.drop([0, 1])

In [57]:
# Display the DataFrame
df_droprows

Unnamed: 0,date_reported,country_code,country,continent,region,new_cases,cumulative_cases,new_deaths,cumulative_deaths
2,2020-01-19,AF,Afghanistan,Asia,EMRO,,0,,0
3,2020-01-26,AF,Afghanistan,Asia,EMRO,,0,,0
4,2020-02-02,AF,Afghanistan,Asia,EMRO,,0,,0
5,2020-02-09,AF,Afghanistan,Asia,EMRO,,0,,0
6,2020-02-16,AF,Afghanistan,Asia,EMRO,,0,,0
...,...,...,...,...,...,...,...,...,...
58555,2024-08-04,ZW,Zimbabwe,Africa,AFRO,1.0,266387,,5740
58556,2024-08-11,ZW,Zimbabwe,Africa,AFRO,,266387,,5740
58557,2024-08-18,ZW,Zimbabwe,Africa,AFRO,,266387,,5740
58558,2024-08-25,ZW,Zimbabwe,Africa,AFRO,,266387,,5740


In [58]:
# We can also replace the current DataFrame. Let's drop the row indexed by "2" in the newly created DataFrame:
df_droprows.drop(index = 2, inplace = True)

In [59]:
# Let's check
df_droprows

Unnamed: 0,date_reported,country_code,country,continent,region,new_cases,cumulative_cases,new_deaths,cumulative_deaths
3,2020-01-26,AF,Afghanistan,Asia,EMRO,,0,,0
4,2020-02-02,AF,Afghanistan,Asia,EMRO,,0,,0
5,2020-02-09,AF,Afghanistan,Asia,EMRO,,0,,0
6,2020-02-16,AF,Afghanistan,Asia,EMRO,,0,,0
7,2020-02-23,AF,Afghanistan,Asia,EMRO,,0,,0
...,...,...,...,...,...,...,...,...,...
58555,2024-08-04,ZW,Zimbabwe,Africa,AFRO,1.0,266387,,5740
58556,2024-08-11,ZW,Zimbabwe,Africa,AFRO,,266387,,5740
58557,2024-08-18,ZW,Zimbabwe,Africa,AFRO,,266387,,5740
58558,2024-08-25,ZW,Zimbabwe,Africa,AFRO,,266387,,5740


# 5. Filtering Dataframes

### 5.1 Filtering with a single logical operator

In [60]:
# Let's filter on the country 'Germany'
df_filtered = df[df['country'] == 'Germany']
df_filtered


Unnamed: 0,date_reported,country_code,country,continent,region,new_cases,cumulative_cases,new_deaths,cumulative_deaths
19032,2020-01-05,DE,Germany,Europe,EURO,1.0,1,3.0,3
19033,2020-01-12,DE,Germany,Europe,EURO,,1,,3
19034,2020-01-19,DE,Germany,Europe,EURO,,1,,3
19035,2020-01-26,DE,Germany,Europe,EURO,1.0,2,3.0,6
19036,2020-02-02,DE,Germany,Europe,EURO,9.0,11,3.0,9
...,...,...,...,...,...,...,...,...,...
19271,2024-08-04,DE,Germany,Europe,EURO,,38437756,,174979
19272,2024-08-11,DE,Germany,Europe,EURO,,38437756,,174979
19273,2024-08-18,DE,Germany,Europe,EURO,,38437756,,174979
19274,2024-08-25,DE,Germany,Europe,EURO,,38437756,,174979


In [61]:
# Let's assign this dataFrame to a new variable called `germany`
germany = df[df['country'] == 'Germany']
germany

Unnamed: 0,date_reported,country_code,country,continent,region,new_cases,cumulative_cases,new_deaths,cumulative_deaths
19032,2020-01-05,DE,Germany,Europe,EURO,1.0,1,3.0,3
19033,2020-01-12,DE,Germany,Europe,EURO,,1,,3
19034,2020-01-19,DE,Germany,Europe,EURO,,1,,3
19035,2020-01-26,DE,Germany,Europe,EURO,1.0,2,3.0,6
19036,2020-02-02,DE,Germany,Europe,EURO,9.0,11,3.0,9
...,...,...,...,...,...,...,...,...,...
19271,2024-08-04,DE,Germany,Europe,EURO,,38437756,,174979
19272,2024-08-11,DE,Germany,Europe,EURO,,38437756,,174979
19273,2024-08-18,DE,Germany,Europe,EURO,,38437756,,174979
19274,2024-08-25,DE,Germany,Europe,EURO,,38437756,,174979


### 5.2 Using other logical operators such as > , < , >= , <= and !=

In [62]:
# let's filter for all rows that have a cumulative deaths over 1 million
germany = germany[germany['cumulative_cases'] > 1000000]
germany



Unnamed: 0,date_reported,country_code,country,continent,region,new_cases,cumulative_cases,new_deaths,cumulative_deaths
19079,2020-11-29,DE,Germany,Europe,EURO,123122.0,1062290,3601.0,24579
19080,2020-12-06,DE,Germany,Europe,EURO,128258.0,1190548,4480.0,29059
19081,2020-12-13,DE,Germany,Europe,EURO,156213.0,1346761,5762.0,34821
19082,2020-12-20,DE,Germany,Europe,EURO,174589.0,1521350,6460.0,41281
19083,2020-12-27,DE,Germany,Europe,EURO,138828.0,1660178,5728.0,47009
...,...,...,...,...,...,...,...,...,...
19271,2024-08-04,DE,Germany,Europe,EURO,,38437756,,174979
19272,2024-08-11,DE,Germany,Europe,EURO,,38437756,,174979
19273,2024-08-18,DE,Germany,Europe,EURO,,38437756,,174979
19274,2024-08-25,DE,Germany,Europe,EURO,,38437756,,174979


### 5.3 Filtering with multiple logical operators

In [31]:
# Filter on the continent 'Asia' with less than 1000 cumulative deaths
# Filter on the continent 'Asia' with less than 1000 cumulative deaths
asia_less_cd = df[(df['cumulative_cases'] < 1000) & (df['continent'] == 'Asia')]
asia_less_cd

Unnamed: 0,date_reported,country_code,country,continent,region,new_cases,cumulative_cases,new_deaths,cumulative_deaths
0,2020-01-05,AF,Afghanistan,Asia,EMRO,,0,,0
1,2020-01-12,AF,Afghanistan,Asia,EMRO,,0,,0
2,2020-01-19,AF,Afghanistan,Asia,EMRO,,0,,0
3,2020-01-26,AF,Afghanistan,Asia,EMRO,,0,,0
4,2020-02-02,AF,Afghanistan,Asia,EMRO,,0,,0
...,...,...,...,...,...,...,...,...,...
57848,2020-05-24,YE,Yemen,Asia,EMRO,90.0,212,21.0,39
57849,2020-05-31,YE,Yemen,Asia,EMRO,98.0,310,38.0,77
57850,2020-06-07,YE,Yemen,Asia,EMRO,172.0,482,34.0,111
57851,2020-06-14,YE,Yemen,Asia,EMRO,223.0,705,49.0,160


### 5.4 Other methods

- x.isin([...])
- x.between(a, b)
- x.isna()
- x.str.startswith('A')

# 6. Save a new dataset to a csv file

In [32]:
# we can save a dataframe into a csv file

germany.to_csv("./germany_covid19_cases.csv", index=False)

# Readings
**pandas user guide**
https://pandas.pydata.org/docs/user_guide/index.html#user-guide