# Python Data Science

> Dataframe Wrangling with Pandas

Kuo, Yao-Jen from [DATAINPOINT](https://www.datainpoint.com/)

In [1]:
import requests
import json
from datetime import date
from datetime import timedelta

## TL; DR

> In this lecture, we will talk about essential data wrangling skills in `pandas`.

## Essential Data Wrangling Skills in `pandas`

## What is `pandas`?

> Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more.

Source: <https://github.com/pandas-dev/pandas>

## Why `pandas`?

Python used to have a weak spot in its analysis capability due to it did not have an appropriate structure handling the common tabular datasets. Pythonists had to switch to a more data-centric language like R or Matlab during the analysis stage until the presence of `pandas`.

## Import Pandas with `import` command

Pandas is officially aliased as `pd`.

In [2]:
import pandas as pd

## If Pandas is not installed, we will encounter a `ModuleNotFoundError`

```
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'pandas'
```

## Use `pip install` at Terminal to install pandas

```bash
pip install pandas
```

## Check version and its installation file path

- `__version__` attribute
- `__file__` attribute

In [3]:
print(pd.__version__)
print(pd.__file__)

1.1.3
/opt/conda/lib/python3.8/site-packages/pandas/__init__.py


## What does `pandas` mean?

![](https://media.giphy.com/media/46Zj6ze2Z2t4k/giphy.gif)

Source: <https://giphy.com/>

## Turns out its naming has nothing to do with panda the animal, it refers to three primary class customed by its author [Wes McKinney](https://wesmckinney.com/)

- **Pan**el(Deprecated since version 0.20.0)
- **Da**taFrame
- **S**eries

## In order to master `pandas`, it is vital to understand the relationships between `Index`, `ndarray`, `Series`, and `DataFrame`

- An `Index` and a `ndarray` assembles a `Series`
- A couple of `Series` that sharing the same `Index` can then form a `DataFrame`

## `Index` from Pandas

The simpliest way to create an `Index` is using `pd.Index()`.

In [4]:
prime_indices = pd.Index([2, 3, 5, 7, 11, 13, 17, 19, 23, 29])
print(type(prime_indices))

<class 'pandas.core.indexes.numeric.Int64Index'>


## An `Index` is like a combination of `tuple` and `set`

- It is immutable.
- It has the characteristics of a set.

In [5]:
# It is immutable
prime_indices = pd.Index([2, 3, 5, 7, 11, 13, 17, 19, 23, 29])
#prime_indices[-1] = 31

In [6]:
# It has the characteristics of a set
odd_indices = pd.Index(range(1, 30, 2))
print(prime_indices.intersection(odd_indices)) # prime_indices & odd_indices
print(prime_indices.union(odd_indices)) # prime_indices | odd_indices
print(prime_indices.symmetric_difference(odd_indices)) # prime_indices ^ odd_indices
print(prime_indices.difference(odd_indices))
print(odd_indices.difference(prime_indices))

Int64Index([3, 5, 7, 11, 13, 17, 19, 23, 29], dtype='int64')
Int64Index([1, 2, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23, 25, 27, 29], dtype='int64')
Int64Index([1, 2, 9, 15, 21, 25, 27], dtype='int64')
Int64Index([2], dtype='int64')
Int64Index([1, 9, 15, 21, 25, 27], dtype='int64')


## `Series` from Pandas

The simpliest way to create an `Series` is using `pd.Series()`.

In [7]:
prime_series = pd.Series([2, 3, 5, 7, 11, 13, 17, 19, 23, 29])
print(type(prime_series))

<class 'pandas.core.series.Series'>


## A `Series` is a combination of `Index` and `ndarray`

In [8]:
print(type(prime_series.index))
print(type(prime_series.values))

<class 'pandas.core.indexes.range.RangeIndex'>
<class 'numpy.ndarray'>


## `DataFrame` from Pandas

The simpliest way to create an `DataFrame` is using `pd.DataFrame()`.

In [9]:
movie_df = pd.DataFrame()
movie_df["title"] = ["The Shawshank Redemption", "The Dark Knight", "Schindler's List", "Forrest Gump", "Inception"]
movie_df["imdb_rating"] = [9.3, 9.0, 8.9, 8.8, 8.7]
print(type(movie_df))

<class 'pandas.core.frame.DataFrame'>


## A `DataFrame` is a combination of multiple `Series` sharing the same `Index`

In [10]:
print(type(movie_df.index))
print(type(movie_df["title"]))
print(type(movie_df["imdb_rating"]))

<class 'pandas.core.indexes.range.RangeIndex'>
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>


## Review of the definition of modern data science

> Modern data science is a huge field, it invovles applications and tools like importing, tidying, transformation, visualization, modeling, and communication. Surrounding all these is programming.

![Imgur](https://i.imgur.com/din6Ig6.png)

Source: [R for Data Science](https://r4ds.had.co.nz/)

## Key functionalities analysts rely on `pandas` are

- Importing
- Tidying
- Transforming

## Tidying and transforming together is also known as WRANGLING

![](https://media.giphy.com/media/MnlZWRFHR4xruE4N2Z/giphy.gif)

Source: <https://giphy.com/>

## Importing

## `pandas` has massive functions importing tabular data

- Flat text file
- Database table
- Spreadsheet
- Array of JSONs
- HTML `<table></table>` tags
- ...etc.

Source: <https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html>

## Using `read_csv` function for flat text files

In [11]:
def get_covid19_latest_daily_report():
    """
    Get latest daily report(world) from:
    https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_daily_reports
    """
    data_date = date.today()
    data_date_delta = timedelta(days=1)
    daily_report_url_no_date = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_daily_reports/{}.csv"
    while True:
        data_date_str = date.strftime(data_date, '%m-%d-%Y')
        daily_report_url = daily_report_url_no_date.format(data_date_str)
        try:
            print("嘗試載入{}的每日報告".format(data_date_str))    
            daily_report = pd.read_csv(daily_report_url)
            print("檔案存在，擷取了{}的每日報告".format(data_date_str))
            break
        except:
            print("{}的檔案還沒有上傳".format(data_date_str))
            data_date -= data_date_delta # data_date = data_date - data_date_delta
    return daily_report

In [12]:
daily_report = get_covid19_latest_daily_report()

嘗試載入12-13-2020的每日報告
12-13-2020的檔案還沒有上傳
嘗試載入12-12-2020的每日報告
12-12-2020的檔案還沒有上傳
嘗試載入12-11-2020的每日報告
檔案存在，擷取了12-11-2020的每日報告


## Using `read_sql` function for database tables

```python
import sqlite3

conn = sqlite3.connect('YOUR_DATABASE.db')
sql_query = """
SELECT * 
  FROM YOUR_TABLE
 LIMIT 100;
"""
pd.read_sql(sql_query, conn)
```

## Using `read_excel` function for spreadsheets

```python
excel_file_path = "PATH/TO/YOUR/EXCEL/FILE"
pd.read_excel(excel_file_path)
```

## Using `read_json` function for array of JSONs

```python
json_file_path = "PATH/TO/YOUR/JSON/FILE"
pd.read_json(json_file_path)
```

## What is JSON?

> JSON (JavaScript Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. It is based on a subset of the JavaScript Programming Language. JSON is a text format that is completely language independent but uses conventions that are familiar to programmers of the C-family of languages, including C, C++, C#, Java, JavaScript, Perl, Python, and many others. These properties make JSON an ideal data-interchange language.

Source: <https://www.json.org/json-en.html>

## Using `read_html` function for HTML `<table></table>` tags

> The `<table>` tag defines an HTML table. An HTML table consists of one `<table>` element and one or more `<tr>`, `<th>`, and `<td>` elements. The `<tr>` element defines a table row, the `<th>` element defines a table header, and the `<td>` element defines a table cell.

Source: <https://www.w3schools.com/default.asp>

In [13]:
request_url = "https://www.imdb.com/chart/top"
html_tables = pd.read_html(request_url)
print(type(html_tables))
print(len(html_tables))

<class 'list'>
1


In [14]:
html_tables[0]

Unnamed: 0.1,Unnamed: 0,Rank & Title,IMDb Rating,Your Rating,Unnamed: 4
0,,1. 刺激1995 (1994),9.2,12345678910 NOT YET RELEASED Seen,
1,,2. 教父 (1972),9.1,12345678910 NOT YET RELEASED Seen,
2,,3. 教父第二集 (1974),9.0,12345678910 NOT YET RELEASED Seen,
3,,4. 黑暗騎士 (2008),9.0,12345678910 NOT YET RELEASED Seen,
4,,5. 十二怒漢 (1957),8.9,12345678910 NOT YET RELEASED Seen,
...,...,...,...,...,...
245,,246. 公主新娘 (1987),8.0,12345678910 NOT YET RELEASED Seen,
246,,247. 阿爾及爾之戰 (1966),8.0,12345678910 NOT YET RELEASED Seen,
247,,248. 橘子收成時 (2013),8.0,12345678910 NOT YET RELEASED Seen,
248,,249. 電影版聲之形 (2016),8.0,12345678910 NOT YET RELEASED Seen,


## Basic attributes and methods

## Basic attributes of a `DataFrame` object

- `shape`
- `dtypes`
- `index`
- `columns`

In [15]:
print(daily_report.shape)
print(daily_report.dtypes)
print(daily_report.index)
print(daily_report.columns)

(3976, 14)
FIPS                   float64
Admin2                  object
Province_State          object
Country_Region          object
Last_Update             object
Lat                    float64
Long_                  float64
Confirmed                int64
Deaths                   int64
Recovered                int64
Active                 float64
Combined_Key            object
Incident_Rate          float64
Case_Fatality_Ratio    float64
dtype: object
RangeIndex(start=0, stop=3976, step=1)
Index(['FIPS', 'Admin2', 'Province_State', 'Country_Region', 'Last_Update',
       'Lat', 'Long_', 'Confirmed', 'Deaths', 'Recovered', 'Active',
       'Combined_Key', 'Incident_Rate', 'Case_Fatality_Ratio'],
      dtype='object')


## Basic methods of a `DataFrame` object

- `head(n)`
- `tail(n)`
- `describe`
- `info`
- `set_index`
- `reset_index`

## `head(n)` returns the top n observations with header

In [16]:
daily_report.head() # n is default to 5

Unnamed: 0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key,Incident_Rate,Case_Fatality_Ratio
0,,,,Afghanistan,2020-12-12 05:26:19,33.93911,67.709953,48116,1945,38141,8030.0,Afghanistan,123.601466,4.042314
1,,,,Albania,2020-12-12 05:26:19,41.1533,20.1683,46863,977,24136,21750.0,Albania,1628.431441,2.0848
2,,,,Algeria,2020-12-12 05:26:19,28.0339,1.6596,91121,2575,59590,28956.0,Algeria,207.796654,2.825913
3,,,,Andorra,2020-12-12 05:26:19,42.5063,1.5218,7236,78,6598,560.0,Andorra,9365.171811,1.077944
4,,,,Angola,2020-12-12 05:26:19,-11.2027,17.8739,16061,365,8798,6898.0,Angola,48.867733,2.272586


## `tail(n)` returns the bottom n observations with header

In [17]:
daily_report.tail(3)

Unnamed: 0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key,Incident_Rate,Case_Fatality_Ratio
3973,,,,Yemen,2020-12-12 05:26:19,15.552727,48.516388,2082,606,1383,93.0,Yemen,6.980494,29.106628
3974,,,,Zambia,2020-12-12 05:26:19,-13.133897,27.849332,18161,365,17329,467.0,Zambia,98.787225,2.009801
3975,,,,Zimbabwe,2020-12-12 05:26:19,-19.015438,29.154857,11162,306,9324,1532.0,Zimbabwe,75.099609,2.741444


## `describe` returns the descriptive summary for numeric columns

In [18]:
daily_report.describe()

Unnamed: 0,FIPS,Lat,Long_,Confirmed,Deaths,Recovered,Active,Incident_Rate,Case_Fatality_Ratio
count,3263.0,3890.0,3890.0,3976.0,3976.0,3976.0,3974.0,3890.0,3934.0
mean,32405.273675,36.004323,-72.036928,17877.66,401.100855,11405.92,6073.491,4676.883893,1.905616
std,18007.1615,12.968793,53.506558,91457.23,2387.924952,118376.9,111954.9,2847.216796,3.462757
min,66.0,-52.368,-174.1596,0.0,0.0,0.0,-6135314.0,0.0,0.0
25%,19052.0,33.274355,-96.591968,549.25,7.0,0.0,470.25,2690.317591,0.886351
50%,30069.0,37.953523,-86.85422,1522.0,24.0,0.0,1329.5,4535.390075,1.504688
75%,47040.0,42.217931,-77.5045,6178.0,91.0,0.0,4075.75,6329.453684,2.361019
max,99999.0,71.7069,178.065,2348795.0,57210.0,6135314.0,2138952.0,23444.976077,178.887765


## `info` returns the concise information of the dataframe

In [19]:
daily_report.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3976 entries, 0 to 3975
Data columns (total 14 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   FIPS                 3263 non-null   float64
 1   Admin2               3268 non-null   object 
 2   Province_State       3806 non-null   object 
 3   Country_Region       3976 non-null   object 
 4   Last_Update          3976 non-null   object 
 5   Lat                  3890 non-null   float64
 6   Long_                3890 non-null   float64
 7   Confirmed            3976 non-null   int64  
 8   Deaths               3976 non-null   int64  
 9   Recovered            3976 non-null   int64  
 10  Active               3974 non-null   float64
 11  Combined_Key         3976 non-null   object 
 12  Incident_Rate        3890 non-null   float64
 13  Case_Fatality_Ratio  3934 non-null   float64
dtypes: float64(6), int64(3), object(5)
memory usage: 435.0+ KB


## `set_index` replaces current `Index` with a specific variable

In [20]:
daily_report.set_index('Combined_Key')

Unnamed: 0_level_0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Incident_Rate,Case_Fatality_Ratio
Combined_Key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
Afghanistan,,,,Afghanistan,2020-12-12 05:26:19,33.939110,67.709953,48116,1945,38141,8030.0,123.601466,4.042314
Albania,,,,Albania,2020-12-12 05:26:19,41.153300,20.168300,46863,977,24136,21750.0,1628.431441,2.084800
Algeria,,,,Algeria,2020-12-12 05:26:19,28.033900,1.659600,91121,2575,59590,28956.0,207.796654,2.825913
Andorra,,,,Andorra,2020-12-12 05:26:19,42.506300,1.521800,7236,78,6598,560.0,9365.171811,1.077944
Angola,,,,Angola,2020-12-12 05:26:19,-11.202700,17.873900,16061,365,8798,6898.0,48.867733,2.272586
...,...,...,...,...,...,...,...,...,...,...,...,...,...
Vietnam,,,,Vietnam,2020-12-12 05:26:19,14.058324,108.277199,1391,35,1238,118.0,1.429033,2.516175
West Bank and Gaza,,,,West Bank and Gaza,2020-12-12 05:26:19,31.952200,35.233200,106622,931,81166,24525.0,2090.047156,0.873178
Yemen,,,,Yemen,2020-12-12 05:26:19,15.552727,48.516388,2082,606,1383,93.0,6.980494,29.106628
Zambia,,,,Zambia,2020-12-12 05:26:19,-13.133897,27.849332,18161,365,17329,467.0,98.787225,2.009801


## `reset_index` resets current `Index` with default `RangeIndex` 

In [21]:
daily_report.set_index('Combined_Key').reset_index()

Unnamed: 0,Combined_Key,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Incident_Rate,Case_Fatality_Ratio
0,Afghanistan,,,,Afghanistan,2020-12-12 05:26:19,33.939110,67.709953,48116,1945,38141,8030.0,123.601466,4.042314
1,Albania,,,,Albania,2020-12-12 05:26:19,41.153300,20.168300,46863,977,24136,21750.0,1628.431441,2.084800
2,Algeria,,,,Algeria,2020-12-12 05:26:19,28.033900,1.659600,91121,2575,59590,28956.0,207.796654,2.825913
3,Andorra,,,,Andorra,2020-12-12 05:26:19,42.506300,1.521800,7236,78,6598,560.0,9365.171811,1.077944
4,Angola,,,,Angola,2020-12-12 05:26:19,-11.202700,17.873900,16061,365,8798,6898.0,48.867733,2.272586
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3971,Vietnam,,,,Vietnam,2020-12-12 05:26:19,14.058324,108.277199,1391,35,1238,118.0,1.429033,2.516175
3972,West Bank and Gaza,,,,West Bank and Gaza,2020-12-12 05:26:19,31.952200,35.233200,106622,931,81166,24525.0,2090.047156,0.873178
3973,Yemen,,,,Yemen,2020-12-12 05:26:19,15.552727,48.516388,2082,606,1383,93.0,6.980494,29.106628
3974,Zambia,,,,Zambia,2020-12-12 05:26:19,-13.133897,27.849332,18161,365,17329,467.0,98.787225,2.009801


## Basic Dataframe Wrangling

## Basic wrangling is like writing SQL queries

- Selecting: `SELECT FROM`
- Filtering: `WHERE`
- Subsetting: `SELECT FROM WHERE`
- Indexing
- Sorting: `ORDER BY`
- Deriving
- Summarizing
- Summarizing and Grouping: `GROUP BY`

## Selecting a column as `Series`

In [22]:
print(daily_report['Country_Region'])
print(type(daily_report['Country_Region']))

0              Afghanistan
1                  Albania
2                  Algeria
3                  Andorra
4                   Angola
               ...        
3971               Vietnam
3972    West Bank and Gaza
3973                 Yemen
3974                Zambia
3975              Zimbabwe
Name: Country_Region, Length: 3976, dtype: object
<class 'pandas.core.series.Series'>


## Selecting a column as `DataFrame`

In [23]:
print(type(daily_report[['Country_Region']]))
daily_report[['Country_Region']]

<class 'pandas.core.frame.DataFrame'>


Unnamed: 0,Country_Region
0,Afghanistan
1,Albania
2,Algeria
3,Andorra
4,Angola
...,...
3971,Vietnam
3972,West Bank and Gaza
3973,Yemen
3974,Zambia


## Selecting multiple columns as `DataFrame`, for sure

In [24]:
cols = ['Country_Region', 'Province_State']
daily_report[cols]

Unnamed: 0,Country_Region,Province_State
0,Afghanistan,
1,Albania,
2,Algeria,
3,Andorra,
4,Angola,
...,...,...
3971,Vietnam,
3972,West Bank and Gaza,
3973,Yemen,
3974,Zambia,


## Filtering rows with conditional statements

In [25]:
is_taiwan = daily_report['Country_Region'] == 'Taiwan*'
daily_report[is_taiwan]

Unnamed: 0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key,Incident_Rate,Case_Fatality_Ratio
640,,,,Taiwan*,2020-12-12 05:26:19,23.7,121.0,725,7,595,123.0,Taiwan*,3.044073,0.965517


## Subsetting columns and rows simultaneously

In [26]:
cols_to_select = ['Country_Region', 'Confirmed']
rows_to_filter = daily_report['Country_Region'] == 'Taiwan*'
daily_report[rows_to_filter][cols_to_select]

Unnamed: 0,Country_Region,Confirmed
640,Taiwan*,725


## Indexing `DataFrame` with

- `loc[]`
- `iloc[]`

## `loc[]` is indexing `DataFrame` with `Index` 

In [27]:
print(daily_report.loc[3388, ['Country_Region', 'Confirmed']]) # as Series
daily_report.loc[[3388], ['Country_Region', 'Confirmed']] # as DataFrame

Country_Region    US
Confirmed         35
Name: 3388, dtype: object


Unnamed: 0,Country_Region,Confirmed
3388,US,35


## `iloc[]` is indexing `DataFrame` with absolute position

In [28]:
print(daily_report.iloc[3388, [3, 7]]) # as Series
daily_report.iloc[[3388], [3, 7]] # as DataFrame

Country_Region    US
Confirmed         35
Name: 3388, dtype: object


Unnamed: 0,Country_Region,Confirmed
3388,US,35


## Sorting `DataFrame` with

- `sort_values`
- `sort_index`

## `sort_values` sorts `DataFrame` with specific columns

In [29]:
daily_report.sort_values(['Country_Region', 'Confirmed'])

Unnamed: 0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key,Incident_Rate,Case_Fatality_Ratio
0,,,,Afghanistan,2020-12-12 05:26:19,33.939110,67.709953,48116,1945,38141,8030.0,Afghanistan,123.601466,4.042314
1,,,,Albania,2020-12-12 05:26:19,41.153300,20.168300,46863,977,24136,21750.0,Albania,1628.431441,2.084800
2,,,,Algeria,2020-12-12 05:26:19,28.033900,1.659600,91121,2575,59590,28956.0,Algeria,207.796654,2.825913
3,,,,Andorra,2020-12-12 05:26:19,42.506300,1.521800,7236,78,6598,560.0,Andorra,9365.171811,1.077944
4,,,,Angola,2020-12-12 05:26:19,-11.202700,17.873900,16061,365,8798,6898.0,Angola,48.867733,2.272586
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3971,,,,Vietnam,2020-12-12 05:26:19,14.058324,108.277199,1391,35,1238,118.0,Vietnam,1.429033,2.516175
3972,,,,West Bank and Gaza,2020-12-12 05:26:19,31.952200,35.233200,106622,931,81166,24525.0,West Bank and Gaza,2090.047156,0.873178
3973,,,,Yemen,2020-12-12 05:26:19,15.552727,48.516388,2082,606,1383,93.0,Yemen,6.980494,29.106628
3974,,,,Zambia,2020-12-12 05:26:19,-13.133897,27.849332,18161,365,17329,467.0,Zambia,98.787225,2.009801


## `sort_index` sorts `DataFrame` with the `Index` of `DataFrame`

In [30]:
daily_report.sort_index(ascending=False)

Unnamed: 0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key,Incident_Rate,Case_Fatality_Ratio
3975,,,,Zimbabwe,2020-12-12 05:26:19,-19.015438,29.154857,11162,306,9324,1532.0,Zimbabwe,75.099609,2.741444
3974,,,,Zambia,2020-12-12 05:26:19,-13.133897,27.849332,18161,365,17329,467.0,Zambia,98.787225,2.009801
3973,,,,Yemen,2020-12-12 05:26:19,15.552727,48.516388,2082,606,1383,93.0,Yemen,6.980494,29.106628
3972,,,,West Bank and Gaza,2020-12-12 05:26:19,31.952200,35.233200,106622,931,81166,24525.0,West Bank and Gaza,2090.047156,0.873178
3971,,,,Vietnam,2020-12-12 05:26:19,14.058324,108.277199,1391,35,1238,118.0,Vietnam,1.429033,2.516175
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4,,,,Angola,2020-12-12 05:26:19,-11.202700,17.873900,16061,365,8798,6898.0,Angola,48.867733,2.272586
3,,,,Andorra,2020-12-12 05:26:19,42.506300,1.521800,7236,78,6598,560.0,Andorra,9365.171811,1.077944
2,,,,Algeria,2020-12-12 05:26:19,28.033900,1.659600,91121,2575,59590,28956.0,Algeria,207.796654,2.825913
1,,,,Albania,2020-12-12 05:26:19,41.153300,20.168300,46863,977,24136,21750.0,Albania,1628.431441,2.084800


## Deriving new variables from `DataFrame`

- Simple operations
- `pd.cut`
- `map` with a `dict`
- `map` with a function(or a lambda expression)

## Deriving new variable with simple operations

In [31]:
active = daily_report['Confirmed'] - daily_report['Deaths'] - daily_report['Recovered']
print(active)

0        8030
1       21750
2       28956
3         560
4        6898
        ...  
3971      118
3972    24525
3973       93
3974      467
3975     1532
Length: 3976, dtype: int64


## Deriving categorical from numerical with `pd.cut`

In [32]:
import numpy as np

cut_bins = [0, 1000, 10000, 100000, np.Inf]
cut_labels = ['Less than 1000', 'Between 1000 and 10000', 'Between 10000 and 100000', 'Above 100000']
confirmed_categorical = pd.cut(daily_report['Confirmed'], bins=cut_bins, labels=cut_labels, right=False)
print(confirmed_categorical)

0       Between 10000 and 100000
1       Between 10000 and 100000
2       Between 10000 and 100000
3         Between 1000 and 10000
4       Between 10000 and 100000
                  ...           
3971      Between 1000 and 10000
3972                Above 100000
3973      Between 1000 and 10000
3974    Between 10000 and 100000
3975    Between 10000 and 100000
Name: Confirmed, Length: 3976, dtype: category
Categories (4, object): ['Less than 1000' < 'Between 1000 and 10000' < 'Between 10000 and 100000' < 'Above 100000']


## Deriving categorical from categorical with `map`

- Passing a `dict`
- Passing a function(or lambda expression)

In [33]:
# Passing a dict
country_name = {
    'Taiwan*': 'Taiwan'
}
daily_report_tw = daily_report[is_taiwan]
daily_report_tw['Country_Region'].map(country_name)

640    Taiwan
Name: Country_Region, dtype: object

In [34]:
# Passing a function
def is_us(x):
    if x == 'US':
        return 'US'
    else:
        return 'Not US'
daily_report['Country_Region'].map(is_us)

0       Not US
1       Not US
2       Not US
3       Not US
4       Not US
         ...  
3971    Not US
3972    Not US
3973    Not US
3974    Not US
3975    Not US
Name: Country_Region, Length: 3976, dtype: object

In [35]:
# Passing a lambda expression)
daily_report['Country_Region'].map(lambda x: 'US' if x == 'US' else 'Not US')

0       Not US
1       Not US
2       Not US
3       Not US
4       Not US
         ...  
3971    Not US
3972    Not US
3973    Not US
3974    Not US
3975    Not US
Name: Country_Region, Length: 3976, dtype: object

## Summarizing `DataFrame` with aggregate methods

In [36]:
daily_report['Confirmed'].sum()

71081574

## Summarizing and grouping `DataFrame` with aggregate methods

In [37]:
daily_report.groupby('Country_Region')['Confirmed'].sum()

Country_Region
Afghanistan            48116
Albania                46863
Algeria                91121
Andorra                 7236
Angola                 16061
                       ...  
Vietnam                 1391
West Bank and Gaza    106622
Yemen                   2082
Zambia                 18161
Zimbabwe               11162
Name: Confirmed, Length: 191, dtype: int64

## More Dataframe Wrangling Operations

## Other common `Dataframe` wranglings including

- Dealing with missing values
- Dealing with text values
- Reshaping dataframes
- Merging and joining dataframes

## Dealing with missing values

- Using `isnull` or `notnull` to check if `np.NaN` exists
- Using `dropna` to drop rows with `np.NaN`
- Using `fillna` to fill `np.NaN` with specific values

In [38]:
print(daily_report['Province_State'].size)
print(daily_report['Province_State'].isnull().sum())
print(daily_report['Province_State'].notnull().sum())

3976
170
3806


In [39]:
print(daily_report.dropna().shape)
print(daily_report['FIPS'].fillna(0))

(3193, 14)
0       0.0
1       0.0
2       0.0
3       0.0
4       0.0
       ... 
3971    0.0
3972    0.0
3973    0.0
3974    0.0
3975    0.0
Name: FIPS, Length: 3976, dtype: float64


## Dealing with text values

Prior to `pandas` 1.0, `object` dtype was the only option.

In [40]:
daily_report.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3976 entries, 0 to 3975
Data columns (total 14 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   FIPS                 3263 non-null   float64
 1   Admin2               3268 non-null   object 
 2   Province_State       3806 non-null   object 
 3   Country_Region       3976 non-null   object 
 4   Last_Update          3976 non-null   object 
 5   Lat                  3890 non-null   float64
 6   Long_                3890 non-null   float64
 7   Confirmed            3976 non-null   int64  
 8   Deaths               3976 non-null   int64  
 9   Recovered            3976 non-null   int64  
 10  Active               3974 non-null   float64
 11  Combined_Key         3976 non-null   object 
 12  Incident_Rate        3890 non-null   float64
 13  Case_Fatality_Ratio  3934 non-null   float64
dtypes: float64(6), int64(3), object(5)
memory usage: 435.0+ KB


## Now we can specify `string` to text values

In [41]:
for col in daily_report.columns:
    if daily_report[col].dtype == 'object':
        daily_report[col] = daily_report[col].astype('string')
daily_report.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3976 entries, 0 to 3975
Data columns (total 14 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   FIPS                 3263 non-null   float64
 1   Admin2               3268 non-null   string 
 2   Province_State       3806 non-null   string 
 3   Country_Region       3976 non-null   string 
 4   Last_Update          3976 non-null   string 
 5   Lat                  3890 non-null   float64
 6   Long_                3890 non-null   float64
 7   Confirmed            3976 non-null   int64  
 8   Deaths               3976 non-null   int64  
 9   Recovered            3976 non-null   int64  
 10  Active               3974 non-null   float64
 11  Combined_Key         3976 non-null   string 
 12  Incident_Rate        3890 non-null   float64
 13  Case_Fatality_Ratio  3934 non-null   float64
dtypes: float64(6), int64(3), string(5)
memory usage: 435.0 KB


## Splitting strings with `str.split` as a `Series`

In [42]:
split_pattern = ', '
daily_report['Combined_Key'].str.split(split_pattern)

0              [Afghanistan]
1                  [Albania]
2                  [Algeria]
3                  [Andorra]
4                   [Angola]
                ...         
3971               [Vietnam]
3972    [West Bank and Gaza]
3973                 [Yemen]
3974                [Zambia]
3975              [Zimbabwe]
Name: Combined_Key, Length: 3976, dtype: object

## Splitting strings with `str.split` as a `DataFrame`

In [43]:
split_pattern = ', '
daily_report['Combined_Key'].str.split(split_pattern, expand=True)

Unnamed: 0,0,1,2
0,Afghanistan,,
1,Albania,,
2,Algeria,,
3,Andorra,,
4,Angola,,
...,...,...,...
3971,Vietnam,,
3972,West Bank and Gaza,,
3973,Yemen,,
3974,Zambia,,


## Along with the new `string` data type, `pd.NA` is introduced 

In [44]:
split_key = daily_report['Combined_Key'].str.split(split_pattern, expand=True)
print(split_key[1][3408] is pd.NA)

False


## Replacing strings with `str.replace`

In [45]:
daily_report['Combined_Key'].str.replace(", ", ';')

0              Afghanistan
1                  Albania
2                  Algeria
3                  Andorra
4                   Angola
               ...        
3971               Vietnam
3972    West Bank and Gaza
3973                 Yemen
3974                Zambia
3975              Zimbabwe
Name: Combined_Key, Length: 3976, dtype: string

## Testing for strings that match or contain a pattern with `str.contains`

In [46]:
print(daily_report['Country_Region'].str.contains('land').sum())
daily_report[daily_report['Country_Region'].str.contains('land')]

26


Unnamed: 0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat,Long_,Confirmed,Deaths,Recovered,Active,Combined_Key,Incident_Rate,Case_Fatality_Ratio
203,,,,Finland,2020-12-12 05:26:19,61.92411,25.748151,30073,453,20000,9620.0,Finland,542.763591,1.506335
246,,,,Iceland,2020-12-12 05:26:19,64.9631,-19.0208,5539,28,5323,188.0,Iceland,1623.150183,0.505506
287,,,,Ireland,2020-12-12 05:26:19,53.1424,-7.6921,75507,2120,23364,50023.0,Ireland,1529.164024,2.807687
383,,,,Marshall Islands,2020-12-12 05:26:19,7.1315,171.1845,4,0,4,0.0,Marshall Islands,6.847791,0.0
427,,,Aruba,Netherlands,2020-12-12 05:26:19,12.5211,-69.9683,5031,46,4879,106.0,"Aruba, Netherlands",4712.174288,0.914331
428,,,"Bonaire, Sint Eustatius and Saba",Netherlands,2020-12-12 05:26:19,12.1784,-68.2385,172,3,160,9.0,"Bonaire, Sint Eustatius and Saba, Netherlands",655.962778,1.744186
429,,,Curacao,Netherlands,2020-12-12 05:26:19,12.1696,-68.99,3501,10,1638,1853.0,"Curacao, Netherlands",2133.45521,0.285633
430,,,Drenthe,Netherlands,2020-12-12 05:26:19,52.862485,6.618435,8829,117,0,8712.0,"Drenthe, Netherlands",1788.3982,1.325178
431,,,Flevoland,Netherlands,2020-12-12 05:26:19,52.550383,5.515162,13222,149,0,13073.0,"Flevoland, Netherlands",3125.613149,1.12691
432,,,Friesland,Netherlands,2020-12-12 05:26:19,53.087337,5.7925,10128,143,0,9985.0,"Friesland, Netherlands",1558.256931,1.411927


## Reshaping dataframes from wide to long format with `pd.melt`

A common problem is that a dataset where some of the column names are not names of variables, but values of a variable.

In [47]:
ts_confirmed_global_url = "https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_covid19_confirmed_global.csv"
ts_confirmed_global = pd.read_csv(ts_confirmed_global_url)
ts_confirmed_global

Unnamed: 0,Province/State,Country/Region,Lat,Long,1/22/20,1/23/20,1/24/20,1/25/20,1/26/20,1/27/20,...,12/2/20,12/3/20,12/4/20,12/5/20,12/6/20,12/7/20,12/8/20,12/9/20,12/10/20,12/11/20
0,,Afghanistan,33.939110,67.709953,0,0,0,0,0,0,...,46718,46837,46837,47072,47306,47516,47716,47851,48053,48116
1,,Albania,41.153300,20.168300,0,0,0,0,0,0,...,39719,40501,41302,42148,42988,43683,44436,45188,46061,46863
2,,Algeria,28.033900,1.659600,0,0,0,0,0,0,...,85084,85927,86730,87502,88252,88825,89416,90014,90579,91121
3,,Andorra,42.506300,1.521800,0,0,0,0,0,0,...,6842,6904,6955,7005,7050,7084,7127,7162,7190,7236
4,,Angola,-11.202700,17.873900,0,0,0,0,0,0,...,15319,15361,15493,15536,15591,15648,15729,15804,15925,16061
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
266,,Vietnam,14.058324,108.277199,0,2,2,2,2,2,...,1358,1361,1361,1365,1366,1367,1377,1381,1385,1391
267,,West Bank and Gaza,31.952200,35.233200,0,0,0,0,0,0,...,90192,92708,94676,96098,98038,99758,101109,102992,104879,106622
268,,Yemen,15.552727,48.516388,0,0,0,0,0,0,...,2217,2239,2267,2304,2337,2383,2078,2079,2081,2082
269,,Zambia,-13.133897,27.849332,0,0,0,0,0,0,...,17700,17730,17857,17898,17916,17931,17963,18062,18091,18161


## We can pivot the columns into a new pair of variables

To describe that operation we need four parameters:

- The set of columns whose names are not values
- The set of columns whose names are values
- The name of the variable to move the column names to
- The name of the variable to move the column values to

## In this example, the four parameters are

- `id_vars`: `['Province/State', 'Country/Region', 'Lat', 'Long']`
- `value_vars`: The columns from `1/22/20` to the last column
- `var_name`: Let's name it `Date`
- `value_name`: Let's name it `Confirmed`

In [48]:
idVars = ['Province/State', 'Country/Region', 'Lat', 'Long']
ts_confirmed_global_long = pd.melt(ts_confirmed_global,
                                  id_vars=idVars,
                                  var_name='Date',
                                  value_name='Confirmed')
ts_confirmed_global_long

Unnamed: 0,Province/State,Country/Region,Lat,Long,Date,Confirmed
0,,Afghanistan,33.939110,67.709953,1/22/20,0
1,,Albania,41.153300,20.168300,1/22/20,0
2,,Algeria,28.033900,1.659600,1/22/20,0
3,,Andorra,42.506300,1.521800,1/22/20,0
4,,Angola,-11.202700,17.873900,1/22/20,0
...,...,...,...,...,...,...
88070,,Vietnam,14.058324,108.277199,12/11/20,1391
88071,,West Bank and Gaza,31.952200,35.233200,12/11/20,106622
88072,,Yemen,15.552727,48.516388,12/11/20,2082
88073,,Zambia,-13.133897,27.849332,12/11/20,18161


## Merging and joining dataframes

- `merge` on column names
- `join` on index

## Using `merge` function to join dataframes on columns

In [49]:
left_df = daily_report[daily_report['Country_Region'].isin(['Taiwan*', 'Japan'])]
right_df = ts_confirmed_global_long[ts_confirmed_global_long['Country/Region'].isin(['Taiwan*', 'Korea, South'])]
# default: inner join
pd.merge(left_df, right_df, left_on='Country_Region', right_on='Country/Region')

Unnamed: 0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat_x,Long_,Confirmed_x,Deaths,Recovered,Active,Combined_Key,Incident_Rate,Case_Fatality_Ratio,Province/State,Country/Region,Lat_y,Long,Date,Confirmed_y
0,,,,Taiwan*,2020-12-12 05:26:19,23.7,121.0,725,7,595,123.0,Taiwan*,3.044073,0.965517,,Taiwan*,23.7,121.0,1/22/20,1
1,,,,Taiwan*,2020-12-12 05:26:19,23.7,121.0,725,7,595,123.0,Taiwan*,3.044073,0.965517,,Taiwan*,23.7,121.0,1/23/20,1
2,,,,Taiwan*,2020-12-12 05:26:19,23.7,121.0,725,7,595,123.0,Taiwan*,3.044073,0.965517,,Taiwan*,23.7,121.0,1/24/20,3
3,,,,Taiwan*,2020-12-12 05:26:19,23.7,121.0,725,7,595,123.0,Taiwan*,3.044073,0.965517,,Taiwan*,23.7,121.0,1/25/20,3
4,,,,Taiwan*,2020-12-12 05:26:19,23.7,121.0,725,7,595,123.0,Taiwan*,3.044073,0.965517,,Taiwan*,23.7,121.0,1/26/20,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
320,,,,Taiwan*,2020-12-12 05:26:19,23.7,121.0,725,7,595,123.0,Taiwan*,3.044073,0.965517,,Taiwan*,23.7,121.0,12/7/20,716
321,,,,Taiwan*,2020-12-12 05:26:19,23.7,121.0,725,7,595,123.0,Taiwan*,3.044073,0.965517,,Taiwan*,23.7,121.0,12/8/20,718
322,,,,Taiwan*,2020-12-12 05:26:19,23.7,121.0,725,7,595,123.0,Taiwan*,3.044073,0.965517,,Taiwan*,23.7,121.0,12/9/20,720
323,,,,Taiwan*,2020-12-12 05:26:19,23.7,121.0,725,7,595,123.0,Taiwan*,3.044073,0.965517,,Taiwan*,23.7,121.0,12/10/20,724


In [50]:
# left join
pd.merge(left_df, right_df, left_on='Country_Region', right_on='Country/Region', how='left')

Unnamed: 0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat_x,Long_,Confirmed_x,Deaths,Recovered,Active,Combined_Key,Incident_Rate,Case_Fatality_Ratio,Province/State,Country/Region,Lat_y,Long,Date,Confirmed_y
0,,,Aichi,Japan,2020-12-12 05:26:19,35.035551,137.211621,12332,127,9717,2488.0,"Aichi, Japan",163.289324,1.029841,,,,,,
1,,,Akita,Japan,2020-12-12 05:26:19,39.748679,140.408228,90,1,89,0.0,"Akita, Japan",9.312047,1.111111,,,,,,
2,,,Aomori,Japan,2020-12-12 05:26:19,40.781541,140.828896,364,6,291,67.0,"Aomori, Japan",29.204787,1.648352,,,,,,
3,,,Chiba,Japan,2020-12-12 05:26:19,35.510141,140.198917,7961,96,6891,974.0,"Chiba, Japan",127.185080,1.205879,,,,,,
4,,,Ehime,Japan,2020-12-12 05:26:19,33.624835,132.856842,356,8,278,70.0,"Ehime, Japan",26.582737,2.247191,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
369,,,,Taiwan*,2020-12-12 05:26:19,23.700000,121.000000,725,7,595,123.0,Taiwan*,3.044073,0.965517,,Taiwan*,23.7,121.0,12/7/20,716.0
370,,,,Taiwan*,2020-12-12 05:26:19,23.700000,121.000000,725,7,595,123.0,Taiwan*,3.044073,0.965517,,Taiwan*,23.7,121.0,12/8/20,718.0
371,,,,Taiwan*,2020-12-12 05:26:19,23.700000,121.000000,725,7,595,123.0,Taiwan*,3.044073,0.965517,,Taiwan*,23.7,121.0,12/9/20,720.0
372,,,,Taiwan*,2020-12-12 05:26:19,23.700000,121.000000,725,7,595,123.0,Taiwan*,3.044073,0.965517,,Taiwan*,23.7,121.0,12/10/20,724.0


In [51]:
# right join
pd.merge(left_df, right_df, left_on='Country_Region', right_on='Country/Region', how='right')

Unnamed: 0,FIPS,Admin2,Province_State,Country_Region,Last_Update,Lat_x,Long_,Confirmed_x,Deaths,Recovered,Active,Combined_Key,Incident_Rate,Case_Fatality_Ratio,Province/State,Country/Region,Lat_y,Long,Date,Confirmed_y
0,,,,,,,,,,,,,,,,"Korea, South",35.907757,127.766922,1/22/20,1
1,,,,,,,,,,,,,,,,"Korea, South",35.907757,127.766922,1/23/20,1
2,,,,,,,,,,,,,,,,"Korea, South",35.907757,127.766922,1/24/20,2
3,,,,,,,,,,,,,,,,"Korea, South",35.907757,127.766922,1/25/20,2
4,,,,,,,,,,,,,,,,"Korea, South",35.907757,127.766922,1/26/20,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
645,,,,Taiwan*,2020-12-12 05:26:19,23.7,121.0,725.0,7.0,595.0,123.0,Taiwan*,3.044073,0.965517,,Taiwan*,23.700000,121.000000,12/7/20,716
646,,,,Taiwan*,2020-12-12 05:26:19,23.7,121.0,725.0,7.0,595.0,123.0,Taiwan*,3.044073,0.965517,,Taiwan*,23.700000,121.000000,12/8/20,718
647,,,,Taiwan*,2020-12-12 05:26:19,23.7,121.0,725.0,7.0,595.0,123.0,Taiwan*,3.044073,0.965517,,Taiwan*,23.700000,121.000000,12/9/20,720
648,,,,Taiwan*,2020-12-12 05:26:19,23.7,121.0,725.0,7.0,595.0,123.0,Taiwan*,3.044073,0.965517,,Taiwan*,23.700000,121.000000,12/10/20,724


## Using `join` method to join dataframes on index

In [52]:
left_df = daily_report[daily_report['Country_Region'].isin(['Taiwan*', 'Japan'])]
right_df = ts_confirmed_global_long[ts_confirmed_global_long['Country/Region'].isin(['Taiwan*', 'Korea, South'])]
left_df = left_df.set_index('Country_Region')
right_df = right_df.set_index('Country/Region')

In [53]:
# default: left join
left_df.join(right_df, lsuffix='_x', rsuffix='_y')

Unnamed: 0,FIPS,Admin2,Province_State,Last_Update,Lat_x,Long_,Confirmed_x,Deaths,Recovered,Active,Combined_Key,Incident_Rate,Case_Fatality_Ratio,Province/State,Lat_y,Long,Date,Confirmed_y
Japan,,,Aichi,2020-12-12 05:26:19,35.035551,137.211621,12332,127,9717,2488.0,"Aichi, Japan",163.289324,1.029841,,,,,
Japan,,,Akita,2020-12-12 05:26:19,39.748679,140.408228,90,1,89,0.0,"Akita, Japan",9.312047,1.111111,,,,,
Japan,,,Aomori,2020-12-12 05:26:19,40.781541,140.828896,364,6,291,67.0,"Aomori, Japan",29.204787,1.648352,,,,,
Japan,,,Chiba,2020-12-12 05:26:19,35.510141,140.198917,7961,96,6891,974.0,"Chiba, Japan",127.185080,1.205879,,,,,
Japan,,,Ehime,2020-12-12 05:26:19,33.624835,132.856842,356,8,278,70.0,"Ehime, Japan",26.582737,2.247191,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Taiwan*,,,,2020-12-12 05:26:19,23.700000,121.000000,725,7,595,123.0,Taiwan*,3.044073,0.965517,,23.7,121.0,12/7/20,716.0
Taiwan*,,,,2020-12-12 05:26:19,23.700000,121.000000,725,7,595,123.0,Taiwan*,3.044073,0.965517,,23.7,121.0,12/8/20,718.0
Taiwan*,,,,2020-12-12 05:26:19,23.700000,121.000000,725,7,595,123.0,Taiwan*,3.044073,0.965517,,23.7,121.0,12/9/20,720.0
Taiwan*,,,,2020-12-12 05:26:19,23.700000,121.000000,725,7,595,123.0,Taiwan*,3.044073,0.965517,,23.7,121.0,12/10/20,724.0


In [54]:
# inner join
left_df.join(right_df, lsuffix='_x', rsuffix='_y', how='inner')

Unnamed: 0,FIPS,Admin2,Province_State,Last_Update,Lat_x,Long_,Confirmed_x,Deaths,Recovered,Active,Combined_Key,Incident_Rate,Case_Fatality_Ratio,Province/State,Lat_y,Long,Date,Confirmed_y
Taiwan*,,,,2020-12-12 05:26:19,23.7,121.0,725,7,595,123.0,Taiwan*,3.044073,0.965517,,23.7,121.0,1/22/20,1
Taiwan*,,,,2020-12-12 05:26:19,23.7,121.0,725,7,595,123.0,Taiwan*,3.044073,0.965517,,23.7,121.0,1/23/20,1
Taiwan*,,,,2020-12-12 05:26:19,23.7,121.0,725,7,595,123.0,Taiwan*,3.044073,0.965517,,23.7,121.0,1/24/20,3
Taiwan*,,,,2020-12-12 05:26:19,23.7,121.0,725,7,595,123.0,Taiwan*,3.044073,0.965517,,23.7,121.0,1/25/20,3
Taiwan*,,,,2020-12-12 05:26:19,23.7,121.0,725,7,595,123.0,Taiwan*,3.044073,0.965517,,23.7,121.0,1/26/20,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Taiwan*,,,,2020-12-12 05:26:19,23.7,121.0,725,7,595,123.0,Taiwan*,3.044073,0.965517,,23.7,121.0,12/7/20,716
Taiwan*,,,,2020-12-12 05:26:19,23.7,121.0,725,7,595,123.0,Taiwan*,3.044073,0.965517,,23.7,121.0,12/8/20,718
Taiwan*,,,,2020-12-12 05:26:19,23.7,121.0,725,7,595,123.0,Taiwan*,3.044073,0.965517,,23.7,121.0,12/9/20,720
Taiwan*,,,,2020-12-12 05:26:19,23.7,121.0,725,7,595,123.0,Taiwan*,3.044073,0.965517,,23.7,121.0,12/10/20,724


In [55]:
# inner join
left_df.join(right_df, lsuffix='_x', rsuffix='_y', how='inner')

Unnamed: 0,FIPS,Admin2,Province_State,Last_Update,Lat_x,Long_,Confirmed_x,Deaths,Recovered,Active,Combined_Key,Incident_Rate,Case_Fatality_Ratio,Province/State,Lat_y,Long,Date,Confirmed_y
Taiwan*,,,,2020-12-12 05:26:19,23.7,121.0,725,7,595,123.0,Taiwan*,3.044073,0.965517,,23.7,121.0,1/22/20,1
Taiwan*,,,,2020-12-12 05:26:19,23.7,121.0,725,7,595,123.0,Taiwan*,3.044073,0.965517,,23.7,121.0,1/23/20,1
Taiwan*,,,,2020-12-12 05:26:19,23.7,121.0,725,7,595,123.0,Taiwan*,3.044073,0.965517,,23.7,121.0,1/24/20,3
Taiwan*,,,,2020-12-12 05:26:19,23.7,121.0,725,7,595,123.0,Taiwan*,3.044073,0.965517,,23.7,121.0,1/25/20,3
Taiwan*,,,,2020-12-12 05:26:19,23.7,121.0,725,7,595,123.0,Taiwan*,3.044073,0.965517,,23.7,121.0,1/26/20,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Taiwan*,,,,2020-12-12 05:26:19,23.7,121.0,725,7,595,123.0,Taiwan*,3.044073,0.965517,,23.7,121.0,12/7/20,716
Taiwan*,,,,2020-12-12 05:26:19,23.7,121.0,725,7,595,123.0,Taiwan*,3.044073,0.965517,,23.7,121.0,12/8/20,718
Taiwan*,,,,2020-12-12 05:26:19,23.7,121.0,725,7,595,123.0,Taiwan*,3.044073,0.965517,,23.7,121.0,12/9/20,720
Taiwan*,,,,2020-12-12 05:26:19,23.7,121.0,725,7,595,123.0,Taiwan*,3.044073,0.965517,,23.7,121.0,12/10/20,724


In [56]:
# right join
left_df.join(right_df, lsuffix='_x', rsuffix='_y', how='right')

Unnamed: 0,FIPS,Admin2,Province_State,Last_Update,Lat_x,Long_,Confirmed_x,Deaths,Recovered,Active,Combined_Key,Incident_Rate,Case_Fatality_Ratio,Province/State,Lat_y,Long,Date,Confirmed_y
"Korea, South",,,,,,,,,,,,,,,35.907757,127.766922,1/22/20,1
"Korea, South",,,,,,,,,,,,,,,35.907757,127.766922,1/23/20,1
"Korea, South",,,,,,,,,,,,,,,35.907757,127.766922,1/24/20,2
"Korea, South",,,,,,,,,,,,,,,35.907757,127.766922,1/25/20,2
"Korea, South",,,,,,,,,,,,,,,35.907757,127.766922,1/26/20,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Taiwan*,,,,2020-12-12 05:26:19,23.7,121.0,725.0,7.0,595.0,123.0,Taiwan*,3.044073,0.965517,,23.700000,121.000000,12/7/20,716
Taiwan*,,,,2020-12-12 05:26:19,23.7,121.0,725.0,7.0,595.0,123.0,Taiwan*,3.044073,0.965517,,23.700000,121.000000,12/8/20,718
Taiwan*,,,,2020-12-12 05:26:19,23.7,121.0,725.0,7.0,595.0,123.0,Taiwan*,3.044073,0.965517,,23.700000,121.000000,12/9/20,720
Taiwan*,,,,2020-12-12 05:26:19,23.7,121.0,725.0,7.0,595.0,123.0,Taiwan*,3.044073,0.965517,,23.700000,121.000000,12/10/20,724
