# Data Analysis with Python and Pandas Tutorial
# Part 3 - Merging and Joining data

This notebook is partially based on:

https://jakevdp.github.io/PythonDataScienceHandbook/03.07-merge-and-join.html

## Tutorial Objectives

In this tutorial, you will learn:

  * How to perform merge operations on dataframes, similar to SQL INNER JOINs
  * How to perform merge operations with inner, left, right, and (full) outer algorithms
  * How to identify rows with NaN value(s)

In [None]:
# import the Pandas library
import pandas as pd

In [None]:
# First, let's define some simple employee data

empl_df1 = pd.DataFrame({
    'employee': ['Lisa', 'Bob', 'Jake', 'Sue'],
    'hire_date': [2004, 2008, 2012, 2014]
})

empl_df2 = pd.DataFrame({
    'employee': ['Bob', 'Jake', 'Lisa', 'Sue'],
    'group': ['Accounting', 'Engineering', 'Engineering', 'HR']
})

sal_df = pd.DataFrame({
    'name': ['Bob', 'Jake', 'Lisa', 'Sue'],
    'salary': [70000, 80000, 120000, 90000]
})

bonus_df = pd.DataFrame({
    'name': ['Sue', 'Timmy', 'Lisa', 'Bob'],
    'bonus': [500, 550, 1000, 300]
})

boss_df = pd.DataFrame({
    'group': ['Accounting', 'Engineering', 'HR'],
    'supervisor': ['Carly', 'Guido', 'Steve']}
)

skills_df = pd.DataFrame({
    'group': ['Accounting', 'Accounting', 'Engineering', 'Engineering', 'HR', 'HR'],
    'skills': ['math', 'spreadsheets', 'coding', 'linux', 'spreadsheets', 'organization']
})

In [None]:
# Review the data frames
empl_df1

In [None]:
empl_df2

In [None]:
# Merge the data frames into a single frame
# Pandas finds the common column automatically
df = pd.merge(empl_df1, empl_df2)
df

In [None]:
# review the bosses dataframe
boss_df

In [None]:
# Merge employees with their group (many to one)
# The common column is "group", which we can explicitly specify
df = pd.merge(df, boss_df, on='group')
df

In [None]:
# review skills data
skills_df

In [None]:
# merge employee/group data with skills data (many to many)
# column to merge on is optional, but specify it anyway
pd.merge(empl_df2, skills_df, on='group')

In [None]:
# review salary data
sal_df

In [None]:
# Merge employee data with salary
# You'll have to identify the columns to merge on yourself
df = pd.merge(df, sal_df, left_on='employee', right_on='name')
df

In [None]:
# drop the duplicated column "name"
df.drop(columns='name', inplace=True)
df

In [None]:
# review end-of-year bonus data
bonus_df

In [None]:
# merge employee data with bonus data (disjoint sets)
# the default merge method is "inner join"
pd.merge(df, bonus_df, left_on='employee', right_on='name')

In [None]:
# left join to include every row on the left
pd.merge(df, bonus_df, left_on='employee', right_on='name', how='left')

In [None]:
# right join to include every row on the right
pd.merge(df, bonus_df, left_on='employee', right_on='name', how='right')

In [None]:
# outer join to include every row
all_df = pd.merge(df, bonus_df, left_on='employee', right_on='name', how='outer')
all_df

In [None]:
all_df.columns

In [None]:
# find rows with NaN
na_rows = all_df.isna().any(axis='columns')

In [None]:
# output those rows
all_df[na_rows]

## Exercise

Go ahead and load the some Vietnam weather data, do basic cleanups of columns. The primary objective of the exercise is to merge the two datasets so that temperature and humidity are combined.

The dataset is available at:

https://1drv.ms/u/s!AgtH78k0_cuvglx5ww3BMV9GpIm1

Discuss your solutions with the person next to you!