# Welcome

## Merging in Pandas

Date: 11/17/2023

Data Analytics Working Group

## Goal

* Explore the Pandas merging method 

* Reference, https://pandas.pydata.org/docs/user_guide/merging.html

These notes will be available on, [here](https://github.com/youngsuKim-CSUSB/presentations). 




# Preliminaries

## Basic JupyterLab 

* How to execute cells
* How to install Python packages

## Import Pandas and CSV files

We will load the package `Pandas` and a CSV file:

> import pandas as pd  
> pd.read_csv(filename)

In [2]:
import pandas as pd

If the package is not installed, you can use one of the following methods without leaving the JupyterLab.

> conda install pandas

> pip install pandas

# Load CSV files as Datasets

Use `pd.read_csv(filename)` but first download the CSV files by typing



In [3]:
# Load tables for exercises
df1 = pd.read_csv("table1.csv")
df2 = pd.read_csv("table2.csv")
df3 = pd.read_csv("table3.csv")
df4 = pd.read_csv("table4.csv")
df5 = pd.read_csv("table5.csv")
df6 = pd.read_csv("table6.csv")
df7 = pd.read_csv("table7.csv")

In [4]:
# pd.read_csv?

In [5]:
# print df1
df1

Unnamed: 0,Column A,Column B
0,1,10
1,2,20
2,3,30
3,4,40
4,5,50


In [6]:
# print df2
# Write your code below


Example:

Use `pd.concat` to concatenate df1 and df2 row-wise

In [7]:
# pd.concat([list of dataframes])
pd.concat([df1,df2])

Unnamed: 0,Column A,Column B
0,1,10
1,2,20
2,3,30
3,4,40
4,5,50
0,6,60
1,7,70


**Student Exercise**:

Check out `df3` and use `pd.concat` to concatenate df1 and df3 row-wise. Your result should look similar to the following.

![df1-df3](./df1-df3.png)


In [9]:
# Now check out df4, df7, df1 in the next three cells
df4

Unnamed: 0,Column A,Column B,Column C
0,1,10,2
1,2,20,4
2,3,30,6
3,4,40,8
4,5,50,10


In [12]:
df1

Unnamed: 0,Column A,Column B
0,1,10
1,2,20
2,3,30
3,4,40
4,5,50


In [13]:
df7

Unnamed: 0,Column A,Column C
0,1,2
1,2,4
2,3,6
3,4,8
4,5,10


**Example**: Merge `df1` and `df7` to get `df4`.



In [None]:
# pd.merge(left,right, on=, how=)

**Student Hands-on Exercise**: The dataframes `df2` and `df6` have a small number of rows. Execute the following four cells that demonstrate different ways of merging. Before executing each cell, make a guess and check it with the outcome. 

In [16]:
df2

Unnamed: 0,Column A,Column B
0,6,60
1,7,70


In [31]:
df6

Unnamed: 0,Column A,Column B
0,2,20
1,3,30
2,6,60


In [22]:
pd.merge(df2,df6, on=['Column A'], how='inner') # 'inner' is the default

Unnamed: 0,Column A,Column B_x,Column B_y
0,6,60,60


In [23]:
pd.merge(df2,df6, on=['Column A'], how='left') 

Unnamed: 0,Column A,Column B_x,Column B_y
0,6,60,60.0
1,7,70,


In [24]:
pd.merge(df2,df6, on=['Column A'], how='right') 

Unnamed: 0,Column A,Column B_x,Column B_y
0,2,,20
1,3,,30
2,6,60.0,60


In [25]:
pd.merge(df2,df6, on=['Column A'], how='outer') # 'inner' is the default

Unnamed: 0,Column A,Column B_x,Column B_y
0,6,60.0,60.0
1,7,70.0,
2,2,,20.0
3,3,,30.0


## Discussion

Do these results make sense to you? Here, `NaN` means `not a number`.

## Task

Counter the number of attendance points for this week.

1. Read files
> [ "roster.csv", "11-13-2023.csv", "11-14-2023.csv", "11-15-2023.csv", "11-17-2023.csv" ]
2. Merge data one by one
3. Replace NaN by 0
4. Create a column with the sum
>
The final dataframe should look like the following:

![roster_goal](./roster-goal.png)

In [45]:
# 1

files = [ "roster.csv", "11-13-2023.csv", "11-14-2023.csv", "11-15-2023.csv", "11-17-2023.csv" ]

roster = pd.read_csv(files[0])
df13 = pd.read_csv(files[1])
df14 = pd.read_csv(files[2])
df15 = pd.read_csv(files[3])
df17 = pd.read_csv(files[4])

In [43]:
# Check df 
roster

NameError: name 'roster' is not defined

In [44]:
# Check df 
df13

NameError: name 'df13' is not defined

In [125]:
# 2. First part
df = pd.merge(roster, df13, on='Name', how='left')
#### Fill this part out




#### 
df

NameError: name 'roster' is not defined

In [45]:
# 3. Use df.dropna(0) to fill NaN by 0



In [None]:
# 4. Add the total column

In [32]:
# If time permits go over the following

## More examples from Pandas Document

In [33]:
left = pd.DataFrame(
    {
        "key": ["K0", "K1", "K2", "K3"],
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
    }
)

right = pd.DataFrame(
    {
        "key": ["K0", "K1", "K2", "K3"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
    }
)

result = pd.merge(left, right, on="key", how='outer')

In [34]:
left

Unnamed: 0,key,A,B
0,K0,A0,B0
1,K1,A1,B1
2,K2,A2,B2
3,K3,A3,B3


In [35]:
right

Unnamed: 0,key,C,D
0,K0,C0,D0
1,K1,C1,D1
2,K2,C2,D2
3,K3,C3,D3


In [36]:
result

Unnamed: 0,key,A,B,C,D
0,K0,A0,B0,C0,D0
1,K1,A1,B1,C1,D1
2,K2,A2,B2,C2,D2
3,K3,A3,B3,C3,D3


In [37]:
left = pd.DataFrame(
    {
        "key1": ["K0", "K0", "K1", "K2"],
        "key2": ["K0", "K1", "K0", "K1"],
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
    }
)

right = pd.DataFrame(
    {
        "key1": ["K0", "K1", "K1", "K2"],
        "key2": ["K0", "K0", "K0", "K0"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
    }
)


result = pd.merge(left, right, on=["key1", "key2"]) #, indicator=True) #, validate='1:m')
result2 = pd.merge(left, right, on=["key1", "key2"], how='left')
result3 = pd.merge(left, right, on=["key1", "key2"], how='right')

In [38]:
left

Unnamed: 0,key1,key2,A,B
0,K0,K0,A0,B0
1,K0,K1,A1,B1
2,K1,K0,A2,B2
3,K2,K1,A3,B3


In [39]:
right

Unnamed: 0,key1,key2,C,D
0,K0,K0,C0,D0
1,K1,K0,C1,D1
2,K1,K0,C2,D2
3,K2,K0,C3,D3


In [40]:
result

Unnamed: 0,key1,key2,A,B,C,D
0,K0,K0,A0,B0,C0,D0
1,K1,K0,A2,B2,C1,D1
2,K1,K0,A2,B2,C2,D2


In [41]:
result2

Unnamed: 0,key1,key2,A,B,C,D
0,K0,K0,A0,B0,C0,D0
1,K0,K1,A1,B1,,
2,K1,K0,A2,B2,C1,D1
3,K1,K0,A2,B2,C2,D2
4,K2,K1,A3,B3,,


In [42]:
result3

Unnamed: 0,key1,key2,A,B,C,D
0,K0,K0,A0,B0,C0,D0
1,K1,K0,A2,B2,C1,D1
2,K1,K0,A2,B2,C2,D2
3,K2,K0,,,C3,D3


# Questions?

## Thank you!