# Enrichment
***
## Learning Objectives
- Understand the concept of enrichment
- Learn about joining dataframes
- Learn about the different types of joins
- Insert new data into a dataframe 
- Update existing data in a dataframe 
- Delete rows from a dataframe 

## Links

## Additional Material


## Sources
- [Difference between `concat` and `merge`](https://stackoverflow.com/questions/38256104/difference-between-pandas-merge-and-concat)
- [SQL Joins](https://www.w3schools.com/sql/sql_join.asp)
- [Indexes in Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html)
- [Join in Pandas](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html)
- [Merge and Join differences](https://g.co/bard/share/219948ee80d0)
- [Adding new data](https://g.co/bard/share/91a6e1839e86)
- [Deleting rows in a dataframe that don't exist in another dataframe](https://g.co/bard/share/c8fc62b9b166)

## Series  

In Pandas, a Series is a one-dimensional labeled array that can hold any data type (integers, strings, floating-point numbers, Python objects, etc.). It is similar to a column in a spreadsheet or a SQL table. A Series has two main components: the data and the index. The data can be a NumPy array, Python dictionary, or a scalar value. The index is a sequence of labels that correspond to the values in the data. 

In the example you provided, `more_animals` is a Pandas Series object that contains two elements ('fish' and 'hamster') and has an index of 4 and 5, respectively. You can access the data and index of a Series using the `values` and `index` attributes, respectively.

In [None]:
import pandas as pd

# create a Series from a list of integers
my_series = pd.Series([10, 20, 30, 40, 50])

print(my_series)

## Indexes


In Pandas, an index is a sequence of labels that uniquely identifies each row or column in a DataFrame or a Series. It provides a way to access and manipulate data in a Pandas object. 

By default, when you create a DataFrame or a Series, Pandas assigns a numeric index starting from 0. However, you can also specify your own index using the `index` parameter when creating a DataFrame or a Series. The index can be any sequence of hashable objects, such as a list, an array, or a range of values.

Indexes can be used to select, filter, and manipulate data in a Pandas object. For example, you can use the `loc` attribute to select rows or columns by their index labels, or the `iloc` attribute to select rows or columns by their integer position. You can also use the `reindex` method to change the order of the rows or columns in a DataFrame or a Series based on a new index.

Indexes can also have hierarchical levels, which allow you to represent multi-dimensional data in a Pandas object. A hierarchical index is a sequence of tuples, where each tuple represents a unique combination of index labels at each level. You can create a hierarchical index using the `MultiIndex` class in Pandas.

In [None]:
import pandas as pd

# create a Series from a list of integers
my_series = pd.Series([10, 20, 30, 40, 50])
my_series


In [None]:

# get the value at index 2
value = my_series[2]

print(value)

## Concat
`concat` is the simplest way to combine two `Series` or `DataFrames`, but it only works if the two objects have the same index. In the following example, `animals` and `more_animals` have the same index, so we can concatenate them:

```python
>>> animals = pd.Series(['dog', 'cat', 'bird'], index=[1, 2, 3])
>>> more_animals = pd.Series(['fish', 'hamster'], index=[4, 5])
>>> pd.concat([animals, more_animals])
1        dog
2        cat
3       bird
4       fish
5    hamster
dtype: object
```
We haven't spoken about indexes in depth yet, but it's the numbers on the very left of the dataframe. In this case, the index is the same as the row number, but it doesn't have to be. We'll talk about indexes in more detail later. Here's the code above to run for yourself:

In [52]:
import pandas as pd
animals = pd.Series(['dog', 'cat', 'bird'], index=[1, 2, 3])
animals

1     dog
2     cat
3    bird
dtype: object

In [53]:

more_animals = pd.Series(['fish', 'hamster'], index=[4, 5])
more_animals

4       fish
5    hamster
dtype: object

In [54]:
pd.concat([animals, more_animals])

1        dog
2        cat
3       bird
4       fish
5    hamster
dtype: object

Can you see how the two series were combined? Here's another example:

In [55]:
import pandas as pd
# Create two DataFrames
data1 = {'Name': ['Alice', 'Bob'],
         'Age': [25, 30]}
data2 = {'Name': ['Charlie', 'David'],
         'Age': [35, 40]}

df1 = pd.DataFrame(data1)
df1 


Unnamed: 0,Name,Age
0,Alice,25
1,Bob,30


In [56]:

df2 = pd.DataFrame(data2)
df2 


Unnamed: 0,Name,Age
0,Charlie,35
1,David,40


In [57]:

# Concatenate the two DataFrames vertically (along rows)
result = pd.concat([df1, df2], ignore_index=True)
# Print the result
print(result)

      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35
3    David   40


`concat` and also be used to combine data horizontally. I.E. to put columns side by side. This is done by setting the `axis` parameter to `1`.


In [58]:
import pandas as pd

# Create two DataFrames
data1 = {'A': ['A0', 'A1', 'A2'],
         'B': ['B0', 'B1', 'B2']}
df1 = pd.DataFrame(data1)
df1 

Unnamed: 0,A,B
0,A0,B0
1,A1,B1
2,A2,B2


In [59]:

data2 = {'C': ['C0', 'C1', 'C2'],
         'D': ['D0', 'D1', 'D2']}
df2 = pd.DataFrame(data2)
df2 


Unnamed: 0,C,D
0,C0,D0
1,C1,D1
2,C2,D2


In [61]:
# Concatenate DataFrames horizontally
result = pd.concat([df1, df2], axis=1)

# Print the result
print(result)


    A   B   C   D
0  A0  B0  C0  D0
1  A1  B1  C1  D1
2  A2  B2  C2  D2


`A` and `B` are columns in `data1` and `C` and `D` are columns in `data2`. The result is a new dataframe -`result`- with the columns `A`, `B`, `C`, `D`. 

`concat` should be used whenever you want to *stack* data on top of (or alongside) each other. 

`merge` should be used whenever you want to *combine* data based on a common column.


## Merge
Imagine you have two data frames: 

In [65]:
orders_data = {
    'order_id': [101, 102, 103, 104],
    'customer_id': ['A', 'B', 'A', 'C'],
    'order_total': [50, 75, 60, 90]
}

orders = pd.DataFrame(orders_data)
orders

Unnamed: 0,order_id,customer_id,order_total
0,101,A,50
1,102,B,75
2,103,A,60
3,104,C,90


In [66]:
# Sample customers DataFrame
customers_data = {
    'customer_id': ['A', 'B', 'C', 'D'],
    'customer_name': ['Alice', 'Bob', 'Charlie', 'David']
}
customers_df = pd.DataFrame(customers_data)
customers_df

Unnamed: 0,customer_id,customer_name
0,A,Alice
1,B,Bob
2,C,Charlie
3,D,David


Can you tell what's the common column in both these sets? It makes sense to combine them using the `customer_id` column. Let's do that using the `merge` function from `pandas`:

```python
merged = pd.merge(customers_df, orders, on='customer_id', how='inner')
```


In [67]:
merged = pd.merge(customers_df, orders, on='customer_id', how='inner')

print(merged)

  customer_id customer_name  order_id  order_total
0           A         Alice       101           50
1           A         Alice       103           60
2           B           Bob       102           75
3           C       Charlie       104           90


The two datasets are combined. The arguments for the `merge()` method are:
* `left`: the left DataFrame
* `right`: the right DataFrame
* `on`: the column(s) to join on. If not specified, and no other join keys given, will use the intersection of the column names in `left` and `right` as the join keys.
* `left_on`: the column(s) to use from the left DataFrame as the join key(s)
* `right_on`: the column(s) to use from the right DataFrame as the join key(s)
* `left_index`: if `True`, use the index (row labels) from the left DataFrame as its join key(s). If specified, `left_on` must be `None`.
* `right_index`: if `True`, use the index (row labels) from the right DataFrame as its join key(s). If specified, `right_on` must be `None`.
* `how`: the type of join to perform. Possible values are: `'left'`, `'right'`, `'outer'`, `'inner'`. Defaults to `'inner'`.
* `suffixes`: a tuple of strings to append to the column names of the overlapping columns in the left and right DataFrames, respectively. Defaults to `('_x', '_y')`.

For our purpose, we use `left`, `right`, `on`, and `how` arguments. We will use the `left` and `right` arguments to specify the DataFrames we want to join, the `on` argument to specify the column(s) to join on, and the `how` argument to specify the type of join to perform.

`on` is used to tell `merge()` which column(s) can be used to combine the datasets. Normally, the columns names are the same, but it doesn't need to be. Think of vlookup`s in a spreadsheet program - you're basically performing a join when you use vlookup. 

`how` is used to tell `merge()` what type of join to perform. The default is an inner join, which is what we want. An inner join will only keep rows where the join key(s) are present in both datasets. You get 4 types of joins: 
* Inner join: only keep rows where the join key(s) are present in both datasets

![image.png](**attachment**:image.png)

* Left join: keep all rows from the left dataset, and only keep rows from the right dataset where the join key(s) are present in both datasets

![image-2.png](attachment:image-2.png)

* Right join: keep all rows from the right dataset, and only keep rows from the left dataset where the join key(s) are present in both datasets

![image-3.png](attachment:image-3.png)

* Outer join: keep all rows from both datasets

![image-4.png](attachment:image-4.png)

Here is a left join example:

In [8]:
import pandas as pd

# Sample orders DataFrame
orders_data = {
    'order_id': [101, 102, 103, 104 , 105],
    'customer_id': ['A', 'B', 'A', 'C' , 'F'],
    'order_total': [50, 75, 60, 90 , 100]
}
orders_df = pd.DataFrame(orders_data)
orders_df 

Unnamed: 0,order_id,customer_id,order_total
0,101,A,50
1,102,B,75
2,103,A,60
3,104,C,90
4,105,F,100


In [14]:
# Sample customers DataFrame
customers_data = {
    'customer_id': ['A', 'B', 'C', 'D'],
    'customer_name': ['Alice', 'Bob', 'Charlie', 'David']
}
customers_df = pd.DataFrame(customers_data)
customers_df

Unnamed: 0,customer_id,customer_name
0,A,Alice
1,B,Bob
2,C,Charlie
3,D,David


In [12]:

# Perform left join based on 'customer_id'
result_df = pd.merge(orders_df, customers_df, on='customer_id', how='left')
print(result_df)

   order_id customer_id  order_total customer_name
0       101           A           50         Alice
1       102           B           75           Bob
2       103           A           60         Alice
3       104           C           90       Charlie
4       105           F          100           NaN


We may want only the customers that have orders. Notice that `David` (or `customer_id` `D`) does not have an order. There are 4 rows in the `customers` dataframe and 3 in `order_data`. That is why the result of the merge is 3 rows. 

We can swap it around - asking for all customers - if they have a order or not - by using a `right join`:

In [15]:
result_df = pd.merge(orders_df, customers_df, on='customer_id', how='right')

print(result_df)

   order_id customer_id  order_total customer_name
0     101.0           A         50.0         Alice
1     103.0           A         60.0         Alice
2     102.0           B         75.0           Bob
3     104.0           C         90.0       Charlie
4       NaN           D          NaN         David


Notice the last row. David has no orders. `order_id` and `order_total` is `NaN` - not a number. This is because David has no orders. 

`OUTER JOIN` is even less restrictive than `LEFT JOIN` or `RIGHT JOIN`. It will return all rows from both tables, even if there is no match. Using the same example as above, the result would look like this:

In [17]:
result_df = pd.merge(orders_df, customers_df, on='customer_id', how='outer')

print(result_df)

   order_id customer_id  order_total customer_name
0     101.0           A         50.0         Alice
1     103.0           A         60.0         Alice
2     102.0           B         75.0           Bob
3     104.0           C         90.0       Charlie
4     105.0           F        100.0           NaN
5       NaN           D          NaN         David


## Join
Why was the previous section called **MERGE** - because there is also a **JOIN**. 

In Pandas, both `join` and `merge` are used to combine two or more DataFrames into a single DataFrame. However, they differ in how they combine the DataFrames:

- `join` is used to combine DataFrames based on their indexes. It is a convenient method for combining DataFrames that have the same or similar indexes. By default, `join` performs a left join, which means that all the rows from the left DataFrame are included in the result, and only the matching rows from the right DataFrame are included. You can also specify other types of joins, such as inner join, right join, and outer join, using the `how` parameter.

- `merge` is used to combine DataFrames based on one or more columns that they share. It is a more flexible method for combining DataFrames that have different indexes or column names. By default, `merge` performs an inner join, which means that only the matching rows from both DataFrames are included in the result. You can also specify other types of joins, such as left join, right join, and outer join, using the `how` parameter.

In general, you should use `join` when you want to combine DataFrames based on their indexes, and `merge` when you want to combine DataFrames based on their columns. However, both methods can be used in a variety of situations, and the choice between them depends on the specific requirements of your analysis.

In [37]:
import pandas as pd

# create two DataFrames with the same index
data1 = {'A': ['A0', 'A1', 'A2'], 'B': ['B0', 'B1', 'B2']}
df1 = pd.DataFrame(data1, index=['K0', 'K1', 'K2'])
df1 

Unnamed: 0,A,B
K0,A0,B0
K1,A1,B1
K2,A2,B2


In [38]:
data2 = {'C': ['C0', 'C1', 'C2'], 'D': ['D0', 'D1', 'D2']}
df2 = pd.DataFrame(data2, index=['K0', 'K2', 'K3'])
df2

Unnamed: 0,C,D
K0,C0,D0
K2,C1,D1
K3,C2,D2


In [39]:

# join the DataFrames on their index
result = df1.join(df2)
# print the result
print(result)

     A   B    C    D
K0  A0  B0   C0   D0
K1  A1  B1  NaN  NaN
K2  A2  B2   C1   D1


In [40]:
import pandas as pd

# Create the first DataFrame
df1 = pd.DataFrame({'A': [1, 2, 3, 4, 5], 'B': [4, 5, 6, 7, 8]})
df1 


Unnamed: 0,A,B
0,1,4
1,2,5
2,3,6
3,4,7
4,5,8


In [41]:
# Create the second DataFrame
df2 = pd.DataFrame({'A': [2, 4, 6, 8], 'B': [5, 7, 9, 10]})
df2


Unnamed: 0,A,B
0,2,5
1,4,7
2,6,9
3,8,10


In [42]:

# Join the two DataFrames on the A column
df_joined = df1.join(df2, on='A', lsuffix='_left', rsuffix='_right')

# Print the DataFrame
print(df_joined)


   A_left  B_left  A_right  B_right
0       1       4      4.0      7.0
1       2       5      6.0      9.0
2       3       6      8.0     10.0
3       4       7      NaN      NaN
4       5       8      NaN      NaN


## Adding new rows 
You frequently need to add new rows to existing data. Pandas has several methods that achieves this task. Here is `concat()`:

In [50]:
# Create a DataFrame
data = {"Name": ["Alice", "Bob", "Charlie"], "Age": [25, 30, 35]}
df = pd.DataFrame(data)

df


Unnamed: 0,Name,Age
0,Alice,25
1,Bob,30
2,Charlie,35


In [49]:

# Create a new DataFrame with the new row

new_row = pd.DataFrame({"Name": ["David"], "Age": [40]})

new_row


Unnamed: 0,Name,Age
0,David,40


In [51]:

# Use the concat function to concatenate the DataFrames
df = pd.concat([df, new_row], ignore_index=True)
# Print the updated DataFrame
print(df)

      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35
3    David   40


In [52]:
# remove duplicates from the DataFrame
df = df.drop_duplicates()
# print the updated DataFrame
print(df)


      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35
3    David   40


In the code above, we added one row to the `data` dataframe but, more often, you'll add multiple rows to existing data: 

In [53]:
import pandas as pd

# Create a DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie'],
        'Age': [25, 30, 35]}
df = pd.DataFrame(data)

# Create a new DataFrame with multiple new rows
new_data = {'Name': ['David', 'Eva'],
            'Age': [40, 28]}
new_rows = pd.DataFrame(new_data)

# Use the concat function to concatenate the DataFrames
df = pd.concat([df, new_rows], ignore_index=True)

# Print the updated DataFrame
print(df)


      Name  Age
0    Alice   25
1      Bob   30
2  Charlie   35
3    David   40
4      Eva   28


The syntax is the same. 

## Update existing data
Inserting new and updating existing data are tasks that are frequently performed together. To update existing data, you can use the `merge()` function shown earlier:

### Update

In [74]:
import pandas as pd

# Create the original DataFrame
data = {'ID': [1, 2, 3, 4],
        'Name': ['John', 'Jane', 'Bob', 'Alice'],
        'Age': [25, 30, 35, 40]}
df = pd.DataFrame(data)
df 


Unnamed: 0,ID,Name,Age
0,1,John,25
1,2,Jane,30
2,3,Bob,35
3,4,Alice,40


In [77]:

# Create the new DataFrame with updated values
new_data = {'ID': [2, 3],
            'Age': [31, 36]}
new_df = pd.DataFrame(new_data)
new_df


Unnamed: 0,ID,Age
0,2,31
1,3,36


In [78]:
# Update the original DataFrame with the new values
df.update(new_df)

# Print the updated DataFrame
print(df)

    ID   Name   Age
0  2.0   John  31.0
1  3.0   Jane  36.0
2  3.0    Bob  35.0
3  4.0  Alice  40.0


In [88]:
import pandas as pd

# Create the original DataFrame
data = {'ID': [1, 2, 3, 4],
        'Name': ['John', 'Jane', 'Bob', 'Alice'],
        'Age': [25, 30, 35, 40]}
df_main = pd.DataFrame(data)
df_main

Unnamed: 0,ID,Name,Age
0,1,John,25
1,2,Jane,30
2,3,Bob,35
3,4,Alice,40


In [89]:

# Create the new DataFrame with updated values and a new row
new_data = {'ID': [2, 3, 5],
            'Age': [31, 36, 27],
            'Name': ['Jane', 'Bob', 'Eve']}
df_update = pd.DataFrame(new_data)
df_update


Unnamed: 0,ID,Age,Name
0,2,31,Jane
1,3,36,Bob
2,5,27,Eve


In [90]:

# Set 'ID' as the index for both DataFrames
df_main.set_index('ID', inplace=True)
df_update.set_index('ID', inplace=True)

# Update df_main with values from df_update,
df_main.update(df_update, overwrite=True)

# Reset the index if needed
df_main.reset_index(inplace=True)

print(df_main)


   ID   Name   Age
0   1   John  25.0
1   2   Jane  31.0
2   3    Bob  36.0
3   4  Alice  40.0


### Merge

In [96]:
# Sample DataFrames
# Create the original DataFrame
data = {'ID': [1, 2, 3, 4],
        'Value': ['A', 'B', 'C', 'D']}
df_original = pd.DataFrame(data)
df_original



Unnamed: 0,ID,Value
0,1,A
1,2,B
2,3,C
3,4,D


In [97]:

# Create a DataFrame with updated values
updated_data = {'ID': [2, 4],
                'Value': ['X', 'Y']}
df_updated = pd.DataFrame(updated_data)

df_updated


Unnamed: 0,ID,Value
0,2,X
1,4,Y


In [99]:

# Merge the original DataFrame with the updated DataFrame based on 'ID'
merged_df = df_original.merge(df_updated, on='ID', how='left', suffixes=('_original', '_updated'))

print(merged_df)

   ID Value_original Value_updated
0   1              A           NaN
1   2              B             X
2   3              C           NaN
3   4              D             Y


Notice the `suffixes`  argument. It's used to give unique names to the two dataframes. `Value` has the same column name in both datasets. In the dataframe, it's been renamed to `Value_left` and `Value_right` respectively. 

In [100]:
# Update the original DataFrame with values from the merged DataFrame
df_original['Value'] = merged_df['Value_updated'].fillna(merged_df['Value_original'])

# Print the updated DataFrame
print(df_original)

   ID Value
0   1     A
1   2     X
2   3     C
3   4     Y


To update the original dataframe, replace the column that you want to update - `Value` - in `df_original` with `Value_updated` in merged_df. 

```python 
df_original['Value'] = merged_df['Value_updated']
```

Where `ID`s don't match, i.e. existing rows that don't need to be updated, use original values in `merged_df`: 
```python
.fillna(merged_df['Value_original'])
```

When complete, the dataframe is updated: 

In [None]:
print(df_original)

## Removing data

An operation that goes hand-in-hand with adding and updating data is removing data. In this section, we will learn how to remove data from a dataframe.


In [114]:
import pandas as pd

# Sample data for DataFrame 1
data1 = {'A': [1, 2, 3, 4, 5],
         'B': ['a', 'b', 'c', 'd', 'e']}
df1 = pd.DataFrame(data1)

# Sample data for DataFrame 2
data2 = {'A': [2, 4],
         'B': ['b', 'd']}
df2 = pd.DataFrame(data2)
# Merge the dataframes based on a common key (e.g., 'A' and 'B')
merged_df = df1.merge(df2, on=['A', 'B'], how='left', indicator=True)
merged_df

Unnamed: 0,A,B,_merge
0,1,a,left_only
1,2,b,both
2,3,c,left_only
3,4,d,both
4,5,e,left_only


In [115]:

# Drop rows that exist in DataFrame 2 (based on the indicator column)
filtered_df = merged_df[merged_df['_merge'] == 'left_only'].drop('_merge', axis=1)

# Display the filtered DataFrame
print(filtered_df)


   A  B
0  1  a
2  3  c
4  5  e


The `~` sign means NOT. So, the code `[~df1.isin(df2)]` reads - "give me all the rows in `df1` that are not in `df2`". This pattern can be used to remove rows from a dataframe that are in another dataframe.