# Intro to Pandas
by Ryan Orsinger

## Module 5: Combining Dataframes
- Using `.concat` to combine dataframes horizontally or vertically 
- Intro to joining dataframes together like database tables
- Understanding different types of joins
- Using `.merge` to join dataframes together based on column values in common

In [None]:
import pandas as pd

In [None]:
# String concatenation
"con" + "cat" + "e" + "nation"

In [None]:
# List concatenation
["con", "cat"] + ["e", "nation"]

In [None]:
# Dataframe Concatenation 
fruits = pd.DataFrame({
    "name": ["mango", "guava", "orange"],
    "quantity": [2, 1, 3]
})

vegetables = pd.DataFrame({
    "name": ["Brussels sprouts", "spinach", "broccoli"],
    "quantity": [1, 7, 4]
})

In [None]:
# Default arguments preserve the original index for each dataframe
pd.concat([fruits, vegetables])

In [None]:
# Axis=0 is the default argument for concatenating dataframes
# This is vertical concatenation, since we're adding row-wise
pd.concat([fruits, vegetables], axis=0)

In [None]:
pd.concat([fruits, vegetables], ignore_index=True)

In [None]:
# Dataframe Concatenation 
fruits = pd.DataFrame({
    "name": ["mango", "guava", "orange"],
})

# Notice that this instance of vegetables lacks a quantity column
vegetables = pd.DataFrame({
    "name": ["Brussels sprouts", "spinach", "broccoli"],
    "quantity": [2, 3, 4]

})

# If a column is missing from a dataframe, its values will be missing, so the concatenation succeeds
pd.concat([fruits, vegetables])

In [None]:
# Axis=1 concatenates dataframes horizontally
# This is a column-wise concatenation
price_quality = pd.DataFrame({
    "price": [2.99, 1.99, 3.99],
    "presentation": ["frozen", "washed", "raw, bunch"] 
})

pd.concat([vegetables, price_quality], axis=1)

In [None]:
# concat can combine an arbitrary number of dataframes
# This can be helpful if you have lots of different data frames from multiple sources
pd.concat([vegetables, vegetables, vegetables, vegetables])

## Using `.merge` to combine dataframes on common column values
- Database style join for Pandas Dataframes
- Pandas `.join` joins dataframes on identical column names that exist on both dataframes
- Using `.merge` can be more flexible, since sometimes the column names are not identical

## Types of Joins
- "Inner" returns records that have matching values in both tables.
- "Left" returns all records from the left table, and the matched records from the right table.
- "Right" returns all records from the right table, and the matched records from the left table.
- "Outer" Returns all records when there is a match in either left or right table.
![diagram of different types of joins](types_of_joins.png)

In [None]:
# Notice how role_id points to the id on the roles dataframe
# Take note of the missing data
users = pd.DataFrame({
    'user_id': [1, 2, 3, 4, 5, 6],
    'name': ['bob', 'mary', 'sally', 'adam', 'jane', 'mike'],
    'role_id': [1, 2, 3, 3, None, None]
})

users

In [None]:
# Notice that the role id column is called "id" on the roles dataframe
roles = pd.DataFrame({
    'role_id': [1, 2, 3, 4],
    'role': ['admin', 'author', 'reviewer', 'commenter']
})

roles

In [None]:
# An inner join returns members that exist on both the dataframes
users.merge(roles, left_on='role_id', right_on='role_id', how='inner')

In [None]:
# If the same exact column name exists on both dataframes, we can use the "on" argument
users.merge(roles, on='role_id', how='inner')

In [None]:
# Notice that the left join keeps all records from the users dataframe, even if they are missing on the right dataframe
users.merge(roles, on='role_id', how='left')

In [None]:
# Notice that the right join keeps all records from the users dataframe, even if they are missing on the right dataframe
users.merge(roles, left_on='role_id', right_on='role_id', how='right')

In [None]:
# The outer join keeps all records from every dataframe, but values are associated, where applicable
# Outer joins keep all values including nulls
users.merge(roles, on='role_id', how='outer')

In [None]:
# Relationship between dataframe order and join type 
# Consider the result of starting with users and left joining roles
users.merge(roles, on="role_id", how='left')

In [None]:
# Compare to starting with roles and using right join with users
roles.merge(users, on="role_id", how='right')

## Additional Resources
- https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.concat.html
- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html
- https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html
- https://pandas.pydata.org/docs/user_guide/merging.html
- https://pandas.pydata.org/pandas-docs/stable/getting_started/comparison/comparison_with_sql.html#compare-with-sql-join

## Exercises
- As an experiment, use `pd.concat` to concatenate the `vegetables` and `users` dataframes together. 
    - First, use the default axis argument (or explicitly state `axis=0`). What do you notice?
    - Then, use `axis=1`. What do you notice in the results?
- Read "2020-sales.csv", "2021-sales.csv", and "2022-sales.csv" into dataframes, then concatenate these 3 dataframes vertically.
- Create a `posts` dataframe of the following information. 
```
[
    {
        "author_id": 1,
        "title": "How I Learned Python"
    },
    {
        "author_id": 2,
        "title": "How I Learned to Stop Worrying and Love Pandas"
    },
    {
        "author_id": 2,
        "title": "Quick Tutorial on Installing Anaconda"
    },
    {
        "author_id": 9,
        "title": "Learning Pandas If You Already Work With Spreadsheets"
    }
]
```
- Perform an inner join of `users` and `posts`. *Hint* Think about what data these two dataframes share in common.
- Start with `users` then left join the `posts` dataframe
- Start with `posts` then right join the `users` dataframe
- Finally, perform an outer join of `users` and `posts`

In [None]:
# Concatenate vegetables and users, using the default axis argument, for vertical, row-wise concatenation


In [None]:
# Concatenate vegetables and users, using the axis=1 argument, for horizontal, column-wise concatentation


In [None]:
# Read "2020-sales.csv", "2021-sales.csv", and `"2022-sales.csv" into dataframes
# Concatenate these 3 dataframes together, vertically
    

In [None]:
# Create a `posts` dataframe of the above blog post data


In [None]:
# Perform an inner join of `users` and `posts`. 
# Hint: Think about what data these two dataframes share in common.


In [None]:
# Start with `users` then left join the `posts` dataframe


In [None]:
# Start with `posts` then right join the `users` dataframe


In [None]:
# Finally, perform an outer join of `users` and `posts`
