# Basic Joins

## Data Preparation
Working with joins, we need 2 DFs, so here I'm taking a tuny known slice from the taxis dataset to play about with.

In [22]:
import seaborn as sns
import pandas as pd
from pyprojroot import here

In [15]:
taxis = sns.load_dataset("taxis")
to_ozone = taxis[taxis["dropoff_zone"] == "Ozone Park"].copy(deep=True)
print(f"taxis has {len(taxis)} rows and to_ozone has {len(to_ozone)} row")

taxis has 6433 rows and to_ozone has 1 row


## pd.merge suffixes

The suffixes argument can be used to control the suffix name used when like column names are encountered. Standard is to get an x/y labelling.

In [18]:
taxis.merge(to_ozone, on="pickup").columns

Index(['pickup', 'dropoff_x', 'passengers_x', 'distance_x', 'fare_x', 'tip_x',
       'tolls_x', 'total_x', 'color_x', 'payment_x', 'pickup_zone_x',
       'dropoff_zone_x', 'pickup_borough_x', 'dropoff_borough_x', 'dropoff_y',
       'passengers_y', 'distance_y', 'fare_y', 'tip_y', 'tolls_y', 'total_y',
       'color_y', 'payment_y', 'pickup_zone_y', 'dropoff_zone_y',
       'pickup_borough_y', 'dropoff_borough_y'],
      dtype='object')

In [19]:
taxis.merge(to_ozone, on="pickup", suffixes=("_taxis", "_to_ozone")).columns

Index(['pickup', 'dropoff_taxis', 'passengers_taxis', 'distance_taxis',
       'fare_taxis', 'tip_taxis', 'tolls_taxis', 'total_taxis', 'color_taxis',
       'payment_taxis', 'pickup_zone_taxis', 'dropoff_zone_taxis',
       'pickup_borough_taxis', 'dropoff_borough_taxis', 'dropoff_to_ozone',
       'passengers_to_ozone', 'distance_to_ozone', 'fare_to_ozone',
       'tip_to_ozone', 'tolls_to_ozone', 'total_to_ozone', 'color_to_ozone',
       'payment_to_ozone', 'pickup_zone_to_ozone', 'dropoff_zone_to_ozone',
       'pickup_borough_to_ozone', 'dropoff_borough_to_ozone'],
      dtype='object')

***
## Kaggle Relational Data

[Social media data](https://www.kaggle.com/datasets/iqbalrony/relational-data-engineering?resource=download) comprising of:

* friends_table 
* posts_table
* reactions_table
* user_table

This will be used to exemplify a more complex join on **several tables** and including **several keys**.


In [35]:
users = pd.read_csv(here("data/kaggle-data/user_table.csv"))
reactions = pd.read_csv(here("data/kaggle-data/reactions_table.csv"))
posts = pd.read_csv(here("data/kaggle-data/posts_table.csv"))
# users.head()
# len(users)
# users table requires a user_id column...
users["user_id"] = range(1, len(users)+1)

Unnamed: 0,Surname,Name,Age,Subscription Date,user_id
995,Kirk,Lee,19,1588160246,996
996,Pomme,Franz,40,1588159625,997
997,Gwahsi,Thomas,40,1588165504,998
998,Beierlorzer,Jean-Luc,32,1588151074,999
999,Thronton,Franz,28,1588171183,1000


### Count reactions to posts

Return counts of reactions to posts made by each user on each date. This exemplifies merging several tables.

In [53]:
user_post_react = users.merge(posts, left_on="user_id", right_on="User")\
    .merge(reactions, on="User")
user_post_react.groupby(["User", "Post Date"]).agg({"Reaction Type" : "count"})\
    .sort_values(by=["Reaction Type", "User"], ascending=[False, True]).head()

Unnamed: 0_level_0,Unnamed: 1_level_0,Reaction Type
User,Post Date,Unnamed: 2_level_1
465,1588163323,226
642,1588162733,217
642,1588163391,217
642,1588163484,217
642,1588163543,217
