# Joins

When working with datasets, we often need to merge data from different sources. To do this, we need keys on which to join our data.

Let's use the <a href="https://www.kaggle.com/datasets/usdot/flight-delays">US Flights 2015 datasets</a>. We have three datasets.

First, we have information on each flight.

In [57]:
import verticapy as vp
flights  = vp.read_csv("data/flights.csv")
display(flights)

Unnamed: 0,123YEARInt,123MONTHInt,123DAYInt,123DAY_OF_WEEKInt,AbcAIRLINEVarchar(20),123FLIGHT_NUMBERInt,AbcTAIL_NUMBERVarchar(20),AbcORIGIN_AIRPORTVarchar(20),AbcDESTINATION_AIRPORTVarchar(20),123SCHEDULED_DEPARTUREInt,123DEPARTURE_TIMEInt,123DEPARTURE_DELAYInt,123TAXI_OUTInt,123WHEELS_OFFInt,123SCHEDULED_TIMEInt,123ELAPSED_TIMEInt,123AIR_TIMEInt,123DISTANCEInt,123WHEELS_ONInt,123TAXI_INInt,123SCHEDULED_ARRIVALInt,123ARRIVAL_TIMEInt,123ARRIVAL_DELAYInt,123DIVERTEDInt,123CANCELLEDInt,AbcCANCELLATION_REASONVarchar(20),123AIR_SYSTEM_DELAYInt,123SECURITY_DELAYInt,123AIRLINE_DELAYInt,123LATE_AIRCRAFT_DELAYInt,123WEATHER_DELAYInt
1,2015,1,1,4,AA,1,N787AA,JFK,LAX,900,855,-5,17,912,390,402,378,2475,1230,7,1230,1237,7,0,0,[null],[null],[null],[null],[null],[null]
2,2015,1,1,4,AA,2,N795AA,LAX,JFK,900,856,-4,16,912,335,295,271,2475,1643,8,1735,1651,-44,0,0,[null],[null],[null],[null],[null],[null]
3,2015,1,1,4,AA,3,N798AA,JFK,LAX,1230,1226,-4,19,1245,380,382,358,2475,1543,5,1550,1548,-2,0,0,[null],[null],[null],[null],[null],[null]
4,2015,1,1,4,AA,4,N799AA,LAX,JFK,1220,1214,-6,23,1237,330,319,284,2475,2021,12,2050,2033,-17,0,0,[null],[null],[null],[null],[null],[null]
5,2015,1,1,4,AA,5,N376AA,DFW,HNL,1305,1754,289,21,1815,515,526,499,3784,2234,6,1740,2240,300,0,0,[null],11,0,197,92,0
6,2015,1,1,4,AA,6,N398AA,OGG,DFW,1805,[null],[null],[null],[null],425,[null],[null],3711,[null],[null],510,[null],[null],0,1,A,[null],[null],[null],[null],[null]
7,2015,1,1,4,AA,7,N398AA,DFW,OGG,1215,1513,178,24,1537,500,517,490,3711,1947,3,1635,1950,195,0,0,[null],17,0,178,0,0
8,2015,1,1,4,AA,8,N368AA,HNL,DFW,1745,1933,108,15,1948,445,446,420,3784,648,11,510,659,109,0,0,[null],1,0,0,108,0
9,2015,1,1,4,AA,9,N792AA,JFK,LAX,700,649,-11,22,711,380,397,368,2475,1019,7,1020,1026,6,0,0,[null],[null],[null],[null],[null],[null]
10,2015,1,1,4,AA,10,N796AA,LAX,JFK,2150,2150,0,14,2204,309,294,275,2475,539,5,559,544,-15,0,0,[null],[null],[null],[null],[null],[null]


Second, we have information on each airport.

In [58]:
airports = vp.read_csv("data/airports.csv")
display(airports)

Unnamed: 0,AbcIATA_CODEVarchar(20),AbcVarchar(156),AbcCITYVarchar(60),AbcSTATEVarchar(20),AbcCOUNTRYVarchar(20),"🌎LATITUDENumeric(10,6)","🌎LONGITUDENumeric(11,6)"
1,ABE,,Allentown,PA,USA,40.65236,-75.4404
2,ABI,,Abilene,TX,USA,32.41132,-99.6819
3,ABQ,,Albuquerque,NM,USA,35.04022,-106.60919
4,ABR,,Aberdeen,SD,USA,45.44906,-98.42183
5,ABY,,Albany,GA,USA,31.53552,-84.19447
6,ACK,,Nantucket,MA,USA,41.25305,-70.06018
7,ACT,,Waco,TX,USA,31.61129,-97.23052
8,ACV,,Arcata/Eureka,CA,USA,40.97812,-124.10862
9,ACY,,Atlantic City,NJ,USA,39.45758,-74.57717
10,ADK,,Adak,AK,USA,51.87796,-176.64603


Third, we have the names of each airline.

In [59]:
airlines = vp.read_csv("data/airlines.csv")
display(airlines)

Unnamed: 0,AbcIATA_CODEVarchar(20),AbcAIRLINEVarchar(56)
1,AA,American Airlines Inc.
2,AS,Alaska Airlines Inc.
3,B6,JetBlue Airways
4,DL,Delta Air Lines Inc.
5,EV,Atlantic Southeast Airlines
6,F9,Frontier Airlines Inc.
7,HA,Hawaiian Airlines Inc.
8,MQ,American Eagle Airlines Inc.
9,NK,Spirit Air Lines
10,OO,Skywest Airlines Inc.


Notice that each dataset has a primary or secondary key on which to join the data. For example, we can join the 'flights' dataset to the 'airlines' and 'airport' datasets using the corresponding IATA code. 

To join datasets in VerticaPy, use the vDataFrame's 'join' method.

In [60]:
help(vp.vDataFrame.join)

Help on function join in module verticapy.vdataframe:

join(self, input_relation, on:dict={}, on_interpolate:dict={}, how:str='natural', expr1:list=['*'], expr2:list=['*'])
    ---------------------------------------------------------------------------
    Joins the vDataFrame with another one or an input relation.
    
                     recommended to always check the current structure 
                     using the 'current_relation' method and to save it using the 
                     'to_db' method with the parameters 'inplace = True' and 
                     'relation_type = table'
    
    Parameters
    ----------
    input_relation: str/vDataFrame
        Relation to use to do the merging.
    on: dict, optional
        Dictionary of all different keys. The dict must be similar to the following:
        {"relationA_key1": "relationB_key1" ..., "relationA_keyk": "relationB_keyk"}
        where relationA is the current vDataFrame and relationB is the input relation
        

Let's use a left join to merge the 'airlines' dataset and the 'flights' dataset.

In [61]:
flights = flights.join(airlines,
                       how = "left",
                       on = {"airline": "IATA_CODE"},
                       expr2 = ["AIRLINE AS airline_long"])
display(flights)

Unnamed: 0,123YEARInteger,123MONTHInteger,123DAYInteger,123DAY_OF_WEEKInteger,AbcAIRLINEVarchar(20),123FLIGHT_NUMBERInteger,AbcTAIL_NUMBERVarchar(20),AbcORIGIN_AIRPORTVarchar(20),AbcDESTINATION_AIRPORTVarchar(20),123SCHEDULED_DEPARTUREInteger,123DEPARTURE_TIMEInteger,123DEPARTURE_DELAYInteger,123TAXI_OUTInteger,123WHEELS_OFFInteger,123SCHEDULED_TIMEInteger,123ELAPSED_TIMEInteger,123AIR_TIMEInteger,123DISTANCEInteger,123WHEELS_ONInteger,123TAXI_INInteger,123SCHEDULED_ARRIVALInteger,123ARRIVAL_TIMEInteger,123ARRIVAL_DELAYInteger,123DIVERTEDInteger,123CANCELLEDInteger,AbcCANCELLATION_REASONVarchar(20),123AIR_SYSTEM_DELAYInteger,123SECURITY_DELAYInteger,123AIRLINE_DELAYInteger,123LATE_AIRCRAFT_DELAYInteger,123WEATHER_DELAYInteger,Abcairline_longVarchar(56)
1,2015,1,1,4,AA,1,N787AA,JFK,LAX,900,855,-5,17,912,390,402,378,2475,1230,7,1230,1237,7,0,0,[null],[null],[null],[null],[null],[null],American Airlines Inc.
2,2015,1,1,4,AA,2,N795AA,LAX,JFK,900,856,-4,16,912,335,295,271,2475,1643,8,1735,1651,-44,0,0,[null],[null],[null],[null],[null],[null],American Airlines Inc.
3,2015,1,1,4,AA,3,N798AA,JFK,LAX,1230,1226,-4,19,1245,380,382,358,2475,1543,5,1550,1548,-2,0,0,[null],[null],[null],[null],[null],[null],American Airlines Inc.
4,2015,1,1,4,AA,4,N799AA,LAX,JFK,1220,1214,-6,23,1237,330,319,284,2475,2021,12,2050,2033,-17,0,0,[null],[null],[null],[null],[null],[null],American Airlines Inc.
5,2015,1,1,4,AA,5,N376AA,DFW,HNL,1305,1754,289,21,1815,515,526,499,3784,2234,6,1740,2240,300,0,0,[null],11,0,197,92,0,American Airlines Inc.
6,2015,1,1,4,AA,6,N398AA,OGG,DFW,1805,[null],[null],[null],[null],425,[null],[null],3711,[null],[null],510,[null],[null],0,1,A,[null],[null],[null],[null],[null],American Airlines Inc.
7,2015,1,1,4,AA,7,N398AA,DFW,OGG,1215,1513,178,24,1537,500,517,490,3711,1947,3,1635,1950,195,0,0,[null],17,0,178,0,0,American Airlines Inc.
8,2015,1,1,4,AA,8,N368AA,HNL,DFW,1745,1933,108,15,1948,445,446,420,3784,648,11,510,659,109,0,0,[null],1,0,0,108,0,American Airlines Inc.
9,2015,1,1,4,AA,9,N792AA,JFK,LAX,700,649,-11,22,711,380,397,368,2475,1019,7,1020,1026,6,0,0,[null],[null],[null],[null],[null],[null],American Airlines Inc.
10,2015,1,1,4,AA,10,N796AA,LAX,JFK,2150,2150,0,14,2204,309,294,275,2475,539,5,559,544,-15,0,0,[null],[null],[null],[null],[null],[null],American Airlines Inc.


Let's use two left joins to get the information on the origin and destination airports.

In [62]:
flights = flights.join(airports,
                       how = "left",
                       on = {"origin_airport": "IATA_CODE"},
                       expr2 = ["LATITUDE AS origin_lat",
                                "LONGITUDE AS origin_lon"])
flights = flights.join(airports,
                       how = "left",
                       on = {"destination_airport": "IATA_CODE"},
                       expr2 = ["LATITUDE AS destination_lat",
                                "LONGITUDE AS destination_lon"])
display(flights)

Unnamed: 0,123YEARInteger,123MONTHInteger,123DAYInteger,123DAY_OF_WEEKInteger,AbcAIRLINEVarchar(20),123FLIGHT_NUMBERInteger,AbcTAIL_NUMBERVarchar(20),AbcORIGIN_AIRPORTVarchar(20),AbcDESTINATION_AIRPORTVarchar(20),123SCHEDULED_DEPARTUREInteger,123DEPARTURE_TIMEInteger,123DEPARTURE_DELAYInteger,123TAXI_OUTInteger,123WHEELS_OFFInteger,123SCHEDULED_TIMEInteger,123ELAPSED_TIMEInteger,123AIR_TIMEInteger,123DISTANCEInteger,123WHEELS_ONInteger,123TAXI_INInteger,123SCHEDULED_ARRIVALInteger,123ARRIVAL_TIMEInteger,123ARRIVAL_DELAYInteger,123DIVERTEDInteger,123CANCELLEDInteger,AbcCANCELLATION_REASONVarchar(20),123AIR_SYSTEM_DELAYInteger,123SECURITY_DELAYInteger,123AIRLINE_DELAYInteger,123LATE_AIRCRAFT_DELAYInteger,123WEATHER_DELAYInteger,Abcairline_longVarchar(56),"123origin_latNumeric(10,6)","123origin_lonNumeric(11,6)","123destination_latNumeric(10,6)","123destination_lonNumeric(11,6)"
1,2015,1,1,4,AA,1,N787AA,JFK,LAX,900,855,-5,17,912,390,402,378,2475,1230,7,1230,1237,7,0,0,[null],[null],[null],[null],[null],[null],American Airlines Inc.,40.63975,-73.77893,33.94254,-118.40807
2,2015,1,1,4,AA,2,N795AA,LAX,JFK,900,856,-4,16,912,335,295,271,2475,1643,8,1735,1651,-44,0,0,[null],[null],[null],[null],[null],[null],American Airlines Inc.,33.94254,-118.40807,40.63975,-73.77893
3,2015,1,1,4,AA,3,N798AA,JFK,LAX,1230,1226,-4,19,1245,380,382,358,2475,1543,5,1550,1548,-2,0,0,[null],[null],[null],[null],[null],[null],American Airlines Inc.,40.63975,-73.77893,33.94254,-118.40807
4,2015,1,1,4,AA,4,N799AA,LAX,JFK,1220,1214,-6,23,1237,330,319,284,2475,2021,12,2050,2033,-17,0,0,[null],[null],[null],[null],[null],[null],American Airlines Inc.,33.94254,-118.40807,40.63975,-73.77893
5,2015,1,1,4,AA,5,N376AA,DFW,HNL,1305,1754,289,21,1815,515,526,499,3784,2234,6,1740,2240,300,0,0,[null],11,0,197,92,0,American Airlines Inc.,32.89595,-97.0372,21.31869,-157.92241
6,2015,1,1,4,AA,6,N398AA,OGG,DFW,1805,[null],[null],[null],[null],425,[null],[null],3711,[null],[null],510,[null],[null],0,1,A,[null],[null],[null],[null],[null],American Airlines Inc.,20.89865,-156.43046,32.89595,-97.0372
7,2015,1,1,4,AA,7,N398AA,DFW,OGG,1215,1513,178,24,1537,500,517,490,3711,1947,3,1635,1950,195,0,0,[null],17,0,178,0,0,American Airlines Inc.,32.89595,-97.0372,20.89865,-156.43046
8,2015,1,1,4,AA,8,N368AA,HNL,DFW,1745,1933,108,15,1948,445,446,420,3784,648,11,510,659,109,0,0,[null],1,0,0,108,0,American Airlines Inc.,21.31869,-157.92241,32.89595,-97.0372
9,2015,1,1,4,AA,9,N792AA,JFK,LAX,700,649,-11,22,711,380,397,368,2475,1019,7,1020,1026,6,0,0,[null],[null],[null],[null],[null],[null],American Airlines Inc.,40.63975,-73.77893,33.94254,-118.40807
10,2015,1,1,4,AA,10,N796AA,LAX,JFK,2150,2150,0,14,2204,309,294,275,2475,539,5,559,544,-15,0,0,[null],[null],[null],[null],[null],[null],American Airlines Inc.,33.94254,-118.40807,40.63975,-73.77893


To avoid duplicate information, splitting the data into different tables is very important. Just imagine: what if we wrote the longitude and the latitude of the destination and origin airports for each flight? It would add way too many duplicates and drastically impact the volume of the data.

Cross joins are special: they don't need a key. Cross joins are used to perform mathematical operations.

Let's use a cross join of the 'airports' dataset on itself to compute the distance between every airport.

In [63]:
distances = airports.join(airports, 
                          how = "cross", 
                          expr1 = ["IATA_CODE AS airport1", 
                                   "LATITUDE AS airport1_latitude", 
                                   "LONGITUDE AS airport1_longitude"],
                          expr2 = ["IATA_CODE AS airport2", 
                                   "LATITUDE AS airport2_latitude", 
                                   "LONGITUDE AS airport2_longitude"])
distances.filter("airport1 != airport2")

import verticapy.stats as st
distances["distance"] = st.distance(distances["airport1_latitude"], distances["airport1_longitude"],
                                    distances["airport2_latitude"], distances["airport2_longitude"])

322 elements were filtered.


VerticaPy offers many powerful options for joining datasets.

In the next lesson, we'll learn how to deal with duplicates.