# Data Enrichment

Data enrichment and data aggregation are the processes involved around joining merging datasets, creating new columns, calculating values on certain windows, grouping into bins, or even changing the values. All of these processes will assist in the data analysis by providing specific insight into the data. Things such as severity of anomalous data, rolling averages, cumulative sums, or quantities grouped by ages can all make the data easier to interpret. We'll focus on the following for this class:
- Querying and Merging dataframes
- Aggregating Dataframes

In [2]:
import pandas as pd

## Querying data frames
In previous sections we talked about how we learned about how to filter a data frame by building some sort of mapping that would indicate exactly which rows to keep, and which to remove. Another approach is to use the `query` method. 

Below we import a more expanded weather data set that we've seen in previous lessons. 

In [4]:
weather = pd.read_csv('https://raw.githubusercontent.com/stefmolin/Hands-On-Data-Analysis-with-Pandas/master/ch_04/data/nyc_weather_2018.csv')
weather

Unnamed: 0,attributes,datatype,date,station,value
0,",,N,",PRCP,2018-01-01T00:00:00,GHCND:US1CTFR0039,0.0
1,",,N,",PRCP,2018-01-01T00:00:00,GHCND:US1NJBG0015,0.0
2,",,N,",SNOW,2018-01-01T00:00:00,GHCND:US1NJBG0015,0.0
3,",,N,",PRCP,2018-01-01T00:00:00,GHCND:US1NJBG0017,0.0
4,",,N,",SNOW,2018-01-01T00:00:00,GHCND:US1NJBG0017,0.0
...,...,...,...,...,...
80251,",,W,",WDF5,2018-12-31T00:00:00,GHCND:USW00094789,130.0
80252,",,W,",WSF2,2018-12-31T00:00:00,GHCND:USW00094789,9.8
80253,",,W,",WSF5,2018-12-31T00:00:00,GHCND:USW00094789,12.5
80254,",,W,",WT01,2018-12-31T00:00:00,GHCND:USW00094789,1.0


We take note of the `datatype` column. Previously, we had seen only temperature related data, however, now we see a few more options. 

In [5]:
weather.datatype.unique()

array(['PRCP', 'SNOW', 'SNWD', 'WESF', 'WESD', 'TMAX', 'TMIN', 'TOBS',
       'AWND', 'TAVG', 'WDF2', 'WDF5', 'WSF2', 'WSF5', 'PGTM', 'DAPR',
       'MDPR', 'WT01', 'WT02', 'WT09', 'WT08', 'WT04', 'WT06', 'WT03',
       'WT11', 'WT05', 'TSUN'], dtype=object)

We could use the same method discussed earlier to create a map or filter as follows:

In [6]:
snow_data_mask = weather[
    (weather.datatype == 'SNOW') & (weather.value > 0)
]
snow_data_mask.head()

Unnamed: 0,attributes,datatype,date,station,value
124,",,N,",SNOW,2018-01-01T00:00:00,GHCND:US1NYWC0019,25.0
723,",,N,",SNOW,2018-01-04T00:00:00,GHCND:US1NJBG0015,229.0
726,",,N,",SNOW,2018-01-04T00:00:00,GHCND:US1NJBG0017,10.0
730,",,N,",SNOW,2018-01-04T00:00:00,GHCND:US1NJBG0018,46.0
737,",,N,",SNOW,2018-01-04T00:00:00,GHCND:US1NJES0018,10.0


However, we do have a method called `query` which allows us to write our filters in a format closer to that commonly seen in SQL or even python boolean logic.

In [7]:
snow_data = weather.query('datatype == "SNOW" and value > 0')
snow_data.head()

Unnamed: 0,attributes,datatype,date,station,value
124,",,N,",SNOW,2018-01-01T00:00:00,GHCND:US1NYWC0019,25.0
723,",,N,",SNOW,2018-01-04T00:00:00,GHCND:US1NJBG0015,229.0
726,",,N,",SNOW,2018-01-04T00:00:00,GHCND:US1NJBG0017,10.0
730,",,N,",SNOW,2018-01-04T00:00:00,GHCND:US1NJBG0018,46.0
737,",,N,",SNOW,2018-01-04T00:00:00,GHCND:US1NJES0018,10.0


We can even demonstrate that both of these methods return the exact same dataframe

In [9]:
snow_data_mask.equals(snow_data)

True

The method that you choose largely depends on preference. Apparently there are some potential speed differences complexity differences that could be encountered, however, that is well beyond the scope of this course. 

## Merging DataFrames

Another crucial part of data analysis is combining datasets together, creating a more complete understanding of the data. There are two types of merges that we typically talk about. Using the vernacular common in from SQL: 
- `join`: A join combines columns of one dataset to the columns of another dataset based on some condition on the row. If the condition is meet, the columns from dataset B are attached to dataset A. 
- `union`: A Union can be thought of as appending two data sets that share the same columns.

Join is the more complex between the two, so we'll mostly talk about that.

### Joining Datasets
So far in this class, we have only worked with a single dataset. Joins provide us the ability to take two separate tables or dataframes with related information, and combine them into a single table. For the weather data we've been using, we might perform a join to attach a physical location to the weather measurements using the weather station's id to gain a better idea of how the weather patterns are distributed.

The most common types of joins are described below using venn diagrams.  

![joins diagram](Assets/joins.jpg "Joins Diagram")

Think of the circles as the complete set of rows for each dataframe and the shaded region as the rows that are returned as a result of the join. Described briefly, 
- Inner joins return only the rows that are present in both dataframes,  
- Left (Right) joins return all of the rows from the left (right) dataframe, leaving all the values as null or missing (depending on your language) for the columns from the right (left) table if there was no matching ID found in that table
- Full or Outer joins return all rows from both tables, leaving missing values in the columns on both sides for missing IDs in either table (Think the left and right joins happening at the same time)

Since it can be easier to understand something by doing, let's join some information about the weather stations to the rows of observations from above.

In [10]:
# reading in the weather station data
stations = pd.read_csv("https://raw.githubusercontent.com/stefmolin/Hands-On-Data-Analysis-with-Pandas/master/ch_04/data/weather_stations.csv")
stations.head()

Unnamed: 0,id,name,latitude,longitude,elevation
0,GHCND:US1CTFR0022,"STAMFORD 2.6 SSW, CT US",41.0641,-73.577,36.6
1,GHCND:US1CTFR0039,"STAMFORD 4.2 S, CT US",41.0378,-73.5682,6.4
2,GHCND:US1NJBG0001,"BERGENFIELD 0.3 SW, NJ US",40.9213,-74.002,20.1
3,GHCND:US1NJBG0002,"SADDLE BROOK TWP 0.6 E, NJ US",40.9027,-74.0834,16.8
4,GHCND:US1NJBG0003,"TENAFLY 1.3 W, NJ US",40.9147,-73.9775,21.6


The first thing we need to do to join two dataframes is determine the rows that we can join on. Note the row called `id`. This value is designed as a unique identifier for each row. If we look at the data from the `weather` dataframe we imported above, we could find that the column `station` contains data that seems to match `stations.id`

In [12]:
weather.head()

Unnamed: 0,attributes,datatype,date,station,value
0,",,N,",PRCP,2018-01-01T00:00:00,GHCND:US1CTFR0039,0.0
1,",,N,",PRCP,2018-01-01T00:00:00,GHCND:US1NJBG0015,0.0
2,",,N,",SNOW,2018-01-01T00:00:00,GHCND:US1NJBG0015,0.0
3,",,N,",PRCP,2018-01-01T00:00:00,GHCND:US1NJBG0017,0.0
4,",,N,",SNOW,2018-01-01T00:00:00,GHCND:US1NJBG0017,0.0


In order to determine how to join, we can take a look at the shape of each dataframe. It is also specifically important to check the number of unique values of the column we want to join on. We can check that below

In [13]:
print(f"Weather: {weather.shape}")
print(f"Stations: {stations.shape}")
print(f"Unique stations from Weather: {weather.station.unique().shape}")
print(f"Unique stations from Stations: {stations.id.unique().shape}")

Weather: (80256, 5)
Stations: (262, 5)
Unique stations from Weather: (109,)
Unique stations from Stations: (262,)


Here we can see that `weather` has more rows than `stations`. In order to preserve the actual data, we want to make sure that we are merging the information from `weather` to `stations`. This is due to the way relation databases store their information, which unfortunately is beyond the scope of this course. We do see that we will lose data on some of the stations (note that `stations.id` has more values than `weather.station`), however this is acceptable since our primary focus is the weather data, not the station data. We could probably go a step further and double check that all of station IDs in `weather` are also present in `stations`, but I'm not that concerned about that right now. 

All of the joins in we are interested in this course can be executed using the `merge` method and specifying keyword arguments, however there are other methods that could be called instead.

Inner join is the default join type of `merge`. The most important argument is the second dataframe to be merged. On top of that you may need to specify the names of the columns to be checked.

In [21]:
inner_joined_tables = weather.merge(stations, left_on='station', right_on='id')
inner_joined_tables.head()

Unnamed: 0,attributes,datatype,date,station,value,id,name,latitude,longitude,elevation
0,",,N,",PRCP,2018-01-01T00:00:00,GHCND:US1CTFR0039,0.0,GHCND:US1CTFR0039,"STAMFORD 4.2 S, CT US",41.0378,-73.5682,6.4
1,",,N,",PRCP,2018-01-02T00:00:00,GHCND:US1CTFR0039,0.0,GHCND:US1CTFR0039,"STAMFORD 4.2 S, CT US",41.0378,-73.5682,6.4
2,",,N,",PRCP,2018-01-03T00:00:00,GHCND:US1CTFR0039,0.0,GHCND:US1CTFR0039,"STAMFORD 4.2 S, CT US",41.0378,-73.5682,6.4
3,",,N,",DAPR,2018-01-05T00:00:00,GHCND:US1CTFR0039,2.0,GHCND:US1CTFR0039,"STAMFORD 4.2 S, CT US",41.0378,-73.5682,6.4
4,",,N,",MDPR,2018-01-05T00:00:00,GHCND:US1CTFR0039,15.5,GHCND:US1CTFR0039,"STAMFORD 4.2 S, CT US",41.0378,-73.5682,6.4


Note that now we have all of the weather data from before (datatype, date, value, etc.) on the left side of the table, and all of the location data on the right (latitude, longitude, and elevation). Unfortunitely we do have two columns with identical informaiton, but this could be solved easiy by dropping one of the columns after the fact or by renaming the one column beforehand. If the joining columns have the same name, python only includes one column with that name. 

In [15]:
weather.merge(stations.rename({'id':'station'}, axis='columns').drop('elevation', axis='columns'), on='station').head()

Unnamed: 0,attributes,datatype,date,station,value,name,latitude,longitude
0,",,N,",PRCP,2018-01-01T00:00:00,GHCND:US1CTFR0039,0.0,"STAMFORD 4.2 S, CT US",41.0378,-73.5682
1,",,N,",PRCP,2018-01-02T00:00:00,GHCND:US1CTFR0039,0.0,"STAMFORD 4.2 S, CT US",41.0378,-73.5682
2,",,N,",PRCP,2018-01-03T00:00:00,GHCND:US1CTFR0039,0.0,"STAMFORD 4.2 S, CT US",41.0378,-73.5682
3,",,N,",DAPR,2018-01-05T00:00:00,GHCND:US1CTFR0039,2.0,"STAMFORD 4.2 S, CT US",41.0378,-73.5682
4,",,N,",MDPR,2018-01-05T00:00:00,GHCND:US1CTFR0039,15.5,"STAMFORD 4.2 S, CT US",41.0378,-73.5682


Note that with this inner join, there are no missing values on either side of the table (at least none that isn't just bad or missing data)

In [19]:
inner_joined_tables.query('id.isna() or station.isna()').shape

(0, 10)

This is where an inner join differs from right, left, and outer joins. With these types of joins, we allow for the possibility of missing data. Let's attempt a left join. 

In [23]:
left_joined_tables = weather.merge(stations, left_on='station', right_on='id', how='left')
left_joined_tables.query("id.isna() or station.isna()").shape

(0, 10)

In this particular case, we don't have any missing data `station`. This tells us that all station ids found in `weather` were part of the `stations` dataframe. However, remeber that `stations` has more unique values in its `id` column than `weather.station`. If we perform a right join, we know for sure that we will get some missing weather data, because `weather.station` does not have all the values that `stations.id` has. 

In [24]:
right_joined_tables = weather.merge(stations, left_on='station', right_on='id', how='right')
right_joined_tables.query("id.isna() or station.isna()")

Unnamed: 0,attributes,datatype,date,station,value,id,name,latitude,longitude,elevation
0,,,,,,GHCND:US1CTFR0022,"STAMFORD 2.6 SSW, CT US",41.06410,-73.577000,36.6
344,,,,,,GHCND:US1NJBG0001,"BERGENFIELD 0.3 SW, NJ US",40.92130,-74.002000,20.1
345,,,,,,GHCND:US1NJBG0002,"SADDLE BROOK TWP 0.6 E, NJ US",40.90270,-74.083400,16.8
718,,,,,,GHCND:US1NJBG0005,"WESTWOOD 0.8 ESE, NJ US",40.98300,-74.015900,15.8
719,,,,,,GHCND:US1NJBG0006,"RAMSEY 0.6 E, NJ US",41.05860,-74.134100,112.2
...,...,...,...,...,...,...,...,...,...,...
50877,,,,,,GHCND:USC00309400,"WHITE PLAINS MAPLE M, NY US",41.01667,-73.733330,45.7
50878,,,,,,GHCND:USC00309466,WILLETS POINT,40.80000,-73.766667,16.8
50879,,,,,,GHCND:USC00309576,"WOODLANDS ARDSLEY, NY US",41.01667,-73.850000,42.7
50880,,,,,,GHCND:USW00014708,"HEMPSTEAD MITCHELL FIELD AFB, NY US",40.73333,-73.600000,38.1


Notice now that we have a set of rows for which the weather station data is present, but under the weather data, we have a bunch of missing values. This indicates to us that the some of the weather stations present in the `stations` dataframe are *not* present in the `weather` dataframe. Doing a quick bit of math, we can confirm that the missing rows matches up with what we expected. Performing an outer join is essentially the same as a left and a right join together, with the above behavior expected for both the left and right tables. 

In [34]:
len(stations.id.unique()) - len(weather.station.unique())

153

The only difference between an outer join and a right or left join is that we are essentially doing both a left and a right join at the same time. We would get all rows where `weather.station` had a matching value in `stations.id`, all rows where `weather.station` had a value that was not present in `station.id` (in this case that number is 0), and all rows where `station.id` has a value not present in `weather.station`.

## Aggregation
Another important function of data analysis is the idea of aggregation. This referes to the process of taking the data and rolling it up into a single value or set of values instead of looking at the individual measurements. We might want the summed, average, or maximum value of the dataset. We can use fairly simple function calls that we have acctually used to a degree before. I will use pivot tables to combine the different aggregations together in our weather data example. 

In [35]:
weather.date = pd.to_datetime(weather.date)
weather_pivot = weather.set_index(['date', 'station'])\
    .pivot(columns='datatype', values='value')[['PRCP', 'SNOW', 'TAVG', 'TMAX', 'TMIN']]
weather_pivot.query('station == "GHCND:USW00094728"')[['PRCP', 'SNOW']].sum()


datatype
PRCP    1665.3
SNOW    1007.0
dtype: float64

In the above example, we filter pivot the dataframe so that each datatype is its own column, then we can filter on values from only one specific station, and only select the columns where it makes sense for the aggregation we are performing. In doing this, we have computed a value that reports the total amount of rain and snow fall for the entire measurement period. 

The above gives us the totals over the entire dataset. We could make this perform the same logic to calculate things like the average per day, or find the days with the most or the least. The method calls are fairly intuitive, and can be quickly found in Panda's documentation. A functionality very commonly used with these aggregation methods is that of grouping (called by `groupby`) to partition the data and perform the aggregation function on each group.  

In [36]:
weather_pivot.groupby("station")[['PRCP', 'SNOW']].sum()

datatype,PRCP,SNOW
station,Unnamed: 1_level_1,Unnamed: 2_level_1
GHCND:US1CTFR0039,1424.8,543.0
GHCND:US1NJBG0003,1590.7,1046.0
GHCND:US1NJBG0010,1233.2,0.0
GHCND:US1NJBG0015,1539.0,770.0
GHCND:US1NJBG0017,1447.8,1019.0
...,...,...
GHCND:USW00054787,1291.5,0.0
GHCND:USW00094728,1665.3,1007.0
GHCND:USW00094741,1611.3,1192.0
GHCND:USW00094745,1512.4,0.0


We could find the same in or first example by grouping, then filtering at the end.

In [38]:
weather_pivot.groupby("station")[['PRCP', 'SNOW']].sum().query('station == "GHCND:USW00094728"')

datatype,PRCP,SNOW
station,Unnamed: 1_level_1,Unnamed: 2_level_1
GHCND:USW00094728,1665.3,1007.0


Instead of grouping by the station, we could instead group by the date to find daily statistics such as the minimum of each column on every day across all stations. 

In [39]:
weather_pivot.groupby('date').min()

datatype,PRCP,SNOW,TAVG,TMAX,TMIN
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2018-01-01,0.0,0.0,-11.3,-10.6,-22.2
2018-01-02,0.0,0.0,-7.8,-10.0,-20.0
2018-01-03,0.0,0.0,-6.8,-10.6,-16.1
2018-01-04,0.0,0.0,-4.0,-3.8,-15.6
2018-01-05,0.0,0.0,-9.2,-8.8,-17.2
...,...,...,...,...,...
2018-12-27,0.0,0.0,4.2,4.4,-2.8
2018-12-28,0.0,0.0,8.0,6.1,-2.8
2018-12-29,0.0,0.0,10.2,12.2,1.1
2018-12-30,0.0,0.0,3.0,1.7,-2.1


Additionally, you can group by multiple columns. It will create a set of nested groups that can be aggregated together for further separation of the metrics being calculated. 

There is also a pandas class called `Grouper` that seems to allow for even more complex grouping functionality, but we will not go into that class in this lesson.

## Window Functions

Window functions are a very interesting and useful function that can provide some insight. Sometimes, we want to know the maximum or the average of an entire column, but it can be just as interesting to know the rolling average of some sort of data. Window functions allow us to perform calculations on a group of rows that are close to each other in some way. Using the weather data, we can calculate the average rainfall over the past week. Using the method `rolling`, we can include a predefined set of values in our calculation. In this example, we compute the rolling average of rainfall over the last 7 days at a specific station. 

In [43]:
rainfall = weather.query("datatype == 'PRCP' and station == 'GHCND:USW00094728'")\
    .set_index("date")\
    .assign(rolling_average=lambda x: x.value.rolling(7).mean())

In [44]:
rainfall

Unnamed: 0_level_0,attributes,datatype,station,value,rolling_average
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2018-01-01,",,W,2400",PRCP,GHCND:USW00094728,0.0,
2018-01-02,",,W,2400",PRCP,GHCND:USW00094728,0.0,
2018-01-03,",,W,2400",PRCP,GHCND:USW00094728,0.0,
2018-01-04,",,W,2400",PRCP,GHCND:USW00094728,19.3,
2018-01-05,",,W,2400",PRCP,GHCND:USW00094728,0.0,
...,...,...,...,...,...
2018-12-27,",,W,2400",PRCP,GHCND:USW00094728,0.0,6.342857
2018-12-28,",,W,2400",PRCP,GHCND:USW00094728,29.2,4.671429
2018-12-29,",,W,2400",PRCP,GHCND:USW00094728,0.0,4.600000
2018-12-30,"T,,W,2400",PRCP,GHCND:USW00094728,0.0,4.600000


Note that the first few rows do not have a value, as rolling does not have enough rows to pull from at this point. 