# <span style="color:darkblue"> Lecture 13 - Aggregating Data </span>

<font size = "5">

In the previous class we covered

- Missing values
- The basics of data cleaning

This class we will talk about 
- Computing aggregate statistics by group
- Introduction to merging

# <span style="color:darkblue"> I. Import Libraries and Data </span>


<font size = "5">
Key libraries

In [102]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

<font size = "5">

Read dataset on car racing circuits

- https://en.wikipedia.org/wiki/Formula_One <br>
- [See Data Source](https://www.kaggle.com/datasets/rohanrao/formula-1-world-championship-1950-2020)

In [103]:
results = pd.read_csv("data_raw/results.csv")

<font size = "5">

The dataset "codebook" is a table with ...

- Key column information
- Main things:  Field, Type, Key, and Description

<img src="figures/codebook_races.png" alt="drawing" width="600"/>


In [104]:
# The codebook contains basic about the columns
# "Field" is the name given to the name of the column
# "Type"  is the variable type:
#         integer (int)
#         string (varchar - "variable character")
#         float (float)
#         The number in parenthesis is the maximum number of characters/digits
#         For most purposes we can ignore the numbers in parentheses.
# "Key" denotes whether this is the primary key "PRI" (also known as the identifier)
#         This is a column with unique values, that uniquely identifies each row
# "Description" contains a label with the content of the variable

<font size = "5">

Get column names + types

- Do types match the codebook?
- If not the data may need to be cleaned

In [105]:
# This code displays column types
# "int" or "float" objects are numneric
# "object" typically denotes strings
# If a column that's supposed to be "numeric" but appears as "object"
# then it needs to be cleaned and converted to numeric 

results.dtypes


resultId             int64
raceId               int64
driverId             int64
constructorId        int64
number              object
grid                 int64
position            object
positionText        object
positionOrder        int64
points             float64
laps                 int64
time                object
milliseconds        object
fastestLap          object
rank                object
fastestLapTime      object
fastestLapSpeed     object
statusId             int64
dtype: object

<font size = "5">

Try it yourself!

- How many rows does the dataset have?
- How many unique values are there for the columns <br>
$\qquad$ "resultId"? <br>
$\qquad$ "raceId"? <br>
$\qquad$ "driverId"? <br>

HINT: Use the "len()" and the "pd.unique()" functions

In [106]:
# Write your own code here
#len(results)

print(results['resultId'].unique())
print(len(results['raceId'].unique()))

[    1     2     3 ... 25843 25844 25845]
1079


# <span style="color:darkblue"> II. Aggregate + groupby  </span>



<font size = "5">

Splitting code into multiple lines

- Makes it easier to read
- Simply wrap the code in round parentheses "()"

In [107]:
# The following code computes descriptive statistics for points 
# Wrapping the code in parentheses "()" allows you to split it into multiple 
# lines. It's considered good practice to make each line less than 80 characters
# This makes it easier to scroll up and down without going sideways.

descriptives_singleline = results["points"].describe()
descriptives_multiline = (results["points"]
                          .describe())

<font size = "5">

Aggregate statistics 

In [108]:
# The ".agg()" subfunction computes aggregate statistics
# The syntax is ("column_name","function_name")
# The first argument is the column name
# The second argument is the function_name
# The command works with single quotations '...' or double "..."

results_agg = results.agg(mean_points = ('points','mean'),
                          sd_points =   ('points','std'),
                          min_points =  ('points','min'),
                          max_points =  ('points','max'),
                          count_obs   = ('points',len))

display(results_agg)

Unnamed: 0,points
mean_points,1.877053
sd_points,4.169849
min_points,0.0
max_points,50.0
count_obs,25840.0


<font size = "5" >

Groupby + Aggregate statistics

<img src="figures/agg.png" alt="drawing" width="400"/>

In [109]:
# In this cases drivers engage in multiple car races
# We can compute the aggregate statistics for each specific driver across
# multiple car races

drivers_agg = (results.groupby("driverId")
                      .agg(mean_points = ('points','mean'),
                           sd_points =   ('points','std'),
                           min_points =  ('points','min'),
                           max_points =  ('points','max'),
                           count_obs   = ('points',len)))

drivers_agg

Unnamed: 0_level_0,mean_points,sd_points,min_points,max_points,count_obs
driverId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,14.182258,9.224098,0.0,50.0,310
2,1.407609,2.372923,0.0,15.0,184
3,7.740291,8.672456,0.0,25.0,206
4,5.756983,6.330721,0.0,25.0,358
5,0.937500,1.969503,0.0,10.0,112
...,...,...,...,...,...
852,1.000000,2.477808,0.0,12.0,44
853,0.000000,0.000000,0.0,0.0,22
854,0.272727,1.335798,0.0,8.0,44
855,0.272727,0.882735,0.0,4.0,22


In [110]:
results.loc[results['driverId']==1, ['points']]

Unnamed: 0,points
0,10.0
26,4.0
56,0.0
68,6.0
89,8.0
...,...
25744,10.0
25761,18.0
25781,18.0
25801,18.0


<font size = "5" >
Groupby + Aggregate statistics (multigroup)

In [111]:
# We can aggregate statistics from multiple columns by
# entering a list of column names in "groupby"
# In this case "constructor" in this case denotes the team 
# The following computes aggregate point stats for each (team, race) combination

teamrace_agg = (results.groupby(["raceId","constructorId"])
                       .agg(mean_points = ('points','mean'),
                            sd_points =   ('points','std'),
                            min_points =  ('points','min'),
                            max_points =  ('points','max'),
                            count_obs   = ('points',len)))

len(teamrace_agg)

12568

<font size = "5">

Filtering + Grouping + Aggregating: <br>

```python 
.query().groupby().agg()
```

- This sequential syntax is known as "chaining"


In [112]:
# The following gets a subset of the data using .query()
# In this case we subset the data before computing aggregate statistics
# Note: "filtering" is often the word used to obtain a subset

teamrace_agg = (results.query("raceId >= 500")
                       .groupby(["raceId","constructorId"])
                        .agg(mean_points = ('points','mean'),
                             sd_points =   ('points','std'),
                             min_points =  ('points','min'),
                             max_points =  ('points','max'),
                             count_obs   = ('points',len)))


In [113]:
teamrace_agg

Unnamed: 0_level_0,Unnamed: 1_level_0,mean_points,sd_points,min_points,max_points,count_obs
raceId,constructorId,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
500,1,0.0,0.000000,0.0,0.0,2
500,3,1.0,1.414214,0.0,2.0,2
500,4,4.5,6.363961,0.0,9.0,2
500,6,0.0,0.000000,0.0,0.0,2
500,21,0.5,0.707107,0.0,1.0,2
...,...,...,...,...,...,...
1096,117,2.5,2.121320,1.0,4.0,2
1096,131,5.0,7.071068,0.0,10.0,2
1096,210,0.0,0.000000,0.0,0.0,2
1096,213,0.0,0.000000,0.0,0.0,2


<font size = "5">

Try it yourself!

- Create a new dataset by chaining  <br>
groups by "raceId" then computes the <br>
aggregate statistics: "points" average <br> and "laps" average


In [114]:
# Write your own code
newdataset = results.groupby("raceId").agg(points_av=('points','mean'),lap_av=('laps','mean'))
newdataset

Unnamed: 0_level_0,points_av,lap_av
raceId,Unnamed: 1_level_1,Unnamed: 2_level_1
1,1.950,50.25
2,0.975,27.70
3,1.950,51.15
4,1.950,56.25
5,1.950,47.10
...,...,...
1092,5.050,25.20
1093,5.100,49.50
1094,5.100,68.90
1095,5.100,62.80


<font size = "5">

Try it yourself!

- Create a new dataset by chaining <br>
groups by "constructorId" (the team) then <br> 
computes the average number of "points"
- Add a chain ".sort_values(...,ascending = False)" <br>
to sort by team points in desceding order


In [115]:
# Write your own code


# <span style="color:darkblue"> III. Relative statistics within group </span>



<font size = "5">

Merging

<img src="figures/merge_stats.png" alt="drawing" width="600"/>


In [122]:
# This command merges the "aggregate" information in "driver_agg" into
# "results" as shown in the figure
# The merging variable "on" is determined by "driverId", which is a column
# that is common to both datasets
# "how = left" indicates that the left dataset is the baseline
#
# Note: For this method to work well "driverId" needs to contain unique alues
# in "drivers_agg". If not you may need to clean the data beforehand

results_merge = pd.merge(results,
                         drivers_agg,
                         on = "driverId",
                         how = "left").sort_values(by="grid",ascending=True)

results_merge



Unnamed: 0,resultId,raceId,driverId,constructorId,number,grid,position,positionText,positionOrder,points,...,fastestLap,rank,fastestLapTime,fastestLapSpeed,statusId,mean_points,sd_points,min_points,max_points,count_obs
12919,12920,531,206,64,24,0,\N,F,26,0.0,...,\N,\N,\N,\N,97,0.234375,0.684110,0.0,3.0,64
10935,10936,461,190,45,31,0,\N,F,28,0.0,...,\N,\N,\N,\N,81,0.000000,0.000000,0.0,0.0,18
10936,10937,461,181,54,9,0,\N,F,29,0.0,...,\N,\N,\N,\N,81,0.036364,0.269680,0.0,2.0,55
10963,10964,462,196,27,26,0,\N,F,27,0.0,...,\N,\N,\N,\N,81,0.000000,0.000000,0.0,0.0,30
10964,10965,462,188,55,34,0,\N,F,28,0.0,...,\N,\N,\N,\N,81,0.043478,0.208514,0.0,1.0,23
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19462,19463,809,512,143,56,33,\N,R,32,0.0,...,\N,\N,\N,\N,80,1.111111,1.833333,0.0,5.0,9
20083,20084,835,732,113,62,33,18,18,18,0.0,...,\N,\N,\N,\N,88,0.000000,0.000000,0.0,0.0,3
19249,19250,800,676,113,71,33,19,19,19,0.0,...,\N,\N,\N,\N,17,0.000000,,0.0,0.0,1
18929,18930,786,630,113,64,33,17,17,17,0.0,...,\N,\N,\N,\N,19,0.000000,0.000000,0.0,0.0,4


<font size = "5">

Check out another approach to compute <br>
aggregate statistics using ``` .transform() ```<br>
 in the optional lecture!

<font size = "5">

Try it yourself!

- Compute a scatter plot with ...
- "points" (y-axis) vs "mean_points" (x-axis)

Note: This plots tells you how much a driver's <br>
performance on individual races deviates from <br>
their overall average

In [None]:
# Write your own code


<font size = "5">

Try it yourself!

- Merge the "teamrace_agg" data into "results"
- This time use the option:

$\qquad$ ```on = ["raceId","constructorId"]```

In [None]:
# Write your own code
