In [1]:
import numpy as np
import pandas as pd

%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

<center> 

# Group, Pivot, and Merge

## Amanda R. Kube Jotte
    
<img src="https://raw.githubusercontent.com/amandakube/Data118LectureImages/main/DSSI.png" width="800">
    
</center>

## Anybody like penguins?

In [73]:
penguins = pd.read_csv("https://raw.githubusercontent.com/amandakube/Data118LectureImages/main/penguins_size.csv")
penguins.head(10) # What is a culmen?

Unnamed: 0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE
3,Adelie,Torgersen,,,,,
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE
5,Adelie,Torgersen,39.3,20.6,190.0,3650.0,MALE
6,Adelie,Torgersen,38.9,17.8,181.0,3625.0,FEMALE
7,Adelie,Torgersen,39.2,19.6,195.0,4675.0,MALE
8,Adelie,Torgersen,34.1,18.1,193.0,3475.0,
9,Adelie,Torgersen,42.0,20.2,190.0,4250.0,


<center>
<img src="https://pbs.twimg.com/media/EaAXQn8U4AAoKUj?format=jpg&name=medium" width="400">
</center>

## Let's look more closely at the data

In [74]:
penguins.species.unique()

array(['Adelie', 'Chinstrap', 'Gentoo'], dtype=object)

<img src="https://www.bas.ac.uk/wp-content/uploads/2015/04/Penguin-heights-736x419.jpg" width="800">

In [75]:
penguins.island.unique()

array(['Torgersen', 'Biscoe', 'Dream'], dtype=object)

<img src="https://www.researchgate.net/profile/William-Fraser/publication/260557350/figure/fig6/AS:669677869076485@1536675054300/The-marine-ecosystem-west-of-the-Antarctic-Peninsula-a-extends-from-northern-Alexander_W640.jpg" width="800">

## Aggregate statistics

Often, we want to compute aggregate statistics over rows - pandas has great functionality for this!

For example, we may want to know how many of each species of penguin are in our data.

In [76]:
penguins.groupby('species').count()

Unnamed: 0_level_0,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Adelie,152,151,151,151,151,146
Chinstrap,68,68,68,68,68,68
Gentoo,124,123,123,123,123,120


Why does `sex` have different values?

## Which species of penguin is bigger on average?

In [77]:
penguins.groupby('species').mean(numeric_only = True)

Unnamed: 0_level_0,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Adelie,38.791391,18.346358,189.953642,3700.662252
Chinstrap,48.833824,18.420588,195.823529,3733.088235
Gentoo,47.504878,14.982114,217.186992,5076.01626


## How many penguins live on each island?

In [78]:
penguins.groupby('island').count()

Unnamed: 0_level_0,species,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
island,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Biscoe,168,167,167,167,167,164
Dream,124,124,124,124,124,123
Torgersen,52,51,51,51,51,47


What is the index now?

## So how do we select elements with this new index?

In [79]:
penguins_count = penguins.groupby('island').count()
penguins_count

Unnamed: 0_level_0,species,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
island,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Biscoe,168,167,167,167,167,164
Dream,124,124,124,124,124,123
Torgersen,52,51,51,51,51,47


In [80]:
penguins_count.species["Biscoe"]

168

We can use the `.reset_index()` method to make island a column and return to the default pandas index

In [81]:
penguins.groupby('island').count().reset_index()

Unnamed: 0,island,species,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex
0,Biscoe,168,167,167,167,167,164
1,Dream,124,124,124,124,124,123
2,Torgersen,52,51,51,51,51,47


If we want a dataframe with a numeric index and a single count column...

In [82]:
# we can chain calls
penguins.groupby('island').count()[['species']].reset_index()

Unnamed: 0,island,species
0,Biscoe,168
1,Dream,124
2,Torgersen,52


In [83]:
# change the name
penguins.groupby('island').count()[['species']].reset_index().rename(columns={"species": "penguin count"})

Unnamed: 0,island,penguin count
0,Biscoe,168
1,Dream,124
2,Torgersen,52


We can group by multiple columns

Notice the row indices are constructed from both grouping columns

In [84]:
penguins.groupby(['species','island']).count()[['body_mass_g']].rename(columns={"body_mass_g": "penguin count"})

Unnamed: 0_level_0,Unnamed: 1_level_0,penguin count
species,island,Unnamed: 2_level_1
Adelie,Biscoe,44
Adelie,Dream,56
Adelie,Torgersen,51
Chinstrap,Dream,68
Gentoo,Biscoe,123


In [85]:
#order matters
penguins.groupby(['island','species']).count()[['body_mass_g']].rename(columns={"body_mass_g": "penguin count"})

Unnamed: 0_level_0,Unnamed: 1_level_0,penguin count
island,species,Unnamed: 2_level_1
Biscoe,Adelie,44
Biscoe,Gentoo,123
Dream,Adelie,56
Dream,Chinstrap,68
Torgersen,Adelie,51


We would use tables like this for cross-classification - looking at patterns between variables

This format is difficult to see

Let's use a pivot_table!

In [86]:
penguins.pivot_table(index=['species'],columns=['island'],values=['body_mass_g'])

Unnamed: 0_level_0,body_mass_g,body_mass_g,body_mass_g
island,Biscoe,Dream,Torgersen
species,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Adelie,3709.659091,3688.392857,3706.372549
Chinstrap,,3733.088235,
Gentoo,5076.01626,,


What's going on here?

Each unique value in "index" gets its own row  and each unique value in "columns" gets its own column 

Value specifies the metric(s)

Default aggregation function is np.mean (The mean of any empty series is NaN)

Aggfunc says how to aggregate the metrics

We can remake our earlier table by changing aggfunc

In [87]:
penguins.pivot_table(index=['species'],columns=['island'],values=['body_mass_g'], aggfunc='count')

Unnamed: 0_level_0,body_mass_g,body_mass_g,body_mass_g
island,Biscoe,Dream,Torgersen
species,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Adelie,44.0,56.0,51.0
Chinstrap,,68.0,
Gentoo,123.0,,


We can use the argument `fill_value` to replace those NaN values with 0 since no penguins are in those groups.

In [88]:
penguins.pivot_table(index=['species'],columns=['island'],values=['body_mass_g'], aggfunc='count', fill_value=0)

Unnamed: 0_level_0,body_mass_g,body_mass_g,body_mass_g
island,Biscoe,Dream,Torgersen
species,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Adelie,44,56,51
Chinstrap,0,68,0
Gentoo,123,0,0


We can change the rows and columns...

In [89]:
penguins.pivot_table(index=['island'],columns=['species'],values=['body_mass_g'], aggfunc='count', fill_value=0)

Unnamed: 0_level_0,body_mass_g,body_mass_g,body_mass_g
species,Adelie,Chinstrap,Gentoo
island,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Biscoe,44,0,123
Dream,56,68,0
Torgersen,51,0,0


An aside:

Instead of using the format `dataframe.pivot_table()` we can use `pd.pivot_table(dataframe)` to get the same result

In [90]:
pd.pivot_table(penguins, index=['island'],columns=['species'],values=['body_mass_g'], aggfunc='count', fill_value=0)

Unnamed: 0_level_0,body_mass_g,body_mass_g,body_mass_g
species,Adelie,Chinstrap,Gentoo
island,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Biscoe,44,0,123
Dream,56,68,0
Torgersen,51,0,0


For each species, what is the heaviest penguin weight? Do some islands have heavier penguins than others?

In [91]:
import numpy as np
penguins.pivot_table(index=['species'],columns=['island'],\
               values=['body_mass_g'],aggfunc=np.max) # This forward slash let's me continue my code on the next line

Unnamed: 0_level_0,body_mass_g,body_mass_g,body_mass_g
island,Biscoe,Dream,Torgersen
species,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
Adelie,4775.0,4650.0,4700.0
Chinstrap,,4800.0,
Gentoo,6300.0,,


## Your turn!

The Illinois Department of Transportation (IDOT) collects records of all traffic stops made by CPD. We've provided data on all stops made in the years 2015 - 2021.

Using the IDOT data on the Jupyter Hub, answer the following questions:

In [2]:
# Read in data here
idot = pd.read_csv('../datasets/idot_clean_abbrv.csv')
idot.head()

  idot = pd.read_csv('../datasets/idot_clean_abbrv.csv')


Unnamed: 0,year,reason,driver_race,beat,district
0,2017.0,equipment,asian,2032.0,20.0
1,2017.0,license,black,1923.0,19.0
2,2017.0,equipment,black,1831.0,18.0
3,2017.0,moving,black,724.0,7.0
4,2017.0,license,asian,1732.0,17.0


### Question 1. How many stops occurred for each race in each year?

In [4]:
# does not matter if you chose race on index and year on columns or year on index and race on columns, both are ok
# since we are counting, you can choose to count any column, all will give the same count of number of elements

idot.pivot_table(index='driver_race', columns='year', values='reason', aggfunc='count')

year,2017.0,2018.0
driver_race,Unnamed: 1_level_1,Unnamed: 2_level_1
am_indian,1058,2132
asian,7249,12433
black,172113,300954
hispanic,59557,104083
other,1287,1958
white,43799,67900


In [12]:
# or using groupby you can do

idot.groupby(['driver_race','year']).count()[['reason']].rename(columns={'reason':'stops'})

Unnamed: 0_level_0,Unnamed: 1_level_0,stops
driver_race,year,Unnamed: 2_level_1
am_indian,2017.0,1058
am_indian,2018.0,2132
asian,2017.0,7249
asian,2018.0,12433
black,2017.0,172113
black,2018.0,300954
hispanic,2017.0,59557
hispanic,2018.0,104083
other,2017.0,1287
other,2018.0,1958


### Question 2. Across all years, which racial group was stopped most often?

In [11]:
# Since now we are looking across years and dont care what year it is, we no longer need to group by year as well
# Since we are only grouping by one thing this is no longer cross-classification so we use groupby

race_table = idot.groupby('driver_race').count()[['reason']].rename(columns={'reason':'stops'})
race_table

Unnamed: 0_level_0,stops
driver_race,Unnamed: 1_level_1
am_indian,3190
asian,19682
black,473067
hispanic,163640
other,3245
white,111699


### Question 3. Stop rates are more useful than counts. Using the table you made in Q2 and what you know about indexing dataframes, calculate the stop rate (percentage of total stops) for each racial group. 

In [14]:
race_table['stop rate'] = race_table.stops / race_table.stops.sum() * 100 # Multiplying by 100 turns proportion into percentage (you can keep it as a proportion if that is easier for you to interpret)
race_table

Unnamed: 0_level_0,stops,stop rate
driver_race,Unnamed: 1_level_1,Unnamed: 2_level_1
am_indian,3190,0.411866
asian,19682,2.541177
black,473067,61.078496
hispanic,163640,21.127843
other,3245,0.418968
white,111699,14.42165


## Are these stops racially biased?

Maybe there are just more drivers...

We need some way to compare. We need more data!

## Merge

Often, the information we need to analyze comes from different sources and is in different files, dataframes, tables, etc.

Pandas has a `merge` function to do this (Similar to a "join" operation on a database)

Let's try this on our penguin data - add some information on the islands


In [104]:
islands = pd.DataFrame(
    {'island': ['Torgersen', 'Biscoe', 'Dream'], 
    'Continent': ['Antarctica',np.nan,np.nan],
    'Distance from Chicago (mi)': [7473,7545,7470]
    })
islands

Unnamed: 0,island,Continent,Distance from Chicago (mi)
0,Torgersen,Antarctica,7473
1,Biscoe,,7545
2,Dream,,7470


## Let's Merge Them

In [106]:
penguins_full = pd.merge(penguins, islands)
penguins_full

Unnamed: 0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex,Continent,Distance from Chicago (mi)
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE,Antarctica,7473
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE,Antarctica,7473
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE,Antarctica,7473
3,Adelie,Torgersen,,,,,,Antarctica,7473
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE,Antarctica,7473
...,...,...,...,...,...,...,...,...,...
339,Chinstrap,Dream,55.8,19.8,207.0,4000.0,MALE,,7470
340,Chinstrap,Dream,43.5,18.1,202.0,3400.0,FEMALE,,7470
341,Chinstrap,Dream,49.6,18.2,193.0,3775.0,MALE,,7470
342,Chinstrap,Dream,50.8,19.0,210.0,4100.0,MALE,,7470


How did it know how to merge the two?

## What if the names didn't match?

In [107]:
islands = pd.DataFrame(
    {'Island': ['Torgersen', 'Biscoe', 'Dream'], 
    'Continent': ['Antarctica','Antarctica','Antarctica'],
    'Distance from Chicago (mi)': [7473,7545,7470]
    })

penguins_full = pd.merge(penguins, islands, left_on = 'island', right_on = "Island")
penguins_full

Unnamed: 0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex,Island,Continent,Distance from Chicago (mi)
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE,Torgersen,Antarctica,7473
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE,Torgersen,Antarctica,7473
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE,Torgersen,Antarctica,7473
3,Adelie,Torgersen,,,,,,Torgersen,Antarctica,7473
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE,Torgersen,Antarctica,7473
...,...,...,...,...,...,...,...,...,...,...
339,Chinstrap,Dream,55.8,19.8,207.0,4000.0,MALE,Dream,,7470
340,Chinstrap,Dream,43.5,18.1,202.0,3400.0,FEMALE,Dream,,7470
341,Chinstrap,Dream,49.6,18.2,193.0,3775.0,MALE,Dream,,7470
342,Chinstrap,Dream,50.8,19.0,210.0,4100.0,MALE,Dream,,7470


Now, it kept `island` and `Island` so we have a duplicate column...

In [98]:
penguins_full = penguins_full.T.drop_duplicates().T # T transposes the dataframe
penguins_full.tail(5)

Unnamed: 0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex,Continent,Distance from Chicago (mi)
339,Chinstrap,Dream,55.8,19.8,207.0,4000.0,MALE,Antarctica,7470
340,Chinstrap,Dream,43.5,18.1,202.0,3400.0,FEMALE,Antarctica,7470
341,Chinstrap,Dream,49.6,18.2,193.0,3775.0,MALE,Antarctica,7470
342,Chinstrap,Dream,50.8,19.0,210.0,4100.0,MALE,Antarctica,7470
343,Chinstrap,Dream,50.2,18.7,198.0,3775.0,FEMALE,Antarctica,7470


## Let's look more closely at what `merge` does

Which rows of each dataframe should we put together?

What should be in the result?

Each row of a dataframe should have a way to tell it apart from the others

This is called a "key" or an "identifier" - why pandas adds an index by default!
#### Some possible keys (how do they compare?):
- Social Security Number
- Cell phone number
- Email address
- Full name
- Retina scan

### Merge matches rows that have the same value for a chosen row

Conceptually: these are data about the same real-world entity

By default, rows from either table that don't have a match in the other table are left out of the answer.

By default, `pd.merge` joins common column names and takes all common rows to make up the combined version

This is called an **inner join** and is also the **intersection** of the two DataFrames.

By default the ordering of the left DataFrame input is kept


In [119]:
islands = pd.DataFrame(
    {'island': ['Torgersen', 'Biscoe', 'Ross'], 
    'Continent': ['Antarctica','Antarctica','Antarctica'],
    'Distance from Chicago (mi)': [7473,7545,7480]
    })
penguins_full = pd.merge(penguins, islands, how="inner")
penguins_full

Unnamed: 0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex,Continent,Distance from Chicago (mi)
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE,Antarctica,7473
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE,Antarctica,7473
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE,Antarctica,7473
3,Adelie,Torgersen,,,,,,Antarctica,7473
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE,Antarctica,7473
...,...,...,...,...,...,...,...,...,...
215,Gentoo,Biscoe,,,,,,Antarctica,7545
216,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,FEMALE,Antarctica,7545
217,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,MALE,Antarctica,7545
218,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,FEMALE,Antarctica,7545


If we want to combine DataFrames in a different way, we can change the `how` argument.

Different options to merge include: 'left', 'right', 'inner', and 'outer'.

We specify how = 'left' or how = 'right' to include all information contained only in one DataFrame. 

Any information not present will be labeled as NaN

In [120]:
islands = pd.DataFrame(
    {'island': ['Torgersen', 'Biscoe', 'Ross'], 
    'Continent': ['Antarctica','Antarctica','Antarctica'],
    'Distance from Chicago (mi)': [7473,7545,7480]
    })
penguins_full = pd.merge(penguins, islands, how="left")
penguins_full.tail(200)

Unnamed: 0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex,Continent,Distance from Chicago (mi)
144,Adelie,Dream,37.3,16.8,192.0,3000.0,FEMALE,,
145,Adelie,Dream,39.0,18.7,185.0,3650.0,MALE,,
146,Adelie,Dream,39.2,18.6,190.0,4250.0,MALE,,
147,Adelie,Dream,36.6,18.4,184.0,3475.0,FEMALE,,
148,Adelie,Dream,36.0,17.8,195.0,3450.0,FEMALE,,
...,...,...,...,...,...,...,...,...,...
339,Gentoo,Biscoe,,,,,,Antarctica,7545.0
340,Gentoo,Biscoe,46.8,14.3,215.0,4850.0,FEMALE,Antarctica,7545.0
341,Gentoo,Biscoe,50.4,15.7,222.0,5750.0,MALE,Antarctica,7545.0
342,Gentoo,Biscoe,45.2,14.8,212.0,5200.0,FEMALE,Antarctica,7545.0


The **outer join** contains all the row entries from both DataFrames (the **union** of the DataFrames). 

In [121]:
islands = pd.DataFrame(
    {'island': ['Torgersen', 'Biscoe', 'Ross'], 
    'Continent': ['Antarctica','Antarctica','Antarctica'],
    'Distance from Chicago (mi)': [7473,7545,7480]
    })
penguins_full = pd.merge(penguins, islands, how="outer")
penguins_full

Unnamed: 0,species,island,culmen_length_mm,culmen_depth_mm,flipper_length_mm,body_mass_g,sex,Continent,Distance from Chicago (mi)
0,Adelie,Torgersen,39.1,18.7,181.0,3750.0,MALE,Antarctica,7473.0
1,Adelie,Torgersen,39.5,17.4,186.0,3800.0,FEMALE,Antarctica,7473.0
2,Adelie,Torgersen,40.3,18.0,195.0,3250.0,FEMALE,Antarctica,7473.0
3,Adelie,Torgersen,,,,,,Antarctica,7473.0
4,Adelie,Torgersen,36.7,19.3,193.0,3450.0,FEMALE,Antarctica,7473.0
...,...,...,...,...,...,...,...,...,...
340,Chinstrap,Dream,43.5,18.1,202.0,3400.0,FEMALE,,
341,Chinstrap,Dream,49.6,18.2,193.0,3775.0,MALE,,
342,Chinstrap,Dream,50.8,19.0,210.0,4100.0,MALE,,
343,Chinstrap,Dream,50.2,18.7,198.0,3775.0,FEMALE,,


## Now, you try...

Use the population by beat data provided on Jupyter Hub.

In [15]:
# Read in data here
pop = pd.read_csv('../datasets/adjusted_population_beat.csv')
pop.head()

Unnamed: 0,beat,White,Black,Hispanic,Asian,Native,Other
0,1713,1341.069794,1865.300113,937.499902,317.076466,0.0,0.0
1,1651,0.0,0.0,0.0,0.0,0.0,0.0
2,1914,641.288133,5878.781077,1621.201006,407.487888,0.0,0.0
3,1915,1178.322329,1331.124001,1597.027253,283.175022,0.0,0.0
4,1913,739.593202,2429.896179,535.721544,121.08396,159.015646,0.0


### Question 1. Merge the population data with the idot data from earlier. (What kind of join should we use? What are we joining on?)

In [25]:
# There are a lot of ways you could do this... here is one that I was not thinking originally when I wrote this question but I think works better and gives practice with more things you've learned
# Create a dataframe of total population and format it nicely
pop_sum = pd.DataFrame(pop.sum()[1:]).reset_index().rename(columns={'index':'driver_race', 0:'population'})
pop_sum

Unnamed: 0,driver_race,population
0,White,298755.197477
1,Black,248709.530847
2,Hispanic,222341.224432
3,Asian,60069.768541
4,Native,2664.876752
5,Other,212.53729


In [26]:
#Change race values to match other dataset so that they will merge
race_map = {'White':'white','Black':'black','Hispanic':'hispanic','Asian':'asian','Native':'am_indian','Other':'other'}
pop_sum.driver_race = pop_sum.driver_race.map(race_map)
pop_sum

Unnamed: 0,driver_race,population
0,white,298755.197477
1,black,248709.530847
2,hispanic,222341.224432
3,asian,60069.768541
4,am_indian,2664.876752
5,other,212.53729


In [27]:
#Merge the two together but first the stops needs reindexing
race_table = race_table.reset_index()
total_table = pd.merge(race_table,pop_sum) # since all elements have a match it does not matter which type of merge you do
total_table

Unnamed: 0,driver_race,stops,stop rate,population
0,am_indian,3190,0.411866,2664.876752
1,asian,19682,2.541177,60069.768541
2,black,473067,61.078496,248709.530847
3,hispanic,163640,21.127843,222341.224432
4,other,3245,0.418968,212.53729
5,white,111699,14.42165,298755.197477


### Question 2. To compare population to the stop rates we calculated earlier, we need to have population rates. Calculate the percentage of the total population each race makes up in 2017.

In [28]:
total_table['population percentage'] = total_table.population / sum(total_table.population) * 100
total_table

Unnamed: 0,driver_race,stops,stop rate,population,population percentage
0,am_indian,3190,0.411866,2664.876752,0.320008
1,asian,19682,2.541177,60069.768541,7.213394
2,black,473067,61.078496,248709.530847,29.865937
3,hispanic,163640,21.127843,222341.224432,26.699536
4,other,3245,0.418968,212.53729,0.025522
5,white,111699,14.42165,298755.197477,35.875602


### Question 3. Looking at the above answer, is there evidence of bias?

If there was no bias, we would expect races to be pulled over by CPD at a similar rate to their population proportion. Looking at the stop rates and population proportions in our data it is clear that there is bias against black drivers

## Our repertoire is growing quickly! 

* Arithmetic Operations
* Comparisons
* Assignment Statements
* Call Expressions
* Arrays
* Lists
* DataFrames
* Groupby
* Pivot_table
* Merge

Now that we know how to summarize our data in tables, we will move to depicting these summaries in graphs and images. 

**Question to leave on:**

What are some best practices for making graphs?