## Change Granularity

#### Ways of Changing Granularity

    1. Grouping --> aggregating
        -goes from fine grained data to les fine grained data. i.e. from play to game. Involves
        a loss of information. So once data is at the game level, wew have no idea what happened 
        on any particular play
    2. Stacking/Unstacking --> reshaping
        -less common than grouping. no loss of info. crams data that was in unique rows into 
        separate columns. 

In [3]:
# Loding libraries

import pandas as pd
import numpy as np
from os import path

# file path

dataDir = '/Users/simmsjn/Documents/GitHub/ltcwff-files/data'

# loding the DF

pbp = pd.read_csv(path.join(dataDir, 'play_data_sample.csv'))
pbp.head()

Unnamed: 0,play_id,game_id,posteam,defteam,posteam_score,defteam_score,qtr,time,yardline_100,down,...,wp,wpa,passer_player_id,passer_player_name,receiver_player_id,receiver_player_name,rusher_player_id,rusher_player_name,turnover,first_down
0,51,2018101412,NE,KC,0.0,0.0,1,15:00:00,75.0,1.0,...,0.500007,0.035322,00-0019596,T.Brady,00-0029664,J.Gordon,,,False,True
1,75,2018101412,NE,KC,0.0,0.0,1,14:27:00,63.0,1.0,...,0.535329,0.004602,,,,,00-0034845,S.Michel,False,False
2,96,2018101412,NE,KC,0.0,0.0,1,13:54:00,58.0,2.0,...,0.53993,0.038361,,,,,00-0034845,S.Michel,False,True
3,117,2018101412,NE,KC,0.0,0.0,1,13:13:00,47.0,1.0,...,0.578292,-0.01862,00-0019596,T.Brady,00-0029664,J.Gordon,,,False,False
4,139,2018101412,NE,KC,0.0,0.0,1,13:10:00,47.0,2.0,...,0.559672,0.008721,,,,,00-0034845,S.Michel,False,False


#### Grouping

In [4]:
# groupby 

pbp.groupby('game_id').sum()

# We get a DF where every column is summed over game_id
# also, game_id is the new index
# this can be prevented by passing as_index=False argument


Unnamed: 0_level_0,play_id,posteam_score,defteam_score,qtr,yardline_100,down,ydstogo,yards_gained,rush_attempt,pass_attempt,...,punt_attempt,shotgun,no_huddle,air_yards,yards_after_catch,epa,wp,wpa,turnover,first_down
game_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2018101412,287794,2269.0,2546.0,361,5750.0,260.0,1060,946,55.0,73.0,...,1.0,85,1,642.0,361.0,28.748338,72.384102,1.37429,3,39
2018111900,472385,3745.0,3995.0,429,7991.0,283.0,1362,1001,41.0,103.0,...,7.0,101,12,953.0,407.0,19.171737,76.67725,0.823359,7,41


In [10]:
sum_cols = ['yards_gained', 'rush_attempt', 'pass_attempt', 'shotgun']


In [11]:
# Only select columns
pbp.groupby('game_id').sum()[sum_cols]

Unnamed: 0_level_0,yards_gained,rush_attempt,pass_attempt,shotgun
game_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2018101412,946,55.0,73.0,85
2018111900,1001,41.0,103.0,101


In [12]:
# can take the sum of the yards using a different function for other columns
# agg() function - takes a dictionary

pbp.groupby('game_id').agg({
    'yards_gained': 'sum',
    'play_id': 'count',
    'interception': 'sum',
    'touchdown': 'sum'
                           })

Unnamed: 0_level_0,yards_gained,play_id,interception,touchdown
game_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2018101412,946,144,2.0,8.0
2018111900,1001,160,3.0,14.0


In [13]:
# Same as above

pbp.groupby('game_id').agg(
    yards_gained = ('yards_gained', 'sum'),
    nplays = ('play_id', 'count'),
    interception = ('interception', 'sum'),
    touchdown = ('touchdown', 'sum')
)

Unnamed: 0_level_0,yards_gained,nplays,interception,touchdown
game_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2018101412,946,144,2.0,8.0
2018111900,1001,160,3.0,14.0


In [18]:
# grouping by more than one thing

yards_per_team_game = (pbp
                  .groupby(['game_id', 'posteam'])).agg(
    ave_yards_per_play = ('yards_gained', 'mean'),
    total_yards = ('yards_gained', 'sum'))

yards_per_team_game.head()

Unnamed: 0_level_0,Unnamed: 1_level_0,ave_yards_per_play,total_yards
game_id,posteam,Unnamed: 2_level_1,Unnamed: 3_level_1
2018101412,KC,7.689655,446
2018101412,NE,6.25,500
2018111900,KC,7.479452,546
2018111900,LA,5.617284,455


#### A note on multilevel indexing

In [21]:
# you can still use the loc method w/ multilevel indexed DFs, but you need to pss it a tuple.

yards_per_team_game.loc[[(2018101412, 'NE'), (2018111900, 'LA')]]

# This can be avoided by callinig the reset_index methodimmediately after the mutli-column groubpy


Unnamed: 0_level_0,Unnamed: 1_level_0,ave_yards_per_play,total_yards
game_id,posteam,Unnamed: 2_level_1,Unnamed: 3_level_1
2018101412,NE,6.25,500
2018111900,LA,5.617284,455


#### Stacking and Unstacking Data
