# Exploratory Data Analysis of RLCS (Rocket League Championship Series) competitive matches

## Motivation/Background:

As a player on the UCLA Rocket League team, it was clear individual mechanical ability and decision-making skills were not enough to succeed on the pitch——team positioning, play-styles, kickoff strategies, communication, and more were integral components too. One day, my discovery of the public Octane.gg API led me to ponder whether analysis of matches of professional players and teams would yield any interesting insights. Over time, we might be able to observe the rise and fall of different "metas", aggressive playstyles, aerial-focused play, in-field passing, and other cool trends. More importantly, however, insights from this analysis may prove to be applicable and change the way our team competes for the better...

–––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––––

### Preface:

You may notice that throughout the analysis, I refrain from using *inplace=True* for any commands. This is because its usage is in general discouraged for a number of reasons—it is bug-prone, removes the ability for chaining, and is planned to be deprecated in the future. Additionally, in most cases it is no more efficient either, because under the hood, a new copy of the object is still created in order to overwrite the previous object.

In addition, I tend to use *.copy()* whenever modifying our dataframe, because we want to avoid the *SettingWithCopyWarning* that arises when we modify a subset of our dataframe. This is safe because we know the warning does not apply——we *want* to modify the original dataframe.

## Part I: Importing and Wrangling Data

## Importing packages

In [4]:
import json
import requests
import pandas as pd

## Extracting our data from the octane.gg API:

To extract the series data, we retrieve 500 series per page, parse the data as JSON and convert it into a dataframe, then iterate through all the pages.

In [2]:
def get_page(page_num):
    response = requests.get(
        "https://zsr.octane.gg/matches",
        params= {
            "page": page_num,
            "perPage": 500})
    data = pd.DataFrame.from_dict(response.json())
    return data

As of Sep 27 2022, 10 PM PT, there are a total of 37107 series available, so we will use a list comprehension to retrieve 75 pages of ~500 series each, and then concatenate them into a single dataframe.

Note that this step may take upwards of 4-5 minutes because of the size of the dataset.

In [3]:
pages = [get_page(n+1) for n in range(75)]
df = pd.concat(pages, ignore_index=True)
df

Unnamed: 0,matches,page,perPage,pageSize
0,"{'_id': '6043145f91504896348eae05', 'slug': 'a...",1,500,500
1,"{'_id': '6043145f91504896348eae0c', 'slug': 'a...",1,500,500
2,"{'_id': '6043145f91504896348eae36', 'slug': 'a...",1,500,500
3,"{'_id': '6043145f91504896348eae2e', 'slug': 'a...",1,500,500
4,"{'_id': '6043145f91504896348eae30', 'slug': 'a...",1,500,500
...,...,...,...,...
37102,"{'_id': '63332ed6c437fde7e02dc201', 'slug': 'c...",75,500,107
37103,"{'_id': '63332ed6c437fde7e02dc202', 'slug': 'c...",75,500,107
37104,"{'_id': '63332ed6c437fde7e02dc203', 'slug': 'c...",75,500,107
37105,"{'_id': '63332ed6c437fde7e02dc204', 'slug': 'c...",75,500,107


Now, let's extract only the "matches" column and wring out its data into separate columns (which are currently in dictionary form):

In [4]:
df = pd.DataFrame(df['matches'].values.tolist(), index=df.index).copy()
df

Unnamed: 0,_id,slug,octane_id,event,stage,date,format,blue,orange,number,games,reverseSweepAttempt,reverseSweep
0,6043145f91504896348eae05,ae05-chasers-vs-team-synergy,1110201,"{'_id': '5f35882d53fbbb5894b43083', 'slug': '3...","{'_id': 1, 'name': 'Playoffs', 'format': 'brac...",2018-07-07T21:00:00Z,"{'type': 'best', 'length': 7}","{'score': 4, 'winner': True, 'team': {'team': ...","{'score': 2, 'team': {'team': {'_id': '6020bf0...",1.0,"[{'_id': '6043145f91504896348eae82', 'blue': 2...",,
1,6043145f91504896348eae0c,ae0c-lucky-bounce-vs-kings-of-urban,0010201,"{'_id': '5f35882d53fbbb5894b43039', 'slug': '3...","{'_id': 2, 'name': 'Regional Championship', 'f...",2016-07-09T00:00:00Z,"{'type': 'best', 'length': 7}","{'score': 1, 'team': {'team': {'_id': '6020bc7...","{'score': 4, 'winner': True, 'team': {'team': ...",1.0,"[{'_id': '6043146091504896348eaf64', 'blue': 0...",,
2,6043145f91504896348eae36,ae36-cloud9-vs-gale-force,0200201,"{'_id': '5f35882d53fbbb5894b4306c', 'slug': '3...","{'_id': 1, 'name': 'Playoffs', 'format': 'brac...",2017-12-03T14:00:00Z,"{'type': 'best', 'length': 7}","{'score': 1, 'team': {'team': {'_id': '6020bc7...","{'score': 4, 'winner': True, 'team': {'team': ...",1.0,"[{'_id': '6043146191504896348eb05e', 'blue': 1...",,
3,6043145f91504896348eae2e,ae2e-who-vs-canyons,1140101,"{'_id': '5f35882d53fbbb5894b4313d', 'slug': '3...","{'_id': 0, 'name': 'Main Event', 'format': 'br...",2020-05-23T12:00:00Z,"{'type': 'best', 'length': 5}",{'team': {'team': {'_id': '605d09394d63e1b16e2...,"{'score': 3, 'winner': True, 'team': {'team': ...",1.0,"[{'_id': '6043146291504896348eb085', 'blue': 0...",,
4,6043145f91504896348eae30,ae30-chiefs-esports-vs-avant-gaming,1120201,"{'_id': '5f35882d53fbbb5894b43084', 'slug': '3...","{'_id': 1, 'name': 'Playoffs', 'format': 'brac...",2018-07-08T00:00:00Z,"{'type': 'best', 'length': 7}","{'score': 2, 'team': {'team': {'_id': '6020bc7...","{'score': 4, 'winner': True, 'team': {'team': ...",1.0,"[{'_id': '6043146291504896348eb089', 'blue': 3...",,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
37102,63332ed6c437fde7e02dc201,c201-tbd-vs-tbd,,"{'_id': '632f05f0c437fde7e02dc03d', 'slug': 'c...","{'_id': 0, 'name': 'Closed Qualifier', 'qualif...",2022-10-23T18:00:00Z,"{'type': 'best', 'length': 5}",,,29.0,,,
37103,63332ed6c437fde7e02dc202,c202-tbd-vs-tbd,,"{'_id': '632f05f0c437fde7e02dc03d', 'slug': 'c...","{'_id': 0, 'name': 'Closed Qualifier', 'qualif...",2022-10-23T18:00:00Z,"{'type': 'best', 'length': 5}",,,30.0,,,
37104,63332ed6c437fde7e02dc203,c203-tbd-vs-tbd,,"{'_id': '632f05f0c437fde7e02dc03d', 'slug': 'c...","{'_id': 0, 'name': 'Closed Qualifier', 'qualif...",2022-10-23T19:00:00Z,"{'type': 'best', 'length': 5}",,,31.0,,,
37105,63332ed6c437fde7e02dc204,c204-tbd-vs-tbd,,"{'_id': '632f05f0c437fde7e02dc03d', 'slug': 'c...","{'_id': 0, 'name': 'Closed Qualifier', 'qualif...",2022-10-23T19:00:00Z,"{'type': 'best', 'length': 5}",,,32.0,,,


We can see that for many columns, their data remains hidden in a dictionary. We will flatten the dictionaries for some simple columns here, but leave the bulk of the data in the "blue", "orange", and "games" columns as they are for now because they contain many levels of nested dictionaries and will be complicated to extract.

In [5]:
event = pd.json_normalize(df['event'])
event

Unnamed: 0,_id,slug,name,region,mode,tier,image,groups
0,5f35882d53fbbb5894b43083,3083-sam-championship-season-1,SAM Championship Season 1,SAM,3,B,https://griffon.octane.gg/events/sam-champions...,
1,5f35882d53fbbb5894b43039,3039-rlcs-season-1-north-america-stage-2,RLCS Season 1 North America Stage 2,,3,S,https://griffon.octane.gg/events/rlcs.png,"[rlcs, rlcs1, rlcsna, rlcs19, rlcs19lp]"
2,5f35882d53fbbb5894b4306c,306c-eleague-2017,ELEAGUE 2017,INT,3,S,https://griffon.octane.gg/events/eleague.png,
3,5f35882d53fbbb5894b4313d,313d-red-bull-gaming-world-finals,Red Bull Gaming World Finals,EU,3,C,https://griffon.octane.gg/events/red-bull-gami...,
4,5f35882d53fbbb5894b43084,3084-gfinity-australia-elite-series-season-1,Gfinity Australia Elite Series Season 1,OCE,3,A,https://griffon.octane.gg/events/gfinity.png,
...,...,...,...,...,...,...,...,...
37102,632f05f0c437fde7e02dc03d,c03d-rlcs-2022-23-fall-sub-saharan-africa-regi...,RLCS 2022-23 Fall Sub-Saharan Africa Regional 2,AF,3,A,https://griffon.octane.gg/events/rlcs-2022-23.png,"[rlcs, rlcs2223, rlcs2223fall]"
37103,632f05f0c437fde7e02dc03d,c03d-rlcs-2022-23-fall-sub-saharan-africa-regi...,RLCS 2022-23 Fall Sub-Saharan Africa Regional 2,AF,3,A,https://griffon.octane.gg/events/rlcs-2022-23.png,"[rlcs, rlcs2223, rlcs2223fall]"
37104,632f05f0c437fde7e02dc03d,c03d-rlcs-2022-23-fall-sub-saharan-africa-regi...,RLCS 2022-23 Fall Sub-Saharan Africa Regional 2,AF,3,A,https://griffon.octane.gg/events/rlcs-2022-23.png,"[rlcs, rlcs2223, rlcs2223fall]"
37105,632f05f0c437fde7e02dc03d,c03d-rlcs-2022-23-fall-sub-saharan-africa-regi...,RLCS 2022-23 Fall Sub-Saharan Africa Regional 2,AF,3,A,https://griffon.octane.gg/events/rlcs-2022-23.png,"[rlcs, rlcs2223, rlcs2223fall]"


In [6]:
stage = pd.json_normalize(df['stage'])
stage

Unnamed: 0,_id,name,format,qualifier,lan
0,1,Playoffs,bracket-4se,,
1,2,Regional Championship,bracket-4se+3,,
2,1,Playoffs,bracket-4se,,
3,0,Main Event,bracket-4se,,
4,1,Playoffs,bracket-4se,,
...,...,...,...,...,...
37102,0,Closed Qualifier,,True,
37103,0,Closed Qualifier,,True,
37104,0,Closed Qualifier,,True,
37105,0,Closed Qualifier,,True,


In [7]:
formats = pd.json_normalize(df['format'])
formats

Unnamed: 0,type,length
0,best,7.0
1,best,7.0
2,best,7.0
3,best,5.0
4,best,7.0
...,...,...
37102,best,5.0
37103,best,5.0
37104,best,5.0
37105,best,5.0


Let's concatenate these three new dataframes back into our main dataframe and delete the original un-normalized ones.

Note: We have to be careful when dropping the original *format* column because there is now also a *format* column from the *stage* dataframe. To fix this, we'll just delete it from *df* before the concatenation.

In [8]:
df = df.drop(columns=["format"]).copy()
to_concat = [df, event, stage, formats]
df = pd.concat(to_concat, axis=1).copy()
df = df.drop(columns=["event", "stage"]).copy()
df

Unnamed: 0,_id,slug,octane_id,date,blue,orange,number,games,reverseSweepAttempt,reverseSweep,...,tier,image,groups,_id.1,name,format,qualifier,lan,type,length
0,6043145f91504896348eae05,ae05-chasers-vs-team-synergy,1110201,2018-07-07T21:00:00Z,"{'score': 4, 'winner': True, 'team': {'team': ...","{'score': 2, 'team': {'team': {'_id': '6020bf0...",1.0,"[{'_id': '6043145f91504896348eae82', 'blue': 2...",,,...,B,https://griffon.octane.gg/events/sam-champions...,,1,Playoffs,bracket-4se,,,best,7.0
1,6043145f91504896348eae0c,ae0c-lucky-bounce-vs-kings-of-urban,0010201,2016-07-09T00:00:00Z,"{'score': 1, 'team': {'team': {'_id': '6020bc7...","{'score': 4, 'winner': True, 'team': {'team': ...",1.0,"[{'_id': '6043146091504896348eaf64', 'blue': 0...",,,...,S,https://griffon.octane.gg/events/rlcs.png,"[rlcs, rlcs1, rlcsna, rlcs19, rlcs19lp]",2,Regional Championship,bracket-4se+3,,,best,7.0
2,6043145f91504896348eae36,ae36-cloud9-vs-gale-force,0200201,2017-12-03T14:00:00Z,"{'score': 1, 'team': {'team': {'_id': '6020bc7...","{'score': 4, 'winner': True, 'team': {'team': ...",1.0,"[{'_id': '6043146191504896348eb05e', 'blue': 1...",,,...,S,https://griffon.octane.gg/events/eleague.png,,1,Playoffs,bracket-4se,,,best,7.0
3,6043145f91504896348eae2e,ae2e-who-vs-canyons,1140101,2020-05-23T12:00:00Z,{'team': {'team': {'_id': '605d09394d63e1b16e2...,"{'score': 3, 'winner': True, 'team': {'team': ...",1.0,"[{'_id': '6043146291504896348eb085', 'blue': 0...",,,...,C,https://griffon.octane.gg/events/red-bull-gami...,,0,Main Event,bracket-4se,,,best,5.0
4,6043145f91504896348eae30,ae30-chiefs-esports-vs-avant-gaming,1120201,2018-07-08T00:00:00Z,"{'score': 2, 'team': {'team': {'_id': '6020bc7...","{'score': 4, 'winner': True, 'team': {'team': ...",1.0,"[{'_id': '6043146291504896348eb089', 'blue': 3...",,,...,A,https://griffon.octane.gg/events/gfinity.png,,1,Playoffs,bracket-4se,,,best,7.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37102,63332ed6c437fde7e02dc201,c201-tbd-vs-tbd,,2022-10-23T18:00:00Z,,,29.0,,,,...,A,https://griffon.octane.gg/events/rlcs-2022-23.png,"[rlcs, rlcs2223, rlcs2223fall]",0,Closed Qualifier,,True,,best,5.0
37103,63332ed6c437fde7e02dc202,c202-tbd-vs-tbd,,2022-10-23T18:00:00Z,,,30.0,,,,...,A,https://griffon.octane.gg/events/rlcs-2022-23.png,"[rlcs, rlcs2223, rlcs2223fall]",0,Closed Qualifier,,True,,best,5.0
37104,63332ed6c437fde7e02dc203,c203-tbd-vs-tbd,,2022-10-23T19:00:00Z,,,31.0,,,,...,A,https://griffon.octane.gg/events/rlcs-2022-23.png,"[rlcs, rlcs2223, rlcs2223fall]",0,Closed Qualifier,,True,,best,5.0
37105,63332ed6c437fde7e02dc204,c204-tbd-vs-tbd,,2022-10-23T19:00:00Z,,,32.0,,,,...,A,https://griffon.octane.gg/events/rlcs-2022-23.png,"[rlcs, rlcs2223, rlcs2223fall]",0,Closed Qualifier,,True,,best,5.0


We won't really need any of the "id" and "octane_id" column identifiers anymore, so let's drop all those columns too:

In [9]:
df = df.drop(columns=["_id", "octane_id"]).copy()
df

Unnamed: 0,slug,date,blue,orange,number,games,reverseSweepAttempt,reverseSweep,slug.1,name,...,mode,tier,image,groups,name.1,format,qualifier,lan,type,length
0,ae05-chasers-vs-team-synergy,2018-07-07T21:00:00Z,"{'score': 4, 'winner': True, 'team': {'team': ...","{'score': 2, 'team': {'team': {'_id': '6020bf0...",1.0,"[{'_id': '6043145f91504896348eae82', 'blue': 2...",,,3083-sam-championship-season-1,SAM Championship Season 1,...,3,B,https://griffon.octane.gg/events/sam-champions...,,Playoffs,bracket-4se,,,best,7.0
1,ae0c-lucky-bounce-vs-kings-of-urban,2016-07-09T00:00:00Z,"{'score': 1, 'team': {'team': {'_id': '6020bc7...","{'score': 4, 'winner': True, 'team': {'team': ...",1.0,"[{'_id': '6043146091504896348eaf64', 'blue': 0...",,,3039-rlcs-season-1-north-america-stage-2,RLCS Season 1 North America Stage 2,...,3,S,https://griffon.octane.gg/events/rlcs.png,"[rlcs, rlcs1, rlcsna, rlcs19, rlcs19lp]",Regional Championship,bracket-4se+3,,,best,7.0
2,ae36-cloud9-vs-gale-force,2017-12-03T14:00:00Z,"{'score': 1, 'team': {'team': {'_id': '6020bc7...","{'score': 4, 'winner': True, 'team': {'team': ...",1.0,"[{'_id': '6043146191504896348eb05e', 'blue': 1...",,,306c-eleague-2017,ELEAGUE 2017,...,3,S,https://griffon.octane.gg/events/eleague.png,,Playoffs,bracket-4se,,,best,7.0
3,ae2e-who-vs-canyons,2020-05-23T12:00:00Z,{'team': {'team': {'_id': '605d09394d63e1b16e2...,"{'score': 3, 'winner': True, 'team': {'team': ...",1.0,"[{'_id': '6043146291504896348eb085', 'blue': 0...",,,313d-red-bull-gaming-world-finals,Red Bull Gaming World Finals,...,3,C,https://griffon.octane.gg/events/red-bull-gami...,,Main Event,bracket-4se,,,best,5.0
4,ae30-chiefs-esports-vs-avant-gaming,2018-07-08T00:00:00Z,"{'score': 2, 'team': {'team': {'_id': '6020bc7...","{'score': 4, 'winner': True, 'team': {'team': ...",1.0,"[{'_id': '6043146291504896348eb089', 'blue': 3...",,,3084-gfinity-australia-elite-series-season-1,Gfinity Australia Elite Series Season 1,...,3,A,https://griffon.octane.gg/events/gfinity.png,,Playoffs,bracket-4se,,,best,7.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37102,c201-tbd-vs-tbd,2022-10-23T18:00:00Z,,,29.0,,,,c03d-rlcs-2022-23-fall-sub-saharan-africa-regi...,RLCS 2022-23 Fall Sub-Saharan Africa Regional 2,...,3,A,https://griffon.octane.gg/events/rlcs-2022-23.png,"[rlcs, rlcs2223, rlcs2223fall]",Closed Qualifier,,True,,best,5.0
37103,c202-tbd-vs-tbd,2022-10-23T18:00:00Z,,,30.0,,,,c03d-rlcs-2022-23-fall-sub-saharan-africa-regi...,RLCS 2022-23 Fall Sub-Saharan Africa Regional 2,...,3,A,https://griffon.octane.gg/events/rlcs-2022-23.png,"[rlcs, rlcs2223, rlcs2223fall]",Closed Qualifier,,True,,best,5.0
37104,c203-tbd-vs-tbd,2022-10-23T19:00:00Z,,,31.0,,,,c03d-rlcs-2022-23-fall-sub-saharan-africa-regi...,RLCS 2022-23 Fall Sub-Saharan Africa Regional 2,...,3,A,https://griffon.octane.gg/events/rlcs-2022-23.png,"[rlcs, rlcs2223, rlcs2223fall]",Closed Qualifier,,True,,best,5.0
37105,c204-tbd-vs-tbd,2022-10-23T19:00:00Z,,,32.0,,,,c03d-rlcs-2022-23-fall-sub-saharan-africa-regi...,RLCS 2022-23 Fall Sub-Saharan Africa Regional 2,...,3,A,https://griffon.octane.gg/events/rlcs-2022-23.png,"[rlcs, rlcs2223, rlcs2223fall]",Closed Qualifier,,True,,best,5.0


## Checking out our column data:

Great! Let's get a feel for what our column data looks like:

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37107 entries, 0 to 37106
Data columns (total 21 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   slug                 37107 non-null  object 
 1   date                 37105 non-null  object 
 2   blue                 36050 non-null  object 
 3   orange               36050 non-null  object 
 4   number               37106 non-null  float64
 5   games                22947 non-null  object 
 6   reverseSweepAttempt  2443 non-null   object 
 7   reverseSweep         1207 non-null   object 
 8   slug                 37107 non-null  object 
 9   name                 37107 non-null  object 
 10  region               37107 non-null  object 
 11  mode                 37107 non-null  int64  
 12  tier                 37107 non-null  object 
 13  image                36842 non-null  object 
 14  groups               17405 non-null  object 
 15  name                 37107 non-null 

We can see our dates are stored in ISO 8601 format but are currently interpreted by pandas to be type *object*. Let's convert its dtype to a datetime object:

In [11]:
df["date"] = pd.to_datetime(df["date"]).dt.date
df["date"] = df["date"].astype('datetime64')
df["date"]

0       2018-07-07
1       2016-07-09
2       2017-12-03
3       2020-05-23
4       2018-07-08
           ...    
37102   2022-10-23
37103   2022-10-23
37104   2022-10-23
37105   2022-10-23
37106   2022-10-23
Name: date, Length: 37107, dtype: datetime64[ns]

Note that our date-times all had a "Z" at the end, indicating zero UTC offset, so we can safely assume all times have already been converted to UTC time and not worry about the headache that is determining timezones, where an event was played (and what about online events?), etc. Besides, any sort of time series analysis we may conduct will most likely focus heavily on long-term trends, so this shouldn't be a big problem.

Let's first rename our columns to something more descriptive——more importantly, we are avoiding duplicate names so it's easier to operate on those columns later:

In [12]:
df.columns = ["matchup", "date", "blue", "orange", "number", "games", "reverseSweepAttempt", "reverseSweep", "event-name", "event_name", "region", "mode", "tier", "image", "groups", "stage_name", "format", "qualifier", "lan", "type", "length"]

Note I confusingly named two adjacent columns "event-name" and "event_name". Yeah, that's because they provide redundant information. Let's drop the "event-name" column:

In [13]:
df = df.drop(columns=["event-name"]).copy()
df

Unnamed: 0,matchup,date,blue,orange,number,games,reverseSweepAttempt,reverseSweep,event_name,region,mode,tier,image,groups,stage_name,format,qualifier,lan,type,length
0,ae05-chasers-vs-team-synergy,2018-07-07,"{'score': 4, 'winner': True, 'team': {'team': ...","{'score': 2, 'team': {'team': {'_id': '6020bf0...",1.0,"[{'_id': '6043145f91504896348eae82', 'blue': 2...",,,SAM Championship Season 1,SAM,3,B,https://griffon.octane.gg/events/sam-champions...,,Playoffs,bracket-4se,,,best,7.0
1,ae0c-lucky-bounce-vs-kings-of-urban,2016-07-09,"{'score': 1, 'team': {'team': {'_id': '6020bc7...","{'score': 4, 'winner': True, 'team': {'team': ...",1.0,"[{'_id': '6043146091504896348eaf64', 'blue': 0...",,,RLCS Season 1 North America Stage 2,,3,S,https://griffon.octane.gg/events/rlcs.png,"[rlcs, rlcs1, rlcsna, rlcs19, rlcs19lp]",Regional Championship,bracket-4se+3,,,best,7.0
2,ae36-cloud9-vs-gale-force,2017-12-03,"{'score': 1, 'team': {'team': {'_id': '6020bc7...","{'score': 4, 'winner': True, 'team': {'team': ...",1.0,"[{'_id': '6043146191504896348eb05e', 'blue': 1...",,,ELEAGUE 2017,INT,3,S,https://griffon.octane.gg/events/eleague.png,,Playoffs,bracket-4se,,,best,7.0
3,ae2e-who-vs-canyons,2020-05-23,{'team': {'team': {'_id': '605d09394d63e1b16e2...,"{'score': 3, 'winner': True, 'team': {'team': ...",1.0,"[{'_id': '6043146291504896348eb085', 'blue': 0...",,,Red Bull Gaming World Finals,EU,3,C,https://griffon.octane.gg/events/red-bull-gami...,,Main Event,bracket-4se,,,best,5.0
4,ae30-chiefs-esports-vs-avant-gaming,2018-07-08,"{'score': 2, 'team': {'team': {'_id': '6020bc7...","{'score': 4, 'winner': True, 'team': {'team': ...",1.0,"[{'_id': '6043146291504896348eb089', 'blue': 3...",,,Gfinity Australia Elite Series Season 1,OCE,3,A,https://griffon.octane.gg/events/gfinity.png,,Playoffs,bracket-4se,,,best,7.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
37102,c201-tbd-vs-tbd,2022-10-23,,,29.0,,,,RLCS 2022-23 Fall Sub-Saharan Africa Regional 2,AF,3,A,https://griffon.octane.gg/events/rlcs-2022-23.png,"[rlcs, rlcs2223, rlcs2223fall]",Closed Qualifier,,True,,best,5.0
37103,c202-tbd-vs-tbd,2022-10-23,,,30.0,,,,RLCS 2022-23 Fall Sub-Saharan Africa Regional 2,AF,3,A,https://griffon.octane.gg/events/rlcs-2022-23.png,"[rlcs, rlcs2223, rlcs2223fall]",Closed Qualifier,,True,,best,5.0
37104,c203-tbd-vs-tbd,2022-10-23,,,31.0,,,,RLCS 2022-23 Fall Sub-Saharan Africa Regional 2,AF,3,A,https://griffon.octane.gg/events/rlcs-2022-23.png,"[rlcs, rlcs2223, rlcs2223fall]",Closed Qualifier,,True,,best,5.0
37105,c204-tbd-vs-tbd,2022-10-23,,,32.0,,,,RLCS 2022-23 Fall Sub-Saharan Africa Regional 2,AF,3,A,https://griffon.octane.gg/events/rlcs-2022-23.png,"[rlcs, rlcs2223, rlcs2223fall]",Closed Qualifier,,True,,best,5.0


Let's revisit our trusty friend *df.info()*:

In [14]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 37107 entries, 0 to 37106
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   matchup              37107 non-null  object        
 1   date                 37105 non-null  datetime64[ns]
 2   blue                 36050 non-null  object        
 3   orange               36050 non-null  object        
 4   number               37106 non-null  float64       
 5   games                22947 non-null  object        
 6   reverseSweepAttempt  2443 non-null   object        
 7   reverseSweep         1207 non-null   object        
 8   event_name           37107 non-null  object        
 9   region               37107 non-null  object        
 10  mode                 37107 non-null  int64         
 11  tier                 37107 non-null  object        
 12  image                36842 non-null  object        
 13  groups               17405 non-

This is minor, but of course our "matchup", "event_name", and "stage_name" columns should all be strings. Let's convert those values into string datatypes:

In [15]:
df = df.astype({"matchup": "string", "event_name": "string", "stage_name": "string"}).copy()

*Note:* Pandas strings and objects are virtually interchangeable——the *object* datatype is still the default datatype for strings, while the StringDtype is relatively new. It is recommended to use StringDtype to store text data, because it is stricter and will not mask any accidental mixing of strings and non-strings as an object dtype would. With that being said, however, you may note later that some columns with string values are of object dtype, simply because functionally there is almost no difference, and I may forget to explicitly convert column types every time I append a new column.

In [16]:
df["matchup"].head()

0           ae05-chasers-vs-team-synergy
1    ae0c-lucky-bounce-vs-kings-of-urban
2              ae36-cloud9-vs-gale-force
3                    ae2e-who-vs-canyons
4    ae30-chiefs-esports-vs-avant-gaming
Name: matchup, dtype: string

Okay, but our *matchup* column in its current form still isn't very convenient for searching. There seems to be a unique alphanumeric sequence before the two teams' names serving as a unique identifier, but we *want* duplicates so that we can compare the same matchup over time/perform other types of equally informative analysis. So, let's clean up the strings a little bit:

In [17]:
df["matchup"] = df["matchup"].str[5:]

In [18]:
df["matchup"]

0               chasers-vs-team-synergy
1        lucky-bounce-vs-kings-of-urban
2                  cloud9-vs-gale-force
3                        who-vs-canyons
4        chiefs-esports-vs-avant-gaming
                      ...              
37102                        tbd-vs-tbd
37103                        tbd-vs-tbd
37104                        tbd-vs-tbd
37105                        tbd-vs-tbd
37106                        tbd-vs-tbd
Name: matchup, Length: 37107, dtype: string

Now there exist matchup duplicates:

In [19]:
df["matchup"].value_counts()

tbd-vs-tbd                           1190
nrg-esports-vs-g2-esports              31
ground-zero-gaming-vs-renegades        26
spacestation-gaming-vs-rogue           26
g2-esports-vs-nrg-esports              25
                                     ... 
detonator-vs-gokkies                    1
falcons-esports-vs-joga-bonito          1
team-inherits-vs-moti-moti-gaming       1
kings-esports-vs-joga-bonito            1
team-toastie-vs-team-memory             1
Name: matchup, Length: 27285, dtype: Int64

Hmmm, there seems to be a lot of "tbd-vs-tbd" matchups. Let's just see what they look like:

In [20]:
pd.set_option('display.min_rows', 20)
df[df["matchup"] == "tbd-vs-tbd"]

Unnamed: 0,matchup,date,blue,orange,number,games,reverseSweepAttempt,reverseSweep,event_name,region,mode,tier,image,groups,stage_name,format,qualifier,lan,type,length
4082,tbd-vs-tbd,2020-11-29,{},{},5.0,,,,Rocket Drift Season 3,,3,C,https://griffon.octane.gg/events/rocket-drift.png,,Playoffs,8se,,,,
4084,tbd-vs-tbd,2020-11-29,{},{},6.0,,,,Rocket Drift Season 3,,3,C,https://griffon.octane.gg/events/rocket-drift.png,,Playoffs,8se,,,,
4085,tbd-vs-tbd,2020-11-30,{},{},7.0,,,,Rocket Drift Season 3,,3,C,https://griffon.octane.gg/events/rocket-drift.png,,Playoffs,8se,,,,
8925,tbd-vs-tbd,2021-01-24,{},{},1.0,,,,Liga Raketa Season 5,EU,3,C,https://griffon.octane.gg/events/liga-raketa.png,,Playoffs,bracket-4se,,,,
8926,tbd-vs-tbd,2021-01-24,{},{},2.0,,,,Liga Raketa Season 5,EU,3,C,https://griffon.octane.gg/events/liga-raketa.png,,Playoffs,bracket-4se,,,,
8927,tbd-vs-tbd,2021-01-24,{},{},3.0,,,,Liga Raketa Season 5,EU,3,C,https://griffon.octane.gg/events/liga-raketa.png,,Playoffs,bracket-4se,,,,
9108,tbd-vs-tbd,2020-12-05,{},{},7.0,,,,Liga Raketa Season 5,EU,3,C,https://griffon.octane.gg/events/liga-raketa.png,,Group Stage,rr-1g8,,,,
9109,tbd-vs-tbd,2020-12-05,{},{},8.0,,,,Liga Raketa Season 5,EU,3,C,https://griffon.octane.gg/events/liga-raketa.png,,Group Stage,rr-1g8,,,,
9111,tbd-vs-tbd,2020-12-06,{},{},9.0,,,,Liga Raketa Season 5,EU,3,C,https://griffon.octane.gg/events/liga-raketa.png,,Group Stage,rr-1g8,,,,
9112,tbd-vs-tbd,2020-12-06,{},{},10.0,,,,Liga Raketa Season 5,EU,3,C,https://griffon.octane.gg/events/liga-raketa.png,,Group Stage,rr-1g8,,,,


In [21]:
pd.reset_option('display.min_rows')

The reason I've displayed so many rows is because it highlights the fact that series with "tbd-vs-tbd" tend to have large amounts of missing data in other columns as well———data which is critical for analysis, such as the final series score, players, team, and individual stats, individual game scores, series progression and more, which is nested within the "blue", "orange" and "games" columns.

In fact, I scanned through all the rows (a few months ago), and only 4 series have information for at least one of the blue/orange columns. Therefore, let's remove all the above records save for those 4:

In [22]:
to_keep = df.index.isin([23042,23057,23072,23087])

In [23]:
to_drop = (df["matchup"] == "tbd-vs-tbd") & (~to_keep)

In [24]:
df = df.drop(df[to_drop].index).copy()

...And don't forget to reset the index.

In [25]:
df = df.reset_index(drop=True).copy()
df

Unnamed: 0,matchup,date,blue,orange,number,games,reverseSweepAttempt,reverseSweep,event_name,region,mode,tier,image,groups,stage_name,format,qualifier,lan,type,length
0,chasers-vs-team-synergy,2018-07-07,"{'score': 4, 'winner': True, 'team': {'team': ...","{'score': 2, 'team': {'team': {'_id': '6020bf0...",1.0,"[{'_id': '6043145f91504896348eae82', 'blue': 2...",,,SAM Championship Season 1,SAM,3,B,https://griffon.octane.gg/events/sam-champions...,,Playoffs,bracket-4se,,,best,7.0
1,lucky-bounce-vs-kings-of-urban,2016-07-09,"{'score': 1, 'team': {'team': {'_id': '6020bc7...","{'score': 4, 'winner': True, 'team': {'team': ...",1.0,"[{'_id': '6043146091504896348eaf64', 'blue': 0...",,,RLCS Season 1 North America Stage 2,,3,S,https://griffon.octane.gg/events/rlcs.png,"[rlcs, rlcs1, rlcsna, rlcs19, rlcs19lp]",Regional Championship,bracket-4se+3,,,best,7.0
2,cloud9-vs-gale-force,2017-12-03,"{'score': 1, 'team': {'team': {'_id': '6020bc7...","{'score': 4, 'winner': True, 'team': {'team': ...",1.0,"[{'_id': '6043146191504896348eb05e', 'blue': 1...",,,ELEAGUE 2017,INT,3,S,https://griffon.octane.gg/events/eleague.png,,Playoffs,bracket-4se,,,best,7.0
3,who-vs-canyons,2020-05-23,{'team': {'team': {'_id': '605d09394d63e1b16e2...,"{'score': 3, 'winner': True, 'team': {'team': ...",1.0,"[{'_id': '6043146291504896348eb085', 'blue': 0...",,,Red Bull Gaming World Finals,EU,3,C,https://griffon.octane.gg/events/red-bull-gami...,,Main Event,bracket-4se,,,best,5.0
4,chiefs-esports-vs-avant-gaming,2018-07-08,"{'score': 2, 'team': {'team': {'_id': '6020bc7...","{'score': 4, 'winner': True, 'team': {'team': ...",1.0,"[{'_id': '6043146291504896348eb089', 'blue': 3...",,,Gfinity Australia Elite Series Season 1,OCE,3,A,https://griffon.octane.gg/events/gfinity.png,,Playoffs,bracket-4se,,,best,7.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35916,team-memory-vs-team-gyro,2022-09-18,"{'score': 4, 'winner': True, 'team': {'team': ...","{'score': 3, 'team': {'team': {'_id': '6328a08...",3.0,"[{'_id': '6328ab43c437fde7e02dbcd7', 'blue': 1...",,,College Carball Association x Immortals Bundle...,,3,B,https://griffon.octane.gg/events/immortals-bun...,,Playoffs,,,,best,7.0
35917,team-tcorrell-vs-team-andy,2022-09-18,"{'score': 4, 'winner': True, 'team': {'team': ...","{'score': 3, 'team': {'team': {'_id': '6328a06...",4.0,"[{'_id': '6328abeeda9d7ca1c7bb4359', 'blue': 1...",,,College Carball Association x Immortals Bundle...,,3,B,https://griffon.octane.gg/events/immortals-bun...,,Playoffs,,,,best,7.0
35918,team-toastie-vs-team-jstn,2022-09-19,"{'score': 4, 'winner': True, 'team': {'team': ...","{'score': 1, 'team': {'team': {'_id': '6328a0b...",5.0,"[{'_id': '6328aaebda9d7ca1c7bb42f0', 'blue': 3...",,,College Carball Association x Immortals Bundle...,,3,B,https://griffon.octane.gg/events/immortals-bun...,,Playoffs,,,,best,7.0
35919,team-memory-vs-team-tcorrell,2022-09-19,"{'score': 4, 'winner': True, 'team': {'team': ...","{'score': 3, 'team': {'team': {'_id': '6328a0d...",6.0,"[{'_id': '6328ab81da9d7ca1c7bb4336', 'blue': 4...",,,College Carball Association x Immortals Bundle...,,3,B,https://griffon.octane.gg/events/immortals-bun...,,Playoffs,,,,best,7.0


## Analyzing NA's and duplicates:

In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35921 entries, 0 to 35920
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   matchup              35921 non-null  string        
 1   date                 35919 non-null  datetime64[ns]
 2   blue                 35921 non-null  object        
 3   orange               35921 non-null  object        
 4   number               35920 non-null  float64       
 5   games                22947 non-null  object        
 6   reverseSweepAttempt  2443 non-null   object        
 7   reverseSweep         1207 non-null   object        
 8   event_name           35921 non-null  string        
 9   region               35921 non-null  object        
 10  mode                 35921 non-null  int64         
 11  tier                 35921 non-null  object        
 12  image                35656 non-null  object        
 13  groups               16472 non-

We can see there exist many null values in the "games" column, and of course in the "reverseSweepAttempt" and "reverseSweep" columns. Additionally, some less important columns such as "groups", "format", "qualifier", "lan" are missing many values as well.

In [27]:
df["reverseSweepAttempt"].value_counts()

True    2443
Name: reverseSweepAttempt, dtype: int64

In [28]:
df["reverseSweep"].value_counts()

True    1207
Name: reverseSweep, dtype: int64

Clearly, data is only inputted into these columns when they are True, as there are no False values. However, this does not mean 
all true reverse sweep attempts have been recorded; in other words, we can't simply fill all NaN values with False. In addition, when "reverseSweepAttempt" is True but "reverseSweep" is NaN, we do not know the outcome of the attempt and cannot fill in any particular value. Therefore, let's just leave the column data like this for now. Potentially we won't even touch these columns.

In [29]:
len(df[(df["reverseSweepAttempt"] == True) & (df["reverseSweep"] == True)])

1207

We note that whenever "reverseSweep" is True, "reverseSweepAttempt" is also True as well, as we obtain the same number of records as the number of non-null "reverseSweep" values. This makes logical sense and is not revolutionary——but had we obtained evidence to the contrary (i.e. a reverse sweep occurred and yet "reverseSweepAttempt" is NaN or False), we would've been able to fix this by imputing True values for "reverseSweepAttempt".

*Note:* Technically, I believe it could be possible for us to deduce and fill in the missing values of “reverseSweepAttempt” and “reverseSweep” for some series. However, this would be a Herculean, and also tediously mind-numbing task——for each and every series, we’d have to extract the series format (Best-of-3, 5, 7, etc.) as well as the game progression (e.g. 1-0, 2-4, 3-2, 4-1), then loop through those games, noting when a team is at match point and observing whether the other teams attempt a reverse sweep + whether it is ultimately successful. In addition, data for the “games” column is missing for more than 1/3 of the series.

*Therefore, let's just say this is left as an exercise to the reader.*

However, as seen above, there are only two records with null values for the date-time. That seems possible to fix manually:

In [30]:
df[pd.isna(df["date"])]

Unnamed: 0,matchup,date,blue,orange,number,games,reverseSweepAttempt,reverseSweep,event_name,region,mode,tier,image,groups,stage_name,format,qualifier,lan,type,length
22000,berlin-phoenix-vs-basilisks-berlin,NaT,"{'score': 4, 'winner': True, 'team': {'team': ...","{'score': 1, 'team': {'team': {'_id': '604da3d...",9.0,,,,European University Rocketeers' Championship 2021,EU,3,C,https://griffon.octane.gg/events/eurc.png,,Playoffs,bracket,,,best,7.0
22006,berlin-phoenix-vs-portsmouth-paladins,NaT,"{'score': 4, 'winner': True, 'team': {'team': ...","{'score': 3, 'team': {'team': {'_id': '604da3d...",11.0,,,,European University Rocketeers' Championship 2021,EU,3,C,https://griffon.octane.gg/events/eurc.png,,Playoffs,bracket,,,best,7.0


From the "event_name" and "stage_name" columns, we see that both series were played in the playoff bracket of the European University Rocketeers' Championship 2021. Digging around on liquipedia.net, we find EURC 2021 and scroll down to the playoff bracket. We then match up the series scores (4-1) and (4-3) shown in the "blue" and "orange" columns with the appropriate series and voila!

![NA Date 1](media/NAdate1.png)

![NA Date 2](media/NAdate2.png)

Let's fill in those values:

In [31]:
df.at[22000, "date"] = pd.to_datetime("2021-04-19")
df.at[22006, "date"] = pd.to_datetime("2021-04-25")

## The problem with matchups

In [32]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35921 entries, 0 to 35920
Data columns (total 20 columns):
 #   Column               Non-Null Count  Dtype         
---  ------               --------------  -----         
 0   matchup              35921 non-null  string        
 1   date                 35921 non-null  datetime64[ns]
 2   blue                 35921 non-null  object        
 3   orange               35921 non-null  object        
 4   number               35920 non-null  float64       
 5   games                22947 non-null  object        
 6   reverseSweepAttempt  2443 non-null   object        
 7   reverseSweep         1207 non-null   object        
 8   event_name           35921 non-null  string        
 9   region               35921 non-null  object        
 10  mode                 35921 non-null  int64         
 11  tier                 35921 non-null  object        
 12  image                35656 non-null  object        
 13  groups               16472 non-

In [33]:
df["matchup"].value_counts()

nrg-esports-vs-g2-esports            31
ground-zero-gaming-vs-renegades      26
spacestation-gaming-vs-rogue         26
g2-esports-vs-nrg-esports            25
g2-esports-vs-ghost-gaming           25
                                     ..
sudor-esports-vs-the-evil-esports     1
ainda-cria-vs-qshaws                  1
locusts-vs-insight                    1
team-nazgul-vs-team-celestial         1
team-toastie-vs-team-memory           1
Name: matchup, Length: 27285, dtype: Int64

Back to the matchups. There are 27,285 unique matchups out of 35,921 series, which seems oddly high.

**My proposition: The matchup names are ordered, so when the same two teams play each other but on opposite teams (blue vs. orange), they count as unique, but of course that's not what we want. Let's investigate:**

Take, for example, the most frequent matchup as stated by our current data: NRG Esports vs. G2 Esports, with 31 series:

In [34]:
df[df["matchup"] == "nrg-esports-vs-g2-esports"]["matchup"].count()

31

Now let's swap the order of the teams and see if we get any matches:

In [35]:
df[df["matchup"] == "g2-esports-vs-nrg-esports"]["matchup"].count()

25

**Aha!** So there are indeed, on average, almost twice as many of the same matchups as displayed.

Now the question is, how do we recognize these duplicates and consolidate them, *without* losing critical information relating our "blue" and "orange" column data to the correct teams?

I propose the following general approach:  

**Step 1: Split the current "matchup" column into two new columns "blue_team" and "orange_team" and create a new dataframe with these columns to retain information about which team was on which side.**

**Step 2: Sort the team names alphabetically into lists of pairs.**

**Step 3: Iterate through the dataframe and swap blue and orange teams if the order is different from its corresponding sorted list. This allows matchups which are reverse duplicates to be treated as if they are identical.**

## Performing surgery on our matchup reverse duplicates

#### Step 1: Splitting our "matchup" column

To begin, we've got to split our matchups into their individual teams:

In [36]:
df["matchup"]

0               chasers-vs-team-synergy
1        lucky-bounce-vs-kings-of-urban
2                  cloud9-vs-gale-force
3                        who-vs-canyons
4        chiefs-esports-vs-avant-gaming
                      ...              
35916          team-memory-vs-team-gyro
35917        team-tcorrell-vs-team-andy
35918         team-toastie-vs-team-jstn
35919      team-memory-vs-team-tcorrell
35920       team-toastie-vs-team-memory
Name: matchup, Length: 35921, dtype: string

In [37]:
matchups = [matchup.split('-vs-') for matchup in df["matchup"]]
matchups[0:10]

[['chasers', 'team-synergy'],
 ['lucky-bounce', 'kings-of-urban'],
 ['cloud9', 'gale-force'],
 ['who', 'canyons'],
 ['chiefs-esports', 'avant-gaming'],
 ['endpoint', 'prophecy'],
 ['nrg-esports', 'settodestroyx'],
 ['allegiance', 'out-of-style'],
 ['team-envy', 'the-juicy-kids'],
 ['nrg-esports', 'the-muffin-men']]

Okay, but when teams have more than one word, the hyphens in between remain. Since "matchups" is now itself a list, we're going to need a nested list comprehension:

In [38]:
team_names = [[team_name.replace("-", " ") for team_name in matchup] for matchup in matchups]
team_names[0:10]

[['chasers', 'team synergy'],
 ['lucky bounce', 'kings of urban'],
 ['cloud9', 'gale force'],
 ['who', 'canyons'],
 ['chiefs esports', 'avant gaming'],
 ['endpoint', 'prophecy'],
 ['nrg esports', 'settodestroyx'],
 ['allegiance', 'out of style'],
 ['team envy', 'the juicy kids'],
 ['nrg esports', 'the muffin men']]

Perfect. Now let's create a dataframe with these matchups:

In [39]:
matchup_df = pd.DataFrame(team_names).copy()
matchup_df.columns = ["blue_team", "orange_team"]
matchup_df

Unnamed: 0,blue_team,orange_team
0,chasers,team synergy
1,lucky bounce,kings of urban
2,cloud9,gale force
3,who,canyons
4,chiefs esports,avant gaming
...,...,...
35916,team memory,team gyro
35917,team tcorrell,team andy
35918,team toastie,team jstn
35919,team memory,team tcorrell


Before we move forward, let's append this to our original dataframe. We also won't need the "matchup" column afterwards because the blue_team and orange_team columns encode the same information in a clearer format.

In [40]:
df = pd.concat([matchup_df, df], axis=1).copy()
df = df.drop(columns=["matchup"]).copy()
df

Unnamed: 0,blue_team,orange_team,date,blue,orange,number,games,reverseSweepAttempt,reverseSweep,event_name,...,mode,tier,image,groups,stage_name,format,qualifier,lan,type,length
0,chasers,team synergy,2018-07-07,"{'score': 4, 'winner': True, 'team': {'team': ...","{'score': 2, 'team': {'team': {'_id': '6020bf0...",1.0,"[{'_id': '6043145f91504896348eae82', 'blue': 2...",,,SAM Championship Season 1,...,3,B,https://griffon.octane.gg/events/sam-champions...,,Playoffs,bracket-4se,,,best,7.0
1,lucky bounce,kings of urban,2016-07-09,"{'score': 1, 'team': {'team': {'_id': '6020bc7...","{'score': 4, 'winner': True, 'team': {'team': ...",1.0,"[{'_id': '6043146091504896348eaf64', 'blue': 0...",,,RLCS Season 1 North America Stage 2,...,3,S,https://griffon.octane.gg/events/rlcs.png,"[rlcs, rlcs1, rlcsna, rlcs19, rlcs19lp]",Regional Championship,bracket-4se+3,,,best,7.0
2,cloud9,gale force,2017-12-03,"{'score': 1, 'team': {'team': {'_id': '6020bc7...","{'score': 4, 'winner': True, 'team': {'team': ...",1.0,"[{'_id': '6043146191504896348eb05e', 'blue': 1...",,,ELEAGUE 2017,...,3,S,https://griffon.octane.gg/events/eleague.png,,Playoffs,bracket-4se,,,best,7.0
3,who,canyons,2020-05-23,{'team': {'team': {'_id': '605d09394d63e1b16e2...,"{'score': 3, 'winner': True, 'team': {'team': ...",1.0,"[{'_id': '6043146291504896348eb085', 'blue': 0...",,,Red Bull Gaming World Finals,...,3,C,https://griffon.octane.gg/events/red-bull-gami...,,Main Event,bracket-4se,,,best,5.0
4,chiefs esports,avant gaming,2018-07-08,"{'score': 2, 'team': {'team': {'_id': '6020bc7...","{'score': 4, 'winner': True, 'team': {'team': ...",1.0,"[{'_id': '6043146291504896348eb089', 'blue': 3...",,,Gfinity Australia Elite Series Season 1,...,3,A,https://griffon.octane.gg/events/gfinity.png,,Playoffs,bracket-4se,,,best,7.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35916,team memory,team gyro,2022-09-18,"{'score': 4, 'winner': True, 'team': {'team': ...","{'score': 3, 'team': {'team': {'_id': '6328a08...",3.0,"[{'_id': '6328ab43c437fde7e02dbcd7', 'blue': 1...",,,College Carball Association x Immortals Bundle...,...,3,B,https://griffon.octane.gg/events/immortals-bun...,,Playoffs,,,,best,7.0
35917,team tcorrell,team andy,2022-09-18,"{'score': 4, 'winner': True, 'team': {'team': ...","{'score': 3, 'team': {'team': {'_id': '6328a06...",4.0,"[{'_id': '6328abeeda9d7ca1c7bb4359', 'blue': 1...",,,College Carball Association x Immortals Bundle...,...,3,B,https://griffon.octane.gg/events/immortals-bun...,,Playoffs,,,,best,7.0
35918,team toastie,team jstn,2022-09-19,"{'score': 4, 'winner': True, 'team': {'team': ...","{'score': 1, 'team': {'team': {'_id': '6328a0b...",5.0,"[{'_id': '6328aaebda9d7ca1c7bb42f0', 'blue': 3...",,,College Carball Association x Immortals Bundle...,...,3,B,https://griffon.octane.gg/events/immortals-bun...,,Playoffs,,,,best,7.0
35919,team memory,team tcorrell,2022-09-19,"{'score': 4, 'winner': True, 'team': {'team': ...","{'score': 3, 'team': {'team': {'_id': '6328a0d...",6.0,"[{'_id': '6328ab81da9d7ca1c7bb4336', 'blue': 4...",,,College Carball Association x Immortals Bundle...,...,3,B,https://griffon.octane.gg/events/immortals-bun...,,Playoffs,,,,best,7.0


### Actually finding the reverse duplicates

So, originally I was planning to use a clever trick to find the reverse duplicates and operate on them separately from matchups that did not have reverse duplicates. (For the curious, the trick was to create a new dataframe with the column names swapped, and perform an inner join to find the reverse duplicates.)

However, I ultimately decided it was just easier to operate on all the series at once via iteration, as it doesn't take too long.

**Okay, let's refresh our memory. We are looking to, in some way or another, allow our original dataframe to treat reverse matchup duplicates as if they are exact duplicates——because then we can group matchups properly and perform analyses.** The most straightforward approach here is to create a new column in our dataframe that states the matchup regardless of order.

So, let's apply the general strategy I outlined earlier:

**Step 1: Sort each combination of teams alphabetically and create a new column with this data.**  
**Step 2: Compare each blue_team, orange_team matchup with the order of the column in Step 1; if it is different, swap the names.**  
**Step 3: Concatenate the team names into a new column and append this back to our original dataframe.**

### Step 1:

Before we can start sorting our teams, we need to check for null values. 

In [41]:
matchup_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35921 entries, 0 to 35920
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   blue_team    35921 non-null  object
 1   orange_team  35915 non-null  object
dtypes: object(2)
memory usage: 561.4+ KB


In [42]:
matchup_df[pd.isnull(matchup_df["orange_team"])]

Unnamed: 0,blue_team,orange_team
11214,kim kardashian vs,
12654,illusionist esports vs,
13566,vs sway green,
14405,clappers vs,
20017,vs no clue,
20025,big goose vs,


So, it turns out when we split our raw "matchup" column earlier, there were some series missing a blue/orange team name.  

This is simply a matter of what we want to do with such types of series. I say let's assign their team name to be "tbd". We'll also have to remove the "vs" from the blue_team names.

In [43]:
matchup_df[pd.isnull(matchup_df["orange_team"])] = matchup_df[pd.isnull(matchup_df["orange_team"])].replace("vs", "", regex=True)
matchup_df[pd.isnull(matchup_df["orange_team"])] = matchup_df[pd.isnull(matchup_df["orange_team"])].fillna("tbd")

You may note we're just taking the team name that exists and slapping it under *blue_team*, although perhaps when the "vs" comes first, the team name that follows should actually be that of the orange_team. However, this really doesn't matter as we won't be looking at these series.

Okay, now we can actually do the sorting.

In [44]:
matchup_df["sorted"] = [sorted([x,y]) for x,y in zip(matchup_df["blue_team"], matchup_df["orange_team"])]
matchup_df

Unnamed: 0,blue_team,orange_team,sorted
0,chasers,team synergy,"[chasers, team synergy]"
1,lucky bounce,kings of urban,"[kings of urban, lucky bounce]"
2,cloud9,gale force,"[cloud9, gale force]"
3,who,canyons,"[canyons, who]"
4,chiefs esports,avant gaming,"[avant gaming, chiefs esports]"
...,...,...,...
35916,team memory,team gyro,"[team gyro, team memory]"
35917,team tcorrell,team andy,"[team andy, team tcorrell]"
35918,team toastie,team jstn,"[team jstn, team toastie]"
35919,team memory,team tcorrell,"[team memory, team tcorrell]"


### Step 2:

It's time to execute the actual team name swapping. We iterate through the list, checking if the blue_team and orange_team names are in the same order as that of the "sorted" column——if not, we swap them.

*Note this procedure may take up to 20 seconds.*

In [45]:
for i in matchup_df.index:
    matchup_list = [matchup_df["blue_team"][i], matchup_df["orange_team"][i]]
    if matchup_list != matchup_df["sorted"][i]:
        matchup_df["blue_team"][i] = matchup_df["sorted"][i][0]
        matchup_df["orange_team"][i] = matchup_df["sorted"][i][1]
matchup_df

Unnamed: 0,blue_team,orange_team,sorted
0,chasers,team synergy,"[chasers, team synergy]"
1,kings of urban,lucky bounce,"[kings of urban, lucky bounce]"
2,cloud9,gale force,"[cloud9, gale force]"
3,canyons,who,"[canyons, who]"
4,avant gaming,chiefs esports,"[avant gaming, chiefs esports]"
...,...,...,...
35916,team gyro,team memory,"[team gyro, team memory]"
35917,team andy,team tcorrell,"[team andy, team tcorrell]"
35918,team jstn,team toastie,"[team jstn, team toastie]"
35919,team memory,team tcorrell,"[team memory, team tcorrell]"


Perfect. See that we've now changed the order of "blue_team" and "orange_team" so that it aligns with the order in the "sorted" column? Now, we can concatenate the team names:

In [46]:
matchup_df["matchup"] = matchup_df["blue_team"] + " " + "vs" + " " + matchup_df["orange_team"]
matchup_df = matchup_df.drop(columns=["blue_team", "orange_team", "sorted"])
matchup_df

Unnamed: 0,matchup
0,chasers vs team synergy
1,kings of urban vs lucky bounce
2,cloud9 vs gale force
3,canyons vs who
4,avant gaming vs chiefs esports
...,...
35916,team gyro vs team memory
35917,team andy vs team tcorrell
35918,team jstn vs team toastie
35919,team memory vs team tcorrell


In [47]:
matchup_df.value_counts()

matchup                           
g2 esports vs nrg esports             56
ground zero gaming vs renegades       49
nrg esports vs spacestation gaming    37
rogue vs spacestation gaming          36
nrg esports vs rogue                  36
                                      ..
downloadable content vs pneumono       1
downloadable content vs keratosis      1
down two earth vs zookeepers           1
down two earth vs zero issue           1
zap vs zero empathy                    1
Length: 24138, dtype: int64

Looks like we were successful in consolidating reverse duplicates into this count! There seem to be slightly more than 24,000 unique series——this isn't too surprising, however. The vast majority of teams appear in very few events before making name changes, roster changes, leaving the scene, falling behind, etc.

Ok, perfect. Now all that remains is for us to append this column back into our original dataframe.

In [48]:
df = pd.concat([matchup_df, df], axis=1).copy()
df

Unnamed: 0,matchup,blue_team,orange_team,date,blue,orange,number,games,reverseSweepAttempt,reverseSweep,...,mode,tier,image,groups,stage_name,format,qualifier,lan,type,length
0,chasers vs team synergy,chasers,team synergy,2018-07-07,"{'score': 4, 'winner': True, 'team': {'team': ...","{'score': 2, 'team': {'team': {'_id': '6020bf0...",1.0,"[{'_id': '6043145f91504896348eae82', 'blue': 2...",,,...,3,B,https://griffon.octane.gg/events/sam-champions...,,Playoffs,bracket-4se,,,best,7.0
1,kings of urban vs lucky bounce,lucky bounce,kings of urban,2016-07-09,"{'score': 1, 'team': {'team': {'_id': '6020bc7...","{'score': 4, 'winner': True, 'team': {'team': ...",1.0,"[{'_id': '6043146091504896348eaf64', 'blue': 0...",,,...,3,S,https://griffon.octane.gg/events/rlcs.png,"[rlcs, rlcs1, rlcsna, rlcs19, rlcs19lp]",Regional Championship,bracket-4se+3,,,best,7.0
2,cloud9 vs gale force,cloud9,gale force,2017-12-03,"{'score': 1, 'team': {'team': {'_id': '6020bc7...","{'score': 4, 'winner': True, 'team': {'team': ...",1.0,"[{'_id': '6043146191504896348eb05e', 'blue': 1...",,,...,3,S,https://griffon.octane.gg/events/eleague.png,,Playoffs,bracket-4se,,,best,7.0
3,canyons vs who,who,canyons,2020-05-23,{'team': {'team': {'_id': '605d09394d63e1b16e2...,"{'score': 3, 'winner': True, 'team': {'team': ...",1.0,"[{'_id': '6043146291504896348eb085', 'blue': 0...",,,...,3,C,https://griffon.octane.gg/events/red-bull-gami...,,Main Event,bracket-4se,,,best,5.0
4,avant gaming vs chiefs esports,chiefs esports,avant gaming,2018-07-08,"{'score': 2, 'team': {'team': {'_id': '6020bc7...","{'score': 4, 'winner': True, 'team': {'team': ...",1.0,"[{'_id': '6043146291504896348eb089', 'blue': 3...",,,...,3,A,https://griffon.octane.gg/events/gfinity.png,,Playoffs,bracket-4se,,,best,7.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35916,team gyro vs team memory,team memory,team gyro,2022-09-18,"{'score': 4, 'winner': True, 'team': {'team': ...","{'score': 3, 'team': {'team': {'_id': '6328a08...",3.0,"[{'_id': '6328ab43c437fde7e02dbcd7', 'blue': 1...",,,...,3,B,https://griffon.octane.gg/events/immortals-bun...,,Playoffs,,,,best,7.0
35917,team andy vs team tcorrell,team tcorrell,team andy,2022-09-18,"{'score': 4, 'winner': True, 'team': {'team': ...","{'score': 3, 'team': {'team': {'_id': '6328a06...",4.0,"[{'_id': '6328abeeda9d7ca1c7bb4359', 'blue': 1...",,,...,3,B,https://griffon.octane.gg/events/immortals-bun...,,Playoffs,,,,best,7.0
35918,team jstn vs team toastie,team toastie,team jstn,2022-09-19,"{'score': 4, 'winner': True, 'team': {'team': ...","{'score': 1, 'team': {'team': {'_id': '6328a0b...",5.0,"[{'_id': '6328aaebda9d7ca1c7bb42f0', 'blue': 3...",,,...,3,B,https://griffon.octane.gg/events/immortals-bun...,,Playoffs,,,,best,7.0
35919,team memory vs team tcorrell,team memory,team tcorrell,2022-09-19,"{'score': 4, 'winner': True, 'team': {'team': ...","{'score': 3, 'team': {'team': {'_id': '6328a0d...",6.0,"[{'_id': '6328ab81da9d7ca1c7bb4336', 'blue': 4...",,,...,3,B,https://griffon.octane.gg/events/immortals-bun...,,Playoffs,,,,best,7.0


Some more small quality-of-life improvements——let's condense the "type" and "length" columns and drop the "image" column, as we won't need image logos in our statistical analyses

In [49]:
df["length"] = ("Bo" + df["length"].astype(str)).str.slice(stop=3)
df = df.drop(columns=["type", "image"]).copy()
df

Unnamed: 0,matchup,blue_team,orange_team,date,blue,orange,number,games,reverseSweepAttempt,reverseSweep,event_name,region,mode,tier,groups,stage_name,format,qualifier,lan,length
0,chasers vs team synergy,chasers,team synergy,2018-07-07,"{'score': 4, 'winner': True, 'team': {'team': ...","{'score': 2, 'team': {'team': {'_id': '6020bf0...",1.0,"[{'_id': '6043145f91504896348eae82', 'blue': 2...",,,SAM Championship Season 1,SAM,3,B,,Playoffs,bracket-4se,,,Bo7
1,kings of urban vs lucky bounce,lucky bounce,kings of urban,2016-07-09,"{'score': 1, 'team': {'team': {'_id': '6020bc7...","{'score': 4, 'winner': True, 'team': {'team': ...",1.0,"[{'_id': '6043146091504896348eaf64', 'blue': 0...",,,RLCS Season 1 North America Stage 2,,3,S,"[rlcs, rlcs1, rlcsna, rlcs19, rlcs19lp]",Regional Championship,bracket-4se+3,,,Bo7
2,cloud9 vs gale force,cloud9,gale force,2017-12-03,"{'score': 1, 'team': {'team': {'_id': '6020bc7...","{'score': 4, 'winner': True, 'team': {'team': ...",1.0,"[{'_id': '6043146191504896348eb05e', 'blue': 1...",,,ELEAGUE 2017,INT,3,S,,Playoffs,bracket-4se,,,Bo7
3,canyons vs who,who,canyons,2020-05-23,{'team': {'team': {'_id': '605d09394d63e1b16e2...,"{'score': 3, 'winner': True, 'team': {'team': ...",1.0,"[{'_id': '6043146291504896348eb085', 'blue': 0...",,,Red Bull Gaming World Finals,EU,3,C,,Main Event,bracket-4se,,,Bo5
4,avant gaming vs chiefs esports,chiefs esports,avant gaming,2018-07-08,"{'score': 2, 'team': {'team': {'_id': '6020bc7...","{'score': 4, 'winner': True, 'team': {'team': ...",1.0,"[{'_id': '6043146291504896348eb089', 'blue': 3...",,,Gfinity Australia Elite Series Season 1,OCE,3,A,,Playoffs,bracket-4se,,,Bo7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35916,team gyro vs team memory,team memory,team gyro,2022-09-18,"{'score': 4, 'winner': True, 'team': {'team': ...","{'score': 3, 'team': {'team': {'_id': '6328a08...",3.0,"[{'_id': '6328ab43c437fde7e02dbcd7', 'blue': 1...",,,College Carball Association x Immortals Bundle...,,3,B,,Playoffs,,,,Bo7
35917,team andy vs team tcorrell,team tcorrell,team andy,2022-09-18,"{'score': 4, 'winner': True, 'team': {'team': ...","{'score': 3, 'team': {'team': {'_id': '6328a06...",4.0,"[{'_id': '6328abeeda9d7ca1c7bb4359', 'blue': 1...",,,College Carball Association x Immortals Bundle...,,3,B,,Playoffs,,,,Bo7
35918,team jstn vs team toastie,team toastie,team jstn,2022-09-19,"{'score': 4, 'winner': True, 'team': {'team': ...","{'score': 1, 'team': {'team': {'_id': '6328a0b...",5.0,"[{'_id': '6328aaebda9d7ca1c7bb42f0', 'blue': 3...",,,College Carball Association x Immortals Bundle...,,3,B,,Playoffs,,,,Bo7
35919,team memory vs team tcorrell,team memory,team tcorrell,2022-09-19,"{'score': 4, 'winner': True, 'team': {'team': ...","{'score': 3, 'team': {'team': {'_id': '6328a0d...",6.0,"[{'_id': '6328ab81da9d7ca1c7bb4336', 'blue': 4...",,,College Carball Association x Immortals Bundle...,,3,B,,Playoffs,,,,Bo7


## Unraveling the "blue", "orange" and "games" columns

At this point, the bulk of the interesting code and game statistics is still hidden away in deeply nested lists of JSON dictionaries in the "blue", "orange", and "games" columns. In order to perform our exploratory (and further visualisations, we will no doubt need to reformat and present this information in an easy-to-access way.

Let's do our engineering on the blue team first, and the process will be analogous for the orange team later:

#### Blue team:

In [58]:
blue_df = pd.json_normalize(df["blue"])
blue_df

Unnamed: 0,score,winner,players,team.team._id,team.team.slug,team.team.name,team.team.image,team.stats.core.shots,team.stats.core.goals,team.stats.core.saves,...,team.stats.positioning.timeNeutralThird,team.stats.positioning.timeOffensiveThird,team.stats.positioning.timeDefensiveHalf,team.stats.positioning.timeOffensiveHalf,team.stats.positioning.timeBehindBall,team.stats.positioning.timeInfrontBall,team.stats.demo.inflicted,team.stats.demo.taken,team.team.region,team.team.relevant
0,4.0,True,[{'player': {'_id': '5f3d8fdd95f40596eae23f4d'...,6020bd08f1e4807cc7008781,8781-chasers,Chasers,https://griffon.octane.gg/teams/chasers.png,41.0,12.0,15.0,...,,,,,,,,,,
1,1.0,,[{'player': {'_id': '5f3d8fdd95f40596eae23d6e'...,6020bc70f1e4807cc70023c9,23c9-lucky-bounce,Lucky Bounce,https://griffon.octane.gg/teams/Lucky_Bounce_2...,28.0,7.0,11.0,...,1273.02,867.25,2690.55,1485.95,2939.21,1237.62,3.0,8.0,,
2,1.0,,[{'player': {'_id': '5f3d8fdd95f40596eae23d72'...,6020bc70f1e4807cc700239d,239d-cloud9,Cloud9,https://griffon.octane.gg/teams/cloud9.png,38.0,8.0,31.0,...,,,,,,,,,,
3,,,[{'player': {'_id': '5f3d8fdd95f40596eae23de1'...,605d09394d63e1b16e2bf768,f768-who,who?,,12.0,0.0,17.0,...,,,,,,,,,,
4,2.0,,[{'player': {'_id': '5f3d8fdd95f40596eae23dff'...,6020bc70f1e4807cc70023ce,23ce-chiefs-esports,Chiefs Esports,https://griffon.octane.gg/teams/chiefs-esports...,44.0,13.0,19.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35916,4.0,True,[{'player': {'_id': '612ca9439d3c410761e8adb9'...,6328a0c4da9d7ca1c7bb4176,4176-team-memory,Team Memory,,57.0,12.0,30.0,...,2302.24,1560.81,4473.18,2693.18,5291.26,1875.09,26.0,20.0,,
35917,4.0,True,[{'player': {'_id': '610656dd87f814e9fbffefa6'...,6328a0deda9d7ca1c7bb4179,4179-team-tcorrell,Team tcorrell,,53.0,12.0,45.0,...,2426.39,1580.08,5254.73,2717.24,5606.71,2365.24,28.0,13.0,,
35918,4.0,True,[{'player': {'_id': '5f3d8fdd95f40596eae241bc'...,6328a0e6da9d7ca1c7bb417a,417a-team-toastie,Team Toastie,,51.0,11.0,24.0,...,1663.23,1075.92,3430.28,1879.32,3838.73,1470.84,15.0,12.0,,
35919,4.0,True,[{'player': {'_id': '5f3d8fdd95f40596eae23d8e'...,6328a0c4da9d7ca1c7bb4176,4176-team-memory,Team Memory,,57.0,13.0,32.0,...,2374.40,1564.20,4480.19,2704.97,5352.71,1832.46,22.0,21.0,,


#### Blue team players

In [59]:
blue_players_df = pd.json_normalize(blue_df["players"])
blue_players_df

Unnamed: 0,0,1,2,3,4,5
0,"{'player._id': '5f3d8fdd95f40596eae23f4d', 'pl...","{'player._id': '5f99d0c8786e9eb85284db78', 'pl...","{'player._id': '5f3d8fdd95f40596eae23f65', 'pl...",,,
1,"{'player._id': '5f3d8fdd95f40596eae23d6e', 'pl...","{'player._id': '5f3d8fdd95f40596eae23d71', 'pl...","{'player._id': '5f3d8fdd95f40596eae23d72', 'pl...",,,
2,"{'player._id': '5f3d8fdd95f40596eae23d72', 'pl...","{'player._id': '5f3d8fdd95f40596eae23d94', 'pl...","{'player._id': '5f3d8fdd95f40596eae23d95', 'pl...",,,
3,"{'player._id': '5f3d8fdd95f40596eae23de1', 'pl...","{'player._id': '5f3d8fdd95f40596eae23e28', 'pl...","{'player._id': '5f3d8fdd95f40596eae23e2b', 'pl...",,,
4,"{'player._id': '5f3d8fdd95f40596eae23dff', 'pl...","{'player._id': '5f3d8fdd95f40596eae23e00', 'pl...","{'player._id': '5f3d8fdd95f40596eae23e01', 'pl...",,,
...,...,...,...,...,...,...
35916,"{'player._id': '612ca9439d3c410761e8adb9', 'pl...","{'player._id': '5f3d8fdd95f40596eae23d8e', 'pl...","{'player._id': '5f3d8fdd95f40596eae2452d', 'pl...",,,
35917,"{'player._id': '610656dd87f814e9fbffefa6', 'pl...","{'player._id': '5f3d8fdd95f40596eae23f42', 'pl...","{'player._id': '5f9c839d5246bf27936bc978', 'pl...",,,
35918,"{'player._id': '5f3d8fdd95f40596eae241bc', 'pl...","{'player._id': '60d9c3b888116f536df97f94', 'pl...","{'player._id': '6328a270da9d7ca1c7bb419c', 'pl...",,,
35919,"{'player._id': '5f3d8fdd95f40596eae23d8e', 'pl...","{'player._id': '5f3d8fdd95f40596eae2452d', 'pl...","{'player._id': '612ca9439d3c410761e8adb9', 'pl...",,,


Yikes, that's not what we want at all. Right now we have data of the 3+ blue team players for each series still hidden away. We want the data of each player for each series. This necessitates another round of json_normalize (I did say this was going to be annoyingly complicated!). Also, since there is more than one blue team player per series and players can appear more than once, we need to somehow keep track of the series their stats are tied to. To do so, let's break down each of the columns first and then concatenate the resulting dataframes while ignoring index. This allows us to retain the index as a unique series identifier.

In [60]:
first_blue_player_df = pd.json_normalize(blue_players_df[blue_players_df.columns[0]])
first_blue_player_df

Unnamed: 0,player._id,player.slug,player.tag,player.country,stats.core.shots,stats.core.goals,stats.core.saves,stats.core.assists,stats.core.score,stats.core.shootingPercentage,...,player.accounts,player.relevant,player.team._id,player.team.slug,player.team.name,player.team.region,player.team.image,player.team.relevant,player.coach,player.substitute
0,5f3d8fdd95f40596eae23f4d,3f4d-caiotg1,CaioTG1,br,14.0,5.0,7.0,6.0,1620.0,35.714286,...,,,,,,,,,,
1,5f3d8fdd95f40596eae23d6e,3d6e-darkfire,DarkFire,us,13.0,1.0,6.0,3.0,1005.0,7.692308,...,,,,,,,,,,
2,5f3d8fdd95f40596eae23d72,3d72-torment,Torment,us,11.0,0.0,12.0,3.0,1390.0,0.000000,...,,,,,,,,,,
3,5f3d8fdd95f40596eae23de1,3de1-rix_ronday,Rix_Ronday,nl,4.0,0.0,3.0,0.0,604.0,0.000000,...,,,,,,,,,,
4,5f3d8fdd95f40596eae23dff,3dff-drippay,Drippay,au,24.0,8.0,8.0,3.0,1990.0,33.333333,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35916,612ca9439d3c410761e8adb9,adb9-spoods,spoods,ca,16.0,3.0,8.0,4.0,2186.0,18.750000,...,,,,,,,,,,
35917,610656dd87f814e9fbffefa6,efa6-tide,Tide,us,19.0,7.0,14.0,1.0,2670.0,36.842105,...,,,,,,,,,,
35918,5f3d8fdd95f40596eae241bc,41bc-toastie,Toastie,us,24.0,4.0,12.0,4.0,2321.0,16.666667,...,,,,,,,,,,
35919,5f3d8fdd95f40596eae23d8e,3d8e-memory,Memory,us,30.0,7.0,9.0,4.0,2729.0,23.333333,...,,,,,,,,,,


In [61]:
second_blue_player_df = pd.json_normalize(blue_players_df[blue_players_df.columns[1]])
second_blue_player_df

Unnamed: 0,player._id,player.slug,player.tag,player.country,stats.core.shots,stats.core.goals,stats.core.saves,stats.core.assists,stats.core.score,stats.core.shootingPercentage,...,player.team._id,player.team.slug,player.team.name,player.team.region,player.team.image,player.accounts,player.substitute,player.coach,player.relevant,player.team.relevant
0,5f99d0c8786e9eb85284db78,db78-noiisey,Noiisey,br,12.0,3.0,5.0,3.0,1280.0,25.000000,...,,,,,,,,,,
1,5f3d8fdd95f40596eae23d71,3d71-timbathy,Timbathy,us,10.0,4.0,4.0,0.0,875.0,40.000000,...,,,,,,,,,,
2,5f3d8fdd95f40596eae23d94,3d94-gimmick,Gimmick,us,13.0,4.0,7.0,2.0,1325.0,30.769231,...,,,,,,,,,,
3,5f3d8fdd95f40596eae23e28,3e28-dmentza,DmentZa,es,2.0,0.0,8.0,0.0,999.0,0.000000,...,,,,,,,,,,
4,5f3d8fdd95f40596eae23e00,3e00-jake,Jake,au,13.0,3.0,3.0,6.0,1190.0,23.076923,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35916,5f3d8fdd95f40596eae23d8e,3d8e-memory,Memory,us,29.0,5.0,15.0,3.0,2969.0,17.241379,...,,,,,,,,,,
35917,5f3d8fdd95f40596eae23f42,3f42-tcorrell,tcorrell,us,21.0,1.0,22.0,6.0,3277.0,4.761905,...,,,,,,,,,,
35918,60d9c3b888116f536df97f94,7f94-dripinho,Dripinho,us,17.0,4.0,9.0,2.0,1772.0,23.529412,...,,,,,,,,,,
35919,5f3d8fdd95f40596eae2452d,452d-reeves,Reeves,us,11.0,2.0,10.0,1.0,1637.0,18.181818,...,,,,,,,,,,


In [62]:
third_blue_player_df = pd.json_normalize(blue_players_df[blue_players_df.columns[2]])
third_blue_player_df

Unnamed: 0,player._id,player.slug,player.tag,player.country,stats.core.shots,stats.core.goals,stats.core.saves,stats.core.assists,stats.core.score,stats.core.shootingPercentage,...,player.team._id,player.team.slug,player.team.name,player.team.region,player.team.image,player.accounts,player.coach,player.relevant,player.substitute,player.team.relevant
0,5f3d8fdd95f40596eae23f65,3f65-protomz,Protomz,br,15.0,4.0,3.0,1.0,1255.0,26.666667,...,,,,,,,,,,
1,5f3d8fdd95f40596eae23d72,3d72-torment,Torment,us,5.0,2.0,1.0,3.0,790.0,40.000000,...,,,,,,,,,,
2,5f3d8fdd95f40596eae23d95,3d95-squishy,Squishy,ca,14.0,4.0,12.0,1.0,1565.0,28.571429,...,,,,,,,,,,
3,5f3d8fdd95f40596eae23e2b,3e2b-nachitow,Nachitow,es,6.0,0.0,6.0,0.0,968.0,0.000000,...,,,,,,,,,,
4,5f3d8fdd95f40596eae23e01,3e01-torsos,Torsos,au,7.0,2.0,8.0,1.0,1080.0,28.571429,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35916,5f3d8fdd95f40596eae2452d,452d-reeves,Reeves,us,12.0,4.0,7.0,1.0,1814.0,33.333333,...,,,,,,,,,,
35917,5f9c839d5246bf27936bc978,c978-dbanq,dbanq,us,13.0,4.0,9.0,3.0,2086.0,30.769231,...,,,,,,,,,,
35918,6328a270da9d7ca1c7bb419c,419c-zath,zath,us,10.0,3.0,3.0,2.0,1497.0,30.000000,...,,,,,,,,,,
35919,612ca9439d3c410761e8adb9,adb9-spoods,spoods,ca,16.0,4.0,13.0,2.0,2477.0,25.000000,...,,,,,,,,,,


Note: There are actually some (albeit extremely few) non-null values for the 4th column of blue_players_df (meaning substitute players) (and even 2 records for a 5th and 6th player!), but for the purposes of our analysis we can disregard them.

Now, we can concatenate these three dataframes to form a master dataframe of all blue players and their stats. We'll add some other columns from our original df such as "tier" and "date" for ease of visualisation later:

In [80]:
blue_playerlist = [first_blue_player_df, second_blue_player_df, third_blue_player_df]
blue_players = pd.concat(blue_playerlist)
blue_players = blue_players.sort_index().copy()
blue_players = pd.concat([blue_players,
                          df["tier"].repeat(3),
                          df["date"].repeat(3)], axis=1).copy()
blue_players

Unnamed: 0,player._id,player.slug,player.tag,player.country,stats.core.shots,stats.core.goals,stats.core.saves,stats.core.assists,stats.core.score,stats.core.shootingPercentage,...,player.team._id,player.team.slug,player.team.name,player.team.region,player.team.image,player.team.relevant,player.coach,player.substitute,tier,date
0,5f3d8fdd95f40596eae23f4d,3f4d-caiotg1,CaioTG1,br,14.0,5.0,7.0,6.0,1620.0,35.714286,...,,,,,,,,,B,2018-07-07
0,5f3d8fdd95f40596eae23f65,3f65-protomz,Protomz,br,15.0,4.0,3.0,1.0,1255.0,26.666667,...,,,,,,,,,B,2018-07-07
0,5f99d0c8786e9eb85284db78,db78-noiisey,Noiisey,br,12.0,3.0,5.0,3.0,1280.0,25.000000,...,,,,,,,,,B,2018-07-07
1,5f3d8fdd95f40596eae23d6e,3d6e-darkfire,DarkFire,us,13.0,1.0,6.0,3.0,1005.0,7.692308,...,,,,,,,,,S,2016-07-09
1,5f3d8fdd95f40596eae23d72,3d72-torment,Torment,us,5.0,2.0,1.0,3.0,790.0,40.000000,...,,,,,,,,,S,2016-07-09
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35919,5f3d8fdd95f40596eae23d8e,3d8e-memory,Memory,us,30.0,7.0,9.0,4.0,2729.0,23.333333,...,,,,,,,,,B,2022-09-19
35919,5f3d8fdd95f40596eae2452d,452d-reeves,Reeves,us,11.0,2.0,10.0,1.0,1637.0,18.181818,...,,,,,,,,,B,2022-09-19
35920,60d9c3b888116f536df97f94,7f94-dripinho,Dripinho,us,10.0,2.0,5.0,6.0,1534.0,20.000000,...,,,,,,,,,B,2022-09-19
35920,5f3d8fdd95f40596eae241bc,41bc-toastie,Toastie,us,20.0,5.0,7.0,3.0,2315.0,25.000000,...,,,,,,,,,B,2022-09-19


However, we may run into problems when visualizing and manipulating our dataframe if we have duplicate indices——therefore, let's bring out those indices as a new column.

In [None]:
blue_players = blue_players.reset_index().copy()
blue_players.rename(columns = {"index":"series_id"}, inplace=True)
blue_players

Perfect. Now we have our data for all three blue team players of each series. Let's now do the same for the orange team.

#### Orange team:

In [87]:
orange_df = pd.json_normalize(df["orange"])
orange_df

Unnamed: 0,score,players,team.team._id,team.team.slug,team.team.name,team.team.image,team.stats.core.shots,team.stats.core.goals,team.stats.core.saves,team.stats.core.assists,...,team.stats.positioning.timeNeutralThird,team.stats.positioning.timeOffensiveThird,team.stats.positioning.timeDefensiveHalf,team.stats.positioning.timeOffensiveHalf,team.stats.positioning.timeBehindBall,team.stats.positioning.timeInfrontBall,team.stats.demo.inflicted,team.stats.demo.taken,team.team.region,team.team.relevant
0,2.0,[{'player': {'_id': '5f3d8fdd95f40596eae23f4a'...,6020bf0bf1e4807cc7017c3f,7c3f-team-synergy,Team Synergy,https://griffon.octane.gg/teams/team-synergy.png,27.0,7.0,23.0,2.0,...,,,,,,,,,,
1,4.0,[{'player': {'_id': '5f3d8fdd95f40596eae23d7b'...,6020bc70f1e4807cc700239e,239e-kings-of-urban,Kings of Urban,https://griffon.octane.gg/teams/kings-of-urban...,37.0,16.0,15.0,12.0,...,1266.25,932.45,2687.93,1500.32,3007.45,1181.28,8.0,3.0,,
2,4.0,[{'player': {'_id': '5f3d8fdd95f40596eae23d9a'...,6020bc70f1e4807cc700239f,239f-gale-force,Gale Force,https://griffon.octane.gg/teams/gale-force.png,60.0,15.0,20.0,11.0,...,,,,,,,,,,
3,3.0,[{'player': {'_id': '5f3d8fdd95f40596eae23e1d'...,6020bc70f1e4807cc7002487,2487-canyons,Canyons,https://griffon.octane.gg/teams/canyons.png,28.0,5.0,10.0,3.0,...,,,,,,,,,,
4,4.0,[{'player': {'_id': '5f3d8fdd95f40596eae23e44'...,6020bcb8f1e4807cc700554d,554d-avant-gaming,Avant Gaming,https://griffon.octane.gg/teams/avant-gaming.png,32.0,13.0,28.0,7.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35916,3.0,[{'player': {'_id': '5f3d8fdd95f40596eae24512'...,6328a08bda9d7ca1c7bb4171,4171-team-gyro,Team Gyro,,53.0,10.0,34.0,8.0,...,2213.94,1255.43,4887.99,2260.23,5185.37,1962.84,20.0,26.0,,
35917,3.0,[{'player': {'_id': '5f3d8fdd95f40596eae24136'...,6328a06ec437fde7e02dbb50,bb50-team-andy,Team Andy,,68.0,14.0,25.0,12.0,...,2580.50,1891.81,4797.57,3127.26,5788.97,2135.90,13.0,28.0,,
35918,1.0,[{'player': {'_id': '5f3d8fdd95f40596eae23dcf'...,6328a0bbda9d7ca1c7bb4175,4175-team-jstn,Team jstn.,,42.0,9.0,25.0,7.0,...,1734.74,1142.67,3336.24,1966.17,3798.85,1503.54,12.0,15.0,,
35919,3.0,[{'player': {'_id': '610656dd87f814e9fbffefa6'...,6328a0deda9d7ca1c7bb4179,4179-team-tcorrell,Team tcorrell,,52.0,13.0,37.0,11.0,...,2382.95,1353.31,4752.04,2426.12,4889.73,2288.41,21.0,22.0,,


In [88]:
orange_players_df = pd.json_normalize(orange_df["players"])
orange_players_df

Unnamed: 0,0,1,2,3,4,5
0,"{'player._id': '5f3d8fdd95f40596eae23f4a', 'pl...","{'player._id': '5f3d8fdd95f40596eae23f4c', 'pl...","{'player._id': '5f3d8fdd95f40596eae23f61', 'pl...",,,
1,"{'player._id': '5f3d8fdd95f40596eae23d7b', 'pl...","{'player._id': '5f3d8fdd95f40596eae23d7c', 'pl...","{'player._id': '5f3d8fdd95f40596eae23d7a', 'pl...",,,
2,"{'player._id': '5f3d8fdd95f40596eae23d9a', 'pl...","{'player._id': '5f3d8fdd95f40596eae23d9b', 'pl...","{'player._id': '5f3d8fdd95f40596eae23d9c', 'pl...",,,
3,"{'player._id': '5f3d8fdd95f40596eae23e1d', 'pl...","{'player._id': '5f3d8fdd95f40596eae23e1e', 'pl...","{'player._id': '5f3d8fdd95f40596eae23e20', 'pl...",,,
4,"{'player._id': '5f3d8fdd95f40596eae23e44', 'pl...","{'player._id': '5f3d8fdd95f40596eae23e5e', 'pl...","{'player._id': '5f3d8fdd95f40596eae23e58', 'pl...",,,
...,...,...,...,...,...,...
35916,"{'player._id': '5f3d8fdd95f40596eae24512', 'pl...","{'player._id': '5f3d8fdd95f40596eae23edb', 'pl...","{'player._id': '6328a824da9d7ca1c7bb4243', 'pl...",,,
35917,"{'player._id': '5f3d8fdd95f40596eae24136', 'pl...","{'player._id': '5f3d8fdd95f40596eae2430e', 'pl...","{'player._id': '613cceea143c37878b236f82', 'pl...",,,
35918,"{'player._id': '5f3d8fdd95f40596eae23dcf', 'pl...","{'player._id': '5f3d8fdd95f40596eae2431b', 'pl...","{'player._id': '5f3d8fdd95f40596eae24532', 'pl...",,,
35919,"{'player._id': '610656dd87f814e9fbffefa6', 'pl...","{'player._id': '5f9c839d5246bf27936bc978', 'pl...","{'player._id': '5f3d8fdd95f40596eae23f42', 'pl...",,,


In [89]:
first_orange_player_df = pd.json_normalize(orange_players_df[orange_players_df.columns[0]])
first_orange_player_df

Unnamed: 0,player._id,player.slug,player.tag,stats.core.shots,stats.core.goals,stats.core.saves,stats.core.assists,stats.core.score,stats.core.shootingPercentage,advanced.goalParticipation,...,player.accounts,player.relevant,player.team._id,player.team.slug,player.team.name,player.team.region,player.team.image,player.team.relevant,player.coach,player.substitute
0,5f3d8fdd95f40596eae23f4a,3f4a-nizzer,Nizzer,11.0,2.0,11.0,0.0,1245.0,18.181818,28.571429,...,,,,,,,,,,
1,5f3d8fdd95f40596eae23d7b,3d7b-jacob,Jacob,13.0,4.0,4.0,5.0,1195.0,30.769231,56.250000,...,,,,,,,,,,
2,5f3d8fdd95f40596eae23d9a,3d9a-kaydop,Kaydop,16.0,3.0,9.0,2.0,1365.0,18.750000,33.333333,...,,,,,,,,,,
3,5f3d8fdd95f40596eae23e1d,3e1d-stake,Stake,10.0,2.0,6.0,1.0,1176.0,20.000000,60.000000,...,,,,,,,,,,
4,5f3d8fdd95f40596eae23e44,3e44-plitz,Plitz,10.0,5.0,10.0,3.0,1405.0,50.000000,61.538462,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35916,5f3d8fdd95f40596eae24512,4512-comp,Comp,24.0,4.0,16.0,2.0,2910.0,16.666667,60.000000,...,,,,,,,,,,
35917,5f3d8fdd95f40596eae24136,4136-andy,Andy,27.0,5.0,11.0,2.0,2881.0,18.518519,50.000000,...,,,,,,,,,,
35918,5f3d8fdd95f40596eae23dcf,3dcf-jstn,jstn.,18.0,3.0,11.0,2.0,2039.0,16.666667,55.555556,...,,,,,,,,,,
35919,610656dd87f814e9fbffefa6,efa6-tide,Tide,18.0,7.0,10.0,2.0,2264.0,38.888889,69.230769,...,,,,,,,,,,


In [90]:
second_orange_player_df = pd.json_normalize(orange_players_df[orange_players_df.columns[1]])
second_orange_player_df

Unnamed: 0,player._id,player.slug,player.tag,player.country,stats.core.shots,stats.core.goals,stats.core.saves,stats.core.assists,stats.core.score,stats.core.shootingPercentage,...,player.accounts,player.name,player.relevant,player.team._id,player.team.slug,player.team.name,player.team.region,player.team.image,player.coach,player.team.relevant
0,5f3d8fdd95f40596eae23f4c,3f4c-wais,Wais,ar,9.0,4.0,8.0,1.0,1225.0,44.444444,...,,,,,,,,,,
1,5f3d8fdd95f40596eae23d7c,3d7c-sadjunior,Sadjunior,ca,12.0,6.0,3.0,4.0,1260.0,50.000000,...,,,,,,,,,,
2,5f3d8fdd95f40596eae23d9b,3d9b-turbopolsa,Turbopolsa,se,23.0,9.0,6.0,1.0,1860.0,39.130435,...,,,,,,,,,,
3,5f3d8fdd95f40596eae23e1e,3e1e-tox,Tox,de,7.0,2.0,1.0,0.0,761.0,28.571429,...,,,,,,,,,,
4,5f3d8fdd95f40596eae23e5e,3e5e-sammy,Sammy,au,11.0,3.0,10.0,3.0,1350.0,27.272727,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35916,5f3d8fdd95f40596eae23edb,3edb-gyro,Gyro.,us,18.0,3.0,8.0,5.0,2246.0,16.666667,...,,,,,,,,,,
35917,5f3d8fdd95f40596eae2430e,430e-busse,Busse,us,18.0,3.0,9.0,8.0,2470.0,16.666667,...,,,,,,,,,,
35918,5f3d8fdd95f40596eae2431b,431b-rahz,Rahz,us,14.0,3.0,6.0,4.0,1592.0,21.428571,...,,,,,,,,,,
35919,5f9c839d5246bf27936bc978,c978-dbanq,dbanq,us,13.0,2.0,12.0,4.0,2409.0,15.384615,...,,,,,,,,,,


In [91]:
third_orange_player_df = pd.json_normalize(orange_players_df[orange_players_df.columns[2]])
third_orange_player_df

Unnamed: 0,player._id,player.slug,player.tag,player.country,stats.core.shots,stats.core.goals,stats.core.saves,stats.core.assists,stats.core.score,stats.core.shootingPercentage,...,player.accounts,player.relevant,player.team._id,player.team.slug,player.team.name,player.team.region,player.team.image,player.team.relevant,player.coach,player.substitute
0,5f3d8fdd95f40596eae23f61,3f61-freeway,freeway,ar,7.0,1.0,4.0,1.0,935.0,14.285714,...,,,,,,,,,,
1,5f3d8fdd95f40596eae23d7a,3d7a-fireburner,Fireburner,us,12.0,6.0,8.0,3.0,1140.0,50.000000,...,,,,,,,,,,
2,5f3d8fdd95f40596eae23d9c,3d9c-violentpanda,ViolentPanda,nl,21.0,3.0,5.0,8.0,1510.0,14.285714,...,,,,,,,,,,
3,5f3d8fdd95f40596eae23e20,3e20-zamue,Zamué,es,11.0,1.0,3.0,2.0,911.0,9.090909,...,,,,,,,,,,
4,5f3d8fdd95f40596eae23e58,3e58-zenulous,zenulous,au,11.0,5.0,8.0,1.0,1525.0,45.454545,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
35916,6328a824da9d7ca1c7bb4243,4243-liss,liss,us,11.0,3.0,10.0,1.0,1910.0,27.272727,...,,,,,,,,,,
35917,613cceea143c37878b236f82,6f82-banger,Banger,us,23.0,6.0,5.0,2.0,2072.0,26.086957,...,,,,,,,,,,
35918,5f3d8fdd95f40596eae24532,4532-dalton,Dalton,us,10.0,3.0,8.0,1.0,1568.0,30.000000,...,,,,,,,,,,
35919,5f3d8fdd95f40596eae23f42,3f42-tcorrell,tcorrell,us,21.0,4.0,15.0,5.0,2953.0,19.047619,...,,,,,,,,,,


In [109]:
orange_playerlist = [first_orange_player_df, second_orange_player_df, third_orange_player_df]
orange_players = pd.concat(orange_playerlist)
orange_players = orange_players.sort_index().copy()
orange_players = pd.concat([orange_players,
                          df["tier"].repeat(3),
                          df["date"].repeat(3)], axis=1).copy()
orange_players = orange_players.reset_index().copy()
orange_players.rename(columns = {"index":"series_id"}, inplace=True)
orange_players

Unnamed: 0,series_id,player._id,player.slug,player.tag,stats.core.shots,stats.core.goals,stats.core.saves,stats.core.assists,stats.core.score,stats.core.shootingPercentage,...,player.team._id,player.team.slug,player.team.name,player.team.region,player.team.image,player.team.relevant,player.coach,player.substitute,tier,date
0,0,5f3d8fdd95f40596eae23f4a,3f4a-nizzer,Nizzer,11.0,2.0,11.0,0.0,1245.0,18.181818,...,,,,,,,,,B,2018-07-07
1,0,5f3d8fdd95f40596eae23f61,3f61-freeway,freeway,7.0,1.0,4.0,1.0,935.0,14.285714,...,,,,,,,,,B,2018-07-07
2,0,5f3d8fdd95f40596eae23f4c,3f4c-wais,Wais,9.0,4.0,8.0,1.0,1225.0,44.444444,...,,,,,,,,,B,2018-07-07
3,1,5f3d8fdd95f40596eae23d7b,3d7b-jacob,Jacob,13.0,4.0,4.0,5.0,1195.0,30.769231,...,,,,,,,,,S,2016-07-09
4,1,5f3d8fdd95f40596eae23d7a,3d7a-fireburner,Fireburner,12.0,6.0,8.0,3.0,1140.0,50.000000,...,,,,,,,,,S,2016-07-09
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
107758,35919,610656dd87f814e9fbffefa6,efa6-tide,Tide,18.0,7.0,10.0,2.0,2264.0,38.888889,...,,,,,,,,,B,2022-09-19
107759,35919,5f9c839d5246bf27936bc978,c978-dbanq,dbanq,13.0,2.0,12.0,4.0,2409.0,15.384615,...,,,,,,,,,B,2022-09-19
107760,35920,5f3d8fdd95f40596eae2452d,452d-reeves,Reeves,16.0,5.0,5.0,3.0,1887.0,31.250000,...,,,,,,,,,B,2022-09-19
107761,35920,5f3d8fdd95f40596eae23d8e,3d8e-memory,Memory,13.0,3.0,13.0,2.0,2282.0,23.076923,...,,,,,,,,,B,2022-09-19


### Saving our dataframes

Whew, all that wrangling and we're finally ready to dig in and search for insights and treasures! But first, let's save our dataframes!

Clearly, it's undesirable to have to fetch all that data from the API and wrangle it every time, so let's store the dataframes as pickle files. (In fact, when I was initially working on this project, I was executing the API request every time I added a line of code and ran the program, and one afternoon my code suddenly slowed to a crawl and stopped working...Yup, I was being rate limited. I obviously learned from the experience!)

In [111]:
df.to_pickle("/Users/Terru/Desktop/RL-Analysis/dataframes/df.pkl")
blue_players.to_pickle("/Users/Terru/Desktop/RL-Analysis/dataframes/blue_players.pkl")
orange_players.to_pickle("/Users/Terru/Desktop/RL-Analysis/dataframes/orange_players.pkl")

Let's also save a mini-version of these dataframes so they can be uploaded and previewed on GitHub:

## ^ TO-DO

Sweet. This'll come in handy later when we want to perform visualisations and/or advanced modelling on our data in a separate notebook, as all we'll need to do is load the file.