# Data Extraction & Transformation

##### Parsing raw StatsBomb data and storing it in a Pandas DataFrame

---

In [25]:
import requests
import pandas as pd
from tqdm import tqdm

- `requests` is a great library for executing HTTP requests
- `pandas` is a data analysis and manipulation package
- `tqdm` is a clean progress bar library

---

In [26]:
base_url = "https://raw.githubusercontent.com/statsbomb/open-data/master/data/"
comp_url = base_url + "matches/{}/{}.json"
match_url = base_url + "events/{}.json"

These URLs are the locations where the raw StatsBomb data lives. Notice the `{}` in there, which are dynamically replaced with IDs with `.format()`

___

In [27]:
def parse_data(competition_id=16, season_id=4):
    matches=requests.get(url=comp_url.format(competition_id,season_id)).json()
    match_ids=[x["match_id"] for x in matches]
    all_events=[]
    for match_id in tqdm(match_ids):
        events=requests.get(url=match_url.format(match_id)).json()
        shots=[x for x in events if x["type"]["name"]=='Shot']
        for y in shots:
            attributes={
                "match_id":match_id,
                "team":y["possession_team"]["name"],
                "player":y['player']["name"],
                "x":y["location"][0],
                "y":y["location"][1],
                "outcome":y["shot"]["outcome"]["name"]
            }
            all_events.append(attributes)
        return pd.DataFrame(all_events)


In [28]:
df=parse_data()

  0%|          | 0/1 [00:00<?, ?it/s]


In [29]:
df.head(10)

Unnamed: 0,match_id,team,player,x,y,outcome
0,22912,Liverpool,Mohamed Salah,108.2,40.1,Goal
1,22912,Tottenham Hotspur,Moussa Sissoko,91.9,43.1,Off T
2,22912,Liverpool,Trent Alexander-Arnold,90.2,59.3,Off T
3,22912,Liverpool,Mohamed Salah,95.2,47.2,Blocked
4,22912,Liverpool,Mohamed Salah,113.0,59.5,Wayward
5,22912,Liverpool,Andrew Robertson,98.4,20.4,Saved
6,22912,Liverpool,Mohamed Salah,97.6,37.4,Off T
7,22912,Liverpool,Jordan Brian Henderson,89.0,47.9,Blocked
8,22912,Liverpool,Mohamed Salah,94.0,31.3,Blocked
9,22912,Tottenham Hotspur,Christian Dannemann Eriksen,96.1,41.6,Off T


---

Devin Pleuler 2020