# Loading and Investigating World Cup Data

In this notebook, we will understand how to load and inspect event data of Women's World Cup matches. We follow the Prof. David Sumpter's [video](https://www.youtube.com/watch?v=GTtuOt03FM0&ab_channel=FriendsofTracking) for understanding how to download the data and inspect it using Python. During the course of this notebook, we will assume that both Statsbomb and Wyscout data is available in the `data` directory. URLs to download the data are provided in the *References* section.

The event data is provided in JSON files, so we need to import the `json` package to load these files. We will need `matplotlib` to plot the data and `numpy` to transform the data.

In [None]:
import json
from typing import Union

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

from plot_utils import create_pitch

## Load data

First, we will use the Statsbomb data. Let us load information about the competitions for which data is available.

In [None]:
with open("../data/statsbomb/data/competitions.json", "r") as f:
    competitions: list = json.load(f)

We have a list of 19 competitions covered in the Statsbomb data. Let us look at the information of the first competition.

In [None]:
competitions[2]

In this notebook, we want to inspect data for the 2019 Women's World Cup. Its competition ID is `72`.

In [None]:
[competition for competition in competitions if competition["competition_id"] == 49]

In [None]:
competition_id: int = 72

Let us load information about all matches from the competition.

In [None]:
with open(f"../data/statsbomb/data/matches/{competition_id}/30.json", "r") as f:
    matches: list = json.load(f)

There were 52 matches played during the World Cup.

In [None]:
len(matches)

Let us now print the result of every match in the World Cup. It will help us understand the structure of match result.

While it would be better for readability to get `match["home_team"]["country"]["name"]`, the event data that we want to analyse specifies `match["home_team"]["home_team_name"]` for every event. The same applies for the away team as well.

In [None]:
match: dict
for match in matches:
    home_team_name: str = match["home_team"]["home_team_name"]
    away_team_name: str = match["away_team"]["away_team_name"]
    home_score: int = match["home_score"]
    away_score: int = match["away_score"]
    print(f"The match between {home_team_name} and {away_team_name} finished {home_score}-{away_score}")

Let us consider the final of the World Cup between the USA and Netherlands and find its match ID.

In [None]:
required_home_team: str = "United States Women's"
required_away_team: str = "Netherlands Women's"

In [None]:
required_match_id: Union[int, str] = "Not found"
for match in matches:
    home_team_name: str = match["home_team"]["home_team_name"]
    away_team_name: str = match["away_team"]["away_team_name"]
    if (home_team_name == required_home_team) and (away_team_name == required_away_team):
        required_match_id: int = match["match_id"]

print(f"{required_home_team} vs {required_away_team} has ID: {required_match_id}")

Let us now load the event data for this match based on its ID.

In [None]:
with open(f"../data/statsbomb/data/events/{required_match_id}.json", "r") as f:
    match_events: list = json.load(f)

This is the event data that we can use for various purposes like creating different kinds of plot and building models like expected goals. The first part of this data contains information about lineups and formations. After that, all information about events that happened on the ball are captured. It includes passes, interceptions, shots, and other on-ball events. For a pass, the start and end coordinate (X, Y) are noted. For a shot, the (X, Y) coordinate from where the shot is taken is recorded as well as where the shot landed up (inside or outside the frame of the goal).

Let us transform this data into a Pandas dataframe so that it is easier to inspect.

In [None]:
events: pd.DataFrame = (pd.json_normalize(match_events, sep="_")
                        .assign(match_id=required_match_id))
events.head()

This is a large dataframe with 117 columns! Let us filter it to only include data about shots.

In [None]:
shots: pd.DataFrame = events.loc[events["type_name"] == "Shot"].set_index("id")
shots.head()

## Plot data

As these are football events, we should ideally plot them on a pitch. Borrowing code from [SoccermaticsForPython](https://github.com/Friends-of-Tracking-Data-FoTD/SoccermaticsForPython/blob/master/FCPython.py), we can first plot the pitch using Matplotlib. The `create_pitch()` function defined in `plot_utils.py` generates the pitch, and it takes pitch length and width as input along with the units of those values. The event data provided by Statsbomb assumes the pitch to be measured in yards.

In [None]:
pitch_length_x: int = 120  # yards
pitch_width_y: int = 80  # yards

In [None]:
fig, ax = create_pitch(pitch_length_x, pitch_width_y, "yards", "gray")

In [None]:
i: int
shot: dict
for i, shot in shots.iterrows():
    x: int = shot["location"][0]
    y: int = shot["location"][1]

    is_goal: bool = shot["shot_outcome_name"] == "Goal"
    team_name: str = shot["team_name"]

    circle_size: float = np.sqrt(shot["shot_statsbomb_xg"] * 15)

    if team_name == required_home_team:
        shot_circle = plt.Circle((x, pitch_width_y - y), circle_size, color="red")
        if is_goal:
            plt.text((x + 1), (pitch_width_y - y + 1), shot["player_name"])
        else:
            shot_circle.set_alpha(0.2)
    else:
        shot_circle = plt.Circle((pitch_length_x - x, y), circle_size, color="blue")
        if is_goal:
            plt.text((pitch_length_x - x + 1), (y + 1), shot["player_name"])
        else:
            shot_circle.set_alpha(0.2)

    ax.add_patch(shot_circle)

plt.text(5, 75, f"{required_away_team} shots")
plt.text(80, 75, f"{required_home_team} shots")

# fig.set_size_inches(10, 7)
# fig.savefig("results/shots.pdf", dpi=100)
plt.show()

Let us now get the data for passes and plot the passes of *Megan Anna Rapinoe* of the USA. When plotting pass maps, it is advisable to plot the passes of one or two players instead of a team as the latter will just lead to a pitch full of arrows from which it will be difficult to derive any meaningful insights.

In [None]:
required_player_name: str = "Megan Anna Rapinoe"

In [None]:
passes: pd.DataFrame = events.loc[events["type_name"] == "Pass"].set_index("id")
passes.head()

In [None]:
fig, ax = create_pitch(pitch_length_x, pitch_width_y, "yards", "gray")

In [None]:
a_pass: dict  # `pass` is a Python keyword so cannot be used as a variable.
for i, a_pass in passes.iterrows():
    if a_pass["player_name"] != required_player_name:
        continue

    x: int = a_pass["location"][0]
    y: int = a_pass["location"][1]

    pass_circle = plt.Circle((x, pitch_width_y - y), 2, color="blue")
    pass_circle.set_alpha(0.2)

    ax.add_patch(pass_circle)

    dx: int = a_pass["pass_end_location"][0] - x
    dy: int = a_pass["pass_end_location"][1] - y

    pass_arrow = plt.Arrow(x, (pitch_width_y - y), dx, -dy, width=3, color="blue")
    ax.add_patch(pass_arrow)

ax.set_title(f"Passes played by {required_player_name}")
# fig.set_size_inches(10, 7)
# fig.savefig("results/passes.pdf", dpi=100)
plt.show()

## References
- [Statsbomb event data](https://github.com/statsbomb/open-data)
- [Wyscout event data](https://figshare.com/collections/Soccer_match_event_dataset/4415000/5)
- [Loading in and investigating World Cup data in Python](https://www.youtube.com/watch?v=GTtuOt03FM0&ab_channel=FriendsofTracking)
- [Making Your Own Shot and Pass Maps](https://www.youtube.com/watch?v=oOAnERLiN5U&ab_channel=FriendsofTracking)