-   Tyler Arista, tja9@calvin.edu
-   Student name, e-mail

# Instructions for today's practice

- Create a copy of this Jupyter Notebook and share it with your partner.
- Fill student names and e-mails in the text cell above.
- At the end of the practice, download the .ipynb file and upload it on Moodle.

![](https://images.unsplash.com/photo-1473976345543-9ffc928e648d?q=80&w=1859&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D)

# Our dataset: Premier League 2023/24 Player Dataset

Our dataset contains detailed statistics for individual football players from the 2023/24 Premier League season. The data is useful for analyzing player performance, team contributions, and other metrics like progressive passing, carrying, and expected goals. It is a simplified version of the dataset made available on [Kaggle](https://www.kaggle.com/datasets/orkunaktas/premier-league-all-players-stats-2324). You can download our data here: [https://cs.calvin.edu/courses/data/202/fsantos/premier-league.csv](https://cs.calvin.edu/courses/data/202/fsantos/premier-league.csv)

#### **Column Descriptions:**

1. **Player**: The name of the footballer.
2. **Nation**: The nationality of the footballer (e.g., ENG for England, ESP for Spain).
3. **Pos**: The player’s position(s) on the field (e.g., FW for Forward, MF for Midfielder, DF for Defender, GK for Goalkeeper).
4. **Age**: The player's age during the season.
5. **MP**: The total number of matches in which the player participated.
6. **Starts**: The number of matches the player started.
7. **Min**: Total minutes played by the player across all matches.
8. **90s**: The equivalent number of full 90-minute matches played by the player.
9. **Gls**: Total goals scored by the player.
10. **Ast**: Total assists made by the player.
11. **G+A**: The combined total of goals and assists for the player.
12. **G-PK**: Goals scored excluding penalty kicks.
13. **PK**: Total penalty goals scored by the player.
14. **PKatt**: Total number of penalty kicks attempted by the player.
15. **CrdY**: Total yellow cards received.
16. **CrdR**: Total red cards received.
17. **xG**: Expected goals—an estimate of the number of goals the player is expected to score based on the quality of their shots.
18. **npxG**: Non-penalty expected goals—expected goals excluding penalties.
19. **xAG**: Expected assists—an estimate of how many assists the player should have based on their passes.
20. **npxG+xAG**: Combined non-penalty expected goals and expected assists.
21. **PrgC**: Progressive carries—the number of times the player carried the ball forward at least 5 yards.
22. **PrgP**: Progressive passes—the number of passes that moved the ball forward by at least 10 yards or into the attacking third of the pitch.
23. **PrgR**: Progressive runs—the number of times the player made forward runs with the ball.
24. **Team**: The team the player belongs to in the Premier League.

# Initial analyses

**📝 Exercise 1**: Load the dataset and print some rows of it. Then, calculate basic statistics such as mean, median, and standard deviation for numerical columns like Gls, Ast, xG, etc.

In [1]:
import pandas as pd
df = pd.read_csv("https://cs.calvin.edu/courses/data/202/fsantos/premier-league.csv")
df.head()

Unnamed: 0,Player,Nation,Pos,Age,MP,Starts,Min,90s,Gls,Ast,...,CrdY,CrdR,xG,npxG,xAG,npxG+xAG,PrgC,PrgP,PrgR,Team
0,Rodri,es ESP,MF,27.0,34,34,2931.0,32.6,8.0,9.0,...,8.0,1.0,4.1,4.1,3.9,8.0,76.0,376.0,55.0,Manchester City
1,Phil Foden,eng ENG,"FW,MF",23.0,35,33,2857.0,31.7,19.0,8.0,...,2.0,0.0,10.3,10.3,8.4,18.7,93.0,168.0,269.0,Manchester City
2,Ederson,br BRA,GK,29.0,33,33,2785.0,30.9,0.0,0.0,...,5.0,0.0,0.0,0.0,0.1,0.1,0.0,4.0,0.0,Manchester City
3,Julián Álvarez,ar ARG,"MF,FW",23.0,36,31,2647.0,29.4,11.0,8.0,...,2.0,0.0,13.0,11.5,6.4,17.9,64.0,103.0,180.0,Manchester City
4,Kyle Walker,eng ENG,DF,33.0,32,30,2767.0,30.7,0.0,4.0,...,2.0,0.0,0.4,0.4,2.6,3.0,74.0,157.0,172.0,Manchester City


In [2]:
mean_goals = df['Gls'].mean()
median_Ast = df['Ast'].median()
std_g_pk = df['G-PK'].std()

print("Mean Goals: ", mean_goals)
print("Median Assists:", median_Ast)
print("Standard Deviation of Goals - Penalty Kicks:", std_g_pk)

Mean Goals:  2.0637931034482757
Median Assists: 0.0
Standard Deviation of Goals - Penalty Kicks: 3.1897391287699906


**📝 Exercise 2**: Identify the 10 first players with the most goals scored that were not penalty kicks (i.e., `G-PK`).

In [26]:
most_goals_not_penalty = df.sort_values('G-PK', ascending=False).head(10)
most_goals_not_penalty[['Player', 'G-PK']].head(10)

Unnamed: 0,Player,G-PK
6,Erling Haaland,20.0
204,Ollie Watkins,19.0
1,Phil Foden,19.0
369,Dominic Solanke,17.0
237,Jarrod Bowen,16.0
117,Alexander Isak,16.0
146,Son Heung-min,15.0
82,Nicolas Jackson,14.0
458,Chris Wood,14.0
265,Jean-Philippe Mateta,14.0


**📝 Exercise 3**: Derive a new row, `Goals per 90s`, consisting of how many goals the player score for every 90 minutes of game (`Gls/90s`). Show the dataframe.

In [6]:
df["Goals Per 90s"] = df['Gls'] / df['90s']
df.head()

Unnamed: 0,Player,Nation,Pos,Age,MP,Starts,Min,90s,Gls,Ast,...,CrdR,xG,npxG,xAG,npxG+xAG,PrgC,PrgP,PrgR,Team,Goals Per 90s
0,Rodri,es ESP,MF,27.0,34,34,2931.0,32.6,8.0,9.0,...,1.0,4.1,4.1,3.9,8.0,76.0,376.0,55.0,Manchester City,0.245399
1,Phil Foden,eng ENG,"FW,MF",23.0,35,33,2857.0,31.7,19.0,8.0,...,0.0,10.3,10.3,8.4,18.7,93.0,168.0,269.0,Manchester City,0.599369
2,Ederson,br BRA,GK,29.0,33,33,2785.0,30.9,0.0,0.0,...,0.0,0.0,0.0,0.1,0.1,0.0,4.0,0.0,Manchester City,0.0
3,Julián Álvarez,ar ARG,"MF,FW",23.0,36,31,2647.0,29.4,11.0,8.0,...,0.0,13.0,11.5,6.4,17.9,64.0,103.0,180.0,Manchester City,0.37415
4,Kyle Walker,eng ENG,DF,33.0,32,30,2767.0,30.7,0.0,4.0,...,0.0,0.4,0.4,2.6,3.0,74.0,157.0,172.0,Manchester City,0.0


# Filtering

**📝 Exercise 4**: Filter the dataset to return all players who are younger than 25. Call this new dataframe `young_players`. Who are the best scoring players there? Show that.

In [7]:
young_players = df[df['Age'] < 25]
young_players.sort_values(by='Gls', ascending=False).head(10)

Unnamed: 0,Player,Nation,Pos,Age,MP,Starts,Min,90s,Gls,Ast,...,CrdR,xG,npxG,xAG,npxG+xAG,PrgC,PrgP,PrgR,Team,Goals Per 90s
6,Erling Haaland,no NOR,FW,23.0,31,29,2552.0,28.4,27.0,5.0,...,0.0,29.2,22.9,4.3,27.2,35.0,26.0,126.0,Manchester City,0.950704
83,Cole Palmer,eng ENG,"FW,MF",21.0,33,29,2607.0,29.0,22.0,11.0,...,0.0,18.2,11.1,11.1,22.2,117.0,197.0,195.0,Chelsea,0.758621
117,Alexander Isak,se SWE,FW,23.0,30,27,2255.0,25.1,21.0,2.0,...,0.0,20.3,15.6,3.7,19.3,68.0,71.0,129.0,Newcastle United,0.836653
1,Phil Foden,eng ENG,"FW,MF",23.0,35,33,2857.0,31.7,19.0,8.0,...,0.0,10.3,10.3,8.4,18.7,93.0,168.0,269.0,Manchester City,0.599369
59,Bukayo Saka,eng ENG,FW,21.0,35,35,2919.0,32.4,16.0,9.0,...,0.0,15.5,10.8,10.5,21.2,155.0,126.0,508.0,Arsenal,0.493827
82,Nicolas Jackson,sn SEN,FW,22.0,35,31,2799.0,31.1,14.0,5.0,...,0.0,18.6,18.6,4.3,22.9,70.0,67.0,200.0,Chelsea,0.450161
62,Kai Havertz,de GER,"MF,FW",24.0,37,30,2634.0,29.3,13.0,7.0,...,0.0,12.3,11.6,4.4,15.9,55.0,99.0,191.0,Arsenal,0.443686
404,Matheus Cunha,br BRA,"FW,MF",24.0,32,29,2440.0,27.1,12.0,7.0,...,0.0,9.5,8.7,3.2,11.9,107.0,87.0,139.0,Wolverhampton,0.442804
114,Anthony Gordon,eng ENG,FW,22.0,35,34,2890.0,32.1,11.0,10.0,...,1.0,10.2,9.4,8.0,17.4,138.0,101.0,232.0,Newcastle United,0.342679
32,Darwin Núñez,uy URU,FW,24.0,36,22,2047.0,22.7,11.0,8.0,...,0.0,16.3,15.5,6.0,21.5,58.0,54.0,206.0,Liverpool,0.484581


**📝 Exercise 5**: In the same way, who are the best scoring players from England? Filter the dataset and show that.

In [8]:
df[df['Nation'] == 'br BRA'].sort_values(by='Gls', ascending=False).head(10)

Unnamed: 0,Player,Nation,Pos,Age,MP,Starts,Min,90s,Gls,Ast,...,CrdR,xG,npxG,xAG,npxG+xAG,PrgC,PrgP,PrgR,Team,Goals Per 90s
404,Matheus Cunha,br BRA,"FW,MF",24.0,32,29,2440.0,27.1,12.0,7.0,...,0.0,9.5,8.7,3.2,11.9,107.0,87.0,139.0,Wolverhampton,0.442804
155,Richarlison,br BRA,FW,26.0,28,18,1491.0,16.6,11.0,4.0,...,0.0,9.6,9.6,2.1,11.7,26.0,38.0,106.0,Tottenham Hotspur,0.662651
296,Rodrigo Muniz,br BRA,FW,22.0,26,18,1593.0,17.7,9.0,1.0,...,0.0,8.7,8.7,2.0,10.7,8.0,19.0,76.0,Fulham,0.508475
207,Douglas Luiz,br BRA,MF,25.0,35,35,2993.0,33.3,9.0,5.0,...,0.0,6.9,3.8,5.4,9.2,60.0,168.0,50.0,Aston Villa,0.27027
345,João Pedro,br BRA,"FW,MF",21.0,31,19,2045.0,22.7,9.0,3.0,...,0.0,11.9,7.9,3.8,11.7,83.0,92.0,205.0,Brighton,0.396476
112,Bruno Guimarães,br BRA,MF,25.0,37,37,3263.0,36.3,7.0,8.0,...,0.0,4.8,4.8,6.4,11.2,65.0,283.0,73.0,Newcastle United,0.192837
63,Gabriel Martinelli,br BRA,FW,22.0,35,24,2019.0,22.4,6.0,4.0,...,0.0,6.8,6.8,6.1,12.9,127.0,65.0,345.0,Arsenal,0.267857
242,Lucas Paquetá,br BRA,"FW,MF",25.0,31,31,2622.0,29.1,4.0,6.0,...,0.0,5.4,3.8,5.4,9.2,34.0,187.0,116.0,West Ham United,0.137457
294,Willian,br BRA,FW,34.0,31,24,2053.0,22.8,4.0,2.0,...,0.0,5.1,3.6,3.9,7.5,92.0,153.0,181.0,Fulham,0.175439
66,Gabriel Jesus,br BRA,FW,26.0,27,17,1478.0,16.4,4.0,5.0,...,0.0,6.3,6.3,3.8,10.1,38.0,42.0,155.0,Arsenal,0.243902


# Grouping

**📝 Exercise 6**: Group the dataset by `Nation` and calculate the total number of goals (Gls) scored by players from each country. (For example, call this new dataframe `total_goals_by_country`) Find the top 5 countries by total goals scored.

Remember to use the syntax we've been recommending:

```python
new_dataframe = (
  dataframe
  .groupby('Column', as_index=False)
  .agg(new_column1=('Col', 'function'),
       new_column2=('Col', 'function') )
)
```

In [17]:
total_goals_by_country = (
    df
    .groupby('Nation', as_index = False)
    .agg(totalNumGoals = ('Gls', 'sum'))
    .sort_values('totalNumGoals', ascending = False)
)
total_goals_by_country.head()

Unnamed: 0,Nation,totalNumGoals
21,eng ENG,369.0
7,br BRA,97.0
23,fr FRA,63.0
2,ar ARG,43.0
45,no NOR,39.0


We will now create age groups (e.g., < 25, 25-30, > 30) in our data. This will be a new column `Age Group` with three categories (represented as strings, `'< 25'`, `'25-30'`, `'> 30'`). We will use the function `cut()` for doing that. Observe:

In [18]:
bins = [0, 25, 30, 100]  # Defining bins for age groups
labels = ['< 25', '25-30', '> 30']  # Defining labels for the bins
df['Age Group'] = pd.cut(df['Age'], bins=bins, labels=labels, right=False)

**📝 Exercise 7**: Now, group the dataset by these age groups (`Age Group`). Calculate the total number of yellow cards (`CrdY`) and red cards (`CrdR`) received by players in each group.

In [20]:
cards_by_age_group = (
    df
    .groupby('Age Group', as_index=False)
    .agg(totalYellowCards=('CrdY', 'sum'), totalRedCards=('CrdR', 'sum'))
)
cards_by_age_group

  df


Unnamed: 0,Age Group,totalYellowCards,totalRedCards
0,< 25,634.0,24.0
1,25-30,743.0,25.0
2,> 30,276.0,9.0


Now, let's do some transformations. Notice the column `Pos`, that may contain multiple labels for the same player. We are performing an "explode" operation here. Observe:

In [21]:
# Split the positions into multiple possible ones for players who play in more than one position
df['Pos'] = df['Pos'].str.split(',')

# Explode the positions into separate rows for each player
df_exploded = df.explode('Pos')

df_exploded[df_exploded['Player']=="Phil Foden"]

Unnamed: 0,Player,Nation,Pos,Age,MP,Starts,Min,90s,Gls,Ast,...,xG,npxG,xAG,npxG+xAG,PrgC,PrgP,PrgR,Team,Goals Per 90s,Age Group
1,Phil Foden,eng ENG,FW,23.0,35,33,2857.0,31.7,19.0,8.0,...,10.3,10.3,8.4,18.7,93.0,168.0,269.0,Manchester City,0.599369,< 25
1,Phil Foden,eng ENG,MF,23.0,35,33,2857.0,31.7,19.0,8.0,...,10.3,10.3,8.4,18.7,93.0,168.0,269.0,Manchester City,0.599369,< 25


**CAREFUL**: notice that we are updating the column `Pos` with a new value. So, if we run this code again, it will mess up the column again and then our `df_exploded` will be compromised!
If you happen to run this code twice, remember to load it again in the first code cell!
Or, alternatively, you can do these kinds of operations with a copy of the original dataframe, like: `new_df = df.copy()`. This way, you will not mess up the original dataframe.

Check this new dataframe. What happened? Observe specifically players who play in multiple positions, like Phil Foden.

**📝 Exercise 8**: Now, using this new dataframe, `df_exploded`, get the average number of goals per position (`Pos`).

In [22]:
average_goals_per_position = (
    df_exploded
    .groupby('Pos', as_index=False)
    .agg(averageGoals=('Gls', 'mean'))
)
average_goals_per_position


Unnamed: 0,Pos,averageGoals
0,DF,0.834821
1,FW,3.963134
2,GK,0.0
3,MF,2.191571


**📝 Exercise 9**: With `df_exploded`, group by both `Pos` and `Team` (remember to pass a list to `groupby()`, and calculate the mean number of goals per 90 minutes (`PrgC`/`90s`). Order these means in descending order.

In [24]:
average_prgC_per_90s = (
    df_exploded
    .groupby(['Pos', 'Team'], as_index=False)
    .agg(averagePrgCPer90s=('PrgC', 'mean'))
    .sort_values('averagePrgCPer90s', ascending=False)
)
average_prgC_per_90s

Unnamed: 0,Pos,Team,averagePrgCPer90s
32,FW,Manchester City,81.750000
20,FW,Arsenal,54.000000
72,MF,Manchester City,52.764706
30,FW,Liverpool,48.222222
37,FW,Tottenham Hotspur,46.818182
...,...,...,...
57,GK,Tottenham Hotspur,0.000000
56,GK,Sheffield United,0.000000
55,GK,Nottingham Forest,0.000000
54,GK,Newcastle United,0.000000


# Reflection: what is the point of sports analytics?

- Check this [article](https://theconvivialsociety.substack.com/p/the-limits-of-optimization) by L. M. Sacasas commenting on **how sports today have changed with the introduction of quantification and analytics**.
- Are we really having fun with that?

> “So you can’t blame anyone for the way the game has developed,” Jacobs concludes. “It has become more rational, with a better command of the laws of probability, and stricter, more rigorous canons of efficiency.”

> "It’s worth pausing to consider wherein the purported rationality lies. It is the logic of competition. Within the sporting world, of course, the point is to win, and to do so in a way that can be clearly determined quantitatively. There are no grounds for anyone to ask a manager or a player to pursue a strategy that will diminish their competitive edge. Most of life, however, is not a game with quantifiable outcomes, and probably shouldn’t be treated as such. However, the triumph of technique in Ellul’s sense encourages the competitive mode of experience. Indeed, quantification itself invites it. This dynamic can be put to beneficial use, and, in clearly delineated circumstances, is perfectly appropriate. But applied uncritically and indiscriminately or even nefariously (see e.g. social media metrics) it can introduce destructive tendencies and eclipse qualitative or otherwise unquantifiable values. Generally speaking, quantification and the logic of optimization which it encourages tend to transform our field of experience into points of aggression, as the sociologist Hartmut Rosa has aptly put it. Data-driven optimization is, in this sense, a way of perceiving the world. And what may matter most about this is not necessarily what it allows us to see, but it keeps us from perceiving: in short, all that cannot be quantified or measured."

**📝 Reflection Exercise**: Write a sentence or two of your overall
reflections on this practice. You may write whatever you want, but you
might perhaps respond to one or two of these questions:

-   Was anything unclear about this assignment?
-   How hard was it for you? Where did you get “stuck”?
-   How long did it take you?
-   What questions or uncertainties remain?
-   What skills do you think you’ll need more practice with?
-   Did you try anything out of curiosity that you weren’t specifically
    asked to do?

**Reflection Response:** I thought that this practice assignment was clear overall and I thought the way that topics were sectioned off/grouped was really helpful for understanding the material. I did find this assignment to be a little bit more difficult than last week's because some of the syntax was a little bit trickier, but will become easier over time. I would say this assignment took me about 35-40 minutes to complete just to make sure everything was done right from a syntax perspective. I would say I don't have any other questions specifically about this assignment.