In [1]:
import altair as alt
import pandas as pd

## Assignment description
**1)** Locate a dataset that you are interested in working with. The data should be sufficiently complex that you can ask lots of questions about it and engage in creative design techniques, but not so complex that you need specialized hardware or algorithmic approaches to analyze. While you are welcome to use any data you’d like, I recommend that your datasets are tabular (e.g., CSV, TSV, SQL, etc.), contain 5,000 or fewer datapoints (on the order of one hundred or so tends to be sufficiently interesting without causing lag in Altair), and is data that you’re comfortable discussing as part of the course (e.g., avoid data that is overly private or classified).

**2)** Discuss your dataset, including the data’s source, key attributes/dimensions of the data, and your goals for working with that data (i.e., what are the key questions you want to answer). Identify existing relevant visualizations for working with that data (either using the same data, showing the same concepts, or just that might provide some inspiration) and critique those visualizations based on the practices from this module. What works well? What might need improvement or to change to answer your target questions? 

### Dataset
I chose a dataset of all the players that are in the squad of the teams participating in the UEFA EURO 2024. Contains info about clubs, age, height, market value etc. which can be very good for EDA and Data Visualizations.

This dataset contains 622 records about soccer players that participated in UEFA EURO 2024.


Source - https://www.kaggle.com/datasets/damirdizdarevic/uefa-euro-2024-players

In [9]:
df = pd.read_csv("data/euro2024_players.csv")

In [10]:
df.head()

Unnamed: 0,Name,Position,Age,Club,Height,Foot,Caps,Goals,MarketValue,Country
0,Marc-André ter Stegen,Goalkeeper,32,FC Barcelona,187,right,40,0,28000000,Germany
1,Manuel Neuer,Goalkeeper,38,Bayern Munich,193,right,119,0,4000000,Germany
2,Oliver Baumann,Goalkeeper,34,TSG 1899 Hoffenheim,187,right,0,0,3000000,Germany
3,Nico Schlotterbeck,Centre-Back,24,Borussia Dortmund,191,left,12,0,40000000,Germany
4,Jonathan Tah,Centre-Back,28,Bayer 04 Leverkusen,195,right,25,0,30000000,Germany


In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 623 entries, 0 to 622
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Name         623 non-null    object
 1   Position     623 non-null    object
 2   Age          623 non-null    int64 
 3   Club         623 non-null    object
 4   Height       623 non-null    int64 
 5   Foot         620 non-null    object
 6   Caps         623 non-null    int64 
 7   Goals        623 non-null    int64 
 8   MarketValue  623 non-null    int64 
 9   Country      623 non-null    object
dtypes: int64(5), object(5)
memory usage: 48.8+ KB


### Discussion

As I mentioned above, the data comes from Kaggle (The original source is Transfermark which is aggregator of statistics about soccer clubs and players).

The key dimensions in this data are **Position, Height, Goals, MarketValue**.

My goal is to try to analyze which **factors** are closely related to **Market Value** of a soccer player.

To achive this, I plan to build **scatterplots** for continious data and **barcharts** for categorical columns.

### Distribution of Market value

In [28]:
#add column that counts marker value in millions of euro


hist_full = alt.Chart(data=df).mark_bar().encode(x=alt.X("MarketValue", scale=alt.Scale(domain=[0, 200_000_000]),
                                             bin=alt.Bin(extent=[0, 200_000_000], step=5_000_000), 
                                             title="Market Value in Euro",
                                             axis=alt.Axis(labelAngle=-30)),
                                     y=alt.Y("count()", scale=alt.Scale(domain=[0, 240])),
                                     tooltip=["count()"]
                                    ).properties(width=400,
                                                 height=300,
                                                title="Market Value Distribution")

hist_full.show()


This visualization is somewhat classic starting point. We plot a histogram of market value of player to
have an idea of distribution of prices. We can see that distribution looks like exponential. In other words,
most of players have market values less than **20M euro** and there are a few superstar players that can be very valueable on market **(120-180M euro)**.

We might discuss what limist should be on X axis? Should we cut off outliers to focus on majority of players or we need to emphasize the skeweness of distribution including outlier players? 

**Let's trim the X axis and compare how we percieve both plots**

In [31]:
hist_trim = alt.Chart(df).transform_filter(
    (alt.datum.MarketValue >= 0) & (alt.datum.MarketValue <= 110_000_000)
).mark_bar(color="lightblue").encode(
    x=alt.X("MarketValue", scale=alt.Scale(domain=[0, 120_000_000]),
            bin=alt.Bin(extent=[0, 200_000_000], step=5_000_000), 
            title="Market Value in Euro (trimmed)",
            axis=alt.Axis(labelAngle=-30)),
    y=alt.Y("count()", scale=alt.Scale(domain=[0, 250])),
    tooltip=["count()"]
).properties(
    width=400,
    height=300,
    title="Market Value Distribution"
)


alt.concat(hist_full, hist_trim)

Both plots are appropriate to bring the basic idea about distribution of Market Value. 
Though, in my opinion, we better stick to the *first plot*, as it shows the whole picture, telling us that there are some extremely expensive superstars in football. 

### Market Value by Position

In [40]:

bar_position = alt.Chart(df).mark_bar().encode(
    x=alt.X('Position:N', sort=alt.EncodingSortField(
        field='MarketValue',
        op='mean', 
        order='descending'  
    )),
    y=alt.Y('mean(MarketValue):Q', title='Mean Market Value'),
    color=alt.Color('Position:N',
                    legend=None,
                   scale=alt.Scale(scheme='category20')),  
    tooltip=['Position:N', 'mean(MarketValue):Q', "count()"]
).properties(
    width=600,
    height=400,
    title='Market Values vs Position'
)


bar_position.show()

This plot shows the average value of player with **a certain position**. This plot helps us understand what position generally tends to be more expensive on transfer market. The plot basically answers the question - **Atacking Midfield** is most paid position according to data from **UEFA EURO 2024**.
Although, it might be a bit confusing to find average price of similar positions (Left, Right, Central Midfield for example). So, we should employ coloring to simplify search over similar positions.

In [44]:
position_colors = {
    'Goalkeeper': '#a6cee3', # light blue
    'Centre-Back': '#b2df8a', # light green
    'Left-Back': '#b2df8a', # light green
    'Right-Back': '#b2df8a', # light green
    'Defensive Midfield': '#fb9a99', # light red
    'Central Midfield': '#fb9a99', # light red
    'Attacking Midfield': '#fb9a99', # light red
    'Left Winger': '#fdbf6f', # light orange
    'Right Winger': '#fdbf6f', # light orange
    'Second Striker': '#fdbf6f', # light orange
    'Centre-Forward': '#fdbf6f', # light orange
    'Right Midfield': '#fb9a99', # light red
    'Left Midfield': '#fb9a99', # light red
}

bar_position_color = alt.Chart(df).mark_bar().encode(
    x=alt.X('Position:N', sort=alt.EncodingSortField(
        field='MarketValue',
        op='mean',
        order='descending'
    )),
    y=alt.Y('mean(MarketValue):Q', title='Mean Market Value'),
    color=alt.Color('Position:N',
                    legend=alt.Legend(title="Position", orient="right"),
                    scale=alt.Scale(domain=list(position_colors.keys()), range=list(position_colors.values()))
                   ),
    tooltip=['Position:N', 'mean(MarketValue):Q', 'count()']
).properties(
    width=600,
    height=400,
    title='Market Values vs Position'
)

bar_position_color.show()


As we use the same coloring for similar positions, we improve readability of this plot for users.

Now, it is much easier to compare Left Back avg cost with Centre Back.
I think it is a good example, where we can employ right coloring to improve visualization