# Football Transfers — Age vs Transfer Fee
DSA210 — Data Science Term Project  
Kerem Ersoy

I want to explore whether younger players are transferred for higher fees.  

This dataset contains football transfer records, including:
 / player age
 / transfer fee (millions EUR)
 / player name and other attributes

For the 28 November submission i will,
 / Load and clean this dataset
 / Perform exploratory data analysis
 / Run hypothesis tests about the relationship between age and transfer fee


In [None]:
import os
os.makedirs('data', exist_ok=True)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

In [None]:
df = pd.read_csv(
    'data/premier-league.csv',
    engine='python',
    on_bad_lines='skip'
)

In [None]:
df.head()

In [None]:
df.info()

In [None]:
# Cleaning part
data = df[["player_name", "age", "fee_cleaned"]].copy()

data = data.dropna(subset=["age", "fee_cleaned"])

# Create high_fee indicator
median_fee = data["fee_cleaned"].median()
data["high_fee"] = (data["fee_cleaned"] >= median_fee).astype(int)


In [None]:
data[["age", "high_fee"]].describe()

In this section, I:

 / examine the distribution of age

 / look at the scales of high-fee vs low-fee transfers

 / visualize the relationship between age and transfer fee


In [None]:
plt.figure(figsize=(8,5))
sns.histplot(data['age'], bins=50)
plt.title('Distribution of Age')
plt.xlabel('Age')
plt.ylabel('Count')
plt.show()

In [None]:
data['high_fee'].value_counts(normalize=True)

In [None]:
plt.figure(figsize=(6,4))
sns.countplot(x='high_fee', data=data)
plt.title('High Fee Transfers (0 = Low, 1 = High)')
plt.xlabel('high_fee')
plt.ylabel('Count')
plt.show()

In [None]:
plt.figure(figsize=(8,6))
sns.scatterplot(x='age', y='fee_cleaned', data=data)
plt.title('Age vs Transfer Fee')
plt.xlabel('Age')
plt.ylabel('Transfer Fee (millions EUR)')
plt.show()

In [None]:
data[["age", "fee_cleaned"]].corr()

## Hypothesis Test 1: Do high-fee transfers involve younger players?

- H0 (null): The average age is the same for high-fee vs low-fee transfers.  
- H1 (alt): High-fee transfers have a different (typically lower) average age.

I will use an independent samples t-test.


In [None]:
young_high = data[data['high_fee'] == 1]['age']
young_low = data[data['high_fee'] == 0]['age']

t_stat, p_val = stats.ttest_ind(young_high, young_low, equal_var=False)

t_stat, p_val

- If pvalue < 0.05, I reject H0 and conclude that high-fee transfers are associated with significantly different (typically lower) average age.
- If pvalue >= 0.05, I fail to reject H0 and cannot say there is a significant difference.


## Hypothesis Test 2: Is there a correlation between age and transfer fee?

Here, age is continuous and fee_cleaned is continuous.

- H0: There is no correlation between age and transfer fee.  
- H1: There is a non-zero correlation between age and transfer fee.


In [None]:
corr, p_corr = stats.pearsonr(data['age'], data['fee_cleaned'])
corr, p_corr

- corr shows the strength and direction of the relationship (closer to 1 or -1 = stronger).
- If pvalue < 0.05, the correlation is statistically significant.


In [None]:
- I loaded and cleaned a football transfers dataset.
- I explored the distribution of age and high-fee labels.
- Scatter plots and descriptive statistics suggest that high-fee transfers may involve different age patterns.
- A t-test was used to compare average age between high-fee vs low-fee transfers.
- A correlation test examined the association between age and transfer fee.
