# 🧑‍🤝‍🧑 Team Task: Exploratory Data Analysis (EDA)

Dear team,  

This notebook is our shared workspace. Each member should explore the dataset and contribute their insights. Please follow the steps below:

1. **Explore the dataset**  
   - Look at the columns, data types, and missing values.  
   - Summarize key statistics (mean, median, min, max, etc.).  
   - Visualize distributions (histograms, boxplots, etc.).  

2. **Analyze patterns**  
   - Identify correlations between features.  
   - Look for outliers or unusual trends.  
   - Suggest potential transformations or feature engineering ideas.  

3. **Document your findings**  
   - Write your analysis and observations in *Markdown cells*.  
   - Include plots and code where relevant.  
   - End with **suggestions** for what we, as a group, should focus on.  

---

✅ **Goal:**  
By the end, we will have multiple perspectives on the data. Then, as a group, we’ll decide on:  
- The main problems or questions to solve  
- The best features to use for modeling  
- Next steps in our project  

---


In [2]:
import pandas as pd
import numpy as np

In [3]:
df = pd.read_csv("/home/rahaf/code/rahafw/future_stars/data/players_data_light-2024_2025.csv")
df.head()

Unnamed: 0,Rk,Player,Nation,Pos,Squad,Comp,Age,Born,MP,Starts,...,Att (GK),Thr,Launch%,AvgLen,Opp,Stp,Stp%,#OPA,#OPA/90,AvgDist
0,1,Max Aarons,eng ENG,DF,Bournemouth,eng Premier League,24.0,2000.0,3,1,...,,,,,,,,,,
1,2,Max Aarons,eng ENG,"DF,MF",Valencia,es La Liga,24.0,2000.0,4,1,...,,,,,,,,,,
2,3,Rodrigo Abajas,es ESP,DF,Valencia,es La Liga,21.0,2003.0,1,1,...,,,,,,,,,,
3,4,James Abankwah,ie IRL,"DF,MF",Udinese,it Serie A,20.0,2004.0,6,0,...,,,,,,,,,,
4,5,Keyliane Abdallah,fr FRA,FW,Marseille,fr Ligue 1,18.0,2006.0,1,0,...,,,,,,,,,,


# Exploring the data

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2854 entries, 0 to 2853
Columns: 165 entries, Rk to AvgDist
dtypes: float64(61), int64(99), object(5)
memory usage: 3.6+ MB


In [5]:
df.shape

(2854, 165)

In [6]:
display(df.describe(include="all").T)

Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max
Rk,2854.0,,,,1427.5,824.023159,1.0,714.25,1427.5,2140.75,2854.0
Player,2854,2702,Rodri,2,,,,,,,
Nation,2847,113,es ESP,415,,,,,,,
Pos,2854,10,DF,859,,,,,,,
Squad,2854,96,Como,38,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...
Stp,212.0,,,,14.377358,13.874832,0.0,2.0,10.5,22.0,64.0
Stp%,211.0,,,,6.159716,4.074863,0.0,4.0,5.6,7.9,33.3
#OPA,212.0,,,,18.768868,18.276921,0.0,3.0,14.0,30.25,89.0
#OPA/90,212.0,,,,1.164528,1.00875,0.0,0.67,1.0,1.47,10.0


In [None]:
d = pd.DataFrame({
    "dtype": df.dtypes,
    "missing_count": df.isna().sum(),
    "missing_pct": df.isna().mean().round(4),
    "nunique": df.nunique()
}).sort_values(["missing_pct","nunique"], ascending=[False, True])

d