# Position Classification for FC Barcelona's Youth Players

## Project Overview

- As the newly appointed data scientist for FC Barcelona, your pivotal role is to harness the power of data analytics to enhance the team's talent development strategy. <br>
- Leveraging the extensive player data from FIFA 2022, this project endeavors to establish a robust classification model that can accurately recommend the optimal playing position for each youth player. <br>
- By delving into the players' physical attributes and their correlation with on-field performance, the project aims to provide actionable insights to the coaching staff, fostering effective player positioning and skill development.

## A. Project Objectives

1. **Data Preprocessing and Quality Assurance:** Thoroughly clean the FIFA 2022 player dataset, addressing missing values, outliers, and inconsistencies to ensure the accuracy and reliability of subsequent analyses.

2. **Feature Creation and Engineering:** Utilize domain knowledge to craft meaningful new features from the existing dataset, amplifying the predictive power of the model and capturing nuanced player attributes.

3. **Exploratory Data Analysis (EDA):** Conduct an in-depth analysis of the most influential features, unveiling correlations between physical attributes and playing positions, guiding feature selection and model development.

4. **Cross-Validation Strategy:** Implement robust cross-validation techniques to mitigate overfitting and optimize model generalization performance.

5. **Model Selection and Justification:** Deploy a minimum of two machine learning models, such as Random Forest and Gradient Boosting, seen in class. Elaborate on the rationale behind each model choice, considering their ability to handle non-linearity, feature interactions, and overall performance.

6. **Ensemble Modeling (Bonus Objective):** Demonstrate advanced expertise by creating an ensemble model that amalgamates predictions from multiple base models, leveraging the strengths of individual models for improved accuracy and robustness.

7. **Performance Evaluation:** Utilize appropriate evaluation metrics such as accuracy, precision, recall, and F1-score to comprehensively assess the models' classification performance.

8. **Kaggle Competition:** your model's prowess by participating in a Kaggle competition, uploading the refined data and comparing your model's accuracy against other teams' submissions. This step provides real-world validation of your model's effectiveness

## B. Data provided

Presented below is the comprehensive roster of variables available for analysis. The primary objective of this undertaking is to predict and forecast the player's designated **position** based on these aforementioned variables.

* id: Unique identifier for each player.
* short_name: Short name or nickname of the player.
* overall: Player's overall rating, representing their overall skill level.
* potential: Player's potential rating, indicating their potential skill growth.
* value_eur: Player's market value in euros.
* wage_eur: Player's weekly wage in euros.
* birthday_date: Player's date of birth.
* height_cm: Player's height in centimeters.
* weight_kg: Player's weight in kilograms.
* club_name: Name of the player's club.
* league_name: Name of the league the club belongs to.
* position: Player's preferred playing position.
* preferred_foot: Player's preferred kicking foot (left or right).
* weak_foot: Player's weak foot rating, indicating their weaker kicking foot's ability.
* skill_moves: Player's skill moves rating, representing their dribbling and ball control skills.
* international_reputation: Player's international reputation level.
* pace, shooting, passing, dribbling, defending, physic: Attributes representing different aspects of a player's playing style and skills.
* mentality_aggression, mentality_vision, mentality_composure: Attributes representing mental aspects of a player's game.
* attacking_crossing, attacking_finishing, attacking_heading_accuracy: Attributes related to attacking and finishing skills.
* movement_acceleration, movement_sprint_speed, movement_agility: Attributes related to a player's speed and agility.
* power_shot_power, power_jumping, power_stamina: Attributes representing a player's physical power and endurance.
* defending_marking_awareness, defending_standing_tackle, defending_sliding_tackle: Attributes representing a player's defensive skills.
* goalkeeping_diving, goalkeeping_handling, goalkeeping_positioning: Goalkeeping attributes related to diving, handling, and positioning.
* goalkeeping_reflexes, goalkeeping_speed: Attributes representing a goalkeeper's reflexes and speed.

## C. Positions possible

**Your objective involves predicting the player's position**. To facilitate this task, provided here is a description encompassing the potential values that the position variable can assume. The positions in the FIFA dataset represent the various playing roles that a soccer player can assume on the field. Here's the description of each position:

* RW (Right Winger): A player positioned on the right flank of the field, often involved in attacking and crossing from the right side.
* ST (Striker): The primary goal-scoring position, responsible for leading the attack and converting chances into goals.
* LW (Left Winger): Similar to the RW, a player positioned on the left flank, typically involved in offensive plays and providing crosses.
* RCM (Right Center Midfielder): Positioned centrally, this player often participates in both offensive and defensive actions from the right side of the midfield.
* GK (Goalkeeper): The player who guards the goal and is primarily responsible for preventing the opposing team from scoring.
* CDM (Central Defensive Midfielder): Positioned centrally, this player's role involves shielding the defense and distributing the ball to initiate attacks.
* LCB (Left Center Back): Positioned on the left side of the central defense, responsible for stopping opposing attacks and distributing the ball to start build-up.
* RDM (Right Defensive Midfielder): Positioned centrally, this player focuses on defensive duties and ball distribution from the right side of the midfield.
* RS (Right Striker): Positioned towards the right side of the attacking line, tasked with goal-scoring and contributing to offensive plays.
* LCM (Left Center Midfielder): Positioned centrally, this player contributes to both defensive and offensive actions from the left side of the midfield.
* CAM (Central Attacking Midfielder): Positioned centrally, this player operates behind the forwards, often responsible for creating scoring opportunities.
* RCB (Right Center Back): Positioned on the right side of the central defense, involved in defensive actions and initiating play from the back.
* LDM (Left Defensive Midfielder): Positioned centrally, this player combines defensive responsibilities with ball distribution from the left side of the midfield.
* LB (Left Back): Positioned on the left side of the defensive line, responsible for both defensive actions and supporting attacking plays.
* RB (Right Back): Positioned on the right side of the defensive line, performing defensive duties and contributing to the attack.
* LM (Left Midfielder): Positioned on the left flank, this player combines defensive actions with supporting offensive plays from the left side.
* RM (Right Midfielder): Positioned on the right flank, this player performs defensive and offensive roles from the right side of the field.
* LS (Left Striker): Positioned towards the left side of the attacking line, contributing to goal-scoring and offensive maneuvers.
* CB (Center Back): Positioned centrally in the defense, this player's main role is to stop opposing attacks and maintain defensive stability.
* RWB (Right Wing Back): Positioned on the right flank, this player combines defensive actions with supporting attacks from the right side.
* RF (Right Forward): Positioned in the forward line, primarily tasked with scoring goals and participating in offensive plays from the right side.
* CM (Central Midfielder): Positioned centrally in the midfield, this player is involved in both defensive and offensive actions.
* LWB (Left Wing Back): Positioned on the left flank, this player performs defensive and attacking roles, similar to a wing-back.
* LF (Left Forward): Positioned in the forward line, responsible for goal-scoring and contributing to offensive maneuvers from the left side.

## D. Kaggle submission

Once you have produced testset predictions you can submit these to <i> kaggle </i> in order to see how your model performs and compete with your collegues.

The following code provides an example of generating a <i> .csv </i> file to submit to kaggle
1. Create a pandas dataframe with two columns, one with the test set "id"'s and the other with your predicted "position" for that observation

2. Use the <i> .to_csv </i> pandas method to create a csv file. The <i> index = False </i> is important to ensure the <i> .csv </i> is in the format kaggle expects

In [None]:
# Produce .csv for kaggle testing
# test_predictions_submit = pd.DataFrame({"id": test_df["id"], "position": test_predictions})
# test_predictions_submit.to_csv("test_predictions_submit.csv", index = False)

## E. Performance measure: F1-score

* **Interpretation:** F1-score is the harmonic mean of precision and recall, providing a balance between the two metrics.
* **Use Cases:** F1-score is useful when you want to find a balance between precision and recall in imbalanced datasets.
* **Advantages:** Serves as a single metric that considers both false positives and false negatives.
* **Limitations:** F1-score may not be the best choice when the cost of false positives and false negatives is significantly different.

$$F1 Score = 2 * \frac{Precision * Recall}{Precision + Recall}$$

where the recall is the ratio of true positives over the total number of positives:

$$Recall = \frac{TP}{TP + FN}$$

* **Interpretation:** Recall is the proportion of true positive predictions out of all actual positive instances (true positives + false negatives).
* **Use Cases:** Recall is important when the cost of false negatives is high, and you want to capture as many positive instances as possible.
* **Advantages:** Focuses on the ability of the model to identify positive instances correctly.
* **Limitations:** It ignores false positives, making it sensitive to class imbalance.

and precision is the ratio of true positive predictions out of all positive predictions:

$$Precision = \frac{TP}{TP + FP}$$

* **Interpretation:** Precision is the proportion of true positive predictions out of all positive predictions (true positives + false positives).
* **Use Cases:** Precision is valuable when the cost of false positives is high, and you want to minimize false alarms.
* **Advantages:** Focuses on the ability of the model to correctly identify positive instances.
* **Limitations:** It ignores false negatives, so it may not be suitable for imbalanced datasets.

# 0. Libraries and file imports

In [2]:
import pandas as pd

In [3]:
# We import the training dataset
df_tr_og = pd.read_csv('../train.csv')

# We import the test dataset
df_te_og = pd.read_csv('../test.csv')

# We create a copy of the original dataset, so that we do not have to load the dataset
# again if we want to go back to the original one
df_tr = df_tr_og.copy()

df_te = df_te_og.copy()

In this notebook we will explore the test set together with the training set (but, in some situations, with an exploration at a lower level if we observe that data is distributed similarly).

In [3]:
df_tr.head()

Unnamed: 0,id,short_name,overall,potential,value_eur,wage_eur,birthday_date,height_cm,weight_kg,club_name,...,defending_marking_awareness,defending_standing_tackle,defending_sliding_tackle,goalkeeping_diving,goalkeeping_handling,goalkeeping_kicking,goalkeeping_positioning,goalkeeping_reflexes,goalkeeping_speed,position
0,216302,E. García,71,71,1400000.0,10000,1989-12-28,176,73,Club Atlético de San Luis,...,65,66,65,14,11,12,12,12,,LB
1,237867,D. Cancola,65,71,1000000.0,2000,1996-10-23,183,73,Ross County FC,...,65,61,58,10,13,7,6,11,,LDM
2,253472,E. Kahl,65,77,1600000.0,2000,2001-09-27,178,69,Aarhus GF,...,60,58,59,10,10,8,10,11,,LWB
3,223994,S. Mugoša,72,72,2300000.0,5000,1992-02-26,188,81,Incheon United FC,...,16,22,19,16,15,13,8,9,,LS
4,251635,A. Țigănașu,65,65,525000.0,3000,1990-06-12,179,74,FC Botoşani,...,64,61,58,12,5,11,12,15,,LB



What we have to predict are the values in the last column: position.

# 1. EDA

## 1.1. Numerical description

### 1.1.1. General characteristics of the dataset

In [6]:
# We print the data types
print(df_tr.dtypes)

id                           int64
short_name                  object
overall                      int64
potential                    int64
value_eur                  float64
                            ...   
goalkeeping_kicking          int64
goalkeeping_positioning      int64
goalkeeping_reflexes         int64
goalkeeping_speed          float64
position                    object
Length: 70, dtype: object


In [9]:
# Now, we get a better visualization of the data types of the dataset
def dataframe_dtypes_overview(df):

    """
    Provides an overview of the data types in a DataFrame, including counts and variable names.
    
    Parameters:
    df (pd.DataFrame): The input DataFrame.
    
    Returns:
    None: Prints the data type counts and variable names grouped by type.
    """

    dtype_counts = df.dtypes.value_counts()
    print("Data Type Overview:")
    for dtype, count in dtype_counts.items():
        print(f"\nData type: {dtype} ({count} variables)")
        cols = df.select_dtypes(include=[dtype]).columns
        print("Variables:", ", ".join(cols))

dataframe_dtypes_overview(df_tr)

Data Type Overview:

Data type: int64 (46 variables)
Variables: id, overall, potential, wage_eur, height_cm, weight_kg, league_level, club_jersey_number, club_contract_valid_until, weak_foot, skill_moves, international_reputation, attacking_crossing, attacking_finishing, attacking_heading_accuracy, attacking_short_passing, attacking_volleys, skill_dribbling, skill_curve, skill_fk_accuracy, skill_long_passing, skill_ball_control, movement_acceleration, movement_sprint_speed, movement_agility, movement_reactions, movement_balance, power_shot_power, power_jumping, power_stamina, power_strength, power_long_shots, mentality_aggression, mentality_interceptions, mentality_positioning, mentality_vision, mentality_penalties, mentality_composure, defending_marking_awareness, defending_standing_tackle, defending_sliding_tackle, goalkeeping_diving, goalkeeping_handling, goalkeeping_kicking, goalkeeping_positioning, goalkeeping_reflexes

Data type: object (14 variables)
Variables: short_name, birth

Overall, we have:
- 47 purely numerical features: 
    - Of type *int64*: overall, potential, wage_eur, height_cm, weight_kg, 5 *attacking* variables, 5 *skill* variables (except for *skill_moves*), 5 *movement* variables, 5 *power* variables, 6 *mentality* variables, 3 *defending* variables, 6 *goalkeeping* variables (except goalkeeping_speed, which is of type *float64*).
    - Of type *float64*: value_eur, release_clause_eur, pace, shooting, passing, dribbling, defending.
- 4 categorical ordinal features: weak_foot, skill_moves, international_reputation, work_rate (interpreted as how much the player trains).
- (70- 51 - 1) = 18 categorical nominal features: all the other variables.

Target: *position*.

## 1.2. Graphical description