# Project: Machine Learning

**Instructions for Students:**

Please carefully follow these steps to complete and submit your project:

1. **Completing the Project**: You are required to work on and complete all tasks in the provided project. Be disciplined and ensure that you thoroughly engage with each task.
   
2. **Creating a Google Drive Folder**: Each of you must create a new folder on your Google Drive if you haven't already. This will be the repository for all your completed assignment and project files, aiding you in keeping your work organized and accessible.
   
3. **Uploading Completed Project**: Upon completion of your project, make sure to upload all necessary files, involving codes, reports, and related documents into the created Google Drive folder. Save this link in the 'Student Identity' section and also provide it as the last parameter in the `submit` function that has been provided.
   
4. **Sharing Folder Link**: You're required to share the link to your project Google Drive folder. This is crucial for the submission and evaluation of your project.
   
5. **Setting Permission toPublic**: Please make sure your Google Drive folder is set to public. This allows your instructor to access your solutions and assess your work correctly.

Adhering to these procedures will facilitate a smooth project evaluation process for you and the reviewers.

## Student Identity

In [1]:
# @title #### Student Identity
student_id = "REA3X5EN" # @param {type:"string"}
name = "Steven Adi Santoso" # @param {type:"string"}
drive_link = "https://drive.google.com/drive/folders/10NSMcDCiZ-aaZD5lq62Z07iGYbQ18j9S?usp=sharing"  # @param {type:"string"}

assignment_id = "00_ml_project"

# Import grader package
!pip install rggrader
from rggrader import submit, submit_image



## Project Description

In this Machine Learning Project, you will create your own supervised Machine Learning (ML) model. We will use the full FIFA21 Dataset and we will identify players that are above average.

We will use the column "Overall" with a treshold of 75 to define players that are 'Valuable'. This will become our target output which we need for a supervised ML model. Because we use the "Overall" as our target output, you cannot use "Overall" in your features, this will be explained further below.

This project will provide a comprehensive overview of your abilities in machine learning, from understanding the problem, choosing the right model, training, and optimizing it.

## Grading Criteria

Your score will be awarded based on the following criteria:
* 100: The model has an accuracy of more than 80% and an F1 score of more than 85%. This model is excellent and demonstrates a strong understanding of the task.
* 90: The model has an accuracy of more than 75% and an F1 score of more than 80%. This model is very good, with some room for improvement.
* 80: The model has an accuracy of more than 70% and an F1 score between 70% and 80%. This model is fairly good but needs improvement in balancing precision and recall.
* 70: The model has an accuracy of more than 65% and an F1 score between 60% and 70%. This model is below average and needs significant improvement.
* 60 or below: The model has an accuracy of less than 65% or an F1 score of less than 60%, or the student did not submit the accuracy and F1 score. This model is poor and needs considerable improvement.

Rmember to make a copy of this notebook in your Google Drive and work in your own copy.

Happy modeling!

>Note: If you get the accuracy of 100% and F1 score of 100%, while it may earn you good grades, it's an indication of overfitting.

In [2]:
# Write any package/module installation that you need
# pip install goes here, this helps declutter your output below

!pip install scikit-learn
!pip install seaborn
!pip install matplotlib



In [3]:
# Import package

# Data preprocessing
import pandas as pd
import numpy as np

# Machine Learning
from sklearn.preprocessing import LabelEncoder

from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import classification_report, f1_score, accuracy_score

## Load the dataset and clean it

In this task, you will prepare and load your dataset. You need to download the full FIFA 21 Dataset from the link here: [Kaggle FIFA Player Stats Database](https://www.kaggle.com/datasets/bryanb/fifa-player-stats-database?resource=download&select=FIFA21_official_data.csv).

>Note: Make sure you download FIFA 21 dataset.
>
>![FIFA21 Dataset](https://storage.googleapis.com/rg-ai-bootcamp/projects/fifa21_dataset-min.png)

After you download the dataset, you will then import the dataset then you will clean the data. For example there may be some empty cell in the dataset which you need to fill. Maybe there are also data that you need to convert to numeric value for analysis. Identify the data that is incomplete and fix them.

In the code block below, you can use the comments to guide you on what to do.

In [4]:
# Load your data
df = pd.read_csv(r"FIFA21_official_data.csv")
df.head()

Unnamed: 0,ID,Name,Age,Photo,Nationality,Flag,Overall,Potential,Club,Club Logo,...,SlidingTackle,GKDiving,GKHandling,GKKicking,GKPositioning,GKReflexes,Best Position,Best Overall Rating,Release Clause,DefensiveAwareness
0,176580,L. Suárez,33,https://cdn.sofifa.com/players/176/580/20_60.png,Uruguay,https://cdn.sofifa.com/flags/uy.png,87,87,Atlético Madrid,https://cdn.sofifa.com/teams/240/light_30.png,...,38.0,27.0,25.0,31.0,33.0,37.0,ST,87.0,€64.6M,57.0
1,192985,K. De Bruyne,29,https://cdn.sofifa.com/players/192/985/20_60.png,Belgium,https://cdn.sofifa.com/flags/be.png,91,91,Manchester City,https://cdn.sofifa.com/teams/10/light_30.png,...,53.0,15.0,13.0,5.0,10.0,13.0,CAM,91.0,€161M,68.0
2,212198,Bruno Fernandes,25,https://cdn.sofifa.com/players/212/198/20_60.png,Portugal,https://cdn.sofifa.com/flags/pt.png,87,90,Manchester United,https://cdn.sofifa.com/teams/11/light_30.png,...,55.0,12.0,14.0,15.0,8.0,14.0,CAM,88.0,€124.4M,72.0
3,194765,A. Griezmann,29,https://cdn.sofifa.com/players/194/765/20_60.png,France,https://cdn.sofifa.com/flags/fr.png,87,87,FC Barcelona,https://cdn.sofifa.com/teams/241/light_30.png,...,49.0,14.0,8.0,14.0,13.0,14.0,ST,87.0,€103.5M,59.0
4,224334,M. Acuña,28,https://cdn.sofifa.com/players/224/334/20_60.png,Argentina,https://cdn.sofifa.com/flags/ar.png,83,83,Sevilla FC,https://cdn.sofifa.com/teams/481/light_30.png,...,79.0,8.0,14.0,13.0,13.0,14.0,LB,83.0,€46.2M,79.0


In [5]:
# Dropping unused columns
unused_columns = ["ID", "Name", "Photo", "Nationality", "Flag", "Club", "Club Logo", "Real Face", "Position", "Jersey Number", "Joined", "Loaned From", "Contract Valid Until",
                  "Marking", "Best Position", "Release Clause", "Body Type"]
df_used_columns = df.drop(columns=unused_columns)
df_used_columns.head()

Unnamed: 0,Age,Overall,Potential,Value,Wage,Special,Preferred Foot,International Reputation,Weak Foot,Skill Moves,...,Composure,StandingTackle,SlidingTackle,GKDiving,GKHandling,GKKicking,GKPositioning,GKReflexes,Best Overall Rating,DefensiveAwareness
0,33,87,87,€31.5M,€115K,2316,Right,5.0,4.0,3.0,...,85.0,45.0,38.0,27.0,25.0,31.0,33.0,37.0,87.0,57.0
1,29,91,91,€87M,€370K,2304,Right,4.0,5.0,4.0,...,91.0,65.0,53.0,15.0,13.0,5.0,10.0,13.0,91.0,68.0
2,25,87,90,€63M,€195K,2303,Right,2.0,4.0,4.0,...,86.0,67.0,55.0,12.0,14.0,15.0,8.0,14.0,88.0,72.0
3,29,87,87,€50.5M,€290K,2288,Left,4.0,3.0,4.0,...,89.0,54.0,49.0,14.0,8.0,14.0,13.0,14.0,87.0,59.0
4,28,83,83,€22M,€41K,2280,Left,2.0,3.0,4.0,...,87.0,82.0,79.0,8.0,14.0,13.0,13.0,14.0,83.0,79.0


In [6]:
# Check null data
null_columns = df_used_columns.isnull().sum()[df.isnull().sum() > 0]
null_columns

Volleys                39
Curve                  39
Agility                39
Balance                39
Jumping                39
Interceptions           3
Positioning             3
Vision                 39
Composure             287
SlidingTackle          39
DefensiveAwareness    942
dtype: int64

In [7]:
# Processing null value
df_no_null = df_used_columns.dropna() # I drop the column because the na data is too few
df_no_null.isnull().sum()

Age                         0
Overall                     0
Potential                   0
Value                       0
Wage                        0
Special                     0
Preferred Foot              0
International Reputation    0
Weak Foot                   0
Skill Moves                 0
Work Rate                   0
Height                      0
Weight                      0
Crossing                    0
Finishing                   0
HeadingAccuracy             0
ShortPassing                0
Volleys                     0
Dribbling                   0
Curve                       0
FKAccuracy                  0
LongPassing                 0
BallControl                 0
Acceleration                0
SprintSpeed                 0
Agility                     0
Reactions                   0
Balance                     0
ShotPower                   0
Jumping                     0
Stamina                     0
Strength                    0
LongShots                   0
Aggression

In [8]:
# Checking the column dtypes for later processing
df_no_null.info()

<class 'pandas.core.frame.DataFrame'>
Index: 16166 entries, 0 to 17106
Data columns (total 48 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Age                       16166 non-null  int64  
 1   Overall                   16166 non-null  int64  
 2   Potential                 16166 non-null  int64  
 3   Value                     16166 non-null  object 
 4   Wage                      16166 non-null  object 
 5   Special                   16166 non-null  int64  
 6   Preferred Foot            16166 non-null  object 
 7   International Reputation  16166 non-null  float64
 8   Weak Foot                 16166 non-null  float64
 9   Skill Moves               16166 non-null  float64
 10  Work Rate                 16166 non-null  object 
 11  Height                    16166 non-null  object 
 12  Weight                    16166 non-null  object 
 13  Crossing                  16166 non-null  float64
 14  Finishing  

In [9]:
# Encoder for categorizing the work rate
le = LabelEncoder()
le.fit(df_no_null["Work Rate"].unique())

# Processing columns
df_processed = df_no_null.copy()

# Value
def convert_money(x):
    x = x.replace('€', '')

    if "K" in x:
        return int(float(x.replace("K", "")) * 1000)
    
    if "M" in x:
        return int(float(x.replace("M", "")) * 1000000)
    
    return 0

for col in ["Value", "Wage"]:
    df_processed[col] = df_processed[col].apply(convert_money)

# Foot
df_processed["Preferred Foot"] = df_processed["Preferred Foot"].apply(lambda x : 0 if x == "Right" else 1)

# Work Rate
df_processed["Work Rate"] = le.transform(df_processed["Work Rate"])

# Height
def convert_height(x):
    feet, inches = map(int, x.split("'")) 
  
    cm = 30.48 * feet + 2.54 * inches
    
    return round(cm)

df_processed["Height"] = df_processed.Height.apply(convert_height)

# Weight
def convert_weight(x):
    kg = int(x.replace("lbs", "")) * 0.45359237
    
    return round(kg)

df_processed["Weight"] = df_processed.Weight.apply(convert_weight)

df_processed.head()

Unnamed: 0,Age,Overall,Potential,Value,Wage,Special,Preferred Foot,International Reputation,Weak Foot,Skill Moves,...,Composure,StandingTackle,SlidingTackle,GKDiving,GKHandling,GKKicking,GKPositioning,GKReflexes,Best Overall Rating,DefensiveAwareness
0,33,87,87,31500000,115000,2316,0,5.0,4.0,3.0,...,85.0,45.0,38.0,27.0,25.0,31.0,33.0,37.0,87.0,57.0
1,29,91,91,87000000,370000,2304,0,4.0,5.0,4.0,...,91.0,65.0,53.0,15.0,13.0,5.0,10.0,13.0,91.0,68.0
2,25,87,90,63000000,195000,2303,0,2.0,4.0,4.0,...,86.0,67.0,55.0,12.0,14.0,15.0,8.0,14.0,88.0,72.0
3,29,87,87,50500000,290000,2288,1,4.0,3.0,4.0,...,89.0,54.0,49.0,14.0,8.0,14.0,13.0,14.0,87.0,59.0
4,28,83,83,22000000,41000,2280,1,2.0,3.0,4.0,...,87.0,82.0,79.0,8.0,14.0,13.0,13.0,14.0,83.0,79.0


In [10]:
# Rechecking the dtypes
df_processed.info()

<class 'pandas.core.frame.DataFrame'>
Index: 16166 entries, 0 to 17106
Data columns (total 48 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Age                       16166 non-null  int64  
 1   Overall                   16166 non-null  int64  
 2   Potential                 16166 non-null  int64  
 3   Value                     16166 non-null  int64  
 4   Wage                      16166 non-null  int64  
 5   Special                   16166 non-null  int64  
 6   Preferred Foot            16166 non-null  int64  
 7   International Reputation  16166 non-null  float64
 8   Weak Foot                 16166 non-null  float64
 9   Skill Moves               16166 non-null  float64
 10  Work Rate                 16166 non-null  int64  
 11  Height                    16166 non-null  int64  
 12  Weight                    16166 non-null  int64  
 13  Crossing                  16166 non-null  float64
 14  Finishing  

## Build and Train your model

In this task you will analyze the data and select the features that is best at predicting if the Player is a 'Valuable' player or not.

The first step is to **define the target output** that you will use for training. Here's an example of how to create a target output:
- `df['OK Player'] = df['Overall'].apply(lambda x: 1 if x >= 50 else 0) #Define the OK Player using treshold of 50.`

Next you will **identify the features** that will best predict a 'Valuable' player. You are required to **submit the features you selected** in the Submission section below. Because we use the "Overall" as our target output, the use of "Overall" in your features is not allowed. You will automatically get 0 if you submit "Overall" in your features.

Once you identify the features, you will then **split the data** into Training set and Testing/Validation set.

Depending on the features you selected, **you may need to scale the features**.

Now you will **train your model, choose the algorithm** you are going to use carefully to make sure it gives the best result.

Once you have trained your model, you need to test the model effectiveness. **Make predictions against your Testing/Validation set** and evaluate your model. You are required to **submit the Accuracy Score and F1 score** in the Submission section below.

In the code block below, you can use the comments to guide you on what to do.

We have also provided 3 variables that you must use in your code, `ml_features`, `ml_accuracy` and `ml_f1_score`. You can move the variables around your code, assign values to them, but you cannot delete them.

In [11]:
# Write your code here
np.random.seed(7) # Lucky seven :D, for reproducibility

# Define the target output (Good >= 75)
df_data = df_processed.copy()

# To be honest i can use only the potential or the best overall rating to predict the target (only use 1 column)
# And the task telling us to just drop the overall column, but I want to try to use the data without it
df_data["Target"] = df_data['Overall'].apply(lambda x: 1 if x >= 75 else 0)
df_data.drop(columns=["Overall", "Potential", "Best Overall Rating"], inplace=True)

# The data not balanced, there's more 0 than 1 in 1:100+ ratio
y = df_data.Target
X = df_data.drop(columns="Target")

# Identify the features you will use in your model
ml_features = X.columns

# I should process the data a bit more
# Split data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.1, random_state=7, stratify=y)

# Train the model
model = RandomForestClassifier(n_estimators=5)
model.fit(X_train, y_train)

# Make predictions using the test set
y_pred = model.predict(X_test)

# Evaluate the model
ml_accuracy = accuracy_score(y_test, y_pred)
ml_f1_score = f1_score(y_test, y_pred)

print(f"""
Classification report:
{classification_report(y_test, y_pred)}

F1 score        : {f1_score(y_test, y_pred)}
Accuracy score  : {accuracy_score(y_test, y_pred)}
""")


Classification report:
              precision    recall  f1-score   support

           0       0.99      0.98      0.99      1405
           1       0.89      0.92      0.91       212

    accuracy                           0.98      1617
   macro avg       0.94      0.95      0.95      1617
weighted avg       0.98      0.98      0.98      1617


F1 score        : 0.9074074074074074
Accuracy score  : 0.9752628324056896


## Conclusion

The data that I used is already straight forward. You can predict the player "Ok Value" by only using the stats column (crossing, shooting, accuracy, etc). For the data cleaning itself I should try to use fill instead of drop. I use drop because it's much faster. And there still a lot of column to use and it needs to be filtered again. I didn't use any standardization because the result is already good enough. There's a lot to try like use bucket for the wage, value, and any other stats or use scaler. And i still didn't know what's special column.

I should try to plot the data too. I will help me to see what's the data looks like and gain more information about the columns. I didn't use it because the result was already good but maybe I will add it if I have the time. Sorry :(

## Submission

Once you are satisfied with the performance of your model, then you run the code block below to submit your project.


In [12]:
# Submit Method

# Do not change the code below
question_id = "01_ml_project_features"
submit(student_id, name, assignment_id, str(ml_features), question_id, drive_link)
question_id = "02_ml_project_accuracy"
submit(student_id, name, assignment_id, str(ml_accuracy), question_id, drive_link)
question_id = "03_ml_project_f1score"
submit(student_id, name, assignment_id, str(ml_f1_score), question_id, drive_link)

'Assignment successfully submitted'

## FIN