# Project: Machine Learning

**Instructions for Students:**

Please carefully follow these steps to complete and submit your project:

1. **Make a copy of the Project**: Please make a copy of this project either to your own Google Drive or download locally. Work on the copy of the project. The master project is **Read-Only**, meaning you can edit, but it will not be saved when you close the master project. To avoid total loss of your work, remember to make a copy.

2. **Completing the Project**: You are required to work on and complete all tasks in the provided project. Be disciplined and ensure that you thoroughly engage with each task.
   
3. **Creating a Google Drive Folder**: Each of you must create a new folder on your Google Drive. This will be the repository for all your completed project files, aiding you in keeping your work organized and accessible.
   
4. **Uploading Completed Project**: Upon completion of your project, make sure to upload all necessary files, involving codes, reports, and related documents into the created Google Drive folder. Save this link in the 'Student Identity' section and also provide it as the last parameter in the `submit` function that has been provided.
   
5. **Sharing Folder Link**: You're required to share the link to your project Google Drive folder. This is crucial for the submission and evaluation of your project.
   
6. **Setting Permission to Public**: Please make sure your Google Drive folder is set to public. This allows your instructor to access your solutions and assess your work correctly.

Adhering to these procedures will facilitate a smooth project evaluation process for you and the reviewers.

## Project Description

In this Machine Learning Project, you will create your own supervised Machine Learning (ML) model. We will use the full FIFA21 Dataset and we will identify players that are above average.

We will use the column "Overall" with a treshold of 75 to define players that are 'Valuable'. This will become our target output which we need for a supervised ML model. Because we use the "Overall" as our target output, you cannot use "Overall" in your features, this will be explained further below.

This project will provide a comprehensive overview of your abilities in machine learning, from understanding the problem, choosing the right model, training, and optimizing it.

## Grading Criteria

Your score will be awarded based on the following criteria:
* 100: The model has an accuracy of more than 90% and an F1 score of more than 80%. This model is excellent and demonstrates a strong understanding of the task.
* 90: The model has an accuracy of more than 85% and an F1 score of more than 75%. This model is very good, with some room for improvement.
* 80: The model has an accuracy of more than 80% and an F1 score between 70% and 75%. This model is fairly good but needs improvement in balancing precision and recall.
* 70: The model has an accuracy of more than 70% and an F1 score between 60% and 70%. This model is below average and needs significant improvement.
* 60 or below: The model has an accuracy of less than 70% or an F1 score of less than 60%, or the student did not submit the accuracy and F1 score. This model is poor and needs considerable improvement.

Rmember to make a copy of this notebook in your Google Drive and work in your own copy.

Happy modeling!

>Note: If you get the accuracy of 100% and F1 score of 100%, while it may earn you good grades, it's an indication of overfitting.

## Student Identity

In [None]:
# @title #### Student Identity
student_id = "REA6UCWBO" # @param {type:"string"}
name = "Ida Bagus Teguh Teja Murti" # @param {type:"string"}
drive_link = "https://colab.research.google.com/drive/1uCtzZEddZL4a2yGhUKeOWvSCQQy9iHhj?usp=sharing"  # @param {type:"string"}

assignment_id = "00_ml_project"

# Import grader package
!pip install rggrader
from rggrader import submit, submit_image

Collecting rggrader
  Downloading rggrader-0.1.6-py3-none-any.whl.metadata (485 bytes)
Downloading rggrader-0.1.6-py3-none-any.whl (2.5 kB)
Installing collected packages: rggrader
Successfully installed rggrader-0.1.6


In [None]:
# Write any package/module installation that you need
# pip install goes here, this helps declutter your output below



## Load the dataset and clean it

In this task, you will prepare and load your dataset. You need to download the full FIFA 21 Dataset from the link here: [Kaggle FIFA Player Stats Database](https://www.kaggle.com/datasets/bryanb/fifa-player-stats-database?resource=download&select=FIFA21_official_data.csv).

>Note: Make sure you download FIFA 21 dataset.
>
>![FIFA21 Dataset](https://storage.googleapis.com/rg-ai-bootcamp/projects/fifa21_dataset-min.png)

After you download the dataset, you will then import the dataset then you will clean the data. For example there may be some empty cell in the dataset which you need to fill. Maybe there are also data that you need to convert to numeric value for analysis. Identify the data that is incomplete and fix them.

In the code block below, you can use the comments to guide you on what to do.

In [None]:
#!/bin/bash
!curl -L -o fifa-player-stats-database.zip\
  https://www.kaggle.com/api/v1/datasets/download/bryanb/fifa-player-stats-database
!unzip fifa-player-stats-database.zip -d fifa-player-stats-database
!mv ./fifa-player-stats-database/FIFA21_official_data.csv ./

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 13.2M  100 13.2M    0     0  25.1M      0 --:--:-- --:--:-- --:--:-- 71.2M
Archive:  fifa-player-stats-database.zip
  inflating: fifa-player-stats-database/FIFA17_official_data.csv  
  inflating: fifa-player-stats-database/FIFA18_official_data.csv  
  inflating: fifa-player-stats-database/FIFA19_official_data.csv  
  inflating: fifa-player-stats-database/FIFA20_official_data.csv  
  inflating: fifa-player-stats-database/FIFA21_official_data.csv  
  inflating: fifa-player-stats-database/FIFA22_official_data.csv  
  inflating: fifa-player-stats-database/FIFA23_official_data.csv  


In [None]:
import pandas as pd

# Load your data
# Make sure to specify the correct path to the dataset file
df = pd.read_csv('./FIFA21_official_data.csv')

# Check your data for empty cells
print(df.isnull().sum())

# Fill the empty cells with data or drop the column
# This is a decision you need to make based on the context of your analysis
# Example: Fill numeric columns with the mean
for col in df.select_dtypes(include='number').columns:
    df[col].fillna(df[col].mean(), inplace=True)

# Example: Fill categorical columns with the mode
for col in df.select_dtypes(include='object').columns:
    df[col].fillna(df[col].mode()[0], inplace=True)

# Alternatively, if a column has too many missing values, you might decide to drop it
# df.drop(columns=['column_name'], inplace=True)

# Convert data to numeric values where necessary
# Example: Convert a 'price' column to floats
# df['price'] = pd.to_numeric(df['price'].str.replace('[^0-9.]', ''), errors='coerce')

# Save the cleaned dataset
df.to_csv('./FIFA21_cleaned_data.csv', index=False)

print("Data cleaning complete. Cleaned data saved.")


ID                        0
Name                      0
Age                       0
Photo                     0
Nationality               0
                       ... 
GKReflexes                0
Best Position             0
Best Overall Rating       0
Release Clause         1629
DefensiveAwareness      942
Length: 65, dtype: int64


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(df[col].mean(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[col].fillna(df[col].mode()[0], inplace=True)


Data cleaning complete. Cleaned data saved.


## Build and Train your model

In this task you will analyze the data and select the features that is best at predicting if the Player is a 'Valuable' player or not.

The first step is to **define the target output** that you will use for training. Here's an example of how to create a target output:
- `df['OK Player'] = df['Overall'].apply(lambda x: 1 if x >= 50 else 0) #Define the OK Player using treshold of 50.`

Next you will **identify the features** that will best predict a 'Valuable' player. You are required to **submit the features you selected** in the Submission section below.

> **Because we use the "Overall" as our target output, the use of "Overall" in your features is not allowed. You will automatically get 0 if you submit "Overall" in your features. The use of "Best Overall Rating" is also not allowed and will automatically get you a score of 0.**

Once you identify the features, you will then **split the data** into Training set and Testing/Validation set.

Depending on the features you selected, **you may need to scale the features**.

Now you will **train your model, choose the algorithm** you are going to use carefully to make sure it gives the best result.

Once you have trained your model, you need to test the model effectiveness. **Make predictions against your Testing/Validation set** and evaluate your model. You are required to **submit the Accuracy Score and F1 score** in the Submission section below.

In the code block below, you can use the comments to guide you on what to do.

We have also provided 3 variables that you must use in your code, `ml_features`, `ml_accuracy` and `ml_f1_score`. You can move the variables around your code, assign values to them, but you cannot delete them.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score
import numpy as np



In [None]:
# Load your data
df = pd.read_csv('./FIFA21_cleaned_data.csv')

# Define the target output (Valuable Player with threshold of 75)
df['Valuable Player'] = df['Overall'].apply(lambda x: 1 if x >= 75 else 0)

# Identify the features you will use in your model
# Avoid using 'Overall' and 'Best Overall Rating'
ml_features = ['Age', 'Potential', 'International Reputation', 'Weak Foot', 'Skill Moves',
               'Work Rate', 'Position', 'Joined', 'Contract Valid Until', 'Height', 'Weight']

In [None]:
# Assuming 'Joined' is a date column in your DataFrame
# Convert 'Joined' to datetime
df['Joined'] = pd.to_datetime(df['Joined'], errors='coerce')

# Example: Extract year from 'Joined' and use it as a feature
df['Joined_Year'] = df['Joined'].dt.year

# You can also calculate the number of days since a reference date
reference_date = pd.to_datetime('2020-01-01')  # Example reference date
df['Days_Since_Joined'] = (df['Joined'] - reference_date).dt.days

# Now update ml_features to include 'Joined_Year' and 'Days_Since_Joined' instead of 'Joined'
ml_features.remove('Joined')
ml_features.extend(['Joined_Year', 'Days_Since_Joined'])

# Convert 'Contract Valid Until' to datetime and then to a more useful numeric feature
df['Contract Valid Until'] = pd.to_datetime(df['Contract Valid Until'], errors='coerce')

# Extract year from 'Contract Valid Until' as a feature (or any other relevant extraction)
df['Contract Year'] = df['Contract Valid Until'].dt.year

# You can also calculate the number of days until the contract expires from a reference date
reference_date = pd.to_datetime('2020-01-01')  # Example reference date
df['Days Until Contract Expires'] = (df['Contract Valid Until'] - reference_date).dt.days

# Now update ml_features to include these new columns instead of 'Contract Valid Until'
if 'Contract Valid Until' in ml_features:
    ml_features.remove('Contract Valid Until')
ml_features.extend(['Contract Year', 'Days Until Contract Expires'])

In [None]:
# Assuming you have already converted categorical columns with pd.get_dummies
df = pd.get_dummies(df, columns=['Work Rate', 'Position'])

# Include dummy variables and the new date-derived features in the feature list
ml_features = [col for col in df.columns if col in ml_features or col.startswith('Work Rate_') or col.startswith('Position_')]

# Ensure all selected features are numeric
X = df[ml_features].select_dtypes(include=[np.number])

In [None]:
# Split data into training set and test set
y = df['Valuable Player']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the features (if needed, optional)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train the model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train_scaled, y_train)

# Make predictions using the test set
y_pred = model.predict(X_test_scaled)

# Evaluate the model
ml_accuracy = accuracy_score(y_test, y_pred)
ml_f1_score = f1_score(y_test, y_pred)

# Print the results
print("Accuracy:", ml_accuracy)
print("F1 Score:", ml_f1_score)


Accuracy: 0.9801285797779077
F1 Score: 0.9249448123620309


## Submission

Once you are satisfied with the performance of your model, then you run the code block below to submit your project.


In [None]:
# Submit Method
print(f'ml_features : {str(ml_features)}')
print(f'ml_accuracy : {str(ml_accuracy)}')
print(f'ml_f1_score : {str(ml_f1_score)}')

# Do not change the code below
question_id = "01_ml_project_features"
submit(student_id, name, assignment_id, str(ml_features), question_id, drive_link)
question_id = "02_ml_project_accuracy"
submit(student_id, name, assignment_id, str(ml_accuracy), question_id, drive_link)
question_id = "03_ml_project_f1score"
submit(student_id, name, assignment_id, str(ml_f1_score), question_id, drive_link)

ml_features : ['Age', 'Potential', 'International Reputation', 'Weak Foot', 'Skill Moves', 'Height', 'Weight', 'Joined_Year', 'Days_Since_Joined', 'Contract Year', 'Days Until Contract Expires', 'Work Rate_High/ High', 'Work Rate_High/ Low', 'Work Rate_High/ Medium', 'Work Rate_Low/ High', 'Work Rate_Low/ Low', 'Work Rate_Low/ Medium', 'Work Rate_Medium/ High', 'Work Rate_Medium/ Low', 'Work Rate_Medium/ Medium', 'Work Rate_N/A/ N/A', 'Position_<span class="pos pos0">GK', 'Position_<span class="pos pos10">CDM', 'Position_<span class="pos pos11">LDM', 'Position_<span class="pos pos12">RM', 'Position_<span class="pos pos13">RCM', 'Position_<span class="pos pos14">CM', 'Position_<span class="pos pos15">LCM', 'Position_<span class="pos pos16">LM', 'Position_<span class="pos pos17">RAM', 'Position_<span class="pos pos18">CAM', 'Position_<span class="pos pos19">LAM', 'Position_<span class="pos pos2">RWB', 'Position_<span class="pos pos20">RF', 'Position_<span class="pos pos21">CF', 'Position

'Assignment successfully submitted'

## FIN