## Project Description

In this Machine Learning Project, you will create your own supervised Machine Learning (ML) model. We will use the full FIFA21 Dataset and we will identify players that are above average.

We will use the column "Overall" with a treshold of 75 to define players that are 'Valuable'. This will become our target output which we need for a supervised ML model. Because we use the "Overall" as our target output, you cannot use "Overall" in your features, this will be explained further below.

This project will provide a comprehensive overview of your abilities in machine learning, from understanding the problem, choosing the right model, training, and optimizing it.

In [6]:
# Write any package/module installation that you need
# pip install goes here, this helps declutter your output below

import numpy as np 
import pandas as pd

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score

## Load the dataset and clean it

In this task, you will prepare and load your dataset. You need to download the full FIFA 21 Dataset from the link here: [Kaggle FIFA Player Stats Database](https://www.kaggle.com/datasets/bryanb/fifa-player-stats-database?resource=download&select=FIFA21_official_data.csv).

>Note: Make sure you download FIFA 21 dataset.
>
>![FIFA21 Dataset](https://storage.googleapis.com/rg-ai-bootcamp/projects/fifa21_dataset-min.png)

After you download the dataset, you will then import the dataset then you will clean the data. For example there may be some empty cell in the dataset which you need to fill. Maybe there are also data that you need to convert to numeric value for analysis. Identify the data that is incomplete and fix them.

In the code block below, you can use the comments to guide you on what to do.

In [None]:
# Write your preprocessing and data cleaning here

# Load your data
fifa21_df = pd.read_csv("FIFA21_official_data.csv")

# Check your data for empty cell
nullAmount = fifa21_df.isnull().sum().sort_values(ascending=False).to_string()

# Fill the empty cell with data or drop the column
fifa21_df = fifa21_df.drop(['Loaned From'], axis=1)
fifa21_df = fifa21_df.dropna(subset=['Age', 'Potential', 'Special', 'International Reputation', 'Best Overall Rating'])

print(fifa21_df.to_string())


## Build and Train your model

In this task you will analyze the data and select the features that is best at predicting if the Player is a 'Valuable' player or not.

The first step is to **define the target output** that you will use for training. Here's an example of how to create a target output:
- `df['OK Player'] = df['Overall'].apply(lambda x: 1 if x >= 50 else 0) #Define the OK Player using treshold of 50.`

Next you will **identify the features** that will best predict a 'Valuable' player. You are required to **submit the features you selected** in the Submission section below.

> **Because we use the "Overall" as our target output, the use of "Overall" in your features is not allowed. You will automatically get 0 if you submit "Overall" in your features. The use of "Best Overall Rating" is also not allowed and will automatically get you a score of 0.**

Once you identify the features, you will then **split the data** into Training set and Testing/Validation set.

Depending on the features you selected, **you may need to scale the features**.

Now you will **train your model, choose the algorithm** you are going to use carefully to make sure it gives the best result.

Once you have trained your model, you need to test the model effectiveness. **Make predictions against your Testing/Validation set** and evaluate your model. You are required to **submit the Accuracy Score and F1 score** in the Submission section below.

In the code block below, you can use the comments to guide you on what to do.

We have also provided 3 variables that you must use in your code, `ml_features`, `ml_accuracy` and `ml_f1_score`. You can move the variables around your code, assign values to them, but you cannot delete them.

In [14]:
# Write your code here

# Define the target output (Good >= 75)
fifa21_df['OK Player'] = fifa21_df['Overall'].apply(lambda x: 1 if x >= 75 else 0)
targetOutput = fifa21_df['OK Player']

# Identify the features you will use in your model
ml_features = fifa21_df[['Age', 'Potential', 'Special', 'International Reputation', 'Best Overall Rating']]

# Split data into training set and test set
X_train, X_test, y_train, y_test = train_test_split(ml_features, targetOutput, test_size=0.2, train_size=0.8, shuffle=True)

# Scale the features (if needed, optional)
scaler = StandardScaler()

X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)

# Train the model
training_model = LogisticRegression()
training_model.fit(X_train, y_train)

# Make predictions using the test set
doPred = training_model.predict(X_test)

# Evaluate the model
ml_accuracy = accuracy_score(y_test, doPred)
ml_f1_score = f1_score(y_test, doPred)

print(f"ml_accuracy: {ml_accuracy} \n ml_f1_score: {ml_f1_score}")



ml_accuracy: 0.9815897136177674 
 ml_f1_score: 0.9312977099236641
