# Assignment 1: Implement Gaussian Mixture Model (GMM)

In this task, you are required to implement the GMM algorithm from scratch and apply it to the FIFA 23 Players Dataset and the EastWestAirlines Dataset for clustering. You should evaluate the performance of your clustering results using both qualitative and quantitative measures.

## Datasets
* **FIFA 23 Players Dataset**: This dataset contains detailed attributes of professional soccer players. The objective is to cluster players based on their skills and playing styles. Relevant features for clustering include:
  * Age
  * Overall rating (general skill level)
  * Potential (maximum projected skill level)
  * Value (market price in €)
  * Wage (weekly salary)
  * Shooting, Passing, Dribbling (technical abilities)
  * Defending, Physicality (defensive capabilities)

  You can download the dataset from Kaggle [FIFA 23 Dataset - Kaggle](https://www.kaggle.com/datasets/bryanb/fifa-player-stats-database?select=FIFA23_official_data.csv)

* **EastWestAirlines Dataset**: This dataset contains information about airline customers and their behaviors. You should preprocess the dataset as necessary before applying the clustering algorithm.

## Tasks

1. Implement the Gaussian Mixture Model algorithm from scratch. Do not use libraries like scikit-learn's GMM implementation for this part.
2. Fit the GMM to both datasets (FIFA 23 Players and EastWestAirlines) and perform clustering.
3. Evaluate your clustering results using the following methods:
   * **Rand Index**: Calculate the Rand Index to compare your clustering results against meaningful labels (e.g., player positions). Read more about the Rand Index here: [Rand Index - Wikipedia](https://en.wikipedia.org/wiki/Rand_index).
   * **Qualitative Evaluation**: For both datasets, visualize and describe the resulting clusters (e.g., scatter plots, pair plots, or other visualizations that highlight the formed groups). For visualizations in Python, you may refer to this guide: [Seaborn Visualization Library](https://seaborn.pydata.org/).

## Hints and Useful Links

* For a detailed explanation on how Gaussian Mixture Models work and how to implement them, see this tutorial: [Gaussian Mixture Model - scikit-learn documentation](https://scikit-learn.org/stable/modules/mixture.html).
* To understand the mathematical background and principles behind GMM, this reference may help: [Mixture Model - Wikipedia](https://en.wikipedia.org/wiki/Mixture_model).

In [1]:
import numpy as np
import pandas as pd

In [2]:
players = pd.read_csv("FIFA23_official_data.csv")

In [4]:
players.head()

Unnamed: 0,ID,Name,Age,Photo,Nationality,Flag,Overall,Potential,Club,Club Logo,...,Real Face,Position,Joined,Loaned From,Contract Valid Until,Height,Weight,Release Clause,Kit Number,Best Overall Rating
0,209658,L. Goretzka,27,https://cdn.sofifa.net/players/209/658/23_60.png,Germany,https://cdn.sofifa.net/flags/de.png,87,88,FC Bayern München,https://cdn.sofifa.net/teams/21/30.png,...,Yes,"<span class=""pos pos28"">SUB","Jul 1, 2018",,2026,189cm,82kg,€157M,8.0,
1,212198,Bruno Fernandes,27,https://cdn.sofifa.net/players/212/198/23_60.png,Portugal,https://cdn.sofifa.net/flags/pt.png,86,87,Manchester United,https://cdn.sofifa.net/teams/11/30.png,...,Yes,"<span class=""pos pos15"">LCM","Jan 30, 2020",,2026,179cm,69kg,€155M,8.0,
2,224334,M. Acuña,30,https://cdn.sofifa.net/players/224/334/23_60.png,Argentina,https://cdn.sofifa.net/flags/ar.png,85,85,Sevilla FC,https://cdn.sofifa.net/teams/481/30.png,...,No,"<span class=""pos pos7"">LB","Sep 14, 2020",,2024,172cm,69kg,€97.7M,19.0,
3,192985,K. De Bruyne,31,https://cdn.sofifa.net/players/192/985/23_60.png,Belgium,https://cdn.sofifa.net/flags/be.png,91,91,Manchester City,https://cdn.sofifa.net/teams/10/30.png,...,Yes,"<span class=""pos pos13"">RCM","Aug 30, 2015",,2025,181cm,70kg,€198.9M,17.0,
4,224232,N. Barella,25,https://cdn.sofifa.net/players/224/232/23_60.png,Italy,https://cdn.sofifa.net/flags/it.png,86,89,Inter,https://cdn.sofifa.net/teams/44/30.png,...,Yes,"<span class=""pos pos13"">RCM","Sep 1, 2020",,2026,172cm,68kg,€154.4M,23.0,


In [28]:
players.columns

Index(['ID', 'Name', 'Age', 'Photo', 'Nationality', 'Flag', 'Overall',
       'Potential', 'Club', 'Club Logo', 'Value', 'Wage', 'Special',
       'Preferred Foot', 'International Reputation', 'Weak Foot',
       'Skill Moves', 'Work Rate', 'Body Type', 'Real Face', 'Position',
       'Joined', 'Loaned From', 'Contract Valid Until', 'Height', 'Weight',
       'Release Clause', 'Kit Number', 'Best Overall Rating'],
      dtype='object')

In [43]:
relevant_features = ['Age', 'Overall', 'Potential', 'Value', 'Wage', 
                     'Special','Preferred Foot','Weak Foot','Skill Moves','International Reputation','Work Rate']

df = players[relevant_features]
D = df.shape[1]
N = df.shape[0]
print("Numero de jugdores: ", N)
print("Numero de caracteristicas", D)

Numero de jugdores:  17660
Numero de caracteristicas 11


In [44]:
df.rename(columns=lambda col: col.replace(' ', '_'), inplace=True)
df.columns

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df.rename(columns=lambda col: col.replace(' ', '_'), inplace=True)


Index(['Age', 'Overall', 'Potential', 'Value', 'Wage', 'Special',
       'Preferred_Foot', 'Weak_Foot', 'Skill_Moves',
       'International_Reputation', 'Work_Rate'],
      dtype='object')

In [45]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17660 entries, 0 to 17659
Data columns (total 11 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Age                       17660 non-null  int64  
 1   Overall                   17660 non-null  int64  
 2   Potential                 17660 non-null  int64  
 3   Value                     17660 non-null  object 
 4   Wage                      17660 non-null  object 
 5   Special                   17660 non-null  int64  
 6   Preferred_Foot            17660 non-null  object 
 7   Weak_Foot                 17660 non-null  float64
 8   Skill_Moves               17660 non-null  float64
 9   International_Reputation  17660 non-null  float64
 10  Work_Rate                 17660 non-null  object 
dtypes: float64(3), int64(4), object(4)
memory usage: 1.5+ MB


In [46]:
df.head()

Unnamed: 0,Age,Overall,Potential,Value,Wage,Special,Preferred_Foot,Weak_Foot,Skill_Moves,International_Reputation,Work_Rate
0,27,87,88,€91M,€115K,2312,Right,4.0,3.0,4.0,High/ Medium
1,27,86,87,€78.5M,€190K,2305,Right,3.0,4.0,3.0,High/ High
2,30,85,85,€46.5M,€46K,2303,Left,3.0,3.0,2.0,High/ High
3,31,91,91,€107.5M,€350K,2303,Right,5.0,4.0,4.0,High/ High
4,25,86,89,€89.5M,€110K,2296,Right,3.0,3.0,3.0,High/ High


In [47]:
# Valores nulos
for col_name in list(df.columns):
    no_null_values = sum(df[col_name].isnull())
    if no_null_values:
        print('El numero de valores nulos en la columna %s es %d' %(col_name,no_null_values))

In [50]:
df['Preferred_Foot'].unique()

array(['Right', 'Left'], dtype=object)

In [51]:
df['Work_Rate'].unique()

array(['High/ Medium', 'High/ High', 'Medium/ Medium', 'Medium/ High',
       'High/ Low', 'Low/ Low', 'Low/ High', 'Medium/ Low', 'Low/ Medium',
       'N/A/ N/A'], dtype=object)

## Algoritmo

In [None]:
# Inicializacion de parametros (thetha_old)

K = 3

np.random.seed(42)
pi1 = 0.35
pi2 = 0.05
pi3 = 0.6
mu_1 = np.random.randn(2,) + np.reshape([3,1],(2,))
mu_2 = np.random.randn(2,) + np.reshape([7,2],(2,))
mu_3 = np.random.randn(2,) + np.reshape([3,8],(2,))

In [8]:
np.random.randn(2,).shape

(2,)

In [7]:
np.reshape([3,1],(2,))

array([3, 1])