Inspried by - 

- [An Introduction to Principal Component Analysis (PCA) with 2018 World Soccer Players Data](https://blog.exploratory.io/an-introduction-to-principal-component-analysis-pca-with-2018-world-soccer-players-data-810d84a14eab): The primary source of this lab.

- [Using PCA to See Which Countries have Better Players for World Cup Games](https://blog.exploratory.io/using-pca-to-see-which-countries-have-better-players-for-world-cup-games-a72f91698b95)

Helpful resources:

- [PCA clearly explained —When, Why, How to use it and feature importance: A guide in Python](https://towardsdatascience.com/pca-clearly-explained-how-when-why-to-use-it-and-feature-importance-a-guide-in-python-7c274582c37e)
- [Principal Component Analysis Visualization by Prasad Ostwal](https://ostwalprasad.github.io/machine-learning/PCA-using-python.html)
- [PCA in 3 steps by Sebastian Raschka](http://sebastianraschka.com/Articles/2015_pca_in_3_steps.html)
- [PCA by plotly](https://plot.ly/ipython-notebooks/principal-component-analysis/)
- [In Depth: Principal Component Analysis](https://jakevdp.github.io/PythonDataScienceHandbook/05.09-principal-component-
- [Principal Component Analysis (PCA) from Scratch](https://drscotthawley.github.io/blog/2019/12/21/PCA-From-Scratch.html)
- [Understanding the Covariance Matrix](https://datascienceplus.com/understanding-the-covariance-matrix/)

# Principal Component Analysis (PCA)

## Data wranling & EDA

In [2]:
pip install plotly

Collecting plotly
  Downloading plotly-4.12.0-py2.py3-none-any.whl (13.1 MB)
[K     |████████████████████████████████| 13.1 MB 3.0 MB/s eta 0:00:01
Collecting retrying>=1.3.3
  Downloading retrying-1.3.3.tar.gz (10 kB)
Building wheels for collected packages: retrying
  Building wheel for retrying (setup.py) ... [?25ldone
[?25h  Created wheel for retrying: filename=retrying-1.3.3-py3-none-any.whl size=11430 sha256=d623356410ad220e7aa2e30af7b22e41a3941bb81d7f973ced515cd36ffb3f0a
  Stored in directory: /Users/tendaimunyanyi/Library/Caches/pip/wheels/c4/a7/48/0a434133f6d56e878ca511c0e6c38326907c0792f67b476e56
Successfully built retrying
Installing collected packages: retrying, plotly
Successfully installed plotly-4.12.0 retrying-1.3.3
Note: you may need to restart the kernel to use updated packages.


In [3]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
import plotly.express as px  #if you don't have this, install "pip install plotly"

# regression
import sklearn.linear_model as lm
from sklearn import metrics
from sklearn.metrics import mean_squared_error
from sklearn.metrics import explained_variance_score
from sklearn.metrics import r2_score

# validation
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score

# regression feature selection
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_regression
from sklearn.feature_selection import RFE

# pca
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

import warnings
warnings.filterwarnings("ignore")

### Data: FIFA 18 Complete Player Dataset:

FIFA 18 Complete Player Dataset: 17k+ players, 70+ attributes extracted from the latest edition of FIFA [Source: Kaggle](https://www.kaggle.com/thec03u5/fifa-18-demo-player-dataset/kernels)

Complete Dataset including :

- Player personal data like Nationality, Photo, Club, Age, Wage, Salary etc.
- **Player skill measures** such as Dribbling, Aggression, GK Skills etc.
- Playing position related data.

There are **34 measures about the players’ skills**. [Source](https://blog.exploratory.io/an-introduction-to-principal-component-analysis-pca-with-2018-world-soccer-players-data-810d84a14eab)

In [4]:
# import CompleteDataset.csv
df = pd.read_csv("data/CompleteDataset.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,Name,Age,Photo,Nationality,Flag,Overall,Potential,Club,Club Logo,...,RB,RCB,RCM,RDM,RF,RM,RS,RW,RWB,ST
0,0,Cristiano Ronaldo,32,https://cdn.sofifa.org/48/18/players/20801.png,Portugal,https://cdn.sofifa.org/flags/38.png,94,94,Real Madrid CF,https://cdn.sofifa.org/24/18/teams/243.png,...,61.0,53.0,82.0,62.0,91.0,89.0,92.0,91.0,66.0,92.0
1,1,L. Messi,30,https://cdn.sofifa.org/48/18/players/158023.png,Argentina,https://cdn.sofifa.org/flags/52.png,93,93,FC Barcelona,https://cdn.sofifa.org/24/18/teams/241.png,...,57.0,45.0,84.0,59.0,92.0,90.0,88.0,91.0,62.0,88.0
2,2,Neymar,25,https://cdn.sofifa.org/48/18/players/190871.png,Brazil,https://cdn.sofifa.org/flags/54.png,92,94,Paris Saint-Germain,https://cdn.sofifa.org/24/18/teams/73.png,...,59.0,46.0,79.0,59.0,88.0,87.0,84.0,89.0,64.0,84.0
3,3,L. Suárez,30,https://cdn.sofifa.org/48/18/players/176580.png,Uruguay,https://cdn.sofifa.org/flags/60.png,92,92,FC Barcelona,https://cdn.sofifa.org/24/18/teams/241.png,...,64.0,58.0,80.0,65.0,88.0,85.0,88.0,87.0,68.0,88.0
4,4,M. Neuer,31,https://cdn.sofifa.org/48/18/players/167495.png,Germany,https://cdn.sofifa.org/flags/21.png,92,92,FC Bayern Munich,https://cdn.sofifa.org/24/18/teams/21.png,...,,,,,,,,,,


In [None]:
# how many rows


In [None]:
# print column names


In [None]:
# overall player score using boxplot


In [None]:
# get the columns associated with skills
skills = ['Acceleration', 'Aggression', 'Agility', 'Balance', 'Ball control',
       'Composure', 'Crossing', 'Curve', 'Dribbling', 'Finishing',
       'Free kick accuracy', 'GK diving', 'GK handling', 'GK kicking',
       'GK positioning', 'GK reflexes', 'Heading accuracy', 'Interceptions',
       'Jumping', 'Long passing', 'Long shots', 'Marking', 'Penalties',
       'Positioning', 'Reactions', 'Short passing', 'Shot power',
       'Sliding tackle', 'Sprint speed', 'Stamina', 'Standing tackle',
       'Strength', 'Vision', 'Volleys']

# print how many skills in the dataset
len(skills)   # 34 skills

In [None]:
# add the overall column to the skills column ... we will use the overall column as y value in regression

# how many columns ... should print 35


In [None]:
# select 34 skills and the overall column ... total 35 columns for further analysis

#head()
df.head()

In [None]:
# print data types


In [None]:
# convert 34 skill columns to numeric


In [None]:
# check missing values


In [None]:
# drop rows with missing values

# print how many rows


### Correlation analysis

In [None]:
# a quick correlation analysis


In [None]:
# heatmap for correlation analysis


## PCA

In [None]:
# let's define X and y
  # data converted into list format
        

In [None]:
# Standardizing or normalizing the features 


In [None]:
# run PCA with 15 components


In [None]:
# print cumulative sum of explained variance


In [None]:
# visualize Cumulative Explained Variance
plt.bar(range(1,len(pca.explained_variance_ )+1),pca.explained_variance_ )
plt.ylabel('Explained variance')
plt.xlabel('Components')
plt.plot(range(1,len(pca.explained_variance_ )+1),
         np.cumsum(pca.explained_variance_),
         c='red',
         label="Cumulative Explained Variance")
plt.legend(loc='upper left');

In [None]:
# visualize Cumulative Explained Variance Ratio
plt.bar(range(1,len(pca.explained_variance_ratio_ )+1),pca.explained_variance_ratio_ )
plt.ylabel('Explained variance Ratio')
plt.xlabel('Components')
plt.plot(range(1,len(pca.explained_variance_ratio_ )+1),
         np.cumsum(pca.explained_variance_ratio_),
         c='red',
         label="Cumulative Explained Variance Ratio")
plt.legend(loc='upper left')

2 compoenents look good.

In [None]:
# run 2 components PCA



### Effect of variables on each components

In [None]:
# visualize the effect of variables on each components
plt.figure(figsize=(14,14))

X_features = df.drop('Overall', axis=1)

ax = sns.heatmap(pca.components_,
                 cmap='YlGnBu',
                 yticklabels=[ "PCA"+str(x) for x in range(1,pca.n_components_+1)],
                 xticklabels=list(X_features.columns),
                 cbar_kws={"orientation": "horizontal"})
ax.set_aspect("equal")

In [None]:
X_features = df.drop('Overall', axis=1)

def myplot(score,coeff,labels=None):
    xs = score[:,0]
    ys = score[:,1]
    n = coeff.shape[0]
    scalex = 1.0/(xs.max() - xs.min())
    scaley = 1.0/(ys.max() - ys.min())
    plt.scatter(xs * scalex,ys * scaley,s=5)
    for i in range(n):
        plt.arrow(0, 0, coeff[i,0], coeff[i,1],color = 'r',alpha = 0.5)
        if labels is None:
            plt.text(coeff[i,0]* 1.5, coeff[i,1] * 1.5, "Var"+str(i+1), color = 'green', ha = 'center', va = 'center')
        else:
            plt.text(coeff[i,0]* 1.5, coeff[i,1] * 1.5, labels[i], color = 'g', ha = 'center', va = 'center')
 
    plt.xlabel("PC{}".format(1))
    plt.ylabel("PC{}".format(2))
    plt.grid()

myplot(components[:,0:2],np.transpose(pca.components_[0:2, :]),list(X_features.columns))
plt.show()

## Regression: using PCA components as the inputs for machine learning

### Why use PCA with machine learning? What benefits?

- The primary goal of PCA (and other dimentionality reduction) is to reduce the number of variables (or the dimension of your data) to a more manageable set of variables.
- A large dataset with so many columns (**high-dimensional data**) introduces such challenges as **overfitting models** (due to model complexity) and **slow model building**. 
- Reducing the features enables **faster ML model building** and **avoiding overfitting**. PCA and similar techniques are necessary for high-dimensional data.

### Regression analysis without PCA

Build the full regression model with **split validation**.

Report the model accuracy (mean squared error).

In [None]:
# split validation


# initialize lineargression and fit the training data



# print mean square error
# print explained variance score



### Regression analysis with PCA

Choose the minimum number of principal components such that 95% of the variance is retained.

Report the model accuracy.

In [None]:
X = df.drop('Overall', axis=1).values
y = df['Overall'].values

# Standardizing or normalizing the features 
x = StandardScaler().fit_transform(X)
pd.DataFrame(x).head()

In [None]:
# obtain components satisfying 95% variance




In [None]:
# visualize Cumulative Explained Variance Ratio
plt.bar(range(1,len(pca.explained_variance_ratio_ )+1),pca.explained_variance_ratio_ )
plt.ylabel('Explained variance Ratio')
plt.xlabel('Components')
plt.plot(range(1,len(pca.explained_variance_ratio_ )+1),
         np.cumsum(pca.explained_variance_ratio_),
         c='red',
         label="Cumulative Explained Variance Ratio")
plt.legend(loc='upper left')

In [None]:
# split validation


# initialize lineargression and fit the training data


# print mean square error
# print explained variance score



15 components can deliver a good accuracy.