> # PUBG: Creating player profiles
#### by Kristofer Söderström
___
*This notebook attempts to create player profiles based on clustering 
analysis. Based on [these](https://towardsdatascience.com/clustering-algorithms-for-customer-segmentation-af637c6830ac) [notebooks](https://medium.com/datadriveninvestor/unsupervised-learning-with-python-k-means-and-hierarchical-clustering-f36ceeec919c)
## Contents
1. Database Description
1. Exploratory Analysis
1. Clustering 


## 1. Database Description

In a PUBG game, up to 100 players start in each match (matchId). Players can be on teams (groupId) which get ranked at the end of the game (winPlacePerc) based on how many other teams are still alive when they are eliminated. In game, players can pick up different munitions, revive downed-but-not-out (knocked) teammates, drive vehicles, swim, run, shoot, and experience all of the consequences -- such as falling too far or running themselves over and eliminating themselves.
You are provided with a large number of anonymized PUBG game stats, formatted so that each row contains one player's post-game stats. The data comes from matches of all types: solos, duos, squads, and custom; there is no guarantee of there being 100 players per match, nor at most 4 player per group.

**Data fields**
* DBNOs - Number of enemy players knocked.
* assists - Number of enemy players this player damaged that were killed by teammates.
* boosts - Number of boost items used.
* damageDealt - Total damage dealt. Note: Self inflicted damage is subtracted.
* headshotKills - Number of enemy players killed with headshots.
* heals - Number of healing items used.
* Id - Player’s Id
* killPlace - Ranking in match of number of enemy players killed.
* killPoints - Kills-based external ranking of player. (Think of this as an Elo ranking where only kills matter.) If there is a value other than -1 in rankPoints, then any 0 in killPoints should be treated as a “None”.
* killStreaks - Max number of enemy players killed in a short amount of time.
* kills - Number of enemy players killed.
* longestKill - Longest distance between player and player killed at time of death. This may be misleading, as downing a player and driving away may lead to a large longestKill stat.
* matchDuration - Duration of match in seconds.
* matchId - ID to identify match. There are no matches that are in both the training and testing set.
* matchType - String identifying the game mode that the data comes from. The standard modes are “solo”, “duo”, “squad”, “solo-fpp”, “duo-fpp”, and “squad-fpp”; other modes are from events or custom matches.
* rankPoints - Elo-like ranking of player. This ranking is inconsistent and is being deprecated in the API’s next version, so use with caution. Value of -1 takes place of “None”.
* revives - Number of times this player revived teammates.
* rideDistance - Total distance traveled in vehicles measured in meters.
* roadKills - Number of kills while in a vehicle.
* swimDistance - Total distance traveled by swimming measured in meters.
* teamKills - Number of times this player killed a teammate.
* vehicleDestroys - Number of vehicles destroyed.
* walkDistance - Total distance traveled on foot measured in meters.
* weaponsAcquired - Number of weapons picked up.
* winPoints - Win-based external ranking of player. (Think of this as an Elo ranking where only winning matters.) If there is a value other than -1 in rankPoints, then any 0 in winPoints should be treated as a “None”.
* groupId - ID to identify a group within a match. If the same group of players plays in different matches, they will have a different groupId each time.
* numGroups - Number of groups we have data for in the match.
* maxPlace - Worst placement we have data for in the match. This may not match with numGroups, as sometimes the data skips over placements.
* winPlacePerc - The target of prediction. This is a percentile winning placement, where 1 corresponds to 1st place, and 0 corresponds to last place in the match. It is calculated off of maxPlace, not numGroups, so it is possible to have missing chunks in a match.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
import os
print(os.listdir("../input"))
#loading additional dependencies 
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from math import sqrt
#seed
import random
random.seed(30) #seed for reproducibility
#ml 
from sklearn.cluster import KMeans
from sklearn import preprocessing
from sklearn.decomposition import PCA
#viz
import matplotlib.pyplot as plt #plots and graphs
import seaborn as sns #additional functionality and visualization
plt.style.use('fivethirtyeight')#set style
%matplotlib inline




## 2. Exploratory Analysis
* There is a mix of object, interger, and float data.
* Around 4.5 million rows, Id (player) in matchID
* 29 columns 
* Most data is numerical. Id, groupId, matchId and matchType are object data

In [None]:
#load data and create dataframe 
train_data = pd.read_csv('../input/train_V2.csv')
#summarize information 
print("database shape:",train_data.shape)
before = train_data.shape
print("missing data?",train_data.isnull().values.any())
print("deleting missing values...")# dataframe has missing values, we will drop them because of time constraints. Usually not desirable since missing information can actually provide with important insights.
train_data = train_data.dropna()
print("missing data?",train_data.isnull().values.any())
after = train_data.shape
#print("using random sample (1% of data) to speed up computation...")
#train_data = train_data.sample(n=None, frac=0.01, replace=False, weights=None, random_state=None, axis=None)
print("database shape:",train_data.shape)
print("Dropped rows:",before[0]-after[0])
train_data.head()

Player actions like assists, boosts, heals, kills have hight standard deviations relative to their mean, indicating a degree of skewness towards zero. The data implies most players perform less actions in a game, while a small percentage seem to perform many actions duing a game. 

In [None]:
train_data.describe()

According to the database description, Id refers to the individual player. According to the description below, there are as many unique players as the size of the database, around 4,6 million.

In [None]:
train_data["Id"].describe()

We are interested in building player profiles based on the data, regardless of their winning placement. For now, we are droping all features that do not represent an action or behaviour.

In [None]:
#we will drop winning placement and all features that do not represent player behaviour
cluster_data = train_data.iloc[:,3:-2]
cluster_data= cluster_data.drop(["matchType","rankPoints","maxPlace","killPlace",
                                "killPoints","matchDuration","numGroups"],axis=1)
print("Database shape: ",cluster_data.shape)
cluster_data.head()

A **heatmap** is a good way to start visualization. It will allow a bird's-eye view of the dataset and identifying correlation between features. We can also list out the highest correlated pairs of features.  
* Player actions related to offensive actions such as: damage dealt, kills, kill streaks, headshot kills and knockouts are all highly correlated (>85%) with each other. It might be a good idea to select only one or two of these features for parsimony. 
* We will create one feature out of the sum walkDistance and swimDistance features, called footDistance, to better represent movement without a vehicle

In [None]:
#developing a heatmap with example from https://seaborn.pydata.org/examples/many_pairwise_correlations.html
corr = cluster_data.corr() # compute correlation matrix
f, ax = plt.subplots(figsize=(16,16)) #set size
cmap = sns.diverging_palette(220,10,as_cmap=True) #define a custom color palette
sns.heatmap(corr,annot=False,cmap=cmap,square=True,linewidths=0.5) #draw graph
plt.show()

In [None]:
#top n correlations
n=10
def get_redundant_pairs(df):
    '''Get diagonal and lower triangular pairs of correlation matrix'''
    pairs_to_drop = set()
    cols = df.columns
    for i in range(0, df.shape[1]):
        for j in range(0, i+1):
            pairs_to_drop.add((cols[i], cols[j]))
    return pairs_to_drop

def get_top_abs_correlations(df, n=5):
    au_corr = df.corr().abs().unstack()
    labels_to_drop = get_redundant_pairs(df)
    au_corr = au_corr.drop(labels=labels_to_drop).sort_values(ascending=False)
    return au_corr[0:n]

print(get_top_abs_correlations(corr,n))

In [None]:
#create footDistance feature and drop highly correlated features
cluster_data['footDistance'] = cluster_data['walkDistance'] + cluster_data['swimDistance']
cluster_data= cluster_data.drop(["kills","killStreaks","DBNOs","walkDistance","swimDistance"],axis=1)

#developing a heatmap with example from https://seaborn.pydata.org/examples/many_pairwise_correlations.html
corr = cluster_data.corr() # compute correlation matrix
f, ax = plt.subplots(figsize=(16,16)) #set size
cmap = sns.diverging_palette(220,10,as_cmap=True) #define a custom color palette
sns.heatmap(corr,annot=False,cmap=cmap,square=True,linewidths=0.5) #draw graph
plt.show()


### 3. Clustering
Clustering is useful for extracting information from data to create the profiles. Ideally, it will group players based on their similarity of actions, making it possible to infer different play styles. 

In [None]:
#we also standardize the data, clustering algorithms are sensitive to scale for measuring distance
standardized = preprocessing.scale(cluster_data)
#building the df again
df_labels = cluster_data.iloc[:0].columns
st_cluster_data = pd.DataFrame(standardized, columns=df_labels)
st_cluster_data.head()

As mentioned earlier, the data is highly skewed. This is visually represented in the following graph where we plot the distribution of some player actions. 

In [None]:
#plot distribution
plt.rcParams['figure.figsize'] = (16, 9)
plot_1 = sns.distplot(st_cluster_data["damageDealt"], kde_kws={"label": "damageDealt"})
plot_2 = sns.distplot(st_cluster_data["assists"], kde_kws={"label": "assists"})
plot_3 = sns.distplot(st_cluster_data["boosts"], kde_kws={"label": "boosts"})
plt.xlabel('Player actions distribution')

In [None]:
#Using the elbow method to find the optimum number of clusters
wcss = []
for i in range(1,11):
    km=KMeans(n_clusters=i,init='k-means++', max_iter=300, n_init=10, random_state=30)
    km.fit(st_cluster_data)
    wcss.append(km.inertia_)
plt.plot(range(1,11),wcss)
plt.title('Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('wcss')
plt.show()

The plot shows the sum of squared distances of samples to their closest cluster center as we increase the number of clusters. We can take into account the reduction of variance for the selection of clusters. However, there is some degree of subjectivity based on a prioir expectations. For now, we will choose 4 as the number of clusters. 


In [None]:
# Fitting K-Means to the dataset
num_clusters = 4
kmeans = KMeans(n_clusters=num_clusters,
                init='k-means++',
                max_iter=1000,
                n_init=20,
                random_state=30)
y_kmeans = kmeans.fit_predict(st_cluster_data)

In [None]:
#we can change the beginning of  the cluster numbering to 1 instead of 0 (optional)
y_kmeans1=y_kmeans
y_kmeans1=y_kmeans+1
# New Dataframe called cluster
cluster = pd.DataFrame(y_kmeans1)
# Adding cluster to the Dataset1
cluster_data['cluster'] = cluster
#Mean of clusters
kmeans_mean_cluster = pd.DataFrame(round(cluster_data.groupby('cluster').mean(),4))
#trasnponse for easier visualization
kmeans_mean_cluster

The previous table shows the mean of the attributes by cluster, we can visualize the data in separate radar graphs. 

In [None]:
for i in range(4):
    obs = cluster_data["cluster"].where(cluster_data["cluster"]==i+1).count()
    percentage = round(cluster_data["cluster"].where(cluster_data["cluster"]==i+1).count()/cluster_data["cluster"].count()*100,2)
    print("Cluster {} has".format(i+1),obs, "players, or {}%".format(percentage))

It is worth noting that cluster size varies greatly. With over 60% of players belonging to the first cluster, and the first two clusters representing almost 90% of recollected data. Cluster 4 represents less than 1% of the data.

In [None]:
#we standardize the data to visualize it in the same scale
radar_data = preprocessing.scale(kmeans_mean_cluster)
radar_data = pd.DataFrame(radar_data)
radar_data

* **Cluster 1** represents shows players with seemingly higher accuracy and distance kills than the rest. Low walking distance and items pickup suggests that these players remain more static than their counterparts. 
*Note: The input code for the rest of the clusters has been hidden from the notebook for cleanliness.*

In [None]:
#https://www.kaggle.com/typewind/draw-a-radar-chart-with-python-in-a-simple-way
labels = np.array(cluster_data.columns.values)
labels = labels[:-1]
stats = radar_data.loc[0].values

angles=np.linspace(0, 2*np.pi, len(labels), endpoint=False)
# close the plot
stats=np.concatenate((stats,[stats[0]]))
angles=np.concatenate((angles,[angles[0]]))

#plot the figure
fig=plt.figure()
ax = fig.add_subplot(111, polar=True)
ax.plot(angles, stats, 'o-', linewidth=2)
ax.fill(angles, stats, alpha=0.25)
ax.set_thetagrids(angles * 180/np.pi, labels)
ax.set_title("Cluster 1: Snipers")
ax.grid(True)

* **Cluster 2** shows players that on average collect more weapons, have high on foot distance travelled. They seem to be more passive in their gameplay, perhaps roaming from isolated points instead of droping into the action right away. The relaitvely high use of healing items suggests defensive play styles. 

In [None]:
#https://www.kaggle.com/typewind/draw-a-radar-chart-with-python-in-a-simple-way
labels = np.array(cluster_data.columns.values)
labels = labels[:-1]
stats = radar_data.loc[1].values


angles=np.linspace(0, 2*np.pi, len(labels), endpoint=False)
# close the plot
stats=np.concatenate((stats,[stats[0]]))
angles=np.concatenate((angles,[angles[0]]))

#plot the figure
fig=plt.figure()
ax = fig.add_subplot(111, polar=True)
ax.plot(angles, stats, 'o-', linewidth=2)
ax.fill(angles, stats, alpha=0.25)
ax.set_thetagrids(angles * 180/np.pi, labels)
ax.set_title("Cluster 2: Roamers")
ax.grid(True)

**Cluster 3** shows players with relatively high offensive solo actions. These seemingly aggressive style players seem to be the most effective at dealing damage. 

In [None]:
#https://www.kaggle.com/typewind/draw-a-radar-chart-with-python-in-a-simple-way
labels = np.array(cluster_data.columns.values)
labels = labels[:-1]
stats = radar_data.loc[2].values

angles=np.linspace(0, 2*np.pi, len(labels), endpoint=False)
# close the plot
stats=np.concatenate((stats,[stats[0]]))
angles=np.concatenate((angles,[angles[0]]))

#plot the figure
fig=plt.figure()
ax = fig.add_subplot(111, polar=True)
ax.plot(angles, stats, 'o-', linewidth=2)
ax.fill(angles, stats, alpha=0.25)
ax.set_thetagrids(angles * 180/np.pi, labels)
ax.set_title("Cluster 3: Aggressive solo")
ax.grid(True)

**Cluster 4** shows players with relatively higher ride distance, road kills, vehicle destroys and team killls. Seemingly showing the preference to use vehicles in teams. 

In [None]:
#https://www.kaggle.com/typewind/draw-a-radar-chart-with-python-in-a-simple-way
labels = np.array(cluster_data.columns.values)
labels = labels[:-1]
stats = radar_data.loc[3].values

angles=np.linspace(0, 2*np.pi, len(labels), endpoint=False)
# close the plot
stats=np.concatenate((stats,[stats[0]]))
angles=np.concatenate((angles,[angles[0]]))

#plot the figure
fig=plt.figure()
ax = fig.add_subplot(111, polar=True)
ax.plot(angles, stats, 'o-', linewidth=2)
ax.fill(angles, stats, alpha=0.25)
ax.set_thetagrids(angles * 180/np.pi, labels)
ax.set_title("Cluster 4: Vehicle team riders")
ax.grid(True)

Another way to visualize the cluster data is to reduce the dimensionality and plot the data with the number of clusters we have defined earlier. First we visualize the data in 2D. 

In [None]:
#we use the standardized cluster data and reduce to three dimensions 
pca = PCA(n_components=2)
pca_result = pca.fit_transform(st_cluster_data)

#https://github.com/llSourcell/spike_sorting
# Plot the 1st principal component aginst the 2nd and use the 3rd for color
fig, ax = plt.subplots(figsize=(16, 9)) 
ax.scatter(pca_result[:, 0], pca_result[:, 1])
ax.set_xlabel('1st principal component', fontsize=20)
ax.set_ylabel('2nd principal component', fontsize=20)
ax.set_title('Principal Component Analysis', fontsize=23)

fig.subplots_adjust(wspace=0.1, hspace=0.1)
plt.show()

In [None]:
# Fitting K-Means to the dataset
num_clusters = 4
kmeans = KMeans(n_clusters=num_clusters,
                init='k-means++',
                max_iter=1000,
                n_init=20,
                random_state=30)
y_kmeans_pca = kmeans.fit_predict(pca_result)
y_kmeans_pca=y_kmeans_pca+1

In [None]:
# Plot the result
plt.scatter(pca_result[:, 0], pca_result[:, 1],
           c=y_kmeans_pca, edgecolor='none', cmap=plt.get_cmap('Spectral',4))
plt.xlabel('1st principal component', fontsize=20)
plt.ylabel('2nd principal component', fontsize=20)
plt.title('Data Clusters in 2D', fontsize=23)
plt.colorbar();

The plot allows us to see our assigned clusters in a 2 dimensional space.