Since I am interested in working as a data scientist in the video game industry, I thought I would look at this StarCraft data set. I am looking to answer the questions posed with the set. 

How do the replay attributes differ by level of player expertise?
What are significant predictors of a player's league?

It seemed like a clustering problem to me. I enjoyed my data mining course and since we did not spend much time on a clustering project, I decided to use K-means as my clustering algorithm. There are 21 different attributes for this set. Of those 21, I removed attributes that were user reported (age, hours per week, and total hours) and the professionals players. 

The user reported values are likely to contain "errors", like the person reporting a total hour play time of 1,000,000 hours.  The professionals were removed because the game itself ranges in rank from bronze to grand master. The researchers who created the data set added the professional rank but the professionals can only have a rank between bronze and grand master within the game. This makes their data points invalid based on improper labeling. 

In [None]:
import numpy as np
import pandas as pd

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples
from sklearn.model_selection import train_test_split
from sklearn.model_selection import KFold
from sklearn.metrics import jaccard_similarity_score
from sklearn.metrics import homogeneity_score

kf = KFold(n_splits = 10) 
kmeans = KMeans(n_clusters = 7, random_state = 31133113)

starcraft = pd.read_csv('../input/starcraft.csv')
star = starcraft.loc[starcraft['TotalHours'].notnull()]
y = pd.DataFrame(star, columns = ['LeagueIndex'])-1 # make zero-indexed

best_attr = [['UniqueHotkeys', 'ComplexUnitsMade', 'ComplexAbilityUsed', 'MaxTimeStamp'],
             ['MinimapAttacks', 'ComplexUnitsMade', 'ComplexAbilityUsed', 'MaxTimeStamp'],
             ['APM', 'UniqueHotkeys', 'TotalMapExplored', 'UniqueUnitsMade', 'ComplexUnitsMade',
              'ComplexAbilityUsed', 'MaxTimeStamp'],
             ['UniqueHotkeys', 'MinimapAttacks', 'ActionLatency', 'WorkersMade', 
              'ComplexUnitsMade']]

I selected the attributes in the best attributes list from a genetic algorithm I modified for this analysis. I did not establish a fitness score threshold for my algorithm. Instead, I allowed the algorithm to run for 5000 generations. The algorithm returned the attributes with the highest minimum silhouette score. 

Silhouette scores have a values between -1 and 1 inclusive. A larger positive value indicates a point's increased belonging to its assigned class. An increasing negative values indicates a point's increased belonging to a different class. Higher minimum silhouette scores correlates well with higher average silhouette scores. With no negative silhouette scores, I can assume that the poorest placement of a point in the cluster will more likely belong in the cluster to which it was assigned.

The genetic algorithm can be found on my GitHub page located [here][1].

The following loop evaluates the best attributes based on my GA results.


  [1]: https://github.com/elleqelle/StarCraft-II-K-means

In [None]:
for attr in best_attr:
    X = pd.DataFrame(star, columns = attr)
    X += 0.0000001
    X = X.apply(np.log)

    X_sample, X_validation, y_sample, y_validation = train_test_split(
        X, y, test_size=0.2, random_state = 13)
    
    sil_min = []
    sil_mean = []
    jaccard = []
    purity = []

    for train, test in kf.split(X_sample):        
        labels = kmeans.fit_predict(X_sample.iloc[train,:])
        sil_vals = silhouette_samples(X_sample.iloc[train,:], labels)
        sil_min.append(min(sil_vals))
        sil_mean.append(np.mean(sil_vals))
        
        jaccard.append(jaccard_similarity_score(y_sample.iloc[train,:], labels)) 
        purity.append(homogeneity_score(y_sample.iloc[train,:].values.flatten(), labels))
        
    print(attr)
    print('Avg Silhouette min: ' + str(np.mean(np.asarray(sil_min))))
    print('Avg Silhouette mean: ' + str(np.mean(np.asarray(sil_mean))))
    print('Avg Jaccarad siilarity: ' + str(np.mean(np.asarray(jaccard))))
    print('Avg Purity: ' + str(np.mean(np.asarray(purity))))
    print()

We can see based on the above testing that UniqueHotkeys, ComplexUnitsMade, ComplexAbilityUsed, MaxTimeStamp has a high average silhouette minimum value and the highest average mean values. It also has the least number of attributes, so I will use those attributes for evaluation with the validation set. 

I included two additional metrics for comparison. I selected the Jaccard similiary and the purity score. Jaccard similarity and purity score are both external measures ranging from 0 to 1 inclusive. They require a priori class membership of the points classified in order to produce a score. Silhouette score, however, is an internal measure and only evaluates cluster membership based on a given point's distance to the cluster center the algorithm assigned it and its distance to other cluster centers. It does not require an knowledge of true class membership. I'll delve more into the external measures later.

Now, let's look at the validation set classification.

In [None]:
top_attr = ['UniqueHotkeys', 'ComplexUnitsMade', 'ComplexAbilityUsed', 'MaxTimeStamp']
X = pd.DataFrame(star, columns = top_attr)
X += 0.0000001
X = X.apply(np.log)

X_sample, X_validation, y_sample, y_validation = train_test_split(
    X, y, test_size=0.2, random_state = 13)
    
centers = kmeans.fit(X_sample)
labels = kmeans.predict(X_validation)
sil_vals = silhouette_samples(X_validation, labels)


print("Validation Silhouette min: " + str(min(sil_vals)))
print("Validation Silhouette mean: " + str(np.mean(sil_vals)))


This looks pretty promising. An actual improvement in the validation set over the training set!?! A minimum silhouette well over 0 and a fairly high silhouette score mean seems like a great set of clusters! I want get a closer look at these cluster centers.

In [None]:
star_centers = []

for i in range(7):
    star_centers.append(np.exp(centers.cluster_centers_[i]))
    
level_centers = pd.DataFrame(star_centers, columns = top_attr)
level_centers.index = range(1, len(level_centers)+1)
level_centers[level_centers <= 0.0000001] = 0

print(level_centers)

I based the the data frame's index on the ranks provided in the data. 1 is the bronze level and 7 is the grand master level. 

A quick glance at this data frame tells you almost nothing intuitive. The values for the centers appear randomly placed. My metrics tell me I have relatively good separation for all the data points, but these centers are meaningless as there is no trend along any of the columns. One can learn nothing tangible from these results outside of the ability to separate the points into reasonable clusters. 

I used the other high scoring list of attributes in the selection to see if this trend holds.

In [None]:
top_attr = ['APM', 'UniqueHotkeys', 'TotalMapExplored', 'UniqueUnitsMade', 'ComplexUnitsMade',
              'ComplexAbilityUsed', 'MaxTimeStamp']
X = pd.DataFrame(star, columns = top_attr)
X += 0.0000001
X = X.apply(np.log)

X_sample, X_validation, y_sample, y_validation = train_test_split(
    X, y, test_size=0.2, random_state = 13)
    
centers = kmeans.fit(X_sample)
labels = kmeans.predict(X_validation)
sil_vals = silhouette_samples(X_validation, labels)
star_centers = []

for i in range(7):
    star_centers.append(np.exp(centers.cluster_centers_[i]))
    
level_centers = pd.DataFrame(star_centers, columns = top_attr)
level_centers.index = range(1, len(level_centers)+1)
level_centers[level_centers <= 0.0000001] = 0

print("Validation Silhouette min: " + str(min(sil_vals)))
print("Validation Silhouette mean: " + str(np.mean(sil_vals)))
print()
print(level_centers)

Sure enough, it does. 

Prior to running these algorithms, I did do some exploratory data analysis and one could be led to believe that a pattern existed of increase or decrease in certain attributes. For example, the box and whisker plots of APM, UniqueHotkeys, ComplexAbilityUsed, and MaxTimeStamp.

In [None]:
import seaborn as sns

sns.boxplot(x = star['LeagueIndex'], y = star['APM'])

In [None]:
sns.boxplot(x = star['LeagueIndex'], y = star['UniqueHotkeys'])

In [None]:
sns.boxplot(x = star['LeagueIndex'], y = star['ComplexUnitsMade'])

In [None]:
sns.boxplot(x = star['LeagueIndex'], y = star['MaxTimeStamp'])

One can see that even with attributes showing the greatest separation of value, the whiskers and the IQRs overlap throughout. Values across the league index show great similarity and will not lead to any satisfying, or significant, separations. Sure, grand master players perform a lot of actions per minute (APM), but so do the rest of the players. 

This brings back my earlier look at internal and external measures. The internal measures tell me that the clusters I produced are valid. The external measures disagree. The external measures disagree because the clusters the K-means algorithm places the points disregards the a priori classification. It has no clue about league index and places the points in clusters based on its algorithm. If the clustering placed the points into clusters based on the league index, we would see agreement between the internal and external measure values. But we don't. 

The leads me to believe that the values provided in this data set are not enough to determine what rank a StarCraft player has. The final cluster centers do not provide a coherent pattern to easily assess a player's rank based on the cluster attributes with regards to external validation. I can easily, and fairly accurately, classify a new player based on the computed centers, but it has no relationship to the external rankings.  

In order to answer the question posed originally, I would either need additional data to improve agreement between internal and external measures, or I would conclude that K-means clustering is not sufficient to answer the question as stated.  (Or, I need to "git gud" at K-means :) ) 