# DOTA 2 Player Segmentation through Unsupervised Learning

Riley Stange<br>
Thinkful Data Science Capstone 3: Unsupervised Learning<br>
December 2019

The goal of this project is to take very detailed player gameplay statistics from Defense of The Ancients 2 (DOTA 2), and create clusters of player segmentations. DOTA 2 is a Multiplayer Online Battle Arena (MOBA) game, where two teams of 5 players have the goal of destroying the opposing team's "ancient," a stationary structure that is within each team's base, at opposite sides of the map. Each team has neverending waves of "minions," or computer controlled characters that continue to go down the three lanes, and, if the players were to not interfere, would end up in a deadlock at the center point of each lane. Players take turns at random choosing one character from an extensive roster of 119 (as of writing) characters, known as "heroes." When a player chooses a hero, no one else in the game can play as a copy of that character, as one could in some other similar games.

In [1]:
import numpy as np
import pandas as pd
from pandas.io.json import json_normalize

from dask_ml.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.manifold import TSNE
from sklearn import metrics
from dask_ml.cluster import KMeans

#from hdbscan import HDBSCAN

#from umap import UMAP

import warnings

from dask.distributed import Client, progress
import dask.dataframe as dd
import joblib

warnings.filterwarnings("ignore")

In [2]:
client = Client(n_workers=4, threads_per_worker=2, memory_limit='2GB')
client

0,1
Client  Scheduler: tcp://127.0.0.1:55084  Dashboard: http://127.0.0.1:8787/status,Cluster  Workers: 4  Cores: 8  Memory: 8.00 GB


# Data Exploration and Cleaning

The data was gathered from [OpenDota's second data dump](https://blog.opendota.com/2017/03/24/datadump2/), downloaded from their small subsample that they linked, which is originally comprised of a 4GB file, containing match information, and a 1GB file containing the skill rank of the matches, with the match id, so that it can be merged with the match data for further filtering.

## Process to Import Data Before Dask



This is how I loaded and filtered my data from a 4GB file that would've otherwise crashed my machine.
```
i = 1
for chunk in pd.read_csv("~\projects\Dota 2\player_matches_small.csv", chunksize=2500000, low_memory=False):
    chunk = chunk.loc[chunk['ability_uses'].notnull()].drop("account_id", 1)
    chunk.to_csv("~\projects\Dota 2\player_stats{}.csv".format(i), index=False)
    i += 1
```
I then appended these files into one dataframe
```
i = 16

player_stats = pd.DataFrame()

while i > 0:
    player_stats = player_stats.append(
        pd.read_csv(
            "~\projects\Dota 2\player_stats{}.csv".format(i)
        )
    )
    i -= 1
```
I then did the same for the file containing match_id and skill bracket, 2 being the highest, which is what I desire for my project.
```
i = 1
for chunk in pd.read_csv("~\projects\Dota 2\match_skill.csv", chunksize=2500000):
    chunk = chunk.loc[chunk['skill'] == 2]
    chunk.to_csv("~\projects\Dota 2\match_skill{}.csv".format(i), index=False)
    i += 1
```
Now with the data
```
i = 53

high_skill = pd.DataFrame()

while i > 0:
    high_skill = high_skill.append(
        pd.read_csv(
            "~\projects\Dota 2\match_skill{}.csv".format(i)
        )
    )
    i -= 1
```
I then performed an inner merge function to combine the two dataframes into one, using match_id as the key
```
skilled_player_matches = pd.merge(left=high_skill, right=player_stats, on="match_id")
```
And for later use, I wrote the result of the dataframe to a new CSV file, allowing me to dispose of all the small CSV files I generated earlier.
```
skilled_player_matches.to_csv("~\projects\Dota 2\skilled_player_matches.csv", index=False)
```

## Process to Import Data With Dask

The above method works, but it is clumsy and does not lend itself well to reproducibility. 

First, to prevent Dask related issues, I must implicitly state the datatype of each column.

In [3]:
dtypes = {'match_id': 'int64',
 'skill': 'int64',
 'player_slot': 'int64',
 'hero_id': 'int64',
 'item_0': 'int64',
 'item_1': 'int64',
 'item_2': 'int64',
 'item_3': 'int64',
 'item_4': 'int64',
 'item_5': 'int64',
 'kills': 'int64',
 'deaths': 'int64',
 'assists': 'int64',
 'leaver_status': 'float64',
 'gold': 'float64',
 'last_hits': 'int64',
 'denies': 'int64',
 'gold_per_min': 'int64',
 'xp_per_min': 'int64',
 'gold_spent': 'float64',
 'hero_damage': 'float64',
 'tower_damage': 'float64',
 'hero_healing': 'float64',
 'level': 'int64',
 'additional_units': 'object',
 'stuns': 'float64',
 'max_hero_hit': 'object',
 'times': 'object',
 'gold_t': 'object',
 'lh_t': 'object',
 'xp_t': 'object',
 'obs_log': 'object',
 'sen_log': 'object',
 'purchase_log': 'object',
 'kills_log': 'object',
 'buyback_log': 'object',
 'lane_pos': 'object',
 'obs': 'object',
 'sen': 'object',
 'actions': 'object',
 'pings': 'object',
 'purchase': 'object',
 'gold_reasons': 'object',
 'xp_reasons': 'object',
 'killed': 'object',
 'item_uses': 'object',
 'ability_uses': 'object',
 'hero_hits': 'object',
 'damage': 'object',
 'damage_taken': 'object',
 'damage_inflictor': 'object',
 'runes': 'object',
 'killed_by': 'object',
 'kill_streaks': 'object',
 'multi_kills': 'object',
 'life_state': 'object'}

In [4]:
player_stats = dd.read_csv("player_matches_small.csv", dtype=dtypes)

Now I filter out the data I do not need, like rows that contain null values in the "ability uses" column, and dropping "account_id".

In [5]:
player_stats = player_stats.loc[player_stats["ability_uses"].notnull()].drop(columns="account_id")

Now to read in the seperate CSV that contains data that will allow me to remove observations that are not of the highest skilled players (2). Dropping the data that does not contain "2" values under "skill" will help to eliminate unnecessary memory overhead.

In [6]:
match_skill = dd.read_csv("match_skill.csv")
match_skill = match_skill.loc[match_skill["skill"] == 2]

Now that the Dask tasks are defined in their proper order, I can compute it into a Pandas dataframe, which I will explain later why this is suitable.

In [7]:
skilled_player_stats = dd.merge(left=match_skill, right=player_stats, on="match_id").compute()

In [10]:
skilled_player_stats.info(memory_usage="deep")

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9289 entries, 0 to 129
Data columns (total 56 columns):
match_id            9289 non-null int64
skill               9289 non-null int64
player_slot         9289 non-null int64
hero_id             9289 non-null int64
item_0              9289 non-null int64
item_1              9289 non-null int64
item_2              9289 non-null int64
item_3              9289 non-null int64
item_4              9289 non-null int64
item_5              9289 non-null int64
kills               9289 non-null int64
deaths              9289 non-null int64
assists             9289 non-null int64
leaver_status       9289 non-null float64
gold                9289 non-null float64
last_hits           9289 non-null int64
denies              9289 non-null int64
gold_per_min        9289 non-null int64
xp_per_min          9289 non-null int64
gold_spent          9289 non-null float64
hero_damage         9289 non-null float64
tower_damage        9289 non-null float64
hero

Since the filtered data is such a small size, I will retain it as a Pandas dataframe until I get to the model training.

In [11]:
skilled_player_stats.describe()

Unnamed: 0,match_id,skill,player_slot,hero_id,item_0,item_1,item_2,item_3,item_4,item_5,...,last_hits,denies,gold_per_min,xp_per_min,gold_spent,hero_damage,tower_damage,hero_healing,level,stuns
count,9289.0,9289.0,9289.0,9289.0,9289.0,9289.0,9289.0,9289.0,9289.0,9289.0,...,9289.0,9289.0,9289.0,9289.0,9289.0,9289.0,9289.0,9289.0,9289.0,6877.0
mean,2315296000.0,2.0,66.147056,52.104963,99.561847,102.90914,97.861018,93.113037,87.903972,77.570244,...,140.677037,5.33825,420.980299,446.880289,14689.853052,11878.72236,1337.827538,459.928625,18.154161,40.877252
std,39157540.0,0.0,64.016537,31.723346,73.20633,71.746296,71.560717,72.14697,72.549821,73.718667,...,109.909288,5.967534,141.416588,145.307008,7029.270891,6918.410586,1813.633859,1213.586262,4.59376,40.756453
min,1662940000.0,2.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,106.0,0.0,75.0,0.0,0.0,0.0,1.0,-2.89397
25%,2317870000.0,2.0,2.0,23.0,46.0,46.0,41.0,40.0,34.0,1.0,...,57.0,1.0,317.0,340.0,9750.0,6819.0,114.0,0.0,15.0,10.8656
50%,2317901000.0,2.0,128.0,52.0,92.0,100.0,92.0,71.0,63.0,48.0,...,117.0,3.0,402.0,444.0,13600.0,10474.0,560.0,0.0,18.0,30.7122
75%,2317936000.0,2.0,130.0,75.0,164.0,158.0,152.0,149.0,143.0,135.0,...,194.0,7.0,511.0,547.0,18555.0,15558.0,1841.0,224.0,22.0,56.8816
max,2317991000.0,2.0,132.0,113.0,254.0,254.0,254.0,254.0,254.0,254.0,...,1068.0,81.0,1154.0,1006.0,67260.0,61295.0,13747.0,17123.0,25.0,422.876


The dataset is comprised of more than half seemingly non-numerical features. Lets look at the last 32 columns to see what they contain.

In [12]:
skilled_player_stats[skilled_player_stats.columns[24:40]]

Unnamed: 0,additional_units,stuns,max_hero_hit,times,gold_t,lh_t,xp_t,obs_log,sen_log,purchase_log,kills_log,buyback_log,lane_pos,obs,sen,actions
0,,20.569000,"{""type"":""max_hero_hit"",""time"":1170,""max"":true,...","{0,60,120,180,240,300,360,420,480,540,600,660,...","{0,100,282,381,948,1127,1227,1366,1466,1566,18...","{0,0,2,2,5,5,5,6,6,6,6,10,11,15,15,15,15,18,19...","{0,51,339,514,929,1102,1153,1359,1359,1503,153...","{""{\""time\"":-9,\""key\"":[100,158]}"",""{\""time\"":...",{},"{""{\""time\"":-81,\""key\"":\""tango\""}"",""{\""time\""...","{""{\""time\"":189,\""key\"":\""npc_dota_hero_furion...",{},"{""70"":{""164"":1,""166"":1,""168"":1,""170"":5},""72"":{...","{""100"":{""158"":1},""140"":{""106"":1},""166"":{""102"":1}}",{},"{""1"":3491,""2"":90,""4"":382,""5"":11,""6"":107,""7"":5,..."
1,,38.754900,"{""type"":""max_hero_hit"",""time"":1053,""max"":true,...","{0,60,120,180,240,300,360,420,480,540,600,660,...","{0,100,201,300,578,678,1266,1438,1538,1638,189...","{0,0,0,0,1,1,1,4,4,4,4,4,5,5,5,5,8,8,8,15,20,2...","{0,56,231,313,712,774,1216,1339,1640,1887,2083...","{""{\""time\"":143,\""key\"":[118,138]}"",""{\""time\""...","{""{\""time\"":201,\""key\"":[154,88]}"",""{\""time\"":...","{""{\""time\"":-41,\""key\"":\""ward_observer\""}"",""{...","{""{\""time\"":305,\""key\"":\""npc_dota_hero_slark\...",{},"{""70"":{""74"":1,""76"":1,""78"":8},""74"":{""78"":1,""158...","{""94"":{""160"":1},""100"":{""128"":1},""112"":{""146"":1...","{""90"":{""160"":1},""94"":{""160"":1},""110"":{""146"":1}...","{""1"":1461,""2"":11,""4"":107,""5"":61,""6"":116,""7"":16..."
2,,85.917400,"{""type"":""max_hero_hit"",""time"":1589,""max"":true,...","{0,60,120,180,240,300,360,420,480,540,600,660,...","{0,280,460,765,1231,1808,2198,2457,2557,3047,4...","{0,2,4,9,17,24,31,35,35,44,53,55,65,67,72,77,8...","{0,317,760,1131,1673,2173,2637,2864,2864,3674,...",{},{},"{""{\""time\"":-65,\""key\"":\""circlet\""}"",""{\""time...","{""{\""time\"":293,\""key\"":\""npc_dota_hero_furion...",{},"{""72"":{""76"":50,""78"":1},""74"":{""78"":32,""80"":1,""8...",{},{},"{""1"":3203,""2"":89,""3"":83,""4"":714,""5"":68,""6"":46,..."
3,,13.024900,"{""type"":""max_hero_hit"",""time"":1951,""max"":true,...","{0,60,120,180,240,300,360,420,480,540,600,660,...","{0,146,413,553,1022,1314,1882,2146,2494,2680,3...","{0,1,5,6,11,14,19,23,29,31,34,43,48,49,55,62,6...","{0,51,339,463,878,1175,1717,1923,2506,2650,279...",{},{},"{""{\""time\"":-80,\""key\"":\""quelling_blade\""}"",""...","{""{\""time\"":306,\""key\"":\""npc_dota_hero_omnikn...",{},"{""70"":{""74"":10,""76"":8},""72"":{""74"":2,""76"":4,""16...",{},{},"{""1"":4110,""2"":249,""4"":1903,""5"":103,""6"":23,""7"":..."
4,,0.033418,"{""type"":""max_hero_hit"",""time"":2130,""max"":true,...","{0,60,120,180,240,300,360,420,480,540,600,660,...","{0,138,363,631,1113,1297,1605,1995,2262,2528,3...","{0,1,4,8,11,13,18,25,29,33,42,48,54,56,65,77,8...","{0,87,262,716,1185,1505,1979,2433,2672,2878,39...",{},{},"{""{\""time\"":-85,\""key\"":\""tango\""}"",""{\""time\""...","{""{\""time\"":227,\""key\"":\""npc_dota_hero_windru...",{},"{""70"":{""76"":8},""72"":{""76"":1,""78"":7},""74"":{""76""...",{},{},"{""1"":4306,""2"":96,""3"":1,""4"":625,""5"":19,""6"":20,""..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
125,,26.494800,"{""type"":""max_hero_hit"",""time"":1100,""max"":true,...","{0,60,120,180,240,300,360,420,480,540,600,660,...","{0,181,405,620,853,1023,1166,1634,1773,2056,23...","{0,2,5,8,11,12,13,14,15,18,21,21,22,22,23,28,3...","{0,144,319,742,1030,1195,1494,1894,2089,2406,2...","{""{\""time\"":1834,\""key\"":[176,142]}"",""{\""time\...","{""{\""time\"":2086,\""key\"":[176,92]}""}","{""{\""time\"":-81,\""key\"":\""ring_of_protection\""...","{""{\""time\"":400,\""key\"":\""npc_dota_hero_phanto...",{},"{""148"":{""118"":1,""120"":2},""150"":{""108"":2,""110"":...","{""90"":{""174"":1},""144"":{""168"":1},""176"":{""142"":1}}","{""176"":{""92"":1}}","{""1"":4148,""4"":538,""5"":37,""6"":49,""7"":3,""8"":64,""..."
126,,57.974100,"{""type"":""max_hero_hit"",""time"":2051,""max"":true,...","{0,60,120,180,240,300,360,420,480,540,600,660,...","{0,100,200,300,400,500,600,868,968,1068,1168,1...","{0,0,0,0,0,0,0,0,0,0,0,1,1,2,2,2,4,6,8,8,11,11...","{0,82,257,339,390,493,565,843,1038,1100,1277,1...",{},{},"{""{\""time\"":-74,\""key\"":\""orb_of_venom\""}"",""{\...","{""{\""time\"":1101,\""key\"":\""npc_dota_hero_legio...",{},"{""68"":{""160"":1},""70"":{""156"":2,""158"":1,""160"":2,...",{},{},"{""1"":1363,""2"":1,""3"":30,""4"":310,""5"":6,""6"":44,""7..."
127,,15.179000,"{""type"":""max_hero_hit"",""time"":2180,""max"":true,...","{0,60,120,180,240,300,360,420,480,540,600,660,...","{0,100,200,300,400,500,600,700,800,942,1042,13...","{0,0,0,0,0,0,0,0,0,1,1,4,7,8,8,8,8,8,8,8,8,11,...","{0,0,164,246,246,374,617,750,1000,1144,1177,15...","{""{\""time\"":115,\""key\"":[80,150]}"",""{\""time\"":...","{""{\""time\"":1273,\""key\"":[148,114]}"",""{\""time\...","{""{\""time\"":-81,\""key\"":\""courier\""}"",""{\""time...",{},{},"{""70"":{""164"":2,""166"":2},""72"":{""164"":2,""168"":2,...","{""80"":{""150"":1},""100"":{""142"":1},""110"":{""148"":2...","{""88"":{""160"":1},""122"":{""136"":1},""128"":{""144"":1...","{""1"":3392,""4"":389,""5"":31,""6"":76,""7"":2,""8"":22,""..."
128,,,"{""type"":""max_hero_hit"",""time"":2190,""max"":true,...","{0,60,120,180,240,300,360,420,480,540,600,660,...","{0,182,319,500,764,1028,1408,1623,1956,2141,25...","{0,2,3,5,9,13,19,22,27,29,36,38,44,52,60,69,75...","{0,196,515,783,1278,1618,1861,2180,2430,2677,2...",{},{},"{""{\""time\"":-70,\""key\"":\""stout_shield\""}"",""{\...","{""{\""time\"":2061,\""key\"":\""npc_dota_hero_trean...",{},"{""72"":{""160"":2,""162"":6},""74"":{""158"":2,""160"":2,...",{},{},"{""1"":3875,""3"":4,""4"":1074,""5"":6,""6"":2,""7"":4,""8""..."


In [13]:
skilled_player_stats[skilled_player_stats.columns[-16:]]

Unnamed: 0,pings,purchase,gold_reasons,xp_reasons,killed,item_uses,ability_uses,hero_hits,damage,damage_taken,damage_inflictor,runes,killed_by,kill_streaks,multi_kills,life_state
0,"{""0"":3}","{""tango"":1,""branches"":3,""enchanted_mango"":1,""w...","{""0"":897,""1"":-1075,""6"":137,""11"":4181,""12"":2748...","{""0"":526,""1"":4800,""2"":7057}","{""npc_dota_creep_badguys_melee"":35,""npc_dota_h...","{""tango"":4,""ward_observer"":3,""enchanted_mango""...","{""ogre_magi_ignite"":40,""ogre_magi_fireblast"":1...","{""ogre_magi_ignite"":148,""undefined"":44,""ogre_m...","{""npc_dota_hero_slark"":4519,""npc_dota_creep_ba...","{""npc_dota_creep_badguys_melee"":724,""npc_dota_...","{""ogre_magi_ignite"":4744,""undefined"":2123,""ogr...","{""3"":1,""5"":3}","{""npc_dota_hero_omniknight"":1,""npc_dota_hero_z...","{""3"":1}",{},"{""0"":2163,""1"":10,""2"":148}"
1,"{""0"":38}","{""ward_observer"":10,""courier"":1,""clarity"":4,""t...","{""0"":816,""11"":3860,""12"":2525,""13"":1520,""14"":200}","{""0"":318,""1"":6018,""2"":7607}","{""npc_dota_creep_goodguys_melee"":2,""npc_dota_c...","{""courier"":1,""ward_observer"":6,""ward_sentry"":5...","{""treant_living_armor"":54,""treant_natures_guis...","{""treant_leech_seed"":61,""undefined"":8}","{""npc_dota_creep_goodguys_melee"":382,""npc_dota...","{""npc_dota_hero_windrunner"":382,""npc_dota_cree...","{""treant_leech_seed"":1580,""undefined"":468}","{""2"":1,""3"":1,""4"":1,""5"":1,""6"":2}",{},{},{},"{""0"":2321}"
2,"{""0"":45}","{""circlet"":1,""mantle"":1,""null_talisman"":1,""rec...","{""0"":969,""1"":-1317,""6"":330,""11"":5120,""12"":7047...","{""0"":506,""1"":10992,""2"":13979,""3"":894}","{""npc_dota_creep_badguys_melee"":117,""npc_dota_...","{""flask"":1,""tango_single"":1,""bottle"":43,""tpscr...","{""invoker_exort"":560,""invoker_invoke"":85,""invo...","{""undefined"":95,""invoker_cold_snap"":36,""invoke...","{""npc_dota_creep_goodguys_melee"":336,""npc_dota...","{""npc_dota_hero_zuus"":8990,""npc_dota_creep_bad...","{""undefined"":8374,""invoker_cold_snap"":1012,""in...","{""4"":2,""5"":4}","{""npc_dota_hero_furion"":1,""npc_dota_hero_slark...","{""3"":2,""4"":2,""5"":2,""6"":2,""7"":2,""8"":2,""9"":1}","{""2"":1}","{""0"":2142,""1"":10,""2"":169}"
3,"{""0"":7}","{""quelling_blade"":1,""ring_of_protection"":1,""ta...","{""0"":728,""1"":-956,""11"":4747,""12"":2783,""13"":631...","{""0"":196,""1"":6218,""2"":11610}","{""npc_dota_creep_badguys_ranged"":15,""npc_dota_...","{""tango"":12,""faerie_fire"":2,""tpscroll"":8,""quel...","{""beastmaster_call_of_the_wild"":23,""beastmaste...","{""undefined"":172,""beastmaster_wild_axes"":21,""b...","{""npc_dota_hero_omniknight"":1380,""npc_dota_her...","{""npc_dota_hero_omniknight"":1052,""npc_dota_her...","{""undefined"":3530,""beastmaster_wild_axes"":1930...","{""5"":1}","{""npc_dota_hero_omniknight"":1,""npc_dota_hero_s...","{""3"":1}",{},"{""0"":2183,""1"":12,""2"":126}"
4,"{""0"":4}","{""tango"":2,""slippers"":1,""clarity"":2,""flask"":1,...","{""0"":710,""1"":-269,""6"":1000,""11"":5084,""12"":5063...","{""0"":156,""1"":9888,""2"":18448,""3"":894}","{""npc_dota_creep_badguys_melee"":114,""npc_dota_...","{""tango"":8,""flask"":1,""clarity"":2,""phase_boots""...","{""juggernaut_blade_fury"":17,""juggernaut_healin...","{""juggernaut_blade_fury"":166,""undefined"":46,""j...","{""npc_dota_creep_badguys_melee"":63382,""npc_dot...","{""npc_dota_creep_badguys_melee"":1205,""npc_dota...","{""juggernaut_blade_fury"":3586,""undefined"":9240...","{""1"":2,""5"":1,""6"":1}","{""npc_dota_hero_zuus"":1}","{""3"":1,""4"":1,""5"":1,""6"":1,""7"":1,""8"":1,""9"":1}","{""2"":1}","{""0"":2282,""1"":3,""2"":36}"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
125,"{""0"":41}","{""ring_of_protection"":1,""stout_shield"":1,""tang...","{""0"":943,""1"":-1854,""6"":100,""11"":240,""12"":1430,...","{""0"":498,""1"":2761,""2"":9659}","{""npc_dota_creep_goodguys_melee"":85,""npc_dota_...","{""tango"":4,""faerie_fire"":1,""branches"":1,""magic...","{""centaur_hoof_stomp"":25,""centaur_double_edge""...","{""centaur_hoof_stomp"":11,""undefined"":15,""centa...","{""npc_dota_creep_goodguys_melee"":44278,""npc_do...","{""npc_dota_hero_lich"":2366,""npc_dota_hero_phan...","{""centaur_hoof_stomp"":1549,""undefined"":842,""ce...","{""5"":4}","{""npc_dota_hero_lich"":2,""npc_dota_hero_phantom...",{},{},"{""0"":2264,""1"":20,""2"":240}"
126,"{""0"":24}","{""orb_of_venom"":1,""stout_shield"":1,""tango"":1,""...","{""0"":781,""1"":-3019,""6"":1887,""11"":483,""12"":1465...","{""0"":245,""1"":1803,""2"":9000}","{""npc_dota_creep_goodguys_melee"":26,""npc_dota_...","{""tango"":4,""magic_stick"":14,""tango_single"":1,""...","{""spirit_breaker_charge_of_darkness"":24,""spiri...","{""undefined"":22,""orb_of_venom"":51,""spirit_brea...","{""npc_dota_hero_phantom_assassin"":1355,""npc_do...","{""npc_dota_hero_phantom_assassin"":4834,""npc_do...","{""undefined"":1137,""orb_of_venom"":102,""spirit_b...","{""3"":2,""5"":2}","{""npc_dota_hero_phantom_assassin"":2,""npc_dota_...",{},{},"{""0"":2094,""1"":31,""2"":399}"
127,"{""0"":14}","{""courier"":1,""ward_observer"":14,""tango"":1,""fla...","{""0"":745,""1"":-1763,""11"":240,""12"":471,""13"":897}","{""0"":266,""1"":1141,""2"":5482}","{""npc_dota_creep_badguys_ranged"":3,""npc_dota_c...","{""courier"":1,""ward_observer"":8,""tango"":4,""bran...","{""rubick_telekinesis"":8,""rubick_telekinesis_la...","{""undefined"":26,""rubick_fade_bolt"":5}","{""npc_dota_hero_treant"":571,""npc_dota_hero_leg...","{""npc_dota_hero_treant"":512,""npc_dota_creep_go...","{""undefined"":981,""rubick_fade_bolt"":565}","{""4"":1,""5"":1}","{""npc_dota_hero_treant"":1,""npc_dota_hero_invok...",{},{},"{""0"":2270,""1"":21,""2"":233}"
128,"{""0"":33}","{""stout_shield"":1,""tango"":3,""branches"":2,""flas...","{""0"":913,""1"":-509,""6"":350,""11"":240,""12"":2036,""...","{""0"":498,""1"":3541,""2"":12450}","{""npc_dota_creep_goodguys_melee"":89,""npc_dota_...","{""tango"":4,""branches"":2,""power_treads"":9,""faer...","{""slark_pounce"":16,""slark_dark_pact"":17,""slark...","{""undefined"":56,""slark_pounce"":7,""slark_dark_p...","{""npc_dota_creep_goodguys_melee"":28193,""npc_do...","{""npc_dota_creep_goodguys_melee"":862,""npc_dota...","{""undefined"":4404,""slark_pounce"":1012,""slark_d...","{""5"":3}","{""npc_dota_hero_phantom_assassin"":1}",{},{},"{""0"":2456,""1"":1,""2"":67}"


So, all 31 of these object columns contain dictionaries with various numerical values. It is unclear how deep some of these dictionaries go, so time to do some exploring to find out just how much dimensionality this dataset could potentially have! I can already see that the "times" and any column with the "_t" suffix is actually a list of numerical values, and would probably create far too much dimensionality.

## The Informed Approach

There is a column called "leaver_status," a value greater than 0 indicates that the player left the game at a certain point. '

The different numbers indicate how the player left the game. 1 is where they disconnected, but reconnected within 5 minutes of disconnection, 2 is where they disconnected and didn't come back within 5 minutes of disconnecting, 3 is where they actually pressed the "Quit" button, and 4 is where they stayed connected to the game, but didn't actively participate. 

Since the game is team based, if a team is lacking one or more players, it will make that team perform worse, and the opposing team perform better, causing the data to not be an accurate representation of a player's usual behavior and performance.

In [14]:
skilled_player_stats['leaver_status'].value_counts()

0.0    9004
1.0     187
3.0      48
2.0      41
4.0       9
Name: leaver_status, dtype: int64

So I will filter out all players that were in a game where at least one player left(or was inactive) at any point in the game.

In [8]:
skilled_player_stats_informed = skilled_player_stats[
    ~skilled_player_stats["match_id"].isin( # No matches...
    skilled_player_stats[
        skilled_player_stats["leaver_status"] > 0]["match_id"])].reset_index(drop=True) # where a player left at any point

In [16]:
len(skilled_player_stats_informed)

7794

Filtered out over 1400 players, but still have a decent amount to work with.

Going to start with dropping the same columns that I dropped in the naive approach.

**Daskified Version Note**: I omitted the naive approach section of this notebook as it ended up being inferior, and I mainly wanted to demonstrate how Dask can assist in reading in the large amount of data and paralellization of clustering.

In [9]:
skilled_player_stats_informed = skilled_player_stats_informed.drop(
    columns=["leaver_status", "actions", "additional_units", "times", 
             "gold_t", "lh_t", "xp_t", "max_hero_hit", "obs_log", 
             "sen_log", "purchase_log", "kills_log", "buyback_log", 
             "life_state"])

In [18]:
skilled_player_stats_informed[skilled_player_stats_informed.columns[4:23]
                             ].head()

Unnamed: 0,item_0,item_1,item_2,item_3,item_4,item_5,kills,deaths,assists,gold,last_hits,denies,gold_per_min,xp_per_min,gold_spent,hero_damage,tower_damage,hero_healing,level
0,100,90,46,178,214,181,4,5,12,3648.0,58,6,369,340,9800.0,9320.0,362.0,0.0,15
1,11,231,37,1,0,46,1,0,16,822.0,36,3,328,384,11725.0,1699.0,94.0,1519.0,16
2,48,100,203,108,110,65,20,3,8,1740.0,188,4,709,726,24140.0,20841.0,4494.0,0.0,22
3,1,81,48,11,194,46,4,4,12,3797.0,158,0,490,496,13650.0,6316.0,2112.0,454.0,18
4,50,249,145,135,168,108,11,1,6,858.0,312,21,735,810,27425.0,14669.0,4621.0,0.0,23


item_0-5 is useful as it tells what items each player had equipped at the end of their game, and the other columns so far seem to be useful and do not need to be dropped or engineered.

In [19]:
skilled_player_stats_informed[skilled_player_stats_informed.columns[23:43]
                             ].head()

Unnamed: 0,stuns,lane_pos,obs,sen,pings,purchase,gold_reasons,xp_reasons,killed,item_uses,ability_uses,hero_hits,damage,damage_taken,damage_inflictor,runes,killed_by,kill_streaks,multi_kills
0,20.569,"{""70"":{""164"":1,""166"":1,""168"":1,""170"":5},""72"":{...","{""100"":{""158"":1},""140"":{""106"":1},""166"":{""102"":1}}",{},"{""0"":3}","{""tango"":1,""branches"":3,""enchanted_mango"":1,""w...","{""0"":897,""1"":-1075,""6"":137,""11"":4181,""12"":2748...","{""0"":526,""1"":4800,""2"":7057}","{""npc_dota_creep_badguys_melee"":35,""npc_dota_h...","{""tango"":4,""ward_observer"":3,""enchanted_mango""...","{""ogre_magi_ignite"":40,""ogre_magi_fireblast"":1...","{""ogre_magi_ignite"":148,""undefined"":44,""ogre_m...","{""npc_dota_hero_slark"":4519,""npc_dota_creep_ba...","{""npc_dota_creep_badguys_melee"":724,""npc_dota_...","{""ogre_magi_ignite"":4744,""undefined"":2123,""ogr...","{""3"":1,""5"":3}","{""npc_dota_hero_omniknight"":1,""npc_dota_hero_z...","{""3"":1}",{}
1,38.7549,"{""70"":{""74"":1,""76"":1,""78"":8},""74"":{""78"":1,""158...","{""94"":{""160"":1},""100"":{""128"":1},""112"":{""146"":1...","{""90"":{""160"":1},""94"":{""160"":1},""110"":{""146"":1}...","{""0"":38}","{""ward_observer"":10,""courier"":1,""clarity"":4,""t...","{""0"":816,""11"":3860,""12"":2525,""13"":1520,""14"":200}","{""0"":318,""1"":6018,""2"":7607}","{""npc_dota_creep_goodguys_melee"":2,""npc_dota_c...","{""courier"":1,""ward_observer"":6,""ward_sentry"":5...","{""treant_living_armor"":54,""treant_natures_guis...","{""treant_leech_seed"":61,""undefined"":8}","{""npc_dota_creep_goodguys_melee"":382,""npc_dota...","{""npc_dota_hero_windrunner"":382,""npc_dota_cree...","{""treant_leech_seed"":1580,""undefined"":468}","{""2"":1,""3"":1,""4"":1,""5"":1,""6"":2}",{},{},{}
2,85.9174,"{""72"":{""76"":50,""78"":1},""74"":{""78"":32,""80"":1,""8...",{},{},"{""0"":45}","{""circlet"":1,""mantle"":1,""null_talisman"":1,""rec...","{""0"":969,""1"":-1317,""6"":330,""11"":5120,""12"":7047...","{""0"":506,""1"":10992,""2"":13979,""3"":894}","{""npc_dota_creep_badguys_melee"":117,""npc_dota_...","{""flask"":1,""tango_single"":1,""bottle"":43,""tpscr...","{""invoker_exort"":560,""invoker_invoke"":85,""invo...","{""undefined"":95,""invoker_cold_snap"":36,""invoke...","{""npc_dota_creep_goodguys_melee"":336,""npc_dota...","{""npc_dota_hero_zuus"":8990,""npc_dota_creep_bad...","{""undefined"":8374,""invoker_cold_snap"":1012,""in...","{""4"":2,""5"":4}","{""npc_dota_hero_furion"":1,""npc_dota_hero_slark...","{""3"":2,""4"":2,""5"":2,""6"":2,""7"":2,""8"":2,""9"":1}","{""2"":1}"
3,13.0249,"{""70"":{""74"":10,""76"":8},""72"":{""74"":2,""76"":4,""16...",{},{},"{""0"":7}","{""quelling_blade"":1,""ring_of_protection"":1,""ta...","{""0"":728,""1"":-956,""11"":4747,""12"":2783,""13"":631...","{""0"":196,""1"":6218,""2"":11610}","{""npc_dota_creep_badguys_ranged"":15,""npc_dota_...","{""tango"":12,""faerie_fire"":2,""tpscroll"":8,""quel...","{""beastmaster_call_of_the_wild"":23,""beastmaste...","{""undefined"":172,""beastmaster_wild_axes"":21,""b...","{""npc_dota_hero_omniknight"":1380,""npc_dota_her...","{""npc_dota_hero_omniknight"":1052,""npc_dota_her...","{""undefined"":3530,""beastmaster_wild_axes"":1930...","{""5"":1}","{""npc_dota_hero_omniknight"":1,""npc_dota_hero_s...","{""3"":1}",{}
4,0.033418,"{""70"":{""76"":8},""72"":{""76"":1,""78"":7},""74"":{""76""...",{},{},"{""0"":4}","{""tango"":2,""slippers"":1,""clarity"":2,""flask"":1,...","{""0"":710,""1"":-269,""6"":1000,""11"":5084,""12"":5063...","{""0"":156,""1"":9888,""2"":18448,""3"":894}","{""npc_dota_creep_badguys_melee"":114,""npc_dota_...","{""tango"":8,""flask"":1,""clarity"":2,""phase_boots""...","{""juggernaut_blade_fury"":17,""juggernaut_healin...","{""juggernaut_blade_fury"":166,""undefined"":46,""j...","{""npc_dota_creep_badguys_melee"":63382,""npc_dot...","{""npc_dota_creep_badguys_melee"":1205,""npc_dota...","{""juggernaut_blade_fury"":3586,""undefined"":9240...","{""1"":2,""5"":1,""6"":1}","{""npc_dota_hero_zuus"":1}","{""3"":1,""4"":1,""5"":1,""6"":1,""7"":1,""8"":1,""9"":1}","{""2"":1}"


## Feature Engineering and Cleaning

### Feature Engineering

The "pings" column can easily be converted to just a single column of the amount of times a player marked a location of interest on the map, showing how active of a communicator they are.

In [12]:
skilled_player_stats_informed["pings"] = json_normalize(skilled_player_stats_informed["pings"].map(eval)
                                                       )

In [10]:
skilled_player_stats_informed = pd.concat([
    skilled_player_stats_informed,
    json_normalize(skilled_player_stats_informed["gold_reasons"].map(eval)).add_prefix("gold_reasons_")
    ], axis=1
)

In [11]:
skilled_player_stats_informed = pd.concat([
    skilled_player_stats_informed,
    json_normalize(skilled_player_stats_informed["xp_reasons"].map(eval)).add_prefix("xp_reasons_")
    ], axis=1
)

"damage_taken" is interesting, where it seems like a lot of dimensionality, it can be filtered to only have columns containing neutral monsters, since there is a certain type of play style that will take a lot more damage from these neutral monsters than others, so I will sum up all the values from these columns containing the damage taken from neutral monsters, and make them into one single column, called "neutral_damage_taken."

In [13]:
neutral_cols = [col for col in json_normalize(skilled_player_stats_informed["damage_taken"].map(eval)
                                             ) if col.startswith('npc_dota_neutral')]

skilled_player_stats_informed["neutral_damage_taken"] = json_normalize(skilled_player_stats_informed["damage_taken"].map(eval)
                                                                      )[neutral_cols].sum(axis=1, skipna=True)
del neutral_cols

skilled_player_stats_informed["neutral_damage_taken"]

0       1009.0
1        525.0
2        159.0
3       6192.0
4       3045.0
         ...  
7789     816.0
7790     516.0
7791      48.0
7792    3360.0
7793    1516.0
Name: neutral_damage_taken, Length: 7794, dtype: float64

I will create another feature in a similar way that I did with "neutral_damage_taken," except by how much damage a player does to enemy minions.

In [14]:
minion_cols = [col for col in json_normalize(skilled_player_stats_informed["damage"].map(eval)
                                            ) if col.startswith('npc_dota_creep')]

skilled_player_stats_informed['damage_to_minion'] = json_normalize(skilled_player_stats_informed["damage"].map(eval)
                                                                  )[minion_cols].sum(axis=1, skipna=True)
del minion_cols

skilled_player_stats_informed['damage_to_minion']

0       23369.0
1        6932.0
2       71579.0
3       28463.0
4       82082.0
         ...   
7789    53928.0
7790    19797.0
7791     7211.0
7792    36437.0
7793    93117.0
Name: damage_to_minion, Length: 7794, dtype: float64

Now for "runes." Runes are power-up items that can be found on the map at certain locations.

In [15]:
skilled_player_stats_informed = pd.concat([
    skilled_player_stats_informed,
    json_normalize(skilled_player_stats_informed["runes"].map(eval)).add_prefix("rune_")
], axis=1)

Now for the final two columns that are kind of related: "kill_streaks" and "multi_kills"
A "kill_streak" is started when a player kills another player at least twice without dying.
A "multi_kill" is when a player kills more than one player in a short period of time.

These both should add some value in differentiating very high performing offensive players without adding too much dimensionality.

In [16]:
skilled_player_stats_informed = pd.concat([
    skilled_player_stats_informed,
    json_normalize(skilled_player_stats_informed["kill_streaks"].map(eval)
                  ).add_prefix("kill_streak_"),
    json_normalize(skilled_player_stats_informed["multi_kills"].map(eval)
                  ).add_prefix("multi_kill_")
], axis=1)

Now to drop all the columns I won't need.

In [17]:
skilled_player_stats_informed = skilled_player_stats_informed.drop(
    columns=[
        "lane_pos", "obs", "sen", "purchase", "killed", "item_uses", 
        "ability_uses", "hero_hits", "damage", "damage_taken", 
        "damage_inflictor", "runes", "killed_by", "kill_streaks",
        "multi_kills", "gold_reasons", "xp_reasons", "skill"
    ]
).fillna(0)

In [41]:
skilled_player_stats_informed.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7794 entries, 0 to 7793
Data columns (total 59 columns):
match_id                7794 non-null int64
player_slot             7794 non-null int64
hero_id                 7794 non-null int64
item_0                  7794 non-null int64
item_1                  7794 non-null int64
item_2                  7794 non-null int64
item_3                  7794 non-null int64
item_4                  7794 non-null int64
item_5                  7794 non-null int64
kills                   7794 non-null int64
deaths                  7794 non-null int64
assists                 7794 non-null int64
gold                    7794 non-null float64
last_hits               7794 non-null int64
denies                  7794 non-null int64
gold_per_min            7794 non-null int64
xp_per_min              7794 non-null int64
gold_spent              7794 non-null float64
hero_damage             7794 non-null float64
tower_damage            7794 non-null float64
hero_

### Back to Dask

Now that feature engineering and cleaning is complete, I will convert the Pandas dataframe into a dask dataframe.

In [18]:
skilled_player_stats = dd.from_pandas(skilled_player_stats_informed, npartitions=4)

### Scaling with Dask

Since the dataframe is so small, I will have it persist in memory to improve performance.

In [19]:
scaler = StandardScaler()
X_std = scaler.fit_transform(
    skilled_player_stats.drop(
        columns=["match_id", "player_slot", "hero_id"]))
X_std.persist()

Unnamed: 0_level_0,item_0,item_1,item_2,item_3,item_4,item_5,kills,deaths,assists,gold,last_hits,denies,gold_per_min,xp_per_min,gold_spent,hero_damage,tower_damage,hero_healing,level,stuns,pings,gold_reasons_0,gold_reasons_1,gold_reasons_6,gold_reasons_11,gold_reasons_12,gold_reasons_13,gold_reasons_14,gold_reasons_15,gold_reasons_2,gold_reasons_5,xp_reasons_0,xp_reasons_1,xp_reasons_2,xp_reasons_3,neutral_damage_taken,damage_to_minion,rune_3,rune_5,rune_2,rune_4,rune_6,rune_1,rune_0,kill_streak_3,kill_streak_4,kill_streak_5,kill_streak_6,kill_streak_7,kill_streak_8,kill_streak_9,kill_streak_10,multi_kill_2,multi_kill_3,multi_kill_4,multi_kill_5
npartitions=4,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1
0,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64
1949,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3898,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5847,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7793,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...


### K-Means

Now, finally onto some clustering. First with K-means.

Since the dimensionality isn't that bad, I will skip straight to the clustering. I will do 5 clusters, since there are 5 defined roles in this type of game.

#### Without Dask

In [24]:
kmeans = KMeans(n_clusters=5, n_jobs=-1)

%timeit kmeans.fit(X_std)

2.69 s ± 226 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [53]:
y_pred = kmeans.predict(X_std)

In [54]:
%timeit print(metrics.silhouette_score(X_std, y_pred, metric='euclidean'))

0.08549047511180916
0.08549047511180916
0.08549047511180916
0.08549047511180916
0.08549047511180916
0.08549047511180916
0.08549047511180916
0.08549047511180916
1.26 s ± 64.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


#### With Dask

I tried running the below cell to use dask_ml's built in KMeans, and got this error message. I tried researching it, but could not find an answer within a reasonable amount of time. I assume that it is not different than using joblib.parallel_backend("dask").

In [23]:
from dask_ml.cluster import KMeans as dask_KMeans

kmeans = dask_KMeans(5)
%timeit kmeans.fit(X_std)

9.26 s ± 2.83 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


Very strange, more than 3x slower than sklearn's KMeans with n_jobs=-1. I'll try and see if any improvement is found with using sklearn KMeans with joblib.parallel_backend("dask").

In [22]:
kmeans = KMeans(n_clusters=5)
with joblib.parallel_backend("dask"):
    %timeit kmeans.fit(X_std)

674 ms ± 31 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Impressive. Completes ~3.72x faster than using n_jobs=-1 without Dask. I wonder why anyone even bothers with dask_ml, if it is so slow compared to using joblib.

In [62]:
with joblib.parallel_backend("dask"):    
    %timeit print(metrics.silhouette_score(X_std, y_pred, metric='euclidean'))

0.08549047511180108
0.08549047511180108
0.08549047511180108
0.08549047511180108
0.08549047511180108
0.08549047511180108
0.08549047511180108
0.08549047511180108
2.17 s ± 59.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


So, now I know that printing metrics does not lend itself well to Dask, I will not do it again.

### t-SNE

#### Without Dask

In [58]:
tsne = TSNE(n_components=2, perplexity=40, n_iter=300, verbose=0)
%timeit tsne_results = pd.DataFrame(tsne.fit_transform(X_std))

20.8 s ± 528 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


#### With Dask

In [64]:
with joblib.parallel_backend("dask"):
    %timeit tsne_results = pd.DataFrame(tsne.fit_transform(X_std))

18.6 s ± 697 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Not nearly as impressive of an improvement like with KMeans, but still significant enough, and perhaps will improve with a larger data and worker pool.

### UMAP

#### Without Dask

In [65]:
umap = UMAP(n_neighbors=15, min_dist=0.6)
%timeit umap_results = pd.DataFrame(umap.fit_transform(X_std))

15.1 s ± 117 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In terms of pure speed, K-Means wins, but I am not satisfied with the results. It mistook the far off outlier group discovered by UMAP to be a part of the closer lower part of the main cluster that is just starting to break away.

#### With Dask

In [66]:
with joblib.parallel_backend("dask"):
    %timeit umap_results = pd.DataFrame(umap.fit_transform(X_std))

14.7 s ± 344 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Not worth doing with Dask. I saw no activity on my dashboard.

### Agglomerative

#### Without Dask

In [67]:
agg_cluster = AgglomerativeClustering(n_clusters=5)

%timeit y_pred = agg_cluster.fit_predict(X_std)

3.56 s ± 77.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [95]:
print(metrics.silhouette_score(X_std, y_pred, metric='euclidean'))

0.022524470671924212


#### With Dask

In [68]:
with joblib.parallel_backend("dask"):
    %timeit y_pred = agg_cluster.fit_predict(X_std)

3.34 s ± 93.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Again, does not seem to provide an improvement.

### HDBSCAN

#### Without Dask

In [70]:
hdbscan = HDBSCAN()

%timeit y_pred = hdbscan.fit_predict(X_std)

4.4 s ± 93.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


After seeing the visualizations for UMAP, I really liked HDBSCAN

#### With Dask

In [71]:
with joblib.parallel_backend("dask"):
    %timeit y_pred = hdbscan.fit_predict(X_std)

4.34 s ± 107 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


Again, another example of a model not being worth it in Dask.

# Summary

I'll cut it short here, as the remainder of my Capstone involves tasks that do not benefit from Dask as I continued forward with UMAP fed into Agglomerative Clustering for visualization, both of which did not benefit from Dask. However, Dask was extremely useful in terms of reading in the data without the clumsy, irreproducible approach I employed at first.