# CS:GO Professional Player Cluster Analysis

The most recent Counter-Strike: Global Offensive (CS:GO) major tournament had a first ever \\$1,000,000 first place prize pool for the winning team with an overall prize pool for all participants of \\$3,000,000. With the stakes in professional play being higher than ever before. I figured it would be pertinent to develop a tool for coaches and team managers to utilize when trying to put together their next million dollar winning team lineup.

This project aims to utilize machine learning models to cluster players based on their historical performance. Allowing the end user to identify which players perform most closely to other players.

The dataset used to train and test the model can be found [here](https://www.kaggle.com/mateusdmachado/csgo-professional-matches). The range of the dataset covers professional matches from 11/2015 to 03/2020.

## What is CS:GO?

CS:GO or Counter-Strike: Global Offensive is the latest entry to the Counter-Strike franchise from Valve. Launched in 2012 it is a 5 versus 5 tactical FPS (First Person Shooter). The two teams face each other in a matchup. The gameplay is as follows, the Terrorist (T) side is tasked with planting a bomb and having it detonate while the Counter-Terrorist side (CT) is to defuse it or prevent it from being planted. Both teams can win a round if they eliminate all players on the opposing team.

A standard game is 30 rounds with the first team to win 16 rounds taking the win for the map with overtime rounds being a possibility. In the event that there is a draw and the match goes into overtime the first team to win two consecutive rounds takes the match.

Counter-strike has an economic system that governs the acquisitions of armor, weapons and grenades by the players. Winning a round award the players with \\$3250 while losing a round after a winning streak gives them \\$1400. Losing many times in a short period increases the losing bonus by \\$500 for every additional loss, as to not penalize the losing team too much. Players can also win money by getting kills and planting or defusing the bomb.

Much like traditional sports there are a number of statistics tied to each player that allow us to evaluate individual performance. For example:

- Kills
- Assists
- Deaths
- Headshot Percentage
- Flash Assists
- ADR (Average Damage per Round)

There are informal roles that each of the 5 players per team take on over the course of a game. I say informal as due to the nature of the game these roles are flexible and a player may be forced to wear multiple hats during the course of a single round. With that said the roles are as follows:

- **Entry Fragger**: The first player to get into a position, whether it be the bombsite, or an objective. These players initiate the push and look to open up a path by securing an entry kill or to provide some kind of an opportunity to make something happen. The success of this role is measured by how often they get the first kill to provide an advantage to their team. Often times these players will also be the first to die.
- **Support**: This player pushes along side the entry fragger and provides cover through the use of utility (grenades). Additionally these players will trade out a kill if the entry fragger falls to an enemy or cover additional lines of sight that the entry fragger cannot hold on their own.
- **AWPer**: This role gets its namesake from the AWP, a high power long range rifle which can be purchased by both teams for \\$4750. This player holds long angles to secure crucial kills as well as to deny space to the enemy team
- **Lurker**: A lurker is a player that typically plays separately from their team, looking to catch an enemy player out of place, in isolation, or transitioning from one bombsite to another. Their primary focus is to catch the other team off guard and prevent them from re-taking a bombsite with ease.
- **IGL (In-Game-Leader)**: As the name suggests this is the player that is calling the shots, they determine how the team uses their economy, the strategies they employ, and just about any other decision that can be made from a macro level over the course of a round.

The objective of this project is to see if a players statistics can tell us whether they fall into these roles as well as to determine how similar a given player is to another.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

### Data Exploration

In [2]:
# Reading in the relevant tables
players_df = pd.read_csv('data/players.csv')
results_df = pd.read_csv('data/results.csv')

These two tables provide us with all the information regarding the players as well as all the matches. The remaining tables in the dataset provide us with map picks and bans, as well as the economy information. This data goes beyond the scope of our project so we won't be using them for anything.

In [3]:
results_df.head()

Unnamed: 0,date,team_1,team_2,_map,result_1,result_2,map_winner,starting_ct,ct_1,t_2,t_1,ct_2,event_id,match_id,rank_1,rank_2,map_wins_1,map_wins_2,match_winner
0,2020-03-18,Recon 5,TeamOne,Dust2,0,16,2,2,0,1,0,15,5151,2340454,62,63,0,2,2
1,2020-03-18,Recon 5,TeamOne,Inferno,13,16,2,2,8,6,5,10,5151,2340454,62,63,0,2,2
2,2020-03-18,New England Whalers,Station7,Inferno,12,16,2,1,9,6,3,10,5243,2340461,140,118,12,16,2
3,2020-03-18,Rugratz,Bad News Bears,Inferno,7,16,2,2,0,8,7,8,5151,2340453,61,38,0,2,2
4,2020-03-18,Rugratz,Bad News Bears,Vertigo,8,16,2,2,4,5,4,11,5151,2340453,61,38,0,2,2


In [4]:
players_df.head()

Unnamed: 0,date,player_name,team,opponent,country,player_id,match_id,event_id,event_name,best_of,...,m3_kddiff_ct,m3_adr_ct,m3_kast_ct,m3_rating_ct,m3_kills_t,m3_deaths_t,m3_kddiff_t,m3_adr_t,m3_kast_t,m3_rating_t
0,2020-02-26,Brehze,Evil Geniuses,Liquid,United States,9136,2339385,4901,IEM Katowice 2020,3,...,-1.0,72.5,80.0,0.93,7.0,9.0,-2.0,70.4,63.6,0.89
1,2020-02-26,CeRq,Evil Geniuses,Liquid,Bulgaria,11219,2339385,4901,IEM Katowice 2020,3,...,3.0,79.5,53.3,1.12,4.0,8.0,-4.0,40.7,54.5,0.53
2,2020-02-26,EliGE,Liquid,Evil Geniuses,United States,8738,2339385,4901,IEM Katowice 2020,3,...,1.0,81.5,63.6,1.03,9.0,9.0,0.0,87.9,73.3,1.05
3,2020-02-26,Ethan,Evil Geniuses,Liquid,United States,10671,2339385,4901,IEM Katowice 2020,3,...,0.0,67.2,66.7,0.97,1.0,9.0,-8.0,14.8,45.5,0.31
4,2020-02-26,NAF,Liquid,Evil Geniuses,Canada,8520,2339385,4901,IEM Katowice 2020,3,...,-1.0,72.9,81.8,0.96,8.0,7.0,1.0,56.3,80.0,0.99


### Preparing Data for Clustering

The code below to extract player statistics by match from our original data tables was referenced from Kaggle user evoluu. The source can be found [here](https://www.kaggle.com/evoluu/csgo-player-statistics-vs-rounds-won/notebook).

The code was modified to include some additional labels.

In [6]:
playerstats_df = pd.DataFrame()

# Join all the individual maps played into one table
for i in [1, 2, 3]:
    player_columns = ['date', 'match_id', 'event_name', 'player_id', 'player_name', 'team', f'map_{i}',
                      f'm{i}_kills', f'm{i}_assists', f'm{i}_deaths', f'm{i}_hs', f'm{i}_flash_assists',
                      f'm{i}_kast', f'm{i}_kddiff', f'm{i}_adr', f'm{i}_fkdiff', f'm{i}_rating']
    temp_df = players_df[player_columns]
    # Rename the columns to exclude map index
    temp_df.columns = ['date', 'match_id', 'event_name', 'player_id', 'player_name', 'team', '_map',
                      'kills', 'assists', 'deaths', 'hs', 'flash_assists',
                      'kast', 'kddiff', 'adr', 'fkdiff', 'rating']
    temp_df = temp_df.dropna()
    playerstats_df = df.append(temp_df)

In [8]:
# Add the rounds won/lost data from results_df to our dataframe
teams_df_list = [None, None]
for i in range(2):
    teams_df_list[i] = results_df[['match_id', '_map', f'team_{i+1}', f'result_{i+1}', f'result_{(not i) + 1}']]
    teams_df_list[i].rename(columns={f'team_{i+1}': 'team', f'result_{i+1}': 'rounds_won', f'result_{(not i) + 1}': 'rounds_lost'}, inplace = True) 

teams_df = pd.concat(teams_df_list)
playerstats_df = playerstats_df.merge(teams_df, on=['match_id', '_map', 'team'])

In [9]:
playerstats_df.head()

Unnamed: 0,date,match_id,event_name,player_id,player_name,team,_map,kills,assists,deaths,hs,flash_assists,kast,kddiff,adr,fkdiff,rating,rounds_won,rounds_lost
0,2020-02-26,2339385,IEM Katowice 2020,9136,Brehze,Evil Geniuses,Overpass,11.0,3.0,18.0,5.0,0.0,65.2,-7.0,60.8,-1.0,0.7,7,16
1,2020-02-26,2339385,IEM Katowice 2020,11219,CeRq,Evil Geniuses,Overpass,11.0,2.0,17.0,4.0,2.0,60.9,-6.0,68.9,-1.0,0.75,7,16
2,2020-02-26,2339385,IEM Katowice 2020,10671,Ethan,Evil Geniuses,Overpass,11.0,1.0,15.0,6.0,1.0,65.2,-4.0,60.7,-2.0,0.73,7,16
3,2020-02-26,2339385,IEM Katowice 2020,8507,stanislaw,Evil Geniuses,Overpass,10.0,1.0,17.0,6.0,0.0,43.5,-7.0,64.5,-4.0,0.65,7,16
4,2020-02-26,2339385,IEM Katowice 2020,8523,tarik,Evil Geniuses,Overpass,14.0,7.0,15.0,6.0,3.0,69.6,-1.0,63.4,-1.0,0.95,7,16


In [10]:
# Saving to CSV for future usage
playerstats_df.to_csv("data/player_stats.csv")