# 1.1 The Question
What specific game actions can a given individual add to or improve on in their game to rank up in starcraft 2 across a given season? (where a season is defined as time in between rank resets)

# 1.2 Imports
All of my imports in one cell so that we need only check one place for the notebook dependencies.

In [1]:
import pandas as pd
import os

# 1.3 Objectives
These fundemental questions need to be answered before moving on, to ensure we are working from a solid base of data.

* Do we have the data needed to tackle the desired question?
* Are there fundemental issues with the data?

# 1.4 Load the Data

In [2]:
starcraft_data = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/00272/SkillCraft1_Dataset.csv')
starcraft_data.head()

Unnamed: 0,GameID,LeagueIndex,Age,HoursPerWeek,TotalHours,APM,SelectByHotkeys,AssignToHotkeys,UniqueHotkeys,MinimapAttacks,MinimapRightClicks,NumberOfPACs,GapBetweenPACs,ActionLatency,ActionsInPAC,TotalMapExplored,WorkersMade,UniqueUnitsMade,ComplexUnitsMade,ComplexAbilitiesUsed
0,52,5,27,10,3000,143.718,0.003515,0.00022,7,0.00011,0.000392,0.004849,32.6677,40.8673,4.7508,28,0.001397,6,0.0,0.0
1,55,5,23,10,5000,129.2322,0.003304,0.000259,4,0.000294,0.000432,0.004307,32.9194,42.3454,4.8434,22,0.001194,5,0.0,0.000208
2,56,4,30,10,200,69.9612,0.001101,0.000336,4,0.000294,0.000461,0.002926,44.6475,75.3548,4.043,22,0.000745,6,0.0,0.000189
3,57,3,19,20,400,107.6016,0.001034,0.000213,1,5.3e-05,0.000543,0.003783,29.2203,53.7352,4.9155,19,0.000426,7,0.0,0.000384
4,58,3,32,10,500,122.8908,0.001136,0.000327,2,0.0,0.001329,0.002368,22.6885,62.0813,9.374,15,0.001174,4,0.0,1.9e-05


In [3]:
starcraft_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3395 entries, 0 to 3394
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   GameID                3395 non-null   int64  
 1   LeagueIndex           3395 non-null   int64  
 2   Age                   3395 non-null   object 
 3   HoursPerWeek          3395 non-null   object 
 4   TotalHours            3395 non-null   object 
 5   APM                   3395 non-null   float64
 6   SelectByHotkeys       3395 non-null   float64
 7   AssignToHotkeys       3395 non-null   float64
 8   UniqueHotkeys         3395 non-null   int64  
 9   MinimapAttacks        3395 non-null   float64
 10  MinimapRightClicks    3395 non-null   float64
 11  NumberOfPACs          3395 non-null   float64
 12  GapBetweenPACs        3395 non-null   float64
 13  ActionLatency         3395 non-null   float64
 14  ActionsInPAC          3395 non-null   float64
 15  TotalMapExplored     

# Data Definition
* Do our column names match up well to what they store?
* Are the data types stored in our columns data types that make sense?
* Do we have any obvious missing values?
* Do summary statistics of our columns offer any insight into our data? Do they prompt further investigation?

Our column names do seem to match up with the data stored. I will add more complete column definitions below, as copied from the UCI machine learning repository:
1. GameID: Unique ID number for each game (integer)
2. LeagueIndex: Bronze, Silver, Gold, Platinum, Diamond, Master, GrandMaster, and Professional leagues coded 1-8 (Ordinal)
3. Age: Age of each player (integer)
4. HoursPerWeek: Reported hours spent playing per week (integer)
5. TotalHours: Reported total hours spent playing (integer)
6. APM: Action per minute (continuous)
7. SelectByHotkeys: Number of unit or building selections made using hotkeys per timestamp (continuous)
8. AssignToHotkeys: Number of units or buildings assigned to hotkeys per timestamp (continuous)
9. UniqueHotkeys: Number of unique hotkeys used per timestamp (continuous)
10. MinimapAttacks: Number of attack actions on minimap per timestamp (continuous)
11. MinimapRightClicks: number of right-clicks on minimap per timestamp (continuous)
12. NumberOfPACs: Number of PACs per timestamp (continuous)
13. GapBetweenPACs: Mean duration in milliseconds between PACs (continuous)
14. ActionLatency: Mean latency from the onset of a PACs to their first action in milliseconds (continuous)
15. ActionsInPAC: Mean number of actions within each PAC (continuous)
16. TotalMapExplored: The number of 24x24 game coordinate grids viewed by the player per timestamp (continuous)
17. WorkersMade: Number of SCVs, drones, and probes trained per timestamp (continuous)
18. UniqueUnitsMade: Unique unites made per timestamp (continuous)
19. ComplexUnitsMade: Number of ghosts, infestors, and high templars trained per timestamp (continuous)
20. ComplexAbilitiesUsed: Abilities requiring specific targeting instructions used per timestamp (continuous)

Of note, age, hours per week, and and total hours are defined by the orignal data set as integers while our info above shows them as string objects. Also of note is that the league index should be a category not an integer as numbers below 1 or above 8 are nonsensical in this format.

An attempt to cast the age column to an int revealed an error message informing us that missing values in the age, hours per week, and total hours columns are represented by the string '?', we filter for them below.

In [4]:
starcraft_data[starcraft_data['TotalHours'] == '?']

Unnamed: 0,GameID,LeagueIndex,Age,HoursPerWeek,TotalHours,APM,SelectByHotkeys,AssignToHotkeys,UniqueHotkeys,MinimapAttacks,MinimapRightClicks,NumberOfPACs,GapBetweenPACs,ActionLatency,ActionsInPAC,TotalMapExplored,WorkersMade,UniqueUnitsMade,ComplexUnitsMade,ComplexAbilitiesUsed
358,1064,5,17,20,?,94.4724,0.003846,0.000783,3,1e-05,0.000135,0.004474,50.5455,54.9287,3.0972,31,0.000763,7,0.000106,0.000116
1841,5255,5,18,?,?,122.247,0.006357,0.000433,3,1.4e-05,0.000257,0.003043,30.8929,62.2933,5.3822,23,0.001055,5,0.0,0.000338
3340,10001,8,?,?,?,189.7404,0.004582,0.000655,4,7.3e-05,0.000618,0.006291,23.513,32.5665,4.4451,25,0.002218,6,0.0,0.0
3341,10005,8,?,?,?,287.8128,0.02904,0.001041,9,0.000231,0.000656,0.005399,31.6416,36.1143,4.5893,34,0.001138,6,5.8e-05,0.0
3342,10006,8,?,?,?,294.0996,0.02964,0.001076,6,0.000302,0.002374,0.006294,16.6393,36.8192,4.185,26,0.000987,6,0.0,0.0
3343,10015,8,?,?,?,274.2552,0.018121,0.001264,8,5.3e-05,0.000975,0.007111,10.6419,24.3556,4.387,28,0.001106,6,0.0,0.0
3344,10016,8,?,?,?,274.3404,0.023131,0.000739,8,0.000622,0.003552,0.005355,19.1568,36.3098,5.2811,28,0.000739,6,0.0,0.0
3345,10017,8,?,?,?,245.8188,0.010471,0.000841,10,0.000657,0.001314,0.005031,14.5518,36.7134,7.1943,33,0.001474,11,4e-05,4.8e-05
3346,10018,8,?,?,?,211.0722,0.013049,0.00094,10,0.000366,0.000909,0.003719,19.6169,38.9326,7.132,23,0.000898,9,0.0,0.0
3347,10021,8,?,?,?,189.5778,0.007559,0.000487,10,0.000606,0.000566,0.005821,22.0317,36.733,4.905,28,0.00054,5,0.0,0.0


We can see above that we have 56 rows where at least one of total hours, hours per week, or age are a missing value. We are most interested in values that can be changed, so were it just age missing we could simply drop the age column. Given that our rows we need to drop are less than 5% of our dataset, we are going to drop them. That being said it is worth noting that the vast majority of our rows missing these values are for players in the highest rank. At a later date, should we find that total hours or weekly hours have little effect on league index, we may add these rows back in inorder to have more information from higher ranked players.

In [5]:
#drop rows preventing age, hours per week, and total hours from being ints
starcraft_data.drop(starcraft_data[starcraft_data['TotalHours'] == '?'].index, inplace = True)

In [6]:
#convert age, hours per week, and total hours to ints
#convert league index to category
starcraft_data["Age"] = starcraft_data['Age'].astype(int)
starcraft_data["HoursPerWeek"] = starcraft_data['HoursPerWeek'].astype(int)
starcraft_data["TotalHours"] = starcraft_data['TotalHours'].astype(int)
starcraft_data['LeagueIndex'] = starcraft_data['LeagueIndex'].astype('category')
starcraft_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3338 entries, 0 to 3339
Data columns (total 20 columns):
 #   Column                Non-Null Count  Dtype   
---  ------                --------------  -----   
 0   GameID                3338 non-null   int64   
 1   LeagueIndex           3338 non-null   category
 2   Age                   3338 non-null   int64   
 3   HoursPerWeek          3338 non-null   int64   
 4   TotalHours            3338 non-null   int64   
 5   APM                   3338 non-null   float64 
 6   SelectByHotkeys       3338 non-null   float64 
 7   AssignToHotkeys       3338 non-null   float64 
 8   UniqueHotkeys         3338 non-null   int64   
 9   MinimapAttacks        3338 non-null   float64 
 10  MinimapRightClicks    3338 non-null   float64 
 11  NumberOfPACs          3338 non-null   float64 
 12  GapBetweenPACs        3338 non-null   float64 
 13  ActionLatency         3338 non-null   float64 
 14  ActionsInPAC          3338 non-null   float64 
 15  Tota

In [7]:
starcraft_data.describe()

Unnamed: 0,GameID,Age,HoursPerWeek,TotalHours,APM,SelectByHotkeys,AssignToHotkeys,UniqueHotkeys,MinimapAttacks,MinimapRightClicks,NumberOfPACs,GapBetweenPACs,ActionLatency,ActionsInPAC,TotalMapExplored,WorkersMade,UniqueUnitsMade,ComplexUnitsMade,ComplexAbilitiesUsed
count,3338.0,3338.0,3338.0,3338.0,3338.0,3338.0,3338.0,3338.0,3338.0,3338.0,3338.0,3338.0,3338.0,3338.0,3338.0,3338.0,3338.0,3338.0,3338.0
mean,4719.552127,21.650389,15.909527,960.421809,114.575763,0.004023,0.000364,4.316357,9.4e-05,0.00038,0.003433,40.713819,64.209584,5.266955,22.116836,0.001031,6.541043,6e-05,0.000142
std,2656.91963,4.206357,11.964495,17318.133922,48.111912,0.004726,0.00021,2.333322,0.000159,0.000359,0.000966,17.057191,19.037394,1.500605,7.440875,0.00052,1.859049,0.000112,0.000266
min,52.0,16.0,0.0,3.0,22.0596,0.0,0.0,0.0,0.0,0.0,0.000679,6.6667,24.6326,2.0389,5.0,7.7e-05,2.0,0.0,0.0
25%,2423.25,19.0,8.0,300.0,79.2315,0.001245,0.000202,3.0,0.0,0.000139,0.002743,29.3266,50.886425,4.261525,17.0,0.000682,5.0,0.0,0.0
50%,4788.0,21.0,12.0,500.0,107.0703,0.002445,0.000349,4.0,3.9e-05,0.000278,0.003376,37.0589,61.2961,5.08705,22.0,0.000904,6.0,0.0,2e-05
75%,6994.75,24.0,20.0,800.0,140.1561,0.004945,0.000493,6.0,0.000113,0.000508,0.004003,48.510425,74.032525,6.02735,27.0,0.001258,8.0,8.7e-05,0.000182
max,9271.0,44.0,168.0,1000000.0,389.8314,0.043088,0.001648,10.0,0.003019,0.003688,0.007971,237.1429,176.3721,18.5581,58.0,0.005149,13.0,0.000902,0.003084


Our summary statistics don't seem to suggest any glaring errors in our data. It is noteworthy that minimap attacks are at 0 for both the minimum and the first quartile, however given that even the max is a small number, this mostly suggests that players don't often make minimap attacks more than the data being missing. We do one last check to make sure the GameID is actually unique and we aren't dealing with any duplicate entries.

In [8]:
starcraft_data['GameID'].is_unique

True

# Save the Data
Now that we have our data cleaned, we save it for use in future notebooks and analysis.

In [9]:
data_path = '../data/interim'
starcraft_data.to_csv(data_path+'Starcraft_cleaned.csv')