# Steam Data Cleaning

In this section, we'll clean the data sets generated from the Steam Store API. Towards the end of this section, we aim to drop the unnecessary columns and expand the data set by adding more interesting columns.

## Import Libraries and Inspect Data

In [156]:
# standard library imports
from ast import literal_eval
import itertools
import time
import re
import calendar

# third-party imports
import numpy as np
import pandas as pd

In [3]:
# read in downloaded data
raw_steam_data = pd.read_csv('../data/gathering/steam_data.csv')

# print out number of rows and columns
print('Rows:', raw_steam_data.shape[0])
print('Columns:', raw_steam_data.shape[1])

# view first five rows
raw_steam_data.head()

Rows: 59159
Columns: 39


  raw_steam_data = pd.read_csv('steam_data.csv')


Unnamed: 0,type,name,steam_appid,required_age,is_free,controller_support,dlc,detailed_description,about_the_game,short_description,...,categories,genres,screenshots,movies,recommendations,achievements,release_date,support_info,background,content_descriptors
0,game,Counter-Strike,10,0.0,False,,,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...,...,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,{'total': 134178},,"{'coming_soon': False, 'date': '1 Nov, 2000'}","{'url': 'http://steamcommunity.com/app/10', 'e...",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [2, 5], 'notes': 'Includes intense vio..."
1,game,Team Fortress Classic,20,0.0,False,,,One of the most popular online action games of...,One of the most popular online action games of...,One of the most popular online action games of...,...,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,{'total': 5251},,"{'coming_soon': False, 'date': '1 Apr, 1999'}","{'url': '', 'email': ''}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [2, 5], 'notes': 'Includes intense vio..."
2,game,Day of Defeat,30,0.0,False,,,Enlist in an intense brand of Axis vs. Allied ...,Enlist in an intense brand of Axis vs. Allied ...,Enlist in an intense brand of Axis vs. Allied ...,...,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,{'total': 3564},,"{'coming_soon': False, 'date': '1 May, 2003'}","{'url': '', 'email': ''}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
3,game,Deathmatch Classic,40,0.0,False,,,Enjoy fast-paced multiplayer gaming with Death...,Enjoy fast-paced multiplayer gaming with Death...,Enjoy fast-paced multiplayer gaming with Death...,...,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,{'total': 1811},,"{'coming_soon': False, 'date': '1 Jun, 2001'}","{'url': '', 'email': ''}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
4,game,Half-Life: Opposing Force,50,0.0,False,,,Return to the Black Mesa Research Facility as ...,Return to the Black Mesa Research Facility as ...,Return to the Black Mesa Research Facility as ...,...,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,{'total': 14552},,"{'coming_soon': False, 'date': '1 Nov, 1999'}","{'url': 'https://help.steampowered.com', 'emai...",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"


Here is a quick inspection of the missing values in the data set.

In [4]:
null_counts = raw_steam_data.isnull().sum()
null_counts

type                          68
name                           3
steam_appid                    0
required_age                  68
is_free                       68
controller_support         46048
dlc                        49181
detailed_description          99
about_the_game               100
short_description             94
fullgame                   59159
supported_languages           88
header_image                  68
website                    27639
pc_requirements               68
mac_requirements              68
linux_requirements            68
legal_notice               40908
drm_notice                 58875
ext_user_account_notice    58203
developers                    82
publishers                    68
demos                      53837
price_overview              8118
packages                    7540
package_groups                68
platforms                     68
metacritic                 55338
reviews                    50041
categories                   222
genres    

## Initial Processing

Here, we aim to remove the unnecessary columns and rows by following criterias:

#### 1. columns with more than 50% missing values:
    
We do not want the columns with high null counts. We can do this by running a filter on the dataset to see the threshod of missing values and how many columns are thus useless.

In [5]:
threshold = raw_steam_data.shape[0] // 2

print('Drop columns with more than {} missing rows'.format(threshold))
print()

drop_rows = raw_steam_data.columns[null_counts > threshold]

print('Columns to drop: {}'.format(list(drop_rows)))

Drop columns with more than 29579 missing rows

Columns to drop: ['controller_support', 'dlc', 'fullgame', 'legal_notice', 'drm_notice', 'ext_user_account_notice', 'demos', 'metacritic', 'reviews', 'recommendations']


#### 2. rows with no information or "none" in the name column:

By looking at the `type` column, we can see how many of the rows have no information


In [6]:
print('Rows to remove:', raw_steam_data[raw_steam_data['type'].isnull()].shape[0])

# preview rows with missing type data
raw_steam_data[raw_steam_data['type'].isnull()].head(3)

Rows to remove: 68


Unnamed: 0,type,name,steam_appid,required_age,is_free,controller_support,dlc,detailed_description,about_the_game,short_description,...,categories,genres,screenshots,movies,recommendations,achievements,release_date,support_info,background,content_descriptors
288,,Dragon Nest,11610,,,,,,,,...,,,,,,,,,,
293,,Max Payne,12140,,,,,,,,...,,,,,,,,,,
294,,Max Payne 2: The Fall of Max Payne,12150,,,,,,,,...,,,,,,,,,,


By looking at the `name` column, we can see how many of the games have no name (which we don't want to focus on)

In [7]:
raw_steam_data[(raw_steam_data['name'].isnull()) | (raw_steam_data['name'] == 'none')]

Unnamed: 0,type,name,steam_appid,required_age,is_free,controller_support,dlc,detailed_description,about_the_game,short_description,...,categories,genres,screenshots,movies,recommendations,achievements,release_date,support_info,background,content_descriptors
4583,game,none,339860,0.0,False,,,,,,...,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '25', 'description': 'Adventure'}, {'i...",,,,"{'total': 3, 'highlighted': [{'name': 'Master ...","{'coming_soon': False, 'date': '27 Feb, 2015'}","{'url': '', 'email': ''}",,"{'ids': [], 'notes': None}"
6382,game,none,385020,0.0,False,,,- discontinued - (please remove),- discontinued - (please remove),- discontinued - (please remove),...,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '4', 'description': 'Casual'}, {'id': ...",,,,,"{'coming_soon': False, 'date': '4 Nov, 2015'}","{'url': '', 'email': ''}",,"{'ids': [], 'notes': None}"
6910,game,none,398970,0.0,False,,,,,,...,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '25', 'description': 'Adventure'}, {'i...",,,,"{'total': 35, 'highlighted': [{'name': ""They'v...","{'coming_soon': False, 'date': '5 Nov, 2015'}","{'url': '', 'email': ''}",,"{'ids': [], 'notes': None}"
32184,game,,1116910,0.0,False,,,,,,...,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 256762655, 'name': '宣传片', 'thumbnail':...",,,"{'coming_soon': False, 'date': '25 Sep, 2019'}","{'url': '', 'email': ''}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [1, 3, 5, 4], 'notes': 'The content co..."
34052,game,,1172120,0.0,False,,,,,,...,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '1', 'description': 'Action'}, {'id': ...",,,,,"{'coming_soon': False, 'date': '23 Jan, 2020'}","{'url': '', 'email': ''}",,"{'ids': [2, 5], 'notes': 'This game includes c..."
35529,game,,1216770,0.0,False,,,&quot;Our Journeys ~ A Collection of Visual No...,&quot;Our Journeys ~ A Collection of Visual No...,&quot;Our Journeys ~ A Collection of Visual No...,...,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '4', 'description': 'Casual'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 256771159, 'name': 'Lorange Domain Tra...",,"{'total': 21, 'highlighted': [{'name': 'Before...","{'coming_soon': False, 'date': '4 Feb, 2020'}","{'url': 'http://miaqc.ca/en', 'email': ''}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"


#### 3. Duplicated rows

We want to remove the extra rows and keep the first row.

In [14]:
duplicate_rows = raw_steam_data[raw_steam_data.duplicated()]

print('Duplicate rows to remove:', duplicate_rows.shape[0])

duplicate_rows

Duplicate rows to remove: 5


Unnamed: 0,type,name,steam_appid,required_age,is_free,controller_support,dlc,detailed_description,about_the_game,short_description,...,categories,genres,screenshots,movies,recommendations,achievements,release_date,support_info,background,content_descriptors
30,game,SiN Episodes: Emergence,1300,0.0,False,,,"You are John Blade, commander of HardCorps, an...","You are John Blade, commander of HardCorps, an...","You are John Blade, commander of HardCorps, an...",...,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,{'total': 599},,"{'coming_soon': False, 'date': '10 May, 2006'}","{'url': '', 'email': ''}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
305,game,Jagged Alliance 2 Gold,1620,0.0,False,,,<p>The small country of Arulco has been taken ...,<p>The small country of Arulco has been taken ...,The small country of Arulco has been taken ove...,...,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '2', 'description': 'Strategy'}]","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,,"{'coming_soon': False, 'date': '6 Jul, 2006'}","{'url': '', 'email': ''}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
455,game,F.E.A.R.,21090,18.0,False,,,Be the hero in your own cinematic epic of acti...,Be the hero in your own cinematic epic of acti...,Experience the original F.E.A.R. along with F....,...,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,{'total': 8800},,"{'coming_soon': False, 'date': '21 May, 2010'}",{'url': 'https://wbgamessupport.wbgames.com/hc...,https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
456,game,F.E.A.R.,21090,18.0,False,,,Be the hero in your own cinematic epic of acti...,Be the hero in your own cinematic epic of acti...,Experience the original F.E.A.R. along with F....,...,"[{'id': 2, 'description': 'Single-player'}]","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,{'total': 8800},,"{'coming_soon': False, 'date': '21 May, 2010'}",{'url': 'https://wbgamessupport.wbgames.com/hc...,https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
707,game,Batman: Arkham Asylum Game of the Year Edition,35140,0.0,False,,,Critically acclaimed Batman: Arkham Asylum ret...,Critically acclaimed Batman: Arkham Asylum ret...,Experience what it’s like to be Batman and fac...,...,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}, {'id': ...","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...","[{'id': 5642, 'name': 'Batman Game of the Year...",{'total': 34624},"{'total': 47, 'highlighted': [{'name': 'Shocki...","{'coming_soon': False, 'date': '26 Mar, 2010'}",{'url': 'https://community.wbgames.com/t5/Arkh...,https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"


### Processing

Now we define a general purpose process function which removes all the unnecessary information shown above and finally run the function on the raw data.

In [59]:
def drop_null_cols(df, thresh=0.5):
    """
    Remove the columns with more than 50% missing values.
    """
    cutoff_count = len(df) * thresh
    
    return df.dropna(thresh=cutoff_count, axis=1)


def process_name_type(df):
    """
    remove rows with no information and no name.
    """
    df = df[df['type'].notnull()]
    
    df = df[df['name'].notnull()]
    df = df[df['name'] != 'none']
    
    df = df.drop('type', axis=1)
    
    return df
    

def general_process(df):
    """
    The final data process based on the functions above and drop duplicates at the end.
    """
    
    # Copy the input dataframe to avoid accidentally modifying original data
    df = df.copy()
    
    # Remove duplicate rows - all appids should be unique
    df = df.drop_duplicates()
    
    # Remove collumns with more than 50% null values
    df = drop_null_cols(df)
        
    # Process rest of columns
    df = process_name_type(df)
    
    return df

print(raw_steam_data.shape)
initial_processing = general_process(raw_steam_data)
print(initial_processing.shape)
initial_processing.head()

(59159, 39)
(59080, 28)


Unnamed: 0,name,steam_appid,required_age,is_free,detailed_description,about_the_game,short_description,supported_languages,header_image,website,...,platforms,categories,genres,screenshots,movies,achievements,release_date,support_info,background,content_descriptors
0,Counter-Strike,10,0.0,False,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...,"English<strong>*</strong>, French<strong>*</st...",https://cdn.akamai.steamstatic.com/steam/apps/...,,...,"{'windows': True, 'mac': True, 'linux': True}","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,"{'coming_soon': False, 'date': '1 Nov, 2000'}","{'url': 'http://steamcommunity.com/app/10', 'e...",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [2, 5], 'notes': 'Includes intense vio..."
1,Team Fortress Classic,20,0.0,False,One of the most popular online action games of...,One of the most popular online action games of...,One of the most popular online action games of...,"English, French, German, Italian, Spanish - Sp...",https://cdn.akamai.steamstatic.com/steam/apps/...,,...,"{'windows': True, 'mac': True, 'linux': True}","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,"{'coming_soon': False, 'date': '1 Apr, 1999'}","{'url': '', 'email': ''}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [2, 5], 'notes': 'Includes intense vio..."
2,Day of Defeat,30,0.0,False,Enlist in an intense brand of Axis vs. Allied ...,Enlist in an intense brand of Axis vs. Allied ...,Enlist in an intense brand of Axis vs. Allied ...,"English, French, German, Italian, Spanish - Spain",https://cdn.akamai.steamstatic.com/steam/apps/...,http://www.dayofdefeat.com/,...,"{'windows': True, 'mac': True, 'linux': True}","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,"{'coming_soon': False, 'date': '1 May, 2003'}","{'url': '', 'email': ''}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
3,Deathmatch Classic,40,0.0,False,Enjoy fast-paced multiplayer gaming with Death...,Enjoy fast-paced multiplayer gaming with Death...,Enjoy fast-paced multiplayer gaming with Death...,"English, French, German, Italian, Spanish - Sp...",https://cdn.akamai.steamstatic.com/steam/apps/...,,...,"{'windows': True, 'mac': True, 'linux': True}","[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,"{'coming_soon': False, 'date': '1 Jun, 2001'}","{'url': '', 'email': ''}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"
4,Half-Life: Opposing Force,50,0.0,False,Return to the Black Mesa Research Facility as ...,Return to the Black Mesa Research Facility as ...,Return to the Black Mesa Research Facility as ...,"English, French, German, Korean",https://cdn.akamai.steamstatic.com/steam/apps/...,,...,"{'windows': True, 'mac': True, 'linux': True}","[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,"{'coming_soon': False, 'date': '1 Nov, 1999'}","{'url': 'https://help.steampowered.com', 'emai...",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}"


## Processing on `required_age` Column


In [60]:
initial_processing['required_age'].value_counts(dropna=False).sort_values()

21          1
6.0         1
4.0         1
5.0         1
18+         1
20.0        1
14          1
17          1
1.0         1
7           2
3           2
10.0        3
14.0        3
10          3
15          3
11.0        4
3.0        13
12         14
7.0        16
13.0       25
15.0       25
16         44
17.0       54
18         66
12.0       86
16.0      225
18.0      366
0       16240
0.0     41877
Name: required_age, dtype: int64

By looking at the value counts, we can see that some values are stored in `float` but one of them "18+" is in string format. Firstly, we need to convert the string format to float and inspect the dataset again.

In [61]:
# change "18+" to "18"
initial_processing.loc[initial_processing['required_age'] == '18+', 'required_age'] = '18'
# convert all values to float
initial_processing['required_age'] = initial_processing['required_age'].astype(float)

initial_processing['required_age'].value_counts(dropna=False).sort_index()

0.0     58117
1.0         1
3.0        15
4.0         1
5.0         1
6.0         1
7.0        18
10.0        6
11.0        4
12.0      100
13.0       25
14.0        4
15.0       28
16.0      269
17.0       55
18.0      433
20.0        1
21.0        1
Name: required_age, dtype: int64

## Processing on `Platfroms` column

In [62]:
initial_processing['platforms'].head()

0    {'windows': True, 'mac': True, 'linux': True}
1    {'windows': True, 'mac': True, 'linux': True}
2    {'windows': True, 'mac': True, 'linux': True}
3    {'windows': True, 'mac': True, 'linux': True}
4    {'windows': True, 'mac': True, 'linux': True}
Name: platforms, dtype: object

We can see that the values in the `platforms` column are stored in ditionaries, so we want to show this column as a string and split each platform by `;`. 

For example, in the form of "windows;mac;linux".

In [65]:
def parse_platforms(x):
    """
    Convert the dictionaries in platform column to strings.
    """
        
    d = literal_eval(x)
    return ';'.join(platform for platform in d.keys() if d[platform])

platforms_df = initial_processing.copy()
platforms_df['platforms'] = platforms_df['platforms'].apply(parse_platforms)
platforms_df['platforms'].value_counts()

windows              43971
windows;mac;linux     7136
windows;mac           6124
windows;linux         1839
mac                      6
linux                    3
mac;linux                1
Name: platforms, dtype: int64

## Processing on `Price` Column
Here we aim to:

1. Combine the `is_free` and `price_overview` columns;
2. Convert the dictionaries into seperated columns;
3. Convert the price to the minimum unit.

In [77]:
platforms_df[['name', 'is_free', 'price_overview']].head()

Unnamed: 0,name,is_free,price_overview
0,Counter-Strike,False,"{'currency': 'EUR', 'initial': 819, 'final': 8..."
1,Team Fortress Classic,False,"{'currency': 'EUR', 'initial': 399, 'final': 3..."
2,Day of Defeat,False,"{'currency': 'EUR', 'initial': 399, 'final': 3..."
3,Deathmatch Classic,False,"{'currency': 'EUR', 'initial': 399, 'final': 3..."
4,Half-Life: Opposing Force,False,"{'currency': 'EUR', 'initial': 399, 'final': 3..."


In [78]:
def parse_price(x):
    """
    Expand the columns if it's a dictionary and
    set the missing values to "-1" (in order to separate them from free games)
    """
    if x is not np.nan:
        return literal_eval(x)
    else:
        return {'currency': 'EUR', 'initial': -1}


price_df = platforms_df.copy()
price_df['price_overview'] = price_df['price_overview'].apply(parse_price)
    
# create columns from currency and initial values
price_df['currency'] = price_df['price_overview'].apply(lambda x: x['currency'])

price_df['price'] = price_df['price_overview'].apply(lambda x: x['initial'])
    
# set price of free games to 0
price_df.loc[price_df['is_free'], 'price'] = 0
    
# convert all price to the minimum unit
price_df.loc[price_df['price'] > 0, 'price'] /= 100
    
# remove columns no longer needed
price_df = price_df.drop(['is_free', 'price_overview'], axis=1)

price_df[['name', 'price', 'currency']].head()

Unnamed: 0,name,price,currency
0,Counter-Strike,8.19,EUR
1,Team Fortress Classic,3.99,EUR
2,Day of Defeat,3.99,EUR
3,Deathmatch Classic,3.99,EUR
4,Half-Life: Opposing Force,3.99,EUR


**However,** we can see that there are several types of currency in the data set. In order to compare the price fairly, we'd like to convert all the currency types to `EUR` based on the exchange rate by **13 April, 2023**.

* 1 USD = 0.91 EUR
* 1 SGD = 0.68 EUR
* 1 BRL = 0.18 EUR
* 1 AUD = 0.61 EUR
* 1 JPY = 0.0068 EUR

In [82]:
price_df[price_df['currency'] != 'EUR']['currency'].value_counts()

USD    77
SGD     7
BRL     4
AUD     2
JPY     1
Name: currency, dtype: int64

In [95]:
currency_df = price_df.copy()
currency_df.loc[currency_df['currency'] == 'USD', 'price'] *= 0.91
currency_df.loc[currency_df['currency'] == 'SGD', 'price'] *= 0.68
currency_df.loc[currency_df['currency'] == 'BRL', 'price'] *= 0.18
currency_df.loc[currency_df['currency'] == 'AUD', 'price'] *= 0.61
currency_df.loc[currency_df['currency'] == 'JPY', 'price'] *= 0.0068

# now remove the currency column which no longer needed
currency_df = currency_df.drop(['currency'], axis=1)

currency_df[['name', 'price']].head()

Unnamed: 0,name,price
0,Counter-Strike,8.19
1,Team Fortress Classic,3.99
2,Day of Defeat,3.99
3,Deathmatch Classic,3.99
4,Half-Life: Opposing Force,3.99


## Processing on `Packages` Column

Clearly, the `package` and `package_groups` columns themselves have no useful information. We want to use these two columns to fill in the `price` information instead.

Previously, we set NaN values in `price_overview` columns to -1. Now, we'd like to find information in `Package_groups` column to fill in those cells. Firstly, let's have a quick look of those rows.

In [98]:
print('Number of rows with missing price:', currency_df[currency_df['price'] == -1].shape[0], '\n')

missing_price_and_package = currency_df[(currency_df['price'] == -1) & (currency_df['package_groups'] == "[]")]
print('Number of rows with both missing price and package:', missing_price_and_package.shape[0], '\n')

Number of rows with missing price: 2688 

Number of rows with both missing price and package: 2638 



We can see that most of the games - 2638 of 2688 - with missing price data have no package information, too. This means that we can remove these games safely, since they don't have price information to be further analysed. Therefore, we'd like to drop the 2688 games where price is -1 as we set before.

In [99]:
currency_df = currency_df[currency_df['price'] != -1]
currency_df = currency_df.drop(['packages', 'package_groups'], axis=1)

currency_df.head()

Unnamed: 0,name,steam_appid,required_age,detailed_description,about_the_game,short_description,supported_languages,header_image,website,pc_requirements,...,categories,genres,screenshots,movies,achievements,release_date,support_info,background,content_descriptors,price
0,Counter-Strike,10,0.0,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...,Play the world's number 1 online action game. ...,"English<strong>*</strong>, French<strong>*</st...",https://cdn.akamai.steamstatic.com/steam/apps/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,...,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,"{'coming_soon': False, 'date': '1 Nov, 2000'}","{'url': 'http://steamcommunity.com/app/10', 'e...",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [2, 5], 'notes': 'Includes intense vio...",8.19
1,Team Fortress Classic,20,0.0,One of the most popular online action games of...,One of the most popular online action games of...,One of the most popular online action games of...,"English, French, German, Italian, Spanish - Sp...",https://cdn.akamai.steamstatic.com/steam/apps/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,...,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,"{'coming_soon': False, 'date': '1 Apr, 1999'}","{'url': '', 'email': ''}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [2, 5], 'notes': 'Includes intense vio...",3.99
2,Day of Defeat,30,0.0,Enlist in an intense brand of Axis vs. Allied ...,Enlist in an intense brand of Axis vs. Allied ...,Enlist in an intense brand of Axis vs. Allied ...,"English, French, German, Italian, Spanish - Spain",https://cdn.akamai.steamstatic.com/steam/apps/...,http://www.dayofdefeat.com/,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,...,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,"{'coming_soon': False, 'date': '1 May, 2003'}","{'url': '', 'email': ''}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}",3.99
3,Deathmatch Classic,40,0.0,Enjoy fast-paced multiplayer gaming with Death...,Enjoy fast-paced multiplayer gaming with Death...,Enjoy fast-paced multiplayer gaming with Death...,"English, French, German, Italian, Spanish - Sp...",https://cdn.akamai.steamstatic.com/steam/apps/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,...,"[{'id': 1, 'description': 'Multi-player'}, {'i...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,"{'coming_soon': False, 'date': '1 Jun, 2001'}","{'url': '', 'email': ''}",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}",3.99
4,Half-Life: Opposing Force,50,0.0,Return to the Black Mesa Research Facility as ...,Return to the Black Mesa Research Facility as ...,Return to the Black Mesa Research Facility as ...,"English, French, German, Korean",https://cdn.akamai.steamstatic.com/steam/apps/...,,{'minimum': '\r\n\t\t\t<p><strong>Minimum:</st...,...,"[{'id': 2, 'description': 'Single-player'}, {'...","[{'id': '1', 'description': 'Action'}]","[{'id': 0, 'path_thumbnail': 'https://cdn.akam...",,,"{'coming_soon': False, 'date': '1 Nov, 1999'}","{'url': 'https://help.steampowered.com', 'emai...",https://cdn.akamai.steamstatic.com/steam/apps/...,"{'ids': [], 'notes': None}",3.99


## Processing on the `supported_languages` Column

Here, we will create a column marking english games with a boolean value - True or False.

In [101]:
currency_df['supported_languages'].value_counts().head(10)

English                                                                                                        15762
English<strong>*</strong><br><strong>*</strong>languages with full audio support                               13782
English, Russian                                                                                                1300
English, Simplified Chinese                                                                                      737
Simplified Chinese                                                                                               639
English, Japanese                                                                                                610
Simplified Chinese<strong>*</strong><br><strong>*</strong>languages with full audio support                      459
English<strong>*</strong>, Russian<strong>*</strong><br><strong>*</strong>languages with full audio support      355
English, Portuguese - Brazil                                    

In [102]:
language_df = currency_df.copy()

# drop rows with missing language data
language_df = language_df.dropna(subset=['supported_languages'])

language_df['english'] = language_df['supported_languages'].apply(lambda x: 1 if 'english' in x.lower() else 0)
language_df = language_df.drop('supported_languages', axis=1)

language_df[['name', 'english']].head()

Unnamed: 0,name,english
0,Counter-Strike,1
1,Team Fortress Classic,1
2,Day of Defeat,1
3,Deathmatch Classic,1
4,Half-Life: Opposing Force,1


In [103]:
language_df['english'].value_counts()

1    54394
0     1988
Name: english, dtype: int64

## Processing on the `Developers` and `Publishers` Columns

Here, we want to combine all values into one column, simply unpacking the list. By this we mean that if we pass a list with only one value, we get a string with just that value. If we pass a list with multiple values, we get a string-separated list as desired.

In [104]:
# remove rows with missing data
dev_pub_df = language_df[(language_df['developers'].notnull()) & (language_df['publishers'] != "['']")].copy()
dev_pub_df = dev_pub_df[~(dev_pub_df['developers'].str.contains(';')) & ~(dev_pub_df['publishers'].str.contains(';'))]
dev_pub_df = dev_pub_df[(dev_pub_df['publishers'] != "['NA']") & (dev_pub_df['publishers'] != "['N/A']")]

# create the new columns
dev_pub_df['developer'] = dev_pub_df['developers'].apply(lambda x: ';'.join(literal_eval(x)))
dev_pub_df['publisher'] = dev_pub_df['publishers'].apply(lambda x: ';'.join(literal_eval(x)))

dev_pub_df = dev_pub_df.drop(['developers', 'publishers'], axis=1)
dev_pub_df[['name', 'steam_appid', 'developer', 'publisher']].head()

Unnamed: 0,name,steam_appid,developer,publisher
0,Counter-Strike,10,Valve,Valve
1,Team Fortress Classic,20,Valve,Valve
2,Day of Defeat,30,Valve,Valve
3,Deathmatch Classic,40,Valve,Valve
4,Half-Life: Opposing Force,50,Gearbox Software,Valve


## Processing on the `Categories` and `Genres` Columns

Here, we will also turn each column from lists to strings which are separated by ';'

In [105]:
cat_gen_df = dev_pub_df.copy()
cat_gen_df = cat_gen_df[(cat_gen_df['categories'].notnull()) & (cat_gen_df['genres'].notnull())]

for col in ['categories', 'genres']:
    cat_gen_df[col] = cat_gen_df[col].apply(lambda x: ';'.join(item['description'] for item in literal_eval(x)))

cat_gen_df[['steam_appid', 'categories', 'genres']].head()

Unnamed: 0,steam_appid,categories,genres
0,10,Multi-player;PvP;Online PvP;Shared/Split Scree...,Action
1,20,Multi-player;PvP;Online PvP;Shared/Split Scree...,Action
2,30,Multi-player;Valve Anti-Cheat enabled,Action
3,40,Multi-player;PvP;Online PvP;Shared/Split Scree...,Action
4,50,Single-player;Multi-player;Valve Anti-Cheat en...,Action


## Processing on the `Achievements` Columns

In [106]:
def parse_achievements(x):
        if x is np.nan:
            # we assume 0 achievement if Nan value
            return 0
        else:
            # extract and return number under total if has value
            return literal_eval(x)['total']

achiev_df = cat_gen_df.copy()
achiev_df = achiev_df.drop('content_descriptors', axis=1)

achiev_df['achievements'] = achiev_df['achievements'].apply(parse_achievements)

achiev_df['achievements'].value_counts().head()

0     24267
10     1499
12     1230
20     1122
15     1044
Name: achievements, dtype: int64

## Processing on the `release_date` Column

Here, we will remove rows with Nan date values and those with strange date format.

In [167]:
date_df = achiev_df.copy()

def eval_date(x):
        x = literal_eval(x)
        if x['coming_soon']:
            return '' # we don't want to analyse new games since they don't have playing data
        else:
            return x['date']

date_df['release_date'] = date_df['release_date'].apply(eval_date)
    
def parse_date(x):
    """
    Parse the date into "%d %b %Y" format.
    """
    if re.search(r'[\d]{1,2} [A-Za-z]{3}, [\d]{4}', x):
        return x.replace(',', '')
    
    elif re.search(r'[A-Za-z]{3} [\d]{4}', x):
        return '1 ' + x

    elif re.search(r'[A-Za-z]{3} [\d]{2}, [\d]{4}', x):
        day = x[4:6]
        month = x[:3]
        year = x[-4:]
        return day + ' ' + month + ' ' + year
    
    elif re.search(r'[A-Za-z]{3} [\d]{1}, [\d]{4}', x):
        day = x[4]
        month = x[:3]
        year = x[-4:]
        return day + ' ' + month + ' ' + year
    
    elif re.search(r'[\d]{2}. [A-Za-z]{3}. [\d]{4}', x):
        return x.replace('.', '')
    
    elif re.search(r'[\d]{4} 年 [\d]{1,2} 月 [\d]{1,2} 日', x):
        year, month, day = re.findall(r'[\d]+', x)
        month = calendar.month_name[int(month)][:3]
        return day + ' ' + month + ' ' + year
    
    elif re.search(r'[\d]{4}年[\d]{1,2}月[\d]{1,2}日', x):
        year, month, day = re.findall(r'[\d]+', x)
        month = calendar.month_name[int(month)][:3]
        return day + ' ' + month + ' ' + year

    elif x == '':
        return np.nan

            
date_df['release_date'] = date_df['release_date'].apply(parse_date)
date_df['release_date'] = pd.to_datetime(date_df['release_date'], format='%d %b %Y', errors='coerce')
    
date_df = date_df[date_df['release_date'].notnull()]
date_df[['name', 'release_date']].head()

Unnamed: 0,name,release_date
0,Counter-Strike,2000-11-01
1,Team Fortress Classic,1999-04-01
2,Day of Defeat,2003-05-01
3,Deathmatch Classic,2001-06-01
4,Half-Life: Opposing Force,1999-11-01


## Drop all the `other` columns

Since we're analysing the performance of each factor for a game's success, we don't have other information in the raw dataset including media columns and descriptions. Therefore, we will drop these columns at this stage.

In [170]:
date_df.columns

Index(['name', 'steam_appid', 'required_age', 'detailed_description',
       'about_the_game', 'short_description', 'header_image', 'website',
       'pc_requirements', 'mac_requirements', 'linux_requirements',
       'platforms', 'categories', 'genres', 'screenshots', 'movies',
       'achievements', 'release_date', 'support_info', 'background', 'price',
       'english', 'developer', 'publisher'],
      dtype='object')

In [171]:
steam_data_cleaned = date_df[['name', 'steam_appid', 'required_age', 'platforms',
                              'categories', 'genres', 'achievements', 'release_date',
                              'price', 'english', 'developer', 'publisher']]

steam_data_cleaned.head()

Unnamed: 0,name,steam_appid,required_age,platforms,categories,genres,achievements,release_date,price,english,developer,publisher
0,Counter-Strike,10,0.0,windows;mac;linux,Multi-player;PvP;Online PvP;Shared/Split Scree...,Action,0,2000-11-01,8.19,1,Valve,Valve
1,Team Fortress Classic,20,0.0,windows;mac;linux,Multi-player;PvP;Online PvP;Shared/Split Scree...,Action,0,1999-04-01,3.99,1,Valve,Valve
2,Day of Defeat,30,0.0,windows;mac;linux,Multi-player;Valve Anti-Cheat enabled,Action,0,2003-05-01,3.99,1,Valve,Valve
3,Deathmatch Classic,40,0.0,windows;mac;linux,Multi-player;PvP;Online PvP;Shared/Split Scree...,Action,0,2001-06-01,3.99,1,Valve,Valve
4,Half-Life: Opposing Force,50,0.0,windows;mac;linux,Single-player;Multi-player;Valve Anti-Cheat en...,Action,0,1999-11-01,3.99,1,Gearbox Software,Valve


### Export the dataset

Finally, we can export the cleaned dataset to csv file.

In [173]:
steam_data_cleaned.to_csv('../data/cleaned/steam_data_cleaned.csv', index=False)