![logo](./images/OPTIMISE.%20Logo%20(green).png)

This is Notebook 1 out of 4 of this project.

# Optimise.
BUSINESS INTELLIGENCE SOLUTIONS

Optimise. uses data analysis to provide businesses a vision of their present operations and provides them with actionable advise based on meticulous analysis that produces tangible results.   

The analysis focuses on these main areas:     
- Product Analysis
    - Performance
    - Classification
    - Pricing
- Customer Analysis
    - Customer Profile
    - Customer Trends
    - Customer Lifetime Value
- Sales Analysis
    - Date/Time Overview
    - Discount Effeciency
    - Projections
    
The deliverables to be expected are a comprehensive report with useful visualizations, combined with specific recommendations based on the results obtained from the analysis.

## Steam Business Analysis
In this project we are going to be executing the analysis on Steam.    

Steam is a video game digital distribution service by Valve. The Steam platform is the largest digital distribution platform for PC gaming, holding around 75% of the market space in 2013. By 2017, users purchasing games through Steam totaled roughly US$4.3 billion, representing at least 18% of global PC game sales. By 2019, the service had over 34,000 games with over 95 million monthly active users. 

The data for the analysis is going to be obtaining in two ways:
1. Steam Store Games - https://www.kaggle.com/nikdavis/steam-store-games
2. Steam Api - https://steamcommunity.com/dev

# First Exploration
### Import Data

In [1]:
import pandas as pd
import numpy as np

In [2]:
s = pd.read_csv("./data/steam.csv")

In [3]:
s.head()

Unnamed: 0,appid,name,release_date,english,developer,publisher,platforms,required_age,categories,genres,steamspy_tags,achievements,positive_ratings,negative_ratings,average_playtime,median_playtime,owners,price
0,10,Counter-Strike,2000-11-01,1,Valve,Valve,windows;mac;linux,0,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,0,124534,3339,17612,317,10000000-20000000,7.19
1,20,Team Fortress Classic,1999-04-01,1,Valve,Valve,windows;mac;linux,0,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,0,3318,633,277,62,5000000-10000000,3.99
2,30,Day of Defeat,2003-05-01,1,Valve,Valve,windows;mac;linux,0,Multi-player;Valve Anti-Cheat enabled,Action,FPS;World War II;Multiplayer,0,3416,398,187,34,5000000-10000000,3.99
3,40,Deathmatch Classic,2001-06-01,1,Valve,Valve,windows;mac;linux,0,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,0,1273,267,258,184,5000000-10000000,3.99
4,50,Half-Life: Opposing Force,1999-11-01,1,Gearbox Software,Valve,windows;mac;linux,0,Single-player;Multi-player;Valve Anti-Cheat en...,Action,FPS;Action;Sci-fi,0,5250,288,624,415,5000000-10000000,3.99


In [4]:
s.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27075 entries, 0 to 27074
Data columns (total 18 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   appid             27075 non-null  int64  
 1   name              27075 non-null  object 
 2   release_date      27075 non-null  object 
 3   english           27075 non-null  int64  
 4   developer         27075 non-null  object 
 5   publisher         27075 non-null  object 
 6   platforms         27075 non-null  object 
 7   required_age      27075 non-null  int64  
 8   categories        27075 non-null  object 
 9   genres            27075 non-null  object 
 10  steamspy_tags     27075 non-null  object 
 11  achievements      27075 non-null  int64  
 12  positive_ratings  27075 non-null  int64  
 13  negative_ratings  27075 non-null  int64  
 14  average_playtime  27075 non-null  int64  
 15  median_playtime   27075 non-null  int64  
 16  owners            27075 non-null  object

### Uniques

In [5]:
print(len(s["developer"].unique()))
s["developer"].unique()

17113


array(['Valve', 'Gearbox Software', 'Valve;Hidden Path Entertainment',
       ..., 'SHEN JIAWEI', 'Semyon Maximov', 'Adept Studios GD'],
      dtype=object)

In [6]:
print(len(s["genres"].unique()))
s["genres"].unique()

1552


array(['Action', 'Action;Free to Play', 'Action;Free to Play;Strategy',
       ...,
       'Action;Adventure;Indie;Massively Multiplayer;RPG;Strategy;Early Access',
       'Action;Adventure;Casual;Free to Play;Indie;RPG;Simulation;Sports;Strategy',
       'Casual;Free to Play;Massively Multiplayer;RPG;Early Access'],
      dtype=object)

**Observations**: I need to unpack the genres column. I also observe a disparity among genre types where some relate to the actual genre of the game (ex: action) and some to the playability style (ex: multiplayer).

In [7]:
print(len(s["categories"].unique()))
s["categories"].unique()

3333


array(['Multi-player;Online Multi-Player;Local Multi-Player;Valve Anti-Cheat enabled',
       'Multi-player;Valve Anti-Cheat enabled',
       'Single-player;Multi-player;Valve Anti-Cheat enabled', ...,
       'Online Multi-Player;Steam Achievements;Full controller support;In-App Purchases;Steam Cloud',
       'Multi-player;Local Multi-Player;Co-op;Local Co-op;Shared/Split Screen',
       'Multi-player;Online Multi-Player;Cross-Platform Multiplayer;Stats'],
      dtype=object)

In [8]:
print(len(s["steamspy_tags"].unique()))
s["steamspy_tags"].unique()

6423


array(['Action;FPS;Multiplayer', 'FPS;World War II;Multiplayer',
       'FPS;Action;Sci-fi', ..., 'Casual;Adventure;Arcade',
       'Free to Play;Visual Novel',
       'Early Access;Adventure;Sexual Content'], dtype=object)

In [9]:
print(len(s["publisher"].unique()))
s["publisher"].unique()

14354


array(['Valve', 'Mark Healey', 'Tripwire Interactive', ..., 'MonteCube',
       'Velvet Paradise Games', 'SHEN JIAWEI'], dtype=object)

**Observations:** Given that the number of unique values in this column is too high, I could unover the top publishers and assign the rest as `Other`.

In [10]:
print(len(s["platforms"].unique()))
s["platforms"].unique()

7


array(['windows;mac;linux', 'windows;mac', 'windows', 'windows;linux',
       'mac', 'mac;linux', 'linux'], dtype=object)

In [11]:
# I confirm that the "appid" column is an unique identifier
print(len(s["appid"].unique()))

27075


### Understanding the Data

It is important to understand the meaning of our data and its parameters.   
I noticed a confusing element in the dataset, the `average_playtime` and `median_playtime`. It is difficult to interpret the results. Upon consulting the dataset documentation, it is confirmed that it's an user avergage. For that to make sense I assume that the time metric is minutes.

### Outliers Check

In [12]:
s.describe()

Unnamed: 0,appid,english,required_age,achievements,positive_ratings,negative_ratings,average_playtime,median_playtime,price
count,27075.0,27075.0,27075.0,27075.0,27075.0,27075.0,27075.0,27075.0,27075.0
mean,596203.5,0.981127,0.354903,45.248864,1000.559,211.027147,149.804949,146.05603,6.078193
std,250894.2,0.136081,2.406044,352.670281,18988.72,4284.938531,1827.038141,2353.88008,7.874922
min,10.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,401230.0,1.0,0.0,0.0,6.0,2.0,0.0,0.0,1.69
50%,599070.0,1.0,0.0,7.0,24.0,9.0,0.0,0.0,3.99
75%,798760.0,1.0,0.0,23.0,126.0,42.0,0.0,0.0,7.19
max,1069460.0,1.0,18.0,9821.0,2644404.0,487076.0,190625.0,190625.0,421.99


**Observations:** There are some extreme values in the data set but these are to be expected given the differet playibity levels among games, and the extreme behaviour of some players. Moreover, a big difference between median playtime and max playtime is to be expected since we need to take into account the players that will only try the game for a few minutes.

# Data Cleaning

In order to perform a correct analysis of the data I need to adjust it to my desired parameters. Thankfully, the dataset is already very comprehensible and neat, thanks to the good work of [Nik Davis](https://www.kaggle.com/nikdavis/steam-store-games).   
However, major work needs to be done regarding the categorical data columns that are nested list and thus imposible to compare and classify. I will attempt to sort this probelm in one of the two following ways:

1. Attempting to group by a certain parameter by using the `if in text` logical test.
    - *The main complication of this strategy is the heavy computational process the program would have to engage in to produce a result*      
       

2. Unpacking the values, separating them and re-assigning them.     
    - *This will be complicated and would mean a certain degree of either operational difficulty or loss of data. The options are:*
        1. Choosing only 1 or 2 top categories and reassigning those to the game instead of the current multitude of combinations.
        2. Creating a boolean type column for each posible category.
        
Moreover, I will also identify the top values of each categorical data column and reassing the rest to an `Other` category.

In [13]:
s.head(1)

Unnamed: 0,appid,name,release_date,english,developer,publisher,platforms,required_age,categories,genres,steamspy_tags,achievements,positive_ratings,negative_ratings,average_playtime,median_playtime,owners,price
0,10,Counter-Strike,2000-11-01,1,Valve,Valve,windows;mac;linux,0,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,0,124534,3339,17612,317,10000000-20000000,7.19


### Datetime

In [14]:
s["release_date"] = pd.to_datetime(s["release_date"], yearfirst=True)

### Unpacking categorical data

#### Platforms

In [15]:
s["platforms"].unique()

array(['windows;mac;linux', 'windows;mac', 'windows', 'windows;linux',
       'mac', 'mac;linux', 'linux'], dtype=object)

In [16]:
# Creating a list with the true unique values
splits = []
for word in s["platforms"].unique():
    splits.append(word.split(';'))
total = [w for lst in splits for w in lst]
unique_platforms = set(total)
unique_platforms

{'linux', 'mac', 'windows'}

In [17]:
# Adding a column for each unique value
s_cols = s.copy()

for w in unique_platforms:
    s_cols[w] = 0
    for i in range(len(s_cols)):
        if w in s_cols["platforms"][i]:
            s_cols[w][i] = 1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [18]:
# Dropping the original column
s_cols.drop(["platforms"], axis=1, inplace=True)

In [19]:
s_cols.head(1)

Unnamed: 0,appid,name,release_date,english,developer,publisher,required_age,categories,genres,steamspy_tags,achievements,positive_ratings,negative_ratings,average_playtime,median_playtime,owners,price,windows,linux,mac
0,10,Counter-Strike,2000-11-01,1,Valve,Valve,0,Multi-player;Online Multi-Player;Local Multi-P...,Action,Action;FPS;Multiplayer,0,124534,3339,17612,317,10000000-20000000,7.19,1,1,1


#### Genres

In [20]:
# Creating a list with the true unique values
splits = []
for word in s["genres"].unique():
    splits.append(word.split(';'))
total = [w for lst in splits for w in lst]
unique_genres = set(total)
print(len(unique_genres))
unique_genres

29


{'Accounting',
 'Action',
 'Adventure',
 'Animation & Modeling',
 'Audio Production',
 'Casual',
 'Design & Illustration',
 'Documentary',
 'Early Access',
 'Education',
 'Free to Play',
 'Game Development',
 'Gore',
 'Indie',
 'Massively Multiplayer',
 'Nudity',
 'Photo Editing',
 'RPG',
 'Racing',
 'Sexual Content',
 'Simulation',
 'Software Training',
 'Sports',
 'Strategy',
 'Tutorial',
 'Utilities',
 'Video Production',
 'Violent',
 'Web Publishing'}

**Observations:** This unpacking turned out better than I thought, since there are only 29 distinct genres (as opposed the the thousands of different combinations we unveiled earlier). Therefore, I will create a column for each value, then assess how many games there are per genre and potentially drop the columns that have negligeable numbers.

In [21]:
# Adding a column for each unique value
for w in unique_genres:
    s_cols[w] = 0
    for i in range(len(s_cols)):
        if w in s_cols["genres"][i]:
            s_cols[w][i] = 1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [22]:
# Creating a temporary table of the counts per type
genres = pd.DataFrame(unique_genres)
genres['count'] = genres[0].apply(lambda w: s_cols[w].sum())
genres.rename(columns={0:"genre"}, inplace=True)
genres.sort_values(["count"], ascending=False, inplace=True)

In [23]:
genres

Unnamed: 0,genre,count
19,Indie,19421
17,Action,11903
24,Casual,10210
9,Adventure,10032
0,Strategy,5247
1,Simulation,5194
23,RPG,4311
13,Early Access,2954
28,Free to Play,1704
22,Sports,1322


**Observations:** With this data frame I can clearly see the top genres. After doing some examination, I've decided to divide the data frame between keep and toss after the 11th value (`Racing`). I will categorise the rest as `Other`.

In [24]:
# Splitting the genres data frame
keep = genres[:11]
toss = genres[11:]

# Dropping the unwanted genre columns.
s_cols.drop(toss["genre"], axis=1, inplace=True)

In [25]:
# Dropping the original column
s_cols.drop(["genres"], axis=1, inplace=True)

In [26]:
s_cols.head(1)

Unnamed: 0,appid,name,release_date,english,developer,publisher,required_age,categories,steamspy_tags,achievements,...,Simulation,Racing,Adventure,Early Access,Action,Indie,Sports,RPG,Casual,Free to Play
0,10,Counter-Strike,2000-11-01,1,Valve,Valve,0,Multi-player;Online Multi-Player;Local Multi-P...,Action;FPS;Multiplayer,0,...,0,0,0,0,1,0,0,0,0,0


#### Categories

In [27]:
# Creating a list with the true unique values
splits = []
for word in s["categories"].unique():
    splits.append(word.split(';'))
total = [w for lst in splits for w in lst]
unique_categories = set(total)
print(len(unique_categories))
unique_categories

29


{'Captions available',
 'Co-op',
 'Commentary available',
 'Cross-Platform Multiplayer',
 'Full controller support',
 'In-App Purchases',
 'Includes Source SDK',
 'Includes level editor',
 'Local Co-op',
 'Local Multi-Player',
 'MMO',
 'Mods',
 'Mods (require HL2)',
 'Multi-player',
 'Online Co-op',
 'Online Multi-Player',
 'Partial Controller Support',
 'Shared/Split Screen',
 'Single-player',
 'Stats',
 'Steam Achievements',
 'Steam Cloud',
 'Steam Leaderboards',
 'Steam Trading Cards',
 'Steam Turn Notifications',
 'Steam Workshop',
 'SteamVR Collectibles',
 'VR Support',
 'Valve Anti-Cheat enabled'}

**Observations:** Similarly to the earlier case, there are only 29 distinct genres (as opposed the the thousands of different combinations we unveiled earlier). Therefore, I will create a column for each value, then assess how many games there are per category and potentially drop the columns that have negligeable numbers.

In [28]:
# Adding a column for each unique value
for w in unique_categories:
    s_cols[w] = 0
    for i in range(len(s_cols)):
        if w in s_cols["categories"][i]:
            s_cols[w][i] = 1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [29]:
# Creating a temporary table of the counts per type
categories = pd.DataFrame(unique_categories)
categories['count'] = categories[0].apply(lambda w: s_cols[w].sum())
categories.rename(columns={0:"category"}, inplace=True)
categories.sort_values(["count"], ascending=False, inplace=True)
categories

Unnamed: 0,category,count
23,Single-player,25678
20,Steam Achievements,14130
22,Steam Trading Cards,7918
26,Steam Cloud,7219
1,Full controller support,5695
5,Partial Controller Support,4234
4,Multi-player,3974
17,Steam Leaderboards,3439
13,Co-op,2604
16,Online Multi-Player,2487


**Observations:** The assessment here is more complicated since the categories give us a type of information that can be interesting to visualise even if the amount of games that are in that category is minimal (ex: `VR Support`). Moreover, I can see some categories that are similar that could be grouped together.    

Therefore, for this process I will apply my critical judgement with the knowledge that I can gather from the subject and produce a specific apprach to diminish the amount of categories.

In [30]:
# Dropping the unecessary categories form the get-go
toss = ["Mods (require HL2)", "Mods", "Includes Source SDK", "SteamVR Collectibles", "Steam Turn Notifications", "Valve Anti-Cheat enabled", "Captions available", "Stats", "Steam Workshop", "In-App Purchases", "Commentary available"]

In [31]:
# Grouping controller suport
s_cols["Controller Support"] = 0
for i in range(len(s_cols)):
    if s_cols["Full controller support"][i]:
        s_cols["Controller Support"][i] = 1
    if s_cols["Partial Controller Support"][i]:
        s_cols["Controller Support"][i] = 1

toss.append("Full controller support")
toss.append("Partial Controller Support")

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


In [32]:
# Grouping online games
s_cols["Online"] = 0
for i in range(len(s_cols)):
    if s_cols["Online Co-op"][i]:
        s_cols["Online"][i] = 1
    elif s_cols["Cross-Platform Multiplayer"][i]:
        s_cols["Online"][i] = 1
    elif s_cols["Online Multi-Player"][i]:
        s_cols["Online"][i] = 1

toss.append("Online Co-op")
toss.append("Cross-Platform Multiplayer")
toss.append("Online Multi-Player")

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


In [33]:
# Grouping multi-player games
s_cols["Multi-Player"] = 0
for i in range(len(s_cols)):
    if s_cols["Multi-player"][i]:
        s_cols["Multi-Player"][i] = 1
    elif s_cols["Online Multi-Player"][i]:
        s_cols["Multi-Player"][i] = 1
    elif s_cols["Local Multi-Player"][i]:
        s_cols["Multi-Player"][i] = 1
    elif s_cols["Cross-Platform Multiplayer"][i]:
        s_cols["Multi-Player"][i] = 1
    elif s_cols["Shared/Split Screen"][i]:
        s_cols["Multi-Player"][i] = 1
    elif s_cols["MMO"][i]:
        s_cols["Multi-Player"][i] = 1

toss.append("Multi-player")
toss.append("Local Multi-Player")
toss.append("Shared/Split Screen")

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  del sys.path[0]
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https

In [34]:
# Grouping co-op games
s_cols["co-op"] = 0
for i in range(len(s_cols)):
    if s_cols["Co-op"][i]:
        s_cols["co-op"][i] = 1
    elif s_cols["Online Co-op"][i]:
        s_cols["co-op"][i] = 1
    elif s_cols["Local Co-op"][i]:
        s_cols["co-op"][i] = 1

toss.append("Co-op")
toss.append("Local Co-op")
toss.append("Online Co-op")

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """


In [35]:
# Grouping local games
s_cols["Local"] = 0
for i in range(len(s_cols)):
    if s_cols["Single-player"][i]:
        s_cols["Local"][i] = 1
    elif s_cols["Local Multi-Player"][i]:
        s_cols["Local"][i] = 1
    elif s_cols["Local Co-op"][i]:
        s_cols["Local"][i] = 1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  import sys
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':


In [36]:
# Removing the Steam Interface categories
toss.append("Steam Achievements")
toss.append("Steam Trading Cards")
toss.append("Steam Cloud")
toss.append("Steam Leaderboards")

In [37]:
# Dropping the original category columns from the data frame
s_cols.drop(toss, axis=1, inplace=True)

In [38]:
# Dropping the original column
s_cols.drop(["categories"], axis=1, inplace=True)

In [39]:
s_cols.head(1)

Unnamed: 0,appid,name,release_date,english,developer,publisher,required_age,steamspy_tags,achievements,positive_ratings,...,Free to Play,Includes level editor,VR Support,Single-player,MMO,Controller Support,Online,Multi-Player,co-op,Local
0,10,Counter-Strike,2000-11-01,1,Valve,Valve,0,Action;FPS;Multiplayer,0,124534,...,0,0,0,0,0,0,1,1,0,1


#### Tags

In [40]:
# Creating a list with the true unique values
splits = []
for word in s["steamspy_tags"].unique():
    splits.append(word.split(';'))
total = [w for lst in splits for w in lst]
unique_tags = set(total)
print(len(unique_tags))
unique_tags

339


{'1980s',
 "1990's",
 '2.5D',
 '2D',
 '2D Fighter',
 '360 Video',
 '3D',
 '3D Platformer',
 '3D Vision',
 '4 Player Local',
 '4X',
 '6DOF',
 'Abstract',
 'Action',
 'Action RPG',
 'Action-Adventure',
 'Adventure',
 'Agriculture',
 'Aliens',
 'Alternate History',
 'America',
 'Animation & Modeling',
 'Anime',
 'Arcade',
 'Arena Shooter',
 'Assassin',
 'Atmospheric',
 'Audio Production',
 'BMX',
 'Base-Building',
 'Baseball',
 'Basketball',
 'Batman',
 'Battle Royale',
 "Beat 'em up",
 'Beautiful',
 'Benchmark',
 'Bikes',
 'Blood',
 'Board Game',
 'Bowling',
 'Building',
 'Bullet Hell',
 'Bullet Time',
 'CRPG',
 'Capitalism',
 'Card Game',
 'Cartoon',
 'Cartoony',
 'Casual',
 'Cats',
 'Character Action Game',
 'Character Customization',
 'Chess',
 'Choices Matter',
 'Choose Your Own Adventure',
 'Cinematic',
 'City Builder',
 'Class-Based',
 'Classic',
 'Clicker',
 'Co-op',
 'Cold War',
 'Colorful',
 'Comedy',
 'Comic Book',
 'Competitive',
 'Controller',
 'Conversation',
 'Crafting',
 '

**Observations:** The amount of unique values in this section is way to high for it to be viable to follow the same approach we did with the other categories. Moreover, many of the tags reflect back upon the genre or the category of the game, rendering it redundant.

However, it would still be interesting to check the most popular tags per genre and category, specially some interesting ones like `Robots` or `Battle Royale` that do give more precise information about the game, can shed light into obscure succesful subgenres. 

Therefore, I will attempt to create an additional dataframe that contains the tags in case I want to use them for my analysis. However, I will still atempt to reduce the number of tags I will adding by only includidng the ones that have a significant number of games attached to them.

In [41]:
s_tags = s_cols.copy()

# Adding a column for each unique value
for w in unique_tags:
    s_tags[w] = 0
    for i in range(len(s_cols)):
        if w in s_tags["steamspy_tags"][i]:
            s_tags[w][i] = 1

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [42]:
# Creating a temporary table of the counts per type
tags = pd.DataFrame(unique_tags)
tags['count'] = tags[0].apply(lambda w: s_tags[w].sum())
tags.rename(columns={0:"tag"}, inplace=True)
tags.sort_values(["count"], ascending=False, inplace=True)
tags

Unnamed: 0,tag,count
38,Indie,16232
269,Action,10344
86,Casual,8205
124,Adventure,7796
203,Strategy,4180
...,...,...
121,3D Vision,1
168,Gun Customization,1
142,Spectacle fighter,1
137,Intentionally Awkward Controls,1


In [43]:
tags.drop(tags[tags["count"]<100].index)
print(len(tags.drop(tags[tags["count"]<100].index)))
tags

60


Unnamed: 0,tag,count
38,Indie,16232
269,Action,10344
86,Casual,8205
124,Adventure,7796
203,Strategy,4180
...,...,...
121,3D Vision,1
168,Gun Customization,1
142,Spectacle fighter,1
137,Intentionally Awkward Controls,1


In [44]:
# Splitting the genres data frame
keep = tags[:60]
toss = tags[60:]

# Dropping the unwanted genre columns.
s_tags.drop(toss["tag"], axis=1, inplace=True)

In [45]:
# Dropping the original column
s_cols.drop(["steamspy_tags"], axis=1, inplace=True)

### Reorganising
In order to facilitate my analysis later on, I'm going to create some useful variables that recollect the information and the work that I've done in this part.

In [46]:
s_tags.columns

Index(['appid', 'name', 'release_date', 'english', 'developer', 'publisher',
       'required_age', 'steamspy_tags', 'achievements', 'positive_ratings',
       'negative_ratings', 'average_playtime', 'median_playtime', 'owners',
       'price', 'windows', 'linux', 'mac', 'Strategy', 'Simulation', 'Racing',
       'Adventure', 'Early Access', 'Action', 'Indie', 'Sports', 'RPG',
       'Casual', 'Free to Play', 'Includes level editor', 'VR Support',
       'Single-player', 'MMO', 'Controller Support', 'Online', 'Multi-Player',
       'co-op', 'Local', 'Fighting', 'Great Soundtrack', 'Nudity', 'RPGMaker',
       'Open World', 'RTS', 'War', 'Sci-fi', 'Tower Defense', 'Platformer',
       'Sexual Content', 'Point & Click', 'Rogue-like', '2D', 'Anime',
       'Massively Multiplayer', 'Management', 'Utilities', 'Horror',
       'Board Game', 'Card Game', 'Pixel Graphics', 'Shooter', 'Multiplayer',
       'Shoot 'Em Up', 'Zombies', 'Hidden Object', 'Fantasy', 'World War II',
       'Survival',

In [47]:
platforms = ['linux', 'windows', 'mac']
genres = ['Indie', 'Sports', 'Simulation', 'Strategy', 'Early Access', 'Casual',
       'RPG', 'Free to Play', 'Adventure', 'Action', 'Racing']
categories = ['Includes level editor', 'MMO', 'VR Support', 'Single-player',
       'Controller Support', 'Online', 'Multi-Player', 'co-op', 'Local']
tags = ['Nudity', 'Retro', 'Violent', 'Visual Novel', 'RPGMaker', 'Fighting',
       'FPS', 'Female Protagonist', 'Board Game', 'Space', 'World War II',
       'Platformer', 'Anime', 'Great Soundtrack', 'Massively Multiplayer',
       'Open World', 'Sexual Content', 'Arcade', 'Gore', 'Pixel Graphics',
       'Turn-Based', 'Music', 'Fantasy', 'Point & Click', 'Rogue-like',
       'World War I', "Shoot 'Em Up", 'RTS', 'Story Rich', 'Hidden Object',
       'Turn-Based Strategy', 'Survival', 'Match 3', 'Horror', 'Puzzle',
       'Sci-fi', 'Tower Defense', 'VR', 'Management', '2D', 'Card Game',
       'Multiplayer', 'Utilities', 'Shooter', 'War', 'Co-op', 'Zombies',
       'Classic', 'Singleplayer']

### Selecting top values from the categorical data and reassigning

#### Developers

In [48]:
print(len(s["developer"].unique()))
s["developer"].unique()

17113


array(['Valve', 'Gearbox Software', 'Valve;Hidden Path Entertainment',
       ..., 'SHEN JIAWEI', 'Semyon Maximov', 'Adept Studios GD'],
      dtype=object)

In [49]:
# Creating a list with the true unique values
splits = []
for word in s["developer"].unique():
    splits.append(word.split(';'))
total = [w for lst in splits for w in lst]
unique_developer = set(total)
print(len(unique_developer))
unique_developer

17953


{'Gunpowder Games, LLC',
 'Breaking Dimensions',
 'Jamong Inc.',
 'LongGe',
 'Bitca',
 'b-Alive',
 'ShotX Studio',
 'Moon Studios GmbH',
 'Yumoon',
 'George Allan',
 'Moonster Studio',
 'Rawkins Games',
 'DonkeyKwon Games',
 'Dark-Spot Studio',
 'Wild Guess Software',
 'Batu Games LLC',
 'AquaBomber',
 '1C Game Studios',
 'Fastermind Games',
 'Scary Bee LLC',
 'Revistronic',
 'Paul Fisch',
 'Playdek, Inc.',
 'Beijing Xinrun Technology Co.,Ltd',
 'Starfish-SD Inc',
 'Helvetica Scenario',
 '2ndRevelation',
 'RedBedlam',
 'Sun-Studios',
 'bu2 sutdio',
 'Accolade, Inc',
 'Because I Can',
 'keyreal',
 'Gattai Games',
 'Beckoning Cat',
 'Cemil Tasdemir',
 'Polinc Games',
 'VitruviusVR',
 'Stolpskott Studios',
 'Trav Nash',
 'Blot Interactive',
 'Triumph Studios',
 'Konstantin Koshutin',
 'MachineSpirit',
 'YOUGAKE',
 'Nuclear Strawberry',
 'etoylab',
 'BulatHard',
 'Hammerson Games',
 '2049VR',
 'Yvo Geldhof',
 'Gampixi',
 'Terapoly',
 'Dima Kiva',
 'VladProduction',
 'Good Bit',
 'Anoman St

In [50]:
group = s.groupby("developer").agg({"appid" : "count"}).sort_values("appid", ascending=False)
group

Unnamed: 0_level_0,appid
developer,Unnamed: 1_level_1
Choice of Games,94
"KOEI TECMO GAMES CO., LTD.",72
Ripknot Systems,62
Laush Dmitriy Sergeevich,51
"Nikita ""Ghost_RUS""",50
...,...
"Hollowhead, Inc.",1
Holmsario Games,1
Hologram Software LTD.;iCandy Games Inc.,1
Holomia,1


In [51]:
group.head(20)

Unnamed: 0_level_0,appid
developer,Unnamed: 1_level_1
Choice of Games,94
"KOEI TECMO GAMES CO., LTD.",72
Ripknot Systems,62
Laush Dmitriy Sergeevich,51
"Nikita ""Ghost_RUS""",50
Dexion Games,45
RewindApp,43
Hosted Games,42
Blender Games,40
Humongous Entertainment,36


**Observations:** Upon grouping by `Developer`, I could see that a majority of them had only 1 game. However, when taking a closer look at the top 20 developers, there wasn't a significant jump at any level, and the differences among them didn't seem relevant. At this stage, trying to delve into an anlaysis by developer wouldn't yield interesting results, so I will leave that column as is and probably not use it in my analysis.

From this I do learn that developers are multitudinous and the amount of games that each developer produces is limited, which makes sense since developing a game is a huge endeavour.

#### Publishers

In [52]:
group = s.groupby("publisher").agg({"appid" : "count"}).sort_values("appid", ascending=False)
group

Unnamed: 0_level_0,appid
publisher,Unnamed: 1_level_1
Big Fish Games,212
Strategy First,136
Ubisoft,111
THQ Nordic,98
Square Enix,97
...,...
Homebrew Cult,1
Homegrown Games,1
Homegrown Games - a HRMC Label,1
Homemade Games,1


In [53]:
group.head(20)

Unnamed: 0_level_0,appid
publisher,Unnamed: 1_level_1
Big Fish Games,212
Strategy First,136
Ubisoft,111
THQ Nordic,98
Square Enix,97
Sekai Project,96
Choice of Games,94
Dagestan Technology,88
1C Entertainment,88
SEGA,78


**Observations:** Similarly to grouping by `Developer`, I can see that a majority of them have published only 1 game. However, when taking a closer look at the top 20 publishers, there wasn't a significant jump at any level, and the differences among them didn't seem relevant. At this stage, trying to delve into an anlaysis by publisher wouldn't yield interesting results, so I will leave that column as is.

However, I will possibly use this same groupby method in my analysis to draw the results of the top publishers.

# Final Look

In [54]:
s_tags.head()

Unnamed: 0,appid,name,release_date,english,developer,publisher,required_age,steamspy_tags,achievements,positive_ratings,...,Visual Novel,Retro,World War I,Match 3,Female Protagonist,Turn-Based,Space,Singleplayer,Violent,Turn-Based Strategy
0,10,Counter-Strike,2000-11-01,1,Valve,Valve,0,Action;FPS;Multiplayer,0,124534,...,0,0,0,0,0,0,0,0,0,0
1,20,Team Fortress Classic,1999-04-01,1,Valve,Valve,0,Action;FPS;Multiplayer,0,3318,...,0,0,0,0,0,0,0,0,0,0
2,30,Day of Defeat,2003-05-01,1,Valve,Valve,0,FPS;World War II;Multiplayer,0,3416,...,0,0,1,0,0,0,0,0,0,0
3,40,Deathmatch Classic,2001-06-01,1,Valve,Valve,0,Action;FPS;Multiplayer,0,1273,...,0,0,0,0,0,0,0,0,0,0
4,50,Half-Life: Opposing Force,1999-11-01,1,Gearbox Software,Valve,0,FPS;Action;Sci-fi,0,5250,...,0,0,0,0,0,0,0,0,0,0


In [55]:
s_tags.columns

Index(['appid', 'name', 'release_date', 'english', 'developer', 'publisher',
       'required_age', 'steamspy_tags', 'achievements', 'positive_ratings',
       'negative_ratings', 'average_playtime', 'median_playtime', 'owners',
       'price', 'windows', 'linux', 'mac', 'Strategy', 'Simulation', 'Racing',
       'Adventure', 'Early Access', 'Action', 'Indie', 'Sports', 'RPG',
       'Casual', 'Free to Play', 'Includes level editor', 'VR Support',
       'Single-player', 'MMO', 'Controller Support', 'Online', 'Multi-Player',
       'co-op', 'Local', 'Fighting', 'Great Soundtrack', 'Nudity', 'RPGMaker',
       'Open World', 'RTS', 'War', 'Sci-fi', 'Tower Defense', 'Platformer',
       'Sexual Content', 'Point & Click', 'Rogue-like', '2D', 'Anime',
       'Massively Multiplayer', 'Management', 'Utilities', 'Horror',
       'Board Game', 'Card Game', 'Pixel Graphics', 'Shooter', 'Multiplayer',
       'Shoot 'Em Up', 'Zombies', 'Hidden Object', 'Fantasy', 'World War II',
       'Survival',

In [56]:
s_tags.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 27075 entries, 0 to 27074
Data columns (total 87 columns):
 #   Column                 Non-Null Count  Dtype         
---  ------                 --------------  -----         
 0   appid                  27075 non-null  int64         
 1   name                   27075 non-null  object        
 2   release_date           27075 non-null  datetime64[ns]
 3   english                27075 non-null  int64         
 4   developer              27075 non-null  object        
 5   publisher              27075 non-null  object        
 6   required_age           27075 non-null  int64         
 7   steamspy_tags          27075 non-null  object        
 8   achievements           27075 non-null  int64         
 9   positive_ratings       27075 non-null  int64         
 10  negative_ratings       27075 non-null  int64         
 11  average_playtime       27075 non-null  int64         
 12  median_playtime        27075 non-null  int64         
 13  o

**Observations:** Upon the last review I realise that I omitted changing the data type of the `owners` column to numeric. I will do that now as a last step before exporting my clean data set. Everything else seems to be correct.

In [57]:
s["owners"]

0        10000000-20000000
1         5000000-10000000
2         5000000-10000000
3         5000000-10000000
4         5000000-10000000
               ...        
27070              0-20000
27071              0-20000
27072              0-20000
27073              0-20000
27074              0-20000
Name: owners, Length: 27075, dtype: object

**Observations:** Upon observation of the `owners` column, I realise it is defined as a range of numbers. I will therefore unpack said range and reassign the middle point of the range as the new value.

In [58]:
def avg(string):
    s = string.split("-")
    s = [int(a) for a in s]
    return int(np.average(s))

#s["owners"] = s["owners"].apply(lambda x: avg(x))
#s_cols["owners"] = s_cols["owners"].apply(lambda x: avg(x))
#s_tags["owners"] = s_tags["owners"].apply(lambda x: avg(x))

## Exporting the Clean Dataset

In [59]:
s.to_csv('data/steam_clean.csv', index=False)
s_cols.to_csv('data/steam_cols_clean.csv', index=False)
s_tags.to_csv('data/steam_tags_clean.csv', index=False)

**Done!** 

The data frames have successfully been exported into 3 separate files:
1. `steam_clean` is a cleaner version fo the original dataset 
2. `steam_cols_clean` is `steam_clean` but the categorical data has been separated in new columns, except tags.
3. `steam_tags_clean` is `steam_cols_clean` but the categorical data has been separated in new columns, including tags.