## Data Cleaning Summary
In this notebook, we focus on the initial cleaning and validation of the raw Steam dataset to prepare it for modeling and analysis. The dataset includes game metadata, pricing, reviews, achievements, and other attributes collected from [Steam Game Data on Kaggle](https://www.kaggle.com/datasets/artyomkruglov/gaming-profiles-2025-steam-playstation-xbox/). 

The hope is that the cleaned dataset serves as a reliable foundation for downstream modeling in Power BI. It should eliminate structural issues that could affect joins, aggregations, and filtering during Power BI reporting. The output of this notebook will be used as input for the normalization process that follows in the next notebook.



In [3]:
import pandas as pd
import numpy as np
import sys
import ast
import os
sys.path.append(os.path.abspath(".."))
from utils.data_utils import missing_value_summary

### Games

In [23]:
gamesdf = pd.read_csv("../data_steam/raw/games.csv")

In [24]:
games = gamesdf.copy()

In [63]:
games.head(3)

Unnamed: 0,gameid,title,developers,publishers,genres,supported_languages,release_date
0,3281560,Horror Game To Play With Friends! Playtest,,,,,2024-10-21
1,3280930,Eternals' Path Playtest,,,,,2024-10-17
2,3280770,ANGST: A TALE OF SURVIVAL - Singleplayer Playtest,,,,,2024-10-13


In [170]:
games.info()

<class 'pandas.core.frame.DataFrame'>
Index: 92965 entries, 5 to 98247
Data columns (total 7 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   gameid               92965 non-null  int64 
 1   title                92965 non-null  object
 2   developers           92965 non-null  object
 3   publishers           92965 non-null  object
 4   genres               92965 non-null  object
 5   supported_languages  92965 non-null  object
 6   release_date         92965 non-null  object
dtypes: int64(1), object(6)
memory usage: 7.7+ MB


### Initial Overview of the `games` Table

The `games` table contains 98,248 rows and 98,248 unique `gameid`s, confirming that each row represents a distinct game — there are no duplicates.

As described by the dataset curator on Kaggle, the table includes the following fields:

- **gameid** – unique game ID on the Steam platform  
- **title** – full title of the game  
- **developers** – list of game developers  
- **publishers** – list of game publishers  
- **genres** – list of associated game genres  
- **supported_languages** – available languages (subtitles or voice-over)  
- **release_date** – date the game was released

Some of these columns contain missing values, which will be addressed shortly. One notable pattern in the data is that some game titles contain the word *"Playtest"*. These entries are typically stripped-down test versions of games and, as shown below, most lack critical metadata such as developers, publishers, or genres.

Since these *"Playtest"* entries are incomplete and not analytically useful, we remove them from the dataset. There are 5,280 such entries, accounting for roughly 5% of the total.


In [25]:
n_rows = len(games)
uniq_game_ids = games['gameid'].nunique()
playtest_games = games[games['title'].astype(str).str.contains('Playtest')] 
playtest_num = len(playtest_games)

print(f'Number of rows in games table:  {n_rows}\n')
print(f'Number of unique gameids in games table: {uniq_game_ids}\n')

print(f'Number of playtest games: {playtest_num}\n')
playtest_perc = (playtest_num / len(games) ) * 100
print(f'Percentage of playtest games: {playtest_perc: .2f}%')
print("\n Number of NaNs per column:\n")
print(playtest_games.isna().sum())


Number of rows in games table:  98248

Number of unique gameids in games table: 98248

Number of playtest games: 5280

Percentage of playtest games:  5.37%

 Number of NaNs per column:

gameid                    0
title                     0
developers             5273
publishers             5276
genres                 5273
supported_languages    5273
release_date              0
dtype: int64


In [238]:
# View them just to see if its the expected output.
playtest_games.head(5)

Unnamed: 0,gameid,title,developers,publishers,genres,supported_languages,release_date
0,3281560,Horror Game To Play With Friends! Playtest,,,,,2024-10-21
1,3280930,Eternals' Path Playtest,,,,,2024-10-17
2,3280770,ANGST: A TALE OF SURVIVAL - Singleplayer Playtest,,,,,2024-10-13
3,3279790,Montabi Playtest,,,,,2024-10-13
4,3278320,파이팅걸 유리 Playtest,,,,,2024-10-12


In [346]:
games = games[~games['title'].astype(str).str.contains('Playtest')] #remove Playtestgames

In [347]:
games.info()

<class 'pandas.core.frame.DataFrame'>
Index: 92968 entries, 5 to 98247
Data columns (total 7 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   gameid               92968 non-null  int64 
 1   title                92965 non-null  object
 2   developers           92682 non-null  object
 3   publishers           92303 non-null  object
 4   genres               92692 non-null  object
 5   supported_languages  92735 non-null  object
 6   release_date         92968 non-null  object
dtypes: int64(1), object(6)
memory usage: 5.7+ MB


#### Games – Missing Values

After removing the *"Playtest"* games, we now explore the presence of missing values in the partially cleaned `games` dataset. As shown in the table below, most columns have less than 1% missing values, indicating relatively good data completeness. These remaining null entries will be handled accordingly in the next steps of the cleaning process.


In [348]:
miss_per_col  = missing_value_summary(games)
miss_per_col

Unnamed: 0,Number of Missing Values,Percent of Total Values
publishers,665,0.72
developers,286,0.31
genres,276,0.3
supported_languages,233,0.25
gameid,0,0.0
title,3,0.0
release_date,0,0.0


##### Handling Missing Titles

We begin with the `title` column, which has only 3 missing entries. Since this table serves as the primary source of game information, and these rows lack titles, we have no reliable way to identify or interpret these games.

As confirmed in the output below, these entries also lack developer and publisher information — further supporting their removal. We therefore drop these 3 rows from the dataset.


In [349]:
no_title_games = games[games['title'].isna()]
games = games[~games['title'].isna()]
no_title_games

Unnamed: 0,gameid,title,developers,publishers,genres,supported_languages,release_date
2853,1347240,,,,,['English'],2021-04-20
39705,1116910,,,,"['Action', 'Adventure', 'Casual', 'Indie', 'RPG', 'Simulation', 'Strategy']",,2019-09-25
77700,396420,,,,,,2016-11-01


In [350]:
miss_per_col  = missing_value_summary(games)
miss_per_col

Unnamed: 0,Number of Missing Values,Percent of Total Values
publishers,662,0.71
developers,283,0.3
genres,274,0.29
supported_languages,231,0.25
gameid,0,0.0
title,0,0.0
release_date,0,0.0


##### Handling Missing Categorical Fields

For the remaining categorical fields — `developers`, `publishers`, `genres`, and `supported_languages` — we replace missing values with the placeholder `"None stated"`. These columns contain only a small percentage of missing values, and dropping these rows would result in unnecessary data loss.

Using a consistent placeholder allows us to retain these games in the dataset while making it clear that the specific information is unavailable.

We now turn our attention to the `Prices` table. We shall return to the `games` table multiple times since it is our main, primary source of information for games.

In [351]:
miss_vals_col = list(miss_per_col[(miss_per_col['Number of Missing Values'] > 0)].index)
for col in miss_vals_col:
    games[col] = games[col].apply(lambda x: ['None stated'] if pd.isna(x) else x)


In [352]:
miss_per_col  = missing_value_summary(games)
miss_per_col

Unnamed: 0,Number of Missing Values,Percent of Total Values
gameid,0,0.0
title,0,0.0
developers,0,0.0
publishers,0,0.0
genres,0,0.0
supported_languages,0,0.0
release_date,0,0.0


## Prices table

In [50]:
pricesdf = pd.read_csv('../data_steam/raw/prices.csv')

In [353]:
prices = pricesdf.copy()

As per Kaggle doumentation, the prices table has the following columns:

* `gameid` - unique gameID on the Steam platform
* `usd` - game price in USD
* `eur` - game price in EUR
* `gbp` - game price in GBP
* `jpy` - game price in JPY
* `rub` - game price in RUB
* `date_acquired` - date of when the price information was recorded.
  
The `prices` table contains 4,414,273 entries, reflecting game prices (in multiple currencies) on different `date_acquired` snapshots. Since prices can change due to discounts, sales, or business model shifts, a single game can appear multiple times with varying prices.

The original `games` table contains 98,248 unique `gameid`s, while the `prices` table contains 98,465 — meaning there are 217 games in `prices` that do not exist in the `games` table.

Since the aim is to build a normalised data model with a `games_dim` table (where `gameid` is the primary key - see next notebook), it makes no sense to retain price records for games that don’t exist in the main games dataset. We therefore remove these 217 unmatched price entries, as they cannot be joined or analyzed meaningfully in our model.



In [354]:
uniq_ids_prices = set(prices['gameid'])
uniq_ids_original_games = set(gamesdf['gameid']) #use the original games, before removing playtests
price_id_not_in_games = uniq_ids_prices - uniq_ids_original_games
print(f'Number of unique game ids in prices: {len(uniq_ids_prices)}')
print(f'Number of unique game ids in original games: {len(uniq_ids_original_games)}')
print(f'Number of unique game ids in prices but not in original games: {len(price_id_not_in_games)}')

Number of unique game ids in prices: 98465
Number of unique game ids in original games: 98248
Number of unique game ids in prices but not in original games: 217


In [355]:
gameid_not_in_games = prices[prices['gameid'].isin(list(price_id_not_in_games))]
prices = prices[prices['gameid'].isin(gamesdf['gameid'])] #remove these games
print("For interest, here is a snapshot the gameids and prices for games not found in the games table\n")
gameid_not_in_games.head(2)


For interest, here is a snapshot the gameids and prices for games not found in the games table



Unnamed: 0,gameid,usd,eur,gbp,jpy,rub,date_acquired
361,1423450,9.99,9.75,8.5,1200.0,385.0,2024-11-28
1816,1378820,0.99,0.99,0.84,117.0,41.0,2024-11-28


In [356]:
uniq_ids_prices = set(prices['gameid'])
uniq_ids_original_games = set(gamesdf['gameid']) #use the original games, before removing playtests
uniq_ids_updated_games = set(games['gameid'])
print(f'Number of unique game ids in new prices table: {len(uniq_ids_prices)}')
print(f'Number of unique game ids in original games: {len(uniq_ids_original_games)}')
print(f'Number of unique game ids in updated games: {len(uniq_ids_updated_games)}')

Number of unique game ids in new prices table: 98248
Number of unique game ids in original games: 98248
Number of unique game ids in updated games: 92965


####  Removing Playtest Games from the Prices Table

As part of the earlier cleaning process, we removed "Playtest" games from the `games` table due to missing metadata (e.g., no developer or publisher information). To ensure consistency across the model, we now remove any corresponding entries from the `prices` table.

These playtest games should not have valid price data — and as we confirm below, all associated `usd` prices are indeed missing (`NaN`). Removing them ensures that the `prices` table aligns with the cleaned `games` table, and helps us maintain a lean, reliable fact table free of placeholder or incomplete entries.


In [357]:
playtest_ids = playtest_games['gameid']  # game ids with playtest in 'title' in games df 
playtest_prices = prices[prices['gameid'].isin(playtest_ids)] #prices for games which are playtests
print(f'Number of rows in Playtests prices table: {playtest_prices.shape[0]}') #number of rows
print('Number of missing values per column:')
playtest_prices.isna().sum() #which of the playtest games are NaNs?

Number of rows in Playtests prices table: 233714
Number of missing values per column:


gameid                0
usd              233714
eur              233714
gbp              233714
jpy              233714
rub              233714
date_acquired         0
dtype: int64

In [358]:
prices = prices[~prices['gameid'].isin(playtest_ids)]#remove playtests

#### Removing Games Without Titles from the Prices Table

Earlier, we removed a small number of games from the `games` table that lacked a valid `title`, as these entries were incomplete and unusable. When comparing the number of unique `gameid`s between the `games` and `prices` tables, we find a small discrepancy (3 extra IDs in `prices`).

Upon inspection, these correspond to the same title-less games previously excluded. To maintain consistency, we remove these `gameid`s from the `prices` table as well — ensuring it remains aligned with the cleaned `games` dataset and avoids entries with missing reference information.


In [359]:
uniq_ids_prices = set(prices['gameid'])
uniq_ids_updated_games = set(games['gameid'])
#checking if the gameids in prices that dont appear in games are exactly the ones that have no title from before
check = set(prices['gameid']) - set(games['gameid'])  == set(no_title_games['gameid']) 
print(f'Number of unique game ids in updated prices table: {len(uniq_ids_prices)}')
print(f'Number of unique game ids in updated games: {len(uniq_ids_updated_games)}')
print('The gameids in prices that dont appear in games are exactly the ones that have no title from before: ', check)

Number of unique game ids in updated prices table: 92968
Number of unique game ids in updated games: 92965
The gameids in prices that dont appear in games are exactly the ones that have no title from before:  True


In [360]:
prices = prices[prices['gameid'].isin(games['gameid'])]
uniq_ids_prices = set(prices['gameid'])
uniq_ids_updated_games = set(games['gameid'])
print(f'Number of unique game ids in updated prices table: {len(uniq_ids_prices)}')
print(f'Number of unique game ids in updated games: {len(uniq_ids_updated_games)}')
test = set(games['gameid']) == set(prices['gameid']) #check if all IDs in games are in prices and vice versa
print("All gameids in games also in prices and vice versa: ",test)

Number of unique game ids in updated prices table: 92965
Number of unique game ids in updated games: 92965
All gameids in games also in prices and vice versa:  True


Just a reminder of how the prices table looks:

In [121]:
prices.head(3)

Unnamed: 0,gameid,usd,eur,gbp,jpy,rub,date_acquired
5,3278740,5.99,5.85,5.1,720.0,228.0,2024-11-28
10,3270850,3.99,3.99,3.89,470.0,165.0,2024-11-28
15,3267350,,,,,,2024-11-28


#### Exploring Missing Values in the `prices` Table

We now continue with the `prices` table to check for missing values. Our first step is to identify rows that have a `NaN` in **any** of the currency columns. This helps us flag games that are missing price information in at least one currency.

As shown in the table below, a considerable number of entries have missing values across different currencies. This highlights the need to decide how we handle incomplete price records depending on the specific use case or analysis.

To streamline the analysis and reduce the impact of missing data, we have chosen to focus exclusively on the `usd` column. This column has the **least number of missing values** and provides a consistent basis for comparisons across games. All future price-related analysis will therefore be conducted using USD-based prices.


In [361]:
rows_with_nan = prices[prices.isnull().any(axis=1)]
#print('')
print('Sample of table with at least one missing price:\n')
print(rows_with_nan.head(3))
print('\nPercentage of missing values per column: ')
miss_prices = missing_value_summary(rows_with_nan)
miss_prices

Sample of table with at least one missing price:

     gameid   usd  eur   gbp    jpy    rub date_acquired
15  3267350   NaN  NaN   NaN    NaN    NaN    2024-11-28
17  3266470  3.49  NaN  3.00  406.0  140.0    2024-11-28
22  3263370  0.99  NaN  0.89  120.0   42.0    2024-11-28

Percentage of missing values per column: 


Unnamed: 0,Number of Missing Values,Percent of Total Values
eur,1463601,93.89
rub,767321,49.22
jpy,681478,43.72
gbp,669734,42.96
usd,668743,42.9
gameid,0,0.0
date_acquired,0,0.0


In [362]:
cols_to_drop = ['eur','gbp','jpy','rub']
prices = prices.drop(columns= cols_to_drop)

#### Investigating Missing Prices

To better understand the `NaN` values in the `prices` table, we examine one such example (`gameid`=3267350). Upon checking its entry in the `games` table, we observe that it is tagged as **"Free to Play"** in the `genres` column — meaning players are not expected to pay for it.

It is important to note that while the `genres` field is curated (not user-defined), some inaccuracies still exist. In particular, there are games that are free to play but are not tagged explicitly as such, likely due to inconsistencies or incomplete updates in the dataset.

In cases where the "Free to Play" tag is present, we initially plan to impute the missing price as `0.00` to reflect the game's intended free status. While imputing zeros may skew overall price distributions, we will mitigate this by also introducing a new categorical variable, `Free to Play`, based on the game's genre.

This flag will allow us to filter or group games by payment model during price-based analysis, helping preserve analytical integrity despite the imputation.

For now we do a further investigation into "Free to Play" games.

In [270]:
prices.head(5)

Unnamed: 0,gameid,usd,date_acquired
5,3278740,5.99,2024-11-28
10,3270850,3.99,2024-11-28
15,3267350,,2024-11-28
17,3266470,3.49,2024-11-28
20,3264110,2.99,2024-11-28


In [332]:
one_suspicious_game = prices[prices['gameid']==3267350]
games[games['gameid']==3267350]
#one_suspicious_game.head(2)

Unnamed: 0,gameid,title,developers,publishers,genres,supported_languages,release_date
15,3267350,Tiny Shooters,['madilumar'],['Thetinyverse'],"['Action', 'Free To Play']",['English'],2024-10-19


In [333]:
free_to_play = games[games['genres'].astype(str).str.contains('Free To Play', case=False)] 
prices_free_to_play = prices[prices['gameid'].isin(free_to_play['gameid'])]
prices_free_to_play[~prices_free_to_play['usd'].isna()].head(3)

Unnamed: 0,gameid,usd,date_acquired
340,1424420,24.99,2024-11-28
1009,1402280,14.39,2024-11-28
3100,1340180,8.79,2024-11-28



##### Investigating "Free to Play" Games

Before imputing prices for games tagged as "Free to Play", we take a closer look at these entries. There are 8,750 such games in the dataset — approximately 9% of the total.

We then check the `prices` table to determine whether all of these games truly lack price data. Interestingly, not all games labeled "Free to Play" have `NaN` values in the `usd` column. This may suggest that the game was once free but now requires payment, or that pricing varies by region.

To ensure consistency, we decide to remove the `"Free to Play"` genre tag from any game that has a valid price listed in the `usd` column. This helps the genre field more accurately reflect the game's current status rather than a historical or partial classification.

We begin by filtering on the `usd` column, and may revisit other currencies if needed based on the results.


In [364]:
free_to_play = games[games['genres'].astype(str).str.contains('Free To Play', case=False)] 
num_free_to_play = len(free_to_play)
print(f'Number of free to play games: {num_free_to_play}')
free_to_play_perc = (len(free_to_play) / len(games) )*100
print(f'Percentage of free to play games: {free_to_play_perc: .2f}%')
free_to_play.head(2)

Number of free to play games: 8750
Percentage of free to play games:  9.41%


Unnamed: 0,gameid,title,developers,publishers,genres,supported_languages,release_date
15,3267350,Tiny Shooters,['madilumar'],['Thetinyverse'],"['Action', 'Free To Play']",['English'],2024-10-19
28,3261860,CAIGE: Cum Overflow,['Zapt5454'],['Zapt5454'],"['Casual', 'Free To Play']",['English'],2024-10-18


An example of a game with 'Free to play' as one of the genres...

In [365]:
games[games['gameid']==1424420]

Unnamed: 0,gameid,title,developers,publishers,genres,supported_languages,release_date
340,1424420,Warchief,['Honikou Games'],['Honikou Games'],"['Action', 'Indie', 'Strategy', 'Free To Play', 'Early Access']","['English', 'French', 'Italian', 'German', 'Spanish - Spain', 'Bulgarian', 'Czech', 'Danish', 'Dutch', 'Finnish', 'Greek', 'Hungarian', 'Indonesian', 'Japanese', 'Korean', 'Norwegian', 'Polish', 'Portuguese - Brazil', 'Portuguese - Portugal', 'Romanian', 'Russian', 'Simplified Chinese', 'Spanish - Latin America', 'Swedish', 'Thai', 'Traditional Chinese', 'Turkish', 'Ukrainian', 'Vietnamese']",2024-06-27


...as seen below, this game (first row) has at least one price.

In [366]:
free_to_play_prices = prices[prices['gameid'].isin(free_to_play['gameid'])]
free_to_play_prices[~free_to_play_prices['usd'].isna()].head(2)

Unnamed: 0,gameid,usd,date_acquired
340,1424420,24.99,2024-11-28
1009,1402280,14.39,2024-11-28


##### Identifying "Free to Play" Games with Prices on All Dates

While some games are tagged as "Free to Play" in the `genres` column, not all of them behave consistently in the `prices` table. To investigate further, we extract the games that have a recorded (non-missing) USD price on **every** date they appear.

These games have never truly had a missing price, suggesting that despite their genre tag, they have consistently required payment.

This list of game IDs will help us refine the "Free to Play" classification and ensure that subsequent price analysis is based on games whose genre and pricing behavior are consistent.


In [367]:
free_to_play = games[games['genres'].astype(str).str.contains('Free To Play', case=False)] 
free_to_play_ids = free_to_play['gameid']

free_to_play_prices = prices[prices['gameid'].isin(free_to_play_ids)]

# group by gameid and check if all USD prices are NOT NaN
games_with_all_prices_present = (
    free_to_play_prices
    .groupby('gameid')['usd']
    .apply(lambda x: x.notna().all())  
)


games_with_all_prices_present_ids = set(
    games_with_all_prices_present[games_with_all_prices_present == True].index
)

print(f"Number of 'Free to Play' games with prices on all dates: {len(games_with_all_prices_present_ids)}")


Number of 'Free to Play' games with prices on all dates: 65


##### Identifying "Free to Play" Games with Prices on All Dates cont...
There are 65 of games with prices on all dates, yet they are tagged as "Free to Play". We then remove this tag from the `genres` column to be reflect the classification of the game. 

In [285]:
pd.set_option('display.max_colwidth', None)

In [368]:
games[games['gameid']==236130]['genres']

87901    ['Adventure', 'Indie', 'Simulation', 'Strategy', 'Free To Play']
Name: genres, dtype: object

In [369]:
#the goal in this cell is just some clean up on the genres column so we can access the genres to remove the 'Free to play' tag.
#For instance, the genre is represented as a list inside a string: "['Action', 'Free To Play']" but we want it as a list ['Action', 'Free To Play']
#so we can loop through it
games['genres'] = games['genres'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)
games.head(3)

Unnamed: 0,gameid,title,developers,publishers,genres,supported_languages,release_date
5,3278740,NEURO,['Revolt Games'],['Strategy First'],[Action],"['English', 'Russian']",2024-10-11
10,3270850,Keep Your Eyes Open,['Texerikus'],['Texerikus'],[Indie],['English'],2024-10-21
15,3267350,Tiny Shooters,['madilumar'],['Thetinyverse'],"[Action, Free To Play]",['English'],2024-10-19


In [370]:
target_ids = list(games_with_all_prices_present_ids)

# Function to remove 'Free To Play' from the genre column for those with prices
def remove_free_to_play(genres):
    return [g for g in genres if g != 'Free To Play']
games.loc[games['gameid'].isin(target_ids), 'genres'] = games.loc[games['gameid'].isin(target_ids), 'genres'].apply(remove_free_to_play)


In [381]:
games_with_all_prices_present2 = (
    free_to_play_prices
    .groupby('gameid')['usd']
    .apply(lambda x: x.notna().all())  # True if ALL usd prices are NOT NaN
)

In [371]:
#test
games[games['gameid']==236130]['genres']

87901    [Adventure, Indie, Simulation, Strategy]
Name: genres, dtype: object

#### Final Observations on "Free to Play" Tag Inconsistencies

After cleaning, we discovered that some games still tagged as "Free to Play" eventually showed valid prices in later dates. This suggests that while pricing models may change over time, the genre tags are not always promptly updated.

Rather than manually modifying the `genres` field, we rely on a newly created categorical variable, `is_free_to_play`, derived from current price behavior. This clean flag will be used for all price analysis and modeling, ensuring consistency while preserving the original genre metadata for reference.


Also note that now we have 8685 games with the "Free to play" tag.


In [388]:
free_to_play = games[games['genres'].astype(str).str.contains('Free To Play')] 
num_free_to_play = len(free_to_play)
print(f'Number of free to play games: {num_free_to_play: }')
free_to_play_perc = (len(free_to_play) / len(games) )*100
print(f'Percentage of free to play games: {free_to_play_perc: .2f}%')
free_to_play_ids = free_to_play['gameid']
free_to_play_prices = prices[prices['gameid'].isin(free_to_play_ids)]
free_to_play_prices.info()

Number of free to play games:  8685
Percentage of free to play games:  9.34%
<class 'pandas.core.frame.DataFrame'>
Index: 388383 entries, 15 to 4414269
Data columns (total 3 columns):
 #   Column         Non-Null Count   Dtype  
---  ------         --------------   -----  
 0   gameid         388383 non-null  int64  
 1   usd            449 non-null     float64
 2   date_acquired  388383 non-null  object 
dtypes: float64(1), int64(1), object(1)
memory usage: 11.9+ MB


In [387]:
games[games['gameid']==2964440]

Unnamed: 0,gameid,title,developers,publishers,genres,supported_languages,release_date
65683,2964440,Yapori Minigames,['Hayllon'],['Hayllon'],"[Casual, Free To Play]","['Portuguese - Brazil', 'English']",2024-10-31


In [389]:
remaining_free_to_play['gameid'].nunique()

19

In [386]:
remaining_free_to_play = free_to_play_prices[~free_to_play_prices['usd'].isna()]
prices[prices['gameid']==2964440]

Unnamed: 0,gameid,usd,date_acquired
65862,2964440,,2024-11-28
163681,2964440,,2024-11-30
261366,2964440,,2024-12-02
358946,2964440,,2024-12-04
456683,2964440,1.99,2024-12-06
554312,2964440,1.99,2024-12-08
652246,2964440,1.99,2024-12-10
749862,2964440,1.99,2024-12-12
847537,2964440,1.99,2024-12-14
945251,2964440,1.99,2024-12-16


In [396]:
# for each game, check if it had only NaN prices
games_with_only_nan_prices = (
    prices.groupby('gameid')['usd']
    .apply(lambda x: x.isna().all())
)

#games['is_free_to_play'] = games['gameid'].apply(lambda gid: 'Yes' if gid in set(games_with_only_nan_prices[games_with_only_nan_prices == True].index) else 'No')


In [399]:
target_ids = list(set(games_with_only_nan_prices[games_with_only_nan_prices == True].index))
games['is_free_to_play'] = games['gameid'].apply(lambda gid: 'Yes' if gid in target_ids else 'No')


In [432]:
games['is_free_to_play'].value_counts()/len(games)

is_free_to_play
No     0.842285
Yes    0.157715
Name: count, dtype: float64

##### Decision on Free-to-Play Prices

Since we have created a clean `is_free_to_play` flag, we do not impute missing (`NaN`) prices for Free-to-Play games. 

In Power BI and later analysis, Free-to-Play games will be filtered and treated separately. Blank USD prices will remain for these entries, ensuring that pricing analysis focuses exclusively on paid games where price information is meaningful.

This approach preserves data integrity and avoids introducing artificial 0.00 values into the dataset.


In [403]:
prices.head(1)

Unnamed: 0,gameid,usd,date_acquired
5,3278740,5.99,2024-11-28


### Verifying Latest Price Snapshot

Since we intend to use a snapshot of the latest prices for each game in our Power BI dashboard, we validate whether all games have their most recent price recorded on the same date. This check ensures that aggregations like minimum, maximum, and average price reflect a consistent time frame. While this wasn’t part of the initial cleaning, it emerged from iterative dashboard testing — highlighting the importance of refining assumptions as new insights arise.


In [43]:
prices['date_acquired'] = pd.to_datetime(prices['date_acquired'], errors='coerce')

# get the latest price date per game
latest_price_per_game = prices.groupby('gameid')['date_acquired'].max().reset_index()
latest_price_per_game.rename(columns={'date_acquired': 'latest_game_date'}, inplace=True)

#how many unique "latest" dates exist?
unique_latest_dates = latest_price_per_game['latest_game_date'].nunique()
print(f"Number of unique latest price dates: {unique_latest_dates}")

# Optional: Display how many games fall under each latest date
date_counts = latest_price_per_game['latest_game_date'].value_counts().sort_index()
print("\nNumber of games by their latest price date:")
print(date_counts)


Number of unique latest price dates: 1

Number of games by their latest price date:
latest_game_date
2025-02-24    92964
Name: count, dtype: int64


In [44]:
latest_price_per_game

Unnamed: 0,gameid,latest_game_date
0,10,2025-02-24
1,20,2025-02-24
2,30,2025-02-24
3,40,2025-02-24
4,50,2025-02-24
...,...,...
92959,3416510,2025-02-24
92960,3416630,2025-02-24
92961,3416810,2025-02-24
92962,3417390,2025-02-24


## Reviews table

We now look at the table with reviews. As per the documentation, this table has the following columns: 
* `reviewid` - unique reviewID (Serial Key)
* `playerid` - userID on the Steam platform who submitted the review
* `gameid` - gameID for which the review was written
* `review` - user-submitted review for the game
* `helpful` - number of users who found the review helpful
* `funny` - number of users who found the review funny
* `awards` - number of awards given to the review
* `posted `- timestamp of when the review was posted

Our main focus is the `gameid` since, later on, this can help in finding the number of reviews per game. We will drop the `review` column since it will not be useful in our analyses. 

In [1]:
reviewsdf = pd.read_csv('../data_steam/raw/reviews.csv')

NameError: name 'pd' is not defined

In [405]:
reviews = reviewsdf.copy()
reviews.head(3)

Unnamed: 0,reviewid,playerid,gameid,review,helpful,funny,awards,posted
0,639543,76561198796340888,730,Goud gamę i have 3 vac ban acont but i stilll playj thiz gesme,0,0,0,2018-03-22
1,639544,76561198028706627,393380,---{ Graphics }---☐ You forget what reality is☑ Beautiful☐ Good☐ Decent☐ Bad☐ Don‘t look too long at it☐ MS-DOS---{ Gameplay }---☑ Very good☐ Good☐ It's just gameplay☐ Mehh☐ Watch paint dry instead☐ Just don't---{ Audio }---☑ Eargasm☐ Very good☐ Good☐ Not too bad☐ Bad☐ I'm now deaf---{ Audience }---☐ Kids☑ Teens☑ Adults☐ Grandma---{ PC Requirements }---☐ Check if you can run paint☐ Potato☑ Decent☐ Fast☐ Rich boi☐ Ask NASA if they have a spare computer---{ Game Size }---☐ Floppy Disk☐ Old Fashioned☐ Workable☐ Big☑ Will eat 10% of your 1TB hard drive☐ You will want an entire hard drive to hold it☐ You will need to invest in a black hole to hold all the data---{ Difficulty }---☐ Just press 'W'☐ Easy☐ Easy to learn / Hard to master☑ Significant brain usage☐ Difficult☐ Dark Souls---{ Grind }---☑ Nothing to grind☐ Only if u care about leaderboards/ranks☐ Isn't necessary to progress☐ Average grind level☐ Too much grind☐ You'll need a second life for grinding---{ Story }---☑ No Story☐ Some lore☐ Average☐ Good☐ Lovely☐ It'll replace your life---{ Game Time }---☐ Long enough for a cup of coffee☐ Short☐ Average☐ Long☑ To infinity and beyond---{ Price }---☐ It's free!☑ Worth the price☐ If it's on sale☐ If u have some spare money left☐ Not recommended☐ You could also just burn your money---{ Bugs }---☐ Never heard of☑ Minor bugs☐ Can get annoying☐ ARK: Survival Evolved☐ The game itself is a big terrarium for bugs---{ ? / 10 }---☐ 1☐ 2☐ 3☐ 4☐ 5☐ 6☐ 7☐ 8☑ 9☐ 10,0,0,0,2025-01-03
2,639545,76561198028706627,10,One of the best FPS games :),0,0,0,2012-05-13


In [406]:
reviews = reviews.drop(columns=['review'])
reviews.head(2)

Unnamed: 0,reviewid,playerid,gameid,helpful,funny,awards,posted
0,639543,76561198796340888,730,0,0,0,2018-03-22
1,639544,76561198028706627,393380,0,0,0,2025-01-03


#### Checking if all gameids in Reviews table are in main games table
We will want to check if all the `gameid`s in the reviews table appear in our main `games` table. It would be problematic to have a review for a game but no way to verify which game it is. We first check the orginal games table, before removing any data.

In [407]:
original_unmatched_ids = set(reviewsdf['gameid']) - set(gamesdf['gameid'])
print(f"Total review gameids missing in original games: {len(original_unmatched_ids)}")


Total review gameids missing in original games: 161


In [408]:
num_games_reviewed = set(reviews['gameid'])
tot_games = set(games['gameid'])
print(f'Number of games reviewed is: {len(num_games_reviewed)}')
print(f'Total number of games is: {len(tot_games)}')

Number of games reviewed is: 51910
Total number of games is: 92965


From above it appears not all games were reviewed (so some games have zero reviews). First though we  check if all `gameid`s in `reviews` appear in `games`. It appears that in the `reviews` table, there are some `gameid`s that do not appear in the main games table (234 of these). This is concerning as the `games` table is the main games table with all game related information. Perhaps these rogue games are those we removed that were playtests. 

In [409]:
#set(reviews['gameid'])
#set(reviews['gameid']) == set(games['gameid'])
review_id_not_in_games = list(set(reviews['gameid']) - set(games['gameid'])) #IDs in reviews not in games

print(f'Number of games in reviews table that do not appear in games table: {len(review_id_not_in_games)}')

Number of games in reviews table that do not appear in games table: 234


In [410]:
reviews_ids = set(reviews['gameid'])
games_ids = set(games['gameid'])
# IDs in reviews that don't exist in games
review_id_not_in_games = reviews_ids - games_ids
print("Number of games in reviews table that do not appear in games table:", len(review_id_not_in_games))  
unmatched_playtest_ids = review_id_not_in_games.intersection(playtest_ids)
print("Number of games in reviews not in games table but are in the playtest:", len(unmatched_playtest_ids))  
remaining_unmatched = review_id_not_in_games - unmatched_playtest_ids
print("Number of games in reviews that do not appear in games after removing playtests:", len(remaining_unmatched))  


Number of games in reviews table that do not appear in games table: 234
Number of games in reviews not in games table but are in the playtest: 72
Number of games in reviews that do not appear in games after removing playtests: 162


We have 72 playtest games. We shall remove these from the reviews for consistency. Then we continue to investigate which IDs are not in main games tables.

In [411]:
reviews = reviews[~reviews['gameid'].isin(playtest_ids)]
# sanity check
games_ids = set(games['gameid'])
reviews_ids = set(reviews['gameid'])# Gameids in reviews (after playtest removal)
unmatched_review_ids = reviews_ids - games_ids # IDs in reviews that are not in games

print(f"Unmatched review gameids (post-playtest-removal): {len(unmatched_review_ids)}")  # Expect 162


Unmatched review gameids (post-playtest-removal): 162


We will also remove the no title games like we did for games and prices (in this case its just one of the games appearing in reviews)


In [57]:
no_title_games['gameid'].isin(reviews['gameid'])

2853     False
39705    False
77700     True
Name: gameid, dtype: bool

In [412]:
reviews = reviews[~reviews['gameid'].isin(no_title_games['gameid'])]

In [413]:
# sanity check
games_ids = set(games['gameid'])
reviews_ids = set(reviews['gameid'])# Gameids in reviews (after playtest and no title games removal)
unmatched_review_ids = reviews_ids - games_ids # IDs in reviews that are NOT in games

print(f"Unmatched review gameids (post-playtest-no-title-removal): {len(unmatched_review_ids)}")  # Expect 161


Unmatched review gameids (post-playtest-no-title-removal): 161


### Decision on games not appearing in main games table
We now have 161 games that appear in reviews but do not appear in games. Since these games have no information associated with them in the games table, I am making the decision to remove these. Perhaps it is a case of missing data in games table (perhaps de-listed games) or incorrectly captured ids in the reviews table. In any case, with no way to verify game information (like title, developers etc) these would be of little use, hence the decision. 

In [414]:
# Keep only reviews whose gameid appears in the main games table
reviews = reviews[reviews['gameid'].isin(games['gameid'])].copy()


## Achievements 
The table achievements contains:
* `achievementid` - unique achievementID, constructed as gameID + '_' + achievementNotUniqueID
* `gameid` - unique gameID on the Steam platform
* `title` - achievement title
* `description` - description of how to unlock the achievement

For data consistency here, we will only keep the achievements that have `gameid`s in the main games table. We will also drop the `title` and `description` for our analyses.

In [69]:
achievementsdf = pd.read_csv('../data_steam/raw/achievements.csv')

In [70]:
achievements = achievementsdf.copy()

In [71]:
achievements.head()

Unnamed: 0,achievementid,gameid,title,description
0,2621440_ACH_FIRST_KILL,2621440,FIRST KILL,You should kill ONE enemy.
1,2621440_ACH_0_LEVEL_COMPLETED,2621440,TUTORIAL COMPLETED,You should complete tutorial.
2,2621440_ACH_1_LEVEL_COMPLETED,2621440,FIRST LEVEL,You should complete first level
3,2621440_ACH_2_LEVEL_COMPLETED,2621440,SECOND LEVEL,You should complete second level
4,2621440_ACH_3_LEVEL_COMPLETED,2621440,THIRD LEVEL,You should complete third level


In [72]:
unmatched_achievements = achievements[~achievements['gameid'].isin(games['gameid'])].copy() #in case I want to come back later and analyse
achievements = achievements[achievements['gameid'].isin(games['gameid'])].copy()
achievements = achievements.drop(columns= ['title', 'description'])

In [73]:
achievements.head(2)

Unnamed: 0,achievementid,gameid
0,2621440_ACH_FIRST_KILL,2621440
1,2621440_ACH_0_LEVEL_COMPLETED,2621440


In [82]:
#achievements['achievementid'].nunique()
achievements[achievements['gameid'].isin(games['gameid'])]['achievementid'].nunique()


1935882

In [77]:
games['gameid'].nunique()

98248

In [62]:
unmatched_achievements['gameid'].nunique()

486

In [66]:
unmatched_achievements.head(3)

Unnamed: 0,achievementid,gameid,title,description
9761,2884310_AC_OpenTheDoor,2884310,Open the door,Replace the Bypass chip of the door control.
9762,2884310_AC_NANDDone,2884310,NAND Designer,Create a NAND gate.
9763,2884310_AC_Chapter1Done,2884310,Ghost in the Grid,Complete the Chapter 1.


## History 
The table history contains:
* `playerid` - unique userID on the Steam platform who earned achievementID
* `achievementid` - unique achievementID, constructed as gameID + '_' + achievementNotUniqueID
* `date_acquired` - timestamp of when the achievement was earned

For data consistency here, I will keep only the rows where the achievementid is in the cleaned achievements table.

In [418]:
historydf = pd.read_csv("../data_steam/raw/history.csv")

In [419]:
history = historydf.copy()

In [69]:
history.head(3)

Unnamed: 0,playerid,achievementid,date_acquired
0,76561198220441373,403640_ACH_1,2019-12-18 15:33:43
1,76561198220441373,403640_ACH_2,2019-12-18 23:49:51
2,76561198220441373,403640_ACH_3,2019-12-19 23:05:07


In [70]:
history.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10693879 entries, 0 to 10693878
Data columns (total 3 columns):
 #   Column         Dtype 
---  ------         ----- 
 0   playerid       int64 
 1   achievementid  object
 2   date_acquired  object
dtypes: int64(1), object(2)
memory usage: 244.8+ MB


Luckily for this table, it appears there aren't any missing values. 

In [420]:
miss_per_col  = missing_value_summary(history)
miss_per_col

Unnamed: 0,Number of Missing Values,Percent of Total Values
playerid,0,0.0
achievementid,0,0.0
date_acquired,0,0.0


There are 1928259 unique achievements and from below, it appears that quite a number of achievements (1043018) haven't been 'achieved'. 

In [421]:
# get sets of unique achievement IDs from both tables
achievement_ids_master = set(achievements['achievementid'])
achievement_ids_history = set(history['achievementid'])

unearned_achievements = achievement_ids_master - achievement_ids_history 

orphan_achievements = achievement_ids_history - achievement_ids_master 
perc_unearned = (len(unearned_achievements) / len(achievement_ids_master)) * 100 

print(f" Total unique achievements in achievements table: {len(achievement_ids_master)}")
print(f" Achievements never earned by any player: {len(unearned_achievements)}")
print(f" Percentage of unearned achievements: {perc_unearned:0.2f}%")
print(f" Achievements in history but missing in achievements table: {len(orphan_achievements)}")


print("\nSample unearned achievement IDs:")
print(list(unearned_achievements)[:5])


 Total unique achievements in achievements table: 1928259
 Achievements never earned by any player: 1043018
 Percentage of unearned achievements: 54.09%
 Achievements in history but missing in achievements table: 1747

Sample unearned achievement IDs:
['2938830_FBQTheFall', '738610_ach_24', '1363210_ACH_STORY_2_4', '509600_com.foggybus.battletime.playerpassedmissions_all', '1281360_award3']


From below, it appears all the `achievementid`s in history but not in main `achievements` table are those in the `unmatched_achievements` which in turn correspond to `achievementid`s for games not in our main `games` table.

In [422]:
unmatched_ids = set(unmatched_achievements['achievementid'])

orphans_in_unmatched = orphan_achievements.intersection(unmatched_ids)
print(f"Number of orphan achievements found in unmatched_achievements: {len(orphans_in_unmatched)}")


all_orphans_present = orphan_achievements.issubset(unmatched_ids)
print(f"Are all orphan achievements present in unmatched_achievements? {all_orphans_present}")


Number of orphan achievements found in unmatched_achievements: 1747
Are all orphan achievements present in unmatched_achievements? True


In [423]:
history_clean = history[history['achievementid'].isin(achievements['achievementid'])].copy()

### Players
According to the documentation, the Players table contains the following:
* `playerid` - unique userID on the Steam platform
* `country` - the country in which the user resides
* `created` - date of creation of the gaming profile


In [8]:
playersdf = pd.read_csv("../data_steam/raw/players.csv")

In [9]:
players = playersdf.copy()

In [65]:
players.head()

Unnamed: 0,playerid,country,created
0,76561198287452552,Brazil,2016-03-02 06:14:20
1,76561198040436563,Israel,2011-04-10 17:10:06
2,76561198049686270,,2011-09-28 21:43:59
3,76561198155814250,Kazakhstan,2014-09-24 19:52:47
4,76561198119605821,,2013-12-26 00:25:50


There are 424683 unique player ids (hard to say if these are unique players since a person can create multiple accounts). The `country` column has 42% missing values. We wil replace these with `Unknown` just to avoid losing data. The `created` column also has some missing values which we cannot reliably impute. So we will leave these in for now and deal with them in the normalisation section.

In [10]:
players['playerid'].nunique()

424683

In [11]:
miss_per_col  = missing_value_summary(players)
miss_per_col

Unnamed: 0,Number of Missing Values,Percent of Total Values
country,177868,41.88
created,47669,11.22
playerid,0,0.0


In [428]:
players['country'] = players['country'].fillna('Unknown')
miss_per_col  = missing_value_summary(players)
miss_per_col

Unnamed: 0,Number of Missing Values,Percent of Total Values
created,47669,11.22
playerid,0,0.0
country,0,0.0


In [4]:
games = pd.read_csv('../data_steam/cleaned/games_cleaned.csv')

In [5]:
games.head(2)

Unnamed: 0,gameid,title,developers,publishers,genres,supported_languages,release_date,is_free_to_play
0,3278740,NEURO,['Revolt Games'],['Strategy First'],['Action'],"['English', 'Russian']",2024-10-11,No
1,3270850,Keep Your Eyes Open,['Texerikus'],['Texerikus'],['Indie'],['English'],2024-10-21,No


## Purchased
We now turn our attention to the `purchased_games` table with has the following data:
- `playerid` - unique userID
- `library` - a list of purchased games for the entire usage period

This table has information that will enable us to answer questions like how many players own a certain game which in turn helps us to judge the popularity of said game. The initial goal here is to make sure that the `playerid`s all appear in the `players` table and the `gameid`s appear in the main `games` table. 

It appears, from initial analysis, that not all players have a game in their library: there are unique 424863 players in the main players table but only 102548 unique players in the `purchased_games` table. This indicates, possibly, that players sign up on the platform but do not purchase a game. 

In [41]:
purchased_games = pd.read_csv('../data_steam/raw/purchased_games.csv')
purchased_games['library'] = purchased_games['library'].apply(lambda x: ast.literal_eval(x) if isinstance(x, str) else x)
purchased_games.head(2)

Unnamed: 0,playerid,library
0,76561198060698936,"[60, 1670, 3830, 1600, 2900, 2910, 2920, 4800,..."
1,76561198287452552,"[10, 80, 100, 240, 2990, 6880, 6910, 6920, 698..."


In [43]:
print("Number of rows in players: ",len(players) )
print("Number of unique ids in players: ",players["playerid"].nunique())
print("Number of unique ids in purchased: ",purchased_games["playerid"].nunique())

m = purchased_games[purchased_games['playerid'].isin(players['playerid'])]
len(m)

Number of rows in players:  424683
Number of unique ids in players:  424683
Number of unique ids in purchased:  102548


102548

### Preparing for normalisation
Below we prepare the data for normalisation in the next notebook by having each row corresponding to one 'player-game' pair.

In [44]:
purchased_games_exploded = purchased_games.explode('library')
purchased_games_exploded = purchased_games_exploded.rename(columns={'library': 'gameid'})

In [45]:
purchased_games_exploded.head()

Unnamed: 0,playerid,gameid
0,76561198060698936,60
0,76561198060698936,1670
0,76561198060698936,3830
0,76561198060698936,1600
0,76561198060698936,2900


### Removing rogue games
There are about 3818 games that do not appear in the original `games` table (prior to any removals). We remove these games since we have no way of finding out their relevant information.

In [56]:
uniq_games = set(games['gameid'])
uniq_games_df = set(gamesdf['gameid'])
uniq_games_purchased = set(purchased_games_exploded['gameid'])
games_missing_from_games_dim = uniq_games_purchased - uniq_games
games_missing_from_games_dim_og = uniq_games_purchased - uniq_games_df
print('Unique games: ', len(uniq_games))
print('Not in games dim: ', len(games_missing_from_games_dim))
print('Not in originial games dim: ', len(games_missing_from_games_dim_og))

Unique games:  98248
Not in games dim:  3818
Not in orignial games dim:  3818


In [58]:
#remove games that dont appear in games df
purchased_games_exploded = purchased_games_exploded[purchased_games_exploded['gameid'].isin(gamesdf['gameid'])]

In [61]:
purchased_games_exploded['gameid'].nunique()

37171

### Saving dataframes

In [429]:
prices.to_csv('../data_steam/cleaned/prices_cleaned.csv', index=False)
games.to_csv('../data_steam/cleaned/games_cleaned.csv', index=False)
achievements.to_csv('../data_steam/cleaned/achievements_cleaned.csv', index=False)
reviews.to_csv('../data_steam/cleaned/reviews_cleaned.csv', index=False)
players.to_csv('../data_steam/cleaned/players_cleaned.csv', index=False)
history_clean.to_csv('../data_steam/cleaned/history_cleaned.csv', index=False)
purchased_games_exploded.to_csv('../data_steam/cleaned/library_cleaned.csv', index=False)

## Summary
This is what we have done in this notebok:
* Removed irrelevant or inconsistent records, such as playtest games

* Handled missing values appropriately across all key fields, with one exception

* Converted stringified lists (e.g., developers, genres) into actual Python lists for later normalisation

* Ensured data consistency across tables (e.g., all references to gameid are valid and aligned)

Next step is performing data normalisation to get the tables ready for Power BI.