# Project Overview: Data Normalization for Power BI Modeling
In this notebook, we focus on normalising the cleaned Steam dataset for use in Power BI, following best practices in dimensional modeling (star schema). The aim is to transform the dataset into a structure that enables accurate, scalable, and flexible analysis across developers, genres, publishers, and more.

### Objectives:
* Normalize multi-valued fields (developers, genres, publishers, languages) into clean, analyzable formats

* Create bridge tables to model many-to-many relationships between games and related entities

* Preserve referential integrity between fact and dimension tables

* Prepare clean, one-row-per-entity dimension tables (e.g., developers, genres)

* Align fact tables (prices, reviews, achievements) to the normalized games structure



In [1]:
import pandas as pd
import numpy as np
import sys
import ast
import os
sys.path.append(os.path.abspath(".."))
from utils.data_utils import create_dim_and_bridge, generate_date_dim #utility function for creating the dimension and bridge tables

In [3]:
prices = pd.read_csv('../data_steam/cleaned/prices_cleaned.csv')
games = pd.read_csv('../data_steam/cleaned/games_cleaned.csv')
achievements = pd.read_csv('../data_steam/cleaned/achievements_cleaned.csv')
reviews = pd.read_csv('../data_steam/cleaned/reviews_cleaned.csv')
history = pd.read_csv('../data_steam/cleaned/history_cleaned.csv')
players = pd.read_csv('../data_steam/cleaned/players_cleaned.csv')

### Date
To enable proper time-series analysis in Power BI, we shall create date tables.
The games table has a release date column for when the game was released. The prices table has a date_acquired column which, according to the documentation, is the date when the price was acquired. Each game can have multiple of date_acquired dates since the price can vary over time (for example, during sales) but the release date is expected to be one for each game.

In [8]:
date_release = generate_date_dim(games['release_date'])
date_price = generate_date_dim(prices['date_acquired'])
date_review = generate_date_dim(reviews['posted'])
date_achieved = generate_date_dim(history['date_acquired'])
date_created = generate_date_dim(players['created'])

### Developers and Games
First we will deal with developers. As seen below, it is possible for a game to have more than one developer (same is true for publishers, genres and supported languages but we will deal with these later). The goal then is to have a games dimension table and a developers dimension table and a bridging table to deal with the many-to-many relationships.

In [9]:
games[games['gameid']==3255400]

Unnamed: 0,gameid,title,developers,publishers,genres,supported_languages,release_date,Free to play
17,3255400,Zone Meditation,"['Kevin Wacknov', 'Stuart Heller PhD 7th Dan']",['Mind Body Aware Games LLC'],"['Casual', 'Free To Play', 'Education']",['English'],2024-10-11,Yes


For games dimensions, from the games table we shall extract the gameid, title and free_to_play columns for a cleaner looking dimension. We also create a date key (release_date_id) which will serve as a foreign key and relate to the time release date dimension we created prior. In the release date dimension table we also add this key.


In [11]:
games_dim = games[['gameid', 'title', 'Free to play', 'release_date']].copy()
games_dim.rename(columns={
    'title': 'game_title',
    'Free to play': 'is_free_to_play'
}, inplace=True)
games_dim['release_date'] = pd.to_datetime(games_dim['release_date'], errors='coerce').dropna()
games_dim['release_date_id'] = games_dim['release_date'].dt.strftime('%Y%m%d').astype(int)
games_dim.drop(columns=['release_date'], inplace=True)

In [12]:
games_dim.head()

Unnamed: 0,gameid,game_title,is_free_to_play,release_date_id
0,3278740,NEURO,No,20241011
1,3270850,Keep Your Eyes Open,No,20241021
2,3267350,Tiny Shooters,Yes,20241019
3,3266470,Futanari Sex Adventures - Episode 5,No,20241017
4,3264110,AUTO_BATTLER_RPG,No,20241022


In [13]:
devs_dim, dev_game_map = create_dim_and_bridge(games, 'developers', 'developer')
publishers_dim, publisher_game_map = create_dim_and_bridge(games, 'publishers', 'publisher')
genres_dim, genre_game_map = create_dim_and_bridge(games, 'genres', 'genre')
langs_dim, lang_game_map = create_dim_and_bridge(games, 'supported_languages', 'language')


In [34]:
prices.head()

Unnamed: 0,gameid,usd,date_acquired
0,3278740,5.99,2024-11-28
1,3270850,3.99,2024-11-28
2,3267350,0.0,2024-11-28
3,3266470,3.49,2024-11-28
4,3264110,2.99,2024-11-28


In [35]:
games_dim.head()

Unnamed: 0,gameid,game_title,is_free_to_play,release_date_id
0,3278740,NEURO,No,20241011
1,3270850,Keep Your Eyes Open,No,20241021
2,3267350,Tiny Shooters,Yes,20241019
3,3266470,Futanari Sex Adventures - Episode 5,No,20241017
4,3264110,AUTO_BATTLER_RPG,No,20241022


In [39]:
reviews.head()

Unnamed: 0,reviewid,playerid,gameid,helpful,funny,awards,posted
0,639543,76561198796340888,730,0,0,0,2018-03-22
1,639544,76561198028706627,393380,0,0,0,2025-01-03
2,639545,76561198028706627,10,0,0,0,2012-05-13
3,639546,76561198049356580,469600,0,0,0,2018-04-21
4,639547,76561198272817436,730,2,0,0,2020-01-23


Next we want to clean up the prices a bit and create a link between this and the date_price dimension table. We create the date_id which will act as a surrogate key, and remove date_acquired to remove the redundancy.

In [42]:
prices.head()

Unnamed: 0,gameid,usd,date_acquired
0,3278740,5.99,2024-11-28
1,3270850,3.99,2024-11-28
2,3267350,0.0,2024-11-28
3,3266470,3.49,2024-11-28
4,3264110,2.99,2024-11-28


In [14]:
#prices.head()
prices_fact = prices.copy()
prices_fact['date_acquired'] = pd.to_datetime(prices_fact['date_acquired'], errors='coerce')
prices_fact['date_id'] = prices_fact['date_acquired'].dt.strftime('%Y%m%d').astype(int)
prices_fact.drop(columns=['date_acquired'], inplace=True)
prices_fact.head()

Unnamed: 0,gameid,usd,date_id
0,3278740,5.99,20241128
1,3270850,3.99,20241128
2,3267350,0.0,20241128
3,3266470,3.49,20241128
4,3264110,2.99,20241128


In [46]:
reviews.head()


Unnamed: 0,reviewid,playerid,gameid,helpful,funny,awards,posted
0,639543,76561198796340888,730,0,0,0,2018-03-22
1,639544,76561198028706627,393380,0,0,0,2025-01-03
2,639545,76561198028706627,10,0,0,0,2012-05-13
3,639546,76561198049356580,469600,0,0,0,2018-04-21
4,639547,76561198272817436,730,2,0,0,2020-01-23


In [16]:
reviews_fact = reviews.copy()
reviews_fact['posted'] = pd.to_datetime(reviews_fact['posted'], errors='coerce')
reviews_fact['date_id'] = reviews_fact['posted'].dt.strftime('%Y%m%d').astype(int)
reviews_fact.drop(columns=['posted'], inplace=True)
reviews_fact.head()

Unnamed: 0,reviewid,playerid,gameid,helpful,funny,awards,date_id
0,639543,76561198796340888,730,0,0,0,20180322
1,639544,76561198028706627,393380,0,0,0,20250103
2,639545,76561198028706627,10,0,0,0,20120513
3,639546,76561198049356580,469600,0,0,0,20180421
4,639547,76561198272817436,730,2,0,0,20200123


In [19]:
history_fact = history.copy()
history_fact['date_acquired'] = pd.to_datetime(history_fact['date_acquired'], errors='coerce')
history_fact['date_id'] = history_fact['date_acquired'].dt.strftime('%Y%m%d').astype(int)
history_fact.drop(columns=['date_acquired'], inplace=True)

In [6]:
date_achieved.head()

Unnamed: 0,date_id,Date,Year,Month,MonthNumber,YearMonth
0,20080913,2008-09-13 01:37:54,2008,Sep,9,2008-09
1,20080914,2008-09-14 01:37:54,2008,Sep,9,2008-09
2,20080915,2008-09-15 01:37:54,2008,Sep,9,2008-09
3,20080916,2008-09-16 01:37:54,2008,Sep,9,2008-09
4,20080917,2008-09-17 01:37:54,2008,Sep,9,2008-09


In [20]:
players['created'] = pd.to_datetime(players['created'], errors='coerce')
players['date_created_id'] = players['created'].dt.strftime('%Y%m%d').astype('Int64')
players_dim = players[['playerid', 'country', 'date_created_id']].drop_duplicates().reset_index(drop=True)


## Summary
We have now transformed a messy raw dataset with multi-valued fieldsand mixed date logic into a clean, Power BI–ready star schema. The importance of getting the data into this form cannot be understated, especially if we want to use Power BI's full capabilities seamlessly.
These are our tables now:
#### Fact Tables:

* prices_fact: Game price observations over time

* reviews_fact: Player reviews with helpful/funny/awards

* history_fact: When players unlocked achievements

#### Dimension Tables:

* games_dim: Game metadata with release date, F2P flag

* players_dim: Player IDs, country, account creation

* achievements_dim: Achievement details per game

* devs_dim, genres_dim, publishers_dim, languages_dim: Normalized lists

#### Bridge Tables (many-to-many):

* dev_game_map, genre_game_map, publisher_game_map, language_game_map

#### Date Dimensions:

* date_release, date_price, date_review, date_achieved, date_created
Each has date_id, Year, Month, etc.

## Utility functions
Apart from the Power BI setting up,  reusable tilities were created
* create_dim_and_bridge() — turns any multi-valued column into a dimension + bridge table

* generate_date_dim() — creates reusable date dimension from any datetime series

Stored in utils/data_utils.py, imported into notebooks



In [None]:
# Export dimensions
games_dim.to_csv("../data_steam/normalised/games_dim.csv", index=False)
devs_dim.to_csv("../data_steam/normalised/devs_dim.csv", index=False)
publishers_dim.to_csv("../data_steam/normalised/publishers_dim.csv", index=False)
genres_dim.to_csv("../data_steam/normalised/genres_dim.csv", index=False)
languages_dim.to_csv("../data_steam/normalised/languages_dim.csv", index=False)
players_dim.to_csv("../data_steam/normalised/players_dim.csv", index=False)
achievements.to_csv("../data_steam/normalised/achievements_dim.csv", index=False)

# Export bridge tables
dev_game_map.to_csv("../data_steam/normalised/dev_game_map.csv", index=False)
publisher_game_map.to_csv("../data_steam/normalised/publisher_game_map.csv", index=False)
genre_game_map.to_csv("../data_steam/normalised/genre_game_map.csv", index=False)
language_game_map.to_csv("../data_steam/normalised/language_game_map.csv", index=False)

# Export fact tables
prices_fact.to_csv("../data_steam/normalised/prices_fact.csv", index=False)
reviews_fact.to_csv("../data_steam/normalised/reviews_fact.csv", index=False)
history_clean.to_csv("../data_steam/normalised/history_fact.csv", index=False)

# Export date tables
date_release.to_csv("../data_steam/normalised/date_release.csv", index=False)
date_price.to_csv("../data_steam/normalised/date_price.csv", index=False)
date_review.to_csv("../data_steam/normalised/date_review.csv", index=False)
date_achieved.to_csv("../data_steam/normalised/date_achieved.csv", index=False)
date_created.to_csv("../data_steam/normalised/date_created.csv", index=False)  
