# Project Overview: Data Normalization for Power BI Modeling
In this notebook, we focus on normalising the cleaned Steam dataset for use in Power BI, following best practices in dimensional modeling (star schema). The aim is to transform the dataset into a structure that enables accurate, scalable, and flexible analysis across developers, genres, publishers, and more.

### Objectives:
* Normalize multi-valued fields (developers, genres, publishers, languages) into clean, analyzable formats

* Create bridge tables to model many-to-many relationships between games and related entities

* Preserve referential integrity between fact and dimension tables

* Prepare clean, one-row-per-entity dimension tables (e.g., developers, genres)

* Align fact tables (prices, reviews, achievements) to the normalized games structure



In [28]:
import pandas as pd
import numpy as np
import sys
import ast
import os
sys.path.append(os.path.abspath(".."))
from utils.data_utils import create_dim_and_bridge #utility function for creating the dimension and bridge tables

In [44]:
prices = pd.read_csv('../steam/prices_cleaned.csv')
games = pd.read_csv('../steam/games_cleaned.csv')
achievements = pd.read_csv('../steam/achievements_cleaned.csv')
reviews = pd.read_csv('../steam/reviews_cleaned.csv')


### Date
To enable proper time-series analysis in Power BI, we shall create date tables.
The games table has a release date column for when the game was released. The prices table has a date_acquired column which, according to the documentation, is the date when the price was acquired. Each game can have multiple of date_acquired dates since the price can vary over time (for example, during sales) but the release date is expected to be one for each game.

In [38]:
games['release_date'] = pd.to_datetime(games['release_date'], errors='coerce')
date_release = pd.DataFrame({
    'Date': pd.date_range(start=games['release_date'].min(), end=games['release_date'].max())
})
date_release['Year'] = date_release['Date'].dt.year
date_release['Month'] = date_release['Date'].dt.strftime('%b')
date_release['MonthNumber'] = date_release['Date'].dt.month
date_release['YearMonth'] = date_release['Date'].dt.strftime('%Y-%m')
date_release['date_id'] = date_release['Date'].dt.strftime('%Y%m%d').astype(int)

date_price = pd.DataFrame({
    'Date': pd.date_range(start=prices['date_acquired'].min(), end=prices['date_acquired'].max())
})

date_price['date_id'] = date_price['Date'].dt.strftime('%Y%m%d').astype(int)
date_price['Year'] = date_price['Date'].dt.year
date_price['Month'] = date_price['Date'].dt.strftime('%b')
date_price['MonthNumber'] = date_price['Date'].dt.month
date_price['YearMonth'] = date_price['Date'].dt.strftime('%Y-%m')


date_price = date_price[['date_id', 'Date', 'Year', 'Month', 'MonthNumber', 'YearMonth']]


### Developers and Games
First we will deal with developers. As seen below, it is possible for a game to have more than one developer (same is true for publishers, genres and supported languages but we will deal with these later). The goal then is to have a games dimension table and a developers dimension table and a bridging table to deal with the many-to-many relationships.

In [5]:
games[games['gameid']==3255400]

Unnamed: 0,gameid,title,developers,publishers,genres,supported_languages,release_date,Free to play
17,3255400,Zone Meditation,"['Kevin Wacknov', 'Stuart Heller PhD 7th Dan']",['Mind Body Aware Games LLC'],"['Casual', 'Free To Play', 'Education']",['English'],2024-10-11,Yes


For games dimensions, from the games table we shall extract the gameid, title and free_to_play columns for a cleaner looking dimension. We also create a date key (release_date_id) which will serve as a foreign key and relate to the time release date dimension we created prior. In the release date dimension table we also add this key.


In [11]:
games_dim = games[['gameid', 'title', 'Free to play', 'release_date']].copy()
games_dim.rename(columns={
    'title': 'game_title',
    'Free to play': 'is_free_to_play'
}, inplace=True)

games_dim['release_date_id'] = games_dim['release_date'].dt.strftime('%Y%m%d').astype(int)
games_dim.drop(columns=['release_date'], inplace=True)


In [12]:
games_dim.head()

Unnamed: 0,gameid,game_title,is_free_to_play,release_date_id
0,3278740,NEURO,No,20241011
1,3270850,Keep Your Eyes Open,No,20241021
2,3267350,Tiny Shooters,Yes,20241019
3,3266470,Futanari Sex Adventures - Episode 5,No,20241017
4,3264110,AUTO_BATTLER_RPG,No,20241022


In [32]:
devs_dim, dev_game_map = create_dim_and_bridge(games, 'developers', 'developer')
publishers_dim, publisher_game_map = create_dim_and_bridge(games, 'publishers', 'publisher')
genres_dim, genre_game_map = create_dim_and_bridge(games, 'genres', 'genre')
langs_dim, lang_game_map = create_dim_and_bridge(games, 'supported_languages', 'language')


In [34]:
prices.head()

Unnamed: 0,gameid,usd,date_acquired
0,3278740,5.99,2024-11-28
1,3270850,3.99,2024-11-28
2,3267350,0.0,2024-11-28
3,3266470,3.49,2024-11-28
4,3264110,2.99,2024-11-28


In [35]:
games_dim.head()

Unnamed: 0,gameid,game_title,is_free_to_play,release_date_id
0,3278740,NEURO,No,20241011
1,3270850,Keep Your Eyes Open,No,20241021
2,3267350,Tiny Shooters,Yes,20241019
3,3266470,Futanari Sex Adventures - Episode 5,No,20241017
4,3264110,AUTO_BATTLER_RPG,No,20241022


In [39]:
reviews.head()

Unnamed: 0,reviewid,playerid,gameid,helpful,funny,awards,posted
0,639543,76561198796340888,730,0,0,0,2018-03-22
1,639544,76561198028706627,393380,0,0,0,2025-01-03
2,639545,76561198028706627,10,0,0,0,2012-05-13
3,639546,76561198049356580,469600,0,0,0,2018-04-21
4,639547,76561198272817436,730,2,0,0,2020-01-23


Next we want to clean up the prices a bit and create a link between this and the date_price dimension table. We create the date_id which will act as a surrogate key, and remove date_acquired to remove the redundancy.

In [42]:
prices.head()

Unnamed: 0,gameid,usd,date_acquired
0,3278740,5.99,2024-11-28
1,3270850,3.99,2024-11-28
2,3267350,0.0,2024-11-28
3,3266470,3.49,2024-11-28
4,3264110,2.99,2024-11-28


In [45]:
#prices.head()
prices_fact = prices.copy()
prices_fact['date_acquired'] = pd.to_datetime(prices_fact['date_acquired'], errors='coerce')
prices_fact['date_id'] = prices_fact['date_acquired'].dt.strftime('%Y%m%d').astype(int)
prices_fact.drop(columns=['date_acquired'], inplace=True)
prices_fact.head()

Unnamed: 0,gameid,usd,date_id
0,3278740,5.99,20241128
1,3270850,3.99,20241128
2,3267350,0.0,20241128
3,3266470,3.49,20241128
4,3264110,2.99,20241128


In [46]:
reviews.head()


Unnamed: 0,reviewid,playerid,gameid,helpful,funny,awards,posted
0,639543,76561198796340888,730,0,0,0,2018-03-22
1,639544,76561198028706627,393380,0,0,0,2025-01-03
2,639545,76561198028706627,10,0,0,0,2012-05-13
3,639546,76561198049356580,469600,0,0,0,2018-04-21
4,639547,76561198272817436,730,2,0,0,2020-01-23


In [47]:
reviews['posted'] = pd.to_datetime(reviews['posted'], errors='coerce')
date_review = pd.DataFrame({
    'Date': pd.date_range(start=reviews['posted'].min(), end=reviews['posted'].max())
})

date_review['date_id'] = date_review['Date'].dt.strftime('%Y%m%d').astype(int)
date_review['Year'] = date_review['Date'].dt.year
date_review['Month'] = date_review['Date'].dt.strftime('%b')
date_review['MonthNumber'] = date_review['Date'].dt.month
date_review['YearMonth'] = date_review['Date'].dt.strftime('%Y-%m')

date_review = date_review[['date_id', 'Date', 'Year', 'Month', 'MonthNumber', 'YearMonth']]
reviews_fact = reviews.copy()

reviews_fact['date_id'] = reviews_fact['posted'].dt.strftime('%Y%m%d').astype(int)
reviews_fact.drop(columns=['posted'], inplace=True)
reviews_fact.head()

Unnamed: 0,reviewid,playerid,gameid,helpful,funny,awards,date_id
0,639543,76561198796340888,730,0,0,0,20180322
1,639544,76561198028706627,393380,0,0,0,20250103
2,639545,76561198028706627,10,0,0,0,20120513
3,639546,76561198049356580,469600,0,0,0,20180421
4,639547,76561198272817436,730,2,0,0,20200123


In [None]:
prices.to_csv('../steam/prices_cleaned.csv', index=False)
games.to_csv('../steam/games_cleaned.csv', index=False)
achievements.to_csv('../steam/achievements_cleaned.csv', index=False)
reviews.to_csv('../steam/reviews_cleaned.csv', index=False)