# Merging data

There are 3 parts, each with 3 subparts, totaling 9 parts. To get a complete dataset, we need to merge all the parts together.

In [2]:
import pandas as pd
import numpy as np

In [3]:
raw_df = []
for i in range(1, 4):
    for j in range(1, 4):
        filepath = f"./data/outputs/part_{i}_{j}_steam_data.csv"
        raw_df.append(pd.read_csv(filepath))

raw_df = pd.concat(raw_df)
raw_df.to_csv("./data/outputs/steam_data.csv", index=False)
raw_df.head()

Unnamed: 0,Game Title,Game Genre,Pricing,Publisher,Release Date,Platform,Rating,Number of Ratings
0,Mycelium,"Adventure, Indie, RPG, Strategy",$5.99,Alex Grim,Oct 22 2024,,100.0,12
1,Relic Keepers,"Action, Adventure, Indie",$0.99,Idea Cabin,Sep 12 2017,,13.0,15
2,OUTBRK,"Action, Adventure, Simulation, Strategy, Early...",$34.99,Sublime,Jun 28 2024,,78.0,1132
3,Whipseey and the Lost Atlas,"Action, Adventure, Indie",$5.99,Daniel A. Ramirez,Aug 27 2019,,62.0,24
4,TT Isle Of Man: Ride on the Edge 3,"Racing, Simulation, Sports",$49.99,Raceward Studio,May 11 2023,,75.0,308


# Understanding the data
First we need a sense of the data's shape, structure and basic statistics.

In [8]:
print("Data frame overview:")
print(raw_df.info())

Data frame overview:
<class 'pandas.core.frame.DataFrame'>
Index: 49229 entries, 0 to 6493
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Game Title         49229 non-null  object 
 1   Game Genre         49229 non-null  object 
 2   Pricing            49229 non-null  object 
 3   Publisher          49229 non-null  object 
 4   Release Date       49229 non-null  object 
 5   Platform           0 non-null      float64
 6   Rating             49229 non-null  float64
 7   Number of Ratings  49229 non-null  int64  
dtypes: float64(2), int64(1), object(5)
memory usage: 3.4+ MB
None


# Preprocessing

The steps to prepare data for merging with other datasets are:
- Remove duplicates
- Remove missing values
- Convert columns' data types to their appropriate data type

### Handle duplicates

We will check for duplicates in `Game Title` column and keep the first occurrence while removing the rest.

In [5]:
print(f"Before removeing duplicates:\t{raw_df.shape}")
raw_df.drop_duplicates(subset=["Game Title"], keep="first", inplace=True)
print(f"After removeing duplicates:\t{raw_df.shape}")

Before removeing duplicates:	(58320, 8)
After removeing duplicates:	(49229, 8)


### Handle missing values

We check for missing values in the dataset and remove them if the missing ratio is less than predefined threshold.

In [6]:
raw_df.isnull().sum()

Game Title               0
Game Genre               0
Pricing                  0
Publisher                0
Release Date             0
Platform             49229
Rating                   0
Number of Ratings        0
dtype: int64

There are no missing values except `Plaform` columns, which was intenionally left empty since Steam does not provide information about available platform of each game. The data will later be filled when merging with other sources.

### Convert columns' data types to their appropriate data type

In [None]:
raw_df.dtypes

Game Title            object
Game Genre            object
Pricing               object
Publisher             object
Release Date          object
Platform             float64
Rating               float64
Number of Ratings      int64
dtype: object