# SI 618 WN Project Part I

## Project Title:
> Provide a descriptive working title for your project.

**Game Sales, Popularity, and Achievements: A Comparative Analysis Across Platforms**

## Team Members
> List each team member and include their uniqname

* Yufeng Song (yfsong)
* Ziqi Wang (Venchy)
* Muyu Lin (linmuyu)

## Overview
> Give a high level description of your project

Our project explores key trends in the gaming industry by analyzing data on game sales, player achievements, and platform preferences. We aim to understand how the same game performs across different platforms, how purchasing preferences vary across countries, and how game popularity evolves over time. Additionally, we will investigate the relationship between supported languages and sales, as well as the impact of pricing on game popularity across multiple platforms. By analyzing these factors, we hope to uncover insights into player behavior, market trends, and factors influencing game success.

## Motivation
> Explain why you chose this particular topic for your project.	Include the three "real-world" questions that you generated about the data, and be sure to explain what you hope to learn by answering them.

We chose this topic because gaming is a massive industry with a highly diverse audience, and understanding player preferences and market trends can provide valuable insights for developers, publishers, and gaming communities. Our project aims to answer the following real-world questions:

1. **How do in-game achievements compare across different platforms for the same game?**

  - By analyzing achievement data, we aim to understand whether players engage with a game differently depending on the platform they use. This could reveal differences in play styles, game difficulty adjustments, or platform-specific engagement trends.

2. **How do game purchase preferences vary by country and active playtime?**

  - We seek to determine whether purchasing behaviors differ based on regional preferences and player engagement levels. Understanding this can help developers tailor their marketing strategies and optimize game pricing for different audiences.

3. **How has the popularity of different game types changed over time?**

  - By tracking shifts in game genre popularity over time, we hope to uncover trends that indicate the rise and fall of specific genres. This insight could be useful for predicting future market demands and guiding game development strategies.

By answering these questions, we hope to gain a deeper understanding of the gaming landscape, helping stakeholders make data-driven decisions about game development, pricing, and distribution strategies.

## Data Sources
> List the two (or more) sources of data that you'll be using.  Provide URLs where appropriate.	**Explain how the two (or more) datasets complement each other.**

1. https://www.kaggle.com/datasets/artyomkruglov/gaming-profiles-2025-steam-playstation-xbox

2. https://github.com/Smipe-a/gamestatshub

- This Gaming Profiles Data from Kaggle include **game** and **player** data from three different **platforms**, PlayStation, Steam, and Xbox.
- For each platform, the **player** and **game** data is complemented by **achievement** data. The relationships between players, games, and achievements are one-to-many: each player can play multiple games, and each game can have multiple achievements.
- Data from the three platforms can be combined based on **game titles** to analyze shared games across platforms.

## Data Description
> List the variables of interest, the size of the data sets, missing values, etc.

### Common Datasets Across 3 Platforms & Variables of Interest:

#### Player Metadata
- **players.csv**: platform-specific `playerid` and `country` (Xbox data lacks country column)
- **purchased_games.csv** lists players' purchased games, with:
    - `playerid`
    - `libarary`: a list of games the player bought.
- **history.csv**: when the player unlocked the achievement.
    - `playerid`
    - `achievementid`
    - `date_acquired`

#### Game Metadata
- **games.csv**: game details such as `genres`, `developers`, `publishers`, `supported_language`, and `release_date`.
- **achievements.csv** maps achievements to their respective game, with:
    - `achievementid` combines the uniqe game id on the platform and the achievement id within the game
    - `gameid`
    - `title`
    - `description`
- **prices.csv**: games' `price` in various currencies, with `date_acquired` indicating the date when the price was recorded.

## Data Manipulation
> Mostly code in this section.  This is where you merge your data sets, as well as create new columns (if appropriate)

#### Merged Datasets Explained - Flattened Table For Raw Analysis
- Platform-Specific Datasets:
    - Player-Info Dataset: merged on `playerid`
    - Game-Info Dataset: merged on `gameid`
- Cross-Platform Game Dataset: merged on game `title` to combine shared game data from all three platforms.


**Notes**: duplicates after merge due to the one-to-many relationship between players, games, and achievements. 

In [66]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## PS

In [67]:
import os

PS_DATA_FOLDER = "data/ps"

dfs = {}

for root, dirs, files in os.walk(PS_DATA_FOLDER):
    for file in files:
        if file.endswith(".csv"):
            file_path = os.path.join(root, file)
            df_name = f"{os.path.basename(root)}_{file.replace('.csv', '')}".replace('.csv', '')  # Clean name
            dfs[df_name] = pd.read_csv(file_path)

for name, df in dfs.items():
    print(f"{name} - shape: {df.shape}")

ps_purchased_games - shape: (46582, 2)
ps_history - shape: (19510083, 3)
ps_prices - shape: (62816, 7)
ps_players - shape: (356600, 3)
ps_games - shape: (23151, 8)
ps_achievements - shape: (846563, 5)


In [68]:
ps_achievements = dfs['ps_achievements']
ps_games = dfs['ps_games']
ps_history = dfs['ps_history']
ps_players = dfs['ps_players']
ps_prices = dfs['ps_prices']
ps_purchased_games = dfs['ps_purchased_games']

### Player focus

In [69]:
ps_player_info = ps_players.merge(ps_history, on="playerid", how="left")

In [70]:
ps_player_info = ps_player_info.merge(ps_achievements, on="achievementid", how="left")

In [71]:
ps_player_info = ps_player_info.merge(ps_purchased_games, on="playerid", how="left")

In [72]:
ps_player_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19861683 entries, 0 to 19861682
Data columns (total 10 columns):
 #   Column         Dtype  
---  ------         -----  
 0   playerid       int64  
 1   nickname       object 
 2   country        object 
 3   achievementid  object 
 4   date_acquired  object 
 5   gameid         float64
 6   title          object 
 7   description    object 
 8   rarity         object 
 9   library        object 
dtypes: float64(1), int64(1), object(8)
memory usage: 1.5+ GB


In [73]:
ps_player_info.sample(5)

Unnamed: 0,playerid,nickname,country,achievementid,date_acquired,gameid,title,description,rarity,library
12809754,1697762,Majin_OLI,Spain,7501_92204,2015-08-31 01:15:00,7501.0,First Step into Darkness,Complete the game on Casual difficulty.,Bronze,"[721802, 622669, 429727, 706550, 619583, 45776..."
5461554,340741,Boiro13,Spain,742_22561,2012-01-23 02:35:41,742.0,10 shots,Finish the 'The last hunt' mission having fi...,Silver,"[169937, 399413, 328918, 410650, 378968, 16824..."
9984156,337230,jamie-thorn,United States,11552_149562,2018-01-15 02:36:57,11552.0,Bee Plot,Got stung by a bee. It happens.,Bronze,"[624236, 437991, 602279, 20725, 333127, 607575..."
1440151,307263,Killah184,United States,5665_71411,2013-12-30 18:25:07,5665.0,Overthrown,Dethrone the King in an online match,Bronze,"[550305, 618458, 167482, 10020, 15572, 9475, 4..."
14804677,138265,FENIX-EL-DORADO,Spain,19863_188575,2017-11-23 20:49:58,19863.0,Stylist,"In the journey, add a cosmetic item to Alex Hu...",Silver,"[10020, 15085, 10403, 138763, 619584, 11928, 3..."


In [74]:
ps_player_info['playerid'].value_counts()

playerid
3065998    157477
336474      73075
381705      67034
333392      66380
435952      61346
            ...  
1171190         1
80539           1
1702479         1
1995222         1
140656          1
Name: count, Length: 356600, dtype: int64

### Game focus
date_acquired column is dropped as it indicates the timestamp when the price info was extracted from multiple databases and do not add helpful insights to our analysis.

In [75]:
ps_game_info = ps_games.merge(ps_achievements.rename(columns={'title': 'achievement_title'}), on="gameid", how="left")

In [76]:
ps_game_info = ps_game_info.merge(ps_prices, on="gameid", how="left")

In [77]:
ps_game_info = ps_game_info.drop(columns=['date_acquired'])

In [78]:
ps_game_info.rename(columns={'platform': 'PS_platform'}, inplace=True)

In [79]:
ps_game_info.rename(columns={'description': 'achievement_description'}, inplace=True)

In [80]:
ps_game_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1289501 entries, 0 to 1289500
Data columns (total 17 columns):
 #   Column                   Non-Null Count    Dtype  
---  ------                   --------------    -----  
 0   gameid                   1289501 non-null  int64  
 1   title                    1289501 non-null  object 
 2   PS_platform              1289501 non-null  object 
 3   developers               1288417 non-null  object 
 4   publishers               1288833 non-null  object 
 5   genres                   1281054 non-null  object 
 6   supported_languages      655941 non-null   object 
 7   release_date             1289501 non-null  object 
 8   achievementid            1289501 non-null  object 
 9   achievement_title        1289493 non-null  object 
 10  achievement_description  1289469 non-null  object 
 11  rarity                   1289501 non-null  object 
 12  usd                      1101806 non-null  float64
 13  eur                      1047106 non-null 

PS-only cols: rarity, PS_platform

In [81]:
ps_game_info.sample(5)

Unnamed: 0,gameid,title,PS_platform,developers,publishers,genres,supported_languages,release_date,achievementid,achievement_title,achievement_description,rarity,usd,eur,gbp,jpy,rub
487997,145420,My Riding Stables - Life with Horses,PS4,['Independent Arts Software'],['Kalypso Media'],['Simulation'],"['French', 'Spanish', 'German', 'Italian', 'Du...",2018-11-13,145420_2690465,Host,Accommodate a guest in the guesthouse.,Bronze,3.99,,3.19,,
955816,589240,LocoRoco Midnight Carnival,PS5,['SIE Japan Studio'],['Sony Interactive Entertainment'],['Platformer'],"['Japanese', 'French', 'Spanish', 'German', 'I...",2022-07-19,589240_4881204,Flow,Start playing Bui Bui Fort 1.,Bronze,9.99,9.99,7.99,1100.0,
569467,13926,LASTFIGHT,PS4,['Piranaking'],['Piranaking'],['fighting'],"['Japanese', 'French', 'Spanish', 'German', 'I...",2016-09-20,13926_145859,Mind over matter!,Win the game after being one round down and ha...,Silver,14.99,14.99,11.99,,
824352,684077,Daxter,PS4,['Ready At Dawn'],['Sony Interactive Entertainment'],['Platformer'],"['Japanese', 'French', 'Spanish', 'German', 'I...",2024-06-18,684077_5526916,Pyromaniacal,Obtain the Flame Thrower upgrade from Taryn.,Bronze,9.99,9.99,7.99,,
1113564,446001,Bowling (Story Two) (Jane Version) - Project: ...,PS4,['Breakthrough Gaming'],['Breakthrough Gaming'],"['Sports', 'Bowling']",,2021-07-11,446001_4032987,"Get a final score of at least 10 in ""Play Bowl...","Get a final score of at least 10 in ""Play Bowl...",Gold,0.99,0.99,0.79,,


## Steam

In [82]:
import os
import pandas as pd

STEAM_DATA_FOLDER = "data/steam"

dfs = {}

for root, dirs, files in os.walk(STEAM_DATA_FOLDER):
    for file in files:
        if file.endswith(".csv"):
            file_path = os.path.join(root, file)
            df_name = f"{os.path.basename(root)}_{file.replace('.csv', '')}".replace('.csv', '')  # Clean name
            dfs[df_name] = pd.read_csv(file_path)

for name, df in dfs.items():
    print(f"{name} - shape: {df.shape}")

steam_achievements = dfs['steam_achievements']
steam_games = dfs['steam_games']
steam_history = dfs['steam_history']
steam_players = dfs['steam_players']
steam_prices = dfs['steam_prices']
steam_purchased_games = dfs['steam_purchased_games']


steam_purchased_games - shape: (102548, 2)
steam_reviews - shape: (1204534, 8)
steam_history - shape: (10693879, 3)
steam_friends - shape: (424683, 2)
steam_prices - shape: (4414273, 7)
steam_players - shape: (424683, 3)
steam_games - shape: (98248, 7)
steam_private_steamids - shape: (227963, 1)
steam_achievements - shape: (1939027, 4)


### Player focus

In [83]:
st_player_info = steam_players.merge(steam_history, on="playerid", how="left")

In [84]:
st_player_info = st_player_info.merge(steam_achievements, on="achievementid", how="left")

In [85]:
st_player_info = st_player_info.merge(steam_purchased_games, on="playerid", how="left")

# # Save player-focused data
# steam_player_info.to_csv('data/steam_players.csv', index=False)

In [86]:
st_player_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11113724 entries, 0 to 11113723
Data columns (total 9 columns):
 #   Column         Dtype  
---  ------         -----  
 0   playerid       int64  
 1   country        object 
 2   created        object 
 3   achievementid  object 
 4   date_acquired  object 
 5   gameid         float64
 6   title          object 
 7   description    object 
 8   library        object 
dtypes: float64(1), int64(1), object(7)
memory usage: 763.1+ MB


In [87]:
st_player_info.shape

(11113724, 9)

In [88]:
st_player_info.head(5)

Unnamed: 0,playerid,country,created,achievementid,date_acquired,gameid,title,description,library
0,76561198287452552,Brazil,2016-03-02 06:14:20,,,,,,"[10, 80, 100, 240, 2990, 6880, 6910, 6920, 698..."
1,76561198040436563,Israel,2011-04-10 17:10:06,,,,,,"[10, 80, 100, 300, 20, 30, 40, 50, 60, 70, 130..."
2,76561198049686270,,2011-09-28 21:43:59,,,,,,
3,76561198155814250,Kazakhstan,2014-09-24 19:52:47,,,,,,
4,76561198119605821,,2013-12-26 00:25:50,,,,,,"[47870, 108600, 550, 271590, 331470, 381210, 2..."


### Game focus

In [89]:
st_game_info = steam_games.merge(steam_achievements.rename(columns={'title': 'achievement_title'}), on="gameid", how="left")

In [90]:
st_game_info = st_game_info.merge(steam_prices, on="gameid", how="left")

In [91]:
st_game_info = st_game_info.drop(columns=['date_acquired'])

In [92]:
st_game_info.rename(columns={'description': 'achievement_description'}, inplace=True)

In [93]:
st_game_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89149729 entries, 0 to 89149728
Data columns (total 15 columns):
 #   Column                   Dtype  
---  ------                   -----  
 0   gameid                   int64  
 1   title                    object 
 2   developers               object 
 3   publishers               object 
 4   genres                   object 
 5   supported_languages      object 
 6   release_date             object 
 7   achievementid            object 
 8   achievement_title        object 
 9   achievement_description  object 
 10  usd                      float64
 11  eur                      float64
 12  gbp                      float64
 13  jpy                      float64
 14  rub                      float64
dtypes: float64(5), int64(1), object(9)
memory usage: 10.0+ GB


In [94]:
st_game_info.shape

(89149729, 15)

## XBOX

In [95]:
import os
import pandas as pd

XBOX_DATA_FOLDER = "data/xbox"

dfs = {}

for root, dirs, files in os.walk(XBOX_DATA_FOLDER):
    for file in files:
        if file.endswith(".csv"):
            file_path = os.path.join(root, file)
            df_name = f"{os.path.basename(root)}_{file.replace('.csv', '')}".replace('.csv', '')
            dfs[df_name] = pd.read_csv(file_path)

for name, df in dfs.items():
    print(f"{name} - shape: {df.shape}")


xbox_purchased_games - shape: (46466, 2)
xbox_history - shape: (15275900, 3)
xbox_prices - shape: (22638, 7)
xbox_players - shape: (274450, 2)
xbox_games - shape: (10489, 7)
xbox_achievements - shape: (351111, 5)


In [96]:
xbox_achievements = dfs['xbox_achievements']
xbox_games = dfs['xbox_games']
xbox_history = dfs['xbox_history']
xbox_players = dfs['xbox_players']
xbox_prices = dfs['xbox_prices']
xbox_purchased_games = dfs['xbox_purchased_games']

### Player focus

In [97]:
xb_player_info = xbox_players.merge(xbox_history, on="playerid", how="left")

In [98]:
xb_player_info = xb_player_info.merge(xbox_achievements, on="achievementid", how="left")

In [99]:
xb_player_info = xb_player_info.merge(xbox_purchased_games, on="playerid", how="left")

In [100]:
xb_player_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15545366 entries, 0 to 15545365
Data columns (total 9 columns):
 #   Column         Dtype  
---  ------         -----  
 0   playerid       int64  
 1   nickname       object 
 2   achievementid  object 
 3   date_acquired  object 
 4   gameid         float64
 5   title          object 
 6   description    object 
 7   points         float64
 8   library        object 
dtypes: float64(2), int64(1), object(6)
memory usage: 1.0+ GB


In [101]:
xb_player_info.shape

(15545366, 9)

### Game focus

In [102]:
xb_game_info = xbox_games.merge(xbox_achievements.rename(columns={'title': 'achievement_title'}), on="gameid", how="left")

In [103]:
xb_game_info = xb_game_info.merge(xbox_prices, on="gameid", how="left")

In [104]:
xb_game_info = xb_game_info.drop(columns=['date_acquired', 'points'])

In [105]:
xb_game_info.rename(columns={'description': 'achievement_description'}, inplace=True)

In [106]:
xb_game_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 646076 entries, 0 to 646075
Data columns (total 15 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   gameid                   646076 non-null  int64  
 1   title                    646076 non-null  object 
 2   developers               608804 non-null  object 
 3   publishers               610118 non-null  object 
 4   genres                   608336 non-null  object 
 5   supported_languages      265610 non-null  object 
 6   release_date             646076 non-null  object 
 7   achievementid            646076 non-null  object 
 8   achievement_title        646074 non-null  object 
 9   achievement_description  645968 non-null  object 
 10  usd                      474444 non-null  float64
 11  eur                      453524 non-null  float64
 12  gbp                      469236 non-null  float64
 13  jpy                      0 non-null       float64
 14  rub 

## Game Info Across Platforms

In [107]:
len(ps_game_info['gameid'].unique())

23151

In [108]:
len(st_game_info['gameid'].unique())

98248

In [109]:
len(xb_game_info['gameid'].unique())

10489

In [110]:
ps_titles = set(ps_game_info['title'].str.lower().str.strip())
st_titles = set(st_game_info['title'].str.lower().str.strip())
xb_titles = set(xb_game_info['title'].str.lower().str.strip())

shared_ps_st = ps_titles.intersection(st_titles)
shared_ps_xb = ps_titles.intersection(xb_titles)
shared_st_xb = st_titles.intersection(xb_titles)
shared_all = ps_titles.intersection(st_titles, xb_titles)

print(f"Shared titles between PS & ST: {len(shared_ps_st)}")
print(f"Shared titles between PS & XB: {len(shared_ps_xb)}")
print(f"Shared titles between ST & XB: {len(shared_st_xb)}")
print(f"Shared across all three: {len(shared_all)}")


Shared titles between PS & ST: 5571
Shared titles between PS & XB: 5360
Shared titles between ST & XB: 5540
Shared across all three: 3815


In [111]:
def clean_game_info(df, platform):
    df = df.copy()
    
    # drop platform-specific columns (ps_game_info's 'rarity' and 'PS_platform')
    drop_cols = ['date_acquired', 'rarity', 'PS_platform']
    df = df.drop(columns=[col for col in drop_cols if col in df.columns], errors='ignore')

    # add platform column
    df['platform'] = platform
    
    return df

In [112]:
ps_game_info_clean = clean_game_info(ps_game_info, 'ps')

In [113]:
ps_game_info_clean.columns

Index(['gameid', 'title', 'developers', 'publishers', 'genres',
       'supported_languages', 'release_date', 'achievementid',
       'achievement_title', 'achievement_description', 'usd', 'eur', 'gbp',
       'jpy', 'rub', 'platform'],
      dtype='object')

In [114]:
st_game_info_clean = clean_game_info(st_game_info, 'st')

In [115]:
st_game_info_clean.columns

Index(['gameid', 'title', 'developers', 'publishers', 'genres',
       'supported_languages', 'release_date', 'achievementid',
       'achievement_title', 'achievement_description', 'usd', 'eur', 'gbp',
       'jpy', 'rub', 'platform'],
      dtype='object')

In [116]:
xb_game_info_clean = clean_game_info(xb_game_info, 'xb')

In [117]:
xb_game_info_clean.columns

Index(['gameid', 'title', 'developers', 'publishers', 'genres',
       'supported_languages', 'release_date', 'achievementid',
       'achievement_title', 'achievement_description', 'usd', 'eur', 'gbp',
       'jpy', 'rub', 'platform'],
      dtype='object')

In [118]:
shared_games = set(ps_game_info_clean['title']) & set(st_game_info_clean['title']) & set(xb_game_info_clean['title'])

ps_game_info_clean = ps_game_info_clean[ps_game_info_clean['title'].isin(shared_games)]
st_game_info_clean = st_game_info_clean[st_game_info_clean['title'].isin(shared_games)]
xb_game_info_clean = xb_game_info_clean[xb_game_info_clean['title'].isin(shared_games)]

merged_game_info = pd.concat([ps_game_info_clean, st_game_info_clean, xb_game_info_clean], ignore_index=True)

print(f"Merged dataset shape: {merged_game_info.shape}")

Merged dataset shape: (5808510, 16)


In [119]:
merged_game_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5808510 entries, 0 to 5808509
Data columns (total 16 columns):
 #   Column                   Dtype  
---  ------                   -----  
 0   gameid                   int64  
 1   title                    object 
 2   developers               object 
 3   publishers               object 
 4   genres                   object 
 5   supported_languages      object 
 6   release_date             object 
 7   achievementid            object 
 8   achievement_title        object 
 9   achievement_description  object 
 10  usd                      float64
 11  eur                      float64
 12  gbp                      float64
 13  jpy                      float64
 14  rub                      float64
 15  platform                 object 
dtypes: float64(5), int64(1), object(10)
memory usage: 709.0+ MB


In [146]:
merged_game_info['achievementid'].value_counts()

achievementid
1729090_TIME_TRIAL_73                    45
517330_VisitMerchant1TimeAchievement     45
517330_VisitOverlord50TimeAchievement    45
517330_VisitOverlord25TimeAchievement    45
517330_VisitOverlord10TimeAchievement    45
                                         ..
617563_5085687                            2
617563_5085686                            2
617563_5085685                            2
617563_5085684                            2
711118_5703719                            2
Name: count, Length: 431807, dtype: int64

In [148]:
filtered_rows = merged_game_info[merged_game_info['achievementid'] == '1729090_TIME_TRIAL_73']
print(filtered_rows.head(3))

         gameid      title               developers               publishers  \
870514  1729090  Chameneon  ['Burning Goat Studio']  ['Burning Goat Studio']   
870515  1729090  Chameneon  ['Burning Goat Studio']  ['Burning Goat Studio']   
870516  1729090  Chameneon  ['Burning Goat Studio']  ['Burning Goat Studio']   

                                  genres                 supported_languages  \
870514  ['Adventure', 'Casual', 'Indie']  ['English', 'Portuguese - Brazil']   
870515  ['Adventure', 'Casual', 'Indie']  ['English', 'Portuguese - Brazil']   
870516  ['Adventure', 'Casual', 'Indie']  ['English', 'Portuguese - Brazil']   

       release_date          achievementid  achievement_title  \
870514   2021-12-16  1729090_TIME_TRIAL_73  Time Challenge 73   
870515   2021-12-16  1729090_TIME_TRIAL_73  Time Challenge 73   
870516   2021-12-16  1729090_TIME_TRIAL_73  Time Challenge 73   

             achievement_description   usd   eur   gbp    jpy   rub platform  
870514  Time to be

## Long and Tidy Form
A tidy dataset follows the principles:
1. Each column is a single variable.
2. Each row is a single observation.
3. Each cell contains a single value.

#### Issues with the current merged_game_info:
- **We decide to leave it unchanged as our analysis will base on it**: The platform column is categorical, meaning the same game can appear multiple times (once per platform). This makes it long format.
- Price columns (usd, eur, etc.) are separate instead of melted into a single column, which makes the dataset wide instead of long, which is not tidy.

In [121]:
# convert price columns into a long format
tidy_game_info = merged_game_info.melt(
    id_vars=[
        "gameid", "title", "developers", "publishers", "genres",
        "supported_languages", "release_date", "achievementid",
        "achievement_title", "achievement_description", "platform"
    ],
    value_vars=["usd", "eur", "gbp", "jpy", "rub"],
    var_name="currency",
    value_name="price"
)

In [122]:
tidy_game_info.shape

(29042550, 13)

In [123]:
tidy_game_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29042550 entries, 0 to 29042549
Data columns (total 13 columns):
 #   Column                   Dtype  
---  ------                   -----  
 0   gameid                   int64  
 1   title                    object 
 2   developers               object 
 3   publishers               object 
 4   genres                   object 
 5   supported_languages      object 
 6   release_date             object 
 7   achievementid            object 
 8   achievement_title        object 
 9   achievement_description  object 
 10  platform                 object 
 11  currency                 object 
 12  price                    float64
dtypes: float64(1), int64(1), object(11)
memory usage: 2.8+ GB


In [139]:
tidy_game_info.sample(5)

Unnamed: 0,gameid,title,developers,publishers,genres,supported_languages,release_date,achievementid,achievement_title,achievement_description,platform,currency,price
940285,1844170,Galactic Wars EX,['VolcanoBytes'],['JanduSoft'],"['Action', 'Indie']",['English'],2022-04-21,1844170_ACHIEVEMENT_12,Untouchable,Complete the first mission without losing a si...,st,usd,7.99
21198747,534550,Guacamelee! 2,['DrinkBox Studios'],['DrinkBox Studios'],"['Action', 'Adventure', 'Indie']","['English', 'French', 'Italian', 'German', 'Po...",2018-08-21,534550_EAwards_12,Cluckstorm,Kill 50 enemies as chicken,st,jpy,2300.0
11598878,8525,Borderlands 2,['gearbox software'],['2K Games'],['shooter'],"['Japanese', 'French', 'Spanish', 'German', 'I...",2015-03-24,8525_357376,Did It All,Completed all side missions.,xb,eur,29.99
1880422,487720,Agony,['Madmind Studio'],['Playway S.A.'],"['Action', 'Adventure', 'Indie']","['English', 'French', 'German', 'Polish', 'Ita...",2018-05-29,487720_succubus_ending,Pull it out!,succubus ending,st,usd,14.99
3046341,427100,Fernbus Simulator,['TML-Studios'],['Aerosoft GmbH'],"['Casual', 'Simulation']","['English', 'German', 'French', 'Polish', 'Rus...",2016-08-25,427100_KILOMETERS_WITH_LIONS_COACH_10000,Lion Fan,Driven 10.000 km with a MAN Lion's Coach,st,usd,29.99


In [140]:
tidy_game_info['achievementid'].value_counts()

achievementid
1729090_TIME_TRIAL_73                    225
517330_VisitMerchant1TimeAchievement     225
517330_VisitOverlord50TimeAchievement    225
517330_VisitOverlord25TimeAchievement    225
517330_VisitOverlord10TimeAchievement    225
                                        ... 
617563_5085687                            10
617563_5085686                            10
617563_5085685                            10
617563_5085684                            10
711118_5703719                            10
Name: count, Length: 431807, dtype: int64

In [None]:
filtered_rows = tidy_game_info[tidy_game_info['achievementid'] == '617563_5085687']
print(filtered_rows.head(3))

         gameid             title                  developers  \
287852   617563  Tenebris Pictura  ['Pentadimensional Games']   
287853   617563  Tenebris Pictura  ['Pentadimensional Games']   
6096362  617563  Tenebris Pictura  ['Pentadimensional Games']   

                         publishers                genres  \
287852   ['Pentadimensional Games']  ['Action-Adventure']   
287853   ['Pentadimensional Games']  ['Action-Adventure']   
6096362  ['Pentadimensional Games']  ['Action-Adventure']   

                                       supported_languages release_date  \
287852   ['Japanese', 'French', 'Spanish', 'German', 'I...   2023-08-31   
287853   ['Japanese', 'French', 'Spanish', 'German', 'I...   2023-08-31   
6096362  ['Japanese', 'French', 'Spanish', 'German', 'I...   2023-08-31   

          achievementid     achievement_title  achievement_description  \
287852   617563_5085687  Tenebris Vitrum x 40  Earn 40 Tenebris Vitrum   
287853   617563_5085687  Tenebris Vitrum x 40

## Preprocessing

### 1. Check for missing values

In [124]:
# count missing values per column
null_counts = tidy_game_info.isnull().sum().reset_index()
null_counts.columns = ["col", "num_missing"]
null_counts["missing_pcnt"] = round((null_counts["num_missing"] / len(merged_game_info)) * 100, 2)

null_counts

Unnamed: 0,col,num_missing,missing_pcnt
0,gameid,0,0.0
1,title,0,0.0
2,developers,15850,0.27
3,publishers,79975,1.38
4,genres,24935,0.43
5,supported_languages,1588735,27.35
6,release_date,0,0.0
7,achievementid,40050,0.69
8,achievement_title,40325,0.69
9,achievement_description,4014060,69.11


#### Here’s how we decide to handle missing values effectively [Used GPT-4o for tailoring to table format]:

**Columns with Minor Missing Data (< 5%)**
| Column | Missing % | Suggested Action |
|---------|----------|-----------------|
| **developers** | 0.27% | Fill with "unknown" to maintain consistency. |
| **publishers** | 1.38% | Fill with "unknown" for completeness. |
| **genres** | 0.43% | Fill with "unknown", since games without a genre classification are rare. |
| **achievementid** | 0.69% | Likely an error or missing achievements, can be kept as NaN. |
| **achievement_title** | 0.69% | Likely corresponds to missing achievementid, can be kept as NaN. |

- **Action: Fill developers, publishers, and genres with "unknown", and leave missing achievements as NaN.**

---

**Columns with Moderate Missing Data (5% - 30%)**
| Column | Missing % | Suggested Action |
|---------|----------|-----------------|
| **supported_languages** | 27.35% | Keep as NaN, since language availability is an actual missing feature. |

- **Action: Keep supported_languages as NaN, since not all games support multiple languages. We don't want to introduce incorrect data.**

---

**Columns with High Missing Data (> 50%)**
| Column | Missing % | Suggested Action |
|---------|----------|-----------------|
| **achievement_description** | 69.11% | Fill missing values with "unknown". |
| **price** | 71.58% | Keep as NaN—missing prices could indicate unavailable data, regional limitations, or discontinued games. |

**Action:**
- **Fill achievement_description with "No description available"** to avoid empty fields.
- **Keep price as NaN**, since forcing imputation could lead to inaccurate pricing.

In [125]:
# fill missing values for categorical text columns
tidy_game_info["developers"].fillna("unknown", inplace=True)
tidy_game_info["publishers"].fillna("unknown", inplace=True)
tidy_game_info["genres"].fillna("unknown", inplace=True)
tidy_game_info["achievement_description"].fillna("unknown", inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  tidy_game_info["developers"].fillna("unknown", inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  tidy_game_info["publishers"].fillna("unknown", inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on

## 2. Multi-valued cols

In [126]:
tidy_game_info.sample(5)

Unnamed: 0,gameid,title,developers,publishers,genres,supported_languages,release_date,achievementid,achievement_title,achievement_description,platform,currency,price
3595266,2301100,Reverie: Sweet As Edition,['Rainbite Ltd'],['Eastasiasoft Limited'],"['Action', 'Adventure', 'Indie']","['English', 'French', 'German', 'Spanish - Spa...",2023-06-28,2301100_Snorkel,Ain't No River Wide Enough,Acquire the snorkel.,st,usd,6.49
12348987,2620210,Jinshin,['Exe Create Inc.'],['KEMCO'],"['Adventure', 'RPG', 'Simulation', 'Strategy']","['English', 'Japanese']",2023-12-21,2620210_5,Onigami Kegare,unknown,st,gbp,13.49
20530077,494730,Galacticare,['Brightrock Games'],['CULT Games'],"['Simulation', 'Strategy']","['English', 'French', 'German', 'Spanish - Spa...",2024-05-23,494730_bonus_mc2,Doc Idol,unknown,st,jpy,3110.0
10895556,220820,Zombie Driver HD,['Exor Studios'],['Exor Studios'],"['Action', 'Indie', 'Racing']","['English', 'French', 'German', 'Italian', 'Po...",2012-10-17,220820_kill_zombies_watercannon,It's not a weapon,Kill 100 zombies with a watercannon in one mis...,st,eur,1.99
25278535,444930,Zaccaria Pinball,['Magic Pixel Kft.'],['Magic Pixel Kft.'],"['Action', 'Casual', 'Free To Play', 'Indie', ...",['English'],2016-06-16,444930_ACH_A_SPEEDKING_CHALLENGE_SILVER,Speed King - Challenge Silver,Collect 6 points in Challenge game mode on Spe...,st,rub,


In [127]:
tidy_game_info['title'].value_counts()

title
Zaccaria Pinball                          1084700
PAYDAY 2                                   299670
Assetto Corsa                              160775
The Binding of Isaac: Rebirth              146980
Idle Champions of the Forgotten Realms     141540
                                           ...   
Alice Sisters                                 405
Baldur's Gate: Dark Alliance                  375
Action SuperCross                             365
Star99                                        365
Skelattack                                    365
Name: count, Length: 3466, dtype: int64

#### Handling Multi-Valued Columns in Our Dataset [Used GPT-4o for tailoring grammar]

Our dataset contains multi-valued columns such as `developers`, `publishers`, `genres`, and `supported_languages`, which are stored as **list-like strings**. Instead of immediately expanding them into multiple rows (long format) or separate columns (one-hot encoding), we are **keeping them in their original format** for the following reasons:

1. **Data Integrity & Storage Efficiency**  
   - Expanding these fields would significantly increase row count, making storage and initial processing heavier.
   - Keeping them as lists allows us to preserve all information within a single row per game.

2. **Flexibility for Future Analysis**  
   - At later stages, we may **apply one-hot encoding** to `genres` and `supported_languages` for categorical analysis.  
   - Alternatively, we can derive **summary features** such as:
     - **Number of genres per game**
     - **Number of supported languages**
     - **Unique count of developers/publishers**
   - These derived features will allow for a more structured comparison across games.

By postponing transformation, we maintain efficiency while keeping the option open for structured feature extraction when needed.


In [128]:
# tidy_game_info_noids = tidy_game_info.drop(columns=['gameid', 'achievementid'], inplace=True)

## More notes on further merging

Our current **tidy dataset** focuses solely on game-related information. To enrich the analysis, we plan additional merges with player data to uncover insights about game popularity, completion rates, and player demographics.

### Planned Merges & Insights
1. **Game Popularity Analysis**  
   - Merge with `purchased_games.csv` (for each platform) to **count the number of players who own each game**.
   - This helps identify **best-selling games** and platform-specific purchase trends.

2. **Game Completion Rates**  
   - Merge with `history.csv` to **track when players earn end-game achievements**.
   - Calculate the **average time to completion** and **percentage of players finishing a game**.
   - Investigate if certain **genres have higher completion rates** (e.g., RPGs vs. casual games).
  
3. **Pricing & Sales Relationship:**  
   - Merge `prices.csv` to analyze how **price fluctuations impact purchases**.
  
4. **Cross-Platform Player Behavior:**  
   - Identify players who own **the same game across multiple platforms** to study cross-platform engagement.
   - Merge with `players.csv` and **cluster by country** to analyze **regional gaming preferences**.


## [TODO] Data Visualization
> Be sure to include interpretations of your visualizations -- what patterns or anomalies do you see?


###  Player Engagement Visualization: player-game-achievement

In [129]:
# Top 50 most popular games
# active players
# Top 50 most popular achievements


player_game_achievement = 
# how many achievements a player unlocked per game

# Group data to count achievements per player per game
agg_data = ps_player_info.groupby(["playerid", "gameid"])["achievementid"].count().reset_index()
agg_data.rename(columns={"achievementid": "achievements_unlocked"}, inplace=True)

# Sample a subset (e.g., 10K rows) for visualization
sample_data = agg_data.sample(10000, random_state=42)

SyntaxError: invalid syntax (3815978410.py, line 6)