# SI 618 WN Project Part I

## Project Title:
> Provide a descriptive working title for your project.

**Game Sales, Popularity, and Achievements: A Comparative Analysis Across Platforms**

## Team Members
> List each team member and include their uniqname

* Yufeng Song (yfsong)
* Ziqi Wang (Venchy)
* Muyu Lin (linmuyu)

## Overview
> Give a high level description of your project

Our project explores key trends in the gaming industry by analyzing data on game sales, player achievements, and platform preferences. We aim to understand how the same game performs across different platforms, how purchasing preferences vary across countries, and how game popularity evolves over time. Additionally, we will investigate the relationship between supported languages and sales, as well as the impact of pricing on game popularity across multiple platforms. By analyzing these factors, we hope to uncover insights into player behavior, market trends, and factors influencing game success.

## Motivation
> Explain why you chose this particular topic for your project.	Include the three "real-world" questions that you generated about the data, and be sure to explain what you hope to learn by answering them.

We chose this topic because gaming is a massive industry with a highly diverse audience, and understanding player preferences and market trends can provide valuable insights for developers, publishers, and gaming communities. Our project aims to answer the following real-world questions:

1. **How do in-game achievements compare across different platforms for the same game?**

  - By analyzing achievement data, we aim to understand whether players engage with a game differently depending on the platform they use. This could reveal differences in play styles, game difficulty adjustments, or platform-specific engagement trends.

2. **How do game purchase preferences vary by country and active playtime?**

  - We seek to determine whether purchasing behaviors differ based on regional preferences and player engagement levels. Understanding this can help developers tailor their marketing strategies and optimize game pricing for different audiences.

3. **How has the popularity of different game types changed over time?**

  - By tracking shifts in game genre popularity over time, we hope to uncover trends that indicate the rise and fall of specific genres. This insight could be useful for predicting future market demands and guiding game development strategies.

By answering these questions, we hope to gain a deeper understanding of the gaming landscape, helping stakeholders make data-driven decisions about game development, pricing, and distribution strategies.

## **[TODO]** Data Sources
> List the two (or more) sources of data that you'll be using.  Provide URLs where appropriate.	**Explain how the two (or more) datasets complement each other.**

1. https://www.kaggle.com/datasets/artyomkruglov/gaming-profiles-2025-steam-playstation-xbox

2. https://github.com/Smipe-a/gamestatshub

## **[TODO]** Data Description
> List the variables of interest, the size of the data sets, missing values, etc.

## Data Manipulation
> Mostly code in this section.  This is where you merge your data sets, as well as create new columns (if appropriate)


In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

## PS

In [4]:
import os
import pandas as pd

PS_DATA_FOLDER = "data/ps"

dfs = {}

for root, dirs, files in os.walk(PS_DATA_FOLDER):
    for file in files:
        if file.endswith(".csv"):
            file_path = os.path.join(root, file)
            df_name = f"{os.path.basename(root)}_{file.replace('.csv', '')}".replace('.csv', '')  # Clean name
            dfs[df_name] = pd.read_csv(file_path)

for name, df in dfs.items():
    print(f"{name} - shape: {df.shape}")

ps_purchased_games - shape: (46582, 2)
ps_history - shape: (19510083, 3)
ps_prices - shape: (62816, 7)
ps_players - shape: (356600, 3)
ps_games - shape: (23151, 8)
ps_achievements - shape: (846563, 5)


In [5]:
ps_achievements = dfs['ps_achievements']
ps_games = dfs['ps_games']
ps_history = dfs['ps_history']
ps_players = dfs['ps_players']
ps_prices = dfs['ps_prices']
ps_purchased_games = dfs['ps_purchased_games']

### Player focus

In [6]:
ps_player_info = ps_players.merge(ps_history, on="playerid", how="left")

In [7]:
ps_player_info = ps_player_info.merge(ps_achievements, on="achievementid", how="left")

In [8]:
ps_player_info = ps_player_info.merge(ps_purchased_games, on="playerid", how="left")

In [10]:
ps_player_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19861683 entries, 0 to 19861682
Data columns (total 10 columns):
 #   Column         Dtype  
---  ------         -----  
 0   playerid       int64  
 1   nickname       object 
 2   country        object 
 3   achievementid  object 
 4   date_acquired  object 
 5   gameid         float64
 6   title          object 
 7   description    object 
 8   rarity         object 
 9   library        object 
dtypes: float64(1), int64(1), object(8)
memory usage: 1.5+ GB


In [11]:
ps_player_info.sample(5)

Unnamed: 0,playerid,nickname,country,achievementid,date_acquired,gameid,title,description,rarity,library
5867413,4476380,zerokingx,United States,473901_4222207,2023-08-24 21:03:06,473901.0,Win! Win!,Win a prize at the slot machine.,Gold,"[669, 554080, 727624, 625558, 442772, 143078, ..."
11208587,1282585,chibihey,Japan,2926_44403,2012-08-24 13:38:19,2926.0,Affinity and Beyond,Raised a person's affinity to the maximum.,Bronze,"[6097, 334008, 167756, 9311, 193357, 455317, 1..."
1208046,4562316,RauL_BiGMan,Brazil,3726_53614,2013-08-11 01:04:48,3726.0,Defender,Successfully defend FOB Spectre from incursion.,Bronze,"[550305, 682408, 569681, 589203, 550824, 43352..."
7693172,346427,nenjahplz,United States,195514_3142469,2021-04-01 07:29:49,195514.0,The Elusive Elder Dragon,Earn the right to take on three-star master ra...,Bronze,"[661659, 546924, 622641, 417808, 5891, 574336,..."
14736934,348774,bart1202,United Kingdom,189_9833,2009-07-20 20:58:09,189.0,Awesome Trophy!,Eliminate 250 Decepticons - Autobot Campaign,Bronze,"[14465, 404720, 400289, 20394, 139705, 9459, 2..."


### Game focus
date_acquired column is dropped as it indicates the timestamp when the price info was extracted from multiple databases and do not add helpful insights to our analysis.

In [12]:
ps_game_info = ps_games.merge(ps_achievements.rename(columns={'title': 'achievement_title'}), on="gameid", how="left")

In [13]:
ps_game_info = ps_game_info.merge(ps_prices, on="gameid", how="left")

In [None]:
ps_game_info = ps_game_info.drop(columns=['date_acquired'])

In [46]:
ps_game_info.rename(columns={'platform': 'PS_platform'}, inplace=True)

In [74]:
ps_game_info.rename(columns={'description': 'achievement_description'}, inplace=True)

In [75]:
ps_game_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1289501 entries, 0 to 1289500
Data columns (total 17 columns):
 #   Column                   Non-Null Count    Dtype  
---  ------                   --------------    -----  
 0   gameid                   1289501 non-null  int64  
 1   title                    1289501 non-null  object 
 2   PS_platform              1289501 non-null  object 
 3   developers               1288417 non-null  object 
 4   publishers               1288833 non-null  object 
 5   genres                   1281054 non-null  object 
 6   supported_languages      655941 non-null   object 
 7   release_date             1289501 non-null  object 
 8   achievementid            1289501 non-null  object 
 9   achievement_title        1289493 non-null  object 
 10  achievement_description  1289469 non-null  object 
 11  rarity                   1289501 non-null  object 
 12  usd                      1101806 non-null  float64
 13  eur                      1047106 non-null 

PS-only cols: rarity, PS_platform

## Steam

In [None]:
import os
import pandas as pd

STEAM_DATA_FOLDER = "data/steam"

dfs = {}

for root, dirs, files in os.walk(STEAM_DATA_FOLDER):
    for file in files:
        if file.endswith(".csv"):
            file_path = os.path.join(root, file)
            df_name = f"{os.path.basename(root)}_{file.replace('.csv', '')}".replace('.csv', '')  # Clean name
            dfs[df_name] = pd.read_csv(file_path)

for name, df in dfs.items():
    print(f"{name} - shape: {df.shape}")

steam_achievements = dfs['steam_achievements']
steam_games = dfs['steam_games']
steam_history = dfs['steam_history']
steam_players = dfs['steam_players']
steam_prices = dfs['steam_prices']
steam_purchased_games = dfs['steam_purchased_games']


steam_purchased_games - shape: (102548, 2)
steam_reviews - shape: (1204534, 8)
steam_history - shape: (10693879, 3)
steam_friends - shape: (424683, 2)
steam_prices - shape: (4414273, 7)
steam_players - shape: (424683, 3)
steam_games - shape: (98248, 7)
steam_private_steamids - shape: (227963, 1)
steam_achievements - shape: (1939027, 4)


### Player focus

In [18]:
st_player_info = steam_players.merge(steam_history, on="playerid", how="left")

In [19]:
st_player_info = st_player_info.merge(steam_achievements, on="achievementid", how="left")

In [20]:
st_player_info = st_player_info.merge(steam_purchased_games, on="playerid", how="left")

# # Save player-focused data
# steam_player_info.to_csv('data/steam_players.csv', index=False)

In [21]:
st_player_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11113724 entries, 0 to 11113723
Data columns (total 9 columns):
 #   Column         Dtype  
---  ------         -----  
 0   playerid       int64  
 1   country        object 
 2   created        object 
 3   achievementid  object 
 4   date_acquired  object 
 5   gameid         float64
 6   title          object 
 7   description    object 
 8   library        object 
dtypes: float64(1), int64(1), object(7)
memory usage: 763.1+ MB


In [22]:
st_player_info.shape

(11113724, 9)

### Game focus

In [67]:
st_game_info = steam_games.merge(steam_achievements.rename(columns={'title': 'achievement_title'}), on="gameid", how="left")

In [68]:
st_game_info = st_game_info.merge(steam_prices, on="gameid", how="left")

In [70]:
st_game_info = st_game_info.drop(columns=['date_acquired'])

In [71]:
st_game_info.rename(columns={'description': 'achievement_description'}, inplace=True)

In [72]:
st_game_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 89149729 entries, 0 to 89149728
Data columns (total 15 columns):
 #   Column                   Dtype  
---  ------                   -----  
 0   gameid                   int64  
 1   title                    object 
 2   developers               object 
 3   publishers               object 
 4   genres                   object 
 5   supported_languages      object 
 6   release_date             object 
 7   achievementid            object 
 8   achievement_title        object 
 9   achievement_description  object 
 10  usd                      float64
 11  eur                      float64
 12  gbp                      float64
 13  jpy                      float64
 14  rub                      float64
dtypes: float64(5), int64(1), object(9)
memory usage: 10.0+ GB


In [27]:
st_game_info.shape

(89149729, 15)

## XBOX

In [28]:
import os
import pandas as pd

XBOX_DATA_FOLDER = "data/xbox"

dfs = {}

for root, dirs, files in os.walk(XBOX_DATA_FOLDER):
    for file in files:
        if file.endswith(".csv"):
            file_path = os.path.join(root, file)
            df_name = f"{os.path.basename(root)}_{file.replace('.csv', '')}".replace('.csv', '')
            dfs[df_name] = pd.read_csv(file_path)

for name, df in dfs.items():
    print(f"{name} - shape: {df.shape}")


xbox_purchased_games - shape: (46466, 2)
xbox_history - shape: (15275900, 3)
xbox_prices - shape: (22638, 7)
xbox_players - shape: (274450, 2)
xbox_games - shape: (10489, 7)
xbox_achievements - shape: (351111, 5)


In [29]:
xbox_achievements = dfs['xbox_achievements']
xbox_games = dfs['xbox_games']
xbox_history = dfs['xbox_history']
xbox_players = dfs['xbox_players']
xbox_prices = dfs['xbox_prices']
xbox_purchased_games = dfs['xbox_purchased_games']

### Player focus

In [30]:
xb_player_info = xbox_players.merge(xbox_history, on="playerid", how="left")

In [31]:
xb_player_info = xb_player_info.merge(xbox_achievements, on="achievementid", how="left")

In [32]:
xb_player_info = xb_player_info.merge(xbox_purchased_games, on="playerid", how="left")

In [33]:
xb_player_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15545366 entries, 0 to 15545365
Data columns (total 9 columns):
 #   Column         Dtype  
---  ------         -----  
 0   playerid       int64  
 1   nickname       object 
 2   achievementid  object 
 3   date_acquired  object 
 4   gameid         float64
 5   title          object 
 6   description    object 
 7   points         float64
 8   library        object 
dtypes: float64(2), int64(1), object(6)
memory usage: 1.0+ GB


In [34]:
xb_player_info.shape

(15545366, 9)

### Game focus

In [35]:
xb_game_info = xbox_games.merge(xbox_achievements.rename(columns={'title': 'achievement_title'}), on="gameid", how="left")

In [36]:
xb_game_info = xb_game_info.merge(xbox_prices, on="gameid", how="left")

In [37]:
xb_game_info = xb_game_info.drop(columns=['date_acquired', 'points'])

In [38]:
xb_game_info.rename(columns={'description': 'achievement_description'}, inplace=True)

In [56]:
xb_game_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 646076 entries, 0 to 646075
Data columns (total 15 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   gameid                   646076 non-null  int64  
 1   title                    646076 non-null  object 
 2   developers               608804 non-null  object 
 3   publishers               610118 non-null  object 
 4   genres                   608336 non-null  object 
 5   supported_languages      265610 non-null  object 
 6   release_date             646076 non-null  object 
 7   achievementid            646076 non-null  object 
 8   achievement_title        646074 non-null  object 
 9   achievement_description  645968 non-null  object 
 10  usd                      474444 non-null  float64
 11  eur                      453524 non-null  float64
 12  gbp                      469236 non-null  float64
 13  jpy                      0 non-null       float64
 14  rub 

## Merge game info

In [40]:
len(ps_game_info['gameid'].unique())

23151

In [41]:
len(st_game_info['gameid'].unique())

98248

In [42]:
len(xb_game_info['gameid'].unique())

10489

In [44]:
ps_titles = set(ps_game_info['title'].str.lower().str.strip())
st_titles = set(st_game_info['title'].str.lower().str.strip())
xb_titles = set(xb_game_info['title'].str.lower().str.strip())

shared_ps_st = ps_titles.intersection(st_titles)
shared_ps_xb = ps_titles.intersection(xb_titles)
shared_st_xb = st_titles.intersection(xb_titles)
shared_all = ps_titles.intersection(st_titles, xb_titles)

print(f"Shared titles between PS & ST: {len(shared_ps_st)}")
print(f"Shared titles between PS & XB: {len(shared_ps_xb)}")
print(f"Shared titles between ST & XB: {len(shared_st_xb)}")
print(f"Shared across all three: {len(shared_all)}")


Shared titles between PS & ST: 5571
Shared titles between PS & XB: 5360
Shared titles between ST & XB: 5540
Shared across all three: 3815


In [None]:
def clean_game_info(df, platform):
    df = df.copy()
    
    # drop platform-specific columns (ps_game_info's 'rarity' and 'PS_platform')
    drop_cols = ['date_acquired', 'rarity', 'PS_platform']
    df = df.drop(columns=[col for col in drop_cols if col in df.columns], errors='ignore')

    # add platform column
    df['platform'] = platform
    
    return df

In [None]:
ps_game_info_clean = clean_game_info(ps_game_info, 'ps')

In [78]:
ps_game_info_clean.columns

Index(['gameid', 'title', 'developers', 'publishers', 'genres',
       'supported_languages', 'release_date', 'achievementid',
       'achievement_title', 'achievement_description', 'usd', 'eur', 'gbp',
       'jpy', 'rub', 'platform'],
      dtype='object')

In [79]:
st_game_info_clean = clean_game_info(st_game_info, 'st')

In [80]:
st_game_info_clean.columns

Index(['gameid', 'title', 'developers', 'publishers', 'genres',
       'supported_languages', 'release_date', 'achievementid',
       'achievement_title', 'achievement_description', 'usd', 'eur', 'gbp',
       'jpy', 'rub', 'platform'],
      dtype='object')

In [81]:
xb_game_info_clean = clean_game_info(xb_game_info, 'xb')

In [82]:
xb_game_info_clean.columns

Index(['gameid', 'title', 'developers', 'publishers', 'genres',
       'supported_languages', 'release_date', 'achievementid',
       'achievement_title', 'achievement_description', 'usd', 'eur', 'gbp',
       'jpy', 'rub', 'platform'],
      dtype='object')

In [None]:
shared_games = set(ps_game_info_clean['title']) & set(st_game_info_clean['title']) & set(xb_game_info_clean['title'])

ps_game_info_clean = ps_game_info_clean[ps_game_info_clean['title'].isin(shared_games)]
st_game_info_clean = st_game_info_clean[st_game_info_clean['title'].isin(shared_games)]
xb_game_info_clean = xb_game_info_clean[xb_game_info_clean['title'].isin(shared_games)]

merged_game_info = pd.concat([ps_game_info_clean, st_game_info_clean, xb_game_info_clean], ignore_index=True)

print(f"Merged dataset shape: {merged_game_info.shape}")

Merged dataset shape: (5808510, 16)


In [84]:
merged_game_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5808510 entries, 0 to 5808509
Data columns (total 16 columns):
 #   Column                   Dtype  
---  ------                   -----  
 0   gameid                   int64  
 1   title                    object 
 2   developers               object 
 3   publishers               object 
 4   genres                   object 
 5   supported_languages      object 
 6   release_date             object 
 7   achievementid            object 
 8   achievement_title        object 
 9   achievement_description  object 
 10  usd                      float64
 11  eur                      float64
 12  gbp                      float64
 13  jpy                      float64
 14  rub                      float64
 15  platform                 object 
dtypes: float64(5), int64(1), object(10)
memory usage: 709.0+ MB


## Long and Tidy Form
A tidy dataset follows the principles:
1. Each column is a single variable.
2. Each row is a single observation.
3. Each cell contains a single value.

#### Issues with the current merged_game_info:
- **We decide to leave it unchanged as our analysis will base on it**: The platform column is categorical, meaning the same game can appear multiple times (once per platform). This makes it long format.
- Price columns (usd, eur, etc.) are separate instead of melted into a single column, which makes the dataset wide instead of long, which is not tidy.

In [85]:
# convert price columns into a long format
tidy_game_info = merged_game_info.melt(
    id_vars=[
        "gameid", "title", "developers", "publishers", "genres",
        "supported_languages", "release_date", "achievementid",
        "achievement_title", "achievement_description", "platform"
    ],
    value_vars=["usd", "eur", "gbp", "jpy", "rub"],
    var_name="currency",
    value_name="price"
)

In [87]:
tidy_game_info.shape

(29042550, 13)

In [86]:
tidy_game_info.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29042550 entries, 0 to 29042549
Data columns (total 13 columns):
 #   Column                   Dtype  
---  ------                   -----  
 0   gameid                   int64  
 1   title                    object 
 2   developers               object 
 3   publishers               object 
 4   genres                   object 
 5   supported_languages      object 
 6   release_date             object 
 7   achievementid            object 
 8   achievement_title        object 
 9   achievement_description  object 
 10  platform                 object 
 11  currency                 object 
 12  price                    float64
dtypes: float64(1), int64(1), object(11)
memory usage: 2.8+ GB


## Preprocessing

### 1. Check for missing values

In [90]:
# count missing values per column
null_counts = tidy_game_info.isnull().sum().reset_index()
null_counts.columns = ["col", "num_missing"]
null_counts["missing_pcnt"] = round((null_counts["num_missing"] / len(merged_game_info)) * 100, 2)

null_counts

Unnamed: 0,col,num_missing,missing_pcnt
0,gameid,0,0.0
1,title,0,0.0
2,developers,15850,0.27
3,publishers,79975,1.38
4,genres,24935,0.43
5,supported_languages,1588735,27.35
6,release_date,0,0.0
7,achievementid,40050,0.69
8,achievement_title,40325,0.69
9,achievement_description,4014060,69.11


#### Here’s how we decide to handle missing values effectively [Used GPT-4o for tailoring to table format]:

**Columns with Minor Missing Data (< 5%)**
| Column | Missing % | Suggested Action |
|---------|----------|-----------------|
| **developers** | 0.27% | Fill with "unknown" to maintain consistency. |
| **publishers** | 1.38% | Fill with "unknown" for completeness. |
| **genres** | 0.43% | Fill with "unknown", since games without a genre classification are rare. |
| **achievementid** | 0.69% | Likely an error or missing achievements, can be kept as NaN. |
| **achievement_title** | 0.69% | Likely corresponds to missing achievementid, can be kept as NaN. |

- **Action: Fill developers, publishers, and genres with "unknown", and leave missing achievements as NaN.**

---

**Columns with Moderate Missing Data (5% - 30%)**
| Column | Missing % | Suggested Action |
|---------|----------|-----------------|
| **supported_languages** | 27.35% | Keep as NaN, since language availability is an actual missing feature. |

- **Action: Keep supported_languages as NaN, since not all games support multiple languages. We don't want to introduce incorrect data.**

---

**Columns with High Missing Data (> 50%)**
| Column | Missing % | Suggested Action |
|---------|----------|-----------------|
| **achievement_description** | 69.11% | Fill missing values with "unknown". |
| **price** | 71.58% | Keep as NaN—missing prices could indicate unavailable data, regional limitations, or discontinued games. |

**Action:**
- **Fill achievement_description with "No description available"** to avoid empty fields.
- **Keep price as NaN**, since forcing imputation could lead to inaccurate pricing.

In [97]:
# fill missing values for categorical text columns
tidy_game_info["developers"].fillna("unknown", inplace=True)
tidy_game_info["publishers"].fillna("unknown", inplace=True)
tidy_game_info["genres"].fillna("unknown", inplace=True)
tidy_game_info["achievement_description"].fillna("unknown", inplace=True)

## 2. Multi-valued cols

In [101]:
tidy_game_info.sample(5)

Unnamed: 0,title,developers,publishers,genres,supported_languages,release_date,achievement_title,achievement_description,platform,currency,price
24479197,Leisure Suit Larry - Wet Dreams Don't Dry,['CrazyBunch'],['Assemble Entertainment'],['Adventure'],"['English', 'German', 'Russian', 'Polish', 'Fr...",2018-11-07,Done!,Complete the game.,st,rub,799.0
20259080,Growtopia,['Ubisoft Abu Dhabi'],['Ubisoft'],"['Action', 'Adventure', 'Casual', 'Massively M...",['English'],2024-03-07,Expert Builder,"Earned for placing 100,000 blocks of any type.",st,jpy,
16533183,Quantum Replica,['ON3D Studios'],['PQube'],"['Action', 'Indie']","['English', 'Italian', 'Spanish - Spain', 'Fre...",2018-05-31,Call me Lightman,Complete a Level without getting detected,st,gbp,1.74
24734293,Gunfire Reborn,['Duoyi Games'],['Duoyi Games'],"['Action', 'Adventure', 'Indie', 'RPG']","['English', 'Simplified Chinese', 'Traditional...",2021-11-17,Deadly Strike,Defeat 500 enemies with Soul Strike.,st,rub,625.0
21674552,The Escapists 2,"['Team17', 'Mouldy Toof Studios']",['Team17'],"['Indie', 'Simulation', 'Strategy']","['English', 'French', 'German', 'Spanish - Spa...",2017-08-21,I'm The Daddy,Knock out every inmate at least once in a sing...,st,jpy,1980.0


#### Handling Multi-Valued Columns in Our Dataset [Used GPT-4o for tailoring grammar]

Our dataset contains multi-valued columns such as `developers`, `publishers`, `genres`, and `supported_languages`, which are stored as **list-like strings**. Instead of immediately expanding them into multiple rows (long format) or separate columns (one-hot encoding), we are **keeping them in their original format** for the following reasons:

1. **Data Integrity & Storage Efficiency**  
   - Expanding these fields would significantly increase row count, making storage and initial processing heavier.
   - Keeping them as lists allows us to preserve all information within a single row per game.

2. **Flexibility for Future Analysis**  
   - At later stages, we may **apply one-hot encoding** to `genres` and `supported_languages` for categorical analysis.  
   - Alternatively, we can derive **summary features** such as:
     - **Number of genres per game**
     - **Number of supported languages**
     - **Unique count of developers/publishers**
   - These derived features will allow for a more structured comparison across games.

By postponing transformation, we maintain efficiency while keeping the option open for structured feature extraction when needed.


In [None]:
# tidy_game_info_noids = tidy_game_info.drop(columns=['gameid', 'achievementid'], inplace=True)

## More notes on further merging

Our current **tidy dataset** focuses solely on game-related information. To enrich the analysis, we plan additional merges with player data to uncover insights about game popularity, completion rates, and player demographics.

### Planned Merges & Insights
1. **Game Popularity Analysis**  
   - Merge with `purchased_games.csv` (for each platform) to **count the number of players who own each game**.
   - This helps identify **best-selling games** and platform-specific purchase trends.

2. **Game Completion Rates**  
   - Merge with `history.csv` to **track when players earn end-game achievements**.
   - Calculate the **average time to completion** and **percentage of players finishing a game**.
   - Investigate if certain **genres have higher completion rates** (e.g., RPGs vs. casual games).
  
3. **Pricing & Sales Relationship:**  
   - Merge `prices.csv` to analyze how **price fluctuations impact purchases**.
  
4. **Cross-Platform Player Behavior:**  
   - Identify players who own **the same game across multiple platforms** to study cross-platform engagement.
   - Merge with `players.csv` and **cluster by country** to analyze **regional gaming preferences**.


## [TODO] Data Visualization
> Be sure to include interpretations of your visualizations -- what patterns or anomalies do you see?
