### Import Libraries

In [1]:
!pip install sentence-transformers

import pandas as pd
import re
from sklearn.preprocessing import StandardScaler
from sentence_transformers import SentenceTransformer




[notice] A new release of pip is available: 24.2 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


### Read CSV files

In [2]:
# Define file path
INFO_BASE_GAMES_PATH = '../../data/raw/info_base_games.csv'

info_base_games_df = pd.read_csv(INFO_BASE_GAMES_PATH)

info_base_games_df

  info_base_games_df = pd.read_csv(INFO_BASE_GAMES_PATH)


Unnamed: 0,appid,name,metacritic,steam_achievements,steam_trading_cards,workshop_support,genres,achievements_total,release_date,supported_platforms
0,2574000,Femboy Burgers,,True,True,True,"Casual, Indie",,"Oct 9, 2023","['windows', 'mac', 'linux']"
1,2574120,PPA Pickleball Tour 2025,,True,True,True,"Indie, Simulation, Sports",18,"Jul 16, 2024","['windows', 'mac', 'linux']"
2,2573200,Squeaky Squad,,True,True,True,"Action, Adventure, Indie",27,"Mar 29, 2024","['windows', 'mac', 'linux']"
3,2573440,Paradox Metal,,True,True,True,"Action, Early Access",,Coming soon,"['windows', 'mac', 'linux']"
4,2569520,Naturpark Lillebælt VR,,True,True,True,"Action, Adventure",,"Sep 18, 2023","['windows', 'mac', 'linux']"
...,...,...,...,...,...,...,...,...,...,...
99162,1548850,Six Days in Fallujah,,True,False,False,"Action, Early Access",34.0,"Jun 22, 2023",['windows']
99163,2478130,PROJECT SURVIVAL #Working title,,False,False,False,"Action, Adventure, RPG",,Coming soon,['windows']
99164,3272980,Siren's Well,,True,False,False,"Action, Adventure",,Coming soon,"['windows', 'linux']"
99165,2054150,Tower Defender VR: Last Adventure,,False,False,False,"Casual, Indie, RPG, Strategy",,"Jul 8, 2022",['windows']


### Data Statistics

In [3]:
info_base_games_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 99167 entries, 0 to 99166
Data columns (total 10 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   appid                99167 non-null  object
 1   name                 99149 non-null  object
 2   metacritic           3019 non-null   object
 3   steam_achievements   99167 non-null  bool  
 4   steam_trading_cards  99167 non-null  bool  
 5   workshop_support     99167 non-null  bool  
 6   genres               94389 non-null  object
 7   achievements_total   38115 non-null  object
 8   release_date         98861 non-null  object
 9   supported_platforms  99167 non-null  object
dtypes: bool(3), object(7)
memory usage: 5.6+ MB


- Notice that appid, steam_achievements, steam_trading_cards and workshop_support, all have no NULL values
- name has some NULL values

### Calculate percentage of missing names:

In [4]:
missing_pct = info_base_games_df['name'].isna().mean() * 100
print(f"{missing_pct:.2f}% of games have no name")

0.02% of games have no name


- Since percentage is very low, rows with no names can be dropped with no real impact

### drop rows with NULL value names

In [5]:
original_length = len(info_base_games_df)
info_base_games_df.dropna(subset=['name'], inplace=True)

print(f"Dropped {original_length - len(info_base_games_df)} rows with missing names.")

Dropped 18 rows with missing names.


### Preprocess names and add features derived from name that could be useful
#### Preprocess names
- To lowercase
- Strip punctuation
#### features derived from name
- Character count
- Word count
- Ratio of capital letters to total length
- Is a sequel 
- Has useful keywords like (vr, remaster, collector, edition, bundle, playtest)

In [6]:
# Strip punctuation
info_base_games_df['name'] = info_base_games_df['name'].str.replace(r'[^\w\s]', '', regex=True)

# add character count and word count features
info_base_games_df['name_len'] = info_base_games_df['name'].str.len()
info_base_games_df['name_words'] = info_base_games_df['name'].str.split().str.len()

def cap_ratio(s):
    if not s:
        return 0
    upper_count = sum(1 for ch in s if ch.isupper())
    return upper_count / len(s)

# add caps ratio feature
info_base_games_df['name_cap_ratio'] = info_base_games_df['name'].apply(cap_ratio)

# transform all names to lowercase
info_base_games_df['name'] = info_base_games_df['name'].str.lower()

# check if game is a sequel and add is_sequel feature
roman_re = re.compile(r'\b(?:i{1,3}|iv|v|vi|vii|viii|ix|x)\b')
digit_re = re.compile(r'\b[2-9]\b')

info_base_games_df['is_sequel'] = (
    info_base_games_df['name'].str.contains(roman_re) |
    info_base_games_df['name'].str.contains(digit_re)
).astype(int)

# add useful keyword features
keywords = ['vr', 'remaster', 'collector', 'collection', 'edition', 'bundle', 'playtest']
for kw in keywords:
    info_base_games_df[f'name_has_{kw}'] = (
        info_base_games_df['name']
          .str.contains(fr'\b{kw}\b')
          .astype(int)
    )

info_base_games_df

Unnamed: 0,appid,name,metacritic,steam_achievements,steam_trading_cards,workshop_support,genres,achievements_total,release_date,supported_platforms,...,name_words,name_cap_ratio,is_sequel,name_has_vr,name_has_remaster,name_has_collector,name_has_collection,name_has_edition,name_has_bundle,name_has_playtest
0,2574000,femboy burgers,,True,True,True,"Casual, Indie",,"Oct 9, 2023","['windows', 'mac', 'linux']",...,2,0.142857,0,0,0,0,0,0,0,0
1,2574120,ppa pickleball tour 2025,,True,True,True,"Indie, Simulation, Sports",18,"Jul 16, 2024","['windows', 'mac', 'linux']",...,4,0.208333,0,0,0,0,0,0,0,0
2,2573200,squeaky squad,,True,True,True,"Action, Adventure, Indie",27,"Mar 29, 2024","['windows', 'mac', 'linux']",...,2,0.153846,0,0,0,0,0,0,0,0
3,2573440,paradox metal,,True,True,True,"Action, Early Access",,Coming soon,"['windows', 'mac', 'linux']",...,2,0.153846,0,0,0,0,0,0,0,0
4,2569520,naturpark lillebælt vr,,True,True,True,"Action, Adventure",,"Sep 18, 2023","['windows', 'mac', 'linux']",...,3,0.181818,0,1,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99162,1548850,six days in fallujah,,True,False,False,"Action, Early Access",34.0,"Jun 22, 2023",['windows'],...,4,0.150000,0,0,0,0,0,0,0,0
99163,2478130,project survival working title,,False,False,False,"Action, Adventure, RPG",,Coming soon,['windows'],...,4,0.533333,0,0,0,0,0,0,0,0
99164,3272980,sirens well,,True,False,False,"Action, Adventure",,Coming soon,"['windows', 'linux']",...,2,0.181818,0,0,0,0,0,0,0,0
99165,2054150,tower defender vr last adventure,,False,False,False,"Casual, Indie, RPG, Strategy",,"Jul 8, 2022",['windows'],...,5,0.187500,0,1,0,0,0,0,0,0


### Apply pretrained embeddings on names

In [7]:
model = SentenceTransformer('all-MiniLM-L6-v2')

embeddings = model.encode(info_base_games_df['name'].tolist())

emb_dim = embeddings.shape[1]
emb_cols = [f'name_emb_{i}' for i in range(emb_dim)]
emb_df = pd.DataFrame(embeddings, columns=emb_cols, index=info_base_games_df.index)
info_base_games_df = pd.concat([info_base_games_df, emb_df], axis=1)

### Scale numeric features (optional depending on model's sensitivity to data scale)

In [8]:
scale_cols = [
    'name_len', 'name_words', 'name_cap_ratio', 'is_sequel'
] + [f'name_has_{kw}' for kw in keywords]

scaler = StandardScaler()
info_base_games_df[scale_cols] = scaler.fit_transform(info_base_games_df[scale_cols])

### Turn boolian columns into numeric

In [9]:
info_base_games_df['steam_achievements'] = info_base_games_df['steam_achievements'].astype(int)
info_base_games_df['steam_trading_cards'] = info_base_games_df['steam_trading_cards'].astype(int)
info_base_games_df['workshop_support'] = info_base_games_df['workshop_support'].astype(int)

### Save the preprocessed data
- Save the preprocessed data for testing and validation purposes.
- Add features to the info_base_games: 
    
    - Character count
    - Word count
    - Ratio of capital letters to total length
    - Is a sequel
    - Has useful keywords like (vr, remaster, collector, edition, bundle, playtest)
    - Name embedding

In [10]:
#info_base_games_df.to_csv(INFO_BASE_GAMES_PATH, index=False)

### Statistics After Preprocessing

In [11]:
info_base_games_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 99149 entries, 0 to 99166
Columns: 405 entries, appid to name_emb_383
dtypes: float32(384), float64(11), int64(3), object(7)
memory usage: 161.9+ MB


### Summary and Conclusion

- **Imported & cleaned**  
  - Installed and loaded `sentence-transformers`, `pandas`, `re`, and `StandardScaler`  
  - Read in `info_base_games.csv` and inspected column completeness  

- **Handled missing titles**  
  - Calculated missing‑name percentage and dropped rows (low impact)  

- **Engineered name features**  
  - Normalized titles (lowercase, no punctuation)  
  - **Character count** (`name_len`)  
  - **Word count** (`name_words`)  
  - **Capital‑letter ratio** (`name_cap_ratio`)  
  - **Sequel flag** (`is_sequel`) identified from standalone roman numerals or digits  
  - **Keyword flags** (`name_has_vr`, `name_has_remaster`, etc.)

- **Applied embeddings**  
  - Generated embedding vectors from `all-MiniLM-L6-v2` for each cleaned name  

- **Scaled & encoded**  
  - Standardized numeric features (`name_len`, `name_words`, `name_cap_ratio`, `is_sequel`, keyword flags)  
  - Converted boolean columns (`steam_achievements`, `steam_trading_cards`, `workshop_support`) into numeric 0/1  

- **Saved & validated**  
  - (Optional) Exported the preprocessed DataFrame into csv
  - Reviewed final schema and data types

- **Notes**

  - Scaling numeric features should be applied or ignored depending on the model type and sensitivity to scaling
  - `app_id` will most likely be dropped since it has no useful impact on sales