# Topic Modeling of 2025 Videogames
In this notebook, I set up a pipeline to conduct topic modeling in the top 5 videogame genres.

## Pipeline Overview
1. Load the dataset
2. Filter for Indie games genre (this is the genre of interest)
3. Preprocess text (tokenization, lemmatization, stopword removal)
4. Apply BERTopic for thematic modeling
5. Visualize and interpret themes

## Install Necessary Packages
In the terminal, the following code was used to install the packages:<br>
pip install pandas spacy bertopic scikit-learn umap sentence-transformers <br>
python -m spacy download en_core_web_sm

## Load the Dataset
In this first step, I am initializing 'path' to point to the location where I've already downloaded the video game datasets from Kaggle in the videogames2025eda.ipynb.

In [4]:
import os
import pandas as pd

# Set the path to the dataset directory
path = r"C:\Users\sarah\.cache\kagglehub\datasets\artermiloff\steam-games-dataset\versions\2"

# List available files
print(os.listdir(path))

['games_march2025_cleaned.csv', 'games_march2025_full.csv', 'games_may2024_cleaned.csv', 'games_may2024_full.csv']


In [5]:
# Read dataset into pandas dataframe
df = pd.read_csv(os.path.join(path, 'games_march2025_cleaned.csv'))
df.head()

Unnamed: 0,appid,name,release_date,required_age,price,dlc_count,detailed_description,about_the_game,short_description,reviews,...,average_playtime_2weeks,median_playtime_forever,median_playtime_2weeks,discount,peak_ccu,tags,pct_pos_total,num_reviews_total,pct_pos_recent,num_reviews_recent
0,730,Counter-Strike 2,2012-08-21,0,0.0,1,"For over two decades, Counter-Strike has offer...","For over two decades, Counter-Strike has offer...","For over two decades, Counter-Strike has offer...",,...,879,5174,350,0,1212356,"{'FPS': 90857, 'Shooter': 65397, 'Multiplayer'...",86,8632939,82,96473
1,578080,PUBG: BATTLEGROUNDS,2017-12-21,0,0.0,0,"LAND, LOOT, SURVIVE! Play PUBG: BATTLEGROUNDS ...","LAND, LOOT, SURVIVE! Play PUBG: BATTLEGROUNDS ...",Play PUBG: BATTLEGROUNDS for free. Land on str...,,...,0,0,0,0,616738,"{'Survival': 14838, 'Shooter': 12727, 'Battle ...",59,2513842,68,16720
2,570,Dota 2,2013-07-09,0,0.0,2,"The most-played game on Steam. Every day, mill...","The most-played game on Steam. Every day, mill...","Every day, millions of players worldwide enter...",“A modern multiplayer masterpiece.” 9.5/10 – D...,...,1536,898,892,0,555977,"{'Free to Play': 59933, 'MOBA': 20158, 'Multip...",81,2452595,80,29366
3,271590,Grand Theft Auto V Legacy,2015-04-13,17,0.0,0,"When a young street hustler, a retired bank ro...","When a young street hustler, a retired bank ro...",Grand Theft Auto V for PC offers players the o...,,...,771,7101,74,0,117698,"{'Open World': 32644, 'Action': 23539, 'Multip...",87,1803832,92,17517
4,359550,Tom Clancy's Rainbow Six® Siege,2015-12-01,17,3.99,9,Edition Comparison Ultimate Edition The Tom Cl...,“One of the best first-person shooters ever ma...,"Tom Clancy's Rainbow Six® Siege is an elite, t...",,...,682,2434,306,80,89916,"{'FPS': 9831, 'PvP': 9162, 'e-sports': 9072, '...",84,1168020,76,12608


For this project, I'm interested in the main themes/topics within the most popular video game genre, which happens to be Indie, based on the 2025 dataset.

In [7]:
# Filter for Indie games
indie_games = df[df['genres'].str.contains("Indie", case=False, na=False)]

# Select relevant columns
indie_games = indie_games[['name', 'short_description']]
indie_games.head()

Unnamed: 0,name,short_description
6,Terraria,"Dig, fight, explore, build! Nothing is impossi..."
7,Rust,The only aim in Rust is to survive. Everything...
8,Garry's Mod,Garry's Mod is a physics sandbox. There aren't...
13,Stardew Valley,You've inherited your grandfather's old farm p...
17,Euro Truck Simulator 2,"Travel across Europe as king of the road, a tr..."


## Preprocess Text
In this step, I use spaCy to preprocess text in the 'short_description' column.

In [None]:
import spacy

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

def preprocess_text(text):
    doc = nlp(text.lower())  # Convert to lowercase
    tokens = [token.lemma_ for token in doc if token.is_alpha and not token.is_stop]  # Lemmatization & stopword removal
    return " ".join(tokens)

# Apply preprocessing
indie_games['processed_description'] = indie_games['short_description'].apply(preprocess_text)