# Visual Analysis of the Steam Dataset

Please download the datasets at the following link and follow the instructions in `data/README.md`. Keep note of the pathnames used to load in the datasets and ensure they match what is in `data/` directory.

https://www.gigasheet.com/sample-data/steam-game-reviews

https://www.kaggle.com/datasets/mohamedtarek01234/steam-games-reviews-and-rankings

# A. Data Card

This section will explain what is contained in each point of our dataset.

### Datasets

*Steam Game Reviews:* https://www.gigasheet.com/sample-data/steam-game-reviews ~464MB

This dataset consists of ~990K rows of reviews taken from SteamDB, a database for the video game vendor, Steam.

*Steam Game Reviews and Rankings:* https://www.kaggle.com/datasets/mohamedtarek01234/steam-games-reviews-and-rankings ~1MB
 
This dataset is where we source game metadata to enrich our review information. This will help give better insights into the relationship between the kind of game and the reviews.

**Shape**

*Steam Game Reviews:* (992153, 8)

*Steam Game Reviews and Rankings:* (290, 13)

*Merged Working Dataset:* (992153, 24)

**Time**

This dataset covers review information from 2010-2024, with far more data coming from more recent time. (View visuals for review count by month/year)

### Column Dictionary

**For the working merged dataset**

**Provided Columns**    
`review` - The user's review string for the game   
`hours_played` - the amount of hours of the game the user played when they posted the review   
`helpful` - the amount of 'helpful' votes the user received from other users for their review   
`funny` - the amount of 'funny' votes the user received from other users for their review   
`recommendation` - whether the user recommends the game or not (0 - no, 1 - yes)   
`date` - the date when the user posted the review   
`game_name` - name of the game the user reviewed   
`username` - the user's personal alias   
`short_description` - the shortened description of the game from SteamDB   
`long_description` - the full description of the game from SteamDB   
`genres` - The list of genres that the game belongs to (ie. Action, Adventure, Horror...)   
`minimum_system_requirements` - the minimum hardware specs needed to run the game   
`recommended_system_requirements` - the hardware specs recommended by the developer to run the game smoothly   
`release_date` - the date when the game was released   
`developer` - the studio who developed the game   
`publisher` - the parent company that licensed the development of the game   
`overall_player_rating` - the overall average rating all reviews gave to the game (categorical)   
`number_of_reviews_from_purchased_people` - total number of reviews from players who purchased the game   
`number_of_english_reviews` - total number of reviews in english from players who purchased the game   
`link` - the link to the Steam page of the game   

**Created Columns by Us**     
`popular` - boolean value whether or not the review has >50 helpful votes   
`cat_playtime` - categorized bucketing for playtime hours     
`rec_ratio` - the ratio for recommended reviews to total reviews for the game based on the dataset reviews     
`in_sale` - a boolean that tells whether the review was posted during major sale periods (Summer - July, Winter - January)    

### Missingness Snapshot

*Steam Reviews*   
503 total missing reviews    
81 usernames    

*Steam Game Metadata*    
13 short descriptions (can be manually filled in)   

*Merged Working Dataset*    
Dates                                      1775    
short_description                          66892       
long_description                           66892    
genres                                     66892     
minimum_system_requirement                 66892   
recommend_system_requirement               66892    
release_date                               76879    
developer                                  66892    
publisher                                  66892    
overall_player_rating                      66892   
number_of_reviews_from_purchased_people    66892    
number_of_english_reviews                  66892   
link                                       66892   

The missing information from the merged dataset stems from the 11 games not covered by the metadata dataset. Can be filled manually.

### Quirks

1. Recent reviews do not include the year (because on Steam if a review is recent then it is just denoted by month and day). This was remedied by simply applying the year 2024 to all reviews missing a year. (data was collected in September of 2024)
2. A few usernames are poorly formatted and need to be fixed
3. 11 Games are missing metadata and can be filled manually
4. If a publisher owns their own development studio, then the publisher and developer will be the same.
5. A few of the review fields contain the year and need to be removed.
6. Genres, system requirements, and long_description are all list objects.

# 2 - Setup

### 2.1 Imports

Handles imports needed to run all cells in the notebook. If you haven't yet please run `pip install -r requirements.txt` 
in the main directory console to get necessary dependencies.

In [2]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

ModuleNotFoundError: No module named 'seaborn'

### 2.2 Load Datasets

This will load in all datasets that accompany this project. If you have not already, please follow the instructions in `data/README.md` to download the datasets required.

In [1]:
reviews_path = '../data/Steam Game Reviews export 2025-09-22 03-26-24.csv' # Replace with the path to your reviews CSV File
game_metadata_path = '../data/games_description.csv' # Replace with path to game metadata CSV file, NOTE make sure to go to parent dir in the path


steamreviews = pd.read_csv(reviews_path)
gamemetadata = pd.read_csv(game_metadata_path)
steamreviews.head()

NameError: name 'pd' is not defined

In [None]:
gamemetadata.head()

### 2.3 Merge Datasets

We can combine the review dataset and game metadata dataset to enrich each review.

In [None]:
steamdataset = pd.merge(steamreviews, gamemetadata, left_on='game_name', right_on='name', how='left') ## This is like a join in SQL, so we join based on the game name
steamdataset.drop(columns=['name'], inplace=True) # don't need 2 names :)
steamdataset