# Steam Analysis

**Name(s)**: Zidane Ho

**Website Link**: (your website link)

**Abstract**: 

## Step 1: Introduction

In [None]:
# Question: What makes a game controversial on Steam?
'''
In game development, controversial games can be both a positive and negative outcome for developers. On one side, a controversial game is widespread across the Internet, resulting in a greater revenue. On the other hand, a controversial game can be a negative impact in the long run, especially for large studios, because gamers will not think their product is worth playing anymore. This project aims to identify common features in a controversial game, so they can make better decisions when developing their game. We will define controversial games to be games on Steam that have negative (0-40%) or mixed reviews (40-69%). We will be using the Steam Games Dataset 2025.

Steam is the largest platform that distributes games on PC, developed by Valve. Similarly to a library, customers can search and purchase a game through different genres, authors, or title. Customers may also leave a review for the game, recommending the game (or not) to others. The dataset is updated to March 2025, containing 89618 rows.

After determining games with a mixed or negative review on Steam, I scraped https://store.steampowered.com/ for reviews.

Relevant Columns
- appid
- game_title
- tags
- positive
- negative
- user_score

- review_text

-criticism_tags


'''


NameError: name 'steam_data_path' is not defined

## Step 2: Data Cleaning and Exploratory Data Analysis

In [None]:
'''
In order to transform the data to be analyzed, I began by containing only the relevant columns: appid, game_title, tags, positive, and negative.

I replaced the column with NaN values: user_score, that represents the proportion of positive reviews of a game. I calculated this value by dividing the positive column by the sum of the positive and negative columns.

I exploded the tags column so that the multiple genres that exist in a game can be represented.

Univariate Analysis:

I conducted a univariate analysis on the distribution of all Steam games released within the dataset. The bar chart reveals a unimodal, left-skewed distribution. The mean is 2019.53, the median is 2020, and the mode is 2020. 

The large growth from 2010-2020 suggests the growing trend of game development. This is largely due to game development becoming more accessible to the average person, with the release of game engines (Unity, Unreal Engine) and a growing library of assets.

The distribution shows a singular peak at 2020, revealing that games were released at an all time in 2020. This can be explained as a consequence of COVID-19, as people became more active on the Internet during this time.

The number of games in 2025 is misleading because the dataset is updated to March 2025. This means that there are still many games to be released throughout the year. With the continuing trend of games becoming easier to develop, we can expect that the number of games being released will increase.

Additionally, I conducted an analysis on the distribution of the types of criticism found in 10% of all controversial Steam games.  The bar chart revealed a unimodal, right-skewed distribution. Using the median, a criticism type is typically mentioned 3292 times.

We can see that 'stability' is the most common criticism in reviews. This suggests that players often go to review because they experience freezes, crashes, or bugs within the game. 'Content' is the next most common criticism, suggesting that the game lacks updates, is boring, or repetitive.

Observe that the 'politics' column is one of the least common types of criticism. This means that reviewers often do not include their bias on sensitive topics such as 'woke' or 'gender.'

Bivariate Analysis

Release Year vs. Price

The scatterplot reveals a positive relationship between the year the game was released and the price of the game.



Interesting Aggregates

I performed an aggregation on the relationship between genres and the types of criticism in reviews. The heatmap reveals that all genres experience 'stability' types of criticism. However, the genres Retro, Visual Novel, and Pixel Graphics have the least amount of 'stability' criticisms. Through this correlation, we can observe that these genres are likely less complex to create. 

In contrast, the genres Multiplayer, Shooter, and Free to Play have the highest proportion of the 'multiplayer' criticism. While the genre Multiplayer is obvious, we can observe that free to play and shooter games often support multiplayer or co-op, which causes a more complex game. 

In the 'content' column, we can observe that the genres Visual Novel, Story Rich, Puzzle, Anime, and Colorful have high proportions. We can make a correlation that these genres suffer due to having little content, or being too boring or repetitive.





'''


## Step 3: Assessment of Missingness

In [None]:
'''
The column 'full_audio_languages' is NMAR because the chance of the value being missing depends on the development budget for the game, which is an unobserved value. Game developers sometimes do not have the resources for voice acting, or the game itself does not require voice acting at all. For example, the old Pokemon games do not have any voice acting done, and is imitated through sound effects.

Missing values in the 'notes' column tells us that the game has no sensitive content. For example, games with any blood or gore should have a note saying so. This means that the column is likely NMAR because it depends on the unobserved variable of the contents of the game. However, I have some suspicions that 'notes' is dependent on what kind of genre the game is. For example, horror and action games are more likely to have blood and violence, games tagged with sexual content are more likely to have nudity.

Missingness Dependency

Permutation Test 1: notes vs. tags

I performed permutation tests to determine the dependency of the missingness of the column 'notes' to the column 'tags'.

Hypothesis Test: The distribution of 'tags' is the same whether or not 'notes' is missing
Alternate Hypothesis: The distribution of 'tags' is different whether or not 'notes is missing
Significance Level: 0.1

- Observed TVD: 1995.87
- P-value: 0.1270

Since the p-value is greater than the significance level, we cannot reject the null hypothesis. We can conclude that the missingness of 'notes' is independent to 'tags'

Permutation Test 2: notes vs. num_reviews_total

Hypothesis Test: The distribution of 'num_reviews_total' is the same whether or not 'notes' is missing
Alternate Hypothesis: The distribution of 'tags' is different whether or not 'notes' is missing
Significance Level: 0.1

- Observed TVD: 6.201174803371181
- P-value: 1.0000

Because the p-value is greater than the significance level, we cannot reject the null hypothesis. Thus we conclude that the missingness of 'notes' is independent to the total number of reviews.
'''

appid is of type <class 'str'>
name is of type <class 'str'>
release_date is of type <class 'str'>
required_age is of type <class 'str'>
price is of type <class 'str'>
dlc_count is of type <class 'str'>
detailed_description is of type <class 'str'>
about_the_game is of type <class 'str'>
short_description is of type <class 'str'>
reviews is of type <class 'str'>
header_image is of type <class 'str'>
website is of type <class 'str'>
support_url is of type <class 'str'>
support_email is of type <class 'str'>
windows is of type <class 'str'>
mac is of type <class 'str'>
linux is of type <class 'str'>
metacritic_score is of type <class 'str'>
metacritic_url is of type <class 'str'>
achievements is of type <class 'str'>
recommendations is of type <class 'str'>
notes is of type <class 'str'>
supported_languages is of type <class 'str'>
full_audio_languages is of type <class 'str'>
packages is of type <class 'str'>
developers is of type <class 'str'>
publishers is of type <class 'str'>
catego

## Step 4: Hypothesis Testing

In [None]:
'''
Null Hypothesis: The mean stability ratio is the same across all genres.
Alternate Hypothesis: At least one genre has a different mean stability ratio.
Test Statistic: F-Statistic
Significance Level : 0.05

F-Stat: 0.818
P-value: 0.999

Since the p-value is greater than the significance level, we cannot reject the null hypothesis. This means that mean stability ratio is generally the same across all genres. This suggests that bugs and crashes often are encountered on the same level, on average, through all controversial games.

Null Hypothesis: The mean political ratio is the same across all genres.
Alternate Hypothesis: At least one genre has a different mean political ratio.

F-Stat: 1.01
P-value: 0.39

Because the p-value is greater than the significance level, we cannot reject the null hypothesis, meaning that the mean politics ratio is generally the same across all genres. However, we must consider that the p-value can go lower if we restrict the release date to only more recent years. This is because the words in the 'politics' category contain words that are used frequently in the recent years. 

Null Hypothesis: Stability criticism is independent of the game's genre
Alternate Hypothesis: Stability criticism is associated with game genre

Chi-squared Statistic: 6831.61
Degrees of Freedom: 778
p-value: 0.0

Because the p-value is lower than the significance level, we can reject the null hypothesis. A game's genre is associated with stability criticism.
'''

## Step 5: Framing a Prediction Problem

In [None]:
'''
Can we predict if a game contains bugs?


'''

## Step 6: Baseline Model

In [None]:
# TODO

## Step 7: Final Model

In [None]:
# TODO

## Step 8: Fairness Analysis

In [None]:
# TODO