# Steam's videogames platform 👾

 
## Company's description 📇
Steam is a video game digital distribution service and storefront from Valve. It was launched as a software client in September 2003 to provide game updates automatically for Valve's games, and expanded to distributing third-party titles in late 2005. Steam offers various features, like digital rights management (DRM), game server matchmaking with Valve Anti-Cheat measures, social networking, and game streaming services. Steam client's functions include game update automation, cloud storage for game progress, and community features such as direct messaging, in-game overlay functions and a virtual collectable marketplace.

## Project 🚧
You're working for Ubisoft, a French video game publisher. They'd like to release a new revolutionary videogame! They asked you conduct a global analysis of the games available on Steam's marketplace in order to better understand the videogames ecosystem and today's trends.

## Goals 🎯
The ultimate goal of this project is to understand what factors affect the popularity or sales of a video game. But your boss asked you to take advantage of this opportunity to analyze the video game market globally.


In [0]:
from pyspark.sql import functions as F
from pyspark.sql import Row, Column

Let's start!

In [0]:
data_file_path = "s3://full-stack-bigdata-datasets/Big_Data/Project_Steam/steam_game_output.json"

In [0]:
df = spark.read.format("json").load(data_file_path)
df.printSchema()

root
 |-- data: struct (nullable = true)
 |    |-- appid: long (nullable = true)
 |    |-- categories: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- ccu: long (nullable = true)
 |    |-- developer: string (nullable = true)
 |    |-- discount: string (nullable = true)
 |    |-- genre: string (nullable = true)
 |    |-- header_image: string (nullable = true)
 |    |-- initialprice: string (nullable = true)
 |    |-- languages: string (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- negative: long (nullable = true)
 |    |-- owners: string (nullable = true)
 |    |-- platforms: struct (nullable = true)
 |    |    |-- linux: boolean (nullable = true)
 |    |    |-- mac: boolean (nullable = true)
 |    |    |-- windows: boolean (nullable = true)
 |    |-- positive: long (nullable = true)
 |    |-- price: string (nullable = true)
 |    |-- publisher: string (nullable = true)
 |    |-- release_date: string (nullable = true)
 |    |-

In [0]:
df.count()

Out[8]: 55691

---

A small function to encapsulate some repetitive and un-friendly code

In [0]:
def get_data_col(column_name: str) -> Column:
    return F.col("data").getField(column_name)

## Which publisher hast released the most games on Steam ?

In [0]:

# get data
publisher_df = df.groupBy(get_data_col("publisher")).count().orderBy(F.col("count").desc())

# visualisation
display(publisher_df.head(1))


data[publisher],count
Big Fish Games,422


Big Fish Games is the publisher who published the most games on Steam (422).

## What are the best rated games?

Assuming there is only two types of rating (positive or negative), I chose to consider "best rating" the highest number of positive ratings.
By doing this, a game with 100% of positive ratings won't be the best game if a game with 50% positive ratings has more ratings.

In [0]:
top_rated_df = df.orderBy(get_data_col("positive").desc()).select(get_data_col("name"), get_data_col("positive"))

display(top_rated_df.head(10))


data.name,data.positive
Counter-Strike: Global Offensive,5943345
Dota 2,1534895
Grand Theft Auto V,1229265
PUBG: BATTLEGROUNDS,1185361
Terraria,1014711
Tom Clancy's Rainbow Six Siege,942910
Garry's Mod,861240
Team Fortress 2,846407
Rust,732513
Left 4 Dead 2,643836


## Are there years with more releases?

In [0]:
release_df = df.withColumn("release_year", F.split(get_data_col("release_date"),"/")[0])
release_df = release_df.groupBy("release_year").count().orderBy("release_year")

display(release_df)


release_year,count
,99
1997.0,2
1998.0,1
1999.0,3
2000.0,2
2001.0,4
2002.0,1
2003.0,3
2004.0,6
2005.0,6


There is a very sharp increase from 2014 onwards, leading up to 2021 with 8823 releases, before starting to decline in 2022.
Without data from 2023 and 2024, it's difficult to say whether the numerous releases of 2020 and 2021 are dependent on COVID 19 or whether, on the contrary, there was a crisis in 2022, unrelated to the “post-COVID” period.

## How are the prizes distributed?

In [0]:
price_stats_df = df.select(F.mean(get_data_col("price")).alias("average_price"),
                           F.stddev(get_data_col("price")).alias("price_standard_deviation"))

display(price_stats_df)


average_price,price_standard_deviation
773.2849832109317,1093.13458272345


## Are there many games with a discount?

In [0]:
discount_df = df.filter(get_data_col("discount") > 0).count()
total_games_count = df.count()

discount_percentage = (discount_df / total_games_count) * 100
print(f"Percentage of games with a discount: {discount_percentage:.2f}%")

Percentage of games with a discount: 4.52%


## What are the most represented languages?

In [0]:
# get data
languages_df = df.withColumn("languages", get_data_col("languages"))

# pre-processing
languages_df = languages_df.withColumn("split_languages", F.split(F.col("languages"),",\s*"))
languages_df = languages_df.withColumn("language", F.explode(F.col("split_languages")))

# processing
languages_df = languages_df.groupBy("language").count().orderBy(F.col("count").desc())

# visualisation
display(languages_df.head(10))

language,count
English,55116
German,14019
French,13426
Russian,12922
Simplified Chinese,12782
Spanish - Spain,12233
Japanese,10368
Italian,9304
Portuguese - Brazil,6750
Korean,6600


**English** is by far the most represented language (almost 4 times more than **German**, the second most represented language).

## Are there many games prohibited for children under 16/18?

In [0]:
age_df = df.filter((get_data_col("required_age") == 16) | (get_data_col("required_age") == 18))

age_restricted_count = age_df.count()
percentage = (age_restricted_count / df.count()) * 100

print(f"Number of games with age restriction under 16/18: {age_restricted_count} ({percentage:.2f}% of the platform's games)")


Number of games with age restriction under 16/18: 261 (0.47% of the platform's games)


## Are most games available on Windows/Mac/Linux?

In [0]:
# get data
platform_df = df.withColumn("platforms", get_data_col("platforms"))\
    .withColumn("Linux", F.col("platforms.Linux"))\
    .withColumn("Mac",F.col("platforms.Mac"))\
    .withColumn("Windows",F.col("platforms.Windows"))\
    .withColumn("genre",get_data_col("genre"))\
    .drop("id")  

# processing
count_mac = platform_df.filter(F.col('Mac') == "True").count()
count_linux = platform_df.filter(F.col('Linux') == "True").count()
count_windows = platform_df.filter(F.col('Windows') == "True").count()
total_count = df.count()

mac_percentage = (count_mac / total_count) * 100
linux_percentage = (count_linux / total_count) * 100
windows_percentage = (count_windows / total_count) * 100

# visualisation
print(f"There are {mac_percentage:.2f}% games on MacOS, {linux_percentage:.2f}% games on Linux, and {windows_percentage:.2f}% games available on Windows.")


There are 22.93% games on MacOS, 15.19% games on Linux, and 99.97% games available on Windows.


## What are the most represented genres?

In [0]:
genre_df = df.withColumn("genres", get_data_col("genre"))
genre_df = genre_df.withColumn("split_genres", F.split(F.col("genres"), ", "))
genre_df = genre_df.withColumn("genre", F.explode(F.col("split_genres")))

genre_df = genre_df.groupBy("genre").count().orderBy(F.col("count").desc())

display(genre_df.head(10))


genre,count
Indie,39681
Action,23759
Casual,22086
Adventure,21431
Strategy,10895
Simulation,10836
RPG,9534
Early Access,6145
Free to Play,3393
Sports,2666


# Conclusion

The market remains dominated by **games available on Windows** (99.97%), with limited options for MacOS (22.93%) and Linux (15.19%). **Indie, Action, Casual** and **Adventure genres are the most represented**, underlining their appeal to developers and gamers alike. However, **only 0.47% of games are age-restricted**, and **4% benefit from promotions**, which could indicate untapped opportunities in terms of demographic targeting and pricing strategies.

Finally, **English is widely preferred as the main language**, indicating a strong orientation towards an international audience.