# Steam's videogames platform 👾

 
## Company's description 📇
Steam is a video game digital distribution service and storefront from Valve. It was launched as a software client in September 2003 to provide game updates automatically for Valve's games, and expanded to distributing third-party titles in late 2005. Steam offers various features, like digital rights management (DRM), game server matchmaking with Valve Anti-Cheat measures, social networking, and game streaming services. Steam client's functions include game update automation, cloud storage for game progress, and community features such as direct messaging, in-game overlay functions and a virtual collectable marketplace.

## Project 🚧
You're working for Ubisoft, a French video game publisher. They'd like to release a new revolutionary videogame! They asked you conduct a global analysis of the games available on Steam's marketplace in order to better understand the videogames ecosystem and today's trends.

## Goals 🎯
The ultimate goal of this project is to understand what factors affect the popularity or sales of a video game. But your boss asked you to take advantage of this opportunity to analyze the video game market globally.


In [0]:
from pyspark.sql import functions as F
from pyspark.sql import Row
import pandas as pd



In [0]:
data_file_path = "s3://full-stack-bigdata-datasets/Big_Data/Project_Steam/steam_game_output.json"

In [0]:
df = spark.read.format("json").load(data_file_path)
df.printSchema()

root
 |-- data: struct (nullable = true)
 |    |-- appid: long (nullable = true)
 |    |-- categories: array (nullable = true)
 |    |    |-- element: string (containsNull = true)
 |    |-- ccu: long (nullable = true)
 |    |-- developer: string (nullable = true)
 |    |-- discount: string (nullable = true)
 |    |-- genre: string (nullable = true)
 |    |-- header_image: string (nullable = true)
 |    |-- initialprice: string (nullable = true)
 |    |-- languages: string (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- negative: long (nullable = true)
 |    |-- owners: string (nullable = true)
 |    |-- platforms: struct (nullable = true)
 |    |    |-- linux: boolean (nullable = true)
 |    |    |-- mac: boolean (nullable = true)
 |    |    |-- windows: boolean (nullable = true)
 |    |-- positive: long (nullable = true)
 |    |-- price: string (nullable = true)
 |    |-- publisher: string (nullable = true)
 |    |-- release_date: string (nullable = true)
 |    |-

In [0]:
df.count()

Out[5]: 55691

## Which publisher hast released the most games on Steam ?

In [0]:

# get data
publisher_df = df.withColumn("publisher", F.col("data").getField("publisher"))

# processing
publisher_df = publisher_df.groupBy("publisher").count().orderBy(F.col("count").desc())

# visualisation
display(publisher_df.head(1))


publisher,count
Big Fish Games,422


Big Fish Games is the publisher who published the most games on Steam (422).

## What are the best rated games?

## Are there years with more releases?

## How are the prizes distributed?

## What are the most represented languages?

In [0]:
# get data
languages_df = df.withColumn("languages", F.col("data").getField("languages"))

# pre-processing
languages_df = languages_df.withColumn("split_languages", F.split(F.col("languages"),",\s*"))
languages_df = languages_df.withColumn("language", F.explode(F.col("split_languages")))

# processing
languages_df = languages_df.groupBy("language").count().orderBy(col("count").desc())

# visualisation
display(languages_df.head(10))

language,count
English,55116
German,14019
French,13426
Russian,12922
Simplified Chinese,12782
Spanish - Spain,12233
Japanese,10368
Italian,9304
Portuguese - Brazil,6750
Korean,6600


## Are there many games prohibited for children under 16/18?

## Are most games available on Windows/Mac/Linux?

In [0]:
# get data
platform_df = df.withColumn("platforms", F.col("data").getField("platforms"))\
    .withColumn("Linux", F.col("platforms.Linux"))\
    .withColumn("Mac",F.col("platforms.Mac"))\
    .withColumn("Windows",F.col("platforms.Windows"))\
    .withColumn("genre",F.col("data").getField("genre"))\
    .drop("id")  

# processing
count_mac = platform_df.filter(F.col('Mac') == "True").count()
count_linux = platform_df.filter(F.col('Linux') == "True").count()
count_windows = platform_df.filter(F.col('Windows') == "True").count()
total_count = df.count()

mac_percentage = int((count_mac / total_count) * 100)
linux_percentage = int((count_linux / total_count) * 100)
windows_percentage = int((count_windows / total_count) * 100)

# visualisation
print(f"There are {mac_percentage}% games on MacOS, {linux_percentage}% games on Linux, and {windows_percentage}% games available on Windows.")


There are 22% games on MacOS, 15% games on Linux, and 99% games available on Windows.


## What are the most represented genres?

## What are the most lucrative genres?