In [1]:
import pandas as pd

In [2]:
data = pd.read_csv('Raw Data/GAME DATA.csv')

# Clean Data

## Count nulls per column

In [3]:
data.shape

(65111, 14)

In [4]:
data.isna().sum()

App ID                     0
Title                      0
Reviews Total              0
Reviews Score Fancy        0
Release Date               0
Reviews D7             65111
Reviews D30            65111
Reviews D90            65111
Launch Price               0
Tags                       0
name_slug              65111
Revenue Estimated          0
Modified Tags              0
Steam Page                 0
dtype: int64

## Drop empty columns

In [5]:
data = data.drop(["Reviews D7","Reviews D30","Reviews D90","name_slug", "Steam Page"], axis=1)

In [6]:
data.columns

Index(['App ID', 'Title', 'Reviews Total', 'Reviews Score Fancy',
       'Release Date', 'Launch Price', 'Tags', 'Revenue Estimated',
       'Modified Tags'],
      dtype='object')

# Clean column names

In [7]:
data.rename(columns= {'App ID': "app_id", 
                    'Title': "title", 
                    'Reviews Total': "reviews_total", 
                    'Reviews Score Fancy': "reviews_score_fancy",
                    'Release Date': "release_date", 
                    'Launch Price': "launch_price", 
                    'Tags' : 'tags',
                    'Revenue Estimated': "revenue_estimated_dataset",
                    'Modified Tags': "modified_tags"
}, inplace=True)


## Drop columns with not enough review data

### Review threshold justification:

The dataset will be joined with games that have reached top 200 twtich viewership since 2016. 

Games will less than 500 reviews are not expected to have reached top 200 twitch viewrship

In [8]:
review_threshold = 500
drop_lt_5000_reviews = data[data['reviews_total'] >= review_threshold]
print(drop_lt_5000_reviews.shape)
drop_lt_5000_reviews.tail(3)

(6557, 9)


Unnamed: 0,app_id,title,reviews_total,reviews_score_fancy,release_date,launch_price,tags,revenue_estimated_dataset,modified_tags
6554,403950,Conquest of Elysium 4,500,89%,2015-11-16,"$24,99","Strategy, Indie, Turn Based, Fantasy, Turn Bas...","$12 495,00","Strategy_, Indie_, Turn Based_, Fantasy_, Turn..."
6555,1604380,Hamidashi Creative,500,97%,2022-09-30,"$29,99","Adventure, Casual, Visual Novel, Sexual Conten...","$14 995,00","Adventure_, Casual_, Visual Novel_, Sexual Con..."
6556,567670,Serious Sam 3 VR: BFE,500,86%,2017-11-09,"$39,99","Action, Indie, VR, Gore, FPS, First Person","$19 995,00","Action_, Indie_, VR_, Gore_, FPS_, First Person_"


## Clean column datatypes

In [9]:
drop_lt_5000_reviews.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6557 entries, 0 to 6556
Data columns (total 9 columns):
 #   Column                     Non-Null Count  Dtype 
---  ------                     --------------  ----- 
 0   app_id                     6557 non-null   int64 
 1   title                      6557 non-null   object
 2   reviews_total              6557 non-null   int64 
 3   reviews_score_fancy        6557 non-null   object
 4   release_date               6557 non-null   object
 5   launch_price               6557 non-null   object
 6   tags                       6557 non-null   object
 7   revenue_estimated_dataset  6557 non-null   object
 8   modified_tags              6557 non-null   object
dtypes: int64(2), object(7)
memory usage: 512.3+ KB



Revenue estimationg will be done through the following formula:

review_total * launch price * conversion_coefficent = total_revenue

### Conversion coefficent justification:

The conversion coeffecient changes from year to year. The VG insights article contains 2013-2021 information. A dataframe is made to contain the conversion coeffecient information.
When doing a visualization or comparison the most recent year of coefficent should be used. 

32 will be used as it is the coefficent for the current year. This provides a static estimate to prevent confusion when comparing between years and visualizations.

#### Sources
Article Title| Source with hyperlink
---|---
How to Estimate Steam Video Game Sales?|[VG Insights](https://vginsights.com/insights/article/how-to-estimate-steam-video-game-sales)
What 'Steam review count' tells us about your game| [GameDiscoverCo newsletter](https://newsletter.gamediscover.co/p/what-steam-review-count-tells-us)
How that game sold on Steam, using the 'NB number'.|[GameDiscoverCo newsletter](https://newsletter.gamediscover.co/p/how-that-game-sold-on-steam-using)