# BigData Final Project | Steam
## <font color = 'blue'> Notebook3 | Feature Engineering </font>
### Team Member: Jim Fang, WooJong Choi, Han Jeon, Tam Nguyen

June 2020
___

### Construct a comprehensive table for players
Join 6 tables together: game2_df, app_id_info, game_genre, friends, groups, player_summary

### <font color ='blue'> CHALLENGE</font>
The game_df data alone has 100k million rows and is the longitudinal data type . If we join that table with app_id_info, game_genre,... we will end up with 7.6 billion rows (~ XXX Gb). As such, we need to find another way to join to keep the size is manageable and speed up the process 

###  <font color ='blue'> SOLUTION</font>

> Using Columnwise approach – increase number of columns, but not rows:
> - Resconstruct tables, i.e one hot econde and aggreate game_genre,total play time, number of active, inactive year... before joining them


---
## I. Import Libraries

In [1]:
from pyspark import SparkContext, SparkConf
from pyspark.context import SparkContext
from pyspark.sql import SparkSession

from pyspark.sql import functions as F
import pyspark.sql.types as t
from pyspark.sql.functions import broadcast
from pyspark.sql.functions import regexp_replace
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import isnan, when, count, col, size
from pyspark.sql.functions import year, month, dayofmonth
from pyspark.sql.functions import length
from pyspark.sql import functions as sf
from pyspark.sql.functions import collect_set, collect_list, array_contains
from pyspark.sql.functions import substring
from functools import reduce

import pandas as pd
import matplotlib.pyplot as plt
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.stat import Correlation


import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
spark = SparkSession.builder.enableHiveSupport().appName('FeatureEngineering').getOrCreate()
sc = spark.sparkContext

In [3]:
!hdfs dfs -ls /user/tamng/jwht/CleanData

Found 22 items
drwxrwxrwx   - tamng tamng          0 2020-05-22 13:38 /user/tamng/jwht/CleanData/app_id_info.csv
drwxrwxrwx   - tamng tamng          0 2020-05-22 22:38 /user/tamng/jwht/CleanData/app_if_info_PosReview.csv
drwxrwxrwx   - tamng tamng          0 2020-05-18 13:42 /user/tamng/jwht/CleanData/friends.csv
drwxrwxrwx   - tamng tamng          0 2020-05-18 15:10 /user/tamng/jwht/CleanData/game2_df.csv
drwxrwxrwx   - tamng tamng          0 2020-05-22 14:12 /user/tamng/jwht/CleanData/game_dgp.csv
drwxrwxrwx   - tamng tamng          0 2020-05-22 13:41 /user/tamng/jwht/CleanData/games_developer.csv
drwxrwxrwx   - tamng tamng          0 2020-05-22 13:47 /user/tamng/jwht/CleanData/games_genres.csv
drwxrwxrwx   - tamng tamng          0 2020-05-22 13:49 /user/tamng/jwht/CleanData/games_publisher.csv
drwxrwxrwx   - tamng tamng          0 2020-05-22 13:55 /user/tamng/jwht/CleanData/groups.csv
drwxrwxrwx   - tamng tamng          0 2020-06-01 12:17 /user/tamng/jwht/CleanData/player_summary_to

---
## II. Create Function

In [4]:
def check_missing(df):
    ''' Check missing value'''
    df.select([count(when(col(c).isNull(), c)).alias(c) for c in df.columns]).show()

In [5]:
def rename_col(df, newColumns):
    ''' Rename all columns        
        Note: newColumns is a list of columns name '''
    oldColumns = df.schema.names
    df = reduce(lambda df, idx: df.withColumnRenamed(oldColumns[idx], newColumns[idx]), range(len(oldColumns)), df)
    return df

In [6]:
def basic_info(df):    
    '''
        Print out the basic ddescription for each table, icluding:
        1. total rows/ observation
        2. Check missing value by columns
        3. Print out the first 3 lines
        4. Basic description
    '''   
    print('TOTAL ROWS:', df.count())
    print('\n')
    print('*-------------'*5)
    print('\n')
    print('MISSING VALUE:')
    check_missing(df)
    print('*-------------'*5)
    print('\n')
    print('PRINT OUT THE 1st 3 LINES:')
    df.show(3, truncate = True)
    print('*-------------'*5)
    print('\n')
    print('TABLE BASIC DESCRIPTION:')
    df.describe().show(10,truncate = True)
    print('*-------------'*5)
    distinct_count = []
    column_name = df.columns
    for i in column_name:
        distinct_count.append(df.select(col(i)).distinct().count())

    print('DISTINCT COUNT BY COLUMN:')
    print('\n')
    print(pd.DataFrame(zip(column_name,distinct_count)).\
      rename(columns={0:'column_name', 1:'distinct_count'}))

In [8]:
# Copy tag cols to new cols
def copy_col(df, copy_cols):
    '''
        df: data
        copy_cols: list of columns that need to copy
    '''
    df = reduce(lambda df, idx: df.withColumn(copy_cols[idx], df.genre), range(len(copy_cols)), df)
    return df

In [9]:
# Encode cols to 0 and 1
def encode_col(df, cols):
    '''
        one hot encoding, replace initial value with 0 or 1 based on specific condition: 
        If aluve == True >> 1, otherwise : 0
    '''
    df = reduce(lambda df, idx: df.withColumn(cols[idx], F.when(F.col(cols[idx])== True, F.lit(1.0)).otherwise(F.lit(0.0))), range(len(cols)), df)
    return df

-----

## III. Import data

### 1. game2_df

In [7]:
game2_df = spark.read.csv('/user/tamng/jwht/CleanData/game2_df.csv', inferSchema = True, header = True)

In [8]:
game2_df.printSchema()

root
 |-- steam_id: long (nullable = true)
 |-- app_id: integer (nullable = true)
 |-- playtime_2weeks: integer (nullable = true)
 |-- playtime_forever: integer (nullable = true)
 |-- dateretrieved: timestamp (nullable = true)



### 2.app_id_info

In [9]:
app_id_info = spark.read.csv('/user/tamng/jwht/CleanData/app_if_info_PosReview.csv', inferSchema = True, header = True)

In [10]:
app_id_info.printSchema()

root
 |-- app_id: integer (nullable = true)
 |-- title: string (nullable = true)
 |-- type: string (nullable = true)
 |-- price: string (nullable = true)
 |-- releaseDate: string (nullable = true)
 |-- rating: string (nullable = true)
 |-- ageRequirement: integer (nullable = true)
 |-- isMultiplayer: integer (nullable = true)
 |-- positiveReviewPercent: integer (nullable = true)



### 3.game_dgp

In [11]:
game_dgp = spark.read.csv('/user/tamng/jwht/CleanData/game_dgp.csv', inferSchema = True, header = True)

In [12]:
game_dgp.printSchema()

root
 |-- app_id: integer (nullable = true)
 |-- gamesDeveloper: string (nullable = true)
 |-- gamesGenre: string (nullable = true)
 |-- gamesPublisher: string (nullable = true)



### 4.groups

In [64]:
groups = spark.read.csv('/user/tamng/jwht/CleanData/groups.csv', inferSchema = True, header = True)

In [65]:
groups.printSchema()

root
 |-- steam_id: long (nullable = true)
 |-- group_id: integer (nullable = true)
 |-- dateretrieved: timestamp (nullable = true)



### 5.friends

In [173]:
friends = spark.read.csv('/user/tamng/jwht/CleanData/friends.csv', inferSchema = True, header = True)

In [174]:
friends.printSchema()

root
 |-- steam_id_a: long (nullable = true)
 |-- steam_id_b: long (nullable = true)
 |-- relationship: string (nullable = true)
 |-- friend_since: timestamp (nullable = true)
 |-- dateretrieved: timestamp (nullable = true)



### 6.player_summary

In [91]:
player_summary = spark.read.csv('/user/tamng/jwht/CleanData/player_summary_total.csv',\
                                inferSchema = True, header = True)

In [92]:
player_summary.printSchema()

root
 |-- steam_id: long (nullable = true)
 |-- person_name: string (nullable = true)
 |-- profile_url: string (nullable = true)
 |-- person_state: string (nullable = true)
 |-- community_visibility_state: integer (nullable = true)
 |-- profile_state: string (nullable = true)
 |-- last_logoff: string (nullable = true)
 |-- comment_permission: string (nullable = true)
 |-- real_name: string (nullable = true)
 |-- primary_clanid: string (nullable = true)
 |-- time_created: string (nullable = true)
 |-- game_id: string (nullable = true)
 |-- gameserver_ip: string (nullable = true)
 |-- game_extrainfo: string (nullable = true)
 |-- city_id: string (nullable = true)
 |-- country_code: string (nullable = true)
 |-- state_code: string (nullable = true)



---
## IV. Run table description across all the data

In [22]:
# Create the name for each dataframe
game2_df.name = 'game2_df'
app_id_info.name = 'app_id_info'
game_dgp.name = 'game_dgp'
friends.name = 'friends'
groups.name = 'groups'

# Incase we want to print out the basic information of all tables:
tables = [game2_df, app_id_info, game_dgp, friends, groups]
for tab in tables:
    print('TABLE NAME:', tab.name)
    print(basic_info(tab))
    print('\n')

TABLE NAME: game2_df
TOTAL ROWS: 100000000


*-------------*-------------*-------------*-------------*-------------


MISSING VALUE:
+--------+------+---------------+----------------+-------------+
|steam_id|app_id|playtime_2weeks|playtime_forever|dateretrieved|
+--------+------+---------------+----------------+-------------+
|       0|     0|              0|               0|            0|
+--------+------+---------------+----------------+-------------+

*-------------*-------------*-------------*-------------*-------------


PRINT OUT THE 1st 3 LINES:
+-----------------+------+---------------+----------------+-------------------+
|         steam_id|app_id|playtime_2weeks|playtime_forever|      dateretrieved|
+-----------------+------+---------------+----------------+-------------------+
|76561197960265729|    10|              0|               0|2014-08-14 14:04:18|
|76561197960265729|    20|              0|               0|2014-08-14 14:04:18|
|76561197960265729|    30|              0

In [23]:
basic_info(player_summary)

TOTAL ROWS: 9881300


*-------------*-------------*-------------*-------------*-------------


MISSING VALUE:
+--------+-----------+-----------+------------+--------------------------+-------------+-----------+------------------+---------+--------------+------------+-------+-------------+--------------+-------+------------+----------+
|steam_id|person_name|profile_url|person_state|community_visibility_state|profile_state|last_logoff|comment_permission|real_name|primary_clanid|time_created|game_id|gameserver_ip|game_extrainfo|city_id|country_code|state_code|
+--------+-----------+-----------+------------+--------------------------+-------------+-----------+------------------+---------+--------------+------------+-------+-------------+--------------+-------+------------+----------+
|       0|         50|          0|           0|                         0|            0|          0|                 0|       18|             0|           0|      0|            0|             0|      0|       

__Quick check type of app_id__

In [25]:
app_id_info.groupBy("type").count().sort("count", ascending = False).show(20)

+--------------+-----+
|          type|count|
+--------------+-----+
|          game| 9085|
|           dlc| 7235|
|          demo| 1066|
|         video|  346|
|           mod|   38|
|      hardware|   12|
|no subtitles)"|    1|
+--------------+-----+



---
## V. Reconstruct table & Feature Engineering

### 1. game_app_info : Combining game2_df and  app_id_info

In [18]:
game_app_info = game2_df.join(broadcast(app_id_info), ["app_id"], how='left')
game_app_info.limit(10).toPandas()

Unnamed: 0,app_id,steam_id,playtime_2weeks,playtime_forever,dateretrieved,title,type,price,releaseDate,rating,ageRequirement,isMultiplayer,positiveReviewPercent
0,10,76561197960265729,0,0,2014-08-14 14:04:18,Counter-Strike,game,9.99,11/1/2000 0:00,88,0,1,96
1,20,76561197960265729,0,0,2014-08-14 14:04:18,Team Fortress Classic,game,4.99,4/1/1999 0:00,-1,0,1,82
2,30,76561197960265729,0,0,2014-08-14 14:04:18,Day of Defeat,game,4.99,5/1/2003 0:00,79,0,1,86
3,40,76561197960265729,0,0,2014-08-14 14:04:18,Deathmatch Classic,game,4.99,6/1/2001 0:00,-1,0,1,79
4,50,76561197960265729,0,0,2014-08-14 14:04:18,Half-Life: Opposing Force,game,4.99,11/1/1999 0:00,-1,0,1,95
5,60,76561197960265729,0,0,2014-08-14 14:04:18,Ricochet,game,4.99,11/1/2000 0:00,-1,0,1,78
6,70,76561197960265729,0,0,2014-08-14 14:04:18,Half-Life,game,9.99,11/8/1998 0:00,96,0,1,96
7,80,76561197960265729,0,0,2014-08-14 14:04:18,Counter-Strike: Condition Zero,game,9.99,3/1/2004 0:00,65,0,1,89
8,100,76561197960265729,0,0,2014-08-14 14:04:18,Counter-Strike: Condition Zero,game,9.99,3/1/2004 0:00,65,0,1,999
9,130,76561197960265729,0,0,2014-08-14 14:04:18,Half-Life: Blue Shift,game,4.99,6/1/2001 0:00,71,0,0,90


In [29]:
basic_info(game_app_info)

TOTAL ROWS: 100000000


*-------------*-------------*-------------*-------------*-------------


MISSING VALUE:
+------+--------+---------------+----------------+-------------+-------+-------+-------+-----------+-------+--------------+-------------+---------------------+
|app_id|steam_id|playtime_2weeks|playtime_forever|dateretrieved|  title|   type|  price|releaseDate| rating|ageRequirement|isMultiplayer|positiveReviewPercent|
+------+--------+---------------+----------------+-------------+-------+-------+-------+-----------+-------+--------------+-------------+---------------------+
|     0|       0|              0|               0|            0|6313519|6313519|6313519|    6313519|6313519|       6313519|      6313519|              6313519|
+------+--------+---------------+----------------+-------------+-------+-------+-------+-----------+-------+--------------+-------------+---------------------+

*-------------*-------------*-------------*-------------*-------------


PRINT OUT THE 

__Do not have info of app_id__

In [32]:
game_app_info.filter(F.col('title').isNull()).groupby('app_id').count().sort('count', ascending = False).show()

+------+------+
|app_id| count|
+------+------+
|223530|516271|
|205790|510435|
| 28050|267712|
|228200|241001|
|219540|200614|
|232210|131199|
|236830|121858|
| 12250|117372|
| 43160|113793|
| 12240|111835|
|212910|100987|
|201280|100239|
| 12230| 98588|
|205930| 95442|
| 50650| 94261|
| 57400| 90934|
|104320| 88390|
|224780| 88387|
|222900| 86639|
| 44320| 80520|
+------+------+
only showing top 20 rows



In [33]:
game_app_info.filter(F.col('title').isNull()).groupby('app_id').count().agg(F.sum('count')).show()

+----------+
|sum(count)|
+----------+
|   6313519|
+----------+



In [35]:
game_app_info.filter(F.col('title').isNull()).select('app_id').distinct().count()

649

__Drop app_id that does not exist in app_id_info table__

In [38]:
game_app_info = game_app_info.filter(~(F.col('title').isNull()))
game_app_info.count()

93686481

In [39]:
game_app_info.limit(5).toPandas()

Unnamed: 0,app_id,steam_id,playtime_2weeks,playtime_forever,dateretrieved,title,type,price,releaseDate,rating,ageRequirement,isMultiplayer,positiveReviewPercent
0,10,76561197960265729,0,0,2014-08-14 14:04:18,Counter-Strike,game,9.99,11/1/2000 0:00,88,0,1,96
1,20,76561197960265729,0,0,2014-08-14 14:04:18,Team Fortress Classic,game,4.99,4/1/1999 0:00,-1,0,1,82
2,30,76561197960265729,0,0,2014-08-14 14:04:18,Day of Defeat,game,4.99,5/1/2003 0:00,79,0,1,86
3,40,76561197960265729,0,0,2014-08-14 14:04:18,Deathmatch Classic,game,4.99,6/1/2001 0:00,-1,0,1,79
4,50,76561197960265729,0,0,2014-08-14 14:04:18,Half-Life: Opposing Force,game,4.99,11/1/1999 0:00,-1,0,1,95


In [19]:
game2_df.show(3,truncate = True)

+-----------------+------+---------------+----------------+-------------------+
|         steam_id|app_id|playtime_2weeks|playtime_forever|      dateretrieved|
+-----------------+------+---------------+----------------+-------------------+
|76561197960265729|    10|              0|               0|2014-08-14 14:04:18|
|76561197960265729|    20|              0|               0|2014-08-14 14:04:18|
|76561197960265729|    30|              0|               0|2014-08-14 14:04:18|
+-----------------+------+---------------+----------------+-------------------+
only showing top 3 rows



In [50]:
app_id_info.show(3,truncate = True)

+------+--------------------+----+-----+--------------+------+--------------+-------------+---------------------+
|app_id|               title|type|price|   releaseDate|rating|ageRequirement|isMultiplayer|positiveReviewPercent|
+------+--------------------+----+-----+--------------+------+--------------+-------------+---------------------+
|    10|      Counter-Strike|game| 9.99|11/1/2000 0:00|    88|             0|            1|                   96|
|    20|Team Fortress Cla...|game| 4.99| 4/1/1999 0:00|    -1|             0|            1|                   82|
|    30|       Day of Defeat|game| 4.99| 5/1/2003 0:00|    79|             0|            1|                   86|
+------+--------------------+----+-----+--------------+------+--------------+-------------+---------------------+
only showing top 3 rows



### 2. game_genre

In [109]:
games_genres = spark.read.csv('/user/tamng/jwht/CleanData/games_genres.csv', inferSchema = True, header = True)

In [110]:
games_genres.show(3,truncate = True)

+------+----------+
|app_id|gamesGenre|
+------+----------+
|    10|    Action|
|    20|    Action|
|    30|    Action|
+------+----------+
only showing top 3 rows



In [111]:
games_genres.count()

39669

In [55]:
basic_info(games_genres)

TOTAL ROWS: 39669


*-------------*-------------*-------------*-------------*-------------


MISSING VALUE:
+------+----------+
|app_id|gamesGenre|
+------+----------+
|     0|         0|
+------+----------+

*-------------*-------------*-------------*-------------*-------------


PRINT OUT THE 1st 3 LINES:
+------+----------+
|app_id|gamesGenre|
+------+----------+
|    10|    Action|
|    20|    Action|
|    30|    Action|
+------+----------+
only showing top 3 rows

*-------------*-------------*-------------*-------------*-------------


TABLE BASIC DESCRIPTION:
+-------+------------------+--------------+
|summary|            app_id|    gamesGenre|
+-------+------------------+--------------+
|  count|             39669|         39669|
|   mean|314814.62552622956|          null|
| stddev|109487.62831029741|          null|
|    min|                10|    Accounting|
|    max|            469850|Web Publishing|
+-------+------------------+--------------+

*-------------*-------------*--

__a. Get a list of game genre for get dummry columns in the function below__

In [113]:
genre_22 = games_genres.select('gamesGenre').distinct().limit(25).toPandas().gamesGenre.tolist()
genre_22

['Education',
 'Massively Multiplayer',
 'Adventure',
 'Sports',
 'Accounting',
 'Audio Production',
 'Video Production',
 'Animation & Modeling',
 'Racing',
 'Design & Illustration',
 'Software Training',
 'Photo Editing',
 'Web Publishing',
 'Utilities',
 'Early Access',
 'Casual',
 'Action',
 'Strategy',
 'Indie',
 'Free to Play',
 'RPG',
 'Simulation']

__b. Collect game_genre to an array_list. Make sure one row for a unique app_id__

In [114]:
games_genres = games_genres.groupBy('app_id').agg(collect_list('gamesGenre').alias('genre'))
games_genres.show()

+------+--------------------+
|app_id|               genre|
+------+--------------------+
|  4900|     [Casual, Indie]|
|  7340|     [Casual, Indie]|
|  9900|[Free to Play, Ma...|
| 16861|          [Strategy]|
| 18800|[Action, Indie, R...|
| 22521|   [Indie, Strategy]|
| 32460|         [Adventure]|
| 73091|       [Action, RPG]|
|111300|[Action, Racing, ...|
|205270|         [Utilities]|
|205541|[Indie, Simulatio...|
|206144|[Casual, Simulation]|
|212010|[Action, Indie, S...|
|213312|            [Action]|
|215309|            [Action]|
|222543|        [Simulation]|
|222556|        [Simulation]|
|222730|[Indie, Simulatio...|
|225430|        [Simulation]|
|226193|          [Strategy]|
+------+--------------------+
only showing top 20 rows



__c. Create dummy columns for game_genre__

In [115]:
for i in genre_22:
    games_genres = games_genres.withColumn(i, array_contains(col("genre"), i))

In [116]:
games_genres.limit(10).toPandas()

Unnamed: 0,app_id,genre,Education,Massively Multiplayer,Adventure,Sports,Accounting,Audio Production,Video Production,Animation & Modeling,...,Web Publishing,Utilities,Early Access,Casual,Action,Strategy,Indie,Free to Play,RPG,Simulation
0,4900,"[Casual, Indie]",False,False,False,False,False,False,False,False,...,False,False,False,True,False,False,True,False,False,False
1,7340,"[Casual, Indie]",False,False,False,False,False,False,False,False,...,False,False,False,True,False,False,True,False,False,False
2,9900,"[Free to Play, Massively Multiplayer, RPG]",False,True,False,False,False,False,False,False,...,False,False,False,False,False,False,False,True,True,False
3,16861,[Strategy],False,False,False,False,False,False,False,False,...,False,False,False,False,False,True,False,False,False,False
4,18800,"[Action, Indie, Racing, Sports]",False,False,False,True,False,False,False,False,...,False,False,False,False,True,False,True,False,False,False
5,22521,"[Indie, Strategy]",False,False,False,False,False,False,False,False,...,False,False,False,False,False,True,True,False,False,False
6,32460,[Adventure],False,False,True,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
7,73091,"[Action, RPG]",False,False,False,False,False,False,False,False,...,False,False,False,False,True,False,False,False,True,False
8,111300,"[Action, Racing, Sports]",False,False,False,True,False,False,False,False,...,False,False,False,False,True,False,False,False,False,False
9,205270,[Utilities],False,False,False,False,False,False,False,False,...,False,True,False,False,False,False,False,False,False,False


__d. Turn `True`, `False` value to 1 or 0__

In [118]:
games_genres = encode_col(games_genres, genre_22)
games_genres.limit(5).toPandas()

Unnamed: 0,app_id,genre,Education,Massively Multiplayer,Adventure,Sports,Accounting,Audio Production,Video Production,Animation & Modeling,...,Web Publishing,Utilities,Early Access,Casual,Action,Strategy,Indie,Free to Play,RPG,Simulation
0,4900,"[Casual, Indie]",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
1,7340,"[Casual, Indie]",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
2,9900,"[Free to Play, Massively Multiplayer, RPG]",0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
3,16861,[Strategy],0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
4,18800,"[Action, Indie, Racing, Sports]",0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0,0.0


_Double check number of distinct app-id vs. number of rows_

In [119]:
games_genres.count()

17195

In [120]:
games_genres.select('app_id').distinct().count()

17195

In [122]:
# drop genre column
games_genres = games_genres.drop('genre')

---
### 3. Join games_genres with game_app_info

In [125]:
merge_table = game_app_info.join(broadcast(games_genres), ["app_id"], how='left')
merge_table.limit(10).toPandas()

Unnamed: 0,app_id,steam_id,playtime_2weeks,playtime_forever,dateretrieved,title,type,price,releaseDate,rating,...,Web Publishing,Utilities,Early Access,Casual,Action,Strategy,Indie,Free to Play,RPG,Simulation
0,10,76561197960265729,0,0,2014-08-14 14:04:18,Counter-Strike,game,9.99,11/1/2000 0:00,88,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1,20,76561197960265729,0,0,2014-08-14 14:04:18,Team Fortress Classic,game,4.99,4/1/1999 0:00,-1,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
2,30,76561197960265729,0,0,2014-08-14 14:04:18,Day of Defeat,game,4.99,5/1/2003 0:00,79,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
3,40,76561197960265729,0,0,2014-08-14 14:04:18,Deathmatch Classic,game,4.99,6/1/2001 0:00,-1,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
4,50,76561197960265729,0,0,2014-08-14 14:04:18,Half-Life: Opposing Force,game,4.99,11/1/1999 0:00,-1,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
5,60,76561197960265729,0,0,2014-08-14 14:04:18,Ricochet,game,4.99,11/1/2000 0:00,-1,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
6,70,76561197960265729,0,0,2014-08-14 14:04:18,Half-Life,game,9.99,11/8/1998 0:00,96,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
7,80,76561197960265729,0,0,2014-08-14 14:04:18,Counter-Strike: Condition Zero,game,9.99,3/1/2004 0:00,65,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
8,100,76561197960265729,0,0,2014-08-14 14:04:18,Counter-Strike: Condition Zero,game,9.99,3/1/2004 0:00,65,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
9,130,76561197960265729,0,0,2014-08-14 14:04:18,Half-Life: Blue Shift,game,4.99,6/1/2001 0:00,71,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


In [127]:
merge_table.filter(F.col('app_id')==4900).limit(10).toPandas()[['steam_id','Education',
 'Massively Multiplayer',
 'Adventure',
 'Sports',
 'Accounting',
 'Audio Production',
 'Video Production',
 'Animation & Modeling',
 'Racing',
 'Design & Illustration',
 'Software Training',
 'Photo Editing',
 'Web Publishing',
 'Utilities',
 'Early Access',
 'Casual',
 'Action',
 'Strategy',
 'Indie',
 'Free to Play',
 'RPG',
 'Simulation']]

Unnamed: 0,steam_id,Education,Massively Multiplayer,Adventure,Sports,Accounting,Audio Production,Video Production,Animation & Modeling,Racing,...,Web Publishing,Utilities,Early Access,Casual,Action,Strategy,Indie,Free to Play,RPG,Simulation
0,76561197960265854,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
1,76561197960266247,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
2,76561197960267947,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
3,76561197960268251,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
4,76561197960269465,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
5,76561197960269594,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
6,76561197960270028,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
7,76561197960270280,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
8,76561197960270469,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
9,76561197960270811,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0


In [132]:
merge_table.filter(F.col('Simulation').isNull()).select('app_id').distinct().count()

19

In [133]:
merge_table.filter(F.col('Simulation').isNull()).groupby('app_id').count().agg(F.sum('count')).show()

+----------+
|sum(count)|
+----------+
|     87706|
+----------+



In [134]:
merge_table = merge_table.filter(~(F.col('Simulation').isNull()))
merge_table.count()

93598775

In [135]:
merge_table.groupby('type').count().show()

+-----+--------+
| type|   count|
+-----+--------+
|  dlc|  360428|
|video|   19107|
| demo|   71834|
| game|92345098|
|  mod|  802308|
+-----+--------+



In [136]:
# Save merge_table
merge_table.write.csv("/user/tamng/jwht/EDA/merge_table.csv",header=True)

---
### 4. construct `player_df1` table (playtime, no.game...)

In [137]:
# Aggregate by player & playtime
player_df1 = merge_table.groupby('steam_id').agg(
                    F.sum('playtime_2weeks').alias('playtime_2weeks'),
                    F.sum('playtime_forever').alias('total_playtime_forever'),
                    F.countDistinct('app_id').alias('total_games_owned'),
                    F.sum('price').alias('total_money_spend'),
                    F.sum('isMultiplayer').alias('total_game_multi_player'))

__Add column `total_game_single_player`__

In [138]:
player_df1 = player_df1.withColumn('total_game_single_player',\
                                   F.lit(F.col('total_games_owned')-F.col('total_game_multi_player')))

In [139]:
player_df1.limit(10).toPandas()

Unnamed: 0,steam_id,playtime_2weeks,total_playtime_forever,total_games_owned,total_money_spend,total_game_multi_player,total_game_single_player
0,76561197968325134,0,42553,105,1732.95,56,49
1,76561197968372359,0,85124,43,569.64,34,9
2,76561197968399150,34,100518,158,2069.48,65,93
3,76561197968425907,0,904,8,49.92,7,1
4,76561197968448710,1,109868,121,2179.81,45,76
5,76561197968451675,0,17,8,49.92,7,1
6,76561197968453660,0,382,8,49.92,7,1
7,76561197968454919,0,80532,20,159.82,16,4
8,76561197968463838,0,1286,15,112.86,10,5
9,76561197968473274,0,50226,219,3053.89,73,146


In [141]:
basic_info(player_df1)

TOTAL ROWS: 3263723


*-------------*-------------*-------------*-------------*-------------


MISSING VALUE:
+--------+---------------+----------------------+-----------------+-----------------+-----------------------+------------------------+
|steam_id|playtime_2weeks|total_playtime_forever|total_games_owned|total_money_spend|total_game_multi_player|total_game_single_player|
+--------+---------------+----------------------+-----------------+-----------------+-----------------------+------------------------+
|       0|              0|                     0|                0|                0|                      0|                       0|
+--------+---------------+----------------------+-----------------+-----------------+-----------------------+------------------------+

*-------------*-------------*-------------*-------------*-------------


PRINT OUT THE 1st 3 LINES:
+-----------------+---------------+----------------------+-----------------+------------------+-------------------

In [143]:
# Save player_df1
player_df1.write.csv("/user/tamng/jwht/EDA/player_df1.csv",header=True)

---

### 5. construct `player_df2` table (genre)

In [142]:
genre_22

['Education',
 'Massively Multiplayer',
 'Adventure',
 'Sports',
 'Accounting',
 'Audio Production',
 'Video Production',
 'Animation & Modeling',
 'Racing',
 'Design & Illustration',
 'Software Training',
 'Photo Editing',
 'Web Publishing',
 'Utilities',
 'Early Access',
 'Casual',
 'Action',
 'Strategy',
 'Indie',
 'Free to Play',
 'RPG',
 'Simulation']

In [147]:
# Aggregate by player & playtime
player_df2 = merge_table.groupby('steam_id').agg(
                    F.sum('Education').alias('gr_education_total'),
                    F.sum('Massively Multiplayer').alias('gr_mutiplayer_total'),
                    F.sum('Adventure').alias('gr_adventure_total'),
                    F.sum('Sports').alias('gr_sports_total'),
                    F.sum('Accounting').alias('gr_accounting_total'),
                    F.sum('Audio Production').alias('gr_audioProduction_total'),
                    F.sum('Video Production').alias('gr_videoProduction_total'),
                    F.sum('Animation & Modeling').alias('gr_animationModeling_total'),
                    F.sum('Racing').alias('gr_racing_total'),
                    F.sum('Design & Illustration').alias('gr_designIllustration_total'),
                    F.sum('Software Training').alias('gr_softwareTraining_total'),
                    F.sum('Photo Editing').alias('gr_photoEditing_total'),
                    F.sum('Web Publishing').alias('gr_webPublishing_total'),
                    F.sum('Utilities').alias('gr_utility_total'),
                    F.sum('Early Access').alias('gr_earlyAccess_total'),
                    F.sum('Casual').alias('gr_casual_total'),
                    F.sum('Action').alias('gr_action_total'),
                    F.sum('Strategy').alias('gr_strategy_total'),
                    F.sum('Indie').alias('gr_indie_total'),
                    F.sum('Free to Play').alias('gr_freeplay_total'),
                    F.sum('RPG').alias('gr_RPG_total'),
                    F.sum('Simulation').alias('gr_simulation_total'))

In [148]:
player_df2.limit(10).toPandas()

Unnamed: 0,steam_id,gr_education_total,gr_mutiplayer_total,gr_adventure_total,gr_sports_total,gr_accounting_total,gr_audioProduction_total,gr_videoProduction_total,gr_animationModeling_total,gr_racing_total,...,gr_webPublishing_total,gr_utility_total,gr_earlyAccess_total,gr_casual_total,gr_action_total,gr_strategy_total,gr_indie_total,gr_freeplay_total,gr_RPG_total,gr_simulation_total
0,76561197969788367,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,8.0,0.0,0.0,0.0,0.0,0.0
1,76561197969788434,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,9.0,0.0,0.0,0.0,0.0,0.0
2,76561197969793994,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,0.0,0.0
3,76561197969794603,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,8.0,0.0,0.0,0.0,0.0,0.0
4,76561197969799828,0.0,0.0,4.0,1.0,0.0,0.0,0.0,0.0,3.0,...,0.0,0.0,0.0,0.0,24.0,0.0,0.0,2.0,4.0,2.0
5,76561197969800760,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,25.0,1.0,1.0,0.0,4.0,0.0
6,76561197969801271,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,0.0
7,76561197969802235,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,17.0,0.0,0.0,0.0,0.0,0.0
8,76561197969806248,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,5.0,0.0,0.0,1.0,0.0,0.0
9,76561197969808689,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,9.0,0.0,0.0,1.0,0.0,0.0


In [149]:
basic_info(player_df2)

TOTAL ROWS: 3263723


*-------------*-------------*-------------*-------------*-------------


MISSING VALUE:
+--------+------------------+-------------------+------------------+---------------+-------------------+------------------------+------------------------+--------------------------+---------------+---------------------------+-------------------------+---------------------+----------------------+----------------+--------------------+---------------+---------------+-----------------+--------------+-----------------+------------+-------------------+
|steam_id|gr_education_total|gr_mutiplayer_total|gr_adventure_total|gr_sports_total|gr_accounting_total|gr_audioProduction_total|gr_videoProduction_total|gr_animationModeling_total|gr_racing_total|gr_designIllustration_total|gr_softwareTraining_total|gr_photoEditing_total|gr_webPublishing_total|gr_utility_total|gr_earlyAccess_total|gr_casual_total|gr_action_total|gr_strategy_total|gr_indie_total|gr_freeplay_total|gr_RPG_total|gr_simula

In [150]:
# Save player_df2
player_df2.write.csv("/user/tamng/jwht/EDA/player_df2.csv",header=True)

---
### 6. construct `playornot` table

In [57]:
#Total games not played
game_notplayed = merge_table.filter(F.col('playtime_forever')== 0).groupby('steam_id').agg(F.count('app_id').alias('total_games_not_played'))

#Total games played
game_played = merge_table.filter(F.col('playtime_forever')!=0).groupby('steam_id').agg(F.count('app_id').alias('total_games_played'))

In [58]:
game_notplayed.count()

3243288

In [59]:
game_played.count()

2801590

In [61]:
playornot = game_notplayed.join(broadcast(game_played), ['steam_id'], how='full')
playornot.count()

3263723

In [62]:
check_missing(playornot)

+--------+----------------------+------------------+
|steam_id|total_games_not_played|total_games_played|
+--------+----------------------+------------------+
|       0|                 20435|            462133|
+--------+----------------------+------------------+



In [63]:
playornot.filter(F.col('total_games_played').isNull()).limit(10).toPandas()

Unnamed: 0,steam_id,total_games_not_played,total_games_played
0,76561197960269352,17,
1,76561197960269449,15,
2,76561197960271611,12,
3,76561197960272095,8,
4,76561197960272236,17,
5,76561197960274540,8,
6,76561197960279823,17,
7,76561197960280763,10,
8,76561197960290794,12,
9,76561197960309458,8,


In [64]:
playornot.filter(F.col('total_games_not_played').isNull()).limit(10).toPandas()

Unnamed: 0,steam_id,total_games_not_played,total_games_played
0,76561197960746981,,12
1,76561197961648138,,8
2,76561197963284625,,11
3,76561197963320144,,10
4,76561197964625286,,9
5,76561197965940586,,3
6,76561197966030631,,3
7,76561197966067466,,3
8,76561197966224567,,3
9,76561197966267518,,5


In [65]:
# replace null value with 0 for total_games_played
playornot = playornot.withColumn('total_games_played', \
                                 F.when(F.col('total_games_played').isNull(),F.lit(0)).otherwise(F.col('total_games_played')))

In [66]:
# replace null value with 0 for total_games_not_played
playornot = playornot.withColumn('total_games_not_played', \
                                 F.when(F.col('total_games_not_played').isNull(),F.lit(0)).otherwise(F.col('total_games_not_played')))

In [67]:
check_missing(playornot)

+--------+----------------------+------------------+
|steam_id|total_games_not_played|total_games_played|
+--------+----------------------+------------------+
|       0|                     0|                 0|
+--------+----------------------+------------------+



In [68]:
playornot.filter(F.col('total_games_played')==0).count()

462133

In [69]:
playornot.filter(F.col('total_games_not_played')==0).count()

20435

In [70]:
# Save playornot
playornot.write.csv("/user/tamng/jwht/EDA/playornot.csv",header=True)

---
### 7. construct `player_df3` table (friends)

In [11]:
friends_new = spark.read.csv('/user/tamng/jwht/SteamData/steamData_new/Friends_100mil.csv', \
                             inferSchema = True, header = False)

In [12]:
friends_new.printSchema()

root
 |-- _c0: long (nullable = true)
 |-- _c1: long (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c3: timestamp (nullable = true)
 |-- _c4: timestamp (nullable = true)
 |-- _c5: string (nullable = true)



In [14]:
# Drop col _c5
friends_new = friends_new.drop("_c5")

In [16]:
# Rename the header columns

newColumns = ["steam_id_a", "steam_id_b","relationship","friend_since", "dateretrieved"]
friends = rename_col(friends_new, newColumns)
friends.printSchema()

root
 |-- steam_id_a: long (nullable = true)
 |-- steam_id_b: long (nullable = true)
 |-- relationship: string (nullable = true)
 |-- friend_since: timestamp (nullable = true)
 |-- dateretrieved: timestamp (nullable = true)



In [21]:
basic_info(friends)

TOTAL ROWS: 100000000


*-------------*-------------*-------------*-------------*-------------


MISSING VALUE:
+----------+----------+------------+------------+-------------+
|steam_id_a|steam_id_b|relationship|friend_since|dateretrieved|
+----------+----------+------------+------------+-------------+
|         0|         0|           0|           0|            0|
+----------+----------+------------+------------+-------------+

*-------------*-------------*-------------*-------------*-------------


PRINT OUT THE 1st 3 LINES:
+-----------------+-----------------+------------+-------------------+-------------------+
|       steam_id_a|       steam_id_b|relationship|       friend_since|      dateretrieved|
+-----------------+-----------------+------------+-------------------+-------------------+
|76561197960265729|76561197967144365|      friend|2012-07-12 14:56:57|2013-05-10 16:33:54|
|76561197960265730|76561197960265733|      friend|1969-12-31 17:00:00|2013-05-06 15:14:15|
|76561197960

In [24]:
friends.limit(5).toPandas()

Unnamed: 0,steam_id_a,steam_id_b,relationship,friend_since,dateretrieved
0,76561197960265729,76561197967144365,friend,2012-07-12 14:56:57,2013-05-10 16:33:54
1,76561197960265730,76561197960265733,friend,1969-12-31 17:00:00,2013-05-06 15:14:15
2,76561197960265730,76561197974593417,friend,2012-09-10 15:45:17,2013-05-06 02:09:27
3,76561197960265730,76561197984632295,friend,1969-12-31 17:00:00,2013-05-19 11:51:55
4,76561197960265730,76561197992219796,friend,2009-07-31 16:26:04,2013-05-27 17:06:57


In [28]:
friends.printSchema()

root
 |-- steam_id_a: long (nullable = true)
 |-- steam_id_b: long (nullable = true)
 |-- relationship: string (nullable = true)
 |-- friend_since: timestamp (nullable = true)
 |-- dateretrieved: timestamp (nullable = true)



In [37]:
friends = friends.withColumn('year_friendship', year('friend_since'))
friends.limit(5).toPandas()

Unnamed: 0,steam_id_a,steam_id_b,relationship,friend_since,dateretrieved,year_friendship
0,76561197960265729,76561197967144365,friend,2012-07-12 14:56:57,2013-05-10 16:33:54,2012
1,76561197960265730,76561197960265733,friend,1969-12-31 17:00:00,2013-05-06 15:14:15,1969
2,76561197960265730,76561197974593417,friend,2012-09-10 15:45:17,2013-05-06 02:09:27,2012
3,76561197960265730,76561197984632295,friend,1969-12-31 17:00:00,2013-05-19 11:51:55,1969
4,76561197960265730,76561197992219796,friend,2009-07-31 16:26:04,2013-05-27 17:06:57,2009


In [42]:
friends = friends.withColumn('number_year_friendship', F.lit(2013 - F.col('year_friendship')+1))
friends.limit(5).toPandas()

Unnamed: 0,steam_id_a,steam_id_b,relationship,friend_since,dateretrieved,year_friendship,number_year_friendship
0,76561197960265729,76561197967144365,friend,2012-07-12 14:56:57,2013-05-10 16:33:54,2012,2
1,76561197960265730,76561197960265733,friend,1969-12-31 17:00:00,2013-05-06 15:14:15,1969,45
2,76561197960265730,76561197974593417,friend,2012-09-10 15:45:17,2013-05-06 02:09:27,2012,2
3,76561197960265730,76561197984632295,friend,1969-12-31 17:00:00,2013-05-19 11:51:55,1969,45
4,76561197960265730,76561197992219796,friend,2009-07-31 16:26:04,2013-05-27 17:06:57,2009,5


In [45]:
friends = friends.withColumn('number_year_friendship', \
                             F.when((F.col('number_year_friendship')>6),F.lit('6plus')).otherwise(F.col('number_year_friendship')))

friends.limit(5).toPandas()

Unnamed: 0,steam_id_a,steam_id_b,relationship,friend_since,dateretrieved,year_friendship,number_year_friendship
0,76561197960265729,76561197967144365,friend,2012-07-12 14:56:57,2013-05-10 16:33:54,2012,2
1,76561197960265730,76561197960265733,friend,1969-12-31 17:00:00,2013-05-06 15:14:15,1969,6plus
2,76561197960265730,76561197974593417,friend,2012-09-10 15:45:17,2013-05-06 02:09:27,2012,2
3,76561197960265730,76561197984632295,friend,1969-12-31 17:00:00,2013-05-19 11:51:55,1969,6plus
4,76561197960265730,76561197992219796,friend,2009-07-31 16:26:04,2013-05-27 17:06:57,2009,5


In [46]:
friends_a = friends.groupBy('steam_id_a').agg(F.count('steam_id_b').alias('number_friends_a'))
friends_a.show(5)

+-----------------+----------------+
|       steam_id_a|number_friends_a|
+-----------------+----------------+
|76561197975589446|              37|
|76561197975590416|               1|
|76561197975594325|               3|
|76561197975601708|               1|
|76561197975601838|               5|
+-----------------+----------------+
only showing top 5 rows



In [51]:
friends_a.count()

6376017

In [47]:
friends_b = friends.groupBy('steam_id_b').agg(F.count('steam_id_a').alias('number_friends_b'))
friends_b.show(5)

+-----------------+----------------+
|       steam_id_b|number_friends_b|
+-----------------+----------------+
|76561197995802269|              64|
|76561197960714238|              42|
|76561197991725389|              67|
|76561197993108118|              41|
|76561198039105252|              22|
+-----------------+----------------+
only showing top 5 rows



In [52]:
friends_b.count()

16497444

In [48]:
friends_ab_join = friends_a.join(friends_b,friends_a.steam_id_a == friends_b.steam_id_b, how = 'full')
friends_ab_join.show(5)

+-----------------+----------------+-----------------+----------------+
|       steam_id_a|number_friends_a|       steam_id_b|number_friends_b|
+-----------------+----------------+-----------------+----------------+
|76561197960266870|               2|76561197960266870|               2|
|76561197960266879|              89|76561197960266879|              56|
|76561197960266911|              10|76561197960266911|               8|
|76561197960268400|               2|76561197960268400|               2|
|76561197960269031|              50|76561197960269031|              37|
+-----------------+----------------+-----------------+----------------+
only showing top 5 rows



In [49]:
friends_ab_join.filter(F.col('steam_id_a').isNull()).count()

10655958

In [50]:
friends_ab_join.filter(F.col('steam_id_b').isNull()).count()

534531

In [53]:
friends_ab_join.filter(F.col('steam_id_a').isNull()).limit(10).show()

+----------+----------------+-----------------+----------------+
|steam_id_a|number_friends_a|       steam_id_b|number_friends_b|
+----------+----------------+-----------------+----------------+
|      null|            null|76561197994129141|               5|
|      null|            null|76561197994129531|               2|
|      null|            null|76561197994130121|               1|
|      null|            null|76561197994130320|               7|
|      null|            null|76561197994131114|               8|
|      null|            null|76561197994132539|               5|
|      null|            null|76561197994132840|               1|
|      null|            null|76561197994133728|               1|
|      null|            null|76561197994135234|               1|
|      null|            null|76561197994136678|              15|
+----------+----------------+-----------------+----------------+



In [54]:
friends_ab_join.filter(F.col('steam_id_b').isNull()).limit(10).show()

+-----------------+----------------+----------+----------------+
|       steam_id_a|number_friends_a|steam_id_b|number_friends_b|
+-----------------+----------------+----------+----------------+
|76561197960289560|               2|      null|            null|
|76561197960289888|               1|      null|            null|
|76561197960301981|               2|      null|            null|
|76561197960320483|               2|      null|            null|
|76561197960323068|               1|      null|            null|
|76561197960331566|               1|      null|            null|
|76561197960343345|               4|      null|            null|
|76561197960357328|               1|      null|            null|
|76561197960368041|               3|      null|            null|
|76561197960370695|               9|      null|            null|
+-----------------+----------------+----------+----------------+



In [56]:
friends_ab_join = friends_ab_join.withColumn('steam_id_a', \
                           F.when(F.col('steam_id_a').isNull(),(F.col('steam_id_b'))).otherwise(F.col('steam_id_a')))

In [57]:
friends_ab_join = friends_ab_join.withColumn('number_friends_a', \
                           F.when(F.col('number_friends_a').isNull(),(F.col('number_friends_b'))).otherwise(F.col('number_friends_a')))

In [60]:
player_df3 = friends_ab_join.select('steam_id_a','number_friends_a')
player_df3.count()

17031975

In [61]:
newColumns = ['steam_id', 'number_friends']
player_df3 = rename_col(player_df3, newColumns)
player_df3.printSchema()

root
 |-- steam_id: long (nullable = true)
 |-- number_friends: long (nullable = true)



In [62]:
# Save merge_table
friends.write.csv("/user/tamng/jwht/EDA/friends.csv",header=True)

In [63]:
# Save merge_table
player_df3.write.csv("/user/tamng/jwht/EDA/player_df3.csv",header=True)

---

### 8. construct `player_df4` table (groups)

In [66]:
groups.limit(5).toPandas()

Unnamed: 0,steam_id,group_id,dateretrieved
0,76561197960265730,4,2013-05-06 02:09:28
1,76561197960265730,5,2013-05-06 02:09:28
2,76561197960265730,83,2013-05-06 02:09:28
3,76561197960265730,132,2013-05-06 02:09:28
4,76561197960265730,4741,2013-05-06 02:09:28


In [86]:
basic_info(groups)

TOTAL ROWS: 81302317


*-------------*-------------*-------------*-------------*-------------


MISSING VALUE:
+--------+--------+-------------+
|steam_id|group_id|dateretrieved|
+--------+--------+-------------+
|       0|       0|            0|
+--------+--------+-------------+

*-------------*-------------*-------------*-------------*-------------


PRINT OUT THE 1st 3 LINES:
+-----------------+--------+-------------------+
|         steam_id|group_id|      dateretrieved|
+-----------------+--------+-------------------+
|76561197960265730|       4|2013-05-06 02:09:28|
|76561197960265730|       5|2013-05-06 02:09:28|
|76561197960265730|      83|2013-05-06 02:09:28|
+-----------------+--------+-------------------+
only showing top 3 rows

*-------------*-------------*-------------*-------------*-------------


TABLE BASIC DESCRIPTION:
+-------+--------------------+------------------+
|summary|            steam_id|          group_id|
+-------+--------------------+------------------+
| 

In [71]:
groups.groupBy('steam_id').agg(F.countDistinct('group_id').alias('count')).orderBy('count',ascending = True).show(50)

+-----------------+-----+
|         steam_id|count|
+-----------------+-----+
|76561198000459803|    1|
|76561197998070598|    1|
|76561198056082910|    1|
|76561197967931082|    1|
|76561198056149190|    1|
|76561197971715289|    1|
|76561198056149328|    1|
|76561198013290251|    1|
|76561198056658999|    1|
|76561198005131498|    1|
|76561198057713596|    1|
|76561198005657926|    1|
|76561198014087987|    1|
|76561198074044434|    1|
|76561198024011280|    1|
|76561198047762068|    1|
|76561198009303067|    1|
|76561198003635222|    1|
|76561198009834950|    1|
|76561198044433932|    1|
|76561198069610081|    1|
|76561198045001972|    1|
|76561198050605002|    1|
|76561197961544856|    1|
|76561197996438348|    1|
|76561198030952487|    1|
|76561197997521625|    1|
|76561198002152719|    1|
|76561197997550461|    1|
|76561198043405092|    1|
|76561197997899180|    1|
|76561198060637775|    1|
|76561198077874865|    1|
|76561198061031661|    1|
|76561198063337486|    1|
|76561198062

In [74]:
player_df4 = groups.groupBy('steam_id').agg(F.countDistinct('group_id').alias('number_groups'))

In [75]:
player_df4.limit(5).toPandas()

Unnamed: 0,steam_id,number_groups
0,76561198026979612,2
1,76561198027099854,20
2,76561198027108059,28
3,76561198027117572,163
4,76561198027171151,1


In [76]:
# Save player_df4
player_df4.write.csv("/user/tamng/jwht/EDA/player_df4.csv",header=True)

---
### 9. player_summary

In [106]:
player_summary.limit(5).toPandas()

Unnamed: 0,steam_id,person_name,profile_url,person_state,community_visibility_state,profile_state,last_logoff,comment_permission,real_name,primary_clanid,time_created,game_id,gameserver_ip,game_extrainfo,city_id,country_code,state_code
0,76561197968588415,wafox26,http://steamcommunity.com/profiles/76561197968...,0,1,N,N,N,N,N,N,N,N,N,N,N,N
1,76561197968591946,Lord,http://steamcommunity.com/profiles/76561197968...,0,1,1,2013-02-14 22:22:46,N,N,N,N,N,N,N,N,N,N
2,76561197968593186,chillwinston8,http://steamcommunity.com/profiles/76561197968...,0,1,N,N,N,N,N,N,N,N,N,N,N,N
3,76561197968594430,dylan,http://steamcommunity.com/profiles/76561197968...,0,1,N,N,N,N,N,N,N,N,N,N,N,N
4,76561197968594940,Kid Curry,http://steamcommunity.com/profiles/76561197968...,0,1,1,2013-02-17 11:54:58,N,N,N,N,N,N,N,N,N,N


In [93]:
player_summary.count()

9881300

In [94]:
player_summary = player_summary.filter(F.col('community_visibility_state')!=0)
player_summary.count()

9881206

In [95]:
player_summary = player_summary.filter(F.col('person_state')!=88)
player_summary.count()

9881199

In [109]:
player_summary.filter(F.col('community_visibility_state')==3).limit(5).toPandas()

Unnamed: 0,steam_id,person_name,profile_url,person_state,community_visibility_state,profile_state,last_logoff,comment_permission,real_name,primary_clanid,time_created,game_id,gameserver_ip,game_extrainfo,city_id,country_code,state_code
0,76561197968085891,michaenglam,http://steamcommunity.com/profiles/76561197968...,0,3,N,2006-10-05 22:29:41,N,N,103582791429521408,2004-08-07 20:27:18,N,N,N,N,N,N
1,76561197968088971,a1061004,http://steamcommunity.com/profiles/76561197968...,0,3,N,2007-10-12 20:37:18,N,N,103582791429521408,2004-08-07 23:27:56,N,N,N,N,N,N
2,76561197968090197,kasanova320,http://steamcommunity.com/profiles/76561197968...,0,3,N,2006-09-22 19:39:36,N,N,103582791429521408,2004-08-08 00:48:10,N,N,N,N,N,N
3,76561197968093732,g087221695,http://steamcommunity.com/profiles/76561197968...,0,3,N,2008-11-25 23:14:44,N,N,103582791429521408,2004-08-10 11:49:39,N,N,N,N,N,N
4,76561197968094218,psykokouak2,http://steamcommunity.com/profiles/76561197968...,0,3,N,2006-11-02 09:00:56,N,N,103582791429521408,2004-08-10 12:13:42,N,N,N,N,N,N


In [111]:
# player_summary = player_summary.withColumn('last_logoff', (player_summary.last_logoff).cast('timestamp'))

In [96]:
player_summary.printSchema()

root
 |-- steam_id: long (nullable = true)
 |-- person_name: string (nullable = true)
 |-- profile_url: string (nullable = true)
 |-- person_state: string (nullable = true)
 |-- community_visibility_state: integer (nullable = true)
 |-- profile_state: string (nullable = true)
 |-- last_logoff: string (nullable = true)
 |-- comment_permission: string (nullable = true)
 |-- real_name: string (nullable = true)
 |-- primary_clanid: string (nullable = true)
 |-- time_created: string (nullable = true)
 |-- game_id: string (nullable = true)
 |-- gameserver_ip: string (nullable = true)
 |-- game_extrainfo: string (nullable = true)
 |-- city_id: string (nullable = true)
 |-- country_code: string (nullable = true)
 |-- state_code: string (nullable = true)



In [97]:
player_summary.filter(~(F.col('last_logoff').contains('N')) & ~(F.col('time_created').contains('N'))).count()

6206703

In [98]:
player_summary = player_summary.withColumn('year_created', substring(player_summary['time_created'], 0, 4))

In [99]:
player_summary.groupBy('year_created').count().show()

+------------+-------+
|year_created|  count|
+------------+-------+
|        2012|      1|
|        2005|1593817|
|        2009|     34|
|        2006|1848599|
|        2004|2005566|
|        2011|     19|
|        2008|     36|
|           N|3674321|
|        2007|     73|
|        2010|     12|
|        2003| 758721|
+------------+-------+



In [100]:
player_summary = player_summary.withColumn('year_logoff', substring(player_summary['last_logoff'], 0, 4))

In [101]:
player_summary.groupBy('year_logoff').count().show()

+-----------+-------+
|year_logoff|  count|
+-----------+-------+
|       2012| 996728|
|       2013|2265073|
|       2005|   2316|
|       2009| 510104|
|       2006| 964231|
|       2011| 490341|
|       2008| 613411|
|          N|3018995|
|       2007| 586297|
|          1|      2|
|       2010| 433698|
|          2|      3|
+-----------+-------+



In [102]:
player_summary = player_summary.filter(F.col('year_logoff')!='1')
player_summary.count()

9881197

In [103]:
player_summary = player_summary.filter(F.col('year_logoff')!='2')
player_summary.count()

9881194

In [107]:
player_summary.withColumn('year_active', \
                          F.lit(F.col('year_logoff')-F.col('year_created')+1)).filter((F.col('year_created').contains('20'))&(F.col('year_logoff')=='N')).show()

+-----------------+----------------+--------------------+------------+--------------------------+-------------+-----------+------------------+---------+------------------+-------------------+-------+-------------+--------------+-------+------------+----------+------------+-----------+-----------+
|         steam_id|     person_name|         profile_url|person_state|community_visibility_state|profile_state|last_logoff|comment_permission|real_name|    primary_clanid|       time_created|game_id|gameserver_ip|game_extrainfo|city_id|country_code|state_code|year_created|year_logoff|year_active|
+-----------------+----------------+--------------------+------------+--------------------------+-------------+-----------+------------------+---------+------------------+-------------------+-------+-------------+--------------+-------+------------+----------+------------+-----------+-----------+
|76561197981566851|andreaslarsen_89|http://steamcommu...|           0|                         3|         

In [139]:
player_summary = player_summary.withColumn('year_active', F.lit(F.col('year_logoff')-F.col('year_created')+1))

In [140]:
player_summary = player_summary.withColumn('year_inactive', F.lit(2013 - F.col('year_logoff')))

In [141]:
player_summary.printSchema()

root
 |-- steam_id: long (nullable = true)
 |-- person_name: string (nullable = true)
 |-- profile_url: string (nullable = true)
 |-- person_state: string (nullable = true)
 |-- community_visibility_state: integer (nullable = true)
 |-- profile_state: string (nullable = true)
 |-- last_logoff: string (nullable = true)
 |-- comment_permission: string (nullable = true)
 |-- real_name: string (nullable = true)
 |-- primary_clanid: string (nullable = true)
 |-- time_created: string (nullable = true)
 |-- game_id: string (nullable = true)
 |-- gameserver_ip: string (nullable = true)
 |-- game_extrainfo: string (nullable = true)
 |-- city_id: string (nullable = true)
 |-- country_code: string (nullable = true)
 |-- state_code: string (nullable = true)
 |-- year_created: string (nullable = true)
 |-- year_logoff: string (nullable = true)
 |-- year_active: double (nullable = true)
 |-- year_inactive: double (nullable = true)



In [143]:
player_summary.filter(F.col('year_active').isNull()).count()

3674491

In [144]:
player_summary.filter(F.col('year_inactive').isNull()).count()

3018995

In [145]:
player_df5 = player_summary.select('steam_id', 'year_active', 'year_inactive', 'year_created', 'country_code')

In [146]:
# Save player_df5
player_df5.write.csv("/user/tamng/jwht/EDA/player_df5.csv",header=True)

---
## VI. Merge data

player_df1, player_df2, player_df3, player_df4, player_df5, playornot

In [10]:
# player_df1
player_df1 = spark.read.csv('/user/tamng/jwht/EDA/player_df1.csv',inferSchema = True, header = True)

In [11]:
# player_df2
player_df2 = spark.read.csv('/user/tamng/jwht/EDA/player_df2.csv',inferSchema = True, header = True)

In [12]:
# player_df3
player_df3 = spark.read.csv('/user/tamng/jwht/EDA/player_df3.csv',inferSchema = True, header = True)

In [13]:
# player_df4
player_df4 = spark.read.csv('/user/tamng/jwht/EDA/player_df4.csv',inferSchema = True, header = True)

In [14]:
# player_df5
player_df5 = spark.read.csv('/user/tamng/jwht/EDA/player_df5.csv',inferSchema = True, header = True)

In [15]:
# playornot
playornot = spark.read.csv('/user/tamng/jwht/EDA/playornot.csv',inferSchema = True, header = True)

In [16]:
player_df1.limit(3).toPandas()

Unnamed: 0,steam_id,playtime_2weeks,total_playtime_forever,total_games_owned,total_money_spend,total_game_multi_player,total_game_single_player
0,76561197971697148,480,85537,118,1319.95,63,55
1,76561197971697538,1982,242018,163,2382.5,72,91
2,76561197971705974,3118,260412,264,4414.5,143,121


In [17]:
player_df2.limit(3).toPandas()

Unnamed: 0,steam_id,gr_education_total,gr_mutiplayer_total,gr_adventure_total,gr_sports_total,gr_accounting_total,gr_audioProduction_total,gr_videoProduction_total,gr_animationModeling_total,gr_racing_total,...,gr_webPublishing_total,gr_utility_total,gr_earlyAccess_total,gr_casual_total,gr_action_total,gr_strategy_total,gr_indie_total,gr_freeplay_total,gr_RPG_total,gr_simulation_total
0,76561197960545041,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,20.0,0.0,0.0,0.0,0.0,0.0
1,76561197960545706,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,8.0,0.0,0.0,0.0,0.0,0.0
2,76561197960546062,0.0,4.0,56.0,2.0,0.0,0.0,0.0,0.0,3.0,...,0.0,0.0,4.0,13.0,120.0,9.0,57.0,12.0,25.0,6.0


In [18]:
player_df3.limit(3).toPandas()

Unnamed: 0,steam_id,number_friends
0,76561197960268652,3
1,76561197960269282,3
2,76561197960269470,1


In [19]:
player_df4.limit(3).toPandas()

Unnamed: 0,steam_id,number_groups
0,76561198055789259,1
1,76561198055833534,48
2,76561198055995390,1


In [20]:
player_df5.limit(3).toPandas()

Unnamed: 0,steam_id,year_active,year_inactive,year_created,country_code
0,76561197968085891,3.0,7.0,2004,N
1,76561197968088971,4.0,6.0,2004,N
2,76561197968090197,3.0,7.0,2004,N


In [21]:
playornot.limit(3).toPandas()

Unnamed: 0,steam_id,total_games_not_played,total_games_played
0,76561197960269356,9,6
1,76561197960270928,16,40
2,76561197960271038,67,6


In [126]:
player_final = player_df1.join(player_df2, ['steam_id'], how='full')
player_final = player_final.join(player_df3, ['steam_id'], how='left')
player_final = player_final.join(player_df4, ['steam_id'], how='left')
player_final = player_final.join(player_df5, ['steam_id'], how='left')
player_final = player_final.join(playornot, ['steam_id'], how='full')

In [129]:
player_final = player_final.withColumnRenamed('playtime_2weeks', 'total_playtime_2weeks')

In [130]:
player_final.printSchema()

root
 |-- steam_id: long (nullable = true)
 |-- total_playtime_2weeks: integer (nullable = true)
 |-- total_playtime_forever: integer (nullable = true)
 |-- total_games_owned: integer (nullable = true)
 |-- total_money_spend: double (nullable = true)
 |-- total_game_multi_player: integer (nullable = true)
 |-- total_game_single_player: integer (nullable = true)
 |-- gr_education_total: double (nullable = true)
 |-- gr_mutiplayer_total: double (nullable = true)
 |-- gr_adventure_total: double (nullable = true)
 |-- gr_sports_total: double (nullable = true)
 |-- gr_accounting_total: double (nullable = true)
 |-- gr_audioProduction_total: double (nullable = true)
 |-- gr_videoProduction_total: double (nullable = true)
 |-- gr_animationModeling_total: double (nullable = true)
 |-- gr_racing_total: double (nullable = true)
 |-- gr_designIllustration_total: double (nullable = true)
 |-- gr_softwareTraining_total: double (nullable = true)
 |-- gr_photoEditing_total: double (nullable = true)
 

_Print the columns_ 

In [131]:
player_final.columns

['steam_id',
 'total_playtime_2weeks',
 'total_playtime_forever',
 'total_games_owned',
 'total_money_spend',
 'total_game_multi_player',
 'total_game_single_player',
 'gr_education_total',
 'gr_mutiplayer_total',
 'gr_adventure_total',
 'gr_sports_total',
 'gr_accounting_total',
 'gr_audioProduction_total',
 'gr_videoProduction_total',
 'gr_animationModeling_total',
 'gr_racing_total',
 'gr_designIllustration_total',
 'gr_softwareTraining_total',
 'gr_photoEditing_total',
 'gr_webPublishing_total',
 'gr_utility_total',
 'gr_earlyAccess_total',
 'gr_casual_total',
 'gr_action_total',
 'gr_strategy_total',
 'gr_indie_total',
 'gr_freeplay_total',
 'gr_RPG_total',
 'gr_simulation_total',
 'number_friends',
 'number_groups',
 'year_active',
 'year_inactive',
 'year_created',
 'country_code',
 'total_games_not_played',
 'total_games_played']

---
_Basic description for total games..._

In [30]:
basic_info(player_final.select('steam_id','total_playtime_2weeks','total_playtime_forever',\
                               'total_games_owned','total_money_spend', 'total_game_multi_player',\
                               'total_game_single_player','total_games_not_played','total_games_played'))

TOTAL ROWS: 3263723


*-------------*-------------*-------------*-------------*-------------


MISSING VALUE:
+--------+---------------------+----------------------+-----------------+-----------------+-----------------------+------------------------+----------------------+------------------+
|steam_id|total_playtime_2weeks|total_playtime_forever|total_games_owned|total_money_spend|total_game_multi_player|total_game_single_player|total_games_not_played|total_games_played|
+--------+---------------------+----------------------+-----------------+-----------------+-----------------------+------------------------+----------------------+------------------+
|       0|                    0|                     0|                0|                0|                      0|                       0|                     0|                 0|
+--------+---------------------+----------------------+-----------------+-----------------+-----------------------+------------------------+------------------

---
_Basic description for friends, groups..._

In [31]:
basic_info(player_final.select('number_friends','number_groups','year_active','year_inactive','year_created','country_code'))

TOTAL ROWS: 3263723


*-------------*-------------*-------------*-------------*-------------


MISSING VALUE:
+--------------+-------------+-----------+-------------+------------+------------+
|number_friends|number_groups|year_active|year_inactive|year_created|country_code|
+--------------+-------------+-----------+-------------+------------+------------+
|       1206741|      2315016|     138024|        78612|       61705|       61705|
+--------------+-------------+-----------+-------------+------------+------------+

*-------------*-------------*-------------*-------------*-------------


PRINT OUT THE 1st 3 LINES:
+--------------+-------------+-----------+-------------+------------+------------+
|number_friends|number_groups|year_active|year_inactive|year_created|country_code|
+--------------+-------------+-----------+-------------+------------+------------+
|             2|         null|        6.0|          5.0|        2003|           N|
|            10|            2|       10.0|

---
### Filling missing value

#### a. number_groups

Since the Groups table is full, we did not take any subset of that orginial table. Therefore, any steam_id does not include in that table has number of group = 0, which means that they did not join any group.

In [132]:
player_final = player_final.withColumn('number_groups', \
                                           F.when((F.col('number_groups').isNull()), F.lit(0)).otherwise(F.col('number_groups')))

In [133]:
check_missing(player_final.select('number_groups'))

+-------------+
|number_groups|
+-------------+
|            0|
+-------------+



#### b. number_friends

In [134]:
player_final = player_final.withColumn('number_friends', \
                                           F.when((F.col('number_friends').isNull()), F.lit(0)).otherwise(F.col('number_friends')))

In [135]:
check_missing(player_final.select('number_friends'))

+--------------+
|number_friends|
+--------------+
|             0|
+--------------+



#### c. drop na

In [136]:
player_final = player_final.filter((F.col('year_created').contains('20'))).filter(~F.col('year_inactive').isNull()).filter(~F.col('year_active').isNull())

In [137]:
player_final.count()

3125699

In [138]:
player_final.groupby('year_created').count().orderBy('year_created', ascending = False).toPandas()

Unnamed: 0,year_created,count
0,2011,12
1,2010,2
2,2009,2
3,2008,5
4,2007,25
5,2006,12
6,2005,436956
7,2004,1951691
8,2003,736994


In [143]:
# Since very fwe value from year 2006, we decided to drop year_created from 2006.
player_final = player_final.filter(F.col('year_created')<2006)

In [144]:
player_final.groupby('year_active').count().orderBy('year_active', ascending = False).toPandas()

Unnamed: 0,year_active,count
0,11.0,246454
1,10.0,786267
2,9.0,484074
3,8.0,265859
4,7.0,227265
5,6.0,252221
6,5.0,272644
7,4.0,285946
8,3.0,249704
9,2.0,54995


In [145]:
player_final.groupby('year_inactive').count().orderBy('year_inactive', ascending = False).toPandas()

Unnamed: 0,year_inactive,count
0,8.0,1606
1,7.0,328642
2,6.0,272248
3,5.0,292093
4,4.0,251869
5,3.0,217293
6,2.0,242283
7,1.0,477531
8,0.0,1042076


In [146]:
player_final.groupby('year_created').count().orderBy('year_created', ascending = False).toPandas()

Unnamed: 0,year_created,count
0,2005,436956
1,2004,1951691
2,2003,736994


In [147]:
player_final.limit(5).toPandas()

Unnamed: 0,steam_id,total_playtime_2weeks,total_playtime_forever,total_games_owned,total_money_spend,total_game_multi_player,total_game_single_player,gr_education_total,gr_mutiplayer_total,gr_adventure_total,...,gr_RPG_total,gr_simulation_total,number_friends,number_groups,year_active,year_inactive,year_created,country_code,total_games_not_played,total_games_played
0,76561197960266870,0,6,8,49.92,7,1,0.0,0.0,0.0,...,0.0,0.0,2,0,6.0,5.0,2003,N,7,1
1,76561197960266911,0,3266,15,119.86,11,4,0.0,0.0,0.0,...,0.0,0.0,10,2,10.0,1.0,2003,DE,12,3
2,76561197960268400,0,251,18,169.83,13,5,0.0,0.0,0.0,...,0.0,1.0,2,0,10.0,1.0,2003,GB,16,2
3,76561197960269173,73,23741,33,497.69,23,10,0.0,0.0,5.0,...,2.0,4.0,2,0,11.0,0.0,2003,N,8,25
4,76561197960269352,0,0,17,134.84,13,4,0.0,0.0,0.0,...,0.0,0.0,2,0,4.0,7.0,2003,N,17,0


---
### Check missing value before saving the final data.

In [148]:
check_missing(player_final.select('number_friends','number_groups','year_active','year_inactive','year_created','country_code'))

+--------------+-------------+-----------+-------------+------------+------------+
|number_friends|number_groups|year_active|year_inactive|year_created|country_code|
+--------------+-------------+-----------+-------------+------------+------------+
|             0|            0|          0|            0|           0|           0|
+--------------+-------------+-----------+-------------+------------+------------+



In [149]:
check_missing(player_final.select('steam_id','total_playtime_2weeks','total_playtime_forever',\
                               'total_games_owned','total_money_spend', 'total_game_multi_player',\
                               'total_game_single_player','total_games_not_played','total_games_played'))

+--------+---------------------+----------------------+-----------------+-----------------+-----------------------+------------------------+----------------------+------------------+
|steam_id|total_playtime_2weeks|total_playtime_forever|total_games_owned|total_money_spend|total_game_multi_player|total_game_single_player|total_games_not_played|total_games_played|
+--------+---------------------+----------------------+-----------------+-----------------+-----------------------+------------------------+----------------------+------------------+
|       0|                    0|                     0|                0|                0|                      0|                       0|                     0|                 0|
+--------+---------------------+----------------------+-----------------+-----------------+-----------------------+------------------------+----------------------+------------------+



In [150]:
check_missing(player_final.select('gr_education_total','gr_mutiplayer_total','gr_adventure_total',
 'gr_sports_total','gr_accounting_total','gr_audioProduction_total','gr_videoProduction_total',
 'gr_animationModeling_total','gr_racing_total','gr_designIllustration_total',
 'gr_softwareTraining_total','gr_photoEditing_total','gr_webPublishing_total',
 'gr_utility_total','gr_earlyAccess_total','gr_casual_total','gr_action_total',
 'gr_strategy_total','gr_indie_total','gr_freeplay_total','gr_RPG_total','gr_simulation_total'))

+------------------+-------------------+------------------+---------------+-------------------+------------------------+------------------------+--------------------------+---------------+---------------------------+-------------------------+---------------------+----------------------+----------------+--------------------+---------------+---------------+-----------------+--------------+-----------------+------------+-------------------+
|gr_education_total|gr_mutiplayer_total|gr_adventure_total|gr_sports_total|gr_accounting_total|gr_audioProduction_total|gr_videoProduction_total|gr_animationModeling_total|gr_racing_total|gr_designIllustration_total|gr_softwareTraining_total|gr_photoEditing_total|gr_webPublishing_total|gr_utility_total|gr_earlyAccess_total|gr_casual_total|gr_action_total|gr_strategy_total|gr_indie_total|gr_freeplay_total|gr_RPG_total|gr_simulation_total|
+------------------+-------------------+------------------+---------------+-------------------+---------------------

---
### Save data

In [151]:
# Save player_final
player_final.write.csv("/user/tamng/jwht/EDA/player_final.csv",header=True)

---
__Correlation bt time & number of game__

In [27]:
corr_time_game =  agg_game_time.select("total_playtime_forever", "total_games_owned")
corr_time_game.show(2)

+----------------------+-----------------+
|total_playtime_forever|total_games_owned|
+----------------------+-----------------+
|                 78840|              168|
|                 39835|               70|
+----------------------+-----------------+
only showing top 2 rows



In [36]:
vector_col = "corr_features"

# Create a vector of total_playtime_forever & total_games_owned
assembler = VectorAssembler(inputCols=['total_playtime_forever','total_games_owned'], 
                            outputCol=vector_col)

# map the created vector with the corr_time_game table 
corr_time_game_vector = assembler.transform(corr_time_game).select(vector_col)

cols = ['total_playtime_forever','total_games_owned']

# Create the correlation matrix
matrix = Correlation.corr(corr_time_game_vector, vector_col).collect()[0][0]
corrmatrix = matrix.toArray().tolist()
corr_time_game_df = spark.createDataFrame(corrmatrix,cols)
corr_time_game_df.show()

+----------------------+------------------+
|total_playtime_forever| total_games_owned|
+----------------------+------------------+
|                   1.0|0.4640277661095082|
|    0.4640277661095082|               1.0|
+----------------------+------------------+

