# BigData Final Project | Steam
## <font color = 'blue'> Notebook2 | Clean & Join Player_Summaries table </font>
### Team Member: Jim Fang, WooJong Choi, Han Jeon, Tam Nguyen

June 2020
___

### This notebook is dedicated for cleaning Player_Summary data (10 million rows ~ 5Gb)

### <font color ='red'> PROBLEM</font>
When we unzip the .sql file on GCP and exported as a csv format, the file:

> 1.does not have column names,

> 2.all columns are mixing up and not following any specific structure to format (example below):


|_c5|_c6|_c7|_c8|_c9|_c10|_c11|_c12|_c13|_c14|_c15|_c16|_c17|_c18
|--|--|--|--|--|--|--|--|--|--|--|--|--|--|--|--|--|
|https://steamcommunity.com/id/caglarbey/|	0	|3	|1	|2012-10-24 12:40:31|	"N,"N	|103582791429521408	|"N,"N	|"N,"N	|"N,"N	|"N,"N	|2013-02-28 14:19:00|	None	| None
|http://media.steampowered.com/steamcommunity/p...|	1	|3	|1|	2013-02-15 11:57:54	|"N,"Alfred"|	103582791429521412|	2003-09-10 05:27:21	|"N,"N	|"N,"N	|US |IL	|"N,"2013-02-28 14:34:05"| None
|2	|	3 |	1|	2013-02-15 11:08:34|	1	|Erik Johnson	|103582791429521412|	2003-09-10 05:14:46	|"N,"N	|"N,"N|	JP|	"N	|2013-02-28 14:19:00 | None
|http://media.steampowered.com/steamcommunity/p...	|0	|3|	1	|2013-02-15 13:19:12	|"N, 103582791429521413,2003-09-10 05:27:13	|146.66.153.116:27055|2360|Dota2 |"N	|US|	WA	|3961|	2013-02-28 14:19:00	

### QUESTION

Does this table contain useful information to keep and clean? Which problem could be solved by this data? Otherwise, just drop it, since: 
> One column can have max 4-5 values: if using normal logic, there should be 4^16 = 10 millions scenarios. There is no way to write a logic or function that can take care such scenarios.

This data would be helpful for building graph knowledge or recommendation systems, plus this is an opportunity to learn and appreciate the value of Big Data Solution and PySpark, we decided to give it a shot.

###  <font color ='green'> SOLUTION</font>

> The only way to solve these puzzles is looking at the maximum number of values contain in each column. Then, based on the outlier of the disitrbution to split the data. Once we deal with the ouliers, the rest will fall into normal distribution

> By this way, we can split the 10 mil rows data to a smaller data with some similar patterns, reducing significantly the if else condition.

By doing so, we created some functions to speed up this mannual task.

### <font color ='blue'> PROCESS </font>

After defining _c11 should be used to split the data, all of the tables cleaning and processing in this notebook will follow the steps below:
- Check split distribution
- Choose the right column to split 
- Add column based on the conditiom: number of Split, run & test
- Format:
    - Drop columns
    - Rename
    - Replace specific symbols
- Save data
- Merge all clean data together

Finally, we drop some of the data on the way as we think its takes time and not really neccessary to for mat, end up with final data: __~9.9 million rows__


|Actual column | drop or keep|
|--|--|--|
|steam_id|keep
|personaname|keep	
|profileurl|keep
|avatar| drop
|avatarmedium| drop
|avatarfull| drop
|personastate|keep
|communityvisibilitystate|keep
|profilestate|keep
|lastlogoff|keep
|commentpermission|keep
|realname|keep
|primaryclanid|keep
|timecreated|keep
| gameid|keep
|gameserverip|keep
|gameextrainfo| keep
|cityid| keep
|loccountrycode| keep
|locstatecode| keep
|loccityid| drop
|dateretrieved| drop


---
## 1. Import Libraries

In [1]:
from pyspark import SparkContext, SparkConf
from pyspark.context import SparkContext
from pyspark.sql import SparkSession

from pyspark.sql import functions as F
import pyspark.sql.types as t
from pyspark.sql.functions import broadcast
from pyspark.sql.functions import regexp_replace
from pyspark.sql.types import IntegerType
from pyspark.sql.functions import isnan, when, count, col, size
from pyspark.sql.functions import year, month, dayofmonth
from pyspark.sql.functions import length
from pyspark.sql import functions as sf
from functools import reduce

import pandas as pd
import matplotlib.pyplot as plt
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.stat import Correlation


import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
spark = SparkSession.builder.enableHiveSupport().appName('CleanData').getOrCreate()
sc = spark.sparkContext

In [3]:
!hdfs dfs -ls /user/tamng/jwht/CleanData

Found 20 items
drwxrwxrwx   - tamng tamng          0 2020-05-22 13:38 /user/tamng/jwht/CleanData/app_id_info.csv
drwxrwxrwx   - tamng tamng          0 2020-05-22 22:38 /user/tamng/jwht/CleanData/app_if_info_PosReview.csv
drwxrwxrwx   - tamng tamng          0 2020-05-18 13:42 /user/tamng/jwht/CleanData/friends.csv
drwxrwxrwx   - tamng tamng          0 2020-05-18 15:10 /user/tamng/jwht/CleanData/game2_df.csv
drwxrwxrwx   - tamng tamng          0 2020-05-22 14:12 /user/tamng/jwht/CleanData/game_dgp.csv
drwxrwxrwx   - tamng tamng          0 2020-05-22 13:41 /user/tamng/jwht/CleanData/games_developer.csv
drwxrwxrwx   - tamng tamng          0 2020-05-22 13:47 /user/tamng/jwht/CleanData/games_genres.csv
drwxrwxrwx   - tamng tamng          0 2020-05-22 13:49 /user/tamng/jwht/CleanData/games_publisher.csv
drwxrwxrwx   - tamng tamng          0 2020-05-22 13:55 /user/tamng/jwht/CleanData/groups.csv
drwxrwxrwx   - tamng tamng          0 2020-05-27 16:54 /user/tamng/jwht/CleanData/ps_t1.csv
drwxrwx

---
## 2. Create Function

_1. Check missing value_

In [4]:
def check_missing(df):
    ''' Check missing value'''
    df.select([count(when(col(c).isNull(), c)).alias(c) for c in df.columns]).show()

_2. Rename columns_

In [5]:
def rename_col(df, newColumns):
    ''' Rename all columns        
        Note: newColumns is a list of columns name '''
    oldColumns = df.schema.names
    df = reduce(lambda df, idx: df.withColumnRenamed(oldColumns[idx], newColumns[idx]), range(len(oldColumns)), df)
    return df

_3. Basic description about the data_

In [6]:
def basic_info(df):    
    '''
        Print out the basic ddescription for each table, icluding:
        1. total rows/ observation
        2. Check missing value by columns
        3. Print out the first 3 lines
        4. Basic description
    '''   
    print('TOTAL ROWS:', df.count())
    print('\n')
    print('*-------------'*5)
    print('\n')
    print('MISSING VALUE:')
    check_missing(df)
    print('*-------------'*5)
    print('\n')
    print('PRINT OUT THE 1st 3 LINES:')
    df.show(3, truncate = True)
    print('*-------------'*5)
    print('\n')
    print('TABLE BASIC DESCRIPTION:')
    df.describe().show(10,truncate = True)
    print('*-------------'*5)
    distinct_count = []
    column_name = df.columns
    for i in column_name:
        distinct_count.append(df.select(col(i)).distinct().count())

    print('DISTINCT COUNT BY COLUMN:')
    print('\n')
    print(pd.DataFrame(zip(column_name,distinct_count)).\
      rename(columns={0:'column_name', 1:'distinct_count'}))

_4. Check number of max value can exist in a column, based on ','_

In [7]:
def count_split_column(df, column):
    '''
    Count number of split:
    df: data
    cols_check: list of columns need to be check, example as below:
    cols_check = ['_c7','_c8','_c9', '_c10', '_c11', '_c12', '_c13', '_c14','_c15','_c16','_c17','_c18','_c19']
    '''
    col = df.withColumn(column, F.size(F.split(F.col(column), ','))).sort(column, ascending=False).\
    select(column).limit(1)
    return col

In [8]:
def add_count_split_column(df, column, tagname):
    '''
    cols_check = ['_c1','_c7','_c8']
    tagname = ['nSplit_c1','nSplit_c7','nSplit_c8']
    '''
    df = reduce(lambda df, idx: df.withColumn(tagname[idx],  F.size(F.split(column[idx], ','))),range(len(column)), df)
    return df

_5. Split the columns based on the max number of columns that need to split for each predefined column_

In [9]:
def split_2_column(df, cols, newcols):
    '''
    This function is using for split columns two 2 new columns, based on the  ',' sign.
        df: data
        cols: list of columns need to split
        newcols: list of new columns after splitting based on the provided column
    '''
    df = reduce(lambda df, idx: df.withColumn(newcols[idx],  F.split(cols[idx], ',').getItem(0)), range(len(cols)), df)
    df = reduce(lambda df, idx: df.withColumn(newcols[idx+1],  F.split(cols[idx], ',').getItem(1)), range(len(cols)), df)
    return df

In [10]:
def split_3_column(df, cols, newcols):
    '''
    This function is using for split columns two 3 new columns, based on the  ',' sign.
        df: data
        cols: list of columns need to split
        newcols: list of new columns after splitting based on the provided column
    '''    
    df = reduce(lambda df, idx: df.withColumn(newcols[idx],  F.split(cols[idx], ',').getItem(0)), range(len(cols)), df)
    df = reduce(lambda df, idx: df.withColumn(newcols[idx+1],  F.split(cols[idx], ',').getItem(1)), range(len(cols)), df)
    df = reduce(lambda df, idx: df.withColumn(newcols[idx+2],  F.split(cols[idx], ',').getItem(2)), range(len(cols)), df)
    return df

---
## 3. Import Data

__Need to create the schema as the data imported, using inferSchema will only recognize 17 columns, while it we have 20 columns__

In [7]:
# Create a schema structure

player_summaries_schema = t.StructType([t.StructField('_c0', t.StringType(), False), 
                            t.StructField('_c1', t.StringType(), False),
                            t.StructField('_c2', t.StringType(), False),
                            t.StructField('_c3', t.StringType(), False),
                            t.StructField('_c4', t.StringType(), False),
                            t.StructField('_c5', t.StringType(), False),
                            t.StructField('_c6', t.StringType(), False), 
                            t.StructField('_c7', t.StringType(), False), 
                            t.StructField('_c8', t.StringType(), False), 
                            t.StructField('_c9', t.StringType(), False), 
                            t.StructField('_c10', t.StringType(), False), 
                            t.StructField('_c11', t.StringType(), False), 
                            t.StructField('_c12', t.StringType(), False), 
                            t.StructField('_c13', t.StringType(), False), 
                            t.StructField('_c14', t.StringType(), False), 
                            t.StructField('_c15', t.StringType(), False),
                            t.StructField('_c16', t.StringType(), False), 
                            t.StructField('_c17', t.StringType(), False), 
                            t.StructField('_c18', t.StringType(), False), 
                            t.StructField('_c19', t.StringType(), False)])

In [9]:
# Load 1st dataset: game2
player_summaries = spark.read.option('header', False).schema(player_summaries_schema).csv('/user/tamng/jwht/SteamData/steamData_new/Player_Summaries_10mil.csv')

# Print Schema
player_summaries.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c3: string (nullable = true)
 |-- _c4: string (nullable = true)
 |-- _c5: string (nullable = true)
 |-- _c6: string (nullable = true)
 |-- _c7: string (nullable = true)
 |-- _c8: string (nullable = true)
 |-- _c9: string (nullable = true)
 |-- _c10: string (nullable = true)
 |-- _c11: string (nullable = true)
 |-- _c12: string (nullable = true)
 |-- _c13: string (nullable = true)
 |-- _c14: string (nullable = true)
 |-- _c15: string (nullable = true)
 |-- _c16: string (nullable = true)
 |-- _c17: string (nullable = true)
 |-- _c18: string (nullable = true)
 |-- _c19: string (nullable = true)



In [10]:
# Total rows
player_summaries.count()

10000000

_Take a look at the data_

In [120]:
player_summaries.limit(5).toPandas()

Unnamed: 0,_c0,_c1,_c2,_c3,_c4,_c5,_c6,_c7,_c8,_c9,_c10,_c11,_c12,_c13,_c14,_c15,_c16,_c17,_c18,_c19
0,76561197960265729,rich,http://steamcommunity.com/profiles/76561197960...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,1,2012-10-24 12:40:31,"""N,""N",103582791429521408,"""N,""N","""N,""N","""N,""N","""N,""N",2013-02-28 14:19:00,,,
1,76561197960265730,alfred,http://steamcommunity.com/id/zoe/,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,1,3,1,2013-02-15 11:57:54,"""N,""Alfred""",103582791429521412,2003-09-10 05:27:21,"""N,""N","""N,""N","""N,""N","""N,""2013-02-28 14:34:05""",,,
2,76561197960265731,ErikJ,http://steamcommunity.com/id/erikj/,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,1,2013-02-15 11:08:34,1,Erik Johnson,103582791429521412,2003-09-10 05:14:46,"""N,""N","""N,""N",US,"""N,""N",2013-02-28 14:19:00,
3,76561197960265732,paulj,http://steamcommunity.com/id/OG2/,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,1,1,2011-07-07 18:09:55,"""N,""N","""N,""N","""N,""N","""N,""N","""N,""N","""N,""2013-02-28 14:34:05""",,,,
4,76561197960265733,alfred,http://steamcommunity.com/id/alfredr/,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,1,2013-02-15 13:19:12,"""N,""N",103582791429521413,2003-09-10 05:27:13,"""N,""N","""N,""N",US,WA,3961,2013-02-28 14:19:00,


In [12]:
# Drop duplicate rows
player_summaries_drop = player_summaries.dropDuplicates() 
player_summaries_drop.count()

10000000

In [13]:
# Drop unnecssary columns
# player_summaries_drop = player_summaries_drop.drop('_c3', '_c4','_c5')

---
## 4. Check column split distribution to find a  suitable way to split the data, which has similar pattern to clean 

---
_Now, add the tag split for later check on column split distribution_

In [13]:
cols_check = ['_c1','_c7','_c8','_c9', '_c10', '_c11', '_c12', '_c13', '_c14','_c15','_c16','_c17','_c18','_c19']

tagname = ['nSplit_c1','nSplit_c7','nSplit_c8','nSplit_c9', 'nSplit_c10', 'nSplit_c11', 'nSplit_c12', 'nSplit_c13', 'nSplit_c14',
           'nSplit_c15','nSplit_c16','nSplit_c17','nSplit_c18','nSplit_c19']

player_summaries_drop = add_count_split_column(player_summaries_drop, cols_check, tagname)

In [16]:
player_summaries_drop.limit(10).toPandas()

Unnamed: 0,_c0,_c1,_c2,_c6,_c7,_c8,_c9,_c10,_c11,_c12,...,nSplit_c10,nSplit_c11,nSplit_c12,nSplit_c13,nSplit_c14,nSplit_c15,nSplit_c16,nSplit_c17,nSplit_c18,nSplit_c19
0,76561197980012168,mR.*,http://steamcommunity.com/profiles/76561197980...,0,3,"""N,""2009-03-07 17:31:16""","""N,""N",103582791429521408,2005-12-26 23:15:59,"""N,""N",...,1,1,2,2,2,2,-1,-1,-1,-1
1,76561197980012472,seanmccotter,http://steamcommunity.com/profiles/76561197980...,0,3,"""N,""2013-01-17 17:21:14""","""N,""N",103582791429521408,2005-12-26 23:49:40,"""N,""N",...,1,1,2,2,2,2,-1,-1,-1,-1
2,76561197980012724,caglarbey,http://steamcommunity.com/id/caglarbey/,0,3,1,2013-02-15 01:25:53,"""N,""N",103582791429521408,2005-12-27 00:23:05,...,2,1,1,2,2,2,2,-1,-1,-1
3,76561197980012958,[nightvision]solid_snake,http://steamcommunity.com/profiles/76561197980...,0,3,"""N,""2006-08-04 00:43:37""","""N,""N",103582791429521408,2005-12-27 00:54:45,"""N,""N",...,1,1,2,2,2,2,-1,-1,-1,-1
4,76561197980013392,SHADESNAIL.,http://steamcommunity.com/id/SHADEEEEEEEEE/,0,3,1,2013-02-19 10:43:40,"""N,""=D""",103582791432130165,2005-12-27 01:39:46,...,2,1,1,2,2,1,2,1,-1,-1
5,76561197980015065,sascha.wag,http://steamcommunity.com/profiles/76561197980...,0,3,"""N,""2012-10-03 09:18:03""","""N,""N",103582791429521408,2005-12-28 12:58:59,"""N,""N",...,1,1,2,2,2,2,-1,-1,-1,-1
6,76561197980017138,Bk Come Back,http://steamcommunity.com/id/leplasticien/,0,3,1,2009-04-21 06:03:33,"""N,""Nicolas""",103582791430158228,2005-12-27 05:43:23,...,2,1,1,2,2,1,1,1,1,-1
7,76561197980017357,^4S^0ub-Zero,http://steamcommunity.com/id/88Sub-ZEro/,0,3,1,2013-02-19 13:21:53,1,Alex,103582791429763583,...,1,1,1,1,2,2,2,2,-1,-1
8,76561197980017455,juco,http://steamcommunity.com/profiles/76561197980...,0,1,"""N,""N","""N,""N","""N,""N","""N,""N","""N,""N",...,2,2,2,2,2,-1,-1,-1,-1,-1
9,76561197980019720,Hurricane 2012,http://steamcommunity.com/profiles/76561197980...,0,3,1,2012-09-02 14:21:57,"""N,""Kai""",103582791430380703,2005-12-27 07:30:46,...,2,1,1,2,2,1,1,2,-1,-1


---
_Next, check thhe nuber of split distribution for each predefined column_

|Number| Meaning| No of comma
|--|--|
|-1| None| None
| 1|1 value| None
| 2|2 values| 1 comma
|.|...|...
|n| n values| n-1 comma

In [34]:
for i in tagname:
    print('Distinct value for each column:\n')
    print(player_summaries_drop.groupBy(i).count().show())

Distinct value for each column:

+---------+-------+
|nSplit_c7|  count|
+---------+-------+
|       -1|      1|
|        1|9999606|
|        2|    393|
+---------+-------+

None
Distinct value for each column:

+---------+-------+
|nSplit_c8|  count|
+---------+-------+
|        1|3647547|
|        2|6352453|
+---------+-------+

None
Distinct value for each column:

+---------+-------+
|nSplit_c9|  count|
+---------+-------+
|        1|3638364|
|        3|     99|
|        2|6361537|
+---------+-------+

None
Distinct value for each column:

+----------+-------+
|nSplit_c10|  count|
+----------+-------+
|         1|3526806|
|         3|    564|
|         5|      1|
|         2|6472629|
+----------+-------+

None
Distinct value for each column:

+----------+-------+
|nSplit_c11|  count|
+----------+-------+
|        -1|     27|
|         1|6266802|
|        13|      1|
|         6|      1|
|         3|  57439|
|         5|      7|
|         4|     16|
|         8|      1|
|         2|

In [35]:
player_summaries_drop.filter(player_summaries_drop.nSplit_c18 == -1).head(2)

[Row(_c0='76561197984327737', _c1='cmkee19', _c2='http://steamcommunity.com/profiles/76561197984327737/', _c6='0', _c7='3', _c8='"N,"2009-01-16 06:51:23"', _c9='"N,"N', _c10='103582791429521408', _c11='2006-08-16 21:39:48', _c12='"N,"N', _c13='"N,"N', _c14='"N,"N', _c15='"N,"2013-03-11 12:30:23"', _c16=None, _c17=None, _c18=None, _c19=None, nSplit_c7=1, nSplit_c8=2, nSplit_c9=2, nSplit_c10=1, nSplit_c11=1, nSplit_c12=2, nSplit_c13=2, nSplit_c14=2, nSplit_c15=2, nSplit_c16=-1, nSplit_c17=-1, nSplit_c18=-1, nSplit_c19=-1),
 Row(_c0='76561197984328196', _c1='realkilla69420', _c2='http://steamcommunity.com/profiles/76561197984328196/', _c6='0', _c7='3', _c8='"N,"2006-11-02 08:20:37"', _c9='"N,"N', _c10='103582791429521408', _c11='2006-08-13 20:49:59', _c12='"N,"N', _c13='"N,"N', _c14='"N,"N', _c15='"N,"2013-03-02 01:09:34"', _c16=None, _c17=None, _c18=None, _c19=None, nSplit_c7=1, nSplit_c8=2, nSplit_c9=2, nSplit_c10=1, nSplit_c11=1, nSplit_c12=2, nSplit_c13=2, nSplit_c14=2, nSplit_c15=2, 

In [36]:
player_summaries_drop.filter(player_summaries_drop.nSplit_c19 == -1).head(2)

[Row(_c0='76561197963206515', _c1='sushy78', _c2='http://steamcommunity.com/profiles/76561197963206515/', _c6='0', _c7='1', _c8='"N,"N', _c9='"N,"N', _c10='"N,"N', _c11='"N,"N', _c12='"N,"N', _c13='"N,"N', _c14='"N,"2013-02-28 14:13:44"', _c15=None, _c16=None, _c17=None, _c18=None, _c19=None, nSplit_c7=1, nSplit_c8=2, nSplit_c9=2, nSplit_c10=2, nSplit_c11=2, nSplit_c12=2, nSplit_c13=2, nSplit_c14=2, nSplit_c15=-1, nSplit_c16=-1, nSplit_c17=-1, nSplit_c18=-1, nSplit_c19=-1),
 Row(_c0='76561197963206693', _c1='jonalam', _c2='http://steamcommunity.com/profiles/76561197963206693/', _c6='0', _c7='3', _c8='"N,"2011-06-05 09:38:35"', _c9='"N,"N', _c10='103582791429521408', _c11='2003-12-05 05:51:59', _c12='"N,"N', _c13='"N,"N', _c14='"N,"N', _c15='"N,"2013-02-28 14:13:44"', _c16=None, _c17=None, _c18=None, _c19=None, nSplit_c7=1, nSplit_c8=2, nSplit_c9=2, nSplit_c10=1, nSplit_c11=1, nSplit_c12=2, nSplit_c13=2, nSplit_c14=2, nSplit_c15=2, nSplit_c16=-1, nSplit_c17=-1, nSplit_c18=-1, nSplit_c19

---
## 5. Processing split table based on pattern find out above
> <font color = 'blue'> **_c11 has a wide range of spliting values. If we deal with outliers first, then the rest will fall to nomal distribution, hence, the pattern is more clear and easy to define. Take it as the first point of split in the data** </font>

---
## <font color = 'black'> 5.1 Table 1: nSplit_c11 = 3 & nSplit_c10 = 1 </font>

### 5.1.1 Filter data and check number of Split

In [19]:
ps_t1 = player_summaries_drop.filter((player_summaries_drop.nSplit_c11 == 3) & (player_summaries_drop.nSplit_c10 == 1))
ps_t1.count()

57359

In [21]:
# Since this table has small number of row, turn it to pandas and use in split function below
nSplit_c11_3_c10_1 = ps_t1.toPandas()
nSplit_c11_3_c10_1

Unnamed: 0,_c0,_c1,_c2,_c6,_c7,_c8,_c9,_c10,_c11,_c12,...,nSplit_c10,nSplit_c11,nSplit_c12,nSplit_c13,nSplit_c14,nSplit_c15,nSplit_c16,nSplit_c17,nSplit_c18,nSplit_c19
0,76561197971533965,{DÛ·D}BlackDeathâ„¢[GEÐ¯],http://steamcommunity.com/profiles/76561197971...,0,3,1,2013-02-17 14:55:08,1,"""N,103582791429525616,""2004-12-05 08:10:48""","""N,""N",...,1,3,2,2,2,2,-1,-1,-1,-1
1,76561197971554822,reaperbait,http://steamcommunity.com/id/barryvanunendooms...,0,3,1,2013-02-17 15:14:32,1,"""N,103582791429524118,""2004-12-06 15:15:51""","""N,""N",...,1,3,2,2,2,2,-1,-1,-1,-1
2,76561197971639930,zolitary,http://steamcommunity.com/id/konni/,0,3,1,2012-09-29 12:04:10,1,"""N,103582791430183561,""2004-12-09 08:51:01""","""N,""N",...,1,3,2,2,1,1,2,-1,-1,-1
3,76561197971660499,sesame,http://steamcommunity.com/id/sesamebee/,0,3,1,2013-02-18 00:19:37,1,"""N,103582791429521408,""2004-12-09 02:31:04""","""N,""N",...,1,3,2,2,2,2,-1,-1,-1,-1
4,76561197971718998,NeK,http://steamcommunity.com/profiles/76561197971...,0,3,1,2012-10-26 06:15:48,1,"""N,103582791429548627,""2004-12-11 11:40:16""","""N,""N",...,1,3,2,2,1,1,1,1,-1,-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
57354,76561197964257589,EvilLinux..RAGE,http://steamcommunity.com/id/evillinux/,3,3,1,2013-02-16 08:52:47,2,"""N,103582791429534006,""2004-02-01 16:56:26""","""N,""N",...,1,3,2,2,1,1,1,1,-1,-1
57355,76561197964267999,skybert,http://steamcommunity.com/id/skybert88/,0,3,1,2013-02-12 20:12:35,1,"""N,103582791429521408,""2004-02-02 09:49:08""","""N,""N",...,1,3,2,2,2,2,-1,-1,-1,-1
57356,76561197964282159,Sgt. Cojones,http://steamcommunity.com/profiles/76561197964...,0,3,1,2013-01-23 15:05:17,2,"""N,103582791432740959,""2004-02-03 07:09:08""","""N,""N",...,1,3,2,2,1,1,1,1,-1,-1
57357,76561197964305064,1337,http://steamcommunity.com/id/pal4rzn/,0,3,1,2012-06-28 10:14:21,1,"""N,103582791430660190,""2004-02-05 07:17:37""","""N,""N",...,1,3,2,2,1,1,1,1,-1,-1


In [22]:
# Check number of unique value in split column
check_list = ['nSplit_c7','nSplit_c8','nSplit_c9','nSplit_c12','nSplit_c13', 'nSplit_c14','nSplit_c15','nSplit_c16','nSplit_c17','nSplit_c18','nSplit_c19']
unique_col = []
for i in check_list:
    unique_col.append(nSplit_c11_3_c10_1[i].nunique())

print(pd.DataFrame(zip(check_list,unique_col)))

             0  1
0    nSplit_c7  1
1    nSplit_c8  1
2    nSplit_c9  1
3   nSplit_c12  2
4   nSplit_c13  2
5   nSplit_c14  2
6   nSplit_c15  2
7   nSplit_c16  3
8   nSplit_c17  3
9   nSplit_c18  3
10  nSplit_c19  2


In [23]:
# Prepare column and number of split for each column
col_c11 = ['_c11']
newcols_c11 = ['c11_1', 'c11_2','c11_3' ]

col_c12 = ['_c12']
newcols_c12 = ['c12_1', 'c12_2']

col_c13 = ['_c13']
newcols_c13 = ['c13_1', 'c13_2']

col_c14 = ['_c14']
newcols_c14 = ['c14_1', 'c14_2']

col_c15 = ['_c15']
newcols_c15 = ['c15_1', 'c15_2']

col_c16 = ['_c16']
newcols_c16 = ['c16_1', 'c16_2', 'c16_3']

col_c17 = ['_c17']
newcols_c17 = ['c17_1', 'c17_2', 'c17_3']

col_c18 = ['_c18']
newcols_c18 = ['c18_1', 'c18_2', 'c18_3']

col_c19 = ['_c19']
newcols_c19 = ['c19_1', 'c19_2']

# Add split columns
ps_t1 = split_3_column(ps_t1, col_c11, newcols_c11)
ps_t1 = split_2_column(ps_t1, col_c12, newcols_c12)
ps_t1 = split_2_column(ps_t1, col_c13, newcols_c13)
ps_t1 = split_2_column(ps_t1, col_c14, newcols_c14)
ps_t1 = split_2_column(ps_t1, col_c15, newcols_c15)
ps_t1 = split_3_column(ps_t1, col_c16, newcols_c16)
ps_t1 = split_3_column(ps_t1, col_c17, newcols_c17)
ps_t1 = split_3_column(ps_t1, col_c18, newcols_c18)
ps_t1 = split_2_column(ps_t1, col_c19, newcols_c19)

_Check if the split function works properly_

In [25]:
ps_t1.toPandas()[['_c7','_c8','_c9','_c10','_c11','_c12','nSplit_c12','_c13','nSplit_c13','_c14','nSplit_c14','_c15','nSplit_c15','_c16','_c17','_c18','_c19']]

Unnamed: 0,_c7,_c8,_c9,_c10,_c11,_c12,nSplit_c12,_c13,nSplit_c13,_c14,nSplit_c14,_c15,nSplit_c15,_c16,_c17,_c18,_c19
0,3,1,2013-02-17 11:05:13,1,"""N,103582791430504459,""2004-11-16 11:26:44""","""N,""N",2,"""N,""N",2,"""N,""N",2,"""N,""2013-02-28 14:37:33""",2,,,,
1,3,1,2013-02-28 08:24:24,1,"""N,103582791432326615,""2004-11-16 14:50:05""","""N,""N",2,"""N,""N",2,"""N,""N",2,"""N,""2013-03-06 18:09:17""",2,,,,
2,3,1,2013-02-05 01:05:43,1,"""N,103582791429521408,""2004-11-16 15:33:07""","""N,""N",2,"""N,""N",2,BE,1,06,1,6662,2013-02-28 14:27:17,,
3,3,1,2013-02-17 15:53:31,1,"""N,103582791431786684,""2004-11-17 09:38:59""","""N,""N",2,"""N,""N",2,FR,1,A8,1,15767,2013-02-28 14:27:21,,
4,3,1,2013-01-31 23:21:49,1,"""N,103582791430768475,""2004-11-17 15:40:09""","""N,""N",2,"""N,""N",2,CH,1,20,1,9391,2013-02-28 14:37:37,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
57354,3,1,2013-02-28 22:51:05,1,"""N,103582791429521408,""2006-06-30 06:42:35""","""N,""N",2,"""N,""N",2,CZ,1,52,1,12260,2013-03-01 23:13:57,,
57355,3,1,2012-06-27 10:16:51,1,"""N,103582791429521408,""2006-07-01 18:52:20""","""N,""N",2,"""N,""N",2,"""N,""N",2,"""N,""2013-03-01 23:17:52""",2,,,,
57356,3,1,2013-03-01 21:45:52,2,"""N,103582791433307803,""2006-07-02 12:58:33""","""N,""N",2,"""N,""N",2,"""N,""N",2,"""N,""2013-03-01 23:19:31""",2,,,,
57357,3,1,2013-03-01 16:29:49,1,"""N,103582791433195577,""2006-07-08 01:31:56""",91310,1,"""N,""Dead Island""",2,"""N,""US""",2,"""N,""N",2,2013-03-01 23:32:53,,,


---
### 5.1.2 Adding columns and condition
From this step, we will start adding column based in n_Split column logic

_Add columns:_
 > - realname
 > - primaryclanid
 > - timecreated

In [26]:
ps_t1 = ps_t1.withColumn('realname',F.when(F.col('_c11').contains('103'), F.col('c11_1')).otherwise(F.col('_c11')))
ps_t1 = ps_t1.withColumn('primaryclanid',F.when(F.col('_c11').contains('103'), F.col('c11_2')).otherwise(F.col('_c12')))
ps_t1 = ps_t1.withColumn('timecreated',F.when(F.col('_c11').contains('103'), F.col('c11_3')).otherwise(F.col('_c13')))

In [27]:
ps_t1.withColumn('realname',F.when(F.col('_c11').contains('103'), F.col('c11_1')).otherwise(F.col('_c11'))).toPandas()

Unnamed: 0,_c0,_c1,_c2,_c6,_c7,_c8,_c9,_c10,_c11,_c12,...,c17_2,c17_3,c18_1,c18_2,c18_3,c19_1,c19_2,realname,primaryclanid,timecreated
0,76561197966739103,SAS - DND - NOT FREE FOR MIX,http://steamcommunity.com/id/sas-buc1/,3,3,1,2013-02-16 11:40:10,1,"""N,103582791429522361,""2004-05-28 15:07:03""","""N,""N",...,,,,,,,,"""N",103582791429522361,"""2004-05-28 15:07:03"""
1,76561197966931469,Klaus-BÃ¶rje das Obergefreiter,http://steamcommunity.com/id/straff/,0,3,1,2013-02-14 13:58:08,1,"""N,103582791430110941,""2004-06-08 17:14:18""","""N,""N",...,,,,,,,,"""N",103582791430110941,"""2004-06-08 17:14:18"""
2,76561197966955486,dd,http://steamcommunity.com/profiles/76561197966...,0,3,1,2012-12-09 09:34:21,1,"""N,103582791430911377,""2004-06-12 01:50:14""","""N,""N",...,,,,,,,,"""N",103582791430911377,"""2004-06-12 01:50:14"""
3,76561197967006879,L1nk2h,http://steamcommunity.com/id/il1nkk/,0,3,1,2013-02-16 17:34:01,1,"""N,103582791433599146,""2004-06-13 06:58:30""","""N,""N",...,,,,,,,,"""N",103582791433599146,"""2004-06-13 06:58:30"""
4,76561197967144937,Dr. Jerf,http://steamcommunity.com/id/TheJoff/,3,3,1,2013-02-15 19:05:41,1,"""N,103582791433049237,""2004-06-21 15:33:52""","""N,""N",...,,,,,,,,"""N",103582791433049237,"""2004-06-21 15:33:52"""
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
57354,76561197980555631,gashanet,http://steamcommunity.com/profiles/76561197980...,0,3,1,2013-01-17 13:52:36,1,"""N,103582791429521408,""2006-01-24 11:00:58""","""N,""N",...,,,,,,,,"""N",103582791429521408,"""2006-01-24 11:00:58"""
57355,76561197980662281,Lom1k,http://steamcommunity.com/id/nazar4ikkk/,0,3,1,2013-02-28 11:07:52,1,"""N,103582791430107095,""2006-01-29 22:43:52""","""N,""N",...,,,,,,,,"""N",103582791430107095,"""2006-01-29 22:43:52"""
57356,76561197980851839,carcinoGeneticist,http://steamcommunity.com/id/SoldiersStashes/,1,3,1,2013-03-01 12:31:21,1,"""N,103582791433689473,""2006-02-09 00:54:57""","""N,""N",...,,,,,,,,"""N",103582791433689473,"""2006-02-09 00:54:57"""
57357,76561197980853652,Stasioooo,http://steamcommunity.com/id/stasiooo/,0,3,1,2013-02-03 18:53:21,1,"""N,103582791429521408,""2006-02-07 08:38:52""","""N,""N",...,,,,,,,,"""N",103582791429521408,"""2006-02-07 08:38:52"""


In [28]:
ps_t1.toPandas()

Unnamed: 0,_c0,_c1,_c2,_c6,_c7,_c8,_c9,_c10,_c11,_c12,...,c17_2,c17_3,c18_1,c18_2,c18_3,c19_1,c19_2,realname,primaryclanid,timecreated
0,76561197983698967,MEGAA,http://steamcommunity.com/profiles/76561197983...,0,3,1,2012-12-25 07:13:27,2,"""N,103582791430868128,""2006-07-14 16:06:58""","""N,""N",...,,,,,,,,"""N",103582791430868128,"""2006-07-14 16:06:58"""
1,76561197983721471,My Purtty Pony,http://steamcommunity.com/profiles/76561197983...,1,3,1,2013-02-28 22:51:48,1,"""N,103582791433752041,""2006-07-15 21:57:56""","""N,""N",...,,,,,,,,"""N",103582791433752041,"""2006-07-15 21:57:56"""
2,76561197983764489,DAVIDOUKOI-,http://steamcommunity.com/id/gillsen/,0,3,1,2013-03-01 17:08:41,2,"""N,103582791429521408,""2006-07-18 09:33:37""","""N,""N",...,,,,,,,,"""N",103582791429521408,"""2006-07-18 09:33:37"""
3,76561197983891637,tiflokileur21,http://steamcommunity.com/profiles/76561197983...,0,3,1,2011-06-25 08:46:53,1,"""N,103582791430910999,""2006-07-25 11:36:11""","""N,""N",...,,,,,,,,"""N",103582791430910999,"""2006-07-25 11:36:11"""
4,76561197984025972,toni,http://steamcommunity.com/id/toni-_-/,1,3,1,2013-03-02 00:20:04,1,"""N,103582791433142277,""2006-07-30 09:36:11""","""N,""N",...,,,,,,,,"""N",103582791433142277,"""2006-07-30 09:36:11"""
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
57354,76561197971705524,Dalaliss,http://steamcommunity.com/profiles/76561197971...,0,3,1,2012-05-07 12:13:19,1,"""N,103582791429570994,""2004-12-11 06:14:36""","""N,""N",...,,,,,,,,"""N",103582791429570994,"""2004-12-11 06:14:36"""
57355,76561197971770449,ads.<3 x22ãƒ„,http://steamcommunity.com/profiles/76561197971...,1,3,1,2013-02-17 08:37:32,1,"""N,103582791432474651,""2004-12-12 03:47:35""","""N,""N",...,,,,,,,,"""N",103582791432474651,"""2004-12-12 03:47:35"""
57356,76561197971815878,sledgeham,http://steamcommunity.com/id/sledgeham/,0,3,1,2013-02-17 15:24:32,2,"""N,103582791432125620,""2004-12-14 12:11:44""","""N,""N",...,,,,,,,,"""N",103582791432125620,"""2004-12-14 12:11:44"""
57357,76561197971857330,arisk79,http://steamcommunity.com/id/arisk79/,3,3,1,2013-02-18 03:16:46,1,"""N,103582791431949543,""2004-12-16 00:38:43""","""N,""N",...,,,,,,,,"""N",103582791431949543,"""2004-12-16 00:38:43"""


_Check if `primaryclanid` and `timecreated` work properly_

In [29]:
ps_t1.filter(F.col('primaryclanid').contains('103')).count()

57359

In [30]:
ps_t1.filter(F.col('timecreated').contains('20')).count()

57359

In [31]:
ps_t1.select('realname').distinct().show(80)

+--------------------+
|            realname|
+--------------------+
|solariS , oKurves...|
|Brian, Benjamin, ...|
|The Drill, aka Bl...|
|Cunt, Emanuel/ Ka...|
|I am serious, do ...|
|San Diego, Califo...|
|h. mc carty, w.h....|
|         "RokÅ¡Ä— ;"|
|  Naw, man, I'm Dave|
|Slightly Sliced, ...|
|Garrett J. Morgan...|
|Gavin, but please...|
|BIG R, daughter L...|
|ÐžÐ¡Ð•ÐÐ¬, Ð›Ð˜Ð...|
|            "Luke ="|
|aka Banjo, Tru, W...|
|Stackar'N^ , Spax...|
|â•”â–º OnE sHoT, ...|
|Delerium, Fevin K...|
|ALIASES:  LIR, La...|
|Padova, Veneto, I...|
|SALIMZZ,-,GU!ZMO ...|
|    Mathias,George,l|
|Brett,Gavin,and T...|
|Steven D.,Kevin D...|
|Claytonious, Rule...|
| Live,Learn,Grow. =]|
|Max BÃ¶hl-Iggelhe...|
| DasK, DaPunZ, Felix|
|Bear, with Ice, s...|
|     "acc recovered"|
|          "/-TaFri-"|
|TUPAC, IMMORTAL T...|
|Servet, Sement, S...|
|Jeremy Willocq ,1...|
|           Me, I, it|
|"Aka: Burger, Toy...|
|Brad, or Bert, or...|
|"f a r s s a ! t ...|
|               .,.,.|
|Massakasch

---
_Add columns:_
 > - gameid (_c14)

In [32]:
ps_t1 = ps_t1.withColumn('gameid',F.when(F.col('_c11').contains('103'), \
                                           F.col('c12_1')).otherwise(F.col('c14_1')))

In [33]:
ps_t1.select('gameid').count()

57359

In [34]:
# Use this code to check if the 'gameid' was parsed correctly.
ps_t1.select('gameid').distinct().toPandas()[201:220]

Unnamed: 0,gameid
201,22120
202,34030
203,3483
204,209690
205,48240
206,218230
207,218130
208,22100
209,17300
210,227300


---
_Add column:_
 > - gameserverip (_c15)

In [35]:
ps_t1 = ps_t1.withColumn('gameserverip',F.when((F.col('_c11').contains('103')) & (F.col('nSplit_c12')==2), \
                                                F.col('c12_2')).when((F.col('_c11').contains('103')) & (F.col('nSplit_c12')==1), \
                                                F.col('c13_1')).when((F.col('_c12').contains('103')) & (F.col('nSplit_c14')==2), \
                                                F.col('c14_2')).otherwise(F.col('c15_1')))

In [36]:
ps_t1.select('gameserverip').count()

57359

In [37]:
ps_t1.select('gameserverip').distinct().count()

815

In [38]:
ps_t1.select('gameserverip').distinct().toPandas()[640:660]

Unnamed: 0,gameserverip
640,146.66.153.97:27034
641,146.66.155.25:27021
642,208.78.165.10:27056
643,146.66.154.39:27079
644,146.66.156.235:27040
645,46.20.120.32:27015
646,46.20.40.20:1111
647,50.115.32.35:27015
648,146.66.153.116:27055
649,173.199.100.175:7707


---
_Add column:_
 > - gameextrainfo (_c16)

In [39]:
ps_t1 = ps_t1.withColumn('gameextrainfo',F.when((F.col('_c11').contains('103')) & (F.col('nSplit_c12')==2) & (F.col('nSplit_c13')==2) , \
                                                F.col('c13_2')).when((F.col('_c11').contains('103')) & (F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==2), \
                                                F.col('c13_2')).when((F.col('_c11').contains('103')) & (F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1), \
                                                F.col('c14_1')).when((F.col('_c12').contains('103')) & (F.col('nSplit_c15')==1), \
                                                F.col('c16_1')).otherwise(F.col('c15_1')))

In [40]:
ps_t1.select('gameextrainfo').distinct().count()

310

In [41]:
ps_t1.select('gameextrainfo').distinct().toPandas()[0:50]

Unnamed: 0,gameextrainfo
0,Dota 2
1,"""Black Mesa"""
2,"""Sniper Elite V2"""
3,"""Dungeonland"""
4,"""Magic: The Gathering - Duels of the Planeswal..."
5,Left 4 Dead 2
6,"""Dark Souls: Prepare to Die Edition"""
7,"""Anno 2070"""
8,"""Age of Wonders"""
9,"""Left 4 Dead 2"""


---
_Add column_
 > - cityid (_c17)

In [42]:
ps_t1 = ps_t1.withColumn('cityid',F.when((F.col('_c11').contains('103')) & (F.col('nSplit_c12')==2) & (F.col('nSplit_c13')==2) ,\
                                                F.col('c13_2')).when((F.col('_c11').contains('103')) & (F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==2) & ~(F.col('_c14').contains('000')), \
                                                F.col('c14_1')).when((F.col('_c11').contains('103')) & (F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1)& (F.col('nSplit_c14')==2), \
                                                F.col('c14_2')).when((F.col('_c11').contains('103')) & (F.col('nSplit_c12')==2) & (F.col('nSplit_c13')==1), \
                                                F.col('c14_1')).when((F.col('_c12').contains('103')) & (F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==2), \
                                                F.col('c15_2')).when((F.col('_c11').contains('103')) & (F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==1) & (F.col('_c14').contains('000'))  & ~(F.col('_c15').contains('000')), \
                                                F.col('c15_1')).when((F.col('_c11').contains('103')) & (F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==1) & (F.col('_c14').contains('000')) & (F.col('_c15').contains('000')), \
                                                F.col('c16_1')).when((F.col('_c12').contains('103')) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==2), \
                                                F.col('c16_1')).when((F.col('_c12').contains('103')) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1)  & (F.col('nSplit_c16')==1), \
                                                F.col('c17_1')).when((F.col('_c12').contains('103')) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1)  & (F.col('nSplit_c16')==2), \
                                                F.col('c16_2')).otherwise(F.col('c15_1')))

In [43]:
ps_t1.select('cityid').distinct().count()

1

In [44]:
ps_t1.select('cityid').distinct().toPandas()[0:50]

Unnamed: 0,cityid
0,"""N"


---
_Add column_
 > - loccountrycode (_c18)

In [45]:
ps_t1 = ps_t1.withColumn('loccountrycode',F.when((F.col('_c11').contains('103')) & (F.col('nSplit_c12')==2) & (F.col('nSplit_c13')==2) ,\
                                                F.col('c14_1')).when((F.col('_c11').contains('103')) & (F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==2) & ~(F.col('_c14').contains('000')), \
                                                F.col('c14_2')).when((F.col('_c11').contains('103')) & (F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1)& (F.col('nSplit_c14')==2), \
                                                F.col('c15_1')).when((F.col('_c11').contains('103')) & (F.col('nSplit_c12')==2) & (F.col('nSplit_c13')==1), \
                                                F.col('c14_2')).when((F.col('_c12').contains('103')) & (F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==2), \
                                                F.col('c16_1')).when((F.col('_c11').contains('103')) & (F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==1) & (F.col('_c14').contains('000'))  & ~(F.col('_c15').contains('000')), \
                                                F.col('c15_2')).when((F.col('_c11').contains('103')) & (F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==1) & (F.col('_c14').contains('000')) & (F.col('_c15').contains('000')), \
                                                F.col('c16_2')).when((F.col('_c12').contains('103')) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==2), \
                                                F.col('c16_2')).when((F.col('_c12').contains('103')) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1)  & (F.col('nSplit_c16')==1), \
                                                F.col('c17_2')).when((F.col('_c12').contains('103')) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1)  & (F.col('nSplit_c16')==2), \
                                                F.col('c17_1')).otherwise(F.col('c15_2')))

In [46]:
ps_t1 = ps_t1.withColumn("loccountrycode",regexp_replace("loccountrycode", '"', ""))
ps_t1.select('loccountrycode').distinct().count()

220

In [47]:
ps_t1.select('loccountrycode').distinct().toPandas()[200:250]

Unnamed: 0,loccountrycode
200,TF
201,YT
202,AR
203,CF
204,PW
205,PR
206,LU
207,NF
208,SZ
209,VN


In [48]:
ps_t1.filter(F.col('loccountrycode').isNull()).toPandas()[['_c12','_c13','_c14','_c15','_c16','_c17', 'cityid']]

Unnamed: 0,_c12,_c13,_c14,_c15,_c16,_c17,cityid


---
_Add column_
 > - locstatecode (_c19)

In [50]:
ps_t1 = ps_t1.withColumn('locstatecode',F.when((F.col('_c11').contains('103')) & (F.col('nSplit_c12')==2) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==2),\
                                                F.col('c14_2')).when((F.col('_c11').contains('103')) & (F.col('nSplit_c12')==2) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==1),\
                                                F.col('c15_1')).when((F.col('_c11').contains('103')) & (F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==2) & ~(F.col('_c14').contains('000')), \
                                                F.col('c15_1')).when((F.col('_c11').contains('103')) & (F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1)& (F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==2), \
                                                F.col('c15_2')).when((F.col('_c11').contains('103')) & (F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1)& (F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==1), \
                                                F.col('c16_1')).when((F.col('_c12').contains('103')) & (F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==2) & (F.col('nSplit_c16')==2), \
                                                F.col('c16_2')).when((F.col('_c12').contains('103')) & (F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==2) & (F.col('nSplit_c16')==1), \
                                                F.col('c17_1')).when((F.col('_c11').contains('103')) & (F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==1) & (F.col('_c14').contains('000'))  & ~(F.col('_c15').contains('000')), \
                                                F.col('c16_1')).when((F.col('_c11').contains('103')) & (F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==1) & (F.col('_c14').contains('000')) & (F.col('_c15').contains('000')), \
                                                F.col('c17_1')).when((F.col('_c12').contains('103')) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==2) & (F.col('nSplit_c16')==1) & (F.col('nSplit_c17')==1), \
                                                F.col('c18_1')).when((F.col('_c12').contains('103')) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==2) & (F.col('nSplit_c16')==1) & (F.col('nSplit_c17')==2), \
                                                F.col('c17_2')).when((F.col('_c12').contains('103')) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1)  & (F.col('nSplit_c16')==1) & (F.col('nSplit_c17')==2), \
                                                F.col('c18_1')).when((F.col('_c12').contains('103')) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1)  & (F.col('nSplit_c16')==2)& (F.col('nSplit_c17')==2), \
                                                F.col('c17_2')).when((F.col('_c12').contains('103')) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1)  & (F.col('nSplit_c16')==2)& (F.col('nSplit_c17')==1) & (F.col('nSplit_c18')==1), \
                                                F.col('c19_1')).when((F.col('_c12').contains('103')) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1)  & (F.col('nSplit_c16')==2)& (F.col('nSplit_c17')==1) & (F.col('nSplit_c18')==2), \
                                                F.col('c18_2')).when((F.col('_c12').contains('103')) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1)  & (F.col('nSplit_c16')==2)& (F.col('nSplit_c17')==1), \
                                                F.col('c18_1')).otherwise(F.col('c16_1')))

In [51]:
ps_t1.select('locstatecode').distinct().count()

334

In [59]:
ps_t1.select('locstatecode').distinct().toPandas()[300:350]

Unnamed: 0,locstatecode
300,Y9
301,94
302,Q4
303,91
304,72
305,Q6
306,74
307,E2
308,YT
309,N9


In [57]:
ps_t1.filter(F.col('locstatecode').isNull()).toPandas()[['_c12','_c13','_c14','_c15','_c16','_c17', 'cityid']]

Unnamed: 0,_c12,_c13,_c14,_c15,_c16,_c17,cityid


----
#### Don't need column locityid since we have locstatecode and loccountrycode
#### Also for dateretrieved: all have the same range of  date:  Feb - Mar 2013
---

### 5.1.3 Format, rename columns and save Table1

In [62]:
# Turn first 50 rows to Pandas to have an overview about the table at hand
ps_t1.limit(50).toPandas()

Unnamed: 0,_c0,_c1,_c2,_c6,_c7,_c8,_c9,_c10,_c11,_c12,...,c19_2,realname,primaryclanid,timecreated,gameid,gameserverip,gameextrainfo,cityid,loccountrycode,locstatecode
0,76561197979750598,Entropy-,http://steamcommunity.com/profiles/76561197979...,0,3,1,2013-01-22 14:06:12,1,"""N,103582791431943412,""2005-12-15 12:06:01""","""N,""N",...,,"""N",103582791431943412,"""2005-12-15 12:06:01""","""N","""N","""N","""N",DK,"""N"
1,76561197979796496,H3chor,http://steamcommunity.com/profiles/76561197979...,0,3,1,2013-02-26 23:18:44,1,"""N,103582791430606952,""2005-12-17 17:38:33""","""N,""N",...,,"""N",103582791430606952,"""2005-12-17 17:38:33""","""N","""N","""N","""N",N,"""N"
2,76561197979873363,koolaid_sux764,http://steamcommunity.com/id/koolaidsux/,0,3,1,2012-09-23 10:38:59,1,"""N,103582791430274949,""2005-12-23 21:03:42""","""N,""N",...,,"""N",103582791430274949,"""2005-12-23 21:03:42""","""N","""N","""N","""N",N,"""N"
3,76561197979920078,HODJA#2914,http://steamcommunity.com/profiles/76561197979...,0,3,1,2013-02-28 10:19:52,1,"""N,103582791429678800,""2005-12-24 01:10:32""","""N,""N",...,,"""N",103582791429678800,"""2005-12-24 01:10:32""","""N","""N","""N","""N",RU,48
4,76561197979938761,Bubba,http://steamcommunity.com/id/neutronstar1/,0,3,1,2013-03-06 19:06:09,1,"""N,103582791429521408,""2005-12-25 21:46:02""","""N,""N",...,,"""N",103582791429521408,"""2005-12-25 21:46:02""","""N","""N","""N","""N",N,"""N"
5,76561197980001888,Simon the Cat,http://steamcommunity.com/id/mutsie/,0,3,1,2013-03-01 00:24:38,1,"""N,103582791429862347,""2005-12-26 14:19:12""","""N,""N",...,,"""N",103582791429862347,"""2005-12-26 14:19:12""","""N","""N","""N","""N",US,"""N"
6,76561197980055072,Jacky_Bull_1â„¢(AUT),http://steamcommunity.com/profiles/76561197980...,0,3,1,2012-02-10 05:54:06,1,"""N,103582791429831621,""2005-12-28 11:55:31""","""N,""N",...,,"""N",103582791429831621,"""2005-12-28 11:55:31""","""N","""N","""N","""N",AT,05
7,76561197980102002,Noor,http://steamcommunity.com/profiles/76561197980...,1,3,1,2013-03-01 11:02:54,1,"""N,103582791433463690,""2005-12-30 10:13:49""",570,...,,"""N",103582791433463690,"""2005-12-30 10:13:49""",570,146.66.154.94:27017,Dota 2,"""N",N,"""N"
8,76561197980104422,gfghfghfh,http://steamcommunity.com/id/turchenko1337/,0,3,1,2012-09-13 11:49:16,1,"""N,103582791431275365,""2005-12-30 11:49:50""","""N,""N",...,,"""N",103582791431275365,"""2005-12-30 11:49:50""","""N","""N","""N","""N",RU,65
9,76561197980166498,ROFLJOHNNY,http://steamcommunity.com/profiles/76561197980...,3,3,1,2013-02-28 09:12:16,1,"""N,103582791429878408,""2006-01-02 11:38:46""","""N,""N",...,,"""N",103582791429878408,"""2006-01-02 11:38:46""","""N","""N","""N","""N",N,"""N"


In [63]:
ps_t1.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c6: string (nullable = true)
 |-- _c7: string (nullable = true)
 |-- _c8: string (nullable = true)
 |-- _c9: string (nullable = true)
 |-- _c10: string (nullable = true)
 |-- _c11: string (nullable = true)
 |-- _c12: string (nullable = true)
 |-- _c13: string (nullable = true)
 |-- _c14: string (nullable = true)
 |-- _c15: string (nullable = true)
 |-- _c16: string (nullable = true)
 |-- _c17: string (nullable = true)
 |-- _c18: string (nullable = true)
 |-- _c19: string (nullable = true)
 |-- nSplit_c7: integer (nullable = false)
 |-- nSplit_c8: integer (nullable = false)
 |-- nSplit_c9: integer (nullable = false)
 |-- nSplit_c10: integer (nullable = false)
 |-- nSplit_c11: integer (nullable = false)
 |-- nSplit_c12: integer (nullable = false)
 |-- nSplit_c13: integer (nullable = false)
 |-- nSplit_c14: integer (nullable = false)
 |-- nSplit_c15: integer (nullable = fals

In [65]:
# Drop unnecssary columns
ps_t1 = ps_t1.drop('nSplit_c7','nSplit_c8','nSplit_c9','nSplit_c10','nSplit_c11',\
                    'nSplit_c12','nSplit_c13','nSplit_c14','nSplit_c15','nSplit_c16','nSplit_c17',\
                    'nSplit_c18','nSplit_c19','c11_1','c11_2', 'c11_3','c12_1','c12_2','c13_1','c13_2','c14_1',\
                    'c14_2','c15_1','c15_2','c16_1','c16_2','c16_3','c17_1','c17_2','c17_3','c18_1',\
                    'c18_2','c18_3','c19_1','c19_2','_c11','_c12','_c13','_c14','_c15','_c16',\
                    '_c17','_c18','_c19')

In [69]:
# Rename columns
newColumns = ['steam_id','person_name', 'profile_url','person_state', 'community_visibility_state',\
              'profile_state','last_logoff', 'comment_permission', 'real_name', 'primary_clanid',\
              'time_created', 'game_id', 'gameserver_ip', 'game_extrainfo', 'city_id', 'country_code',\
              'state_code']
ps_t1 = rename_col(ps_t1, newColumns)

In [70]:
ps_t1.printSchema()

root
 |-- steam_id: string (nullable = true)
 |-- person_name: string (nullable = true)
 |-- profile_url: string (nullable = true)
 |-- person_state: string (nullable = true)
 |-- community_visibility_state: string (nullable = true)
 |-- profile_state: string (nullable = true)
 |-- last_logoff: string (nullable = true)
 |-- comment_permission: string (nullable = true)
 |-- real_name: string (nullable = true)
 |-- primary_clanid: string (nullable = true)
 |-- time_created: string (nullable = true)
 |-- game_id: string (nullable = true)
 |-- gameserver_ip: string (nullable = true)
 |-- game_extrainfo: string (nullable = true)
 |-- city_id: string (nullable = true)
 |-- country_code: string (nullable = true)
 |-- state_code: string (nullable = true)



In [71]:
ps_t1.count()

57359

In [73]:
# Replace " symbols in data
col_list = ['real_name', 'primary_clanid','time_created', 'game_id', 'gameserver_ip',\
            'game_extrainfo', 'city_id', 'country_code','state_code']

for i in col_list:
     ps_t1 = ps_t1.withColumn(i,regexp_replace(i, '"', ""))

In [74]:
ps_t1.limit(50).toPandas()

Unnamed: 0,steam_id,person_name,profile_url,person_state,community_visibility_state,profile_state,last_logoff,comment_permission,real_name,primary_clanid,time_created,game_id,gameserver_ip,game_extrainfo,city_id,country_code,state_code
0,76561197963292609,Sir John,http://steamcommunity.com/profiles/76561197963...,0,3,1,2013-02-14 06:05:34,2,N,103582791431786684,2003-12-09 16:30:20,N,N,N,N,N,N
1,76561197963352036,Spindizzy,http://steamcommunity.com/id/spindizzy/,2,3,1,2013-02-15 21:25:46,1,N,103582791429524514,2003-12-13 13:43:42,N,N,N,N,N,N
2,76561197963646776,PQMarine,http://steamcommunity.com/id/PQMarine/,0,3,1,2013-02-16 16:27:42,1,N,103582791430168284,2003-12-30 12:55:54,N,N,N,N,DE,N
3,76561197963683024,blitzyuk,http://steamcommunity.com/id/blitzyuk/,1,3,1,2013-02-16 18:51:54,1,N,103582791429521900,2004-01-01 18:57:34,N,N,N,N,GB,43
4,76561197963685077,Kayzer,http://steamcommunity.com/profiles/76561197963...,0,3,1,2013-01-16 07:05:22,1,N,103582791432128861,2004-01-01 14:33:39,N,N,N,N,SE,14
5,76561197984967267,L3monT3a,http://steamcommunity.com/id/silvi4/,0,3,1,2012-05-26 08:13:31,2,N,103582791430384313,2006-09-18 03:55:19,N,N,N,N,N,N
6,76561197984973181,lorkhi,http://steamcommunity.com/id/LorkhiGer/,0,3,1,2013-03-01 18:43:36,1,N,103582791430010492,2006-09-18 10:42:32,N,N,N,N,N,N
7,76561197984979925,MechaSavage,http://steamcommunity.com/id/TheYoungLiar/,0,3,1,2013-02-28 18:24:11,1,N,103582791432866998,2006-09-18 19:05:07,N,N,N,N,N,N
8,76561197984989000,xRaven -,http://steamcommunity.com/id/XRAVENXXX/,0,3,1,2011-10-19 06:02:10,1,N,103582791431079627,2006-09-16 03:01:29,N,N,N,N,FR,N
9,76561197985212574,Qooya,http://steamcommunity.com/id/qooya/,0,3,1,2013-03-02 01:01:55,2,N,103582791429521408,2006-09-28 01:48:16,N,N,N,N,N,N


In [75]:
# Save TABLE 1
ps_t1.write.csv('/user/tamng/jwht/CleanData/ps_t1.csv', header = True)

---
## <font color = 'black'> 5.2 Table 2: nSplit_c11 = 3 & nSplit_c10 != 1 </font>

_c10 should be split into 2 and 2nd column: realname, and adjust from _c10 but eveything consitent until column _c13

---

### 5.2.1 Filter data and check number of Split

In [76]:
player_summaries_drop.filter((player_summaries_drop.nSplit_c11 == 3) & (player_summaries_drop.nSplit_c10 != 1)).count()

80

In [77]:
ps_t2 = player_summaries_drop.filter((player_summaries_drop.nSplit_c11 == 3) & (player_summaries_drop.nSplit_c10 != 1))

In [78]:
nSplit_c11_3_c10_n = ps_t2.toPandas()
nSplit_c11_3_c10_n

Unnamed: 0,_c0,_c1,_c2,_c6,_c7,_c8,_c9,_c10,_c11,_c12,...,nSplit_c10,nSplit_c11,nSplit_c12,nSplit_c13,nSplit_c14,nSplit_c15,nSplit_c16,nSplit_c17,nSplit_c18,nSplit_c19
0,76561197983433666,zfan350,http://steamcommunity.com/profiles/76561197983...,0,3,1,2012-04-29 09:27:17,"""N,""Ba",""",103582791432926187,""2006-06-28 11:31:34""","""N,""N",...,2,3,2,2,2,2,-1,-1,-1,-1
1,76561197970431714,butz,http://steamcommunity.com/profiles/76561197970...,0,3,1,2013-02-14 16:17:29,"""N,""butz bum",""",103582791430787496,""2004-11-16 13:14:29""","""N,""N",...,2,3,2,2,2,2,-1,-1,-1,-1
2,76561197972161391,NarCiuss,http://steamcommunity.com/profiles/76561197972...,0,3,1,2013-01-31 16:45:47,"""N,""Brandon G",""",103582791430699298,""2004-12-22 17:23:33""","""N,""N",...,2,3,2,2,1,1,2,-1,-1,-1
3,76561197970997287,ThTh | Koala,http://steamcommunity.com/id/thommes/,0,3,1,2011-03-09 06:49:23,"""N,""Th",""",103582791429521543,""2004-11-22 09:07:53""","""N,""N",...,2,3,2,2,1,1,2,-1,-1,-1
4,76561197980777799,Tramper (â—£ â—¢),http://steamcommunity.com/profiles/76561197980...,0,3,1,2013-02-09 20:45:46,"""N,""KÃ¥re",""",103582791430380703,""2006-02-05 01:05:22""","""N,""N",...,2,3,2,2,1,2,1,-1,-1,-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
75,76561197977155078,Sap,http://steamcommunity.com/profiles/76561197977...,0,3,1,2013-02-13 20:49:38,"""N,""Evan K",""",103582791431010006,""2005-07-23 01:26:28""","""N,""N",...,2,3,2,2,1,1,1,1,-1,-1
76,76561197960833489,spec,http://steamcommunity.com/profiles/76561197960...,0,3,1,2009-06-07 20:53:23,"N,",""",103582791429521408,""2003-09-16 10:23:10""","""N,""N",...,2,3,2,2,1,2,1,-1,-1,-1
77,76561197964247514,Frontier Psychiatrist,http://steamcommunity.com/id/siffus/,0,3,1,2013-02-16 00:15:54,"""N,""An assortment of ice creams and sorbets",""",103582791429813574,""2004-02-01 15:16:23""","""N,""N",...,2,3,2,2,1,2,1,-1,-1,-1
78,76561197969606017,PeterToshBuddy_-',http://steamcommunity.com/profiles/76561197969...,0,3,1,2013-01-04 19:57:52,"""N,""Brendon",""",103582791433331103,""2004-10-15 08:06:02""","""N,""N",...,2,3,2,2,1,1,1,1,-1,-1


In [81]:
# All of the observations in _c11 has the primaryclanid >> consistent for all in _c11
ps_t2.filter(F.col('_c11').contains('103')).count()

80

In [80]:
# Check number of unique value in split column
check_list = ['nSplit_c7','nSplit_c8','nSplit_c9', 'nSplit_c10', 'nSplit_c11','nSplit_c12','nSplit_c13', 'nSplit_c14','nSplit_c15','nSplit_c16','nSplit_c17','nSplit_c18','nSplit_c19']
unique_col = []
for i in check_list:
    unique_col.append(nSplit_c11_3_c10_n[i].nunique())

print(pd.DataFrame(zip(check_list,unique_col)))

             0  1
0    nSplit_c7  1
1    nSplit_c8  1
2    nSplit_c9  1
3   nSplit_c10  1
4   nSplit_c11  1
5   nSplit_c12  2
6   nSplit_c13  1
7   nSplit_c14  2
8   nSplit_c15  2
9   nSplit_c16  3
10  nSplit_c17  2
11  nSplit_c18  1
12  nSplit_c19  1


___
#### Take a look at the number of split to decide whether we should split the column or not

In [82]:
nSplit_c11_3_c10_n.nSplit_c17.unique()

array([-1,  1])

In [83]:
nSplit_c11_3_c10_n.nSplit_c18.unique()

array([-1])

In [84]:
nSplit_c11_3_c10_n.nSplit_c19.unique()

array([-1])

In [104]:
nSplit_c11_3_c10_n[['_c13']]

Unnamed: 0,_c13
0,"""N,""N"
1,"""N,""N"
2,"""N,""N"
3,"""N,""N"
4,"""N,""N"
...,...
75,"""N,""N"
76,"""N,""N"
77,"""N,""N"
78,"""N,""N"


In [91]:
# Set new columns need to split for this data
col_c11 = ['_c11']
newcols_c11 = ['c11_1', 'c11_2','c11_3' ]

col_c12 = ['_c12']
newcols_c12 = ['c12_1', 'c12_2']

col_c13 = ['_c13']
newcols_c13 = ['c13_1', 'c13_2']

col_c14 = ['_c14']
newcols_c14 = ['c14_1', 'c14_2']

col_c15 = ['_c15']
newcols_c15 = ['c15_1', 'c15_2']

col_c16 = ['_c16']
newcols_c16 = ['c16_1', 'c16_2', 'c16_3']


# Apply split function
ps_t2 = split_3_column(ps_t2, col_c11, newcols_c11)
ps_t2 = split_2_column(ps_t2, col_c12, newcols_c12)
ps_t2 = split_2_column(ps_t2, col_c13, newcols_c13)
ps_t2 = split_2_column(ps_t2, col_c14, newcols_c14)
ps_t2 = split_2_column(ps_t2, col_c15, newcols_c15)
ps_t2 = split_3_column(ps_t2, col_c16, newcols_c16)

In [95]:
ps_t2.toPandas()[['_c7','_c8','_c9','_c10','_c11','_c12','nSplit_c12','_c13','nSplit_c13','_c14','nSplit_c14','_c15','nSplit_c15','_c16','_c17','_c18','_c19']]

Unnamed: 0,_c7,_c8,_c9,_c10,_c11,_c12,nSplit_c12,_c13,nSplit_c13,_c14,nSplit_c14,_c15,nSplit_c15,_c16,_c17,_c18,_c19
0,3,1,2012-04-29 09:27:17,"""N,""Ba",""",103582791432926187,""2006-06-28 11:31:34""","""N,""N",2,"""N,""N",2,"""N,""N",2,"""N,""2013-03-01 23:09:54""",2,,,,
1,3,1,2013-02-14 16:17:29,"""N,""butz bum",""",103582791430787496,""2004-11-16 13:14:29""","""N,""N",2,"""N,""N",2,"""N,""N",2,"""N,""2013-02-28 14:37:33""",2,,,,
2,3,1,2013-01-31 16:45:47,"""N,""Brandon G",""",103582791430699298,""2004-12-22 17:23:33""","""N,""N",2,"""N,""N",2,US,1,CO,1,"""N,""2013-02-28 14:28:26""",,,
3,3,1,2011-03-09 06:49:23,"""N,""Th",""",103582791429521543,""2004-11-22 09:07:53""","""N,""N",2,"""N,""N",2,NL,1,05,1,"""N,""2013-02-28 14:27:37""",,,
4,3,1,2013-02-09 20:45:46,"""N,""KÃ¥re",""",103582791430380703,""2006-02-05 01:05:22""","""N,""N",2,"""N,""N",2,DK,1,"""N,""N",2,2013-03-11 12:28:42,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
75,3,1,2013-02-13 20:49:38,"""N,""Evan K",""",103582791431010006,""2005-07-23 01:26:28""","""N,""N",2,"""N,""N",2,US,1,HI,1,1131,2013-02-28 14:39:44,,
76,3,1,2013-02-16 00:15:54,"""N,""An assortment of ice creams and sorbets",""",103582791429813574,""2004-02-01 15:16:23""","""N,""N",2,"""N,""N",2,US,1,"""N,""N",2,2013-02-28 14:35:21,,,
77,3,1,2009-06-07 20:53:23,"N,",""",103582791429521408,""2003-09-16 10:23:10""","""N,""N",2,"""N,""N",2,PT,1,"""N,""N",2,2013-02-28 14:19:23,,,
78,3,1,2013-01-04 19:57:52,"""N,""Brendon",""",103582791433331103,""2004-10-15 08:06:02""","""N,""N",2,"""N,""N",2,BR,1,05,1,7943,2013-02-28 14:26:18,,


---
### 5.2.2 Adding columns and condition
From this step, we will start adding column based in n_Split column logic

_Add columns:_
 > - realname
 > - primaryclanid
 > - timecreated

In [97]:
ps_t2.filter(F.col('c11_2').contains('103')).count()

80

In [98]:
ps_t2 = ps_t2.withColumn('realname',F.when(F.col('_c11').contains('103'), F.col('_c10')))
ps_t2 = ps_t2.withColumn('primaryclanid',F.when(F.col('_c11').contains('103'), F.col('c11_2')))
ps_t2 = ps_t2.withColumn('timecreated',F.when(F.col('_c11').contains('103'), F.col('c11_3')))

In [99]:
ps_t2.filter(F.col('timecreated').contains('20')).count()

80

In [110]:
ps_t2.select('timecreated').distinct().show()

+--------------------+
|         timecreated|
+--------------------+
|"2004-07-07 00:08...|
|"2003-10-13 18:54...|
|"2006-02-01 09:16...|
|"2003-09-23 11:28...|
|"2005-10-08 16:56...|
|"2005-06-16 06:08...|
|"2004-10-15 08:06...|
|"2004-11-06 22:38...|
|"2005-07-23 01:26...|
|"2006-06-29 10:33...|
|"2004-01-30 09:44...|
|"2006-11-10 06:08...|
|"2004-12-29 21:48...|
|"2005-02-23 12:22...|
|"2006-05-13 10:11...|
|"2005-07-15 09:16...|
|"2005-02-25 15:25...|
|"2006-08-24 13:10...|
|"2004-07-28 09:00...|
|"2004-12-22 17:23...|
+--------------------+
only showing top 20 rows



---
_Add column_
 > - gameid (_c14)

In [111]:
nSplit_c11_3_c10_n.nSplit_c11.unique()

array([3])

In [100]:
ps_t2 = ps_t2.withColumn('gameid',F.when(F.col('_c11').contains('103'),F.col('c12_1')))

In [101]:
ps_t2.select('gameid').distinct().count()

2

In [103]:
# Use this code to check if the 'gameid' was parsed correctly.
ps_t2.select('gameid').distinct().toPandas()[:5]

Unnamed: 0,gameid
0,33930
1,"""N"


---
Add column
 > - gameserverip (_c15)

In [105]:
nSplit_c11_3_c10_n.nSplit_c12.unique()

array([2, 1])

In [106]:
ps_t2 = ps_t2.withColumn('gameserverip',F.when((F.col('nSplit_c12')==1), \
                                                F.col('c13_1')).when((F.col('nSplit_c12')==2), \
                                                F.col('c12_2')))

In [107]:
ps_t2.select('gameserverip').distinct().count()

1

In [108]:
# Use this code to check if the 'gameid' was parsed correctly.
ps_t2.select('gameserverip').distinct().toPandas()[:5]

Unnamed: 0,gameserverip
0,"""N"


---
_Add column_
 - gameextrainfo (_c16)

In [113]:
nSplit_c11_3_c10_n.nSplit_c13.unique()

array([2])

In [114]:
ps_t2 = ps_t2.withColumn('gameextrainfo',F.when((F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==2), \
                                                F.col('c13_2')).when((F.col('nSplit_c12')==2) & (F.col('nSplit_c13')==2), \
                                                F.col('c13_1')))

In [118]:
ps_t2.select('gameextrainfo').distinct().count()

2

In [116]:
# Use this code to check if the 'gameid' was parsed correctly.
ps_t2.select('gameextrainfo').distinct().toPandas()[:5]

Unnamed: 0,gameextrainfo
0,"""Arma 2: Operation Arrowhead"""
1,"""N"


In [None]:
ps_t2.filter(F.col('gameextrainfo').isNull()).toPandas()[['_c12','_c13','_c14','_c15','_c16','_c17', 'cityid']]

---
_Add column:_
 - cityid (_c17)

In [117]:
nSplit_c11_3_c10_n.nSplit_c14.unique()

array([2, 1])

In [120]:
ps_t2 = ps_t2.withColumn('cityid',F.when((F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==2), \
                                                F.col('c14_1')).when((F.col('nSplit_c12')==2) & (F.col('nSplit_c13')==2), \
                                                F.col('c13_2')))

In [121]:
ps_t2.select('cityid').distinct().count()

1

In [122]:
# Use this code to check if the 'gameid' was parsed correctly.
ps_t2.select('cityid').distinct().toPandas()[:5]

Unnamed: 0,cityid
0,"""N"


In [123]:
ps_t2.filter(F.col('cityid').isNull()).toPandas()[['_c12','_c13','_c14','_c15','_c16','_c17', 'cityid']]

Unnamed: 0,_c12,_c13,_c14,_c15,_c16,_c17,cityid


---
_Add column_
 > - loccountrycode (_c18)

In [124]:
nSplit_c11_3_c10_n.nSplit_c15.unique()

array([2, 1])

In [125]:
ps_t2 = ps_t2.withColumn('loccountrycode',F.when((F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==1), \
                                                F.col('c15_1')).when((F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==2), \
                                                F.col('c14_2')).when((F.col('nSplit_c12')==2) & (F.col('nSplit_c13')==2), \
                                                F.col('c14_1')))

In [126]:
ps_t2.select('loccountrycode').distinct().count()

25

In [129]:
# Use this code to check if the 'gameid' was parsed correctly.
ps_t2.select('loccountrycode').distinct().toPandas()[0:25]

Unnamed: 0,loccountrycode
0,NL
1,PL
2,RU
3,PT
4,AU
5,CA
6,GB
7,BR
8,DE
9,ES


In [128]:
ps_t2.filter(F.col('loccountrycode').isNull()).toPandas()[['_c12','_c13','_c14','_c15','_c16','_c17', 'cityid']]

Unnamed: 0,_c12,_c13,_c14,_c15,_c16,_c17,cityid


---
_Add column_
 > - locstatecode (_c19)

In [130]:
nSplit_c11_3_c10_n.nSplit_c16.unique()

array([-1,  2,  1])

In [135]:
ps_t2 = ps_t2.withColumn('locstatecode',F.when((F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1), \
                                                F.col('c16_1')).when((F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==2), \
                                                F.col('c15_2')).when((F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==1), \
                                                F.col('c15_1')).when((F.col('nSplit_c12')==2) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==2), \
                                                F.col('c14_2')).when((F.col('nSplit_c12')==2) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==1), \
                                                F.col('c15_1')))

In [136]:
ps_t2.select('locstatecode').distinct().count()

30

In [139]:
# Use this code to check if the 'gameid' was parsed correctly.
ps_t2.select('locstatecode').distinct().toPandas()[0:50]

Unnamed: 0,locstatecode
0,07
1,54
2,LA
3,A8
4,16
5,VA
6,B2
7,KY
8,18
9,MI


In [140]:
ps_t2.filter(F.col('locstatecode').isNull()).toPandas()[['_c12','_c13','_c14','_c15','_c16','_c17', 'cityid']]

Unnamed: 0,_c12,_c13,_c14,_c15,_c16,_c17,cityid


---
### 5.2.3 Format, rename columns and save Table2

In [141]:
ps_t2.limit(50).toPandas()

Unnamed: 0,_c0,_c1,_c2,_c6,_c7,_c8,_c9,_c10,_c11,_c12,...,c16_3,realname,primaryclanid,timecreated,gameid,gameserverip,gameextrainfo,cityid,loccountrycode,locstatecode
0,76561197983433666,zfan350,http://steamcommunity.com/profiles/76561197983...,0,3,1,2012-04-29 09:27:17,"""N,""Ba",""",103582791432926187,""2006-06-28 11:31:34""","""N,""N",...,,"""N,""Ba",103582791432926187,"""2006-06-28 11:31:34""","""N","""N","""N","""N","""N","""N"
1,76561197970431714,butz,http://steamcommunity.com/profiles/76561197970...,0,3,1,2013-02-14 16:17:29,"""N,""butz bum",""",103582791430787496,""2004-11-16 13:14:29""","""N,""N",...,,"""N,""butz bum",103582791430787496,"""2004-11-16 13:14:29""","""N","""N","""N","""N","""N","""N"
2,76561197972161391,NarCiuss,http://steamcommunity.com/profiles/76561197972...,0,3,1,2013-01-31 16:45:47,"""N,""Brandon G",""",103582791430699298,""2004-12-22 17:23:33""","""N,""N",...,,"""N,""Brandon G",103582791430699298,"""2004-12-22 17:23:33""","""N","""N","""N","""N",US,CO
3,76561197970997287,ThTh | Koala,http://steamcommunity.com/id/thommes/,0,3,1,2011-03-09 06:49:23,"""N,""Th",""",103582791429521543,""2004-11-22 09:07:53""","""N,""N",...,,"""N,""Th",103582791429521543,"""2004-11-22 09:07:53""","""N","""N","""N","""N",NL,05
4,76561197980777799,Tramper (â—£ â—¢),http://steamcommunity.com/profiles/76561197980...,0,3,1,2013-02-09 20:45:46,"""N,""KÃ¥re",""",103582791430380703,""2006-02-05 01:05:22""","""N,""N",...,,"""N,""KÃ¥re",103582791430380703,"""2006-02-05 01:05:22""","""N","""N","""N","""N",DK,"""N"
5,76561197980897422,Seth,http://steamcommunity.com/profiles/76561197980...,0,3,1,2009-08-21 10:13:24,"""N,""Steven C",""",103582791430535848,""2006-02-09 14:12:17""","""N,""N",...,,"""N,""Steven C",103582791430535848,"""2006-02-09 14:12:17""","""N","""N","""N","""N",DK,"""N"
6,76561197985003229,^eLower ãƒ„,http://steamcommunity.com/profiles/76561197985...,0,3,1,2013-03-01 09:57:52,"""N,""Marek",""",103582791429814614,""2006-09-20 08:49:00""","""N,""N",...,,"""N,""Marek",103582791429814614,"""2006-09-20 08:49:00""","""N","""N","""N","""N",SK,05
7,76561197982694259,Entoriel,http://steamcommunity.com/profiles/76561197982...,0,3,1,2010-11-22 10:39:29,"""N,""ok'",""",103582791430068767,""2006-05-21 10:49:19""","""N,""N",...,,"""N,""ok'",103582791430068767,"""2006-05-21 10:49:19""","""N","""N","""N","""N",BR,21
8,76561197972223670,Flipmode,http://steamcommunity.com/profiles/76561197972...,0,3,1,2008-03-10 13:00:03,"N,",""",103582791429530572,""2004-12-24 10:04:13""","""N,""N",...,,"N,",103582791429530572,"""2004-12-24 10:04:13""","""N","""N","""N","""N","""N","""N"
9,76561197970139452,Hydrix,http://steamcommunity.com/profiles/76561197970...,0,3,1,2011-11-13 16:31:58,"""N,""N.oÂ°||'~",""",103582791429521408,""2004-11-07 12:09:11""","""N,""N",...,,"""N,""N.oÂ°||'~",103582791429521408,"""2004-11-07 12:09:11""","""N","""N","""N","""N","""N","""N"


In [142]:
ps_t2.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c6: string (nullable = true)
 |-- _c7: string (nullable = true)
 |-- _c8: string (nullable = true)
 |-- _c9: string (nullable = true)
 |-- _c10: string (nullable = true)
 |-- _c11: string (nullable = true)
 |-- _c12: string (nullable = true)
 |-- _c13: string (nullable = true)
 |-- _c14: string (nullable = true)
 |-- _c15: string (nullable = true)
 |-- _c16: string (nullable = true)
 |-- _c17: string (nullable = true)
 |-- _c18: string (nullable = true)
 |-- _c19: string (nullable = true)
 |-- nSplit_c7: integer (nullable = false)
 |-- nSplit_c8: integer (nullable = false)
 |-- nSplit_c9: integer (nullable = false)
 |-- nSplit_c10: integer (nullable = false)
 |-- nSplit_c11: integer (nullable = false)
 |-- nSplit_c12: integer (nullable = false)
 |-- nSplit_c13: integer (nullable = false)
 |-- nSplit_c14: integer (nullable = false)
 |-- nSplit_c15: integer (nullable = fals

In [143]:
# Drop unnecssary columns
ps_t2 = ps_t2.drop('nSplit_c7','nSplit_c8','nSplit_c9','nSplit_c10','nSplit_c11',\
                    'nSplit_c12','nSplit_c13','nSplit_c14','nSplit_c15','nSplit_c16','nSplit_c17',\
                    'nSplit_c18','nSplit_c19','c11_1','c11_2', 'c11_3','c12_1','c12_2','c13_1','c13_2','c14_1',\
                    'c14_2','c15_1','c15_2','c16_1','c16_2','c16_3','_c11','_c12','_c13','_c14','_c15','_c16',\
                    '_c17','_c18','_c19')

In [145]:
# Rename columns
newColumns = ['steam_id','person_name', 'profile_url','person_state', 'community_visibility_state',\
              'profile_state','last_logoff', 'comment_permission', 'real_name', 'primary_clanid',\
              'time_created', 'game_id', 'gameserver_ip', 'game_extrainfo', 'city_id', 'country_code',\
              'state_code']
ps_t2 = rename_col(ps_t2, newColumns)

In [146]:
ps_t2.printSchema()

root
 |-- steam_id: string (nullable = true)
 |-- person_name: string (nullable = true)
 |-- profile_url: string (nullable = true)
 |-- person_state: string (nullable = true)
 |-- community_visibility_state: string (nullable = true)
 |-- profile_state: string (nullable = true)
 |-- last_logoff: string (nullable = true)
 |-- comment_permission: string (nullable = true)
 |-- real_name: string (nullable = true)
 |-- primary_clanid: string (nullable = true)
 |-- time_created: string (nullable = true)
 |-- game_id: string (nullable = true)
 |-- gameserver_ip: string (nullable = true)
 |-- game_extrainfo: string (nullable = true)
 |-- city_id: string (nullable = true)
 |-- country_code: string (nullable = true)
 |-- state_code: string (nullable = true)



In [147]:
ps_t2.count()

80

In [150]:
ps_t2 = ps_t2.withColumn('comment_permission',F.split('comment_permission', ',').getItem(0))
ps_t2 = ps_t2.withColumn('real_name',F.split('real_name', ',').getItem(1))

In [156]:
# Replace " symbols in data
col_list = ['comment_permission','real_name', 'primary_clanid','time_created', 'game_id', 'gameserver_ip',\
            'game_extrainfo', 'city_id', 'country_code','state_code']

for i in col_list:
     ps_t2 = ps_t2.withColumn(i,regexp_replace(i, '"', ""))

In [157]:
ps_t2.limit(50).toPandas()

Unnamed: 0,steam_id,person_name,profile_url,person_state,community_visibility_state,profile_state,last_logoff,comment_permission,real_name,primary_clanid,time_created,game_id,gameserver_ip,game_extrainfo,city_id,country_code,state_code
0,76561197983433666,zfan350,http://steamcommunity.com/profiles/76561197983...,0,3,1,2012-04-29 09:27:17,N,Ba,103582791432926187,2006-06-28 11:31:34,N,N,N,N,N,N
1,76561197970431714,butz,http://steamcommunity.com/profiles/76561197970...,0,3,1,2013-02-14 16:17:29,N,butz bum,103582791430787496,2004-11-16 13:14:29,N,N,N,N,N,N
2,76561197972161391,NarCiuss,http://steamcommunity.com/profiles/76561197972...,0,3,1,2013-01-31 16:45:47,N,Brandon G,103582791430699298,2004-12-22 17:23:33,N,N,N,N,US,CO
3,76561197970997287,ThTh | Koala,http://steamcommunity.com/id/thommes/,0,3,1,2011-03-09 06:49:23,N,Th,103582791429521543,2004-11-22 09:07:53,N,N,N,N,NL,05
4,76561197980777799,Tramper (â—£ â—¢),http://steamcommunity.com/profiles/76561197980...,0,3,1,2013-02-09 20:45:46,N,KÃ¥re,103582791430380703,2006-02-05 01:05:22,N,N,N,N,DK,N
5,76561197980897422,Seth,http://steamcommunity.com/profiles/76561197980...,0,3,1,2009-08-21 10:13:24,N,Steven C,103582791430535848,2006-02-09 14:12:17,N,N,N,N,DK,N
6,76561197985003229,^eLower ãƒ„,http://steamcommunity.com/profiles/76561197985...,0,3,1,2013-03-01 09:57:52,N,Marek,103582791429814614,2006-09-20 08:49:00,N,N,N,N,SK,05
7,76561197982694259,Entoriel,http://steamcommunity.com/profiles/76561197982...,0,3,1,2010-11-22 10:39:29,N,ok',103582791430068767,2006-05-21 10:49:19,N,N,N,N,BR,21
8,76561197972223670,Flipmode,http://steamcommunity.com/profiles/76561197972...,0,3,1,2008-03-10 13:00:03,N,,103582791429530572,2004-12-24 10:04:13,N,N,N,N,N,N
9,76561197970139452,Hydrix,http://steamcommunity.com/profiles/76561197970...,0,3,1,2011-11-13 16:31:58,N,N.oÂ°||'~,103582791429521408,2004-11-07 12:09:11,N,N,N,N,N,N


In [162]:
# Save TABLE 2
ps_t2.write.csv('/user/tamng/jwht/CleanData/ps_t2.csv', header = True)

---
## <font color ='black'> 5.3 Table 3: nSplit_c11 > 3 </font>

<br> </br>

<font color ='blue'> The players have mutiple real_name </font>

- Everything consitent until _c13, dont need to adjust c11,1c2

### 5.3.1 Filter data and check number of Split

In [164]:
ps_t3 = player_summaries_drop.filter(player_summaries_drop.nSplit_c11 > 3)
ps_t3.count()

26

In [172]:
nSplit_c11_n = ps_t3.toPandas()
nSplit_c11_n[['_c11','_c12','_c13','_c14','_c15','_c16','_c17','_c18']]

Unnamed: 0,_c11,_c12,_c13,_c14,_c15,_c16,_c17,_c18
0,"Tobi, (Twby, Kit, Keksi, KO-NE-KO-CHAN)",103582791429663584,2006-02-09 08:03:05,"""N,""N","""N,""N",DE,"""N,""N",2013-03-01 17:08:03
1,"Hruod Ruhm, Lant Land, Besitz bzw. Nand kÃ¼hn,...",103582791429726636,2005-11-14 13:33:26,"""N,""N","""N,""N",DE,01,12476
2,"Rick, Rich, Dick, Richard",103582791433531922,2006-07-28 22:01:56,"""N,""N","""N,""N",CA,ON,4511
3,",,,,,,,,,,,,",103582791429638424,2004-09-06 17:41:22,"""N,""N","""N,""N",ES,53,15088
4,"Veni, vidi, vici ich kam, ich sah, ich siegte'",103582791431349569,2005-11-03 05:35:02,80,"""N,""Counter-Strike: Condition Zero""","""N,""DE""",07,"""N,""2013-02-28 14:31:42"""
5,",,,",103582791430855102,2005-07-08 13:38:02,"""N,""N","""N,""N",FR,A8,"""N,""2013-02-28 14:30:49"""
6,"A.k.a Ozamu, Bisnesmies, Empresario, Sulttaani...",103582791432035804,2005-01-26 23:18:18,"""N,""N","""N,""N",FI,13,"""N,""2013-02-28 14:29:31"""
7,"Reborn,Cynical,Tumbo,Mero",103582791430065590,2003-09-13 11:41:47,"""N,""N","""N,""N","""N,""N","""N,""2013-02-28 14:19:12""",
8,".,!,,",103582791429604273,2004-10-04 23:55:13,"""N,""N","""N,""N",SE,"""N,""N",2013-02-28 14:26:09
9,"--,,,--",103582791432807346,2006-04-10 14:35:32,"""N,""N","""N,""N",DE,07,"""N,""2013-03-11 12:29:00"""


In [166]:
# All of the observations in _c12 has the primaryclanid >> consistent for all in _c12
ps_t3.filter(F.col('_c12').contains('103')).count()

26

___
#### Take a look at the number of split to decide whether we should split the column or not

In [169]:
# Check number of unique value in split column
check_list = ['nSplit_c7','nSplit_c8','nSplit_c9', 'nSplit_c10', 'nSplit_c11','nSplit_c12','nSplit_c13', 'nSplit_c14','nSplit_c15','nSplit_c16','nSplit_c17','nSplit_c18','nSplit_c19']
unique_col = []
for i in check_list:
    unique_col.append(nSplit_c11_n[i].unique())

print(pd.DataFrame(zip(check_list,unique_col)))

             0                 1
0    nSplit_c7               [1]
1    nSplit_c8               [1]
2    nSplit_c9               [1]
3   nSplit_c10               [1]
4   nSplit_c11  [5, 4, 13, 6, 8]
5   nSplit_c12               [1]
6   nSplit_c13               [1]
7   nSplit_c14            [2, 1]
8   nSplit_c15            [2, 1]
9   nSplit_c16            [1, 2]
10  nSplit_c17            [2, 1]
11  nSplit_c18        [1, 2, -1]
12  nSplit_c19           [-1, 1]


In [174]:
# Set new columns need to split for this data
col_c14 = ['_c14']
newcols_c14 = ['c14_1', 'c14_2']

col_c15 = ['_c15']
newcols_c15 = ['c15_1', 'c15_2']

col_c16 = ['_c16']
newcols_c16 = ['c16_1', 'c16_2']

col_c17 = ['_c17']
newcols_c17 = ['c17_1', 'c17_2']

col_c18 = ['_c18']
newcols_c18 = ['c18_1', 'c18_2']


# Apply split function
ps_t3 = split_2_column(ps_t3, col_c14, newcols_c14)
ps_t3 = split_2_column(ps_t3, col_c15, newcols_c15)
ps_t3 = split_2_column(ps_t3, col_c16, newcols_c16)
ps_t3 = split_2_column(ps_t3, col_c17, newcols_c17)
ps_t3 = split_2_column(ps_t3, col_c18, newcols_c18)

In [175]:
ps_t3.toPandas()[['_c7','_c8','_c9','_c10','_c11','_c12','nSplit_c12','_c13','nSplit_c13','_c14','nSplit_c14','_c15','nSplit_c15','_c16','_c17','_c18','_c19']]

Unnamed: 0,_c7,_c8,_c9,_c10,_c11,_c12,nSplit_c12,_c13,nSplit_c13,_c14,nSplit_c14,_c15,nSplit_c15,_c16,_c17,_c18,_c19
0,3,1,2013-01-30 09:28:37,1,"Tobi, (Twby, Kit, Keksi, KO-NE-KO-CHAN)",103582791429663584,1,2006-02-09 08:03:05,1,"""N,""N",2,"""N,""N",2,DE,"""N,""N",2013-03-01 17:08:03,
1,3,1,2013-02-19 09:38:37,1,"Hruod Ruhm, Lant Land, Besitz bzw. Nand kÃ¼hn,...",103582791429726636,1,2005-11-14 13:33:26,1,"""N,""N",2,"""N,""N",2,DE,01,12476,2013-02-28 14:31:45
2,3,1,2013-03-01 01:33:05,1,"Rick, Rich, Dick, Richard",103582791433531922,1,2006-07-28 22:01:56,1,"""N,""N",2,"""N,""N",2,CA,ON,4511,2013-03-14 13:16:31
3,3,1,2013-02-17 17:13:50,1,",,,,,,,,,,,,",103582791429638424,1,2004-09-06 17:41:22,1,"""N,""N",2,"""N,""N",2,ES,53,15088,2013-02-28 14:25:47
4,3,1,2013-02-16 19:55:30,1,"Veni, vidi, vici ich kam, ich sah, ich siegte'",103582791431349569,1,2005-11-03 05:35:02,1,80,1,"""N,""Counter-Strike: Condition Zero""",2,"""N,""DE""",07,"""N,""2013-02-28 14:31:42""",
5,3,1,2013-02-17 11:36:15,2,",,,",103582791430855102,1,2005-07-08 13:38:02,1,"""N,""N",2,"""N,""N",2,FR,A8,"""N,""2013-02-28 14:30:49""",
6,3,1,2011-03-08 02:34:07,1,"Reborn,Cynical,Tumbo,Mero",103582791430065590,1,2003-09-13 11:41:47,1,"""N,""N",2,"""N,""N",2,"""N,""N","""N,""2013-02-28 14:19:12""",,
7,3,1,2013-02-18 14:05:01,2,"A.k.a Ozamu, Bisnesmies, Empresario, Sulttaani...",103582791432035804,1,2005-01-26 23:18:18,1,"""N,""N",2,"""N,""N",2,FI,13,"""N,""2013-02-28 14:29:31""",
8,3,1,2013-02-17 14:41:21,2,".,!,,",103582791429604273,1,2004-10-04 23:55:13,1,"""N,""N",2,"""N,""N",2,SE,"""N,""N",2013-02-28 14:26:09,
9,3,1,2013-03-01 14:28:26,1,"--,,,--",103582791432807346,1,2006-04-10 14:35:32,1,"""N,""N",2,"""N,""N",2,DE,07,"""N,""2013-03-11 12:29:00""",


---
### 5.3.2 Adding columns and condition
From this step, we will start adding column based in n_Split column logic

_Add columns:_
 > - realname
 > - primaryclanid
 > - timecreated

In [176]:
ps_t3 = ps_t3.withColumn('realname', F.col('_c11'))
ps_t3 = ps_t3.withColumn('primaryclanid', F.col('_c12'))
ps_t3 = ps_t3.withColumn('timecreated', F.col('_c13'))

In [177]:
ps_t3.filter(F.col('timecreated').contains('20')).count()

26

In [None]:
#ps_t3.select('timecreated').distinct().count()

In [178]:
# Use this code to check if the 'gameid' was parsed correctly.
ps_t3.select('timecreated').distinct().toPandas()[0:50]

Unnamed: 0,timecreated
0,2006-07-29 23:15:27
1,2004-04-01 13:28:58
2,2003-09-12 14:23:59
3,2005-01-26 23:18:18
4,2005-07-08 13:38:02
5,2006-07-28 22:01:56
6,2006-02-09 08:03:05
7,2003-09-30 10:19:49
8,2005-01-26 08:43:17
9,2005-11-03 05:35:02


---
_Add column_
 > - gameid (_c14)

In [179]:
ps_t3 = ps_t3.withColumn('gameid',F.col('c14_1'))

In [180]:
ps_t3.select('gameid').distinct().count()

4

In [181]:
# Use this code to check if the 'gameid' was parsed correctly.
ps_t3.select('gameid').distinct().toPandas()[0:50]

Unnamed: 0,gameid
0,33930
1,4000
2,"""N"
3,80


In [183]:
ps_t3.filter(F.col('gameid').isNull()).toPandas()[['_c12','_c13','_c14','_c15','_c16','_c17']]

Unnamed: 0,_c12,_c13,_c14,_c15,_c16,_c17


---
_Add column_
 > - gameserverip (_c15)

In [184]:
ps_t3 = ps_t3.withColumn('gameserverip',F.when((F.col('nSplit_c14')==1), \
                                                F.col('c15_1')).when((F.col('nSplit_c14')==2), \
                                                F.col('c14_2')))

In [185]:
ps_t3.select('gameserverip').distinct().count()

2

In [186]:
# Use this code to check if the 'gameid' was parsed correctly.
ps_t3.select('gameserverip').distinct().toPandas()[0:50]

Unnamed: 0,gameserverip
0,74.119.216.30:27015
1,"""N"


In [187]:
ps_t3.filter(F.col('gameserverip').isNull()).toPandas()[['_c12','_c13','_c14','_c15','_c16','_c17']]

Unnamed: 0,_c12,_c13,_c14,_c15,_c16,_c17


---
_Add column_
 > - gameextrainfo (_c16)

In [188]:
nSplit_c11_n.nSplit_c15.unique()

array([2, 1])

In [189]:
ps_t3 = ps_t3.withColumn('gameextrainfo',F.when((F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1), \
                                                F.col('c16_1')).when((F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==2), \
                                                F.col('c15_2')).when((F.col('nSplit_c14')==2), \
                                                F.col('c15_1')))

In [190]:
ps_t3.select('gameextrainfo').distinct().count()

4

In [191]:
# Use this code to check if the 'gameid' was parsed correctly.
ps_t3.select('gameextrainfo').distinct().toPandas()[0:50]

Unnamed: 0,gameextrainfo
0,"""Arma 2: Operation Arrowhead"""
1,Garry's Mod
2,"""N"
3,"""Counter-Strike: Condition Zero"""


In [192]:
ps_t3.filter(F.col('gameextrainfo').isNull()).toPandas()[['_c12','_c13','_c14','_c15','_c16','_c17']]

Unnamed: 0,_c12,_c13,_c14,_c15,_c16,_c17


---
_Add column_
 - cityid (_c17)

In [193]:
nSplit_c11_n.nSplit_c16.unique()

array([1, 2])

In [195]:
ps_t3 = ps_t3.withColumn('cityid',F.when((F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==1), \
                                                F.col('c17_1')).when((F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==2), \
                                                F.col('c16_2')).when((F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==2), \
                                                F.col('c16_1')).when((F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==2), \
                                                F.col('c15_2')).when((F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==1), \
                                                F.col('c16_1')))

In [196]:
ps_t3.select('cityid').distinct().count()

1

In [197]:
# Use this code to check if the 'gameid' was parsed correctly.
ps_t3.select('cityid').distinct().toPandas()[0:50]

Unnamed: 0,cityid
0,"""N"


In [198]:
ps_t3.filter(F.col('cityid').isNull()).toPandas()[['_c12','_c13','_c14','_c15','_c16','_c17']]

Unnamed: 0,_c12,_c13,_c14,_c15,_c16,_c17


---
_Add column_
 > - loccountrycode (_c18)

In [199]:
nSplit_c11_n.nSplit_c17.unique()

array([2, 1])

In [200]:
ps_t3 = ps_t3.withColumn('loccountrycode',F.when((F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==1) & (F.col('nSplit_c17')==1), \
                                                F.col('c18_1')).when((F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==1) & (F.col('nSplit_c17')==2), \
                                                F.col('c17_2')).when((F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==2), \
                                                F.col('c17_1')).when((F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==2) & (F.col('nSplit_c16')==1), \
                                                F.col('c17_1')).when((F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==2) & (F.col('nSplit_c16')==2), \
                                                F.col('c16_2')).when((F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==2), \
                                                F.col('c16_1')).when((F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==2), \
                                                F.col('c16_2')).when((F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==1), \
                                                F.col('c17_1')))

In [201]:
ps_t3.select('loccountrycode').distinct().count()

13

In [202]:
# Use this code to check if the 'gameid' was parsed correctly.
ps_t3.select('loccountrycode').distinct().toPandas()[0:50]

Unnamed: 0,loccountrycode
0,FI
1,PL
2,"""DE"""
3,SO
4,ID
5,CA
6,DE
7,ES
8,US
9,FR


In [203]:
ps_t3.filter(F.col('loccountrycode').isNull()).toPandas()[['_c12','_c13','_c14','_c15','_c16','_c17']]

Unnamed: 0,_c12,_c13,_c14,_c15,_c16,_c17


---
_Add column_
 > - locstatecode (_c19)

In [204]:
nSplit_c11_n.nSplit_c18.unique()

array([ 1,  2, -1])

In [206]:
ps_t3 = ps_t3.withColumn('locstatecode',F.when((F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==1) & (F.col('nSplit_c17')==1) & (F.col('nSplit_c18')==1), \
                                                F.col('_c19')).when((F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==1) & (F.col('nSplit_c17')==1) & (F.col('nSplit_c18')==2), \
                                                F.col('c18_2')).when((F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==1) & (F.col('nSplit_c17')==2), \
                                                F.col('c18_1')).when((F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==2) & (F.col('nSplit_c17')==2), \
                                                F.col('c17_2')).when((F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==2) & (F.col('nSplit_c17')==1), \
                                                F.col('c18_1')).when((F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==2) & (F.col('nSplit_c16')==1) & (F.col('nSplit_c17')==1), \
                                                F.col('c18_1')).when((F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==2) & (F.col('nSplit_c16')==1) & (F.col('nSplit_c17')==2), \
                                                F.col('c17_2')).when((F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==2) & (F.col('nSplit_c16')==2), \
                                                F.col('c17_1')).when((F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==2) & (F.col('nSplit_c16')==2), \
                                                F.col('c16_2')).when((F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==2) & (F.col('nSplit_c16')==1), \
                                                F.col('c17_1')).when((F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==2), \
                                                F.col('c17_1')).when((F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==1) & (F.col('nSplit_c17')==1), \
                                                F.col('c18_1')).when((F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==1) & (F.col('nSplit_c17')==2), \
                                                F.col('c17_2')))

In [207]:
ps_t3.select('locstatecode').distinct().count()

17

In [208]:
# Use this code to check if the 'gameid' was parsed correctly.
ps_t3.select('locstatecode').distinct().toPandas()[0:50]

Unnamed: 0,locstatecode
0,07
1,SC
2,AZ
3,29
4,01
5,A8
6,09
7,08
8,02
9,53


In [209]:
ps_t3.filter(F.col('locstatecode').isNull()).toPandas()[['_c12','_c13','_c14','_c15','_c16','_c17']]

Unnamed: 0,_c12,_c13,_c14,_c15,_c16,_c17


---
### 5.3.3 Format, rename columns and save Table3

In [210]:
# Turn first 50 rows to Pandas to have an overview about the table at hand
ps_t3.limit(50).toPandas()

Unnamed: 0,_c0,_c1,_c2,_c6,_c7,_c8,_c9,_c10,_c11,_c12,...,c18_2,realname,primaryclanid,timecreated,gameid,gameserverip,gameextrainfo,cityid,loccountrycode,locstatecode
0,76561197980890406,Tensai-Baka,http://steamcommunity.com/profiles/76561197980...,0,3,1,2013-01-30 09:28:37,1,"Tobi, (Twby, Kit, Keksi, KO-NE-KO-CHAN)",103582791429663584,...,,"Tobi, (Twby, Kit, Keksi, KO-NE-KO-CHAN)",103582791429663584,2006-02-09 08:03:05,"""N","""N","""N","""N",DE,"""N"
1,76561197979195177,sixfeetunder,http://steamcommunity.com/profiles/76561197979...,0,3,1,2013-02-19 09:38:37,1,"Hruod Ruhm, Lant Land, Besitz bzw. Nand kÃ¼hn,...",103582791429726636,...,,"Hruod Ruhm, Lant Land, Besitz bzw. Nand kÃ¼hn,...",103582791429726636,2005-11-14 13:33:26,"""N","""N","""N","""N",DE,01
2,76561197983953627,rIKKaNDRSN,http://steamcommunity.com/id/rikkandrsn/,1,3,1,2013-03-01 01:33:05,1,"Rick, Rich, Dick, Richard",103582791433531922,...,,"Rick, Rich, Dick, Richard",103582791433531922,2006-07-28 22:01:56,"""N","""N","""N","""N",CA,ON
3,76561197968812523,Java,http://steamcommunity.com/profiles/76561197968...,0,3,1,2013-02-17 17:13:50,1,",,,,,,,,,,,,",103582791429638424,...,,",,,,,,,,,,,,",103582791429638424,2004-09-06 17:41:22,"""N","""N","""N","""N",ES,53
4,76561197978986309,eXtreme PerformanÂ¢e GmbH,http://steamcommunity.com/profiles/76561197978...,1,3,1,2013-02-16 19:55:30,1,"Veni, vidi, vici ich kam, ich sah, ich siegte'",103582791431349569,...,"""2013-02-28 14:31:42""","Veni, vidi, vici ich kam, ich sah, ich siegte'",103582791431349569,2005-11-03 05:35:02,80,"""N","""Counter-Strike: Condition Zero""","""N","""DE""",07
5,76561197976870049,Lsd.K.....le net c'est chouette,http://steamcommunity.com/id/le_sadik/,0,3,1,2013-02-17 11:36:15,2,",,,",103582791430855102,...,"""2013-02-28 14:30:49""",",,,",103582791430855102,2005-07-08 13:38:02,"""N","""N","""N","""N",FR,A8
6,76561197960518379,KrumPZ,http://steamcommunity.com/profiles/76561197960...,0,3,1,2011-03-08 02:34:07,1,"Reborn,Cynical,Tumbo,Mero",103582791430065590,...,,"Reborn,Cynical,Tumbo,Mero",103582791430065590,2003-09-13 11:41:47,"""N","""N","""N","""N","""N","""N"
7,76561197973640033,Doble Es,http://steamcommunity.com/id/empresario7/,0,3,1,2013-02-18 14:05:01,2,"A.k.a Ozamu, Bisnesmies, Empresario, Sulttaani...",103582791432035804,...,"""2013-02-28 14:29:31""","A.k.a Ozamu, Bisnesmies, Empresario, Sulttaani...",103582791432035804,2005-01-26 23:18:18,"""N","""N","""N","""N",FI,13
8,76561197969346811,Padde,http://steamcommunity.com/profiles/76561197969...,0,3,1,2013-02-17 14:41:21,2,".,!,,",103582791429604273,...,,".,!,,",103582791429604273,2004-10-04 23:55:13,"""N","""N","""N","""N",SE,"""N"
9,76561197981996985,|1st| Salt,http://steamcommunity.com/id/ZE_Echelon/,0,3,1,2013-03-01 14:28:26,1,"--,,,--",103582791432807346,...,"""2013-03-11 12:29:00""","--,,,--",103582791432807346,2006-04-10 14:35:32,"""N","""N","""N","""N",DE,07


In [211]:
ps_t3.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c6: string (nullable = true)
 |-- _c7: string (nullable = true)
 |-- _c8: string (nullable = true)
 |-- _c9: string (nullable = true)
 |-- _c10: string (nullable = true)
 |-- _c11: string (nullable = true)
 |-- _c12: string (nullable = true)
 |-- _c13: string (nullable = true)
 |-- _c14: string (nullable = true)
 |-- _c15: string (nullable = true)
 |-- _c16: string (nullable = true)
 |-- _c17: string (nullable = true)
 |-- _c18: string (nullable = true)
 |-- _c19: string (nullable = true)
 |-- nSplit_c7: integer (nullable = false)
 |-- nSplit_c8: integer (nullable = false)
 |-- nSplit_c9: integer (nullable = false)
 |-- nSplit_c10: integer (nullable = false)
 |-- nSplit_c11: integer (nullable = false)
 |-- nSplit_c12: integer (nullable = false)
 |-- nSplit_c13: integer (nullable = false)
 |-- nSplit_c14: integer (nullable = false)
 |-- nSplit_c15: integer (nullable = fals

In [212]:
# Drop unnecssary columns
ps_t3 = ps_t3.drop('nSplit_c7','nSplit_c8','nSplit_c9','nSplit_c10','nSplit_c11',\
                    'nSplit_c12','nSplit_c13','nSplit_c14','nSplit_c15','nSplit_c16','nSplit_c17',\
                    'nSplit_c18','nSplit_c19','c14_1','c14_2','c15_1','c15_2','c16_1','c16_2',\
                    'c17_1','c17_2','c18_1','c18_2','_c11','_c12','_c13','_c14','_c15','_c16',\
                    '_c17','_c18','_c19')

In [214]:
# Rename columns
newColumns = ['steam_id','person_name', 'profile_url','person_state', 'community_visibility_state',\
              'profile_state','last_logoff', 'comment_permission', 'real_name', 'primary_clanid',\
              'time_created', 'game_id', 'gameserver_ip', 'game_extrainfo', 'city_id', 'country_code',\
              'state_code']
ps_t3 = rename_col(ps_t3, newColumns)

In [215]:
ps_t3.printSchema()

root
 |-- steam_id: string (nullable = true)
 |-- person_name: string (nullable = true)
 |-- profile_url: string (nullable = true)
 |-- person_state: string (nullable = true)
 |-- community_visibility_state: string (nullable = true)
 |-- profile_state: string (nullable = true)
 |-- last_logoff: string (nullable = true)
 |-- comment_permission: string (nullable = true)
 |-- real_name: string (nullable = true)
 |-- primary_clanid: string (nullable = true)
 |-- time_created: string (nullable = true)
 |-- game_id: string (nullable = true)
 |-- gameserver_ip: string (nullable = true)
 |-- game_extrainfo: string (nullable = true)
 |-- city_id: string (nullable = true)
 |-- country_code: string (nullable = true)
 |-- state_code: string (nullable = true)



In [216]:
# Replace " symbols in data
col_list = ['comment_permission','real_name', 'primary_clanid','time_created', 'game_id', 'gameserver_ip',\
            'game_extrainfo', 'city_id', 'country_code','state_code']

for i in col_list:
     ps_t3 = ps_t3.withColumn(i,regexp_replace(i, '"', ""))

In [217]:
ps_t3.limit(50).toPandas()

Unnamed: 0,steam_id,person_name,profile_url,person_state,community_visibility_state,profile_state,last_logoff,comment_permission,real_name,primary_clanid,time_created,game_id,gameserver_ip,game_extrainfo,city_id,country_code,state_code
0,76561197980890406,Tensai-Baka,http://steamcommunity.com/profiles/76561197980...,0,3,1,2013-01-30 09:28:37,1,"Tobi, (Twby, Kit, Keksi, KO-NE-KO-CHAN)",103582791429663584,2006-02-09 08:03:05,N,N,N,N,DE,N
1,76561197979195177,sixfeetunder,http://steamcommunity.com/profiles/76561197979...,0,3,1,2013-02-19 09:38:37,1,"Hruod Ruhm, Lant Land, Besitz bzw. Nand kÃ¼hn,...",103582791429726636,2005-11-14 13:33:26,N,N,N,N,DE,01
2,76561197983953627,rIKKaNDRSN,http://steamcommunity.com/id/rikkandrsn/,1,3,1,2013-03-01 01:33:05,1,"Rick, Rich, Dick, Richard",103582791433531922,2006-07-28 22:01:56,N,N,N,N,CA,ON
3,76561197968812523,Java,http://steamcommunity.com/profiles/76561197968...,0,3,1,2013-02-17 17:13:50,1,",,,,,,,,,,,,",103582791429638424,2004-09-06 17:41:22,N,N,N,N,ES,53
4,76561197978986309,eXtreme PerformanÂ¢e GmbH,http://steamcommunity.com/profiles/76561197978...,1,3,1,2013-02-16 19:55:30,1,"Veni, vidi, vici ich kam, ich sah, ich siegte'",103582791431349569,2005-11-03 05:35:02,80,N,Counter-Strike: Condition Zero,N,DE,07
5,76561197976870049,Lsd.K.....le net c'est chouette,http://steamcommunity.com/id/le_sadik/,0,3,1,2013-02-17 11:36:15,2,",,,",103582791430855102,2005-07-08 13:38:02,N,N,N,N,FR,A8
6,76561197973640033,Doble Es,http://steamcommunity.com/id/empresario7/,0,3,1,2013-02-18 14:05:01,2,"A.k.a Ozamu, Bisnesmies, Empresario, Sulttaani...",103582791432035804,2005-01-26 23:18:18,N,N,N,N,FI,13
7,76561197960518379,KrumPZ,http://steamcommunity.com/profiles/76561197960...,0,3,1,2011-03-08 02:34:07,1,"Reborn,Cynical,Tumbo,Mero",103582791430065590,2003-09-13 11:41:47,N,N,N,N,N,N
8,76561197969346811,Padde,http://steamcommunity.com/profiles/76561197969...,0,3,1,2013-02-17 14:41:21,2,".,!,,",103582791429604273,2004-10-04 23:55:13,N,N,N,N,SE,N
9,76561197981996985,|1st| Salt,http://steamcommunity.com/id/ZE_Echelon/,0,3,1,2013-03-01 14:28:26,1,"--,,,--",103582791432807346,2006-04-10 14:35:32,N,N,N,N,DE,07


In [218]:
# Save TABLE 3
ps_t3.write.csv('/user/tamng/jwht/CleanData/ps_t3.csv', header = True)

---
## <font color = 'black'> 5.4 Table 4: nSplit_c11 == 2 & nSplit_c10 == 1 </font>

### 5.4.1 Filter data and check number of Split

In [19]:
ps_t4 = player_summaries_drop.filter((player_summaries_drop.nSplit_c11 == 2) & (player_summaries_drop.nSplit_c10 == 1))
ps_t4.count()

121673

In [221]:
nSplit_c11_2_c10_1 = player_summaries_drop.filter((player_summaries_drop.nSplit_c11 == 2) & (player_summaries_drop.nSplit_c10 == 1)).toPandas()
nSplit_c11_2_c10_1.head(50)

Unnamed: 0,_c0,_c1,_c2,_c6,_c7,_c8,_c9,_c10,_c11,_c12,...,nSplit_c10,nSplit_c11,nSplit_c12,nSplit_c13,nSplit_c14,nSplit_c15,nSplit_c16,nSplit_c17,nSplit_c18,nSplit_c19
0,76561197976084252,PedoBear ^_^,http://steamcommunity.com/profiles/76561197976...,0,1,1,2012-12-15 13:12:03,2,"""N,""N","""N,""N",...,1,2,2,2,2,2,1,-1,-1,-1
1,76561197976148316,PEDAL' v POL!,http://steamcommunity.com/profiles/76561197976...,0,1,1,2011-05-02 01:46:41,2,"""N,""N","""N,""N",...,1,2,2,2,2,2,1,-1,-1,-1
2,76561197976152449,mieserTrollchummer,http://steamcommunity.com/profiles/76561197976...,0,1,1,2013-02-17 11:19:56,2,"""N,""N","""N,""N",...,1,2,2,2,2,2,1,-1,-1,-1
3,76561197976200426,AVROR,http://steamcommunity.com/profiles/76561197976...,0,1,1,2013-02-04 03:15:52,2,"""N,""N","""N,""N",...,1,2,2,2,2,2,1,-1,-1,-1
4,76561197976202587,Xong,http://steamcommunity.com/profiles/76561197976...,0,1,1,2013-02-14 16:47:22,2,"""N,""N","""N,""N",...,1,2,2,2,2,2,1,-1,-1,-1
5,76561197976293451,Schnapsi - Schatz,http://steamcommunity.com/profiles/76561197976...,0,1,1,2013-02-07 10:28:30,2,"""N,""N","""N,""N",...,1,2,2,2,2,2,1,-1,-1,-1
6,76561197976340921,United Satans of Arbitrary Inc.,http://steamcommunity.com/profiles/76561197976...,0,1,1,2013-03-06 17:43:15,1,"""N,""N","""N,""N",...,1,2,2,2,2,2,1,-1,-1,-1
7,76561197976364502,shrapnel,http://steamcommunity.com/profiles/76561197976...,0,1,1,2013-01-14 22:33:09,1,"""N,""N","""N,""N",...,1,2,2,2,2,2,1,-1,-1,-1
8,76561197976378009,furioN~,http://steamcommunity.com/profiles/76561197976...,0,1,1,2012-12-29 05:00:17,1,"""N,""N","""N,""N",...,1,2,2,2,2,2,1,-1,-1,-1
9,76561197976402125,Mr.slendy,http://steamcommunity.com/profiles/76561197976...,0,1,1,2013-02-16 18:28:15,2,"""N,""N","""N,""N",...,1,2,2,2,2,2,1,-1,-1,-1


In [224]:
check_list = ['nSplit_c7','nSplit_c8','nSplit_c9', 'nSplit_c10', 'nSplit_c11','nSplit_c12','nSplit_c13', 'nSplit_c14','nSplit_c15','nSplit_c16','nSplit_c17','nSplit_c18','nSplit_c19']

unique_col = []
for i in check_list:
    unique_col.append(nSplit_c11_2_c10_1[i].unique())

pd.DataFrame(zip(check_list,unique_col))

Unnamed: 0,0,1
0,nSplit_c7,"[1, 2]"
1,nSplit_c8,"[1, 2]"
2,nSplit_c9,"[1, 2]"
3,nSplit_c10,[1]
4,nSplit_c11,[2]
5,nSplit_c12,"[2, 1]"
6,nSplit_c13,"[2, 1]"
7,nSplit_c14,"[2, 1]"
8,nSplit_c15,"[2, -1, 1]"
9,nSplit_c16,"[1, -1, 2]"


---
### 4 scenarios:
- _c9 contains `date`: 121175
- _c9 contains `primaryclanid` (start with 103...): 391
- _c9 contains `N` ('"N,N'): 57 
- _c9 contains value (1): 50 

In [230]:
ps_t4.limit(50).toPandas()[['_c7','_c8','_c9','_c10','_c11','_c12','nSplit_c12','_c13','nSplit_c13','_c14','nSplit_c14','_c15','nSplit_c15','_c16','_c17','_c18','_c19']]

Unnamed: 0,_c7,_c8,_c9,_c10,_c11,_c12,nSplit_c12,_c13,nSplit_c13,_c14,nSplit_c14,_c15,nSplit_c15,_c16,_c17,_c18,_c19
0,1,1,2013-03-06 00:37:20,1,"""N,""N","""N,""N",2,"""N,""N",2,"""N,""N",2,"""N,""N",2,2013-03-06 14:00:47,,,
1,1,1,2013-02-17 15:41:52,1,"""N,""N","""N,""N",2,"""N,""N",2,"""N,""N",2,"""N,""N",2,2013-02-28 14:37:09,,,
2,1,1,2013-02-15 13:20:00,2,"""N,""N","""N,""N",2,"""N,""N",2,"""N,""N",2,"""N,""N",2,2013-03-06 17:49:10,,,
3,1,1,2013-02-17 16:45:06,2,"""N,""N","""N,""N",2,"""N,""N",2,"""N,""N",2,"""N,""N",2,2013-02-28 14:26:25,,,
4,1,1,2009-04-27 09:44:39,2,"""N,""N","""N,""N",2,"""N,""N",2,"""N,""N",2,"""N,""N",2,2013-02-28 14:26:26,,,
5,1,1,2012-12-22 05:37:40,2,"""N,""N","""N,""N",2,"""N,""N",2,"""N,""N",2,"""N,""N",2,2013-03-06 17:58:01,,,
6,1,1,2013-02-16 23:07:24,1,"""N,""N","""N,""N",2,"""N,""N",2,"""N,""N",2,"""N,""N",2,2013-02-28 14:37:13,,,
7,1,1,2013-02-15 16:47:37,2,"""N,""N","""N,""N",2,"""N,""N",2,"""N,""N",2,"""N,""N",2,2013-02-28 14:37:13,,,
8,1,1,2013-02-16 03:31:03,1,"""N,""N","""N,""N",2,"""N,""N",2,"""N,""N",2,"""N,""N",2,2013-02-28 14:26:32,,,
9,1,1,2012-11-20 12:18:34,2,"""N,""N","""N,""N",2,"""N,""N",2,"""N,""N",2,"""N,""N",2,2013-02-28 14:37:13,,,


---
### Table 4_1: _c9 contains date

### 5.4.1a Filter data and check number of Split

In [270]:
ps_t4_1 = ps_t4.filter(F.col('_c9').contains('20'))
ps_t4_1.count()

121175

In [233]:
nSplit_c11_2_c10_1_c9_d = ps_t4_1.toPandas()

In [234]:
# Check number of unique value in split column
check_list = ['nSplit_c7','nSplit_c8','nSplit_c9', 'nSplit_c10', 'nSplit_c11','nSplit_c12','nSplit_c13', 'nSplit_c14','nSplit_c15','nSplit_c16','nSplit_c17','nSplit_c18','nSplit_c19']
unique_col = []
for i in check_list:
    unique_col.append(nSplit_c11_2_c10_1_c9_d[i].nunique())

print(pd.DataFrame(zip(check_list,unique_col)))

             0  1
0    nSplit_c7  1
1    nSplit_c8  1
2    nSplit_c9  1
3   nSplit_c10  1
4   nSplit_c11  1
5   nSplit_c12  2
6   nSplit_c13  2
7   nSplit_c14  2
8   nSplit_c15  2
9   nSplit_c16  2
10  nSplit_c17  3
11  nSplit_c18  3
12  nSplit_c19  2


In [235]:
# Check number of unique value in split column
check_list = ['nSplit_c7','nSplit_c8','nSplit_c9', 'nSplit_c10', 'nSplit_c11','nSplit_c12','nSplit_c13', 'nSplit_c14','nSplit_c15','nSplit_c16','nSplit_c17','nSplit_c18','nSplit_c19']
unique_col = []
for i in check_list:
    unique_col.append(nSplit_c11_2_c10_1_c9_d[i].unique())

print(pd.DataFrame(zip(check_list,unique_col)))

             0           1
0    nSplit_c7         [1]
1    nSplit_c8         [1]
2    nSplit_c9         [1]
3   nSplit_c10         [1]
4   nSplit_c11         [2]
5   nSplit_c12      [2, 1]
6   nSplit_c13      [2, 1]
7   nSplit_c14      [2, 1]
8   nSplit_c15      [2, 1]
9   nSplit_c16      [1, 2]
10  nSplit_c17  [-1, 2, 1]
11  nSplit_c18  [-1, 1, 2]
12  nSplit_c19     [-1, 1]


In [241]:
ps_t4_1.limit(100).toPandas()[['_c7','_c8','_c9','_c10','_c11','_c12','nSplit_c12','_c13','nSplit_c13','_c14','nSplit_c14','_c15','nSplit_c15','_c16','_c17','_c18','_c19']]

Unnamed: 0,_c7,_c8,_c9,_c10,_c11,_c12,nSplit_c12,_c13,nSplit_c13,_c14,nSplit_c14,_c15,nSplit_c15,_c16,_c17,_c18,_c19
0,1,1,2013-03-03 10:20:23,2,"""N,""N","""N,""N",2,"""N,""N",2,"""N,""N",2,"""N,""N",2,2013-03-06 16:33:47,,,
1,1,1,2011-11-01 13:44:07,2,"""N,""N","""N,""N",2,"""N,""N",2,"""N,""N",2,"""N,""N",2,2013-03-06 12:37:03,,,
2,1,1,2010-03-20 19:14:17,2,"""N,""N","""N,""N",2,"""N,""N",2,"""N,""N",2,"""N,""N",2,2013-02-28 14:34:17,,,
3,1,1,2011-04-10 05:35:22,2,"""N,""N","""N,""N",2,"""N,""N",2,"""N,""N",2,"""N,""N",2,2013-02-28 14:19:33,,,
4,1,1,2013-02-10 04:22:25,2,"""N,""N","""N,""N",2,"""N,""N",2,"""N,""N",2,"""N,""N",2,2013-02-28 14:34:17,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,1,1,2013-02-17 15:08:14,2,"""N,""N","""N,""N",2,"""N,""N",2,"""N,""N",2,"""N,""N",2,2013-02-28 14:27:07,,,
96,1,1,2012-06-19 14:27:55,2,"""N,""N","""N,""N",2,"""N,""N",2,"""N,""N",2,"""N,""N",2,2013-02-28 14:27:08,,,
97,1,1,2013-02-17 15:38:24,1,"""N,""N","""N,""N",2,"""N,""N",2,"""N,""N",2,"""N,""N",2,2013-02-28 14:37:32,,,
98,1,1,2011-04-10 03:23:18,2,"""N,""N","""N,""N",2,"""N,""N",2,"""N,""N",2,"""N,""N",2,2013-02-28 14:27:09,,,


In [258]:
nSplit_c11_2_c10_1_c9_d.nSplit_c11.unique()

array([2])

In [263]:
ps_t4_1.filter(F.col('_c11').contains('"N,')).toPandas()[['_c11','_c12','_c13','_c14','_c15','_c16','_c17']]

Unnamed: 0,_c11,_c12,_c13,_c14,_c15,_c16,_c17
0,"""N,""N","""N,""N","""N,""N","""N,""N","""N,""N",2013-03-11 12:30:24,
1,"""N,""N","""N,""N","""N,""N","""N,""N","""N,""N",2013-03-02 01:21:43,
2,"""N,""N","""N,""N","""N,""N","""N,""N","""N,""N",2013-03-11 12:30:25,
3,"""N,""N","""N,""N","""N,""N","""N,""N","""N,""N",2013-03-02 01:22:05,
4,"""N,""N","""N,""N","""N,""N","""N,""N","""N,""N",2013-03-11 12:30:25,
...,...,...,...,...,...,...,...
120865,"""N,""N","""N,""N","""N,""N","""N,""N","""N,""N",2013-02-28 14:37:38,
120866,"""N,""N","""N,""N","""N,""N","""N,""N","""N,""N",2013-02-28 14:27:25,
120867,"""N,""N","""N,""N","""N,""N","""N,""N","""N,""N",2013-02-28 14:27:25,
120868,"""N,""N","""N,""N","""N,""N","""N,""N","""N,""N",2013-02-28 14:27:26,


In [271]:
ps_t4_1_1 = ps_t4_1.filter(F.col('_c11').contains('"N,'))

In [272]:
nSplit_c11_n = ps_t4_1.filter(F.col('_c11').contains('"N,')).toPandas()

In [None]:
check_list = ['nSplit_c7','nSplit_c8','nSplit_c9', 'nSplit_c10', 'nSplit_c11','nSplit_c12','nSplit_c13', 'nSplit_c14','nSplit_c15','nSplit_c16','nSplit_c17','nSplit_c18','nSplit_c19']

unique_col = []
for i in check_list:
    unique_col.append(nSplit_c11_n[i].unique())

pd.DataFrame(zip(check_list,unique_col))

In [274]:
# Set new columns need to split for this data

col_c11 = ['_c11']
newcols_c11 = ['c11_1', 'c11_2']

col_c12 = ['_c12']
newcols_c12 = ['c12_1', 'c12_2']

col_c13 = ['_c13']
newcols_c13 = ['c13_1', 'c13_2']

col_c14 = ['_c14']
newcols_c14 = ['c14_1', 'c14_2']

col_c15 = ['_c15']
newcols_c15 = ['c15_1', 'c15_2']


# Apply split function
ps_t4_1_1 = split_2_column(ps_t4_1_1, col_c11, newcols_c11)
ps_t4_1_1 = split_2_column(ps_t4_1_1, col_c12, newcols_c12)
ps_t4_1_1 = split_2_column(ps_t4_1_1, col_c13, newcols_c13)
ps_t4_1_1 = split_2_column(ps_t4_1_1, col_c14, newcols_c14)
ps_t4_1_1 = split_2_column(ps_t4_1_1, col_c15, newcols_c15)

---
### 5.4.2a Adding columns and condition
From this step, we will start adding column based in n_Split column logic

_Add columns_

In [276]:
ps_t4_1_1 = ps_t4_1_1.withColumn('realname',F.col('c11_1'))
ps_t4_1_1 = ps_t4_1_1.withColumn('primaryclanid', F.col('c11_2'))
ps_t4_1_1 = ps_t4_1_1.withColumn('timecreated', F.col('c12_1'))
ps_t4_1_1 = ps_t4_1_1.withColumn('gameid', F.col('c12_2'))
ps_t4_1_1 = ps_t4_1_1.withColumn('gameserverip', F.col('c13_1'))
ps_t4_1_1 = ps_t4_1_1.withColumn('gameextrainfo', F.col('c13_2'))
ps_t4_1_1 = ps_t4_1_1.withColumn('cityid', F.col('c14_1'))
ps_t4_1_1 = ps_t4_1_1.withColumn('loccountrycode', F.col('c14_2'))
ps_t4_1_1 = ps_t4_1_1.withColumn('locstatecode', F.col('c15_1'))

In [277]:
check_missing(ps_t4_1_1.select('realname','primaryclanid','timecreated','gameid',\
                               'gameserverip','gameextrainfo','cityid','loccountrycode','locstatecode'))

+--------+-------------+-----------+------+------------+-------------+------+--------------+------------+
|realname|primaryclanid|timecreated|gameid|gameserverip|gameextrainfo|cityid|loccountrycode|locstatecode|
+--------+-------------+-----------+------+------------+-------------+------+--------------+------------+
|       0|            0|          0|     0|           0|            0|     0|             0|           0|
+--------+-------------+-----------+------+------------+-------------+------+--------------+------------+



---
### 5.4.3a Format, rename columns and save Table

In [278]:
ps_t4_1_1.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c6: string (nullable = true)
 |-- _c7: string (nullable = true)
 |-- _c8: string (nullable = true)
 |-- _c9: string (nullable = true)
 |-- _c10: string (nullable = true)
 |-- _c11: string (nullable = true)
 |-- _c12: string (nullable = true)
 |-- _c13: string (nullable = true)
 |-- _c14: string (nullable = true)
 |-- _c15: string (nullable = true)
 |-- _c16: string (nullable = true)
 |-- _c17: string (nullable = true)
 |-- _c18: string (nullable = true)
 |-- _c19: string (nullable = true)
 |-- nSplit_c7: integer (nullable = false)
 |-- nSplit_c8: integer (nullable = false)
 |-- nSplit_c9: integer (nullable = false)
 |-- nSplit_c10: integer (nullable = false)
 |-- nSplit_c11: integer (nullable = false)
 |-- nSplit_c12: integer (nullable = false)
 |-- nSplit_c13: integer (nullable = false)
 |-- nSplit_c14: integer (nullable = false)
 |-- nSplit_c15: integer (nullable = fals

In [279]:
# Drop unnecssary columns
ps_t4_1_1 = ps_t4_1_1.drop('nSplit_c7','nSplit_c8','nSplit_c9','nSplit_c10','nSplit_c11',\
                    'nSplit_c12','nSplit_c13','nSplit_c14','nSplit_c15','nSplit_c16','nSplit_c17',\
                    'nSplit_c18','nSplit_c19','c11_1','c11_2','c12_1','c12_2','c13_1','c13_2',\
                    'c14_1','c14_2','c15_1','c15_2','_c11','_c12','_c13','_c14','_c15','_c16',\
                    '_c17','_c18','_c19')

In [283]:
# Rename columns
newColumns = ['steam_id','person_name', 'profile_url','person_state', 'community_visibility_state',\
              'profile_state','last_logoff', 'comment_permission', 'real_name', 'primary_clanid',\
              'time_created', 'game_id', 'gameserver_ip', 'game_extrainfo', 'city_id', 'country_code',\
              'state_code']
ps_t4_1_1 = rename_col(ps_t4_1_1, newColumns)

In [284]:
ps_t4_1_1.printSchema()

root
 |-- steam_id: string (nullable = true)
 |-- person_name: string (nullable = true)
 |-- profile_url: string (nullable = true)
 |-- person_state: string (nullable = true)
 |-- community_visibility_state: string (nullable = true)
 |-- profile_state: string (nullable = true)
 |-- last_logoff: string (nullable = true)
 |-- comment_permission: string (nullable = true)
 |-- real_name: string (nullable = true)
 |-- primary_clanid: string (nullable = true)
 |-- time_created: string (nullable = true)
 |-- game_id: string (nullable = true)
 |-- gameserver_ip: string (nullable = true)
 |-- game_extrainfo: string (nullable = true)
 |-- city_id: string (nullable = true)
 |-- country_code: string (nullable = true)
 |-- state_code: string (nullable = true)



In [285]:
# Replace " symbols in data
col_list = ['comment_permission','real_name', 'primary_clanid','time_created', 'game_id', 'gameserver_ip',\
            'game_extrainfo', 'city_id', 'country_code','state_code']

for i in col_list:
     ps_t4_1_1 = ps_t4_1_1.withColumn(i,regexp_replace(i, '"', ""))

In [286]:
ps_t4_1_1.limit(20).toPandas()

Unnamed: 0,steam_id,person_name,profile_url,person_state,community_visibility_state,profile_state,last_logoff,comment_permission,real_name,primary_clanid,time_created,game_id,gameserver_ip,game_extrainfo,city_id,country_code,state_code
0,76561197963341583,piteR,http://steamcommunity.com/profiles/76561197963...,0,1,1,2013-02-16 17:36:08,1,N,N,N,N,N,N,N,N,N
1,76561197963392875,thePyun,http://steamcommunity.com/profiles/76561197963...,0,1,1,2013-02-12 22:00:59,2,N,N,N,N,N,N,N,N,N
2,76561197963403209,Rocket,http://steamcommunity.com/profiles/76561197963...,0,1,1,2013-02-15 19:46:20,2,N,N,N,N,N,N,N,N,N
3,76561197963404757,King_McGEEzzy,http://steamcommunity.com/id/King_McGEEzzy/,0,1,1,2013-02-16 18:04:23,2,N,N,N,N,N,N,N,N,N
4,76561197963413078,iniesta,http://steamcommunity.com/profiles/76561197963...,0,1,1,2011-02-06 13:37:39,2,N,N,N,N,N,N,N,N,N
5,76561197963459764,WayneLyon,http://steamcommunity.com/profiles/76561197963...,0,1,1,2013-01-12 08:20:35,2,N,N,N,N,N,N,N,N,N
6,76561197963552074,afan,http://steamcommunity.com/profiles/76561197963...,0,1,1,2013-02-04 20:06:14,2,N,N,N,N,N,N,N,N,N
7,76561197963702074,Acceso,http://steamcommunity.com/profiles/76561197963...,0,1,1,2013-02-16 15:18:13,2,N,N,N,N,N,N,N,N,N
8,76561197984369705,cupcake:),http://steamcommunity.com/id/ladynotorious/,0,1,1,2013-01-13 21:59:52,1,N,N,N,N,N,N,N,N,N
9,76561197984414574,Poopâ„¢â˜‚,http://steamcommunity.com/profiles/76561197984...,0,1,1,2013-03-01 16:20:22,2,N,N,N,N,N,N,N,N,N


In [287]:
# Save TABLE 4_1_1
ps_t4_1_1.write.csv('/user/tamng/jwht/CleanData/ps_t4_1_1.csv', header = True)

___
### Table 4_1_2: F.col('_c11').contains('"N,')

### 5.4.1b Filter data and check number of Split

In [322]:
ps_t4_1_2 = ps_t4_1.filter(~F.col('_c11').contains('"N,'))
ps_t4_1_2.count()

305

In [289]:
nSplit_c11_v = ps_t4_1_2.toPandas()

In [323]:
check_list = ['nSplit_c7','nSplit_c8','nSplit_c9', 'nSplit_c10', 'nSplit_c11','nSplit_c12','nSplit_c13', 'nSplit_c14','nSplit_c15','nSplit_c16','nSplit_c17','nSplit_c18','nSplit_c19']

unique_col = []
for i in check_list:
    unique_col.append(nSplit_c11_v[i].unique())

pd.DataFrame(zip(check_list,unique_col))

Unnamed: 0,0,1
0,nSplit_c7,[1]
1,nSplit_c8,[1]
2,nSplit_c9,[1]
3,nSplit_c10,[1]
4,nSplit_c11,[2]
5,nSplit_c12,[1]
6,nSplit_c13,[1]
7,nSplit_c14,"[2, 1]"
8,nSplit_c15,"[2, 1]"
9,nSplit_c16,"[1, 2]"


In [None]:
# Set new columns need to split for this data

col_c11 = ['_c11']
newcols_c11 = ['c11_1', 'c11_2']

col_c14 = ['_c14']
newcols_c14 = ['c14_1', 'c14_2']

col_c15 = ['_c15']
newcols_c15 = ['c15_1', 'c15_2']

col_c16 = ['_c16']
newcols_c16 = ['c16_1', 'c16_2']

col_c17 = ['_c17']
newcols_c17 = ['c17_1', 'c17_2']

col_c18 = ['_c18']
newcols_c18 = ['c18_1', 'c18_2']


# Apply split function
ps_t4_1_2 = split_2_column(ps_t4_1_2, col_c11, newcols_c11)
ps_t4_1_2 = split_2_column(ps_t4_1_2, col_c14, newcols_c14)
ps_t4_1_2 = split_2_column(ps_t4_1_2, col_c15, newcols_c15)
ps_t4_1_2 = split_2_column(ps_t4_1_2, col_c16, newcols_c16)
ps_t4_1_2 = split_2_column(ps_t4_1_2, col_c17, newcols_c17)
ps_t4_1_2 = split_2_column(ps_t4_1_2, col_c18, newcols_c18)

In [292]:
ps_t4_1_2.toPandas()[['_c11','_c12','_c13','_c14','_c15','_c16','_c17']]

Unnamed: 0,_c11,_c12,_c13,_c14,_c15,_c16,_c17
0,"Don't get ripped, by the riptide",103582791431889810,2005-08-02 09:32:59,"""N,""N","""N,""N",GB,"""N,""N"
1,"Santus Dee Lupinus, Esq.",103582791432212622,2005-12-03 13:52:37,440,208.78.165.61:27019,Team Fortress 2,"""N,""US"""
2,"nick graber, tenyson graber",103582791429562546,2006-08-24 23:11:09,"""N,""N","""N,""N",US,OH
3,"J. James Huffington, Esq.",103582791429521408,2003-11-10 09:00:21,"""N,""N","""N,""N",SH,01
4,"Sorry Steam, but that shit's private.",103582791429776287,2005-09-17 15:55:42,"""N,""N","""N,""N","""N,""N","""N,""2013-03-01 05:34:16"""
...,...,...,...,...,...,...,...
300,"Josh, 19",103582791431053192,2006-07-14 08:19:46,"""N,""N","""N,""N",AU,QLD
301,"Brandon, My last name is of no consequence to ...",103582791433430435,2003-11-24 13:49:30,"""N,""N","""N,""N",US,CA
302,"Jakub, PiÅ‚a",103582791430678167,2004-01-09 12:12:39,"""N,""N","""N,""N",PL,86
303,"^(o,O)^",103582791430038775,2006-08-19 02:35:55,"""N,""N","""N,""N",SV,10


---
### 5.4.2b Adding columns and condition
From this step, we will start adding column based in n_Split column logic

_Add columns:_
 > - realname
 > - primaryclanid
 > - timecreated

In [326]:
ps_t4_1_2 = ps_t4_1_2.withColumn('realname',F.col('_c11'))
ps_t4_1_2 = ps_t4_1_2.withColumn('primaryclanid', F.col('_c12'))
ps_t4_1_2 = ps_t4_1_2.withColumn('timecreated', F.col('_c13'))

In [327]:
check_missing(ps_t4_1_2.select('realname','primaryclanid','timecreated'))

+--------+-------------+-----------+
|realname|primaryclanid|timecreated|
+--------+-------------+-----------+
|       0|            0|          0|
+--------+-------------+-----------+



---
_Add column_
 > - gameid (_c14)

In [328]:
nSplit_c11_v.nSplit_c14.unique()

array([2, 1])

In [329]:
ps_t4_1_2 = ps_t4_1_2.withColumn('gameid', F.col('c14_1'))

In [330]:
ps_t4_1_2.select('gameid').distinct().count()

14

In [331]:
# Use this code to check if the 'gameid' was parsed correctly.
ps_t4_1_2.select('gameid').distinct().toPandas()[0:50]

Unnamed: 0,gameid
0,33220
1,8500
2,570
3,440
4,16777286
5,33930
6,48700
7,4000
8,208480
9,33554432


In [332]:
check_missing(ps_t4_1_2.select('gameid'))

+------+
|gameid|
+------+
|     0|
+------+



---
_Add column_
 > - gameserverip (_c15)

In [333]:
nSplit_c11_v.nSplit_c15.unique()

array([2, 1])

In [337]:
ps_t4_1_2 = ps_t4_1_2.withColumn('gameserverip',F.when((F.col('nSplit_c14')==1), \
                                                F.col('c15_1')).when((F.col('nSplit_c14')==2), \
                                                F.col('c14_2')))

In [338]:
ps_t4_1_2.select('gameserverip').distinct().count()

6

In [339]:
# Use this code to check if the 'gameid' was parsed correctly.
ps_t4_1_2.select('gameserverip').distinct().toPandas()[0:50]

Unnamed: 0,gameserverip
0,208.78.165.61:27019
1,68.232.176.48:27015
2,50.88.38.3:27015
3,146.66.156.67:27051
4,83.170.71.174:27015
5,"""N"


In [345]:
check_missing(ps_t4_1_2.select('gameserverip'))

+------------+
|gameserverip|
+------------+
|           0|
+------------+



---
_Add column_
 >- gameextrainfo (_c16)

In [341]:
nSplit_c11_v.nSplit_c16.unique()

array([1, 2])

In [343]:
ps_t4_1_2 = ps_t4_1_2.withColumn('gameextrainfo',F.when((F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1), \
                                                F.col('c16_1')).when((F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==2), \
                                                F.col('c15_2')).when((F.col('nSplit_c14')==2), \
                                                F.col('c15_1')))

In [344]:
ps_t4_1_2.select('gameextrainfo').distinct().count()

14

In [346]:
# Use this code to check if the 'gameid' was parsed correctly.
ps_t4_1_2.select('gameextrainfo').distinct().toPandas()[0:50]

Unnamed: 0,gameextrainfo
0,Dota 2
1,"""Eve Online: Inferno"""
2,Natural Selection
3,"""Arma 2: Operation Arrowhead"""
4,"""Mount & Blade: Warband"""
5,"""Tomb Raider III: Adventures of Lara Croft"""
6,"""Tom Clancy's Splinter Cell: Conviction"""
7,"""PlanetSide 2"""
8,"""League of Legends"""
9,Counter-Strike: Source


In [347]:
check_missing(ps_t4_1_2.select('gameextrainfo'))

+-------------+
|gameextrainfo|
+-------------+
|            0|
+-------------+



---
_Add column_
 > - cityid (_c17)

In [349]:
nSplit_c11_v.nSplit_c17.unique()

array([2, 1])

In [350]:
ps_t4_1_2 = ps_t4_1_2.withColumn('cityid',F.when((F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==1), \
                                                F.col('c17_1')).when((F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==2), \
                                                F.col('c16_2')).when((F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==2), \
                                                F.col('c16_1')).when((F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==2), \
                                                F.col('c15_2')).when((F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==1), \
                                                F.col('c16_1')))

In [351]:
ps_t4_1_2.select('cityid').distinct().count()

1

In [352]:
# Use this code to check if the 'gameid' was parsed correctly.
ps_t4_1_2.select('cityid').distinct().toPandas()[0:50]

Unnamed: 0,cityid
0,"""N"


In [353]:
check_missing(ps_t4_1_2.select('cityid'))

+------+
|cityid|
+------+
|     0|
+------+



---
_Add column_
 > - loccountrycode (_c18)

In [355]:
nSplit_c11_v.nSplit_c18.unique()

array([ 1, -1,  2])

In [356]:
ps_t4_1_2 = ps_t4_1_2.withColumn('loccountrycode',F.when((F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==1) & (F.col('nSplit_c17')==1), \
                                                F.col('c18_1')).when((F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==1) & (F.col('nSplit_c17')==2), \
                                                F.col('c17_2')).when((F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==2), \
                                                F.col('c17_1')).when((F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==2) & (F.col('nSplit_c16')==1), \
                                                F.col('c17_1')).when((F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==2) & (F.col('nSplit_c16')==2), \
                                                F.col('c16_2')).when((F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==2), \
                                                F.col('c16_1')).when((F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==2), \
                                                F.col('c16_2')).when((F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==1), \
                                                F.col('c17_1')))

In [357]:
ps_t4_1_2.select('loccountrycode').distinct().count()

52

In [361]:
# Use this code to check if the 'gameid' was parsed correctly.
ps_t4_1_2.select('loccountrycode').distinct().toPandas()[0:20]

Unnamed: 0,loccountrycode
0,AZ
1,FI
2,RO
3,NL
4,PL
5,AM
6,MX
7,"""YE"""
8,UM
9,AT


In [359]:
check_missing(ps_t4_1_2.select('loccountrycode'))

+--------------+
|loccountrycode|
+--------------+
|             0|
+--------------+



---
_Add column_
 > - locstatecode (_c19)

In [360]:
ps_t4_1_2 = ps_t4_1_2.withColumn('locstatecode',F.when((F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==1) & (F.col('nSplit_c17')==1) & (F.col('nSplit_c18')==1), \
                                                F.col('_c19')).when((F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==1) & (F.col('nSplit_c17')==1) & (F.col('nSplit_c18')==2), \
                                                F.col('c18_2')).when((F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==1) & (F.col('nSplit_c17')==2), \
                                                F.col('c18_1')).when((F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==2) & (F.col('nSplit_c17')==2), \
                                                F.col('c17_2')).when((F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==2) & (F.col('nSplit_c17')==1), \
                                                F.col('c18_1')).when((F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==2) & (F.col('nSplit_c16')==1) & (F.col('nSplit_c17')==1), \
                                                F.col('c18_1')).when((F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==2) & (F.col('nSplit_c16')==1) & (F.col('nSplit_c17')==2), \
                                                F.col('c17_2')).when((F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==2) & (F.col('nSplit_c16')==2), \
                                                F.col('c17_1')).when((F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==2) & (F.col('nSplit_c16')==2), \
                                                F.col('c16_2')).when((F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==2) & (F.col('nSplit_c16')==1), \
                                                F.col('c17_1')).when((F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==2), \
                                                F.col('c17_1')).when((F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==1) & (F.col('nSplit_c17')==1), \
                                                F.col('c18_1')).when((F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==1) & (F.col('nSplit_c17')==2), \
                                                F.col('c17_2')))

In [362]:
ps_t4_1_2.select('locstatecode').distinct().count()

76

In [364]:
# Use this code to check if the 'gameid' was parsed correctly.
ps_t4_1_2.select('locstatecode').distinct().toPandas()[50:80]

Unnamed: 0,locstatecode
50,06
51,48
52,NY
53,ON
54,QLD
55,AB
56,67
57,79
58,TX
59,10


In [365]:
check_missing(ps_t4_1_2.select('locstatecode'))

+------------+
|locstatecode|
+------------+
|           0|
+------------+



---
### 5.4.3b Format, rename columns and save Table

In [366]:
ps_t4_1_2.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c6: string (nullable = true)
 |-- _c7: string (nullable = true)
 |-- _c8: string (nullable = true)
 |-- _c9: string (nullable = true)
 |-- _c10: string (nullable = true)
 |-- _c11: string (nullable = true)
 |-- _c12: string (nullable = true)
 |-- _c13: string (nullable = true)
 |-- _c14: string (nullable = true)
 |-- _c15: string (nullable = true)
 |-- _c16: string (nullable = true)
 |-- _c17: string (nullable = true)
 |-- _c18: string (nullable = true)
 |-- _c19: string (nullable = true)
 |-- nSplit_c7: integer (nullable = false)
 |-- nSplit_c8: integer (nullable = false)
 |-- nSplit_c9: integer (nullable = false)
 |-- nSplit_c10: integer (nullable = false)
 |-- nSplit_c11: integer (nullable = false)
 |-- nSplit_c12: integer (nullable = false)
 |-- nSplit_c13: integer (nullable = false)
 |-- nSplit_c14: integer (nullable = false)
 |-- nSplit_c15: integer (nullable = fals

In [367]:
# Drop unnecssary columns
ps_t4_1_2 = ps_t4_1_2.drop('nSplit_c7','nSplit_c8','nSplit_c9','nSplit_c10','nSplit_c11',\
                    'nSplit_c12','nSplit_c13','nSplit_c14','nSplit_c15','nSplit_c16','nSplit_c17',\
                    'nSplit_c18','nSplit_c19','c11_1','c11_2','c18_1','c18_2','c16_1','c16_2',\
                    'c14_1','c14_2','c15_1','c15_2','_c11','_c12','_c13','_c14','_c15','_c16',\
                    '_c17','_c18','_c19','c17_1','c17_2')

In [369]:
# Rename columns
newColumns = ['steam_id','person_name', 'profile_url','person_state', 'community_visibility_state',\
              'profile_state','last_logoff', 'comment_permission', 'real_name', 'primary_clanid',\
              'time_created', 'game_id', 'gameserver_ip', 'game_extrainfo', 'city_id', 'country_code',\
              'state_code']
ps_t4_1_2 = rename_col(ps_t4_1_2, newColumns)

In [370]:
ps_t4_1_2.printSchema()

root
 |-- steam_id: string (nullable = true)
 |-- person_name: string (nullable = true)
 |-- profile_url: string (nullable = true)
 |-- person_state: string (nullable = true)
 |-- community_visibility_state: string (nullable = true)
 |-- profile_state: string (nullable = true)
 |-- last_logoff: string (nullable = true)
 |-- comment_permission: string (nullable = true)
 |-- real_name: string (nullable = true)
 |-- primary_clanid: string (nullable = true)
 |-- time_created: string (nullable = true)
 |-- game_id: string (nullable = true)
 |-- gameserver_ip: string (nullable = true)
 |-- game_extrainfo: string (nullable = true)
 |-- city_id: string (nullable = true)
 |-- country_code: string (nullable = true)
 |-- state_code: string (nullable = true)



In [371]:
# Replace " symbols in data
col_list = ['comment_permission','real_name', 'primary_clanid','time_created', 'game_id', 'gameserver_ip',\
            'game_extrainfo', 'city_id', 'country_code','state_code']

for i in col_list:
     ps_t4_1_2 = ps_t4_1_2.withColumn(i,regexp_replace(i, '"', ""))

In [372]:
ps_t4_1_2.limit(20).toPandas()

Unnamed: 0,steam_id,person_name,profile_url,person_state,community_visibility_state,profile_state,last_logoff,comment_permission,real_name,primary_clanid,time_created,game_id,gameserver_ip,game_extrainfo,city_id,country_code,state_code
0,76561197977303947,[TANG] Merple,http://steamcommunity.com/id/merple/,0,3,1,2013-02-18 19:56:53,1,"Don't get ripped, by the riptide",103582791431889810,2005-08-02 09:32:59,N,N,N,N,GB,N
1,76561197979540016,Santus Dee Lupinus,http://steamcommunity.com/profiles/76561197979...,1,3,1,2013-02-28 21:46:02,1,"Santus Dee Lupinus, Esq.",103582791432212622,2005-12-03 13:52:37,440,208.78.165.61:27019,Team Fortress 2,N,US,MD
2,76561197984489349,b0o-b0o,http://steamcommunity.com/profiles/76561197984...,0,3,1,2013-03-02 00:24:30,1,"nick graber, tenyson graber",103582791429562546,2006-08-24 23:11:09,N,N,N,N,US,OH
3,76561197962672069,JeeZeus,http://steamcommunity.com/id/jjhuffington/,0,3,1,2013-02-02 18:21:13,1,"J. James Huffington, Esq.",103582791429521408,2003-11-10 09:00:21,N,N,N,N,SH,01
4,76561197978163284,Brother Biscuit,http://steamcommunity.com/id/brotherbiscuit/,0,3,1,2013-02-28 23:53:35,2,"Sorry Steam, but that shit's private.",103582791429776287,2005-09-17 15:55:42,N,N,N,N,N,N
5,76561197969060456,John 'xli0z' Calma,http://steamcommunity.com/id/xlyfag/,0,3,1,2010-09-11 10:28:46,1,"thats me, i look like a urangutan huh?? LOL ye...",103582791429534759,2004-09-22 22:17:55,N,N,N,N,US,CA
6,76561197984554357,FLoppix,http://steamcommunity.com/profiles/76561197984...,0,3,1,2013-03-01 09:20:30,1,"No information given, yo.",103582791432689134,2006-08-28 04:15:33,N,N,N,N,DE,N
7,76561197964830530,fps_salem,http://steamcommunity.com/id/fpssalem/,0,3,1,2013-02-10 08:12:04,1,"Taked baby. Meet at later bar, night or day so...",103582791430628857,2004-03-05 08:53:28,N,N,N,N,N,N
8,76561197963354819,GOVNOED,http://steamcommunity.com/id/evptvtvtv/,0,3,1,2012-04-01 01:10:33,1,"If you spray , you are GOD!",103582791432405845,2003-12-13 12:14:01,N,N,N,N,RU,66
9,76561197970683704,suechtlers.toxi,http://steamcommunity.com/id/sue_toxiii/,0,3,1,2013-02-14 09:41:06,1,"Luca A,",103582791432581039,2004-11-18 10:12:56,N,N,N,N,DE,N


In [373]:
ps_t4_1_2.count()

305

In [374]:
# Save TABLE 4_1_2
ps_t4_1_2.write.csv('/user/tamng/jwht/CleanData/ps_t4_1_2.csv', header = True)

---
### Table 4.2: F.col('_c9').contains('103')

### 5.4.1c Filter data and check number of Split

In [18]:
ps_t4_2 = player_summaries.filter(F.col('_c9').contains('103'))
ps_t4_2.count()

391

In [19]:
ps_t4_2.limit(50).toPandas()

Unnamed: 0,_c0,_c1,_c2,_c3,_c4,_c5,_c6,_c7,_c8,_c9,_c10,_c11,_c12,_c13,_c14,_c15,_c16,_c17,_c18,_c19
0,76561197960294170,"""^6i|-|\//\//\//"",""http://steamcommunity.com/p...",http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,"""N,""2012-09-09 10:28:45""","""N,""N",103582791431922407,2003-09-12 03:40:04,"""N,""N","""N,""N","""N,""N","""N,""2013-02-28 14:34:05""",,,,,
1,76561197960306346,"""Lassssssi"",""http://steamcommunity.com/profile...",http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,"""N,""2012-12-16 12:07:24""","""N,""N",103582791429521408,2003-09-12 04:45:04,"""N,""N","""N,""N","""N,""N","""N,""2013-02-28 14:34:06""",,,,,
2,76561197960310409,"""\missing-name"",""http://steamcommunity.com/pro...",http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,"""N,""2011-07-13 01:17:27""","""N,""N",103582791429521408,2003-09-12 05:08:40,"""N,""N","""N,""N","""N,""N","""N,""2013-02-28 14:19:03""",,,,,
3,76561197960318323,"""kore"",""http://steamcommunity.com/profiles/765...",http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,"""N,""2008-07-02 09:20:48""","""N,""N",103582791429521408,2003-09-12 05:48:20,"""N,""N","""N,""N","""N,""N","""N,""2013-02-28 14:19:03""",,,,,
4,76561197960337332,"""lewz"",""http://steamcommunity.com/profiles/765...",http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,"""N,""2008-01-27 10:08:59""","""N,""N",103582791429521408,2003-09-12 07:08:48,"""N,""N","""N,""N","""N,""N","""N,""2013-02-28 14:34:06""",,,,,
5,76561197960458711,"""Felipe :"",""http://steamcommunity.com/profiles...",http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,"""N,""2007-07-24 14:34:39""","""N,""N",103582791429521408,2003-09-12 21:04:14,"""N,""N","""N,""N","""N,""N","""N,""2013-02-28 14:19:09""",,,,,
6,76561197960582631,"""\missing-name"",""http://steamcommunity.com/pro...",http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,"""N,""2006-08-28 14:00:14""","""N,""N",103582791429521408,2003-09-14 03:46:41,"""N,""N","""N,""N","""N,""N","""N,""2013-02-28 14:19:14""",,,,,
7,76561197960594684,""""",""http://steamcommunity.com/profiles/7656119...",http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,"""N,""2007-10-28 04:14:02""","""N,""N",103582791429521408,2003-09-14 06:09:29,"""N,""N","""N,""N","""N,""N","""N,""2013-02-28 14:34:10""",,,,,
8,76561197960720659,""""",""http://steamcommunity.com/profiles/7656119...",http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,"""N,""2006-11-01 09:23:43""","""N,""N",103582791429521408,2003-09-15 07:34:35,"""N,""N","""N,""N","""N,""N","""N,""2013-02-28 14:19:19""",,,,,
9,76561197960746171,"""reed"",""http://steamcommunity.com/profiles/765...",http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,"""N,""2008-10-29 15:07:14""","""N,""N",103582791429521408,2003-09-15 11:05:39,"""N,""N","""N,""N","""N,""N","""N,""2013-02-28 14:19:20""",,,,,


__Add number of split for each column in the data__

In [20]:
cols_check = ['_c1','_c7','_c8','_c9', '_c10', '_c11', '_c12', '_c13', '_c14','_c15','_c16','_c17','_c18','_c19']

tagname = ['nSplit_c1','nSplit_c7','nSplit_c8','nSplit_c9', 'nSplit_c10', 'nSplit_c11', 'nSplit_c12', 'nSplit_c13', 'nSplit_c14',
           'nSplit_c15','nSplit_c16','nSplit_c17','nSplit_c18','nSplit_c19']

ps_t4_2 = add_count_split_column(ps_t4_2, cols_check, tagname)

__Check number of unique split in each tag__

In [21]:
nSplit_c1_c9 = ps_t4_2.toPandas()

check_list = ['nSplit_c1','nSplit_c7','nSplit_c8','nSplit_c9', 'nSplit_c10', 'nSplit_c11','nSplit_c12','nSplit_c13', 'nSplit_c14','nSplit_c15','nSplit_c16','nSplit_c17','nSplit_c18','nSplit_c19']

unique_col = []
for i in check_list:
    unique_col.append(nSplit_c1_c9[i].unique())

pd.DataFrame(zip(check_list,unique_col))

Unnamed: 0,0,1
0,nSplit_c1,"[2, 3, 1, 6]"
1,nSplit_c7,"[2, 1]"
2,nSplit_c8,[2]
3,nSplit_c9,[1]
4,nSplit_c10,[1]
5,nSplit_c11,[2]
6,nSplit_c12,[2]
7,nSplit_c13,"[2, 1]"
8,nSplit_c14,"[2, 1]"
9,nSplit_c15,"[-1, 1, 2]"


In [22]:
ps_t4_2.filter(F.col('nSplit_c1')>2).toPandas()[['_c1','_c2','_c3','_c4','_c5','_c6','_c7','_c8','_c9']]

Unnamed: 0,_c1,_c2,_c3,_c4,_c5,_c6,_c7,_c8,_c9
0,"""vKd | Chronicv bbvbbjihhb7u ,dskX ks 3"",""h...",http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,"""N,""2006-11-02 03:05:35""","""N,""N",103582791429521408
1,""",paiN-digiTaLl.CÃ³BrÃ¥/!"",""http://steamcommun...",http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,"""N,""2007-06-14 08:49:08""","""N,""N",103582791429521408
2,"""/^,,|*Freakynator*|,,^"",""http://steamcommunit...",http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,"""N,""2012-11-19 13:33:09""","""N,""N",103582791429521408


In [23]:
ps_t4_2.filter(F.col('nSplit_c7')==1).toPandas()[['_c1','_c2','_c3','_c4','_c5','_c6','_c7','_c8','_c9']]

Unnamed: 0,_c1,_c2,_c3,_c4,_c5,_c6,_c7,_c8,_c9
0,"""<><{/""""\}jumper{/""""\}><>""http://steamcommunit...",http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,1,2009-04-07 01:41:53,"""N,""N",103582791430346408


In [28]:
# Set new columns need to split for this data

col_c1 = ['_c1']
newcols_c1 = ['c1_1', 'c1_2']

col_c7 = ['_c7']
newcols_c7 = ['c7_1', 'c7_2']

col_c8 = ['_c8']
newcols_c8 = ['c8_1', 'c8_2']

col_c11 = ['_c11']
newcols_c11 = ['c11_1', 'c11_2']

col_c12 = ['_c12']
newcols_c12 = ['c12_1', 'c12_2']

col_c13 = ['_c13']
newcols_c13 = ['c13_1', 'c13_2']

col_c14 = ['_c14']
newcols_c14 = ['c14_1', 'c14_2']

col_c15 = ['_c15']
newcols_c15 = ['c15_1', 'c15_2']


# Apply split function
ps_t4_2 = split_2_column(ps_t4_2, col_c1, newcols_c1)
ps_t4_2 = split_2_column(ps_t4_2, col_c7, newcols_c7)
ps_t4_2 = split_2_column(ps_t4_2, col_c8, newcols_c8)
ps_t4_2 = split_2_column(ps_t4_2, col_c11, newcols_c11)
ps_t4_2 = split_2_column(ps_t4_2, col_c12, newcols_c12)
ps_t4_2 = split_2_column(ps_t4_2, col_c13, newcols_c13)
ps_t4_2 = split_2_column(ps_t4_2, col_c14, newcols_c14)
ps_t4_2 = split_2_column(ps_t4_2, col_c15, newcols_c15)

---
### 5.4.2c Adding columns and condition
From this step, we will start adding column based in n_Split column logic

_Add columns_

In [29]:
ps_t4_2 = ps_t4_2.withColumn('_c1',F.col('c1_1'))
ps_t4_2 = ps_t4_2.withColumn('personastate', F.when((F.col('nSplit_c7')==2),F.col('_c5')).otherwise(F.col('_c4')))
ps_t4_2 = ps_t4_2.withColumn('communityvisibilitystate', F.when((F.col('nSplit_c7')==2),F.col('_c6')).otherwise(F.col('_c5')))
ps_t4_2 = ps_t4_2.withColumn('profilestate', F.when((F.col('nSplit_c7')==2),F.col('c7_1')).otherwise(F.col('_c6')))
ps_t4_2 = ps_t4_2.withColumn('lastlogoff', F.when((F.col('nSplit_c7')==2),F.col('c7_2')).otherwise(F.col('_c7')))
ps_t4_2 = ps_t4_2.withColumn('commentpermission', F.col('c8_1'))
ps_t4_2 = ps_t4_2.withColumn('realname', F.col('c8_2'))
ps_t4_2 = ps_t4_2.withColumn('primaryclanid', F.col('_c9'))
ps_t4_2 = ps_t4_2.withColumn('timecreated', F.col('_c10'))
ps_t4_2 = ps_t4_2.withColumn('gameid', F.col('c11_1'))
ps_t4_2 = ps_t4_2.withColumn('gameserverip', F.col('c11_2'))
ps_t4_2 = ps_t4_2.withColumn('gameextrainfo', F.col('c12_1'))
ps_t4_2 = ps_t4_2.withColumn('cityid', F.col('c12_2'))
ps_t4_2 = ps_t4_2.withColumn('loccountrycode', F.col('c13_1'))
ps_t4_2 = ps_t4_2.withColumn('locstatecode', F.when((F.col('nSplit_c13')==1),F.col('c14_1')).otherwise(F.col('c13_2')))

In [30]:
check_missing(ps_t4_2.select('_c1','personastate', 'communityvisibilitystate', 'profilestate', 'lastlogoff','commentpermission',\
                             'realname','primaryclanid','timecreated','gameid','gameserverip','gameextrainfo','cityid','loccountrycode','locstatecode'))

+---+------------+------------------------+------------+----------+-----------------+--------+-------------+-----------+------+------------+-------------+------+--------------+------------+
|_c1|personastate|communityvisibilitystate|profilestate|lastlogoff|commentpermission|realname|primaryclanid|timecreated|gameid|gameserverip|gameextrainfo|cityid|loccountrycode|locstatecode|
+---+------------+------------------------+------------+----------+-----------------+--------+-------------+-----------+------+------------+-------------+------+--------------+------------+
|  0|           0|                       0|           0|         0|                0|       0|            0|          0|     0|           0|            0|     0|             0|           0|
+---+------------+------------------------+------------+----------+-----------------+--------+-------------+-----------+------+------------+-------------+------+--------------+------------+



In [31]:
ps_t4_2.toPandas()[['_c1','personastate', 'communityvisibilitystate', 'profilestate', 'lastlogoff','commentpermission',
                             'realname','primaryclanid','timecreated','gameid','gameserverip','gameextrainfo','cityid','loccountrycode','locstatecode']]

Unnamed: 0,_c1,personastate,communityvisibilitystate,profilestate,lastlogoff,commentpermission,realname,primaryclanid,timecreated,gameid,gameserverip,gameextrainfo,cityid,loccountrycode,locstatecode
0,"""^6i|-|\//\//\//""",0,3,"""N","""2012-09-09 10:28:45""","""N","""N",103582791431922407,2003-09-12 03:40:04,"""N","""N","""N","""N","""N","""N"
1,"""Lassssssi""",0,3,"""N","""2012-12-16 12:07:24""","""N","""N",103582791429521408,2003-09-12 04:45:04,"""N","""N","""N","""N","""N","""N"
2,"""\missing-name""",0,3,"""N","""2011-07-13 01:17:27""","""N","""N",103582791429521408,2003-09-12 05:08:40,"""N","""N","""N","""N","""N","""N"
3,"""kore""",0,3,"""N","""2008-07-02 09:20:48""","""N","""N",103582791429521408,2003-09-12 05:48:20,"""N","""N","""N","""N","""N","""N"
4,"""lewz""",0,3,"""N","""2008-01-27 10:08:59""","""N","""N",103582791429521408,2003-09-12 07:08:48,"""N","""N","""N","""N","""N","""N"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
386,"""Scar[F]ace et M3 peu last /!""",0,3,"""N","""2007-05-26 12:00:50""","""N","""N",103582791429521408,2006-10-27 09:48:29,"""N","""N","""N","""N","""N","""N"
387,"""/'*(~Lambo~)*'""",0,3,"""N","""2010-11-14 09:44:35""","""N","""N",103582791429521408,2006-11-03 12:26:20,"""N","""N","""N","""N","""N","""N"
388,"""K ill""",0,3,"""N","""2009-12-09 16:58:07""","""N","""N",103582791429521408,2006-11-02 03:04:59,"""N","""N","""N","""N","""N","""N"
389,"""/-\un g@rs du 4o/-""",0,3,"""N","""2008-09-19 16:01:58""","""N","""N",103582791429521408,2006-11-02 10:58:58,"""N","""N","""N","""N",FR,97


---
### 5.4.3c Format, rename columns and save Table

In [32]:
ps_t4_2.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c3: string (nullable = true)
 |-- _c4: string (nullable = true)
 |-- _c5: string (nullable = true)
 |-- _c6: string (nullable = true)
 |-- _c7: string (nullable = true)
 |-- _c8: string (nullable = true)
 |-- _c9: string (nullable = true)
 |-- _c10: string (nullable = true)
 |-- _c11: string (nullable = true)
 |-- _c12: string (nullable = true)
 |-- _c13: string (nullable = true)
 |-- _c14: string (nullable = true)
 |-- _c15: string (nullable = true)
 |-- _c16: string (nullable = true)
 |-- _c17: string (nullable = true)
 |-- _c18: string (nullable = true)
 |-- _c19: string (nullable = true)
 |-- nSplit_c1: integer (nullable = false)
 |-- nSplit_c7: integer (nullable = false)
 |-- nSplit_c8: integer (nullable = false)
 |-- nSplit_c9: integer (nullable = false)
 |-- nSplit_c10: integer (nullable = false)
 |-- nSplit_c11: integer (nullable = false)
 |-- nSplit_c12: integer 

In [39]:
# Drop unnecssary columns
ps_t4_2 = ps_t4_2.drop('nSplit_c1','nSplit_c7','nSplit_c8','nSplit_c9','nSplit_c10','nSplit_c11',\
                    'nSplit_c12','nSplit_c13','nSplit_c14','nSplit_c15','nSplit_c16','nSplit_c17',\
                    'nSplit_c18','nSplit_c19','c1_1','c1_2','c7_1','c7_2','c8_1','c8_2','c11_1','c11_2',\
                    'c12_1','c12_2','c13_1','c13_2','c14_1','c14_2','c15_1','c15_2','_c6','_c7','_c10',\
                    '_c8','_c9','_c11','_c12','_c13','_c14','_c15','_c16','_c17','_c18','_c19','_c3','_c4','_c5')

In [41]:
# Rename columns
newColumns = ['steam_id','person_name', 'profile_url','person_state', 'community_visibility_state',\
              'profile_state','last_logoff', 'comment_permission', 'real_name', 'primary_clanid',\
              'time_created', 'game_id', 'gameserver_ip', 'game_extrainfo', 'city_id', 'country_code',\
              'state_code']
ps_t4_2 = rename_col(ps_t4_2, newColumns)

In [42]:
ps_t4_2.printSchema()

root
 |-- steam_id: string (nullable = true)
 |-- person_name: string (nullable = true)
 |-- profile_url: string (nullable = true)
 |-- person_state: string (nullable = true)
 |-- community_visibility_state: string (nullable = true)
 |-- profile_state: string (nullable = true)
 |-- last_logoff: string (nullable = true)
 |-- comment_permission: string (nullable = true)
 |-- real_name: string (nullable = true)
 |-- primary_clanid: string (nullable = true)
 |-- time_created: string (nullable = true)
 |-- game_id: string (nullable = true)
 |-- gameserver_ip: string (nullable = true)
 |-- game_extrainfo: string (nullable = true)
 |-- city_id: string (nullable = true)
 |-- country_code: string (nullable = true)
 |-- state_code: string (nullable = true)



In [44]:
# Replace " symbols in data
col_list = ['person_name','profile_state','comment_permission','real_name', 'primary_clanid','time_created', 'game_id', 'gameserver_ip',\
            'last_logoff','profile_url','game_extrainfo', 'city_id', 'country_code','state_code']

for i in col_list:
     ps_t4_2 = ps_t4_2.withColumn(i,regexp_replace(i, '"', ""))

In [45]:
ps_t4_2.limit(20).toPandas()

Unnamed: 0,steam_id,person_name,profile_url,person_state,community_visibility_state,profile_state,last_logoff,comment_permission,real_name,primary_clanid,time_created,game_id,gameserver_ip,game_extrainfo,city_id,country_code,state_code
0,76561197960294170,^6i|-|\//\//\//,http://media.steampowered.com/steamcommunity/p...,0,3,N,2012-09-09 10:28:45,N,N,103582791431922407,2003-09-12 03:40:04,N,N,N,N,N,N
1,76561197960306346,Lassssssi,http://media.steampowered.com/steamcommunity/p...,0,3,N,2012-12-16 12:07:24,N,N,103582791429521408,2003-09-12 04:45:04,N,N,N,N,N,N
2,76561197960310409,\missing-name,http://media.steampowered.com/steamcommunity/p...,0,3,N,2011-07-13 01:17:27,N,N,103582791429521408,2003-09-12 05:08:40,N,N,N,N,N,N
3,76561197960318323,kore,http://media.steampowered.com/steamcommunity/p...,0,3,N,2008-07-02 09:20:48,N,N,103582791429521408,2003-09-12 05:48:20,N,N,N,N,N,N
4,76561197960337332,lewz,http://media.steampowered.com/steamcommunity/p...,0,3,N,2008-01-27 10:08:59,N,N,103582791429521408,2003-09-12 07:08:48,N,N,N,N,N,N
5,76561197960458711,Felipe :,http://media.steampowered.com/steamcommunity/p...,0,3,N,2007-07-24 14:34:39,N,N,103582791429521408,2003-09-12 21:04:14,N,N,N,N,N,N
6,76561197960582631,\missing-name,http://media.steampowered.com/steamcommunity/p...,0,3,N,2006-08-28 14:00:14,N,N,103582791429521408,2003-09-14 03:46:41,N,N,N,N,N,N
7,76561197960594684,,http://media.steampowered.com/steamcommunity/p...,0,3,N,2007-10-28 04:14:02,N,N,103582791429521408,2003-09-14 06:09:29,N,N,N,N,N,N
8,76561197960720659,,http://media.steampowered.com/steamcommunity/p...,0,3,N,2006-11-01 09:23:43,N,N,103582791429521408,2003-09-15 07:34:35,N,N,N,N,N,N
9,76561197960746171,reed,http://media.steampowered.com/steamcommunity/p...,0,3,N,2008-10-29 15:07:14,N,N,103582791429521408,2003-09-15 11:05:39,N,N,N,N,N,N


In [46]:
ps_t4_2.count()

391

In [47]:
# Save TABLE 4_2
ps_t4_2.write.csv('/user/tamng/jwht/CleanData/ps_t4_2.csv', header = True)

---
### Table 4.3:

### 5.4.1d Filter data and check number of Split

In [24]:
ps_t4_3 = ps_t4.filter(F.col('_c9').contains('N'))
ps_t4_3.limit(10).toPandas()

Unnamed: 0,_c0,_c1,_c2,_c6,_c7,_c8,_c9,_c10,_c11,_c12,...,nSplit_c10,nSplit_c11,nSplit_c12,nSplit_c13,nSplit_c14,nSplit_c15,nSplit_c16,nSplit_c17,nSplit_c18,nSplit_c19
0,76561197985764946,FN#25499218,http://steamcommunity.com/profiles/76561197985...,0,3,"""N,""2009-03-11 18:38:38""","""N,""N",103582791429521408,"""N,""N","""N,""N",...,1,2,2,2,2,1,-1,-1,-1,-1
1,76561197978765238,FN#18499510,http://steamcommunity.com/profiles/76561197978...,0,3,"""N,""2009-05-10 18:59:20""","""N,""N",103582791429521408,"""N,""N","""N,""N",...,1,2,2,2,2,1,-1,-1,-1,-1
2,76561197982892481,FN#22626753,http://steamcommunity.com/profiles/76561197982...,0,3,"""N,""2009-03-14 13:54:56""","""N,""N",103582791429521408,"""N,""N","""N,""N",...,1,2,2,2,2,1,-1,-1,-1,-1
3,76561197983105185,FN#22839457,http://steamcommunity.com/profiles/76561197983...,0,3,"""N,""2009-05-03 13:36:33""","""N,""N",103582791429521408,"""N,""N","""N,""N",...,1,2,2,2,2,1,-1,-1,-1,-1
4,76561197979038568,FN#18772840,http://steamcommunity.com/profiles/76561197979...,0,3,"""N,""2009-04-19 05:05:49""","""N,""N",103582791429521408,"""N,""N","""N,""N",...,1,2,2,2,2,1,-1,-1,-1,-1
5,76561197960485742,FN#220014,http://steamcommunity.com/profiles/76561197960...,0,3,"""N,""2010-10-13 12:57:27""","""N,""N",103582791429521408,"""N,""N","""N,""N",...,1,2,2,2,2,1,-1,-1,-1,-1
6,76561197980213913,FN#19948185,http://steamcommunity.com/profiles/76561197980...,0,3,"""N,""2009-05-03 09:29:17""","""N,""N",103582791429521408,"""N,""N","""N,""N",...,1,2,2,2,2,1,-1,-1,-1,-1
7,76561197982601760,FN#22336032,http://steamcommunity.com/profiles/76561197982...,0,3,"""N,""2009-05-31 02:18:40""","""N,""N",103582791429521408,"""N,""N","""N,""N",...,1,2,2,2,2,1,-1,-1,-1,-1
8,76561197984644524,FN#24378796,http://steamcommunity.com/profiles/76561197984...,0,3,"""N,""2009-03-11 17:31:07""","""N,""N",103582791429521408,"""N,""N","""N,""N",...,1,2,2,2,2,1,-1,-1,-1,-1
9,76561197984813531,FN#24547803,http://steamcommunity.com/profiles/76561197984...,0,3,"""N,""2009-03-14 22:40:43""","""N,""N",103582791429521408,"""N,""N","""N,""N",...,1,2,2,2,2,1,-1,-1,-1,-1


In [27]:
ps_t4_3.count()

57

In [25]:
nSplit_c11_2_c10_1_c9n = ps_t4.filter(F.col('_c9').contains('N')).toPandas()

In [26]:
check_list = ['nSplit_c7','nSplit_c8','nSplit_c9', 'nSplit_c10', 'nSplit_c11','nSplit_c12','nSplit_c13', 'nSplit_c14','nSplit_c15','nSplit_c16','nSplit_c17','nSplit_c18','nSplit_c19']

unique_col = []
for i in check_list:
    unique_col.append(nSplit_c11_2_c10_1_c9n[i].unique())

pd.DataFrame(zip(check_list,unique_col))

Unnamed: 0,0,1
0,nSplit_c7,[1]
1,nSplit_c8,[2]
2,nSplit_c9,[2]
3,nSplit_c10,[1]
4,nSplit_c11,[2]
5,nSplit_c12,[2]
6,nSplit_c13,[2]
7,nSplit_c14,[2]
8,nSplit_c15,[1]
9,nSplit_c16,[-1]


In [28]:
# Set new columns need to split for this data

col_c8 = ['_c8']
newcols_c8 = ['c8_1', 'c8_2']

col_c9 = ['_c9']
newcols_c9 = ['c9_1', 'c9_2']

col_c11 = ['_c11']
newcols_c11 = ['c11_1', 'c11_2']

col_c12 = ['_c12']
newcols_c12 = ['c12_1', 'c12_2']

col_c13 = ['_c13']
newcols_c13 = ['c13_1', 'c13_2']

col_c14 = ['_c14']
newcols_c14 = ['c14_1', 'c14_2']


# Apply split function
ps_t4_3 = split_2_column(ps_t4_3, col_c8, newcols_c8)
ps_t4_3 = split_2_column(ps_t4_3, col_c9, newcols_c9)
ps_t4_3 = split_2_column(ps_t4_3, col_c11, newcols_c11)
ps_t4_3 = split_2_column(ps_t4_3, col_c12, newcols_c12)
ps_t4_3 = split_2_column(ps_t4_3, col_c13, newcols_c13)
ps_t4_3 = split_2_column(ps_t4_3, col_c14, newcols_c14)

---
### 5.4.2d Adding columns and condition
From this step, we will start adding column based in n_Split column logic

_Add columns_

In [29]:
ps_t4_3 = ps_t4_3.withColumn('profilestate', F.col('c8_1'))
ps_t4_3 = ps_t4_3.withColumn('lastlogoff', F.col('c8_2'))
ps_t4_3 = ps_t4_3.withColumn('commentpermission', F.col('c9_1'))
ps_t4_3 = ps_t4_3.withColumn('realname', F.col('c9_2'))
ps_t4_3 = ps_t4_3.withColumn('primaryclanid', F.col('_c10'))
ps_t4_3 = ps_t4_3.withColumn('timecreated', F.col('c11_1'))
ps_t4_3 = ps_t4_3.withColumn('gameid', F.col('c11_2'))
ps_t4_3 = ps_t4_3.withColumn('gameserverip', F.col('c12_1'))
ps_t4_3 = ps_t4_3.withColumn('gameextrainfo', F.col('c12_2'))
ps_t4_3 = ps_t4_3.withColumn('cityid', F.col('c13_1'))
ps_t4_3 = ps_t4_3.withColumn('loccountrycode', F.col('c13_2'))
ps_t4_3 = ps_t4_3.withColumn('locstatecode', F.col('c14_1'))

In [30]:
check_missing(ps_t4_3.select('profilestate', 'lastlogoff','commentpermission','realname','primaryclanid',\
                             'timecreated','gameid','gameserverip','gameextrainfo','cityid','loccountrycode','locstatecode'))

+------------+----------+-----------------+--------+-------------+-----------+------+------------+-------------+------+--------------+------------+
|profilestate|lastlogoff|commentpermission|realname|primaryclanid|timecreated|gameid|gameserverip|gameextrainfo|cityid|loccountrycode|locstatecode|
+------------+----------+-----------------+--------+-------------+-----------+------+------------+-------------+------+--------------+------------+
|           0|         0|                0|       0|            0|          0|     0|           0|            0|     0|             0|           0|
+------------+----------+-----------------+--------+-------------+-----------+------+------------+-------------+------+--------------+------------+



In [31]:
ps_t4_3.limit(20).toPandas()[['profilestate', 'lastlogoff','commentpermission','realname','primaryclanid',\
                             'timecreated','gameid','gameserverip','gameextrainfo','cityid','loccountrycode','locstatecode']]

Unnamed: 0,profilestate,lastlogoff,commentpermission,realname,primaryclanid,timecreated,gameid,gameserverip,gameextrainfo,cityid,loccountrycode,locstatecode
0,"""N","""2009-03-11 18:38:38""","""N","""N",103582791429521408,"""N","""N","""N","""N","""N","""N","""N"
1,"""N","""2009-03-14 13:54:56""","""N","""N",103582791429521408,"""N","""N","""N","""N","""N","""N","""N"
2,"""N","""2009-05-10 18:59:20""","""N","""N",103582791429521408,"""N","""N","""N","""N","""N","""N","""N"
3,"""N","""2009-05-03 13:36:33""","""N","""N",103582791429521408,"""N","""N","""N","""N","""N","""N","""N"
4,"""N","""2010-10-13 12:57:27""","""N","""N",103582791429521408,"""N","""N","""N","""N","""N","""N","""N"
5,"""N","""2009-04-19 05:05:49""","""N","""N",103582791429521408,"""N","""N","""N","""N","""N","""N","""N"
6,"""N","""2009-05-03 09:29:17""","""N","""N",103582791429521408,"""N","""N","""N","""N","""N","""N","""N"
7,"""N","""2009-05-31 02:18:40""","""N","""N",103582791429521408,"""N","""N","""N","""N","""N","""N","""N"
8,"""N","""2009-03-11 17:31:07""","""N","""N",103582791429521408,"""N","""N","""N","""N","""N","""N","""N"
9,"""N","""2009-03-14 22:40:43""","""N","""N",103582791429521408,"""N","""N","""N","""N","""N","""N","""N"


---
### 5.4.3d Format, rename columns and save Table

In [32]:
ps_t4_3.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c6: string (nullable = true)
 |-- _c7: string (nullable = true)
 |-- _c8: string (nullable = true)
 |-- _c9: string (nullable = true)
 |-- _c10: string (nullable = true)
 |-- _c11: string (nullable = true)
 |-- _c12: string (nullable = true)
 |-- _c13: string (nullable = true)
 |-- _c14: string (nullable = true)
 |-- _c15: string (nullable = true)
 |-- _c16: string (nullable = true)
 |-- _c17: string (nullable = true)
 |-- _c18: string (nullable = true)
 |-- _c19: string (nullable = true)
 |-- nSplit_c7: integer (nullable = false)
 |-- nSplit_c8: integer (nullable = false)
 |-- nSplit_c9: integer (nullable = false)
 |-- nSplit_c10: integer (nullable = false)
 |-- nSplit_c11: integer (nullable = false)
 |-- nSplit_c12: integer (nullable = false)
 |-- nSplit_c13: integer (nullable = false)
 |-- nSplit_c14: integer (nullable = false)
 |-- nSplit_c15: integer (nullable = fals

In [39]:
# Drop unnecssary columns
ps_t4_3 = ps_t4_3.drop('nSplit_c7','nSplit_c8','nSplit_c9','nSplit_c10','nSplit_c11',\
                    'nSplit_c12','nSplit_c13','nSplit_c14','nSplit_c15','nSplit_c16','nSplit_c17',\
                    'nSplit_c18','nSplit_c19','c8_1','c8_2','c9_1','c9_2','c11_1','c11_2',\
                    'c12_1','c12_2','c13_1','c13_2','c14_1','c14_2','_c10',\
                    '_c8','_c9','_c11','_c12','_c13','_c14','_c15','_c16','_c17','_c18','_c19')

In [41]:
# Rename columns
newColumns = ['steam_id','person_name', 'profile_url','person_state', 'community_visibility_state',\
              'profile_state','last_logoff', 'comment_permission', 'real_name', 'primary_clanid',\
              'time_created', 'game_id', 'gameserver_ip', 'game_extrainfo', 'city_id', 'country_code',\
              'state_code']
ps_t4_3 = rename_col(ps_t4_3, newColumns)

In [42]:
ps_t4_3.printSchema()

root
 |-- steam_id: string (nullable = true)
 |-- person_name: string (nullable = true)
 |-- profile_url: string (nullable = true)
 |-- person_state: string (nullable = true)
 |-- community_visibility_state: string (nullable = true)
 |-- profile_state: string (nullable = true)
 |-- last_logoff: string (nullable = true)
 |-- comment_permission: string (nullable = true)
 |-- real_name: string (nullable = true)
 |-- primary_clanid: string (nullable = true)
 |-- time_created: string (nullable = true)
 |-- game_id: string (nullable = true)
 |-- gameserver_ip: string (nullable = true)
 |-- game_extrainfo: string (nullable = true)
 |-- city_id: string (nullable = true)
 |-- country_code: string (nullable = true)
 |-- state_code: string (nullable = true)



In [43]:
# Replace " symbols in data
col_list = ['person_name','profile_state','comment_permission','real_name', 'primary_clanid','time_created', 'game_id', 'gameserver_ip',\
            'last_logoff','profile_url','game_extrainfo', 'city_id', 'country_code','state_code']

for i in col_list:
     ps_t4_3 = ps_t4_3.withColumn(i,regexp_replace(i, '"', ""))

In [44]:
ps_t4_3.limit(10).toPandas()

Unnamed: 0,steam_id,person_name,profile_url,person_state,community_visibility_state,profile_state,last_logoff,comment_permission,real_name,primary_clanid,time_created,game_id,gameserver_ip,game_extrainfo,city_id,country_code,state_code
0,76561197985764946,FN#25499218,http://steamcommunity.com/profiles/76561197985...,0,3,N,2009-03-11 18:38:38,N,N,103582791429521408,N,N,N,N,N,N,N
1,76561197982892481,FN#22626753,http://steamcommunity.com/profiles/76561197982...,0,3,N,2009-03-14 13:54:56,N,N,103582791429521408,N,N,N,N,N,N,N
2,76561197978765238,FN#18499510,http://steamcommunity.com/profiles/76561197978...,0,3,N,2009-05-10 18:59:20,N,N,103582791429521408,N,N,N,N,N,N,N
3,76561197983105185,FN#22839457,http://steamcommunity.com/profiles/76561197983...,0,3,N,2009-05-03 13:36:33,N,N,103582791429521408,N,N,N,N,N,N,N
4,76561197979038568,FN#18772840,http://steamcommunity.com/profiles/76561197979...,0,3,N,2009-04-19 05:05:49,N,N,103582791429521408,N,N,N,N,N,N,N
5,76561197960485742,FN#220014,http://steamcommunity.com/profiles/76561197960...,0,3,N,2010-10-13 12:57:27,N,N,103582791429521408,N,N,N,N,N,N,N
6,76561197980213913,FN#19948185,http://steamcommunity.com/profiles/76561197980...,0,3,N,2009-05-03 09:29:17,N,N,103582791429521408,N,N,N,N,N,N,N
7,76561197982601760,FN#22336032,http://steamcommunity.com/profiles/76561197982...,0,3,N,2009-05-31 02:18:40,N,N,103582791429521408,N,N,N,N,N,N,N
8,76561197984644524,FN#24378796,http://steamcommunity.com/profiles/76561197984...,0,3,N,2009-03-11 17:31:07,N,N,103582791429521408,N,N,N,N,N,N,N
9,76561197984813531,FN#24547803,http://steamcommunity.com/profiles/76561197984...,0,3,N,2009-03-14 22:40:43,N,N,103582791429521408,N,N,N,N,N,N,N


In [45]:
# Save TABLE 4_3
ps_t4_3.write.csv('/user/tamng/jwht/CleanData/ps_t4_3.csv', header = True)

---
### Table 4.4: ['len_c9']==1

### 5.3.1e Filter data and check number of Split

In [46]:
nSplit_c11_2_c10_1['len_c9'] = nSplit_c11_2_c10_1._c9.str.len()
nSplit_c11_2_c10_1['len_c9'].unique()

array([19, 18,  1,  5])

In [53]:
nSplit_c11_2_c10_1_c91 = nSplit_c11_2_c10_1[nSplit_c11_2_c10_1['len_c9']==1]

# Create a column captured the length of _c9
ps_t4 = ps_t4.withColumn('len_c9', length(ps_t4._c9))

In [52]:
ps_t4_4 = ps_t4.filter(F.col('len_c9')<2)
ps_t4_4.count()

50

In [51]:
ps_t4.filter(F.col('len_c9')<2).toPandas()[['_c1','_c2','_c6','_c7','_c8','_c9','_c10','_c11','_c12','len_c9']]

Unnamed: 0,_c1,_c2,_c6,_c7,_c8,_c9,_c10,_c11,_c12,len_c9
0,"frodo c""",")""",http://media.steampowered.com/steamcommunity/p...,0,1,1,2013-02-16 14:59:00,"""N,""N","""N,""N",1
1,"c""",")""",http://media.steampowered.com/steamcommunity/p...,0,3,1,2009-12-14 13:26:58,"""N,""t0mppade""",103582791429521408,1
2,"""""""DUCHER","aragon e + alguÃ©m""""""",http://media.steampowered.com/steamcommunity/p...,0,3,1,2009-06-06 05:16:34,"""N,""TÃ¡ no B.I.""",103582791429591926,1
3,"""<php? Echo """"h1","iam php user"""";>""",http://media.steampowered.com/steamcommunity/p...,0,3,1,2013-01-05 16:02:03,"""N,""N",103582791429521543,1
4,"""_,.-'""""Â¯""""Xobu""""Â¯'""""-.","_""",http://media.steampowered.com/steamcommunity/p...,0,3,1,2013-02-07 02:00:56,"""N,""N",103582791433560172,1
5,"""""""manuelSFC""""agregadme","nueva cuen""",http://media.steampowered.com/steamcommunity/p...,0,1,1,2011-03-24 14:13:43,"""N,""N","""N,""N",1
6,"JeaNakiS c""",")""",http://media.steampowered.com/steamcommunity/p...,0,3,1,2012-11-26 13:23:43,"""N,""N",103582791429572341,1
7,"""sQeAk""""off","@rL <hf & gl>""",http://media.steampowered.com/steamcommunity/p...,0,3,1,2009-12-20 18:38:04,"""N,""N",103582791429521408,1
8,"""Alphamale """"U know it","u feel it""",http://media.steampowered.com/steamcommunity/p...,0,3,1,2011-03-25 13:08:49,"""N,""Niclas""",103582791430111996,1
9,"""""""#","Jaranko i 1 leming.""""""",http://media.steampowered.com/steamcommunity/p...,0,3,1,2012-02-10 07:00:29,"""N,""Krzysztof Rozenberger""",103582791430584492,1


In [54]:
check_list = ['nSplit_c7','nSplit_c8','nSplit_c9', 'nSplit_c10', 'nSplit_c11','nSplit_c12','nSplit_c13', 'nSplit_c14','nSplit_c15','nSplit_c16','nSplit_c17','nSplit_c18','nSplit_c19']

unique_col = []
for i in check_list:
    unique_col.append(nSplit_c11_2_c10_1_c91[i].unique())

pd.DataFrame(zip(check_list,unique_col))

Unnamed: 0,0,1
0,nSplit_c7,[1]
1,nSplit_c8,[1]
2,nSplit_c9,[1]
3,nSplit_c10,[1]
4,nSplit_c11,[2]
5,nSplit_c12,"[2, 1]"
6,nSplit_c13,"[2, 1]"
7,nSplit_c14,"[2, 1]"
8,nSplit_c15,[2]
9,nSplit_c16,"[2, 1]"


In [56]:
# Set new columns need to split for this data

col_c11 = ['_c11']
newcols_c11 = ['c11_1', 'c11_2']

col_c12 = ['_c12']
newcols_c12 = ['c12_1', 'c12_2']

col_c13 = ['_c13']
newcols_c13 = ['c13_1', 'c13_2']

col_c14 = ['_c14']
newcols_c14 = ['c14_1', 'c14_2']

col_c15 = ['_c15']
newcols_c15 = ['c15_1', 'c15_2']

col_c16 = ['_c16']
newcols_c16 = ['c16_1', 'c16_2']

col_c17 = ['_c17']
newcols_c17 = ['c17_1', 'c17_2']

col_c18 = ['_c18']
newcols_c18 = ['c18_1', 'c18_2']

# Apply split function
ps_t4_4 = split_2_column(ps_t4_4, col_c11, newcols_c11)
ps_t4_4 = split_2_column(ps_t4_4, col_c12, newcols_c12)
ps_t4_4 = split_2_column(ps_t4_4, col_c13, newcols_c13)
ps_t4_4 = split_2_column(ps_t4_4, col_c14, newcols_c14)
ps_t4_4 = split_2_column(ps_t4_4, col_c15, newcols_c15)
ps_t4_4 = split_2_column(ps_t4_4, col_c16, newcols_c16)
ps_t4_4 = split_2_column(ps_t4_4, col_c17, newcols_c17)
ps_t4_4 = split_2_column(ps_t4_4, col_c18, newcols_c18)

---
### 5.4.2e Adding columns and condition
From this step, we will start adding column based in n_Split column logic

_Add columns_

In [81]:
ps_t4_4 = ps_t4_4.withColumn('personastate', F.col('_c7'))
ps_t4_4 = ps_t4_4.withColumn('communityvisibilitystate', F.col('_c8'))
ps_t4_4 = ps_t4_4.withColumn('profilestate', F.col('_c9'))
ps_t4_4 = ps_t4_4.withColumn('lastlogoff', F.col('_c10'))
ps_t4_4 = ps_t4_4.withColumn('commentpermission', F.col('c11_1'))
ps_t4_4 = ps_t4_4.withColumn('realname', F.col('c11_2'))
ps_t4_4 = ps_t4_4.withColumn('primaryclanid', F.when((F.col('_c12').contains('-.-')),F.col('_c13')).otherwise(F.col('c12_1')))

In [82]:
ps_t4_4.limit(10).toPandas()[['personastate','communityvisibilitystate','profilestate','lastlogoff','commentpermission','realname','primaryclanid']]

Unnamed: 0,personastate,communityvisibilitystate,profilestate,lastlogoff,commentpermission,realname,primaryclanid
0,0,1,1,2013-02-16 14:59:00,"""N","""N","""N"
1,0,3,1,2009-12-14 13:26:58,"""N","""t0mppade""",103582791429521408
2,0,3,1,2009-06-06 05:16:34,"""N","""TÃ¡ no B.I.""",103582791429591926
3,0,3,1,2013-01-05 16:02:03,"""N","""N",103582791429521543
4,0,3,1,2013-02-07 02:00:56,"""N","""N",103582791433560172
5,0,1,1,2011-03-24 14:13:43,"""N","""N","""N"
6,0,3,1,2012-11-26 13:23:43,"""N","""N",103582791429572341
7,0,3,1,2009-12-20 18:38:04,"""N","""N",103582791429521408
8,0,3,1,2011-03-25 13:08:49,"""N","""Niclas""",103582791430111996
9,0,3,1,2012-02-10 07:00:29,"""N","""Krzysztof Rozenberger""",103582791430584492


In [83]:
ps_t4_4 = ps_t4_4.withColumn('timecreated', F.when((F.col('nSplit_c12')==1) & (F.col('_c12').contains('-.-')), F.col('_c14')).\
                             when((F.col('nSplit_c12')==1) &(~F.col('_c12').contains('-.-')),F.col('c13_1')).otherwise(F.col('c12_2')))

In [84]:
ps_t4_4.select('timecreated').distinct().count()

44

In [85]:
# Use this code to check if the 'gameid' was parsed correctly.
ps_t4_4.select('timecreated').distinct().toPandas()[35:50]

Unnamed: 0,timecreated
35,2004-07-12 08:33:27
36,2004-06-04 04:54:30
37,2006-08-08 12:11:56
38,"""N"
39,2005-10-18 18:47:48
40,2004-12-08 14:11:07
41,2004-05-29 21:43:13
42,2006-07-04 06:23:30
43,2006-10-04 03:23:05


In [86]:
check_missing(ps_t4_4.select('timecreated'))

+-----------+
|timecreated|
+-----------+
|          0|
+-----------+



---
_Add column_
 > - gameid (_c14)

In [90]:
ps_t4_4 = ps_t4_4.withColumn('gameid', F.when((F.col('nSplit_c12')==1) & (F.col('_c12').contains('-.-')), F.col('c15_1')).\
                             when((F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1) &(~F.col('_c12').contains('-.-')),F.col('c14_1')).\
                             when((F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==2) &(~F.col('_c12').contains('-.-')),F.col('c13_2')).\
                             when((F.col('nSplit_c12')==2) &(~F.col('_c12').contains('-.-')),F.col('c14_1')))

In [91]:
ps_t4_4.filter(F.col('gameid').contains('2006')).toPandas()[['_c1','_c2','_c6','_c7','_c8','_c9','_c10','_c11','_c12','_c13','_c14']]

Unnamed: 0,_c1,_c2,_c6,_c7,_c8,_c9,_c10,_c11,_c12,_c13,_c14


In [92]:
ps_t4_4.select('gameid').distinct().count()

1

In [93]:
# Use this code to check if the 'gameid' was parsed correctly.
ps_t4_4.select('gameid').distinct().toPandas()[0:50]

Unnamed: 0,gameid
0,"""N"


In [94]:
check_missing(ps_t4_4.select('gameid'))

+------+
|gameid|
+------+
|     0|
+------+



---
_Add column_
 > - gameserverip (_c15)

In [88]:
nSplit_c11_2_c10_1_c91.nSplit_c15.unique()

array([2])

In [100]:
ps_t4_4 = ps_t4_4.withColumn('gameserverip', F.when((F.col('nSplit_c12')==1) & (F.col('_c12').contains('-.-')), F.col('c15_2')).\
                             when((F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==1) &(~F.col('_c12').contains('-.-')),F.col('c15_1')).\
                             when((F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==2) &(~F.col('_c12').contains('-.-')),F.col('c14_2')).\
                             when((F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==2) &(~F.col('_c12').contains('-.-')),F.col('c14_1')).\
                             when((F.col('nSplit_c12')==2) & (F.col('nSplit_c13')==1) & (~F.col('_c12').contains('-.-')),F.col('c14_1')).\
                             when((F.col('nSplit_c12')==2) & (F.col('nSplit_c13')==2) & (~F.col('_c12').contains('-.-')),F.col('c14_2')))

In [101]:
ps_t4_4.select('gameserverip').distinct().count()

1

In [102]:
# Use this code to check if the 'gameid' was parsed correctly.
ps_t4_4.select('gameserverip').distinct().toPandas()[0:50]

Unnamed: 0,gameserverip
0,"""N"


In [103]:
check_missing(ps_t4_4.select('gameserverip'))

+------------+
|gameserverip|
+------------+
|           0|
+------------+



---
_Add column_
 > - gameextrainfo (_c16)

In [99]:
nSplit_c11_2_c10_1_c91.nSplit_c16.unique()

array([2, 1])

In [104]:
ps_t4_4 = ps_t4_4.withColumn('gameextrainfo', F.when((F.col('nSplit_c12')==1) & (F.col('_c12').contains('-.-')), F.col('c16_1')).\
                             when((F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) &(~F.col('_c12').contains('-.-')),F.col('c16_1')).\
                             when((F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==2) &(~F.col('_c12').contains('-.-')),F.col('c15_2')).\
                             when((F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==2) &(~F.col('_c12').contains('-.-')),F.col('c15_1')).\
                             when((F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==1) &(~F.col('_c12').contains('-.-')),F.col('c15_1')).\
                             when((F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==2) &(~F.col('_c12').contains('-.-')),F.col('c14_2')).\
                             when((F.col('nSplit_c12')==2) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==1) & (~F.col('_c12').contains('-.-')),F.col('c15_1')).\
                             when((F.col('nSplit_c1x2')==2) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==2) & (~F.col('_c12').contains('-.-')),F.col('c14_2')).\
                             when((F.col('nSplit_c12')==2) & (F.col('nSplit_c13')==2) & (~F.col('_c12').contains('-.-')),F.col('c15_1')))

In [105]:
ps_t4_4.select('gameextrainfo').distinct().count()

1

In [106]:
# Use this code to check if the 'gameid' was parsed correctly.
ps_t4_4.select('gameextrainfo').distinct().toPandas()[0:50]

Unnamed: 0,gameextrainfo
0,"""N"


In [107]:
check_missing(ps_t4_4.select('gameextrainfo'))

+-------------+
|gameextrainfo|
+-------------+
|            0|
+-------------+



---
_Add column_
 > - cityid (_c17)

In [108]:
nSplit_c11_2_c10_1_c91.nSplit_c17.unique()

array([-1,  2,  1])

In [121]:
ps_t4_4 = ps_t4_4.withColumn('cityid', F.when((F.col('nSplit_c12')==1) & (F.col('nSplit_c16')== 2) & (F.col('_c12').contains('-.-')), F.col('c16_2')).\
                             when((F.col('nSplit_c12')==1) & (F.col('nSplit_c16')== 1) & (F.col('_c12').contains('-.-')), F.col('c17_1')).\
                             when((F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==1) & (~F.col('_c12').contains('-.-')),F.col('c17_1')).\
                             when((F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==2) & (~F.col('_c12').contains('-.-')),F.col('c16_2')).\
                             when((F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==2) & (~F.col('_c12').contains('-.-')),F.col('c16_1')).\
                             when((F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==1) & (~F.col('_c12').contains('-.-')),F.col('c16_1')).\
                             when((F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==2) & (~F.col('_c12').contains('-.-')),F.col('c15_2')).\
                             when((F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (~F.col('_c12').contains('-.-')),F.col('c16_1')).\
                             when((F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==2) & (~F.col('_c12').contains('-.-')),F.col('c15_2')).\
                             when((F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==2) & (~F.col('_c12').contains('-.-')),F.col('c15_1')).\
                             when((F.col('nSplit_c12')==2) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (~F.col('_c12').contains('-.-')),F.col('c16_1')).\
                             when((F.col('nSplit_c12')==2) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==2) & (~F.col('_c12').contains('-.-')),F.col('c15_2')).\
                             when((F.col('nSplit_c12')==2) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==2) & (~F.col('_c12').contains('-.-')),F.col('c15_1')).\
                             when((F.col('nSplit_c12')==2) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==1) & (~F.col('_c12').contains('-.-')),F.col('c15_1')).\
                             when((F.col('nSplit_c12')==2) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==2) & (~F.col('_c12').contains('-.-')),F.col('c14_2')))

In [122]:
ps_t4_4.select('cityid').distinct().count()

1

In [123]:
# Use this code to check if the 'gameid' was parsed correctly.
ps_t4_4.select('cityid').distinct().toPandas()[0:50]

Unnamed: 0,cityid
0,"""N"


In [124]:
check_missing(ps_t4_4.select('cityid'))

+------+
|cityid|
+------+
|     0|
+------+



---
_Add column_
 > - loccountrycode (_c18)

In [126]:
nSplit_c11_2_c10_1_c91.nSplit_c18.unique()

array([-1,  1,  2])

In [134]:
ps_t4_4 = ps_t4_4.withColumn('loccountrycode', F.when((F.col('nSplit_c12')==1) & (F.col('nSplit_c16')== 2) & (F.col('_c12').contains('-.-')), F.col('c17_1')).\
                             when((F.col('nSplit_c12')==1) & (F.col('nSplit_c16')== 1) & (F.col('nSplit_c17')== 1) & (F.col('_c12').contains('-.-')), F.col('c18_1')).\
                             when((F.col('nSplit_c12')==1) & (F.col('nSplit_c16')== 1) & (F.col('nSplit_c17')== 2) & (F.col('_c12').contains('-.-')), F.col('c17_2')).\
                             when((F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==1) & (F.col('nSplit_c17')== 1) & (~F.col('_c12').contains('-.-')),F.col('c18_1')).\
                             when((F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==1) & (F.col('nSplit_c17')== 2) & (~F.col('_c12').contains('-.-')),F.col('c17_2')).\
                             when((F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==2) & (~F.col('_c12').contains('-.-')),F.col('c17_1')).\
                             when((F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==2) & (F.col('nSplit_c16')==1) & (~F.col('_c12').contains('-.-')),F.col('c17_1')).\
                             when((F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==2) & (F.col('nSplit_c16')==2) & (~F.col('_c12').contains('-.-')),F.col('c16_2')).\
                             when((F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==1) & (~F.col('_c12').contains('-.-')),F.col('c17_1')).\
                             when((F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==2) & (~F.col('_c12').contains('-.-')),F.col('c16_2')).\
                             when((F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==2) & (~F.col('_c12').contains('-.-')),F.col('c16_1')).\
                             when((F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==1) & (~F.col('_c12').contains('-.-')),F.col('c17_1')).\
                             when((F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==2) & (~F.col('_c12').contains('-.-')),F.col('c16_2')).\
                             when((F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==2) & (~F.col('_c12').contains('-.-')),F.col('c16_1')).\
                             when((F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==2) & (F.col('nSplit_c16')==1) & (~F.col('_c12').contains('-.-')),F.col('c16_1')).\
                             when((F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==2) & (F.col('nSplit_c16')==2) & (~F.col('_c12').contains('-.-')),F.col('c15_2')).\
                             when((F.col('nSplit_c12')==2) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==1) & (~F.col('_c12').contains('-.-')),F.col('c17_1')).\
                             when((F.col('nSplit_c12')==2) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==2) & (~F.col('_c12').contains('-.-')),F.col('c16_2')).\
                             when((F.col('nSplit_c12')==2) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==2) & (~F.col('_c12').contains('-.-')),F.col('c16_1')).\
                             when((F.col('nSplit_c12')==2) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==1) & (~F.col('_c12').contains('-.-')),F.col('c16_1')).\
                             when((F.col('nSplit_c12')==2) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==2) & (~F.col('_c12').contains('-.-')),F.col('c15_2')).\
                             when((F.col('nSplit_c12')==2) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (~F.col('_c12').contains('-.-')),F.col('c16_1')).\
                             when((F.col('nSplit_c12')==2) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==2) & (~F.col('_c12').contains('-.-')),F.col('c15_2')).\
                             when((F.col('nSplit_c12')==2) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==2) & (~F.col('_c12').contains('-.-')),F.col('c15_1')))

In [135]:
ps_t4_4.select('loccountrycode').distinct().count()

14

In [136]:
# Use this code to check if the 'gameid' was parsed correctly.
ps_t4_4.select('loccountrycode').distinct().toPandas()[0:50]

Unnamed: 0,loccountrycode
0,PL
1,RU
2,PT
3,DE
4,ES
5,TR
6,US
7,FR
8,SG
9,SE


In [137]:
check_missing(ps_t4_4.select('loccountrycode'))

+--------------+
|loccountrycode|
+--------------+
|             0|
+--------------+



---
_Add column_
 > - locstatecode (_c19)

In [138]:
nSplit_c11_2_c10_1_c91.nSplit_c19.unique()

array([-1,  1])

In [142]:
ps_t4_4 = ps_t4_4.withColumn('locstatecode', F.when((F.col('nSplit_c12')==1) & (F.col('nSplit_c16')== 2) & (F.col('nSplit_c17')== 1) & (F.col('_c12').contains('-.-')), F.col('c18_1')).\
                             when((F.col('nSplit_c12')==1) & (F.col('nSplit_c16')== 2) & (F.col('nSplit_c17')== 2) & (F.col('_c12').contains('-.-')), F.col('c17_2')).\
                             when((F.col('nSplit_c12')==1) & (F.col('nSplit_c16')== 1) & (F.col('nSplit_c17')== 1) & (F.col('nSplit_c18')== 1) & (F.col('_c12').contains('-.-')), F.col('_c19')).\
                             when((F.col('nSplit_c12')==1) & (F.col('nSplit_c16')== 1) & (F.col('nSplit_c17')== 1) & (F.col('nSplit_c18')== 2) & (F.col('_c12').contains('-.-')), F.col('c18_2')).\
                             when((F.col('nSplit_c12')==1) & (F.col('nSplit_c16')== 1) & (F.col('nSplit_c17')== 2) & (F.col('_c12').contains('-.-')), F.col('c18_1')).\
                             when((F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==1) & (F.col('nSplit_c17')== 1) & (F.col('nSplit_c18')== 1) & (~F.col('_c12').contains('-.-')),F.col('_c19')).\
                             when((F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==1) & (F.col('nSplit_c17')== 1) & (F.col('nSplit_c18')== 2) & (~F.col('_c12').contains('-.-')),F.col('_c19')).\
                             when((F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==1) & (F.col('nSplit_c17')== 2) & (~F.col('_c12').contains('-.-')),F.col('c18_1')).\
                             when((F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==2) & (F.col('nSplit_c17')==1) & (~F.col('_c12').contains('-.-')),F.col('c18_1')).\
                             when((F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==2) & (F.col('nSplit_c17')==2) & (~F.col('_c12').contains('-.-')),F.col('c17_2')).\
                             when((F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==2) & (F.col('nSplit_c16')==1) & (F.col('nSplit_c17')==1) & (~F.col('_c12').contains('-.-')),F.col('c18_1')).\
                             when((F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==2) & (F.col('nSplit_c16')==1) & (F.col('nSplit_c17')==2) & (~F.col('_c12').contains('-.-')),F.col('c17_2')).\
                             when((F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==2) & (F.col('nSplit_c16')==2) & (~F.col('_c12').contains('-.-')),F.col('c17_1')).\
                             when((F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==1) & (F.col('nSplit_c17')==1) & (~F.col('_c12').contains('-.-')),F.col('c18_1')).\
                             when((F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==1) & (F.col('nSplit_c17')==2) & (~F.col('_c12').contains('-.-')),F.col('c17_2')).\
                             when((F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==2) & (~F.col('_c12').contains('-.-')),F.col('c17_1')).\
                             when((F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==2) & (F.col('nSplit_c16')==1) & (~F.col('_c12').contains('-.-')),F.col('c17_1')).\
                             when((F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==2) & (F.col('nSplit_c16')==2) & (~F.col('_c12').contains('-.-')),F.col('c16_2')).\
                             when((F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==1) & (F.col('nSplit_c17')==1) & (~F.col('_c12').contains('-.-')),F.col('c18_1')).\
                             when((F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==1) & (F.col('nSplit_c17')==2) & (~F.col('_c12').contains('-.-')),F.col('c17_2')).\
                             when((F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==2) & (~F.col('_c12').contains('-.-')),F.col('c17_1')).\
                             when((F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==2) & (F.col('nSplit_c16')==1) & (~F.col('_c12').contains('-.-')),F.col('c17_1')).\
                             when((F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==2) & (F.col('nSplit_c16')==2) & (~F.col('_c12').contains('-.-')),F.col('c16_2')).\
                             when((F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==2) & (F.col('nSplit_c16')==1) & (F.col('nSplit_c17')==1) & (~F.col('_c12').contains('-.-')),F.col('c18_1')).\
                             when((F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==2) & (F.col('nSplit_c16')==1) & (F.col('nSplit_c17')==2) & (~F.col('_c12').contains('-.-')),F.col('c17_2')).\
                             when((F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==2) & (F.col('nSplit_c16')==2) & (~F.col('_c12').contains('-.-')),F.col('c16_1')).\
                             when((F.col('nSplit_c12')==2) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==1) & (F.col('nSplit_c17')==1) & (~F.col('_c12').contains('-.-')),F.col('c18_1')).\
                             when((F.col('nSplit_c12')==2) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==1) & (F.col('nSplit_c17')==2) & (~F.col('_c12').contains('-.-')),F.col('c17_2')).\
                             when((F.col('nSplit_c12')==2) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==2) & (~F.col('_c12').contains('-.-')),F.col('c17_1')).\
                             when((F.col('nSplit_c12')==2) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==2) & (F.col('nSplit_c16')==1) & (~F.col('_c12').contains('-.-')),F.col('c17_1')).\
                             when((F.col('nSplit_c12')==2) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==2) & (F.col('nSplit_c16')==2) & (~F.col('_c12').contains('-.-')),F.col('c16_2')).\
                             when((F.col('nSplit_c12')==2) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==1) & (~F.col('_c12').contains('-.-')),F.col('c17_1')).\
                             when((F.col('nSplit_c12')==2) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==2) & (~F.col('_c12').contains('-.-')),F.col('c16_2')).\
                             when((F.col('nSplit_c12')==2) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==2) & (~F.col('_c12').contains('-.-')),F.col('c16_1')).\
                             when((F.col('nSplit_c12')==2) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==1) & (~F.col('_c12').contains('-.-')),F.col('c17_1')).\
                             when((F.col('nSplit_c12')==2) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==2) & (~F.col('_c12').contains('-.-')),F.col('c16_2')).\
                             when((F.col('nSplit_c12')==2) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==2) & (~F.col('_c12').contains('-.-')),F.col('c16_1')).\
                             when((F.col('nSplit_c12')==2) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==1) & (~F.col('_c12').contains('-.-')),F.col('c16_1')).\
                             when((F.col('nSplit_c12')==2) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==2) & (~F.col('_c12').contains('-.-')),F.col('c15_2')))

In [143]:
ps_t4_4.select('locstatecode').distinct().count()

16

In [144]:
# Use this code to check if the 'gameid' was parsed correctly.
ps_t4_4.select('locstatecode').distinct().toPandas()[0:50]

Unnamed: 0,locstatecode
0,07
1,01
2,47
3,27
4,02
5,B9
6,58
7,56
8,C1
9,10


In [145]:
check_missing(ps_t4_4.select('locstatecode'))

+------------+
|locstatecode|
+------------+
|           0|
+------------+



---
### 5.4.3e Format, rename columns and save Table

In [146]:
ps_t4_4.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c6: string (nullable = true)
 |-- _c7: string (nullable = true)
 |-- _c8: string (nullable = true)
 |-- _c9: string (nullable = true)
 |-- _c10: string (nullable = true)
 |-- _c11: string (nullable = true)
 |-- _c12: string (nullable = true)
 |-- _c13: string (nullable = true)
 |-- _c14: string (nullable = true)
 |-- _c15: string (nullable = true)
 |-- _c16: string (nullable = true)
 |-- _c17: string (nullable = true)
 |-- _c18: string (nullable = true)
 |-- _c19: string (nullable = true)
 |-- nSplit_c7: integer (nullable = false)
 |-- nSplit_c8: integer (nullable = false)
 |-- nSplit_c9: integer (nullable = false)
 |-- nSplit_c10: integer (nullable = false)
 |-- nSplit_c11: integer (nullable = false)
 |-- nSplit_c12: integer (nullable = false)
 |-- nSplit_c13: integer (nullable = false)
 |-- nSplit_c14: integer (nullable = false)
 |-- nSplit_c15: integer (nullable = fals

In [152]:
# Drop unnecssary columns
ps_t4_4 = ps_t4_4.drop('nSplit_c7','nSplit_c8','nSplit_c9','nSplit_c10','nSplit_c11',\
                    'nSplit_c12','nSplit_c13','nSplit_c14','nSplit_c15','nSplit_c16','nSplit_c17','len_c9',\
                    'nSplit_c18','nSplit_c19','c11_1','c11_2','c12_1','c12_2','c13_1','c13_2',\
                    'c18_1','c18_2','c14_1','c14_2','c15_1','c15_2','c16_1','c16_2','c17_1','c17_2',\
                    '_c6','_c7','_c8','_c9','_c10','_c11','_c12','_c13','_c14','_c15','_c16','_c17','_c18','_c19')

In [154]:
# Rename columns
newColumns = ['steam_id','person_name', 'profile_url','person_state', 'community_visibility_state',\
              'profile_state','last_logoff', 'comment_permission', 'real_name', 'primary_clanid',\
              'time_created', 'game_id', 'gameserver_ip', 'game_extrainfo', 'city_id', 'country_code',\
              'state_code']
ps_t4_4 = rename_col(ps_t4_4, newColumns)

In [155]:
ps_t4_4.printSchema()

root
 |-- steam_id: string (nullable = true)
 |-- person_name: string (nullable = true)
 |-- profile_url: string (nullable = true)
 |-- person_state: string (nullable = true)
 |-- community_visibility_state: string (nullable = true)
 |-- profile_state: string (nullable = true)
 |-- last_logoff: string (nullable = true)
 |-- comment_permission: string (nullable = true)
 |-- real_name: string (nullable = true)
 |-- primary_clanid: string (nullable = true)
 |-- time_created: string (nullable = true)
 |-- game_id: string (nullable = true)
 |-- gameserver_ip: string (nullable = true)
 |-- game_extrainfo: string (nullable = true)
 |-- city_id: string (nullable = true)
 |-- country_code: string (nullable = true)
 |-- state_code: string (nullable = true)



In [156]:
# Replace " symbols in data
col_list = ['profile_state','community_visibility_state','comment_permission','real_name', 'primary_clanid','time_created', 'game_id', 'gameserver_ip',\
            'last_logoff','profile_state','game_extrainfo', 'city_id', 'country_code','state_code']

for i in col_list:
     ps_t4_4 = ps_t4_4.withColumn(i,regexp_replace(i, '"', ""))

In [158]:
ps_t4_4.toPandas()

Unnamed: 0,steam_id,person_name,profile_url,person_state,community_visibility_state,profile_state,last_logoff,comment_permission,real_name,primary_clanid,time_created,game_id,gameserver_ip,game_extrainfo,city_id,country_code,state_code
0,76561197979717326,"c""",")""",0,3,1,2009-12-14 13:26:58,N,t0mppade,103582791429521408,2005-12-13 11:15:41,N,N,N,N,N,N
1,76561197962613920,"frodo c""",")""",0,1,1,2013-02-16 14:59:00,N,N,N,N,N,N,N,N,N,N
2,76561197978705527,"""""""DUCHER","aragon e + alguÃ©m""""""",0,3,1,2009-06-06 05:16:34,N,TÃ¡ no B.I.,103582791429591926,2005-10-18 18:47:48,N,N,N,N,PT,10
3,76561197984123334,"""<php? Echo """"h1","iam php user"""";>""",0,3,1,2013-01-05 16:02:03,N,N,103582791429521543,2006-08-04 04:20:12,N,N,N,N,RU,N
4,76561197966762701,"""_,.-'""""Â¯""""Xobu""""Â¯'""""-.","_""",0,3,1,2013-02-07 02:00:56,N,N,103582791433560172,2004-05-29 21:43:13,N,N,N,N,N,N
5,76561197983503638,"""""""manuelSFC""""agregadme","nueva cuen""",0,1,1,2011-03-24 14:13:43,N,N,N,N,N,N,N,N,N,N
6,76561197983008066,"JeaNakiS c""",")""",0,3,1,2012-11-26 13:23:43,N,N,103582791429572341,2006-06-05 14:55:29,N,N,N,N,N,N
7,76561197967429954,"""sQeAk""""off","@rL <hf & gl>""",0,3,1,2009-12-20 18:38:04,N,N,103582791429521408,2004-07-12 08:33:27,N,N,N,N,DE,N
8,76561197960986678,"""Alphamale """"U know it","u feel it""",0,3,1,2011-03-25 13:08:49,N,Niclas,103582791430111996,2003-09-18 10:34:12,N,N,N,N,SE,N
9,76561197960897998,"""""""#","Jaranko i 1 leming.""""""",0,3,1,2012-02-10 07:00:29,N,Krzysztof Rozenberger,103582791430584492,2003-09-17 06:58:25,N,N,N,N,PL,76


In [165]:
ps_t4_4 = ps_t4_4.withColumn('profile_url', sf.concat(sf.lit('http://steamcommunity.com/profiles/'), sf.col('steam_id')))

In [166]:
ps_t4_4.limit(10).toPandas()

Unnamed: 0,steam_id,person_name,profile_url,person_state,community_visibility_state,profile_state,last_logoff,comment_permission,real_name,primary_clanid,time_created,game_id,gameserver_ip,game_extrainfo,city_id,country_code,state_code
0,76561197979717326,"c""",http://steamcommunity.com/profiles/76561197979...,0,3,1,2009-12-14 13:26:58,N,t0mppade,103582791429521408,2005-12-13 11:15:41,N,N,N,N,N,N
1,76561197962613920,"frodo c""",http://steamcommunity.com/profiles/76561197962...,0,1,1,2013-02-16 14:59:00,N,N,N,N,N,N,N,N,N,N
2,76561197978705527,"""""""DUCHER",http://steamcommunity.com/profiles/76561197978...,0,3,1,2009-06-06 05:16:34,N,TÃ¡ no B.I.,103582791429591926,2005-10-18 18:47:48,N,N,N,N,PT,10
3,76561197984123334,"""<php? Echo """"h1",http://steamcommunity.com/profiles/76561197984...,0,3,1,2013-01-05 16:02:03,N,N,103582791429521543,2006-08-04 04:20:12,N,N,N,N,RU,N
4,76561197966762701,"""_,.-'""""Â¯""""Xobu""""Â¯'""""-.",http://steamcommunity.com/profiles/76561197966...,0,3,1,2013-02-07 02:00:56,N,N,103582791433560172,2004-05-29 21:43:13,N,N,N,N,N,N
5,76561197983503638,"""""""manuelSFC""""agregadme",http://steamcommunity.com/profiles/76561197983...,0,1,1,2011-03-24 14:13:43,N,N,N,N,N,N,N,N,N,N
6,76561197983008066,"JeaNakiS c""",http://steamcommunity.com/profiles/76561197983...,0,3,1,2012-11-26 13:23:43,N,N,103582791429572341,2006-06-05 14:55:29,N,N,N,N,N,N
7,76561197967429954,"""sQeAk""""off",http://steamcommunity.com/profiles/76561197967...,0,3,1,2009-12-20 18:38:04,N,N,103582791429521408,2004-07-12 08:33:27,N,N,N,N,DE,N
8,76561197960986678,"""Alphamale """"U know it",http://steamcommunity.com/profiles/76561197960...,0,3,1,2011-03-25 13:08:49,N,Niclas,103582791430111996,2003-09-18 10:34:12,N,N,N,N,SE,N
9,76561197960897998,"""""""#",http://steamcommunity.com/profiles/76561197960...,0,3,1,2012-02-10 07:00:29,N,Krzysztof Rozenberger,103582791430584492,2003-09-17 06:58:25,N,N,N,N,PL,76


In [167]:
ps_t4_4.count()

50

In [168]:
# Save TABLE 4_4
ps_t4_4.write.csv('/user/tamng/jwht/CleanData/ps_t4_4.csv', header = True)

__Check total number of rows that belong to BIG TABLE 4 condition__

In [169]:
# ps_t4_1_1, ps_t4_1_2, ps_t4_2, ps_t4_3, ps_t4_4
120870 + 305 + 391 + 57 + 50

121673

__Check total number of rows of TABLE(1+2+3+4)__

In [170]:
57359 + 80 +26 + 121673

179138

---
## <font color = 'black'> TABLE 5:  nSplit_c11 == 2 & nSplit_c10 != 1 </font>

### 5.5.1 Filter data and check number of Split

In [24]:
cols_check = ['_c1','_c7','_c8','_c9', '_c10', '_c11', '_c12', '_c13', '_c14','_c15','_c16','_c17','_c18','_c19']
tagname = ['nSplit_c1','nSplit_c7','nSplit_c8','nSplit_c9', 'nSplit_c10', 'nSplit_c11', 'nSplit_c12', 'nSplit_c13', 'nSplit_c14',
           'nSplit_c15','nSplit_c16','nSplit_c17','nSplit_c18','nSplit_c19']
def add_count_split_column(df, column, tagname):
    df = reduce(lambda df, idx: df.withColumn(tagname[idx],  F.size(F.split(column[idx], ','))),range(len(column)), df)
    return df

In [25]:
player_summaries_drop = add_count_split_column(player_summaries_drop, cols_check, tagname)

In [19]:
ps_t5 = player_summaries_drop.filter((player_summaries_drop.nSplit_c11 == 2) &(player_summaries_drop.nSplit_c10 != 1))
ps_t5.count()

3554033

In [20]:
check_list = ['nSplit_c1','nSplit_c7','nSplit_c8','nSplit_c9', 'nSplit_c10', 'nSplit_c11','nSplit_c12','nSplit_c13', 'nSplit_c14','nSplit_c15','nSplit_c16','nSplit_c17','nSplit_c18','nSplit_c19']


for i in check_list:
    print('Distinct value for each column:\n')
    print(ps_t5.groupBy(i).count().show())

Distinct value for each column:

+---------+-------+
|nSplit_c1|  count|
+---------+-------+
|       -1|     54|
|        1|3552870|
|       13|      1|
|        3|     87|
|        5|     19|
|        9|      1|
|        4|     13|
|        8|      2|
|        7|      4|
|       11|      1|
|        2|    981|
+---------+-------+

None
Distinct value for each column:

+---------+-------+
|nSplit_c7|  count|
+---------+-------+
|        1|3554030|
|        2|      3|
+---------+-------+

None
Distinct value for each column:

+---------+-------+
|nSplit_c8|  count|
+---------+-------+
|        1| 464463|
|        2|3089570|
+---------+-------+

None
Distinct value for each column:

+---------+-------+
|nSplit_c9|  count|
+---------+-------+
|        1| 455546|
|        3|     23|
|        2|3098464|
+---------+-------+

None
Distinct value for each column:

+----------+-------+
|nSplit_c10|  count|
+----------+-------+
|         3|    561|
|         2|3553472|
+----------+-------+

None

In [21]:
ps_t5.filter(F.col('nSplit_c7')==2).limit(20).toPandas()

Unnamed: 0,_c0,_c1,_c2,_c3,_c4,_c5,_c6,_c7,_c8,_c9,...,nSplit_c10,nSplit_c11,nSplit_c12,nSplit_c13,nSplit_c14,nSplit_c15,nSplit_c16,nSplit_c17,nSplit_c18,nSplit_c19
0,76561197973476315,"""Ecxpzo0"",""http://steamcommunity.com/profiles/...",http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,1,"""N,""2011-09-22 14:51:12""","""N,""N","""N,""N",...,2,2,2,2,-1,-1,-1,-1,-1,-1
1,76561197966788690,"""T1shk0*|__Ð¯/<"",""http://steamcommunity.com/pr...",http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,1,"""N,""2013-01-22 06:08:26""","""N,""N","""N,""N",...,2,2,2,2,-1,-1,-1,-1,-1,-1
2,76561197983363035,"""ec/\u_Tbl_men9_ybi/\_Tbl_vaFe/"",""http://steam...",http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,1,"""N,""2012-12-10 06:42:50""","""N,""N","""N,""N",...,2,2,2,2,-1,-1,-1,-1,-1,-1


---
_If nSplitc7 =1, is there any different values in thhis column? Lets check_

In [22]:
ps_t5.filter(F.col('nSplit_c7')==1).select('_c7').distinct().count()

3

In [23]:
ps_t5.filter(F.col('nSplit_c7')==1).select('_c7').distinct().show()

+--------------------+
|                 _c7|
+--------------------+
|                   3|
|http://media.stea...|
|                   1|
+--------------------+



In [24]:
ps_t5.filter((F.col('nSplit_c7')==1) & (F.col('_c7').contains('http'))).toPandas()

Unnamed: 0,_c0,_c1,_c2,_c3,_c4,_c5,_c6,_c7,_c8,_c9,...,nSplit_c10,nSplit_c11,nSplit_c12,nSplit_c13,nSplit_c14,nSplit_c15,nSplit_c16,nSplit_c17,nSplit_c18,nSplit_c19
0,76561197981841669,"""- ."""". '",,".""",http://steamcommunity.com/profiles/76561197981...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,...,2,2,1,1,2,2,2,2,-1,-1
1,76561197980060437,"""""""The",,"Maniac...""",http://steamcommunity.com/profiles/76561197980...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,...,2,2,1,1,2,2,2,2,-1,-1
2,76561197967614772,""""""" <3 chun",,"*å‚»å‚»åˆ†buæ¸…æ¥š""",http://steamcommunity.com/profiles/76561197967...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,...,2,2,1,1,2,2,2,2,-1,-1


In [25]:
ps_t5.filter((F.col('nSplit_c7')==1) & (~F.col('_c7').contains('http'))).limit(20).toPandas()

Unnamed: 0,_c0,_c1,_c2,_c3,_c4,_c5,_c6,_c7,_c8,_c9,...,nSplit_c10,nSplit_c11,nSplit_c12,nSplit_c13,nSplit_c14,nSplit_c15,nSplit_c16,nSplit_c17,nSplit_c18,nSplit_c19
0,76561197967277772,vinvel,http://steamcommunity.com/profiles/76561197967...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,1,"""N,""N","""N,""N",...,2,2,2,2,2,-1,-1,-1,-1,-1
1,76561197967278962,matrix12,http://steamcommunity.com/profiles/76561197967...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,1,"""N,""N","""N,""N",...,2,2,2,2,2,-1,-1,-1,-1,-1
2,76561197967279045,alex,http://steamcommunity.com/profiles/76561197967...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,1,"""N,""N","""N,""N",...,2,2,2,2,2,-1,-1,-1,-1,-1
3,76561197967280352,gunman,http://steamcommunity.com/profiles/76561197967...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,1,"""N,""N","""N,""N",...,2,2,2,2,2,-1,-1,-1,-1,-1
4,76561197967281070,red_748,http://steamcommunity.com/profiles/76561197967...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,1,"""N,""N","""N,""N",...,2,2,2,2,2,-1,-1,-1,-1,-1
5,76561197967281426,hmaitree,http://steamcommunity.com/profiles/76561197967...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,1,"""N,""N","""N,""N",...,2,2,2,2,2,-1,-1,-1,-1,-1
6,76561197967284540,takenotemusic,http://steamcommunity.com/profiles/76561197967...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,1,"""N,""N","""N,""N",...,2,2,2,2,2,-1,-1,-1,-1,-1
7,76561197967286379,xclide20063,http://steamcommunity.com/profiles/76561197967...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,1,"""N,""N","""N,""N",...,2,2,2,2,2,-1,-1,-1,-1,-1
8,76561197967290061,blah,http://steamcommunity.com/profiles/76561197967...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,1,"""N,""N","""N,""N",...,2,2,2,2,2,-1,-1,-1,-1,-1
9,76561197967291144,bart_ss,http://steamcommunity.com/profiles/76561197967...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,1,1,"""N,""N",...,2,2,2,2,2,1,-1,-1,-1,-1


---
__Create two new data, filtered by nsplit = 1 or !=1__

In [36]:
# Create two new data, filtered by nsplit = 1 or !=1
ps_t5_1_1 = ps_t5_1.filter((F.col('nSplit_c1')>1))
ps_t5_1_2 = ps_t5_1.filter((F.col('nSplit_c1')==1))

### Table 5_1: (F.col('nSplit_c1')>1)
### 5.5.1a Filter data and check number of Split

In [37]:
for i in check_list:
    print('Distinct value for each column:\n')
    print(ps_t5_1_1.groupBy(i).count().show())

Distinct value for each column:

+---------+-----+
|nSplit_c1|count|
+---------+-----+
|       13|    1|
|        3|   87|
|        5|   19|
|        9|    1|
|        4|   13|
|        8|    2|
|        7|    4|
|       11|    1|
|        2|  978|
+---------+-----+

None
Distinct value for each column:

+---------+-----+
|nSplit_c7|count|
+---------+-----+
|        1| 1106|
+---------+-----+

None
Distinct value for each column:

+---------+-----+
|nSplit_c8|count|
+---------+-----+
|        1| 1075|
|        2|   31|
+---------+-----+

None
Distinct value for each column:

+---------+-----+
|nSplit_c9|count|
+---------+-----+
|        1|  980|
|        2|  126|
+---------+-----+

None
Distinct value for each column:

+----------+-----+
|nSplit_c10|count|
+----------+-----+
|         3|   21|
|         2| 1085|
+----------+-----+

None
Distinct value for each column:

+----------+-----+
|nSplit_c11|count|
+----------+-----+
|         2| 1106|
+----------+-----+

None
Distinct value fo

In [42]:
# Set new columns need to split for this data

col_c1 = ['_c1']
newcols_c1 = ['c1_1', 'c1_2']

col_c8 = ['_c8']
newcols_c8 = ['c8_1', 'c8_2']

col_c9 = ['_c9']
newcols_c9 = ['c9_1', 'c9_2']

col_c10 = ['_c10']
newcols_c10 = ['c10_1', 'c10_2', 'c10_3']

col_c11 = ['_c11']
newcols_c11 = ['c11_1', 'c11_2']

col_c12 = ['_c12']
newcols_c12 = ['c12_1', 'c12_2']

col_c13 = ['_c13']
newcols_c13 = ['c13_1', 'c13_2']

col_c14 = ['_c14']
newcols_c14 = ['c14_1', 'c14_2']

col_c15 = ['_c15']
newcols_c15 = ['c15_1', 'c15_2']

# Apply split function
ps_t5_1_1 = split_2_column(ps_t5_1_1, col_c1, newcols_c1)
ps_t5_1_1 = split_2_column(ps_t5_1_1, col_c8, newcols_c8)
ps_t5_1_1 = split_2_column(ps_t5_1_1, col_c9, newcols_c9)
ps_t5_1_1 = split_3_column(ps_t5_1_1, col_c10, newcols_c10)
ps_t5_1_1 = split_2_column(ps_t5_1_1, col_c11, newcols_c11)
ps_t5_1_1 = split_2_column(ps_t5_1_1, col_c12, newcols_c12)
ps_t5_1_1 = split_2_column(ps_t5_1_1, col_c13, newcols_c13)
ps_t5_1_1 = split_2_column(ps_t5_1_1, col_c14, newcols_c14)
ps_t5_1_1 = split_2_column(ps_t5_1_1, col_c15, newcols_c15)

---
### 5.5.2a Adding columns and condition
From this step, we will start adding column based in n_Split column logic

_Add columns_

In [44]:
ps_t5_1_1 = ps_t5_1_1.withColumn('personaname', F.when((~F.col('_c1').contains('http')), F.col('_c1')).\
                                 when((F.col('_c1').contains('http')), F.col('c1_1')))

In [45]:
ps_t5_1_1 = ps_t5_1_1.withColumn('profileurl', F.col('_c2'))
ps_t5_1_1 = ps_t5_1_1.withColumn('personastate', F.when((~F.col('_c1').contains('http')), F.col('_c6')).\
                                 when((F.col('_c1').contains('http')), F.col('_c5')))

In [46]:
ps_t5_1_1 = ps_t5_1_1.withColumn('communityvisibilitystate', F.when((~F.col('_c1').contains('http')), F.col('_c7')).\
                                 when((F.col('_c1').contains('http')), F.col('_c6')))

In [47]:
ps_t5_1_1 = ps_t5_1_1.withColumn('profilestate', F.when((~F.col('_c1').contains('http')), F.col('c8_1')).\
                                 when((F.col('_c1').contains('http')), F.col('_c7')))

In [48]:
check_missing(ps_t5_1_1.select('personaname','personastate', 'communityvisibilitystate', 'profilestate'))

+-----------+------------+------------------------+------------+
|personaname|personastate|communityvisibilitystate|profilestate|
+-----------+------------+------------------------+------------+
|          0|           0|                       0|           0|
+-----------+------------+------------------------+------------+



In [49]:
ps_t5_1_1.select('personastate').distinct().count()

3

In [50]:
# Use this code to check if the 'gameid' was parsed correctly.
ps_t5_1_1.select('personastate').distinct().toPandas()[0:50]

Unnamed: 0,personastate
0,3
1,0
2,1


In [51]:
ps_t5_1_1.select('communityvisibilitystate').distinct().count()

2

In [52]:
# Use this code to check if the 'gameid' was parsed correctly.
ps_t5_1_1.select('communityvisibilitystate').distinct().toPandas()[0:50]

Unnamed: 0,communityvisibilitystate
0,3
1,1


In [53]:
ps_t5_1_1.select('profilestate').distinct().count()

2

In [54]:
# Use this code to check if the 'gameid' was parsed correctly.
ps_t5_1_1.select('profilestate').distinct().toPandas()[0:50]

Unnamed: 0,profilestate
0,1
1,"""N"


---
_Add column_
> - lastlogoff

In [60]:
ps_t5_1_1 = ps_t5_1_1.withColumn('lastlogoff', F.when((~F.col('_c1').contains('http')), F.col('c9_1')).\
                                 when((F.col('_c1').contains('http')), F.col('c8_1')))

In [61]:
ps_t5_1_1.select('lastlogoff').distinct().count()

1078

In [68]:
# Use this code to check if the 'gameid' was parsed correctly.
ps_t5_1_1.select('lastlogoff').distinct().toPandas()[1050:1078]

Unnamed: 0,lastlogoff
1050,2013-02-16 16:13:18
1051,2007-09-18 09:35:08
1052,2012-11-15 23:50:40
1053,2008-10-11 16:38:04
1054,2011-09-03 17:25:23
1055,2013-02-14 09:50:41
1056,2013-02-18 17:07:03
1057,2012-07-18 09:26:16
1058,2013-02-18 09:55:32
1059,2011-08-25 11:51:28


In [67]:
check_missing(ps_t5_1_1.select('lastlogoff'))

+----------+
|lastlogoff|
+----------+
|         0|
+----------+



---
_Add column_
> - commentpermission

In [69]:
ps_t5_1_1 = ps_t5_1_1.withColumn('commentpermission', F.when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1), F.col('c10_1')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==2), F.col('c9_2')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==1), F.col('c9_1')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==2), F.col('c8_2')))

In [70]:
ps_t5_1_1.select('commentpermission').distinct().count()

3

In [71]:
# Use this code to check if the 'gameid' was parsed correctly.
ps_t5_1_1.select('commentpermission').distinct().toPandas()[0:50]

Unnamed: 0,commentpermission
0,1
1,"""N"
2,2


In [72]:
check_missing(ps_t5_1_1.select('commentpermission'))

+-----------------+
|commentpermission|
+-----------------+
|                0|
+-----------------+



---
_Add column_
> - realname

In [73]:
ps_t5_1_1 = ps_t5_1_1.withColumn('realname', F.when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1), F.col('c11_1')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==2), F.col('c10_2')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==2), F.col('c10_1')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==1), F.col('c10_1')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==2), F.col('c9_2')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==2), F.col('c9_1')))

In [74]:
ps_t5_1_1.select('realname').distinct().count()

1

In [75]:
# Use this code to check if the 'gameid' was parsed correctly.
ps_t5_1_1.select('realname').distinct().toPandas()[0:50]

Unnamed: 0,realname
0,"""N"


In [76]:
check_missing(ps_t5_1_1.select('realname'))

+--------+
|realname|
+--------+
|       0|
+--------+



---
_Add column_
> - primaryclanid

In [85]:
ps_t5_1_1 = ps_t5_1_1.withColumn('primaryclanid', F.when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1)  & (F.col('nSplit_c11')==1), F.col('c12_1')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1)  & (F.col('nSplit_c11')==2), F.col('c11_2')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==2), F.col('c11_1')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==3), F.col('c10_3')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==1), F.col('c11_1')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==2), F.col('c10_2')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1), F.col('c11_1')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==2), F.col('c10_2')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==3), F.col('c10_2')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==2), F.col('c10_1')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==2) & (F.col('nSplit_c9')==1), F.col('c10_1')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==2) & (F.col('nSplit_c9')==2), F.col('c9_2')))

In [86]:
ps_t5_1_1.select('primaryclanid').distinct().count()

20

In [87]:
# Use this code to check if the 'gameid' was parsed correctly.
ps_t5_1_1.select('primaryclanid').distinct().toPandas()[0:50]

Unnamed: 0,primaryclanid
0,103582791432583665
1,103582791433621205
2,103582791429554297
3,103582791432774642
4,103582791429622516
5,103582791430077014
6,103582791432817724
7,103582791432160547
8,103582791429521408
9,103582791432044304


In [88]:
check_missing(ps_t5_1_1.select('primaryclanid'))

+-------------+
|primaryclanid|
+-------------+
|            0|
+-------------+



---
_Add column_
> - timecreated

In [89]:
ps_t5_1_1 = ps_t5_1_1.withColumn('timecreated', F.when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1)  & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==1), F.col('c13_1')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1)  & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==2), F.col('c12_2')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1)  & (F.col('nSplit_c11')==2), F.col('c12_1')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==2) & (F.col('nSplit_c11')==1), F.col('c12_1')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==2) & (F.col('nSplit_c11')==2), F.col('c11_2')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==3), F.col('c11_1')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==1) & (F.col('nSplit_c11')==1), F.col('c12_1')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==1) & (F.col('nSplit_c11')==2), F.col('c11_2')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==2), F.col('c11_1')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==3), F.col('c10_3')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1)& (F.col('nSplit_c11')==1), F.col('c12_1')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1)& (F.col('nSplit_c11')==2), F.col('c11_2')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==2), F.col('c11_1')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==3), F.col('c10_3')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==1), F.col('c11_1')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==2), F.col('c10_2')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==2) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1), F.col('c11_1')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==2) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==2), F.col('c10_2')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==2) & (F.col('nSplit_c9')==2), F.col('c10_1')))

In [91]:
ps_t5_1_1.select('timecreated').distinct().count()

22

In [92]:
# Use this code to check if the 'gameid' was parsed correctly.
ps_t5_1_1.select('timecreated').distinct().toPandas()[0:50]

Unnamed: 0,timecreated
0,"""2005-12-07 15:21:29"""
1,"""2004-12-10 04:25:41"""
2,"""2004-04-19 17:47:32"""
3,"""2003-11-25 20:23:38"""
4,"""2003-12-13 07:31:25"""
5,"""2003-09-13 13:39:58"""
6,"""2006-04-27 13:11:12"""
7,"""2003-09-16 09:27:39"""
8,"""2003-09-15 05:27:34"""
9,"""2004-12-18 07:27:12"""


In [93]:
check_missing(ps_t5_1_1.select('timecreated'))

+-----------+
|timecreated|
+-----------+
|          0|
+-----------+



---
_Add column_
> - gameid

In [94]:
ps_t5_1_1 = ps_t5_1_1.withColumn('gameid', F.when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1)  & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1), F.col('c14_1')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1)  & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==2), F.col('c13_2')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1)  & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==2), F.col('c13_1')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1)  & (F.col('nSplit_c11')==2) & (F.col('nSplit_c12')==1), F.col('c13_1')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1)  & (F.col('nSplit_c11')==2) & (F.col('nSplit_c12')==2), F.col('c12_2')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==2) & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==1), F.col('c13_1')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==2) & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==2), F.col('c12_2')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==2) & (F.col('nSplit_c11')==2), F.col('c12_1')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==3) & (F.col('nSplit_c11')==1), F.col('c12_1')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==3) & (F.col('nSplit_c11')==2), F.col('c11_2')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==1) & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==1), F.col('c13_1')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==1) & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==2), F.col('c12_2')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==1) & (F.col('nSplit_c11')==2), F.col('c12_1')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==2) & (F.col('nSplit_c11')==1), F.col('c12_1')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==2) & (F.col('nSplit_c11')==2), F.col('c11_2')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==3), F.col('c11_1')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1)& (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==1), F.col('c13_1')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1)& (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==2), F.col('c12_2')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1)& (F.col('nSplit_c11')==2), F.col('c12_1')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==2) & (F.col('nSplit_c11')==1), F.col('c12_1')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==2) & (F.col('nSplit_c11')==2), F.col('c11_2')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==3), F.col('c11_1')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==1) & (F.col('nSplit_c11')==1), F.col('c12_1')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==1) & (F.col('nSplit_c11')==2), F.col('c11_2')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==2), F.col('c11_1')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==2) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1) & (F.col('nSplit_c11')==1), F.col('c12_1')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==2) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1) & (F.col('nSplit_c11')==2), F.col('c11_2')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==2) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==2), F.col('c11_1')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==2) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==1) , F.col('c11_1')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==2) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')!=1) , F.col('c10_2')))

In [95]:
ps_t5_1_1.select('gameid').distinct().count()

1

In [96]:
# Use this code to check if the 'gameid' was parsed correctly.
ps_t5_1_1.select('gameid').distinct().toPandas()[0:50]

Unnamed: 0,gameid
0,"""N"


In [97]:
check_missing(ps_t5_1_1.select('gameid'))

+------+
|gameid|
+------+
|     0|
+------+



---
_Add column_
> - gameserverip

In [98]:
ps_t5_1_1 = ps_t5_1_1.withColumn('gameserverip', F.when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1)  & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==1), F.col('c15_1')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1)  & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==2), F.col('c14_2')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1)  & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==2), F.col('c14_1')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1)  & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==2) & (F.col('nSplit_c13')==1), F.col('c14_1')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1)  & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==2) & (F.col('nSplit_c13')==1), F.col('c13_2')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1)  & (F.col('nSplit_c11')==2) & (F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1), F.col('c14_1')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1)  & (F.col('nSplit_c11')==2) & (F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==2), F.col('c13_2')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1)  & (F.col('nSplit_c11')==2) & (F.col('nSplit_c12')==2), F.col('c13_1')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==2) & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1), F.col('c14_1')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==2) & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==2), F.col('c13_2')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==2) & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==2), F.col('c13_1')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==2) & (F.col('nSplit_c11')==2) & (F.col('nSplit_c12')==1), F.col('c13_1')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==2) & (F.col('nSplit_c11')==2) & (F.col('nSplit_c12')==2), F.col('c12_2')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==3) & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==1), F.col('c13_1')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==3) & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==2), F.col('c12_2')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==3) & (F.col('nSplit_c11')==2), F.col('c12_1')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==1) & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1), F.col('c14_1')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==1) & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==2), F.col('c13_2')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==1) & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==2), F.col('c13_1')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==1) & (F.col('nSplit_c11')==2) & (F.col('nSplit_c12')==1), F.col('c13_1')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==1) & (F.col('nSplit_c11')==2) & (F.col('nSplit_c12')==2), F.col('c12_2')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==2) & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==1), F.col('c13_1')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==2) & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==2), F.col('c12_2')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==2) & (F.col('nSplit_c11')==2), F.col('c12_1')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==3) & (F.col('nSplit_c11')==1), F.col('c12_1')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==3) & (F.col('nSplit_c11')==2), F.col('c11_2')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1)& (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1), F.col('c14_1')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1)& (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==2), F.col('c13_2')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1)& (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==2), F.col('c13_1')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1)& (F.col('nSplit_c11')==2) & (F.col('nSplit_c12')==1), F.col('c13_1')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1)& (F.col('nSplit_c11')==2) & (F.col('nSplit_c12')==2), F.col('c12_2')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==2) & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==1), F.col('c13_1')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==2) & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==2), F.col('c12_2')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==2) & (F.col('nSplit_c11')==2), F.col('c12_1')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==3) & (F.col('nSplit_c11')==1), F.col('c12_1')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==3) & (F.col('nSplit_c11')==2), F.col('c11_2')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==1) & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==1), F.col('c13_1')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==1) & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==2), F.col('c12_2')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==1) & (F.col('nSplit_c11')==2), F.col('c12_1')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==2) & (F.col('nSplit_c11')==1), F.col('c12_1')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==2) & (F.col('nSplit_c11')==2), F.col('c11_2')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==2) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1) & (F.col('nSplit_c11')==1)& (F.col('nSplit_c12')==1), F.col('c13_1')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==2) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1) & (F.col('nSplit_c11')==1)& (F.col('nSplit_c12')==2), F.col('c12_2')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==2) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1) & (F.col('nSplit_c11')==2), F.col('c12_1')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==2) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==2) & (F.col('nSplit_c11')==1), F.col('c12_1')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==2) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==2) & (F.col('nSplit_c11')==2), F.col('c11_2')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==2) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==1) & (F.col('nSplit_c11')==1) , F.col('c12_1')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==2) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==1) & (F.col('nSplit_c11')==2) , F.col('c11_2')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==2) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==2) , F.col('c11_1')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==2) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==3) , F.col('c10_3')))

In [99]:
ps_t5_1_1.select('gameserverip').distinct().count()

1

In [100]:
# Use this code to check if the 'gameid' was parsed correctly.
ps_t5_1_1.select('gameserverip').distinct().toPandas()[0:50]

Unnamed: 0,gameserverip
0,"""N"


In [101]:
check_missing(ps_t5_1_1.select('gameserverip'))

+------------+
|gameserverip|
+------------+
|           0|
+------------+



---
_Add column_
> - gameextrainfo

In [103]:
ps_t5_1_1 = ps_t5_1_1.withColumn('gameextrainfo', F.when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1)  & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1), F.col('_c16')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1)  & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1), F.col('c15_2')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1)  & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==2), F.col('c15_1')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1)  & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==1), F.col('c15_1')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1)  & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==1), F.col('c14_2')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1)  & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==2) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==1), F.col('c15_1')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1)  & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==2) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==2), F.col('c14_2')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1)  & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==2) & (F.col('nSplit_c13')==1), F.col('c14_1')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1)  & (F.col('nSplit_c11')==2) & (F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==1), F.col('c15_1')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1)  & (F.col('nSplit_c11')==2) & (F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==2), F.col('c14_2')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1)  & (F.col('nSplit_c11')==2) & (F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==2), F.col('c14_1')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1)  & (F.col('nSplit_c11')==2) & (F.col('nSplit_c12')==2) & (F.col('nSplit_c13')==1), F.col('c14_1')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1)  & (F.col('nSplit_c11')==2) & (F.col('nSplit_c12')==2) & (F.col('nSplit_c13')==2), F.col('c13_2')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==2) & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==1), F.col('c15_1')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==2) & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==2), F.col('c14_2')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==2) & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==2), F.col('c14_1')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==2) & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==2) & (F.col('nSplit_c13')==1), F.col('c14_1')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==2) & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==2) & (F.col('nSplit_c13')==2), F.col('c13_2')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==2) & (F.col('nSplit_c11')==2) & (F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1), F.col('c14_1')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==2) & (F.col('nSplit_c11')==2) & (F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==2), F.col('c13_2')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==2) & (F.col('nSplit_c11')==2) & (F.col('nSplit_c12')==2), F.col('c13_1')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==3) & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1), F.col('c14_1')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==3) & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==2), F.col('c13_2')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==3) & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==2), F.col('c13_1')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==3) & (F.col('nSplit_c11')==2) & (F.col('nSplit_c12')==1), F.col('c13_1')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==3) & (F.col('nSplit_c11')==2) & (F.col('nSplit_c12')==2), F.col('c12_2')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==1) & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==1), F.col('c15_1')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==1) & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==2), F.col('c14_2')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==1) & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==2), F.col('c14_1')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==1) & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==2) & (F.col('nSplit_c13')==1), F.col('c14_1')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==1) & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==2) & (F.col('nSplit_c13')==2), F.col('c13_2')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==1) & (F.col('nSplit_c11')==2) & (F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1), F.col('c14_1')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==1) & (F.col('nSplit_c11')==2) & (F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==2), F.col('c13_2')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==1) & (F.col('nSplit_c11')==2) & (F.col('nSplit_c12')==2), F.col('c13_1')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==2) & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1), F.col('c14_1')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==2) & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==2), F.col('c13_2')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==2) & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==2), F.col('c13_1')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==2) & (F.col('nSplit_c11')==2) & (F.col('nSplit_c12')==1), F.col('c13_1')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==2) & (F.col('nSplit_c11')==2) & (F.col('nSplit_c12')==2), F.col('c12_2')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==3) & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==1), F.col('c13_1')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==3) & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==2), F.col('c12_2')).\
                                 when((~F.col('_c1').contains('http')) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==3) & (F.col('nSplit_c11')==2), F.col('c12_1')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1)& (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c13')==1), F.col('c15_1')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1)& (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c13')==2), F.col('c14_2')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1)& (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==2), F.col('c14_1')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1)& (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==2) & (F.col('nSplit_c13')==1), F.col('c14_1')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1)& (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==2) & (F.col('nSplit_c13')==2), F.col('c13_2')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1)& (F.col('nSplit_c11')==2) & (F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1), F.col('c14_1')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1)& (F.col('nSplit_c11')==2) & (F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==2), F.col('c13_2')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1)& (F.col('nSplit_c11')==2) & (F.col('nSplit_c12')==2), F.col('c13_1')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==2) & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1), F.col('c14_1')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==2) & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==2), F.col('c13_2')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==2) & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==2), F.col('c13_1')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==2) & (F.col('nSplit_c11')==2) & (F.col('nSplit_c12')==1), F.col('c13_1')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==2) & (F.col('nSplit_c11')==2) & (F.col('nSplit_c12')==2), F.col('c12_2')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==3) & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==1), F.col('c13_1')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==3) & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==2), F.col('c12_2')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==3) & (F.col('nSplit_c11')==2), F.col('c12_1')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==1) & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1), F.col('c14_1')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==1) & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==2), F.col('c13_2')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==1) & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==2), F.col('c13_1')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==1) & (F.col('nSplit_c11')==2) & (F.col('nSplit_c12')==1), F.col('c13_1')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==1) & (F.col('nSplit_c11')==2) & (F.col('nSplit_c12')==2), F.col('c12_2')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==2) & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==1), F.col('c13_1')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==2) & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==2), F.col('c12_2')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==2) & (F.col('nSplit_c11')==2), F.col('c12_1')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==2) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1) & (F.col('nSplit_c11')==1)& (F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==1), F.col('c14_1')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==2) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1) & (F.col('nSplit_c11')==1)& (F.col('nSplit_c12')==1) & (F.col('nSplit_c13')==2), F.col('c13_2')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==2) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1) & (F.col('nSplit_c11')==1)& (F.col('nSplit_c12')==2), F.col('c13_1')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==2) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1) & (F.col('nSplit_c11')==2) & (F.col('nSplit_c12')==1), F.col('c13_1')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==2) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1) & (F.col('nSplit_c11')==2) & (F.col('nSplit_c12')==1), F.col('c12_2')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==2) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==2) & (F.col('nSplit_c11')==1)& (F.col('nSplit_c12')==1), F.col('c13_1')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==2) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==2) & (F.col('nSplit_c11')==1)& (F.col('nSplit_c12')==2), F.col('c12_2')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==2) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==2) & (F.col('nSplit_c11')==2), F.col('c12_1')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==2) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==1) & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==1) , F.col('c13_1')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==2) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==1) & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==2) , F.col('c12_2')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==2) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==1) & (F.col('nSplit_c11')==2) , F.col('c12_1')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==2) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==2) & (F.col('nSplit_c11')==1) , F.col('c12_1')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==2) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==2) & (F.col('nSplit_c11')==2) , F.col('c11_2')).\
                                 when((F.col('_c1').contains('http')) & (F.col('nSplit_c8')==2) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==3) , F.col('c11_1')))

In [104]:
ps_t5_1_1.select('gameextrainfo').distinct().count()

1

In [105]:
# Use this code to check if the 'gameid' was parsed correctly.
ps_t5_1_1.select('gameextrainfo').distinct().toPandas()[0:50]

Unnamed: 0,gameextrainfo
0,"""N"


In [106]:
check_missing(ps_t5_1_1.select('gameextrainfo'))

+-------------+
|gameextrainfo|
+-------------+
|            0|
+-------------+



---
_Add column_
> - cityid

In [108]:
ps_t5_1_1 = ps_t5_1_1.withColumn('cityid', F.lit('N'))

In [109]:
# Use this code to check if the 'gameid' was parsed correctly.
ps_t5_1_1.select('cityid').distinct().toPandas()[0:50]

Unnamed: 0,cityid
0,N


In [110]:
ps_t5_1_1.count()

1106

---
_Check pattern of _c16, _c17,_c18. If they all None, than the rest of the columns can be filled with Null_

In [116]:
ps_t5_1_1.toPandas()[['_c13','_c14','_c15','_c16', '_c17', '_c18','_c19']]

Unnamed: 0,_c13,_c14,_c15,_c16,_c17,_c18,_c19
0,"""N,""N","""N,""N","""N,""2013-03-06 17:32:16""",,,,
1,"""N,""N","""N,""N","""N,""2013-02-28 14:34:16""",,,,
2,"""N,""N","""N,""N","""N,""2013-02-28 14:34:18""",,,,
3,"""N,""N","""N,""N","""N,""2013-02-28 14:31:53""",,,,
4,"""N,""N","""N,""N","""N,""2013-02-28 14:31:54""",,,,
...,...,...,...,...,...,...,...
1101,"""N,""N","""N,""N","""N,""2013-03-02 02:20:04""",,,,
1102,"""N,""N","""N,""N","""N,""2013-03-14 13:16:10""",,,,
1103,"""N,""N","""N,""N","""N,""2013-02-28 14:26:30""",,,,
1104,"""N,""N","""N,""N","""N,""2013-02-28 14:26:12""",,,,


In [117]:
ps_t5_1_1 = ps_t5_1_1.withColumn('loccountrycode',F.lit('N'))
ps_t5_1_1 = ps_t5_1_1.withColumn('locstatecode',F.lit('N'))                            

In [118]:
ps_t5_1_1.select('loccountrycode').distinct().count()

1

In [119]:
# Use this code to check if the 'gameid' was parsed correctly.
ps_t5_1_1.select('loccountrycode').distinct().toPandas()[0:50]

Unnamed: 0,loccountrycode
0,N


---
### 5.5.3a Format, rename columns and save Table

In [120]:
ps_t5_1_1.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c3: string (nullable = true)
 |-- _c4: string (nullable = true)
 |-- _c5: string (nullable = true)
 |-- _c6: string (nullable = true)
 |-- _c7: string (nullable = true)
 |-- _c8: string (nullable = true)
 |-- _c9: string (nullable = true)
 |-- _c10: string (nullable = true)
 |-- _c11: string (nullable = true)
 |-- _c12: string (nullable = true)
 |-- _c13: string (nullable = true)
 |-- _c14: string (nullable = true)
 |-- _c15: string (nullable = true)
 |-- _c16: string (nullable = true)
 |-- _c17: string (nullable = true)
 |-- _c18: string (nullable = true)
 |-- _c19: string (nullable = true)
 |-- nSplit_c1: integer (nullable = false)
 |-- nSplit_c7: integer (nullable = false)
 |-- nSplit_c8: integer (nullable = false)
 |-- nSplit_c9: integer (nullable = false)
 |-- nSplit_c10: integer (nullable = false)
 |-- nSplit_c11: integer (nullable = false)
 |-- nSplit_c12: integer 

In [125]:
# Drop unnecssary columns
ps_t5_1_1 = ps_t5_1_1.drop('nSplit_c1','nSplit_c7','nSplit_c8','nSplit_c9','nSplit_c10','nSplit_c11',\
                    'nSplit_c12','nSplit_c13','nSplit_c14','nSplit_c15','nSplit_c16','nSplit_c17','len_c9',\
                    'nSplit_c18','nSplit_c19','c11_1','c11_2','c12_1','c12_2','c13_1','c13_2','_c4','_c5',\
                    'c1_1','c1_2','c14_1','c14_2','c15_1','c15_2','c8_1','c8_2','c9_1','c9_2','_c1','_c2','_c3',\
                    'c10_1','c10_2','c10_3','_c6','_c7','_c8','_c9','_c10','_c11','_c12','_c13','_c14','_c15','_c16','_c17','_c18','_c19')

In [127]:
# Rename columns
newColumns = ['steam_id','person_name', 'profile_url','person_state', 'community_visibility_state',\
              'profile_state','last_logoff', 'comment_permission', 'real_name', 'primary_clanid',\
              'time_created', 'game_id', 'gameserver_ip', 'game_extrainfo', 'city_id', 'country_code',\
              'state_code']
ps_t5_1_1 = rename_col(ps_t5_1_1, newColumns)

In [128]:
ps_t5_1_1.printSchema()

root
 |-- steam_id: string (nullable = true)
 |-- person_name: string (nullable = true)
 |-- profile_url: string (nullable = true)
 |-- person_state: string (nullable = true)
 |-- community_visibility_state: string (nullable = true)
 |-- profile_state: string (nullable = true)
 |-- last_logoff: string (nullable = true)
 |-- comment_permission: string (nullable = true)
 |-- real_name: string (nullable = true)
 |-- primary_clanid: string (nullable = true)
 |-- time_created: string (nullable = true)
 |-- game_id: string (nullable = true)
 |-- gameserver_ip: string (nullable = true)
 |-- game_extrainfo: string (nullable = true)
 |-- city_id: string (nullable = false)
 |-- country_code: string (nullable = false)
 |-- state_code: string (nullable = false)



In [129]:
# Replace " symbols in data
col_list = ['profile_state','community_visibility_state','comment_permission','real_name', 'primary_clanid','time_created', 'game_id', 'gameserver_ip',\
            'last_logoff','profile_state','game_extrainfo', 'city_id', 'country_code','state_code']

for i in col_list:
     ps_t5_1_1 = ps_t5_1_1.withColumn(i,regexp_replace(i, '"', ""))

In [130]:
ps_t5_1_1.limit(50).toPandas()

Unnamed: 0,steam_id,person_name,profile_url,person_state,community_visibility_state,profile_state,last_logoff,comment_permission,real_name,primary_clanid,time_created,game_id,gameserver_ip,game_extrainfo,city_id,country_code,state_code
0,76561197961041872,"tHe New Era' MacBook Pro 3,6GHz",http://steamcommunity.com/profiles/76561197961...,0,1,1,2013-01-24 17:27:53,N,N,N,N,N,N,N,N,N,N
1,76561197961165168,"å†·è”µåº« `_^d3f (,;,;,)",http://steamcommunity.com/profiles/76561197961...,0,1,1,2013-02-14 03:31:15,N,N,N,N,N,N,N,N,N,N
2,76561197969633313,"Mom, I grow bandits ...",http://steamcommunity.com/profiles/76561197969...,0,1,1,2013-02-16 00:32:32,N,N,N,N,N,N,N,N,N,N
3,76561197980045879,"Lock, Fuck Life!",http://steamcommunity.com/id/Mnstrcat/,0,1,1,2013-02-19 03:36:45,N,N,N,N,N,N,N,N,N,N
4,76561197976133631,"69Â°N-28Â°E .`,",http://steamcommunity.com/profiles/76561197976...,0,1,1,2013-02-18 09:13:35,N,N,N,N,N,N,N,N,N,N
5,76561197976434449,"DEVIL>_<,.",http://steamcommunity.com/profiles/76561197976...,0,1,1,2010-09-25 11:38:49,N,N,N,N,N,N,N,N,N,N
6,76561197980707823,"Mami,s Kleiner Nuttenwurm",http://steamcommunity.com/profiles/76561197980...,0,1,1,2013-03-01 08:55:14,N,N,N,N,N,N,N,N,N,N
7,76561197980172071,"Bram,, |BRamBoNL|",http://steamcommunity.com/id/brambo90/,0,1,1,2012-09-14 14:22:14,N,N,N,N,N,N,N,N,N,N
8,76561197980194293,",@,e",http://steamcommunity.com/profiles/76561197980...,0,1,1,2013-03-05 23:32:35,N,N,N,N,N,N,N,N,N,N
9,76561197962799333,"Sleepa, Creepa",http://steamcommunity.com/profiles/76561197962...,0,1,1,2013-03-06 15:57:30,N,N,N,N,N,N,N,N,N,N


In [131]:
# Save TABLE 5_1_1
ps_t5_1_1.write.csv('/user/tamng/jwht/CleanData/ps_t5_1_1.csv', header = True)

---
### TABLE 5_1_2

ps_t5_1_2 = ps_t5_1.filter((F.col('nSplit_c1')==1)

### 5.5.1b Filter data and check number of Split

In [132]:
ps_t5_1_2.count()

3552867

In [133]:
ps_t5_1_2.limit(20).toPandas()

Unnamed: 0,_c0,_c1,_c2,_c3,_c4,_c5,_c6,_c7,_c8,_c9,...,nSplit_c10,nSplit_c11,nSplit_c12,nSplit_c13,nSplit_c14,nSplit_c15,nSplit_c16,nSplit_c17,nSplit_c18,nSplit_c19
0,76561197970619367,iangel,http://steamcommunity.com/profiles/76561197970...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,1,"""N,""N","""N,""N",...,2,2,2,2,2,-1,-1,-1,-1,-1
1,76561197970619889,CrazyIvan75,http://steamcommunity.com/profiles/76561197970...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,1,1,2013-02-07 15:27:57,...,2,2,2,2,2,2,-1,-1,-1,-1
2,76561197970620385,usdsk8er2003,http://steamcommunity.com/profiles/76561197970...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,1,"""N,""N","""N,""N",...,2,2,2,2,2,-1,-1,-1,-1,-1
3,76561197970620458,fireintiger,http://steamcommunity.com/profiles/76561197970...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,1,"""N,""N","""N,""N",...,2,2,2,2,2,-1,-1,-1,-1,-1
4,76561197970620685,vampire_of_dusseldorf,http://steamcommunity.com/profiles/76561197970...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,1,"""N,""N","""N,""N",...,2,2,2,2,2,-1,-1,-1,-1,-1
5,76561197970621298,webgap,http://steamcommunity.com/profiles/76561197970...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,1,"""N,""N","""N,""N",...,2,2,2,2,2,-1,-1,-1,-1,-1
6,76561197970621362,timobeni,http://steamcommunity.com/profiles/76561197970...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,1,"""N,""N","""N,""N",...,2,2,2,2,2,-1,-1,-1,-1,-1
7,76561197970621876,Xyster,http://steamcommunity.com/id/xystro/,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,1,1,2013-02-08 19:53:53,...,2,2,2,2,2,2,-1,-1,-1,-1
8,76561197970622727,graphics,http://steamcommunity.com/profiles/76561197970...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,1,"""N,""N","""N,""N",...,2,2,2,2,2,-1,-1,-1,-1,-1
9,76561197970622903,*><* DaNiJeL *><*,http://steamcommunity.com/profiles/76561197970...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,1,1,2013-02-02 09:25:35,...,2,2,2,2,2,2,-1,-1,-1,-1


In [134]:
for i in check_list:
    print('Distinct value for each column:\n')
    print(ps_t5_1_2.groupBy(i).count().show())

Distinct value for each column:

+---------+-------+
|nSplit_c1|  count|
+---------+-------+
|        1|3552867|
+---------+-------+

None
Distinct value for each column:

+---------+-------+
|nSplit_c7|  count|
+---------+-------+
|        1|3552867|
+---------+-------+

None
Distinct value for each column:

+---------+-------+
|nSplit_c8|  count|
+---------+-------+
|        1| 463385|
|        2|3089482|
+---------+-------+

None
Distinct value for each column:

+---------+-------+
|nSplit_c9|  count|
+---------+-------+
|        1| 454563|
|        3|     23|
|        2|3098281|
+---------+-------+

None
Distinct value for each column:

+----------+-------+
|nSplit_c10|  count|
+----------+-------+
|         3|    540|
|         2|3552327|
+----------+-------+

None
Distinct value for each column:

+----------+-------+
|nSplit_c11|  count|
+----------+-------+
|         2|3552867|
+----------+-------+

None
Distinct value for each column:

+----------+-------+
|nSplit_c12|  count|


In [135]:
ps_t5_1_2.filter(F.col('nSplit_c9')==3).toPandas()

Unnamed: 0,_c0,_c1,_c2,_c3,_c4,_c5,_c6,_c7,_c8,_c9,...,nSplit_c10,nSplit_c11,nSplit_c12,nSplit_c13,nSplit_c14,nSplit_c15,nSplit_c16,nSplit_c17,nSplit_c18,nSplit_c19
0,76561197973080426,shN dno iz dna iz css,http://steamcommunity.com/profiles/76561197973...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,1,1,"""N,2,""N",...,2,2,2,2,2,-1,-1,-1,-1,-1
1,76561197970971266,quoc.hoang,http://steamcommunity.com/profiles/76561197970...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,1,1,"""N,2,""N",...,2,2,2,2,2,-1,-1,-1,-1,-1
2,76561197966052466,bram,http://steamcommunity.com/profiles/76561197966...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,1,1,"""N,1,""N",...,2,2,2,2,2,-1,-1,-1,-1,-1
3,76561197964008436,personne,http://steamcommunity.com/profiles/76561197964...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,1,1,"""N,2,""N",...,2,2,2,2,2,-1,-1,-1,-1,-1
4,76561197976070247,Fraktsiya,http://steamcommunity.com/profiles/76561197976...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,1,1,"""N,2,""N",...,2,2,2,2,2,-1,-1,-1,-1,-1
5,76561197965174845,mman71,http://steamcommunity.com/profiles/76561197965...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,1,1,"""N,2,""N",...,2,2,2,2,2,-1,-1,-1,-1,-1
6,76561197967599201,IV THA POVERTY,http://steamcommunity.com/profiles/76561197967...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,1,1,"""N,1,""N",...,2,2,2,2,2,-1,-1,-1,-1,-1
7,76561197974026086,forssi,http://steamcommunity.com/profiles/76561197974...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,1,1,"""N,2,""N",...,2,2,2,2,2,-1,-1,-1,-1,-1
8,76561197972799272,eigler,http://steamcommunity.com/profiles/76561197972...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,1,1,"""N,2,""N",...,2,2,2,2,2,-1,-1,-1,-1,-1
9,76561197983055084,keberk0,http://steamcommunity.com/profiles/76561197983...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,1,1,"""N,1,""N",...,2,2,2,2,2,-1,-1,-1,-1,-1


In [137]:
ps_t5_1_2.filter(F.col('nSplit_c10')==3).toPandas()[['_c7','_c8','_c9','_c10','_c11','_c12','_c13','_c14','_c15','_c16', '_c17', '_c18','_c19']]

Unnamed: 0,_c7,_c8,_c9,_c10,_c11,_c12,_c13,_c14,_c15,_c16,_c17,_c18,_c19
0,3,1,"""N,""N","""N,103582791429521408,""2004-06-12 06:29:06""","""N,""N","""N,""N",DZ,"""N,""N",2013-02-28 14:36:28,,,,
1,3,1,"""N,""N","""N,103582791429521408,""2003-09-19 20:11:52""","""N,""N","""N,""N","""N,""N","""N,""2013-02-28 14:34:17""",,,,,
2,3,"""N,""2013-02-12 11:51:09""",1,"""N,103582791429521408,""2004-06-07 02:50:15""","""N,""N","""N,""N",US,"""N,""N",2013-02-28 14:36:27,,,,
3,3,1,"""N,""N","""N,103582791429521408,""2005-01-23 04:19:41""","""N,""N","""N,""N","""N,""N","""N,""2013-02-28 14:29:30""",,,,,
4,3,"""N,""2013-01-07 20:19:40""",1,"""N,103582791429521408,""2003-09-25 23:07:15""","""N,""N","""N,""N",US,TX,3620,2013-02-28 14:19:47,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
535,3,1,"""N,""N","""N,103582791429521408,""2005-01-25 13:40:03""","""N,""N","""N,""N","""N,""N","""N,""2013-02-28 14:29:31""",,,,,
536,3,"""N,""2011-06-27 22:29:25""",1,"""N,103582791429521408,""2006-09-10 06:42:18""","""N,""N","""N,""N","""N,""N","""N,""2013-03-11 12:30:31""",,,,,
537,3,"""N,""2011-08-28 18:21:54""",1,"""N,103582791429521408,""2005-12-25 21:38:23""","""N,""N","""N,""N",CA,"""N,""N",2013-03-01 10:52:31,,,,
538,3,1,"""N,""N","""N,103582791429521408,""2003-10-18 14:50:04""","""N,""N","""N,""N","""N,""N","""N,""2013-02-28 14:34:32""",,,,,


In [229]:
ps_t5_2 = ps_t5_1_2.filter((F.col('nSplit_c9')!=3) & (F.col('nSplit_c10')!=3))
ps_t5_2.count()

3552304

In [145]:
for i in check_list:
    print('Distinct value for each column:\n')
    print(ps_t5_2.groupBy(i).count().show())

Distinct value for each column:

+---------+-------+
|nSplit_c1|  count|
+---------+-------+
|        1|3552304|
+---------+-------+

None
Distinct value for each column:

+---------+-------+
|nSplit_c7|  count|
+---------+-------+
|        1|3552304|
+---------+-------+

None
Distinct value for each column:

+---------+-------+
|nSplit_c8|  count|
+---------+-------+
|        1| 463139|
|        2|3089165|
+---------+-------+

None
Distinct value for each column:

+---------+-------+
|nSplit_c9|  count|
+---------+-------+
|        1| 454223|
|        2|3098081|
+---------+-------+

None
Distinct value for each column:

+----------+-------+
|nSplit_c10|  count|
+----------+-------+
|         2|3552304|
+----------+-------+

None
Distinct value for each column:

+----------+-------+
|nSplit_c11|  count|
+----------+-------+
|         2|3552304|
+----------+-------+

None
Distinct value for each column:

+----------+-------+
|nSplit_c12|  count|
+----------+-------+
|         2|3552304|

In [230]:
# Set new columns need to split for this data

col_c8 = ['_c8']
newcols_c8 = ['c8_1', 'c8_2']

col_c9 = ['_c9']
newcols_c9 = ['c9_1', 'c9_2']

col_c10 = ['_c10']
newcols_c10 = ['c10_1', 'c10_2', 'c10_3']

col_c11 = ['_c11']
newcols_c11 = ['c11_1', 'c11_2']

col_c12 = ['_c12']
newcols_c12 = ['c12_1', 'c12_2']

col_c13 = ['_c13']
newcols_c13 = ['c13_1', 'c13_2']

col_c14 = ['_c14']
newcols_c14 = ['c14_1', 'c14_2']

col_c15 = ['_c15']
newcols_c15 = ['c15_1', 'c15_2']

# Apply split function
ps_t5_2 = split_2_column(ps_t5_2, col_c8, newcols_c8)
ps_t5_2 = split_2_column(ps_t5_2, col_c9, newcols_c9)
ps_t5_2 = split_3_column(ps_t5_2, col_c10, newcols_c10)
ps_t5_2 = split_2_column(ps_t5_2, col_c11, newcols_c11)
ps_t5_2 = split_2_column(ps_t5_2, col_c12, newcols_c12)
ps_t5_2 = split_2_column(ps_t5_2, col_c13, newcols_c13)
ps_t5_2 = split_2_column(ps_t5_2, col_c14, newcols_c14)
ps_t5_2 = split_2_column(ps_t5_2, col_c15, newcols_c15)

In [147]:
ps_t5_2.limit(20).toPandas()

Unnamed: 0,_c0,_c1,_c2,_c3,_c4,_c5,_c6,_c7,_c8,_c9,...,c11_1,c11_2,c12_1,c12_2,c13_1,c13_2,c14_1,c14_2,c15_1,c15_2
0,76561197961495942,sortelli,http://steamcommunity.com/profiles/76561197961...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,1,"""N,""N","""N,""N",...,"""N","""N","""N","""N","""N","""N","""N","""2013-02-28 14:34:23""",,
1,76561197961496043,knollo111,http://steamcommunity.com/profiles/76561197961...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,1,"""N,""N","""N,""N",...,"""N","""N","""N","""N","""N","""N","""N","""2013-02-28 14:19:50""",,
2,76561197961496053,great_sim,http://steamcommunity.com/profiles/76561197961...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,1,"""N,""N","""N,""N",...,"""N","""N","""N","""N","""N","""N","""N","""2013-02-28 14:19:50""",,
3,76561197961496921,kellerdaughter,http://steamcommunity.com/profiles/76561197961...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,1,"""N,""N","""N,""N",...,"""N","""N","""N","""N","""N","""N","""N","""2013-02-28 14:19:50""",,
4,76561197961497573,wildgus,http://steamcommunity.com/profiles/76561197961...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,1,"""N,""N","""N,""N",...,"""N","""N","""N","""N","""N","""N","""N","""2013-02-28 14:19:50""",,
5,76561197961497817,nikkibroda,http://steamcommunity.com/profiles/76561197961...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,1,"""N,""N","""N,""N",...,"""N","""N","""N","""N","""N","""N","""N","""2013-02-28 14:19:50""",,
6,76561197961502157,w3x1n,http://steamcommunity.com/id/w3x1n/,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,1,1,2012-01-30 21:24:10,...,"""N","""N","""N","""N","""N","""N","""N","""N","""N","""2013-02-28 14:19:50"""
7,76561197961507338,david_c21,http://steamcommunity.com/profiles/76561197961...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,1,"""N,""N","""N,""N",...,"""N","""N","""N","""N","""N","""N","""N","""2013-02-28 14:34:23""",,
8,76561197961511362,kamasutra0878,http://steamcommunity.com/profiles/76561197961...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,1,"""N,""N","""N,""N",...,"""N","""N","""N","""N","""N","""N","""N","""2013-02-28 14:34:23""",,
9,76561197961520021,l_brigham00,http://steamcommunity.com/profiles/76561197961...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,1,"""N,""N","""N,""N",...,"""N","""N","""N","""N","""N","""N","""N","""2013-02-28 14:19:51""",,


In [190]:
ps_t5_2.limit(50).toPandas()[['_c5','_c6','_c7','_c8','_c9','_c10','_c11','_c12','_c13', '_c14','_c15','_c16']]

Unnamed: 0,_c5,_c6,_c7,_c8,_c9,_c10,_c11,_c12,_c13,_c14,_c15,_c16
0,http://media.steampowered.com/steamcommunity/p...,0,1,"""N,""N","""N,""N","""N,""N","""N,""N","""N,""N","""N,""N","""N,""2013-02-28 14:38:42""",,
1,http://media.steampowered.com/steamcommunity/p...,0,1,"""N,""N","""N,""N","""N,""N","""N,""N","""N,""N","""N,""N","""N,""2013-02-28 14:38:42""",,
2,http://media.steampowered.com/steamcommunity/p...,0,1,"""N,""2013-02-18 17:50:47""","""N,""N","""N,""N","""N,""N","""N,""N","""N,""N","""N,""2013-02-28 14:38:42""",,
3,http://media.steampowered.com/steamcommunity/p...,0,1,"""N,""N","""N,""N","""N,""N","""N,""N","""N,""N","""N,""N","""N,""2013-02-28 14:29:38""",,
4,http://media.steampowered.com/steamcommunity/p...,0,1,"""N,""N","""N,""N","""N,""N","""N,""N","""N,""N","""N,""N","""N,""2013-02-28 14:38:42""",,
5,http://media.steampowered.com/steamcommunity/p...,0,1,"""N,""N","""N,""N","""N,""N","""N,""N","""N,""N","""N,""N","""N,""2013-02-28 14:38:42""",,
6,http://media.steampowered.com/steamcommunity/p...,0,1,"""N,""N","""N,""N","""N,""N","""N,""N","""N,""N","""N,""N","""N,""2013-02-28 14:29:38""",,
7,http://media.steampowered.com/steamcommunity/p...,0,1,"""N,""N","""N,""N","""N,""N","""N,""N","""N,""N","""N,""N","""N,""2013-02-28 14:29:38""",,
8,http://media.steampowered.com/steamcommunity/p...,0,1,"""N,""N","""N,""N","""N,""N","""N,""N","""N,""N","""N,""N","""N,""2013-03-06 14:39:18""",,
9,http://media.steampowered.com/steamcommunity/p...,0,1,"""N,""N","""N,""N","""N,""N","""N,""N","""N,""N","""N,""N","""N,""2013-02-28 14:29:38""",,


---
### 5.5.2b Adding columns and condition
From this step, we will start adding column based in n_Split column logic

_Add columns_

In [231]:
ps_t5_2 = ps_t5_2.withColumn('profilestate', F.col('c8_1'))


ps_t5_2 = ps_t5_2.withColumn('lastlogoff', F.when((F.col('nSplit_c8')==1),F.col('c9_1')).\
                             when((F.col('nSplit_c8')==2),F.col('c8_2')))


ps_t5_2 = ps_t5_2.withColumn('commentpermission', F.when((F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==1),F.col('c10_1')).\
                             when((F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==2),F.col('c9_2')).\
                             when((F.col('nSplit_c8')==2) , F.col('c9_1')))

In [232]:
ps_t5_2 = ps_t5_2.filter((~F.col('c8_1').contains('20')))
ps_t5_2.count()

3552302

In [203]:
check_missing(ps_t5_2.select('profilestate'))

+------------+
|profilestate|
+------------+
|           0|
+------------+



In [204]:
ps_t5_2.select('profilestate').distinct().count()

5

In [205]:
# Use this code to check if the 'gameid' was parsed correctly.
ps_t5_2.select('profilestate').distinct().toPandas()[0:50]

Unnamed: 0,profilestate
0,3
1,1
2,4
3,"""N"
4,2


In [206]:
ps_t5_2.select('lastlogoff').distinct().count()

486696

In [207]:
# Use this code to check if the 'gameid' was parsed correctly.
ps_t5_2.select('lastlogoff').distinct().toPandas()[0:50]

Unnamed: 0,lastlogoff
0,2013-02-02 14:45:11
1,"""2011-08-16 18:58:09"""
2,2013-02-26 00:09:48
3,2013-02-28 21:33:53
4,2013-02-18 19:06:40
5,2013-02-12 06:32:45
6,2011-06-26 03:57:35
7,2013-02-13 16:35:03
8,2013-01-23 06:01:03
9,"""2013-02-17 12:15:15"""


---
_Add column_
> - commentpermission

In [208]:
ps_t5_2.select('commentpermission').distinct().count()

3

In [209]:
# Use this code to check if the 'gameid' was parsed correctly.
ps_t5_2.select('commentpermission').distinct().toPandas()[0:50]

Unnamed: 0,commentpermission
0,1
1,"""N"
2,2


---
_Add column_
> - realname

In [233]:
ps_t5_2 = ps_t5_2.withColumn('realname', F.when((F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1),F.col('c11_1')).\
                             when((F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==2),F.col('c10_2')).\
                             when((F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==2),F.col('c10_1')).\
                             when((F.col('nSplit_c8')==2) & (F.col('nSplit_c9')==1),F.col('c10_1')).\
                             when((F.col('nSplit_c8')==2) & (F.col('nSplit_c9')==2),F.col('c9_2')))

In [211]:
ps_t5_2.select('realname').distinct().count()

1

In [212]:
# Use this code to check if the 'gameid' was parsed correctly.
ps_t5_2.select('realname').distinct().toPandas()[0:50]

Unnamed: 0,realname
0,"""N"


---
_Add column_
> - primaryclanid

In [234]:
ps_t5_2 = ps_t5_2.withColumn('primaryclanid', F.when((F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1) & (F.col('nSplit_c11')==1),F.col('c12_1')).\
                             when((F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1) & (F.col('nSplit_c11')==2),F.col('c11_2')).\
                             when((F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==2),F.col('c11_1')).\
                             when((F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==1),F.col('c11_1')).\
                             when((F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==2),F.col('c10_2')).\
                             when((F.col('nSplit_c8')==2) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1),F.col('c11_1')).\
                             when((F.col('nSplit_c8')==2) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==2),F.col('c10_2')).\
                             when((F.col('nSplit_c8')==2) & (F.col('nSplit_c9')==2),F.col('c10_1')))

In [214]:
ps_t5_2.select('primaryclanid').distinct().count()

1

In [215]:
# Use this code to check if the 'gameid' was parsed correctly.
ps_t5_2.select('primaryclanid').distinct().toPandas()[0:50]

Unnamed: 0,primaryclanid
0,"""N"


In [216]:
check_missing(ps_t5_2.select('primaryclanid'))

+-------------+
|primaryclanid|
+-------------+
|            0|
+-------------+



---
_Add column_
> - timecreated

In [235]:
ps_t5_2 = ps_t5_2.withColumn('timecreated', F.when((F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1) & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==1),F.col('c13_1')).\
                             when((F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1) & (F.col('nSplit_c11')==1) & (F.col('nSplit_c12')==2),F.col('c12_2')).\
                             when((F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1) & (F.col('nSplit_c11')==2),F.col('c12_1')).\
                             when((F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==2) & (F.col('nSplit_c11')==1),F.col('c12_1')).\
                             when((F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==2) & (F.col('nSplit_c11')==2),F.col('c11_2')).\
                             when((F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==1) & (F.col('nSplit_c11')==1),F.col('c12_1')).\
                             when((F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==1) & (F.col('nSplit_c11')==2),F.col('c11_2')).\
                             when((F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==2),F.col('c11_1')).\
                             when((F.col('nSplit_c8')==2) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1) & (F.col('nSplit_c11')==1),F.col('c12_1')).\
                             when((F.col('nSplit_c8')==2) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==1) & (F.col('nSplit_c11')==2),F.col('c11_2')).\
                             when((F.col('nSplit_c8')==2) & (F.col('nSplit_c9')==1) & (F.col('nSplit_c10')==2),F.col('c11_1')).\
                             when((F.col('nSplit_c8')==2) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==1),F.col('c11_1')).\
                             when((F.col('nSplit_c8')==2) & (F.col('nSplit_c9')==2) & (F.col('nSplit_c10')==2),F.col('c10_2')))

In [218]:
ps_t5_2.select('timecreated').distinct().count()

1

In [219]:
# Use this code to check if the 'gameid' was parsed correctly.
ps_t5_2.select('timecreated').distinct().toPandas()[0:50]

Unnamed: 0,timecreated
0,"""N"


In [220]:
check_missing(ps_t5_2.select('timecreated'))

+-----------+
|timecreated|
+-----------+
|          0|
+-----------+



---
_Add column_
> - gameid
> - gameserverip
> - gameextrainfo
> - cityid
> - loccountrycode
> - locstatecode

In [236]:
ps_t5_2 = ps_t5_2.withColumn('gameid',F.lit('N'))
ps_t5_2 = ps_t5_2.withColumn('gameserverip',F.lit('N'))
ps_t5_2 = ps_t5_2.withColumn('gameextrainfo',F.lit('N'))
ps_t5_2 = ps_t5_2.withColumn('cityid',F.lit('N'))
ps_t5_2 = ps_t5_2.withColumn('loccountrycode',F.lit('N'))
ps_t5_2 = ps_t5_2.withColumn('locstatecode',F.lit('N'))

---
### 5.5.3b Format, rename columns and save Table

In [237]:
ps_t5_2.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c3: string (nullable = true)
 |-- _c4: string (nullable = true)
 |-- _c5: string (nullable = true)
 |-- _c6: string (nullable = true)
 |-- _c7: string (nullable = true)
 |-- _c8: string (nullable = true)
 |-- _c9: string (nullable = true)
 |-- _c10: string (nullable = true)
 |-- _c11: string (nullable = true)
 |-- _c12: string (nullable = true)
 |-- _c13: string (nullable = true)
 |-- _c14: string (nullable = true)
 |-- _c15: string (nullable = true)
 |-- _c16: string (nullable = true)
 |-- _c17: string (nullable = true)
 |-- _c18: string (nullable = true)
 |-- _c19: string (nullable = true)
 |-- nSplit_c1: integer (nullable = false)
 |-- nSplit_c7: integer (nullable = false)
 |-- nSplit_c8: integer (nullable = false)
 |-- nSplit_c9: integer (nullable = false)
 |-- nSplit_c10: integer (nullable = false)
 |-- nSplit_c11: integer (nullable = false)
 |-- nSplit_c12: integer 

In [238]:
# Drop unnecssary columns
ps_t5_2 = ps_t5_2.drop('nSplit_c1','nSplit_c7','nSplit_c8','nSplit_c9','nSplit_c10','nSplit_c11',\
                    'nSplit_c12','nSplit_c13','nSplit_c14','nSplit_c15','nSplit_c16','nSplit_c17','len_c9',\
                    'nSplit_c18','nSplit_c19','c11_1','c11_2','c12_1','c12_2','c13_1','c13_2','_c4','_c5',\
                    'c14_1','c14_2','c15_1','c15_2','c8_1','c8_2','c9_1','c9_2','_c3',\
                    'c10_1','c10_2','c10_3','_c8','_c9','_c10','_c11','_c12','_c13','_c14','_c15','_c16','_c17','_c18','_c19')

In [240]:
# Rename columns
newColumns = ['steam_id','person_name', 'profile_url','person_state', 'community_visibility_state',\
              'profile_state','last_logoff', 'comment_permission', 'real_name', 'primary_clanid',\
              'time_created', 'game_id', 'gameserver_ip', 'game_extrainfo', 'city_id', 'country_code',\
              'state_code']
ps_t5_2 = rename_col(ps_t5_2, newColumns)

In [241]:
ps_t5_2.printSchema()

root
 |-- steam_id: string (nullable = true)
 |-- person_name: string (nullable = true)
 |-- profile_url: string (nullable = true)
 |-- person_state: string (nullable = true)
 |-- community_visibility_state: string (nullable = true)
 |-- profile_state: string (nullable = true)
 |-- last_logoff: string (nullable = true)
 |-- comment_permission: string (nullable = true)
 |-- real_name: string (nullable = true)
 |-- primary_clanid: string (nullable = true)
 |-- time_created: string (nullable = true)
 |-- game_id: string (nullable = false)
 |-- gameserver_ip: string (nullable = false)
 |-- game_extrainfo: string (nullable = false)
 |-- city_id: string (nullable = false)
 |-- country_code: string (nullable = false)
 |-- state_code: string (nullable = false)



In [243]:
# Replace " symbols in data
col_list = ['profile_state','community_visibility_state','comment_permission','real_name', 'primary_clanid','time_created', 'game_id', 'gameserver_ip',\
            'last_logoff','profile_state','game_extrainfo', 'city_id', 'country_code','state_code']

for i in col_list:
     ps_t5_2 = ps_t5_2.withColumn(i,regexp_replace(i, '"', ""))

In [244]:
ps_t5_2.limit(10).toPandas()

Unnamed: 0,steam_id,person_name,profile_url,person_state,community_visibility_state,profile_state,last_logoff,comment_permission,real_name,primary_clanid,time_created,game_id,gameserver_ip,game_extrainfo,city_id,country_code,state_code
0,76561197979722532,markkanitz,http://steamcommunity.com/profiles/76561197979...,0,1,N,N,N,N,N,N,N,N,N,N,N,N
1,76561197979725205,acererack,http://steamcommunity.com/profiles/76561197979...,0,1,N,N,N,N,N,N,N,N,N,N,N,N
2,76561197979726384,bss5552003,http://steamcommunity.com/profiles/76561197979...,0,1,N,N,N,N,N,N,N,N,N,N,N,N
3,76561197979729455,silenthill2,http://steamcommunity.com/profiles/76561197979...,0,1,N,N,N,N,N,N,N,N,N,N,N,N
4,76561197979730227,ouchmyknee,http://steamcommunity.com/profiles/76561197979...,0,1,N,2013-02-19 00:57:12,N,N,N,N,N,N,N,N,N,N
5,76561197979731226,suominen.hannu,http://steamcommunity.com/profiles/76561197979...,0,1,N,N,N,N,N,N,N,N,N,N,N,N
6,76561197979731796,papuecas,http://steamcommunity.com/profiles/76561197979...,0,1,N,N,N,N,N,N,N,N,N,N,N,N
7,76561197979732779,trailwalkers,http://steamcommunity.com/profiles/76561197979...,0,1,N,N,N,N,N,N,N,N,N,N,N,N
8,76561197979732980,bender03,http://steamcommunity.com/profiles/76561197979...,0,1,N,N,N,N,N,N,N,N,N,N,N,N
9,76561197979740173,dancehouse1,http://steamcommunity.com/profiles/76561197979...,0,1,N,N,N,N,N,N,N,N,N,N,N,N


In [246]:
ps_t5_2.count()

3552302

In [245]:
# Save TABLE 5_2
ps_t5_2.write.csv('/user/tamng/jwht/CleanData/ps_t5_2.csv', header = True)

---
## <font color = 'black'> TABLE 6: nSplit_c11 =1 & nSplit_c10 =1 <font>
<br> </br>
nSplit_c11 =1 & nSplit_c10 =1

### 5.6.1 Filter data and check number of Split

In [30]:
ps_t6 = player_summaries_drop.filter((player_summaries_drop.nSplit_c11 == 1) & (player_summaries_drop.nSplit_c10 == 1))
ps_t6.count()

3347748

In [31]:
cols_check = ['_c1','_c7','_c8','_c9', '_c10', '_c11', '_c12', '_c13', '_c14','_c15','_c16','_c17','_c18','_c19']

tagname = ['nSplit_c1','nSplit_c7','nSplit_c8','nSplit_c9', 'nSplit_c10', 'nSplit_c11', 'nSplit_c12', 'nSplit_c13', 'nSplit_c14',
           'nSplit_c15','nSplit_c16','nSplit_c17','nSplit_c18','nSplit_c19']

ps_t6 = add_count_split_column(ps_t6, cols_check, tagname)

In [32]:
check_list = ['nSplit_c1','nSplit_c7','nSplit_c8','nSplit_c9', 'nSplit_c10', 'nSplit_c11','nSplit_c12','nSplit_c13', 'nSplit_c14','nSplit_c15','nSplit_c16','nSplit_c17','nSplit_c18','nSplit_c19']


for i in check_list:
    print('Distinct value for each column:\n')
    print(ps_t6.groupBy(i).count().show())

Distinct value for each column:

+---------+-------+
|nSplit_c1|  count|
+---------+-------+
|       26|      1|
|       12|      3|
|        1|3344328|
|        6|      9|
|       16|      1|
|        3|    366|
|        5|     72|
|       15|      1|
|        9|     30|
|       17|      2|
|        4|     58|
|        8|      7|
|        7|     27|
|       10|      5|
|       45|      1|
|       11|      4|
|        2|   2833|
+---------+-------+

None
Distinct value for each column:

+---------+-------+
|nSplit_c7|  count|
+---------+-------+
|       -1|      1|
|        1|3347747|
+---------+-------+

None
Distinct value for each column:

+---------+-------+
|nSplit_c8|  count|
+---------+-------+
|        1|  85316|
|        2|3262432|
+---------+-------+

None
Distinct value for each column:

+---------+-------+
|nSplit_c9|  count|
+---------+-------+
|        1|  84690|
|        3|     76|
|        2|3262982|
+---------+-------+

None
Distinct value for each column:

+----------

In [35]:
ps_t6.filter(F.col('nSplit_c1')==1).limit(50).toPandas()[['_c0','_c1','_c2','_c3','_c4','_c5','_c6','_c7','_c8','_c9','_c10','_c11','_c12','_c13', '_c14','_c15','_c16','_c17','_c18','_c19']]

Unnamed: 0,_c0,_c1,_c2,_c3,_c4,_c5,_c6,_c7,_c8,_c9,_c10,_c11,_c12,_c13,_c14,_c15,_c16,_c17,_c18,_c19
0,76561197973059911,KoS,http://steamcommunity.com/profiles/76561197973...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,"""N,""2006-11-02 16:48:27""","""N,""N",103582791429521408,2005-01-10 05:57:00,"""N,""N","""N,""N","""N,""N","""N,""2013-02-28 14:29:18""",,,,
1,76561197973060336,ehydrean,http://steamcommunity.com/profiles/76561197973...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,"""N,""2012-11-17 14:59:47""","""N,""N",103582791429521408,2005-01-10 17:35:44,"""N,""N","""N,""N","""N,""N","""N,""2013-02-28 14:38:28""",,,,
2,76561197973062097,mafiak2,http://steamcommunity.com/profiles/76561197973...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,"""N,""2006-11-02 15:13:13""","""N,""N",103582791429521408,2005-01-10 07:30:33,"""N,""N","""N,""N","""N,""N","""N,""2013-02-28 14:29:18""",,,,
3,76561197973063896,Krystal,http://steamcommunity.com/profiles/76561197973...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,"""N,""2011-10-21 01:12:29""","""N,""N",103582791429521408,2005-01-10 20:48:17,"""N,""N","""N,""N","""N,""N","""N,""2013-02-28 14:38:28""",,,,
4,76561197973065286,martin,http://steamcommunity.com/profiles/76561197973...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,"""N,""2006-09-06 22:40:28""","""N,""N",103582791429521408,2005-01-10 22:18:39,"""N,""N","""N,""N","""N,""N","""N,""2013-03-06 14:32:14""",,,,
5,76561197973065436,hivegaming14,http://steamcommunity.com/profiles/76561197973...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,"""N,""2006-09-06 17:36:05""","""N,""N",103582791429521408,2005-01-10 22:29:13,"""N,""N","""N,""N","""N,""N","""N,""2013-02-28 14:38:28""",,,,
6,76561197973066647,mugri,http://steamcommunity.com/profiles/76561197973...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,"""N,""2009-06-03 11:51:29""","""N,""N",103582791429521408,2005-01-10 10:05:22,"""N,""N","""N,""N","""N,""N","""N,""2013-02-28 14:29:19""",,,,
7,76561197973069313,mark,http://steamcommunity.com/profiles/76561197973...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,"""N,""2006-10-27 12:33:13""","""N,""N",103582791429521408,2005-01-10 11:22:25,"""N,""N","""N,""N","""N,""N","""N,""2013-02-28 14:29:19""",,,,
8,76561197973075516,jeffrey.bron,http://steamcommunity.com/profiles/76561197973...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,"""N,""2009-11-15 14:36:24""","""N,""N",103582791429521408,2005-01-11 09:03:33,"""N,""N","""N,""N","""N,""N","""N,""2013-02-28 14:38:28""",,,,
9,76561197973076304,rooobsn,http://steamcommunity.com/profiles/76561197973...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,"""N,""2010-06-18 03:56:56""","""N,""N",103582791429521408,2005-01-11 09:27:46,"""N,""N","""N,""N","""N,""N","""N,""2013-02-28 14:38:28""",,,,


---
### TABLE 6_1

nSplit_c1 ==1 & nSplit_c16 != 1

Total rows: 3235784

### 5.6.1a Filter data and check number of Split

In [43]:
ps_t6_1 = ps_t6.filter((F.col('nSplit_c1')==1) & (F.col('nSplit_c16')==-1))
ps_t6_1.count()

3235784

In [44]:
ps_t6_1.limit(50).toPandas()[['_c0','_c1','_c2','_c3','_c4','_c5','_c6','_c7','_c8','_c9','_c10','_c11','_c12','_c13', '_c14','_c15','_c16','_c17','_c18','_c19']]

Unnamed: 0,_c0,_c1,_c2,_c3,_c4,_c5,_c6,_c7,_c8,_c9,_c10,_c11,_c12,_c13,_c14,_c15,_c16,_c17,_c18,_c19
0,76561197961459420,manjamerda,http://steamcommunity.com/profiles/76561197961...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,"""N,""2009-01-18 04:38:36""","""N,""N",103582791429521408,2003-09-26 23:41:44,"""N,""N","""N,""N","""N,""N","""N,""2013-02-28 14:34:22""",,,,
1,76561197961462225,bhannie,http://steamcommunity.com/profiles/76561197961...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,"""N,""2007-05-05 20:12:31""","""N,""N",103582791429521408,2003-09-26 22:57:47,"""N,""N","""N,""N","""N,""N","""N,""2013-02-28 14:19:49""",,,,
2,76561197961465178,PCB|lehrbua,http://steamcommunity.com/profiles/76561197961...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,"""N,""2006-10-27 23:39:45""","""N,""N",103582791429521408,2003-09-27 03:24:06,"""N,""N","""N,""N","""N,""N","""N,""2013-02-28 14:34:22""",,,,
3,76561197961466888,falcon,http://steamcommunity.com/profiles/76561197961...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,"""N,""2012-06-09 12:57:03""","""N,""N",103582791429521408,2003-09-27 04:20:34,"""N,""N","""N,""N","""N,""N","""N,""2013-02-28 14:34:22""",,,,
4,76561197961469540,enossavior,http://steamcommunity.com/profiles/76561197961...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,"""N,""2006-09-29 13:35:15""","""N,""N",103582791429521408,2003-09-27 05:41:59,"""N,""N","""N,""N","""N,""N","""N,""2013-03-06 12:47:27""",,,,
5,76561197961470177,felipevmc,http://steamcommunity.com/profiles/76561197961...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,"""N,""2007-07-01 10:27:31""","""N,""N",103582791429521408,2003-09-27 03:55:42,"""N,""N","""N,""N","""N,""N","""N,""2013-02-28 14:19:49""",,,,
6,76561197961473372,Bitchmove0815,http://steamcommunity.com/profiles/76561197961...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,"""N,""2009-10-24 05:55:40""","""N,""N",103582791429521408,2003-09-27 07:33:08,"""N,""N","""N,""N","""N,""N","""N,""2013-02-28 14:34:22""",,,,
7,76561197961475070,cleansheet32,http://steamcommunity.com/profiles/76561197961...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,"""N,""2006-10-05 06:16:06""","""N,""N",103582791429521408,2003-09-27 08:14:04,"""N,""N","""N,""N","""N,""N","""N,""2013-02-28 14:34:22""",,,,
8,76561197961475659,-{BdS}- Papps,http://steamcommunity.com/profiles/76561197961...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,"""N,""2008-03-12 01:25:47""","""N,""N",103582791429521408,2003-09-27 06:33:57,"""N,""N","""N,""N","""N,""N","""N,""2013-02-28 14:19:49""",,,,
9,76561197961476241,vastavoima1,http://steamcommunity.com/profiles/76561197961...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,"""N,""2010-03-16 07:10:12""","""N,""N",103582791429521408,2003-09-27 06:52:00,"""N,""N","""N,""N","""N,""N","""N,""2013-02-28 14:19:49""",,,,


In [45]:
check_list = ['nSplit_c1','nSplit_c7','nSplit_c8','nSplit_c9', 'nSplit_c10', 'nSplit_c11','nSplit_c12','nSplit_c13', 'nSplit_c14','nSplit_c15','nSplit_c16','nSplit_c17','nSplit_c18','nSplit_c19']


for i in check_list:
    print('Distinct value for each column:\n')
    print(ps_t6_1.groupBy(i).count().show())

Distinct value for each column:

+---------+-------+
|nSplit_c1|  count|
+---------+-------+
|        1|3235784|
+---------+-------+

None
Distinct value for each column:

+---------+-------+
|nSplit_c7|  count|
+---------+-------+
|        1|3235784|
+---------+-------+

None
Distinct value for each column:

+---------+-------+
|nSplit_c8|  count|
+---------+-------+
|        1|     79|
|        2|3235705|
+---------+-------+

None
Distinct value for each column:

+---------+-------+
|nSplit_c9|  count|
+---------+-------+
|        3|     73|
|        2|3235711|
+---------+-------+

None
Distinct value for each column:

+----------+-------+
|nSplit_c10|  count|
+----------+-------+
|         1|3235784|
+----------+-------+

None
Distinct value for each column:

+----------+-------+
|nSplit_c11|  count|
+----------+-------+
|         1|3235784|
+----------+-------+

None
Distinct value for each column:

+----------+-------+
|nSplit_c12|  count|
+----------+-------+
|         2|3235784|

__ Check to see what is the difference between nSplit_c9 = 3 vs. the rest__

In [46]:
ps_t6_1.filter(F.col('nSplit_c9')==3).toPandas()[['_c0','_c1','_c2','_c3','_c4','_c5','_c6','_c7','_c8','_c9','_c10','_c11','_c12','_c13', '_c14','_c15','_c16','_c17','_c18','_c19']]

Unnamed: 0,_c0,_c1,_c2,_c3,_c4,_c5,_c6,_c7,_c8,_c9,_c10,_c11,_c12,_c13,_c14,_c15,_c16,_c17,_c18,_c19
0,76561197976977789,psykerer,http://steamcommunity.com/profiles/76561197976...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,1,"""N,1,""N",103582791429521408,2005-07-14 21:46:42,"""N,""N","""N,""N","""N,""N","""N,""2013-02-28 14:30:50""",,,,
1,76561197972169545,ccd_882,http://steamcommunity.com/profiles/76561197972...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,1,"""N,1,""N",103582791429521408,2004-12-22 21:51:01,"""N,""N","""N,""N","""N,""N","""N,""2013-02-28 14:28:27""",,,,
2,76561197977188742,allen,http://steamcommunity.com/profiles/76561197977...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,1,"""N,2,""N",103582791429521408,2005-07-24 20:30:45,"""N,""N","""N,""N","""N,""N","""N,""2013-02-28 14:39:44""",,,,
3,76561197979037870,minggao1,http://steamcommunity.com/profiles/76561197979...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,1,"""N,1,""N",103582791429521408,2005-11-04 15:20:55,"""N,""N","""N,""N","""N,""N","""N,""2013-03-01 07:51:37""",,,,
4,76561197960405077,wolf,http://steamcommunity.com/profiles/76561197960...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,1,"""N,1,""N",103582791429521408,2003-09-12 13:06:42,"""N,""N","""N,""N","""N,""N","""N,""2013-02-28 14:19:07""",,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
68,76561197972367471,dkaradean,http://steamcommunity.com/profiles/76561197972...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,1,"""N,2,""N",103582791429521408,2004-12-25 13:23:37,"""N,""N","""N,""N","""N,""N","""N,""2013-03-06 18:42:55""",,,,
69,76561197966311947,FULLGAMERZ,http://steamcommunity.com/profiles/76561197966...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,1,"""N,2,""N",103582791429521408,2004-05-05 06:18:58,"""N,""N","""N,""N","""N,""N","""N,""2013-02-28 14:23:54""",,,,
70,76561197973216731,l_eventreur,http://steamcommunity.com/profiles/76561197973...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,1,"""N,1,""N",103582791429521408,2005-01-15 04:12:37,"""N,""N","""N,""N","""N,""N","""N,""2013-02-28 14:29:26""",,,,
71,76561197976553231,souza003,http://steamcommunity.com/profiles/76561197976...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,1,"""N,1,""N",103582791429521408,2005-06-19 19:54:32,"""N,""N","""N,""N","""N,""N","""N,""2013-02-28 14:30:46""",,,,


In [50]:
# Set new columns need to split for this data

col_c8 = ['_c8']
newcols_c8 = ['c8_1', 'c8_2']


col_c9 = ['_c9']
newcols_c9 = ['c9_1', 'c9_2', 'c9_3']

col_c12 = ['_c12']
newcols_c12 = ['c12_1', 'c12_2']

col_c13 = ['_c13']
newcols_c13 = ['c13_1', 'c13_2']

col_c14 = ['_c14']
newcols_c14 = ['c14_1', 'c14_2']

col_c15 = ['_c15']
newcols_c15 = ['c15_1', 'c15_2']


# Apply split function
ps_t6_1 = split_2_column(ps_t6_1, col_c8, newcols_c8)
ps_t6_1 = split_3_column(ps_t6_1, col_c9, newcols_c9)
ps_t6_1 = split_2_column(ps_t6_1, col_c12, newcols_c12)
ps_t6_1 = split_2_column(ps_t6_1, col_c13, newcols_c13)
ps_t6_1 = split_2_column(ps_t6_1, col_c14, newcols_c14)
ps_t6_1 = split_2_column(ps_t6_1, col_c15, newcols_c15)

---
### 5.6.2a Adding columns and condition
From this step, we will start adding column based in n_Split column logic

_Add columns_

In [62]:
ps_t6_1 = ps_t6_1.withColumn('personastate', F.when((F.col('_c1').contains('http')), F.col('_c5')).otherwise(F.col('_c6')))
ps_t6_1 = ps_t6_1.withColumn('communityvisibilitystate',F.when((F.col('_c1').contains('http')), F.col('_c6')).otherwise(F.col('_c7')))
ps_t6_1 = ps_t6_1.withColumn('profilestate', F.when((F.col('_c1').contains('http')), F.col('_c7')).otherwise(F.col('c8_1')))
ps_t6_1 = ps_t6_1.withColumn('lastlogoff', F.when((F.col('_c1').contains('http')), F.col('_c8')).\
                             when((F.col('nSplit_c8')==1),F.col('c9_1')).\
                             when((F.col('nSplit_c8')==2),F.col('c8_2')))
ps_t6_1 = ps_t6_1.withColumn('commentpermission', F.when((F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==3),F.col('c9_2')).\
                             when((F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==2),F.col('c9_1')).\
                             when((F.col('nSplit_c8')==2) , F.col('c9_1')))
ps_t6_1 = ps_t6_1.withColumn('realname', F.when((F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==3),F.col('c9_3')).\
                             when((F.col('nSplit_c8')==1) & (F.col('nSplit_c9')==2),F.col('c9_2')).\
                             when((F.col('nSplit_c8')==2) , F.col('c9_2')))
ps_t6_1 = ps_t6_1.withColumn('primaryclanid',F.col('_c10'))
ps_t6_1 = ps_t6_1.withColumn('timecreated',F.col('_c11'))
ps_t6_1 = ps_t6_1.withColumn('gameid',F.col('c12_1'))
ps_t6_1 = ps_t6_1.withColumn('gameserverip',F.col('c12_2'))
ps_t6_1 = ps_t6_1.withColumn('gameextrainfo',F.col('c13_1'))
ps_t6_1 = ps_t6_1.withColumn('cityid',F.col('c13_2'))
ps_t6_1 = ps_t6_1.withColumn('loccountrycode',F.col('c14_1'))
ps_t6_1 = ps_t6_1.withColumn('locstatecode',F.col('c14_2'))

In [63]:
check_missing(ps_t6_1.select('personastate','communityvisibilitystate','profilestate','lastlogoff',\
                             'commentpermission','realname','primaryclanid','timecreated','gameid',\
                             'gameserverip','gameextrainfo','cityid','loccountrycode','locstatecode'))

+------------+------------------------+------------+----------+-----------------+--------+-------------+-----------+------+------------+-------------+------+--------------+------------+
|personastate|communityvisibilitystate|profilestate|lastlogoff|commentpermission|realname|primaryclanid|timecreated|gameid|gameserverip|gameextrainfo|cityid|loccountrycode|locstatecode|
+------------+------------------------+------------+----------+-----------------+--------+-------------+-----------+------+------------+-------------+------+--------------+------------+
|           0|                       0|           0|         0|                0|       0|            0|          0|     0|           0|            0|     0|             0|           0|
+------------+------------------------+------------+----------+-----------------+--------+-------------+-----------+------+------------+-------------+------+--------------+------------+



In [64]:
ps_t6_1.limit(50).toPandas()[['personastate','communityvisibilitystate','profilestate','lastlogoff',\
                             'commentpermission','realname','primaryclanid','timecreated','gameid',\
                             'gameserverip','gameextrainfo','cityid','loccountrycode','locstatecode']]

Unnamed: 0,personastate,communityvisibilitystate,profilestate,lastlogoff,commentpermission,realname,primaryclanid,timecreated,gameid,gameserverip,gameextrainfo,cityid,loccountrycode,locstatecode
0,0,3,"""N","""2010-02-04 18:34:23""","""N","""N",103582791429521408,2004-03-28 11:41:21,"""N","""N","""N","""N","""N","""N"
1,0,3,"""N","""2012-10-26 13:04:09""","""N","""N",103582791429521408,2004-03-28 14:13:44,"""N","""N","""N","""N","""N","""N"
2,0,3,"""N","""2007-02-16 13:07:54""","""N","""N",103582791429521408,2004-03-28 21:42:25,"""N","""N","""N","""N","""N","""N"
3,0,3,"""N","""2007-11-17 03:08:17""","""N","""N",103582791429521408,2004-03-29 04:10:10,"""N","""N","""N","""N","""N","""N"
4,0,3,"""N","""2007-10-24 14:06:27""","""N","""N",103582791429521408,2004-03-29 08:32:50,"""N","""N","""N","""N","""N","""N"
5,0,3,"""N","""2009-02-10 15:17:53""","""N","""N",103582791429521408,2004-03-29 07:24:01,"""N","""N","""N","""N","""N","""N"
6,0,3,"""N","""2009-09-06 05:40:20""","""N","""N",103582791429521408,2004-03-29 09:00:22,"""N","""N","""N","""N","""N","""N"
7,0,3,"""N","""2012-05-24 15:32:14""","""N","""N",103582791429521408,2004-03-29 10:47:12,"""N","""N","""N","""N","""N","""N"
8,0,3,"""N","""2013-02-01 22:25:24""","""N","""N",103582791429521408,2004-03-29 11:00:54,"""N","""N","""N","""N","""N","""N"
9,0,3,"""N","""2007-03-10 11:22:18""","""N","""N",103582791429521408,2004-03-29 12:33:09,"""N","""N","""N","""N","""N","""N"


In [65]:
ps_t6_1.select('timecreated').distinct().count()

3123387

In [67]:
# Use this code to check if the 'gameid' was parsed correctly.
ps_t6_1.select('timecreated').distinct().limit(50).toPandas()

Unnamed: 0,timecreated
0,2004-06-30 13:10:12
1,2004-07-14 00:22:01
2,2004-12-06 04:22:07
3,2005-01-15 08:54:17
4,2005-01-20 11:37:15
5,2004-12-30 19:43:49
6,2005-01-02 00:41:12
7,2003-09-22 12:28:19
8,2004-09-02 06:01:26
9,2004-09-05 12:51:17


In [68]:
ps_t6_1.select('gameid').distinct().count()

1

In [69]:
# Use this code to check if the 'gameid' was parsed correctly.
ps_t6_1.select('gameid').distinct().toPandas()[0:50]

Unnamed: 0,gameid
0,"""N"


In [70]:
# Use this code to check if the 'loccountrycode' was parsed correctly.
ps_t6_1.select('loccountrycode').distinct().toPandas()[0:50]

Unnamed: 0,loccountrycode
0,"""N"


---
### 5.6.3a Format, rename columns and save Table

In [71]:
ps_t6_1.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c3: string (nullable = true)
 |-- _c4: string (nullable = true)
 |-- _c5: string (nullable = true)
 |-- _c6: string (nullable = true)
 |-- _c7: string (nullable = true)
 |-- _c8: string (nullable = true)
 |-- _c9: string (nullable = true)
 |-- _c10: string (nullable = true)
 |-- _c11: string (nullable = true)
 |-- _c12: string (nullable = true)
 |-- _c13: string (nullable = true)
 |-- _c14: string (nullable = true)
 |-- _c15: string (nullable = true)
 |-- _c16: string (nullable = true)
 |-- _c17: string (nullable = true)
 |-- _c18: string (nullable = true)
 |-- _c19: string (nullable = true)
 |-- nSplit_c7: integer (nullable = false)
 |-- nSplit_c8: integer (nullable = false)
 |-- nSplit_c9: integer (nullable = false)
 |-- nSplit_c10: integer (nullable = false)
 |-- nSplit_c11: integer (nullable = false)
 |-- nSplit_c12: integer (nullable = false)
 |-- nSplit_c13: integer

In [82]:
# Drop unnecssary columns
ps_t6_1 = ps_t6_1.drop('nSplit_c1','nSplit_c7','nSplit_c8','nSplit_c9','nSplit_c10','nSplit_c11',\
                    'nSplit_c12','nSplit_c13','nSplit_c14','nSplit_c15','nSplit_c16','nSplit_c17','len_c9',\
                    'nSplit_c18','nSplit_c19','c11_1','c11_2','c12_1','c12_2','c13_1','c13_2','_c4','_c5',\
                    'c14_1','c14_2','c15_1','c15_2','c8_1','c8_2','c9_1','c9_2','_c9_3','c9_3','_c3',\
                    '_c4','_c5','_c6','_c7','_c8','_c9','_c10','_c11','_c12','_c13','_c14','_c15','_c16','_c17','_c18','_c19')

In [80]:
from pyspark.sql import functions as sf
ps_t6_1 = ps_t6_1.withColumn('_c2', F.when(~(F.col('_c1').contains('http')), sf.concat(sf.lit('http://steamcommunity.com/profiles/'), sf.col('_c0'))).\
                             otherwise(F.col('_c2')))

In [84]:
# Rename columns
newColumns = ['steam_id','person_name', 'profile_url','person_state', 'community_visibility_state',\
              'profile_state','last_logoff', 'comment_permission', 'real_name', 'primary_clanid',\
              'time_created', 'game_id', 'gameserver_ip', 'game_extrainfo', 'city_id', 'country_code',\
              'state_code']
ps_t6_1 = rename_col(ps_t6_1, newColumns)

In [85]:
ps_t6_1.printSchema()

root
 |-- steam_id: string (nullable = true)
 |-- person_name: string (nullable = true)
 |-- profile_url: string (nullable = true)
 |-- person_state: string (nullable = true)
 |-- community_visibility_state: string (nullable = true)
 |-- profile_state: string (nullable = true)
 |-- last_logoff: string (nullable = true)
 |-- comment_permission: string (nullable = true)
 |-- real_name: string (nullable = true)
 |-- primary_clanid: string (nullable = true)
 |-- time_created: string (nullable = true)
 |-- game_id: string (nullable = true)
 |-- gameserver_ip: string (nullable = true)
 |-- game_extrainfo: string (nullable = true)
 |-- city_id: string (nullable = true)
 |-- country_code: string (nullable = true)
 |-- state_code: string (nullable = true)



In [87]:
# Replace " symbols in data
col_list = ['profile_state','community_visibility_state','comment_permission','real_name', 'primary_clanid','time_created', 'game_id', 'gameserver_ip',\
            'last_logoff','profile_state','game_extrainfo', 'city_id', 'country_code','state_code']

for i in col_list:
     ps_t6_1 = ps_t6_1.withColumn(i,regexp_replace(i, '"', ""))

In [88]:
ps_t6_1.limit(10).toPandas()

Unnamed: 0,steam_id,person_name,profile_url,person_state,community_visibility_state,profile_state,last_logoff,comment_permission,real_name,primary_clanid,time_created,game_id,gameserver_ip,game_extrainfo,city_id,country_code,state_code
0,76561197960577009,tisboishorty,http://steamcommunity.com/profiles/76561197960...,0,3,N,2006-10-10 17:52:40,N,N,103582791429521408,2003-09-14 02:33:40,N,N,N,N,N,N
1,76561197960578001,smokey_mcp0t42o,http://steamcommunity.com/profiles/76561197960...,0,3,N,2007-08-20 21:33:54,N,N,103582791429521408,2003-09-14 02:48:10,N,N,N,N,N,N
2,76561197960578142,caicai_a07,http://steamcommunity.com/profiles/76561197960...,0,3,N,2009-12-15 01:13:45,N,N,103582791429521408,2003-09-14 03:17:08,N,N,N,N,N,N
3,76561197960578543,popng_2003,http://steamcommunity.com/profiles/76561197960...,0,3,N,2007-07-08 19:00:55,N,N,103582791429521408,2003-09-14 02:55:48,N,N,N,N,N,N
4,76561197960578731,bsQ|Sprutludret,http://steamcommunity.com/profiles/76561197960...,0,3,N,2006-10-25 08:42:55,N,N,103582791429521408,2003-09-14 02:58:14,N,N,N,N,N,N
5,76561197960580301,flashmanrpr,http://steamcommunity.com/profiles/76561197960...,0,3,N,2013-02-14 20:12:26,N,N,103582791429521408,2003-09-14 03:19:23,N,N,N,N,N,N
6,76561197960581169,med1c,http://steamcommunity.com/profiles/76561197960...,0,3,N,2008-04-09 11:58:15,N,N,103582791429521408,2003-09-14 03:29:33,N,N,N,N,N,N
7,76561197960586538,R*( no team ),http://steamcommunity.com/profiles/76561197960...,0,3,N,2011-01-12 16:05:20,N,N,103582791429521408,2003-09-14 04:51:40,N,N,N,N,N,N
8,76561197960586771,lars28787,http://steamcommunity.com/profiles/76561197960...,0,3,N,2009-07-30 07:01:09,N,N,103582791429521408,2003-09-14 04:31:00,N,N,N,N,N,N
9,76561197960590798,der mÃ¤chtige,http://steamcommunity.com/profiles/76561197960...,0,3,N,2007-09-21 13:17:38,N,N,103582791429521408,2003-09-14 05:33:38,N,N,N,N,N,N


In [89]:
# Save TABLE 6_1
ps_t6_1.write.csv('/user/tamng/jwht/CleanData/ps_t6_1.csv', header = True)

---
### TABLE 6_2

nSplit_c1 =1 & nSplit_c16 != -1

Total rows: 108544

#### Ssve this table for later. Might or might not use

In [92]:
ps_t6_2 = ps_t6.filter((F.col('nSplit_c1')==1) & (F.col('nSplit_c16')!=-1))
ps_t6_2.count()

108544

In [93]:
ps_t6_2.limit(20).toPandas()[['_c0','_c1','_c2','_c3','_c4','_c5','_c6','_c7','_c8','_c9','_c10','_c11','_c12','_c13', '_c14','_c15','_c16','_c17','_c18','_c19']]

Unnamed: 0,_c0,_c1,_c2,_c3,_c4,_c5,_c6,_c7,_c8,_c9,_c10,_c11,_c12,_c13,_c14,_c15,_c16,_c17,_c18,_c19
0,76561197974276104,Â®_SiN_Â©â„¢,http://steamcommunity.com/id/blacksci/,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,1,2012-07-15 02:27:56,1,jeremy,103582791430996385,2005-02-17 22:39:29,"""N,""N","""N,""N",US,WA,4045,2013-02-28 14:38:54
1,76561197974277311,*Hellscar*,http://steamcommunity.com/id/hellscar54/,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,1,2013-02-17 16:51:34,1,alex,103582791430075728,2005-02-18 12:14:20,"""N,""N","""N,""N",UM,"""N,""N",2013-02-28 14:29:50,
2,76561197974288763,MDesh,http://steamcommunity.com/id/MDesh/,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,1,2012-06-16 14:10:20,1,MDesh,103582791429521408,2005-02-18 22:00:41,"""N,""N","""N,""N",RU,48,41460,2013-02-28 14:29:50
3,76561197974342163,Elpotapo,http://steamcommunity.com/profiles/76561197974...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,"""N,""2013-02-14 07:11:39""","""N,""N",103582791429521408,2005-02-20 15:55:46,"""N,""N","""N,""N",PL,"""N,""N",2013-02-28 14:29:51,,,
4,76561197974419239,dtd.RAINBOW WARRIOR,http://steamcommunity.com/id/phillipsinc/,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,1,3,1,2013-02-18 16:43:52,1,Phillip,103582791431745514,2005-02-23 15:03:18,"""N,""N","""N,""N",US,NV,2373,2013-02-28 14:29:52
5,76561197974517519,Hyanar,http://steamcommunity.com/id/danscottbrown/,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,1,2013-02-18 15:06:06,2,Dan,103582791433947157,2005-02-27 12:14:01,"""N,""N","""N,""N",GB,G6,16924,2013-02-28 14:29:53
6,76561197974551258,Office Ninja,http://steamcommunity.com/id/mr_meyer/,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,4,3,1,2013-02-18 19:15:59,1,John Miller,103582791429586353,2005-02-27 20:10:04,"""N,""N","""N,""N",CA,AB,"""N,""2013-02-28 14:38:58""",
7,76561197974662531,Heaten,http://steamcommunity.com/id/Heaten/,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,1,2013-02-18 13:48:52,1,Marek,103582791430649748,2005-03-05 13:09:09,"""N,""N","""N,""N",SK,"""N,""N",2013-02-28 14:29:55,
8,76561197974686132,Swift,http://steamcommunity.com/id/SwiftRyu/,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,1,2013-03-06 03:30:26,1,Bobby Mckenna,103582791431692618,2005-03-05 09:35:40,"""N,""N","""N,""N",GB,M3,16796,2013-03-06 14:42:03
9,76561197974710829,T-rav,http://steamcommunity.com/id/TravisC/,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,1,2013-02-13 17:03:02,1,Travis,103582791429523489,2005-03-07 15:41:19,"""N,""N","""N,""N",US,"""N,""N",2013-02-28 14:29:56,


In [111]:
# Save TABLE 6_2
ps_t6_2.write.csv('/user/tamng/jwht/SteamData/ps_t6_2.csv', header = True)

---
### TABLE 6_3

nSplit_c1 !=1

Total rows: 3420

#### Ssve this table for later. Might or might not use

In [95]:
ps_t6_3 = ps_t6.filter(F.col('nSplit_c1')!=1)
ps_t6_3.count()

3420

In [96]:
ps_t6_3.limit(50).toPandas()[['_c0','_c1','_c2','_c3','_c4','_c5','_c6','_c7','_c8','_c9','_c10','_c11','_c12','_c13', '_c14','_c15','_c16']]

Unnamed: 0,_c0,_c1,_c2,_c3,_c4,_c5,_c6,_c7,_c8,_c9,_c10,_c11,_c12,_c13,_c14,_c15,_c16
0,76561197985868516,"Binladen (uu,) -DEJO EL CS AMIGOS :(",http://steamcommunity.com/profiles/76561197985...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,"""N,""2008-08-17 11:26:46""","""N,""N",103582791429521408,2006-10-30 12:58:42,"""N,""N","""N,""N","""N,""N","""N,""2013-03-02 05:53:20""",
1,76561197971188127,"Meddler, Comandante MolÃ³n",http://steamcommunity.com/profiles/76561197971...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,"""N,""2011-06-23 05:27:38""","""N,""N",103582791429521408,2004-11-26 13:40:46,"""N,""N","""N,""N","""N,""N","""N,""2013-03-06 18:24:28""",
2,76561197985486996,"WWW,LION,DE,",http://steamcommunity.com/profiles/76561197985...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,"""N,""2006-11-02 06:15:27""","""N,""N",103582791429521408,2006-10-11 17:02:49,"""N,""N","""N,""N","""N,""N","""N,""2013-03-02 04:22:16""",
3,76561197981849171,"..atopold,jajsz",http://steamcommunity.com/profiles/76561197981...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,"""N,""2007-09-28 05:50:41""","""N,""N",103582791429521408,2006-04-02 11:36:55,"""N,""N","""N,""N","""N,""N","""N,""2013-03-11 12:28:56""",
4,76561197984575330,"""[C$K]/\/\**(( [STIFLER] ))**/\/"",""http://stea...",http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,1,2009-11-07 13:40:12,"""N,""N",103582791429619983,2006-08-26 09:08:49,"""N,""N","""N,""N","""N,""N","""N,""2013-03-02 01:43:50""",
5,76561197984772733,"BlacKBeaNZiNhHuU, FeiUnhU mAs XeroZiNhO",http://steamcommunity.com/profiles/76561197984...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,"""N,""2006-10-31 13:32:05""","""N,""N",103582791429521408,2006-09-07 23:57:47,"""N,""N","""N,""N","""N,""N","""N,""2013-03-11 12:30:30""",
6,76561197979846532,"""crb"",""http://steamcommunity.com/profiles/7656...",http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,1,2012-05-23 15:35:45,"""N,""NivleK""",103582791429698991,2005-12-20 11:05:48,"""N,""N","""N,""N","""N,""N","""N,""2013-03-01 10:30:04""",
7,76561197981204963,"new id, add: noobesser",http://steamcommunity.com/profiles/76561197981...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,"""N,""2008-04-12 09:57:57""","""N,""N",103582791429521408,2006-02-26 19:13:22,"""N,""N","""N,""N","""N,""N","""N,""2013-03-11 12:28:47""",
8,76561197975050736,"oops,headshoot",http://steamcommunity.com/profiles/76561197975...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,"""N,""2006-08-20 12:42:13""","""N,""N",103582791429521408,2005-03-22 15:43:11,"""N,""N","""N,""N","""N,""N","""N,""2013-02-28 14:39:06""",
9,76561197975772641,".,|.,d(-.-)b.,|.,",http://steamcommunity.com/profiles/76561197975...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,"""N,""2008-11-18 01:57:14""","""N,""N",103582791429521408,2005-05-03 23:40:24,"""N,""N","""N,""N","""N,""N","""N,""2013-02-28 14:30:08""",


In [110]:
# Save TABLE 6_1
ps_t6_3.write.csv('/user/tamng/jwht/SteamData/ps_t6_3.csv', header = True)

In [94]:
print('Total rows in TABLE 6:', 3235784 + 3420 + 108544)

Total rows in TABLE 6: 3347748


---
## <font color = 'black'> TABLE 7: nSplit_c11 =1 & nSplit_c10 !=1 <font>
<br> </br>
nSplit_c11 =1 & nSplit_c10 !=1

### 5.7.1 Filter data and check number of Split

In [17]:
ps_t7 = player_summaries_drop.filter((player_summaries_drop.nSplit_c11 == 1) & (player_summaries_drop.nSplit_c10 != 1))
ps_t7.count()

2919054

In [18]:
cols_check = ['_c1','_c7','_c8','_c9', '_c10', '_c11', '_c12', '_c13', '_c14','_c15','_c16','_c17','_c18','_c19']

tagname = ['nSplit_c1','nSplit_c7','nSplit_c8','nSplit_c9', 'nSplit_c10', 'nSplit_c11', 'nSplit_c12', 'nSplit_c13', 'nSplit_c14',
           'nSplit_c15','nSplit_c16','nSplit_c17','nSplit_c18','nSplit_c19']

ps_t7 = add_count_split_column(ps_t7, cols_check, tagname)

In [20]:
check_list = ['nSplit_c1','nSplit_c7','nSplit_c8','nSplit_c9', 'nSplit_c10', 'nSplit_c11','nSplit_c12','nSplit_c13', 'nSplit_c14','nSplit_c15','nSplit_c16','nSplit_c17','nSplit_c18','nSplit_c19']


for i in check_list:
    print('Distinct value for each column:\n')
    print(ps_t7.groupBy(i).count().show())

Distinct value for each column:

+---------+-------+
|nSplit_c1|  count|
+---------+-------+
|       27|      2|
|       12|      1|
|        1|2913529|
|       13|      1|
|        6|      9|
|        3|    635|
|        5|     83|
|       15|      2|
|       43|      1|
|        9|     30|
|        4|     94|
|        8|      5|
|        7|     25|
|       10|      5|
|       25|      1|
|       21|      1|
|       11|      2|
|        2|   4628|
+---------+-------+

None
Distinct value for each column:

+---------+-------+
|nSplit_c7|  count|
+---------+-------+
|        1|2919054|
+---------+-------+

None
Distinct value for each column:

+---------+-------+
|nSplit_c8|  count|
+---------+-------+
|        1|2919051|
|        2|      3|
+---------+-------+

None
Distinct value for each column:

+---------+-------+
|nSplit_c9|  count|
+---------+-------+
|        1|2919020|
|        2|     34|
+---------+-------+

None
Distinct value for each column:

+----------+-------+
|nSplit_c1

Lets see what is the different of those who have nSplit_c9 =1

In [21]:
check_list = ['nSplit_c1','nSplit_c7','nSplit_c8','nSplit_c9', 'nSplit_c10', 'nSplit_c11','nSplit_c12','nSplit_c13', 'nSplit_c14','nSplit_c15','nSplit_c16','nSplit_c17','nSplit_c18','nSplit_c19']


for i in check_list:
    print('Distinct value for each column:\n')
    print(ps_t7.filter(F.col('nSplit_c9')==1).groupBy(i).count().show())

Distinct value for each column:

+---------+-------+
|nSplit_c1|  count|
+---------+-------+
|       27|      2|
|       12|      1|
|        1|2913498|
|       13|      1|
|        6|      9|
|        3|    635|
|        5|     83|
|       15|      2|
|       43|      1|
|        9|     30|
|        4|     94|
|        8|      5|
|        7|     25|
|       10|      5|
|       25|      1|
|       21|      1|
|       11|      2|
|        2|   4625|
+---------+-------+

None
Distinct value for each column:

+---------+-------+
|nSplit_c7|  count|
+---------+-------+
|        1|2919020|
+---------+-------+

None
Distinct value for each column:

+---------+-------+
|nSplit_c8|  count|
+---------+-------+
|        1|2919017|
|        2|      3|
+---------+-------+

None
Distinct value for each column:

+---------+-------+
|nSplit_c9|  count|
+---------+-------+
|        1|2919020|
+---------+-------+

None
Distinct value for each column:

+----------+-------+
|nSplit_c10|  count|
+--------

---
How about snPlit_c10>2?

In [100]:
ps_t7.filter(F.col('nSplit_c10')>2).toPandas()[['_c0','_c1','_c2','_c3','_c4','_c5','_c6','_c7','_c8','_c9','_c10','_c11','_c12','_c13', '_c14','_c15','_c16','_c17','_c18','_c19']]

Unnamed: 0,_c0,_c1,_c2,_c3,_c4,_c5,_c6,_c7,_c8,_c9,_c10,_c11,_c12,_c13,_c14,_c15,_c16,_c17,_c18,_c19
0,76561197970520668,Deafblindboy,http://steamcommunity.com/profiles/76561197970...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,1,3,"""N,""2013-02-17 12:18:31""",2,"""N,103582791429521408,""2004-11-16 21:21:06""",730,85.131.163.214:20025,Counter-Strike: Global Offensive,"""N,""N","""N,""N",2013-02-28 14:37:35,,,
1,76561197982120722,nym006900,http://steamcommunity.com/profiles/76561197982...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,1,3,"""N,""2013-02-27 07:43:00""",1,"""N,103582791429521408,""2006-04-15 02:23:16""",9900,"""N,""Star Trek Online""","""N,""N","""N,""N",2013-03-01 20:19:34,,,,
2,76561197964428587,tempus,http://steamcommunity.com/profiles/76561197964...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,4,3,"""N,""2013-02-16 19:19:06""",1,"""N,103582791429521408,""2004-02-11 17:15:15""",12220,"""N,""Grand Theft Auto: Episodes from Liberty City""","""N,""N","""N,""N",2013-02-28 14:22:23,,,,
3,76561197978589418,Wanderer,http://steamcommunity.com/profiles/76561197978...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,1,2013-03-01 05:04:45,"N,""\,/_(>.<)_\,,/",103582791429674099,2005-10-11 05:47:59,"""N,""N","""N,""N",FI,"""N,""N",2013-03-01 06:37:56,,


---
How about snPlit_c12==1?

In [108]:
ps_t7.filter(F.col('nSplit_c12')==1).limit(20).toPandas()[['_c0','_c1','_c2','_c3','_c4','_c5','_c6','_c7','_c8','_c9','_c10','_c11','_c12','_c13', '_c14','_c15','_c16','_c17','_c18','_c19']]

Unnamed: 0,_c0,_c1,_c2,_c3,_c4,_c5,_c6,_c7,_c8,_c9,_c10,_c11,_c12,_c13,_c14,_c15,_c16,_c17,_c18,_c19
0,76561197970399529,CatPhoenix,http://steamcommunity.com/id/CatPhoenix/,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,1,3,1,2013-02-15 17:22:06,"""N,""CatPhoenix""",103582791431392968,2004-11-16 09:52:27,49520,"""N,""N","""N,""US""",WV,4061,2013-02-28 14:27:12,
1,76561197970400068,Amunition121,http://steamcommunity.com/profiles/76561197970...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,1,2010-03-05 19:37:39,"""N,""N",103582791429521408,2004-11-16 11:21:04,"""N,""N","""N,""N","""N,""N","""N,""2013-02-28 14:37:33""",,,
2,76561197970400355,Dr.Feelgood,http://steamcommunity.com/profiles/76561197970...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,1,2013-02-08 15:37:29,"""N,""Juha""",103582791430111969,2004-11-16 09:56:14,"""N,""N","""N,""N",FI,"""N,""N",2013-02-28 14:27:12,,
3,76561197970400473,flonge,http://steamcommunity.com/profiles/76561197970...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,1,2013-02-15 17:08:18,"""N,""N",103582791429521408,2004-11-16 09:56:24,"""N,""N","""N,""N",NO,"""N,""N",2013-02-28 14:27:12,,
4,76561197970402690,sarge_bozz,http://steamcommunity.com/profiles/76561197970...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,1,2013-02-12 16:06:43,"""N,""N",103582791430751772,2004-11-16 11:29:36,"""N,""N","""N,""N","""N,""N","""N,""2013-02-28 14:37:33""",,,
5,76561197970403279,biochip_c,http://steamcommunity.com/profiles/76561197970...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,1,2013-02-17 14:01:22,"""N,""N",103582791432850448,2004-11-16 10:08:46,"""N,""N","""N,""N","""N,""N","""N,""2013-02-28 14:27:12""",,,
6,76561197970404599,OutofSkillz,http://steamcommunity.com/profiles/76561197970...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,1,2013-03-06 13:25:54,"""N,""N",103582791430878549,2004-11-16 10:12:16,"""N,""N","""N,""N","""N,""N","""N,""2013-03-06 18:08:23""",,,
7,76561197970405470,kilbane,http://steamcommunity.com/profiles/76561197970...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,1,2012-09-02 03:10:18,"""N,""Laurie""",103582791429521408,2004-11-16 11:40:13,"""N,""N","""N,""N","""N,""N","""N,""2013-02-28 14:37:33""",,,
8,76561197970406759,beemer (>NUTTER<),http://steamcommunity.com/profiles/76561197970...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,1,2013-02-17 15:45:23,"""N,""N",103582791429968690,2004-11-16 10:22:54,"""N,""N","""N,""N",BE,"""N,""N",2013-02-28 14:27:12,,
9,76561197970406909,Faxe,http://steamcommunity.com/profiles/76561197970...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,1,2013-02-17 17:08:44,"""N,""N",103582791429521408,2004-11-16 10:23:11,"""N,""N","""N,""N","""N,""N","""N,""2013-02-28 14:27:12""",,,


---
_Check distribution of nSplit_c1 =1_

In [25]:
check_list = ['nSplit_c1','nSplit_c7','nSplit_c8','nSplit_c9', 'nSplit_c10', 'nSplit_c11','nSplit_c12','nSplit_c13', 'nSplit_c14','nSplit_c15','nSplit_c16','nSplit_c17','nSplit_c18','nSplit_c19']


for i in check_list:
    print('Distinct value for each column:\n')
    print(ps_t7.filter((F.col('nSplit_c1')==1) &(F.col('nSplit_c10')==2)).groupBy(i).count().show())

Distinct value for each column:

+---------+-------+
|nSplit_c1|  count|
+---------+-------+
|        1|2913525|
+---------+-------+

None
Distinct value for each column:

+---------+-------+
|nSplit_c7|  count|
+---------+-------+
|        1|2913525|
+---------+-------+

None
Distinct value for each column:

+---------+-------+
|nSplit_c8|  count|
+---------+-------+
|        1|2913525|
+---------+-------+

None
Distinct value for each column:

+---------+-------+
|nSplit_c9|  count|
+---------+-------+
|        1|2913494|
|        2|     31|
+---------+-------+

None
Distinct value for each column:

+----------+-------+
|nSplit_c10|  count|
+----------+-------+
|         2|2913525|
+----------+-------+

None
Distinct value for each column:

+----------+-------+
|nSplit_c11|  count|
+----------+-------+
|         1|2913525|
+----------+-------+

None
Distinct value for each column:

+----------+-------+
|nSplit_c12|  count|
+----------+-------+
|        -1|      3|
|         1|2913511

---
Allright, we find the right pattern to split in the condition below

In [36]:
ps_t7_1 = ps_t7.filter((F.col('nSplit_c1')==1) &(F.col('nSplit_c9')!=2) &(F.col('nSplit_c10')==2) &(F.col('nSplit_c12')==1) &(F.col('nSplit_c13')!=3) &(F.col('nSplit_c14')!=3) &(F.col('nSplit_c15')!=3)  &(F.col('nSplit_c16')!=3))
ps_t7_1.count()

2913465

In [37]:
ps_t7_1.filter(~F.col('_c5').contains('http')).count()

0

In [38]:
ps_t7_1.limit(20).toPandas()[['_c0','_c1','_c2','_c3','_c4','_c5','_c6','_c7','_c8','_c9','_c10','_c11','_c12','_c13', '_c14','_c15','_c16','_c17','_c18','_c19']]

Unnamed: 0,_c0,_c1,_c2,_c3,_c4,_c5,_c6,_c7,_c8,_c9,_c10,_c11,_c12,_c13,_c14,_c15,_c16,_c17,_c18,_c19
0,76561197984409743,CAPSGUY,http://steamcommunity.com/profiles/76561197984...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,1,2013-02-05 12:07:53,"""N,""N",103582791429521408,2006-08-20 18:31:51,"""N,""N","""N,""N",CA,QC,4659,2013-03-11 12:30:25,
1,76561197984409754,-exdkr- Milan Jagdhund,http://steamcommunity.com/profiles/76561197984...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,1,3,1,2013-03-01 16:01:30,"""N,"" Marcel""",103582791429525029,2006-08-18 02:16:02,"""N,""N","""N,""N","""N,""N","""N,""2013-03-02 01:21:06""",,,
2,76561197984410981,the raccoon,http://steamcommunity.com/profiles/76561197984...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,1,2012-06-16 13:32:46,"""N,""N",103582791429638306,2006-08-20 20:19:24,"""N,""N","""N,""N","""N,""N","""N,""2013-03-11 12:30:25""",,,
3,76561197984412618,(`Â´)|buRRo,http://steamcommunity.com/profiles/76561197984...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,1,2012-08-30 16:12:51,"""N,""Tommy Karlsson""",103582791430622765,2006-08-18 06:13:11,"""N,""N","""N,""N",SE,"""N,""N",2013-03-02 01:21:27,,
4,76561197984412820,H^ xd,http://steamcommunity.com/profiles/76561197984...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,1,2009-12-28 12:58:43,"""N,""N",103582791429521408,2006-08-18 06:27:26,"""N,""N","""N,""N","""N,""N","""N,""2013-03-02 01:21:29""",,,
5,76561197984414552,MasterShake,http://steamcommunity.com/id/mastershake193/,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,1,2013-02-28 23:52:36,"""N,""Aaron""",103582791430921377,2006-08-18 08:09:43,"""N,""N","""N,""N",US,SD,3495,2013-03-02 01:21:43,
6,76561197984418128,DyÅ‹Ã cÃ²Ñe|áº¶VA,http://steamcommunity.com/profiles/76561197984...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,1,2011-06-14 03:16:56,"""N,""N",103582791429521408,2006-08-18 10:50:56,"""N,""N","""N,""N","""N,""N","""N,""2013-03-02 01:22:12""",,,
7,76561197984422985,jkop,http://steamcommunity.com/profiles/76561197984...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,1,2013-02-24 04:04:32,"""N,""N",103582791430084582,2006-08-21 11:20:40,"""N,""N","""N,""N",DE,"""N,""N",2013-03-11 12:30:25,,
8,76561197984423499,Chip,http://steamcommunity.com/profiles/76561197984...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,1,2012-05-10 13:33:04,"""N,""N",103582791431945232,2006-08-21 11:42:42,"""N,""N","""N,""N","""N,""N","""N,""2013-03-11 12:30:25""",,,
9,76561197984424003,,http://steamcommunity.com/profiles/76561197984...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,http://media.steampowered.com/steamcommunity/p...,0,3,1,2013-02-28 13:39:01,"""N,""N",103582791429639803,2006-08-21 12:05:05,"""N,""N","""N,""N",FR,A1,15628,2013-03-11 12:30:25,


In [53]:
check_list = ['nSplit_c1','nSplit_c7','nSplit_c8','nSplit_c9', 'nSplit_c10', 'nSplit_c11','nSplit_c12','nSplit_c13', 'nSplit_c14','nSplit_c15','nSplit_c16','nSplit_c17','nSplit_c18','nSplit_c19']


for i in check_list:
    print('Distinct value for each column:\n')
    print(ps_t7_1.groupBy(i).count().show())

Distinct value for each column:

+---------+-------+
|nSplit_c1|  count|
+---------+-------+
|        1|2913465|
+---------+-------+

None
Distinct value for each column:

+---------+-------+
|nSplit_c7|  count|
+---------+-------+
|        1|2913465|
+---------+-------+

None
Distinct value for each column:

+---------+-------+
|nSplit_c8|  count|
+---------+-------+
|        1|2913465|
+---------+-------+

None
Distinct value for each column:

+---------+-------+
|nSplit_c9|  count|
+---------+-------+
|        1|2913465|
+---------+-------+

None
Distinct value for each column:

+----------+-------+
|nSplit_c10|  count|
+----------+-------+
|         2|2913465|
+----------+-------+

None
Distinct value for each column:

+----------+-------+
|nSplit_c11|  count|
+----------+-------+
|         1|2913465|
+----------+-------+

None
Distinct value for each column:

+----------+-------+
|nSplit_c12|  count|
+----------+-------+
|         1|2913465|
+----------+-------+

None
Distinct val

In [39]:
# Set new columns need to split for this data

col_c10 = ['_c10']
newcols_c10 = ['c10_1', 'c10_2']

col_c13 = ['_c13']
newcols_c13 = ['c13_1', 'c13_2']

col_c14 = ['_c14']
newcols_c14 = ['c14_1', 'c14_2']

col_c15 = ['_c15']
newcols_c15 = ['c15_1', 'c15_2']

col_c16 = ['_c16']
newcols_c16 = ['c16_1', 'c16_2']

col_c17 = ['_c17']
newcols_c17 = ['c17_1', 'c17_2']

col_c18 = ['_c18']
newcols_c18 = ['c18_1', 'c18_2']

col_c19 = ['_c19']
newcols_c19 = ['c19_1', 'c19_2']

# Apply split function
ps_t7_1 = split_2_column(ps_t7_1, col_c10, newcols_c10)
ps_t7_1 = split_2_column(ps_t7_1, col_c13, newcols_c13)
ps_t7_1 = split_2_column(ps_t7_1, col_c14, newcols_c14)
ps_t7_1 = split_2_column(ps_t7_1, col_c15, newcols_c15)
ps_t7_1 = split_2_column(ps_t7_1, col_c16, newcols_c16)
ps_t7_1 = split_2_column(ps_t7_1, col_c17, newcols_c17)
ps_t7_1 = split_2_column(ps_t7_1, col_c18, newcols_c18)
ps_t7_1 = split_2_column(ps_t7_1, col_c19, newcols_c19)

---
### 5.7.2 Adding columns and condition
From this step, we will start adding column based in n_Split column logic

_Add columns_

In [40]:
ps_t7_1 = ps_t7_1.withColumn('personastate', F.col('_c6'))
ps_t7_1 = ps_t7_1.withColumn('communityvisibilitystate', F.col('_c7'))
ps_t7_1 = ps_t7_1.withColumn('profilestate', F.col('_c8'))
ps_t7_1 = ps_t7_1.withColumn('lastlogoff', F.col('_c9'))
ps_t7_1 = ps_t7_1.withColumn('commentpermission', F.col('c10_1'))
ps_t7_1 = ps_t7_1.withColumn('realname', F.col('c10_2'))
ps_t7_1 = ps_t7_1.withColumn('primaryclanid', F.when((F.col('_c11').contains('103')), F.col('_c11')).\
                             when(~(F.col('_c11').contains('103')), F.col('_c12')))
ps_t7_1 = ps_t7_1.withColumn('timecreated', F.when((F.col('_c11').contains('103')), F.col('_c12')).\
                             when(~(F.col('_c11').contains('103')), F.col('c13_1')))                  
                             

In [None]:
ps_t7_1 = ps_t7_1.withColumn('gameid', F.when((F.col('_c11').contains('103')), F.col('c13_1')).\
                             when((F.col('_c12').contains('103')) & (F.col('nSplit_c13')==1), F.col('c14_1')).\
                             when((F.col('_c12').contains('103')) & (F.col('nSplit_c13')==2), F.col('c13_2')))

In [61]:
ps_t7_1 = ps_t7_1.withColumn('gameserverip', F.when((F.col('_c11').contains('103') & (F.col('nSplit_c13')==1)), F.col('c14_1')).\
                             when((F.col('_c11').contains('103') & (F.col('nSplit_c13')==2)), F.col('c13_2')).\
                             when((F.col('_c12').contains('103')) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==1), F.col('c15_1')).\
                             when((F.col('_c12').contains('103')) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==2), F.col('c14_2')).\
                             when((F.col('_c12').contains('103')) & (F.col('nSplit_c13')==2), F.col('c14_1')))  

In [62]:
ps_t7_1 = ps_t7_1.withColumn('gameextrainfo', F.when((F.col('_c11').contains('103') & (F.col('nSplit_c13')==1)) & (F.col('nSplit_c14')==1), F.col('c15_1')).\
when((F.col('_c11').contains('103')) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==2), F.col('c14_2')).\
when((F.col('_c11').contains('103')) & (F.col('nSplit_c13')==2), F.col('c14_1')).\
when((F.col('_c12').contains('103')) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1), F.col('c16_1')).\
when((F.col('_c12').contains('103')) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==2), F.col('c15_2')).\
when((F.col('_c12').contains('103')) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==2), F.col('c15_1')).\
when((F.col('_c12').contains('103')) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==1), F.col('c15_1')).\
when((F.col('_c12').contains('103')) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==2), F.col('c14_2'))) 

In [97]:
ps_t7_1 = ps_t7_1.withColumn('cityid', F.when((F.col('_c11').contains('103') & (F.col('nSplit_c13')==1)) & (F.col('nSplit_c14')==1)& (F.col('nSplit_c15')==1), F.col('c16_1')).\
                            when((F.col('_c11').contains('103') & (F.col('nSplit_c13')==1)) & (F.col('nSplit_c14')==1)& (F.col('nSplit_c15')==2), F.col('c15_2')).\
                            when((F.col('_c11').contains('103')) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==2) , F.col('c15_1')).\
                            when((F.col('_c11').contains('103')) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==1) , F.col('c15_1')).\
                            when((F.col('_c11').contains('103')) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==2), F.col('c14_2')).\
                            when((F.col('_c12').contains('103')) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==1), F.col('c17_1')).\
                            when((F.col('_c12').contains('103')) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==2), F.col('c16_2')).\
                            when((F.col('_c12').contains('103')) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==2), F.col('c16_1')).\
                            when((F.col('_c12').contains('103')) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==1), F.col('c16_1')).\
                            when((F.col('_c12').contains('103')) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==2) , F.col('c15_2')).\
                            when((F.col('_c12').contains('103')) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1), F.col('c16_1')).\
                            when((F.col('_c12').contains('103')) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==2), F.col('c15_2')).\
                            when((F.col('_c12').contains('103')) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==2), F.col('c15_1')).\
                            otherwise(F.col('c16_1'))) 

In [95]:
ps_t7_1.count()

2913465

In [96]:
ps_t7_1 = ps_t7_1.filter((F.col('_c11').contains('103')) | (F.col('_c12').contains('103')))
ps_t7_1.count()

2913156

In [100]:
ps_t7_1 = ps_t7_1.filter(F.col('cityid')=='"N')
ps_t7_1.count()

2912970

---
_Add column_
> - loccountrycode

In [None]:
ps_t7_1 = ps_t7_1.withColumn('loccountrycode', F.when((F.col('_c11').contains('103') & (F.col('nSplit_c13')==1)) & (F.col('nSplit_c14')==1)& (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==1), F.col('c17_1')).\
                            when((F.col('_c11').contains('103') & (F.col('nSplit_c13')==1)) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==2), F.col('c16_2')).\
                            when((F.col('_c11').contains('103') & (F.col('nSplit_c13')==1)) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==2), F.col('c16_1')).\
                            when((F.col('_c11').contains('103')) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==1) , F.col('c16_1')).\
                            when((F.col('_c11').contains('103')) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==2) , F.col('c15_2')).\
                            when((F.col('_c11').contains('103')) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) , F.col('c16_1')).\
                            when((F.col('_c11').contains('103')) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==2), F.col('c15_2')).\
                            when((F.col('_c11').contains('103')) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==2), F.col('c15_1')).\
                            when((F.col('_c12').contains('103')) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==1) & (F.col('nSplit_c17')==1), F.col('c18_1')).\
                            when((F.col('_c12').contains('103')) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==1) & (F.col('nSplit_c17')==2), F.col('c17_2')).\
                            when((F.col('_c12').contains('103')) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==2), F.col('c17_1')).\
                            when((F.col('_c12').contains('103')) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==2) & (F.col('nSplit_c16')==1) & (F.col('nSplit_c17')==1), F.col('c18_1')).\
                            when((F.col('_c12').contains('103')) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==2) & (F.col('nSplit_c16')==1) & (F.col('nSplit_c17')==2), F.col('c17_2')).\
                            when((F.col('_c12').contains('103')) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==1), F.col('c17_1')).\
                            when((F.col('_c12').contains('103')) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==2), F.col('c16_2')).\
                            when((F.col('_c12').contains('103')) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==2) , F.col('c16_1')).\
                            when((F.col('_c12').contains('103')) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==1), F.col('c17_1')).\
                            when((F.col('_c12').contains('103')) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==2), F.col('c16_2')).\
                            when((F.col('_c12').contains('103')) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==2), F.col('c16_1')).\
                            when((F.col('_c12').contains('103')) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==1), F.col('c16_1')).\
                            when((F.col('_c12').contains('103')) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==2), F.col('c15_2')).\
                            otherwise(F.col('c16_2')))

---
_Add column_
> - locstatecode

In [111]:
ps_t7_1 = ps_t7_1.withColumn('locstatecode', F.when((F.col('_c11').contains('103') & (F.col('nSplit_c13')==1)) & (F.col('nSplit_c14')==1)& (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==1) & (F.col('nSplit_c17')==1), F.col('c18_1')).\
                            when((F.col('_c11').contains('103') & (F.col('nSplit_c13')==1)) & (F.col('nSplit_c14')==1)& (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==1) & (F.col('nSplit_c17')==2), F.col('c17_2')).\
                            when((F.col('_c11').contains('103') & (F.col('nSplit_c13')==1)) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==2), F.col('c17_1')).\
                            when((F.col('_c11').contains('103') & (F.col('nSplit_c13')==1)) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==2) & (F.col('nSplit_c16')==1), F.col('c17_1')).\
                            when((F.col('_c11').contains('103') & (F.col('nSplit_c13')==1)) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==2) & (F.col('nSplit_c16')==2), F.col('c16_2')).\
                            when((F.col('_c11').contains('103')) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==1) , F.col('c17_1')).\
                            when((F.col('_c11').contains('103')) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==2) , F.col('c16_2')).\
                            when((F.col('_c11').contains('103')) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==2) , F.col('c16_1')).\
                            when((F.col('_c11').contains('103')) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==1) , F.col('c17_1')).\
                            when((F.col('_c11').contains('103')) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==2) , F.col('c16_2')).\
                            when((F.col('_c11').contains('103')) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==2), F.col('c16_1')).\
                            when((F.col('_c11').contains('103')) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==1), F.col('c16_1')).\
                            when((F.col('_c11').contains('103')) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==2), F.col('c15_2')).\
                            when((F.col('_c12').contains('103')) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==1) & (F.col('nSplit_c17')==1) & (F.col('nSplit_c18')==1), F.col('c19_1')).\
                            when((F.col('_c12').contains('103')) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==1) & (F.col('nSplit_c17')==1) & (F.col('nSplit_c18')==2), F.col('c18_2')).\
                            when((F.col('_c12').contains('103')) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==1) & (F.col('nSplit_c17')==2), F.col('c18_1')).\
                            when((F.col('_c12').contains('103')) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==2) & (F.col('nSplit_c17')==1), F.col('c18_1')).\
                            when((F.col('_c12').contains('103')) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==2) & (F.col('nSplit_c17')==2), F.col('c17_2')).\
                            when((F.col('_c12').contains('103')) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==2) & (F.col('nSplit_c16')==1) & (F.col('nSplit_c17')==1), F.col('c18_1')).\
                            when((F.col('_c12').contains('103')) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==2) & (F.col('nSplit_c16')==1) & (F.col('nSplit_c17')==2), F.col('c17_2')).\
                            when((F.col('_c12').contains('103')) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==2) & (F.col('nSplit_c16')==1) & (F.col('nSplit_c17')==2), F.col('c18_1')).\
                            when((F.col('_c12').contains('103')) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==1) & (F.col('nSplit_c17')==1), F.col('c18_1')).\
                            when((F.col('_c12').contains('103')) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==1) & (F.col('nSplit_c17')==2), F.col('c17_2')).\
                            when((F.col('_c12').contains('103')) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==2), F.col('c17_1')).\
                            when((F.col('_c12').contains('103')) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==2) & (F.col('nSplit_c16')==1) , F.col('c17_1')).\
                            when((F.col('_c12').contains('103')) & (F.col('nSplit_c13')==1) & (F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==2) & (F.col('nSplit_c16')==2) , F.col('c16_2')).\
                            when((F.col('_c12').contains('103')) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==1) & (F.col('nSplit_c17')==1), F.col('c18_1')).\
                            when((F.col('_c12').contains('103')) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==1) & (F.col('nSplit_c17')==2), F.col('c17_2')).\
                            when((F.col('_c12').contains('103')) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==2), F.col('c17_1')).\
                            when((F.col('_c12').contains('103')) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==1) & (F.col('nSplit_c15')==2) & (F.col('nSplit_c16')==1), F.col('c17_1')).\
                            when((F.col('_c12').contains('103')) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==1), F.col('c17_1')).\
                            when((F.col('_c12').contains('103')) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==1) & (F.col('nSplit_c16')==2), F.col('c16_2')).\
                            when((F.col('_c12').contains('103')) & (F.col('nSplit_c13')==2) & (F.col('nSplit_c14')==2) & (F.col('nSplit_c15')==2), F.col('c16_1')).\
                            otherwise(F.col('c17_1')))

_Check missing value across all columns created_

In [112]:
check_missing(ps_t7_1.select('personastate','communityvisibilitystate','profilestate','lastlogoff','commentpermission',
                            'realname','primaryclanid','timecreated','gameid','gameextrainfo','cityid','loccountrycode','locstatecode'))

+------------+------------------------+------------+----------+-----------------+--------+-------------+-----------+------+-------------+------+--------------+------------+
|personastate|communityvisibilitystate|profilestate|lastlogoff|commentpermission|realname|primaryclanid|timecreated|gameid|gameextrainfo|cityid|loccountrycode|locstatecode|
+------------+------------------------+------------+----------+-----------------+--------+-------------+-----------+------+-------------+------+--------------+------------+
|           0|                       0|           0|         0|                0|       0|            0|          0|     0|            0|     0|             0|           0|
+------------+------------------------+------------+----------+-----------------+--------+-------------+-----------+------+-------------+------+--------------+------------+



In [114]:
ps_t7_1.select('timecreated').distinct().count()

2791591

In [115]:
# Use this code to check if the 'gameid' was parsed correctly.
ps_t7_1.select('timecreated').distinct().limit(50).toPandas()


Unnamed: 0,timecreated
0,2006-07-23 14:06:45
1,2006-07-27 06:10:36
2,2006-08-04 09:18:23
3,2006-09-23 21:45:52
4,2005-01-19 02:22:04
5,2005-12-26 19:36:16
6,2006-01-03 22:39:47
7,2006-01-17 10:57:42
8,2006-06-09 02:25:49
9,2006-06-13 09:57:09


In [52]:
# Use this code to check if the 'gameid' was parsed correctly.
ps_t7_1.select('gameid').distinct().limit(50).toPandas()

Unnamed: 0,gameid
0,108800
1,207060
2,45300
3,7650
4,211160
5,207210
6,1280
7,31290
8,212220
9,2700


In [116]:
# Use this code to check if the 'gameid' was parsed correctly.
ps_t7_1.select('gameserverip').distinct().limit(50).toPandas()

Unnamed: 0,gameserverip
0,146.66.154.27:27034
1,146.66.152.96:27025
2,74.91.123.66:27015
3,146.66.153.88:27064
4,109.207.55.26:27015
5,63.210.145.203:27015
6,103.10.125.71:27063
7,27.50.71.209:27015
8,90.189.192.130:27016
9,146.66.153.125:27079


In [65]:
# Use this code to check if the 'gameid' was parsed correctly.
ps_t7_1.select('gameextrainfo').distinct().limit(50).toPandas()

Unnamed: 0,gameextrainfo
0,Dota 2
1,"""Star Wars: The Force Unleashed Ultimate Sith ..."
2,"""Black Mesa"""
3,"""Street Fighter X Tekken"""
4,"""LEGO Batman: The Videogame"""
5,"""*Craft II"""
6,"""Call of Duty 2"""
7,"""Lego Star Wars 3: The Clone Wars"""
8,"""Client"""
9,"""PHANTASY STAR ONLINE 2"""


In [117]:
# Use this code to check if the 'gameid' was parsed correctly.
ps_t7_1.select('cityid').distinct().limit(50).toPandas()

Unnamed: 0,cityid
0,"""N"


_loccountrycode_

In [119]:
ps_t7_1.select('loccountrycode').distinct().count()

427

In [118]:
# Use this code to check if the 'gameid' was parsed correctly.
ps_t7_1.select('loccountrycode').distinct().limit(50).toPandas()

Unnamed: 0,loccountrycode
0,DZ
1,LT
2,"""SG"""
3,MM
4,"""TJ"""
5,CI
6,TC
7,FI
8,AZ
9,SC


In [122]:
ps_t7_1.select('locstatecode').distinct().count()

408

In [123]:
# Use this code to check if the 'gameid' was parsed correctly.
ps_t7_1.select('locstatecode').distinct().limit(50).toPandas()

Unnamed: 0,locstatecode
0,07
1,51
2,C6
3,X9
4,CI
5,AZ
6,SC
7,A9
8,Q7
9,NS


---
### 5.7.2 Format Table

In [124]:
ps_t7_1.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)
 |-- _c2: string (nullable = true)
 |-- _c3: string (nullable = true)
 |-- _c4: string (nullable = true)
 |-- _c5: string (nullable = true)
 |-- _c6: string (nullable = true)
 |-- _c7: string (nullable = true)
 |-- _c8: string (nullable = true)
 |-- _c9: string (nullable = true)
 |-- _c10: string (nullable = true)
 |-- _c11: string (nullable = true)
 |-- _c12: string (nullable = true)
 |-- _c13: string (nullable = true)
 |-- _c14: string (nullable = true)
 |-- _c15: string (nullable = true)
 |-- _c16: string (nullable = true)
 |-- _c17: string (nullable = true)
 |-- _c18: string (nullable = true)
 |-- _c19: string (nullable = true)
 |-- nSplit_c1: integer (nullable = false)
 |-- nSplit_c7: integer (nullable = false)
 |-- nSplit_c8: integer (nullable = false)
 |-- nSplit_c9: integer (nullable = false)
 |-- nSplit_c10: integer (nullable = false)
 |-- nSplit_c11: integer (nullable = false)
 |-- nSplit_c12: integer 

In [125]:
# Drop unnecssary columns
ps_t7_1 = ps_t7_1.drop('nSplit_c1','nSplit_c7','nSplit_c8','nSplit_c9','nSplit_c10','nSplit_c11',\
                    'nSplit_c12','nSplit_c13','nSplit_c14','nSplit_c15','nSplit_c16','nSplit_c17',\
                    'nSplit_c18','nSplit_c19',\
                    'c10_1','c10_2', 'c11_3','c13_1','c13_2','c14_1','c14_2','c15_1','c15_2',\
                    'c16_1','c16_2','c17_1','c17_2','c18_1','c18_2','c19_1','c19_2',\
                    '_c6','_c7','_c8','_c9','_c10','_c11','_c12','_c13','_c14','_c15','_c16',\
                    '_c17','_c18','_c19','_c3','_c4','_c5',)

In [127]:
# Rename columns
newColumns = ['steam_id','person_name', 'profile_url','person_state', 'community_visibility_state',\
              'profile_state','last_logoff', 'comment_permission', 'real_name', 'primary_clanid',\
              'time_created', 'game_id', 'gameserver_ip', 'game_extrainfo', 'city_id', 'country_code',\
              'state_code']
ps_t7_1 = rename_col(ps_t7_1, newColumns)

In [128]:
ps_t7_1.printSchema()

root
 |-- steam_id: string (nullable = true)
 |-- person_name: string (nullable = true)
 |-- profile_url: string (nullable = true)
 |-- person_state: string (nullable = true)
 |-- community_visibility_state: string (nullable = true)
 |-- profile_state: string (nullable = true)
 |-- last_logoff: string (nullable = true)
 |-- comment_permission: string (nullable = true)
 |-- real_name: string (nullable = true)
 |-- primary_clanid: string (nullable = true)
 |-- time_created: string (nullable = true)
 |-- game_id: string (nullable = true)
 |-- gameserver_ip: string (nullable = true)
 |-- game_extrainfo: string (nullable = true)
 |-- city_id: string (nullable = true)
 |-- country_code: string (nullable = true)
 |-- state_code: string (nullable = true)



In [130]:
# Replace " symbols in data
col_list = ['profile_state','community_visibility_state','comment_permission','real_name', 'primary_clanid','time_created', 'game_id', 'gameserver_ip',\
            'last_logoff','profile_state','game_extrainfo', 'city_id', 'country_code','state_code']

for i in col_list:
     ps_t7_1 = ps_t7_1.withColumn(i,regexp_replace(i, '"', ""))

In [131]:
ps_t7_1.limit(20).toPandas()

Unnamed: 0,steam_id,person_name,profile_url,person_state,community_visibility_state,profile_state,last_logoff,comment_permission,real_name,primary_clanid,time_created,game_id,gameserver_ip,game_extrainfo,city_id,country_code,state_code
0,76561197974615739,Warrior_Wax,http://steamcommunity.com/profiles/76561197974...,3,3,1,2013-03-06 18:19:24,N,N,103582791429521408,2005-03-03 20:33:42,N,N,N,N,N,N
1,76561197974616863,Brothaman Â±ihia,http://steamcommunity.com/profiles/76561197974...,0,3,1,2013-02-17 23:41:24,N,N,103582791429681794,2005-03-03 22:46:40,N,N,N,N,N,N
2,76561197974618852,TaNk,http://steamcommunity.com/profiles/76561197974...,0,3,1,2013-01-01 14:34:48,N,N,103582791429521408,2005-03-02 17:18:55,N,N,N,N,US,CA
3,76561197974624528,fjeppert,http://steamcommunity.com/profiles/76561197974...,0,3,1,2011-11-22 06:04:48,N,N,103582791429521408,2005-03-03 03:19:57,N,N,N,N,NL,07
4,76561197974626969,>EpoXy`,http://steamcommunity.com/profiles/76561197974...,0,3,1,2009-11-26 14:58:47,N,N,103582791429521408,2005-03-04 09:50:08,N,N,N,N,N,N
5,76561197974627729,jamallorock,http://steamcommunity.com/profiles/76561197974...,0,3,1,2012-09-09 14:14:37,N,N,103582791431480423,2005-03-04 10:20:32,N,N,N,N,PL,N
6,76561197974629794,NERO 91,http://steamcommunity.com/profiles/76561197974...,0,3,1,2012-07-01 07:03:57,N,N,103582791431465541,2005-03-03 09:06:52,N,N,N,N,N,N
7,76561197974631473,RRZ_lovs,http://steamcommunity.com/profiles/76561197974...,1,3,1,2012-10-22 11:30:03,N,N,103582791429639803,2005-03-04 12:48:01,N,N,N,N,N,N
8,76561197974639142,cheese,http://steamcommunity.com/profiles/76561197974...,0,3,1,2013-02-11 23:03:34,N,N,103582791429639803,2005-03-03 18:03:07,N,N,N,N,N,N
9,76561197974641428,Sorce,http://steamcommunity.com/profiles/76561197974...,1,3,1,2013-02-18 15:33:26,N,Sorce,103582791429574842,2005-03-03 18:42:10,N,N,N,N,N,N


In [132]:
# Save TABLE 7_1
ps_t7_1.write.csv('/user/tamng/jwht/CleanData/ps_t7_1.csv', header = True)

----
## 6. Merge all table together

In [11]:
!hdfs dfs -ls /user/tamng/jwht/CleanData

Found 21 items
drwxrwxrwx   - tamng tamng          0 2020-05-22 13:38 /user/tamng/jwht/CleanData/app_id_info.csv
drwxrwxrwx   - tamng tamng          0 2020-05-22 22:38 /user/tamng/jwht/CleanData/app_if_info_PosReview.csv
drwxrwxrwx   - tamng tamng          0 2020-05-18 13:42 /user/tamng/jwht/CleanData/friends.csv
drwxrwxrwx   - tamng tamng          0 2020-05-18 15:10 /user/tamng/jwht/CleanData/game2_df.csv
drwxrwxrwx   - tamng tamng          0 2020-05-22 14:12 /user/tamng/jwht/CleanData/game_dgp.csv
drwxrwxrwx   - tamng tamng          0 2020-05-22 13:41 /user/tamng/jwht/CleanData/games_developer.csv
drwxrwxrwx   - tamng tamng          0 2020-05-22 13:47 /user/tamng/jwht/CleanData/games_genres.csv
drwxrwxrwx   - tamng tamng          0 2020-05-22 13:49 /user/tamng/jwht/CleanData/games_publisher.csv
drwxrwxrwx   - tamng tamng          0 2020-05-22 13:55 /user/tamng/jwht/CleanData/groups.csv
drwxrwxrwx   - tamng tamng          0 2020-05-27 16:54 /user/tamng/jwht/CleanData/ps_t1.csv
drwxrwx

In [12]:
ps_t1 = spark.read.csv('/user/tamng/jwht/CleanData/ps_t1.csv',inferSchema = True, header = True)
ps_t2 = spark.read.csv('/user/tamng/jwht/CleanData/ps_t2.csv',inferSchema = True, header = True)
ps_t3 = spark.read.csv('/user/tamng/jwht/CleanData/ps_t3.csv',inferSchema = True, header = True)
ps_t4_1_1 = spark.read.csv('/user/tamng/jwht/CleanData/ps_t4_1_1.csv',inferSchema = True, header = True)
ps_t4_1_2 = spark.read.csv('/user/tamng/jwht/CleanData/ps_t4_1_2.csv',inferSchema = True, header = True)
ps_t4_2 = spark.read.csv('/user/tamng/jwht/CleanData/ps_t4_2.csv',inferSchema = True, header = True)
ps_t4_3 = spark.read.csv('/user/tamng/jwht/CleanData/ps_t4_3.csv',inferSchema = True, header = True)
ps_t4_4 = spark.read.csv('/user/tamng/jwht/CleanData/ps_t4_4.csv',inferSchema = True, header = True)
ps_t5_1_1 = spark.read.csv('/user/tamng/jwht/CleanData/ps_t5_1_1.csv',inferSchema = True, header = True)
ps_t5_2 = spark.read.csv('/user/tamng/jwht/CleanData/ps_t5_2.csv',inferSchema = True, header = True)
ps_t6_1 = spark.read.csv('/user/tamng/jwht/CleanData/ps_t6_1.csv',inferSchema = True, header = True)
ps_t7_1 = spark.read.csv('/user/tamng/jwht/CleanData/ps_t7_1.csv',inferSchema = True, header = True)

In [14]:
ps_t1.limit(2).toPandas()

Unnamed: 0,steam_id,person_name,profile_url,person_state,community_visibility_state,profile_state,last_logoff,comment_permission,real_name,primary_clanid,time_created,game_id,gameserver_ip,game_extrainfo,city_id,country_code,state_code
0,76561197974936592,Crack Whorez,http://steamcommunity.com/profiles/76561197974...,0,3,1,2009-11-29 01:27:15,1,N,103582791429526766,2005-03-17 10:14:08,N,N,N,N,N,N
1,76561197974941277,Nymann,http://steamcommunity.com/id/b0xr/,0,3,1,2010-02-09 06:47:58,1,N,103582791429915259,2005-03-19 04:48:01,N,N,N,N,DK,01


In [16]:
ps_t2.limit(2).toPandas()

Unnamed: 0,steam_id,person_name,profile_url,person_state,community_visibility_state,profile_state,last_logoff,comment_permission,real_name,primary_clanid,time_created,game_id,gameserver_ip,game_extrainfo,city_id,country_code,state_code
0,76561197984594581,joe #sQx,http://steamcommunity.com/profiles/76561197984...,0,3,1,2010-05-07 19:09:49,N,Joe Pardim,103582791430266409,2006-08-30 04:55:18,N,N,N,N,BR,16
1,76561197967762191,heiftzurdah,http://steamcommunity.com/id/Zurdah/,0,3,1,2013-02-11 10:19:16,N,Arnar,103582791429550710,2004-07-28 09:00:55,N,N,N,N,N,N


In [19]:
ps_t3.limit(2).toPandas()

Unnamed: 0,steam_id,person_name,profile_url,person_state,community_visibility_state,profile_state,last_logoff,comment_permission,real_name,primary_clanid,time_created,game_id,gameserver_ip,game_extrainfo,city_id,country_code,state_code
0,76561197960518379,KrumPZ,http://steamcommunity.com/profiles/76561197960...,0,3,1,2011-03-08 02:34:07,1,"Reborn,Cynical,Tumbo,Mero",103582791430065590,2003-09-13 11:41:47,N,N,N,N,N,N
1,76561197973640033,Doble Es,http://steamcommunity.com/id/empresario7/,0,3,1,2013-02-18 14:05:01,2,"A.k.a Ozamu, Bisnesmies, Empresario, Sulttaani...",103582791432035804,2005-01-26 23:18:18,N,N,N,N,FI,13


In [21]:
ps_t4_1_1.limit(2).toPandas()

Unnamed: 0,steam_id,person_name,profile_url,person_state,community_visibility_state,profile_state,last_logoff,comment_permission,real_name,primary_clanid,time_created,game_id,gameserver_ip,game_extrainfo,city_id,country_code,state_code
0,76561197961037440,Fukano,http://steamcommunity.com/profiles/76561197961...,0,1,1,2013-02-16 03:14:12,2,N,N,N,N,N,N,N,N,N
1,76561197961061498,Demonata,http://steamcommunity.com/profiles/76561197961...,0,1,1,2010-06-03 19:20:51,1,N,N,N,N,N,N,N,N,N


In [23]:
ps_t4_1_2.limit(2).toPandas()

Unnamed: 0,steam_id,person_name,profile_url,person_state,community_visibility_state,profile_state,last_logoff,comment_permission,real_name,primary_clanid,time_created,game_id,gameserver_ip,game_extrainfo,city_id,country_code,state_code
0,76561197968251326,tpX- NiceDoggy,http://steamcommunity.com/id/Robwsimons/,0,3,1,2013-02-16 22:40:03,1,"Steven, Bentley",103582791431492925,2004-08-16 18:38:24,N,N,N,N,N,N
1,76561197979997346,Clappie -Something Awesome!,http://steamcommunity.com/id/Clappie/,1,3,1,2013-02-28 23:30:52,1,"Ehmm, You proberly allready know it.",103582791431867759,2005-12-26 11:51:35,N,N,N,N,DK,08


In [25]:
ps_t4_2.limit(2).toPandas()

Unnamed: 0,steam_id,person_name,profile_url,person_state,community_visibility_state,profile_state,last_logoff,comment_permission,real_name,primary_clanid,time_created,game_id,gameserver_ip,game_extrainfo,city_id,country_code,state_code
0,76561197979744990,J |_| T | |_ /-,http://media.steampowered.com/steamcommunity/p...,0,3,N,2011-05-02 03:09:51,N,N,103582791429521408,2005-12-15 06:35:24,N,N,N,N,N,N
1,76561197979765627,,http://media.steampowered.com/steamcommunity/p...,0,3,N,2006-09-24 12:02:59,N,N,103582791429521408,2005-12-18 11:30:47,N,N,N,N,N,N


In [28]:
ps_t4_3.limit(2).toPandas()

Unnamed: 0,steam_id,person_name,profile_url,person_state,community_visibility_state,profile_state,last_logoff,comment_permission,real_name,primary_clanid,time_created,game_id,gameserver_ip,game_extrainfo,city_id,country_code,state_code
0,76561197978765238,FN#18499510,http://steamcommunity.com/profiles/76561197978...,0,3,N,2009-05-10 18:59:20,N,N,103582791429521408,N,N,N,N,N,N,N
1,76561197982892481,FN#22626753,http://steamcommunity.com/profiles/76561197982...,0,3,N,2009-03-14 13:54:56,N,N,103582791429521408,N,N,N,N,N,N,N


In [30]:
ps_t4_4.limit(2).toPandas()

Unnamed: 0,steam_id,person_name,profile_url,person_state,community_visibility_state,profile_state,last_logoff,comment_permission,real_name,primary_clanid,time_created,game_id,gameserver_ip,game_extrainfo,city_id,country_code,state_code
0,76561197963349080,"xiGeroNi c""",http://steamcommunity.com/profiles/76561197963...,0,3,1,2013-02-12 14:15:07,N,Steffen Z,103582791429559556,2003-12-13 10:52:54,N,N,N,N,RU,47
1,76561197985127080,"""""""Don Casimir",http://steamcommunity.com/profiles/76561197985...,0,1,1,2012-06-10 08:49:02,N,N,N,N,N,N,N,N,N,N


In [32]:
ps_t5_1_1.limit(2).toPandas()

Unnamed: 0,steam_id,person_name,profile_url,person_state,community_visibility_state,profile_state,last_logoff,comment_permission,real_name,primary_clanid,time_created,game_id,gameserver_ip,game_extrainfo,city_id,country_code,state_code
0,76561197965776711,"dIzZu ^,-",http://steamcommunity.com/profiles/76561197965...,0,1,1,2013-02-16 04:32:05,N,N,N,N,N,N,N,N,N,N
1,76561197960888341,"Jodzin | too free, do uslyszenia",http://steamcommunity.com/profiles/76561197960...,0,1,1,2012-02-29 14:49:09,N,N,N,N,N,N,N,N,N,N


In [34]:
ps_t5_2.limit(2).toPandas()

Unnamed: 0,steam_id,person_name,profile_url,person_state,community_visibility_state,profile_state,last_logoff,comment_permission,real_name,primary_clanid,time_created,game_id,gameserver_ip,game_extrainfo,city_id,country_code,state_code
0,76561197968588415,wafox26,http://steamcommunity.com/profiles/76561197968...,0,1,N,N,N,N,N,N,N,N,N,N,N,N
1,76561197968591946,Lord,http://steamcommunity.com/profiles/76561197968...,0,1,1,2013-02-14 22:22:46,N,N,N,N,N,N,N,N,N,N


In [36]:
ps_t6_1.limit(2).toPandas()

Unnamed: 0,steam_id,person_name,profile_url,person_state,community_visibility_state,profile_state,last_logoff,comment_permission,real_name,primary_clanid,time_created,game_id,gameserver_ip,game_extrainfo,city_id,country_code,state_code
0,76561197968085891,michaenglam,http://steamcommunity.com/profiles/76561197968...,0,3,N,2006-10-05 22:29:41,N,N,103582791429521408,2004-08-07 20:27:18,N,N,N,N,N,N
1,76561197968088971,a1061004,http://steamcommunity.com/profiles/76561197968...,0,3,N,2007-10-12 20:37:18,N,N,103582791429521408,2004-08-07 23:27:56,N,N,N,N,N,N


In [38]:
ps_t7_1.limit(2).toPandas()

Unnamed: 0,steam_id,person_name,profile_url,person_state,community_visibility_state,profile_state,last_logoff,comment_permission,real_name,primary_clanid,time_created,game_id,gameserver_ip,game_extrainfo,city_id,country_code,state_code
0,76561197972507081,Peanuts Hucko,http://steamcommunity.com/profiles/76561197972...,0,3,1,2013-02-15 19:43:07,N,N,103582791429913068,2004-12-27 12:41:50,N,N,N,N,US,N
1,76561197972507183,IIIIkoolaidIIII,http://steamcommunity.com/id/ikoolaidi/,1,3,1,2013-02-16 01:22:04,N,Ben,103582791430274949,2004-12-27 12:43:37,N,N,N,N,N,N


In [39]:
from functools import reduce
from pyspark.sql import DataFrame

In [40]:
dfs = [ps_t1, ps_t2, ps_t3, ps_t4_1_1, ps_t4_1_2, ps_t4_2, ps_t4_3, ps_t4_4, ps_t5_1_1, ps_t5_2, ps_t6_1, ps_t7_1]

In [41]:
player_summary = reduce(DataFrame.unionAll, dfs)

In [42]:
player_summary.count()

9881300

In [46]:
player_summary.select('country_code').distinct().count()

237

In [47]:
# Save TABLE
player_summary.write.csv('/user/tamng/jwht/CleanData/player_summary_total.csv', header = True)

### END
---