01-create-dataframe.ipynb
======================

We need **six datasets** for different purposes in this project.

1. Dataset that contains player (node) data
* Dataset that contains raw telemetry data for general statistics
* Dataset for cheater analysis
* Dataset that contains the team IDs of players who took part in teamplay matches 
* Dataset for estimating the start date of cheating and analysing the victimisation-based mechanism
* Dataset for analysing the observation-based mechanism

In [1]:
from functools import reduce
from pyspark.sql.functions import col, lit, when
from pyspark.sql.types import StructField, StructType, StringType, IntegerType, TimestampType
import pubg_analysis as pubg

## 1. Create a dataset that contains player data.

In [2]:
# Define the structure of player data.
nodeSchema = StructType([StructField("id", StringType(), True),
                         StructField("pname", StringType(), True),
                         StructField("cheating_flag", IntegerType(), True),
                         StructField("ban_date", StringType(), True)])

PATH_TO_FILE = "s3://social-research-cheating/td_nodes.txt"

# Create a table of player data and store it in the S3 bucket.
players = spark.read.options(header='false', delimiter='\t').schema(nodeSchema).csv(PATH_TO_FILE)
players.write.parquet("s3://social-research-cheating/players.parquet")

# Show the top 10 rows of the dataset.
players.show(10)

+--------------------+---------------+-------------+--------+
|                  id|          pname|cheating_flag|ban_date|
+--------------------+---------------+-------------+--------+
|account.1d0281ff2...|      ulimnet10|            0|      NA|
|account.1c295c6c0...|       yoon9242|            0|      NA|
|account.a2b8791d5...|        meco001|            0|      NA|
|account.e3b1eb159...|         forsir|            0|      NA|
|account.65433d8ee...|      jimin0311|            0|      NA|
|account.74c0462cd...|namyoonwoo07074|            0|      NA|
|account.64d031587...|       wreu1234|            0|      NA|
|account.7f874085e...|        kbs4799|            0|      NA|
|account.5c8366a6b...|       ssabu110|            0|      NA|
|account.d89f4429c...|      gusrb0187|            0|      NA|
+--------------------+---------------+-------------+--------+
only showing top 10 rows



As shown below, there are 1,977,329 unique players and 6,161 cheaters in our dataset (no duplicates found).

In [4]:
# Count the number of players and check whether there are any duplicates.
players.registerTempTable("players")
print(players.count())

test_players = spark.sql("SELECT COUNT(DISTINCT id) FROM players")
test_players.show()

# Count the number of cheaters and check whether there are any duplicates.
test_players = spark.sql("""SELECT COUNT(DISTINCT id) FROM players 
                            WHERE cheating_flag = 1""")
test_players.show()

cheaters = spark.sql("SELECT * FROM players WHERE cheating_flag = 1")
cheaters.registerTempTable("cheaters")
print(cheaters.count())

1977329
+------------------+
|count(DISTINCT id)|
+------------------+
|           1977329|
+------------------+

+------------------+
|count(DISTINCT id)|
+------------------+
|              6161|
+------------------+

6161


In [5]:
# Count the number of cheaters by ban date.
num_of_cheaters = spark.sql("""SELECT ban_date, COUNT(*) AS num_of_cheaters 
                               FROM cheaters GROUP BY ban_date""")
num_of_cheaters.show()

# Store the table in the S3 bucket for the later use (plotting general statistics).
num_of_cheaters.write.parquet("s3://social-research-cheating/general-stats/num_of_cheaters.parquet")

+----------+---------------+
|  ban_date|num_of_cheaters|
+----------+---------------+
|2019-03-03|            258|
|2019-03-11|            511|
|2019-03-28|             99|
|2019-03-07|            262|
|2019-03-20|            112|
|2019-03-19|            116|
|2019-03-01|            103|
|2019-03-23|            176|
|2019-03-30|             93|
|2019-03-16|            107|
|2019-03-05|            228|
|2019-03-29|            114|
|2019-03-25|             89|
|2019-03-31|             89|
|2019-03-14|            139|
|2019-03-15|            132|
|2019-03-10|            135|
|2019-03-17|            144|
|2019-03-22|            118|
|2019-03-26|            170|
+----------+---------------+
only showing top 20 rows



## 2. Create a raw dataset by combining multiple dataframes. 

This dataset will be used for general statistics. The total number of killings in the dataset is 98,319,451.

In [29]:
PATH_TO_RAW_DATA = "s3://social-research-cheating/raw_td.parquet"

pubg.combine_telemetry_data(31, 5, PATH_TO_RAW_DATA)

In [2]:
# Read telemetry data stored in my S3 bucket.
raw_td = spark.read.parquet("s3://social-research-cheating/raw_td.parquet")
raw_td.registerTempTable("raw_td")

# Count the number of rows (= killings) in the dataframe.
print(raw_td.count())

# The number of killings including self-loops between March 1 and March 3 is 12,276,231.

98319451


In [4]:
raw_td.show(10)

+--------------------+--------------------+--------------------+--------------------+----------+
|                 mid|                 src|                 dst|                time|    m_date|
+--------------------+--------------------+--------------------+--------------------+----------+
|01fd8f35-01ff-48f...|account.f1ef62d78...|account.bf5a2bdf5...|2019-03-03 14:19:...|2019-03-03|
|01fd8f35-01ff-48f...|account.e80a530e6...|account.8bd3cc440...|2019-03-03 14:19:...|2019-03-03|
|01fd8f35-01ff-48f...|account.e80a530e6...|account.52accebe5...|2019-03-03 14:19:...|2019-03-03|
|01fd8f35-01ff-48f...|account.6961c79f1...|account.e28657d14...|2019-03-03 14:20:...|2019-03-03|
|01fd8f35-01ff-48f...|account.caa44db60...|account.749a9649f...|2019-03-03 14:20:...|2019-03-03|
|01fd8f35-01ff-48f...|account.749a9649f...|account.6b9c75259...|2019-03-03 14:20:...|2019-03-03|
|01fd8f35-01ff-48f...|account.749a9649f...|account.02fe9c7cb...|2019-03-03 14:20:...|2019-03-03|
|01fd8f35-01ff-48f...|account.

There are 1,146,941 unique matches played during the observation period. 

In [3]:
# Count the number of unique match IDs.
unique_mids = spark.sql("SELECT COUNT(DISTINCT mid) FROM raw_td")
unique_mids.show()

# Count the number of matches by date.
mids_by_date = spark.sql("""SELECT m_date, COUNT(DISTINCT mid) AS num_of_mids 
                            FROM raw_td GROUP BY m_date""")
mids_by_date.show()

# Store the table in the S3 bucket for the later use (plotting general statistics).
mids_by_date.write.parquet("s3://social-research-cheating/general-stats/mids_by_date.parquet")

+-------------------+
|count(DISTINCT mid)|
+-------------------+
|            1146941|
+-------------------+

+----------+-----------+
|    m_date|num_of_mids|
+----------+-----------+
|2019-03-03|      45696|
|2019-03-11|      29363|
|2019-03-28|      24271|
|2019-03-07|      31267|
|2019-03-20|      29240|
|2019-03-19|      29523|
|2019-03-01|      48886|
|2019-03-23|      50375|
|2019-03-30|      49550|
|2019-03-16|      50550|
|2019-03-05|      30504|
|2019-03-29|      36189|
|2019-03-25|      29115|
|2019-03-31|      45487|
|2019-03-14|      29890|
|2019-03-15|      37090|
|2019-03-10|      46290|
|2019-03-17|      45816|
|2019-03-22|      36154|
|2019-03-26|      27491|
+----------+-----------+
only showing top 20 rows



## 3. Create a dataset for cheater analysis.

To compare cheaters and non-cheaters, we need to extract the records of matches played between March 1 and March 3.<br>
The number of killings without self-loops between March 1 and March 3 is 12,216,898.

In [5]:
# raw_td = spark.read.parquet("s3://social-research-cheating/raw_td.parquet")
# raw_td.registerTempTable("raw_td")

# Create a small dataset without self-loops.
td = spark.sql("SELECT * FROM raw_td WHERE m_date <= '2019-03-03' AND src != dst")
print(td.count())

# Store the data in the S3 bucket.
td.write.parquet("s3://social-research-cheating/cheater-analysis/data_for_cheater_analysis.parquet")

12216898


## 4. Create a dataset that contains team membership information.

In [27]:
# Combine tables that contain the team membership information into one table.
PATH_TO_TEAM_DATA = "s3://social-research-cheating/team_data.parquet"

pubg.combine_team_data(31, 6, PATH_TO_TEAM_DATA)

In [3]:
# Read the data stored in the S3 bucket.
PATH_TO_TEAM_DATA = "s3://social-research-cheating/team_data.parquet"
test_data = spark.read.parquet(PATH_TO_TEAM_DATA)

# Show the top 10 rows of the dataset.
test_data.show(10)

# Count the number of rows in the dataframe.
print(test_data.count())
# The number of rows is 93,730,706.

+--------------------+--------------------+---+
|                 mid|                  id|tid|
+--------------------+--------------------+---+
|b6a091d4-2bdb-451...|account.9fbe4bbe5...|  1|
|24d0a877-2d20-43a...|account.9ad264163...| 17|
|866b5d75-0d8f-497...|account.4c10d9e9f...| 47|
|476c22d8-d929-46c...|account.74c896572...| 21|
|499aa106-272e-468...|account.bebee03c5...| 29|
|355aafa1-b7a2-45c...|account.289b29eda...| 13|
|4020041c-a4a6-46f...|account.4d93bc13f...| 35|
|450b9c1c-6bd0-4d7...|account.a8a2ff4b7...| 15|
|79ca6d6c-8f3a-485...|account.452fb2497...| 30|
|02c36bd8-de13-479...|account.1a3ac664c...| 14|
+--------------------+--------------------+---+
only showing top 10 rows

93730706


## 5. Create a dataset for the use of analysing the observation-based mechanism.

The dataset for analysing the observation-based mechanism should contain self-loops because players who killed themselves (self-loops) cannot observe what happens in the match after they die.<br>
The number of edges is 19,109,055 and there are 89,045 self-loops in this dataset.

In [2]:
PATH_TO_RAW_DATA = "s3://social-research-cheating/raw_td.parquet"

players = spark.read.parquet("s3://social-research-cheating/players.parquet")
players.registerTempTable("players")

In [3]:
pubg.create_data_for_obs_mech(PATH_TO_RAW_DATA, players)

In [2]:
# Read the data stored in the S3 bucket.
obs_mech_data = spark.read.parquet("s3://social-research-cheating/obs_mech_data.parquet")

# Count the number of rows in the dataframe.
print(obs_mech_data.count())

19109055


In [5]:
obs_mech_data.registerTempTable("obs_mech_data")

# Count the number of self-loops.
self_loops = spark.sql("SELECT * FROM obs_mech_data WHERE src == dst")
print(self_loops.count())

89045


## 6. Create a dataset for the use of estimating the start date of cheating and analysing the victimisation-based mechanism.

We need the killing records of matches where cheaters killed at least one player without self-loops.<br> 
We can simply reuse the dataset for the victimisation-based mechanism by getting rid of self-loops from it.<br>
The number of edges in this dataset is 19,020,010.

In [2]:
spark.read.parquet("s3://social-research-cheating/obs_mech_data.parquet").createOrReplaceTempView("raw_data")
    
# Remove self-loops and store the dataset in the S3 bucket.
cleaned_data = spark.sql("SELECT * FROM raw_data WHERE src != dst")
cleaned_data.write.parquet("s3://social-research-cheating/vic_mech_data.parquet")

In [3]:
vic_mech_data = spark.read.parquet("s3://social-research-cheating/vic_mech_data.parquet")
print(vic_mech_data.count())

19020010
