01-create-dataframe.ipynb
======================

We need **six datasets** for different purposes in this project.

1. Dataset that contains player (node) data
* Dataset that contains raw telemetry data for general statistics
* Dataset for cheater analysis
* Dataset that contains the team IDs of players who took part in teamplay matches 
* Dataset for estimating the start date of cheating and analysing the victimisation-based mechanism
* Dataset for analysing the observation-based mechanism

## 1. Create a dataset that contains player data and store it in an S3 bucket.

In [1]:
from functools import reduce
from pyspark.sql.functions import col, lit, when
from pyspark.sql.types import StructField, StructType, StringType, IntegerType, TimestampType
import pubg_analysis as pubg

In [3]:
# Define the structure of player data.
nodeSchema = StructType([StructField("id", StringType(), True),
                         StructField("pname", StringType(), True),
                         StructField("cheating_flag", IntegerType(), True),
                         StructField("ban_date", StringType(), True)])

PATH_TO_FILE = "s3://jinny-capstone-data-test/td_nodes.txt"

# Create a table of player data and store it in the S3 bucket.
players = spark.read.options(header='false', delimiter='\t').schema(nodeSchema).csv(PATH_TO_FILE)
players.write.parquet("s3://jinny-capstone-data-test/players.parquet")

# Show the top 10 rows of the dataset.
players.show(10)

+--------------------+---------------+-------------+--------+
|                  id|          pname|cheating_flag|ban_date|
+--------------------+---------------+-------------+--------+
|account.1d0281ff2...|      ulimnet10|            0|      NA|
|account.1c295c6c0...|       yoon9242|            0|      NA|
|account.a2b8791d5...|        meco001|            0|      NA|
|account.e3b1eb159...|         forsir|            0|      NA|
|account.65433d8ee...|      jimin0311|            0|      NA|
|account.74c0462cd...|namyoonwoo07074|            0|      NA|
|account.64d031587...|       wreu1234|            0|      NA|
|account.7f874085e...|        kbs4799|            0|      NA|
|account.5c8366a6b...|       ssabu110|            0|      NA|
|account.d89f4429c...|      gusrb0187|            0|      NA|
+--------------------+---------------+-------------+--------+
only showing top 10 rows



## 2. Create a raw dataset by combining multiple dataframes. 

This dataset will be used for general statistics.

In [33]:
pubg.combine_telemetry_data(3, 7)

In [34]:
# Read telemetry data stored in my S3 bucket.
spark.read.parquet("s3://jinny-capstone-data-test/raw_td.parquet").createOrReplaceTempView("data_for_test")

# Count the number of rows in the dataframe.
print(data_for_test.count())

12276231


In [35]:
data_for_test.show(10)

+--------------------+--------------------+--------------------+--------------------+----------+
|                 mid|                 src|                 dst|                time|    m_date|
+--------------------+--------------------+--------------------+--------------------+----------+
|01fd8f35-01ff-48f...|account.f1ef62d78...|account.bf5a2bdf5...|2019-03-03 14:19:...|2019-03-03|
|01fd8f35-01ff-48f...|account.e80a530e6...|account.8bd3cc440...|2019-03-03 14:19:...|2019-03-03|
|01fd8f35-01ff-48f...|account.e80a530e6...|account.52accebe5...|2019-03-03 14:19:...|2019-03-03|
|01fd8f35-01ff-48f...|account.6961c79f1...|account.e28657d14...|2019-03-03 14:20:...|2019-03-03|
|01fd8f35-01ff-48f...|account.caa44db60...|account.749a9649f...|2019-03-03 14:20:...|2019-03-03|
|01fd8f35-01ff-48f...|account.749a9649f...|account.6b9c75259...|2019-03-03 14:20:...|2019-03-03|
|01fd8f35-01ff-48f...|account.749a9649f...|account.02fe9c7cb...|2019-03-03 14:20:...|2019-03-03|
|01fd8f35-01ff-48f...|account.

In [None]:
# Count the number of unique match IDs.
mids = spark.sql("SELECT DISTINCT mid FROM data_for_test")
print(mids.count())

## 3. Create a dataset for cheater analysis.

In [39]:
# To compare cheaters and non-cheaters, extract the records of matches played between March 1 and March 3.
spark.read.parquet("s3://jinny-capstone-data-test/raw_td.parquet").createOrReplaceTempView("raw_td")

# Create a small dataset without self-loops.
td = spark.sql("SELECT * FROM raw_td WHERE m_date <= '2019-03-03' AND src != dst")
# print(td.count())

td.write.parquet("s3://jinny-capstone-data-test/data_for_cheater_analysis.parquet")

## 4. Create a dataset that contains team membership information.

In [55]:
# Combine tables that contain the team membership information into one table.
pubg.combine_team_data(3, 6)

In [44]:
# Read the data stored in the S3 bucket.
data_for_test = spark.read.parquet("s3://jinny-capstone-data-test/team_data.parquet")

# Count the number of rows in the dataframe.
# print(data_for_test.count())

# Show the top 10 rows of the dataset.
data_for_test.show(10)

+--------------------+--------------------+---+
|                 mid|                  id|tid|
+--------------------+--------------------+---+
|4828fb2b-dc29-47d...|account.3cedea336...| 16|
|f140304e-8141-4cd...|account.70e880e9e...| 23|
|f9bf6dd2-1c39-4d9...|account.dbc461b68...|  1|
|04797c91-e78a-490...|account.a1016974f...| 12|
|04797c91-e78a-490...|account.3306c7c07...| 19|
|ee5b9236-67cc-4c2...|account.b48012570...| 22|
|a4202a45-023f-4e7...|account.4841d206b...| 15|
|95e6084f-1aaf-455...|account.19b4bd6c3...|  6|
|711ba8f0-076e-426...|account.12c5b465a...|  5|
|711ba8f0-076e-426...|account.e5f6b3bd4...| 15|
+--------------------+--------------------+---+
only showing top 10 rows



## 5. Create a dataset for the use of analysing the observation-based mechanism.

The dataset for analysing the observation-based mechanism should contain self-loops because players who killed themselves (self-loops) cannot observe what happens in the match after they die.

In [50]:
pubg.create_data_for_obs_mech("s3://jinny-capstone-data-test/raw_td.parquet", players)

In [51]:
# Read the data stored in the S3 bucket.
data_for_test = spark.read.parquet("s3://jinny-capstone-data-test/data_for_obs_mech.parquet")

# Count the number of rows in the dataframe.
print(data_for_test.count())

4381687


## 6. Create a dataset for the use of estimating the start date of cheating and analysing the victimisation-based mechanism.

We need the killing records of matches where cheaters killed at least one player without self-loops.<br> 
We can simply reuse the dataset for the victimisation-based mechanism by getting rid of self-loops from it.

In [52]:
spark.read.parquet("s3://jinny-capstone-data-test/data_for_obs_mech.parquet").createOrReplaceTempView("raw_data")
    
# Remove self-loops and store the dataset in the S3 bucket.
cleaned_data = spark.sql("SELECT * FROM raw_data WHERE src != dst")
cleaned_data.write.parquet("s3://jinny-capstone-data-test/data_for_vic_mech.parquet")

In [53]:
data_for_test = spark.read.parquet("s3://jinny-capstone-data-test/data_for_vic_mech.parquet")
print(data_for_test.count())

4360974
