### 01. Create a table for each data in parquet format. 

---
We use **six datasets** for different purposes in this project.

1. Dataset that contains player (node) data
* Dataset that contains raw telemetry data for general statistics
* Dataset for cheater analysis
* Dataset that contains the team IDs of players who took part in teamplay matches 
* Dataset for estimating the start date of cheating and analysing the victimisation-based mechanism
* Dataset for analysing the observation-based mechanism

In [1]:
from functools import reduce
from pyspark.sql.functions import col, lit, when
from pyspark.sql.types import StructField, StructType, StringType, IntegerType, TimestampType
import pubg_analysis as pubg
import pandas as pd

### 1. Create a dataset that contains player data.

The table below describes the variables in the player data:

| Variable   | Explanation   
|:-----------|:-------
| id         | ID of the player               
| pname      | nickname of the player  
| cheating_flag     | 1 if the player was banned, 0 otherwise
| ban_date   | date in the format YYYY-MM-DD when the cheater was banned 

As shown below, there are 1,977,329 unique players and 6,161 players among them are cheaters in our dataset.

In [2]:
# Define the structure of player data.
nodeSchema = StructType([StructField("id", StringType(), True),
                         StructField("pname", StringType(), True),
                         StructField("cheating_flag", IntegerType(), True),
                         StructField("ban_date", StringType(), True)])

# Create a table of player data and store it in the S3 bucket.
PATH_TO_FILE = "s3://social-research-cheating/td_nodes.txt"

players = spark.read.options(header='false', delimiter='\t').schema(nodeSchema).csv(PATH_TO_FILE)
players.write.parquet("s3://social-research-cheating/players.parquet")
players.registerTempTable("players")

# Show the top 10 rows of the dataset.
players.show(10)

+--------------------+---------------+-------------+--------+
|                  id|          pname|cheating_flag|ban_date|
+--------------------+---------------+-------------+--------+
|account.1d0281ff2...|      ulimnet10|            0|      NA|
|account.1c295c6c0...|       yoon9242|            0|      NA|
|account.a2b8791d5...|        meco001|            0|      NA|
|account.e3b1eb159...|         forsir|            0|      NA|
|account.65433d8ee...|      jimin0311|            0|      NA|
|account.74c0462cd...|namyoonwoo07074|            0|      NA|
|account.64d031587...|       wreu1234|            0|      NA|
|account.7f874085e...|        kbs4799|            0|      NA|
|account.5c8366a6b...|       ssabu110|            0|      NA|
|account.d89f4429c...|      gusrb0187|            0|      NA|
+--------------------+---------------+-------------+--------+
only showing top 10 rows



In [4]:
# Count the number of players and check whether there are any duplicates.
print(players.count())

test_players = spark.sql("SELECT COUNT(DISTINCT id) FROM players")
test_players.show()

# Count the number of cheaters and check whether there are any duplicates.
test_players = spark.sql("""SELECT COUNT(DISTINCT id) FROM players 
                            WHERE cheating_flag = 1""")
test_players.show()

cheaters = spark.sql("SELECT * FROM players WHERE cheating_flag = 1")
cheaters.registerTempTable("cheaters")
print(cheaters.count())

1977329
+------------------+
|count(DISTINCT id)|
+------------------+
|           1977329|
+------------------+

+------------------+
|count(DISTINCT id)|
+------------------+
|              6161|
+------------------+

6161


In [5]:
# Count the number of cheaters by ban date.
num_of_cheaters = spark.sql("""SELECT ban_date, COUNT(*) AS num_of_cheaters 
                               FROM cheaters GROUP BY ban_date""")
num_of_cheaters.show()

# Store the table in the S3 bucket for the later use (plotting general statistics).
num_of_cheaters.write.parquet("s3://social-research-cheating/general-stats/num_of_cheaters.parquet")

+----------+---------------+
|  ban_date|num_of_cheaters|
+----------+---------------+
|2019-03-03|            258|
|2019-03-11|            511|
|2019-03-28|             99|
|2019-03-07|            262|
|2019-03-20|            112|
|2019-03-19|            116|
|2019-03-01|            103|
|2019-03-23|            176|
|2019-03-30|             93|
|2019-03-16|            107|
|2019-03-05|            228|
|2019-03-29|            114|
|2019-03-25|             89|
|2019-03-31|             89|
|2019-03-14|            139|
|2019-03-15|            132|
|2019-03-10|            135|
|2019-03-17|            144|
|2019-03-22|            118|
|2019-03-26|            170|
+----------+---------------+
only showing top 20 rows



### 2. Create a raw dataset that contains killings. 

This dataset will be used for general statistics.

The table below describes the variables in the telemetry data:

| Variable   | Explanation   
|:-----------|:-------
| mid         | ID of the match               
| src      | ID of the killer  
| dst     | ID of the victim 
| time   | time in the format YYYY-MM-DD HH:MM:SS.SSS Z when the attack (killing) happened
| m_date   | date in the format YYYY-MM-DD when the match was played 

There are 1,146,941 unique matches played during the observation period.<br>
The total number of killings (edges) including self-loops in the dataset is 98,319,451.

In [2]:
file_nums = [(1, 7), (2, 7), (3, 7), (4, 4), (5, 4), 
             (6, 4), (7, 4), (8, 5), (9, 7), (10, 6),
             (11, 4), (12, 4)]

In [3]:
PATH_TO_RAW_DATA = "s3://social-research-cheating/edges/raw_td.parquet"

for tup in file_nums:
    pubg.combine_telemetry_data(tup[0], tup[1], PATH_TO_RAW_DATA)

In [2]:
# Read telemetry data stored in my S3 bucket.
raw_td = spark.read.parquet("s3://social-research-cheating/raw_td.parquet")
raw_td.registerTempTable("raw_td")

# Count the number of rows (= killings) in the dataframe.
print(raw_td.count())

98319451


In [4]:
# Show the top 10 rows of the dataset.
raw_td.show(10)

+--------------------+--------------------+--------------------+--------------------+----------+
|                 mid|                 src|                 dst|                time|    m_date|
+--------------------+--------------------+--------------------+--------------------+----------+
|01fd8f35-01ff-48f...|account.f1ef62d78...|account.bf5a2bdf5...|2019-03-03 14:19:...|2019-03-03|
|01fd8f35-01ff-48f...|account.e80a530e6...|account.8bd3cc440...|2019-03-03 14:19:...|2019-03-03|
|01fd8f35-01ff-48f...|account.e80a530e6...|account.52accebe5...|2019-03-03 14:19:...|2019-03-03|
|01fd8f35-01ff-48f...|account.6961c79f1...|account.e28657d14...|2019-03-03 14:20:...|2019-03-03|
|01fd8f35-01ff-48f...|account.caa44db60...|account.749a9649f...|2019-03-03 14:20:...|2019-03-03|
|01fd8f35-01ff-48f...|account.749a9649f...|account.6b9c75259...|2019-03-03 14:20:...|2019-03-03|
|01fd8f35-01ff-48f...|account.749a9649f...|account.02fe9c7cb...|2019-03-03 14:20:...|2019-03-03|
|01fd8f35-01ff-48f...|account.

In [3]:
# Count the number of unique match IDs.
unique_mids = spark.sql("SELECT COUNT(DISTINCT mid) FROM raw_td")
unique_mids.show()

# Count the number of matches by date.
mids_by_date = spark.sql("""SELECT m_date, COUNT(DISTINCT mid) AS num_of_mids 
                            FROM raw_td GROUP BY m_date""")
mids_by_date.show()

# Store the table in the S3 bucket for the later use (plotting general statistics).
mids_by_date.write.parquet("s3://social-research-cheating/general-stats/mids_by_date.parquet")

+-------------------+
|count(DISTINCT mid)|
+-------------------+
|            1146941|
+-------------------+

+----------+-----------+
|    m_date|num_of_mids|
+----------+-----------+
|2019-03-03|      45696|
|2019-03-11|      29363|
|2019-03-28|      24271|
|2019-03-07|      31267|
|2019-03-20|      29240|
|2019-03-19|      29523|
|2019-03-01|      48886|
|2019-03-23|      50375|
|2019-03-30|      49550|
|2019-03-16|      50550|
|2019-03-05|      30504|
|2019-03-29|      36189|
|2019-03-25|      29115|
|2019-03-31|      45487|
|2019-03-14|      29890|
|2019-03-15|      37090|
|2019-03-10|      46290|
|2019-03-17|      45816|
|2019-03-22|      36154|
|2019-03-26|      27491|
+----------+-----------+
only showing top 20 rows



### 3. Create a dataset for cheater analysis.

To compare cheaters and non-cheaters, we need to extract the records of matches played between March 1 and March 3.<br>
The number of killings without self-loops between March 1 and March 3 is 12,216,898.

The table below describes the variables in the data for cheater analysis:

| Variable   | Explanation   
|:-----------|:-------
| mid         | ID of the match               
| src      | ID of the killer  
| dst     | ID of the victim 
| time   | time in the format YYYY-MM-DD HH:MM:SS.SSS Z when the attack (killing) happened
| m_date   | date in the format YYYY-MM-DD when the match was played 

In [5]:
raw_td = spark.read.parquet("s3://social-research-cheating/raw_td.parquet")
raw_td.registerTempTable("raw_td")

# Create a small dataset without self-loops.
# The dataset below does not contain invalid edges (= edges with NULL).
td = spark.sql("SELECT * FROM raw_td WHERE m_date <= '2019-03-03' AND src != dst")
print(td.count())

# Store the data in the S3 bucket.
td.write.parquet("s3://social-research-cheating/cheater-analysis/data_for_cheater_analysis.parquet")

12216898


### 4. Create a dataset that contains team membership information.

The table below describes the variables in the team membership data:

| Variable   | Explanation   
|:-----------|:-------
| mid        | ID of the match               
| id     | ID of the player  
| tid     | ID of the team

The number of teamplay matches is 1,022,520.

In [27]:
# Combine tables that contain the team membership information into one table.
PATH_TO_TEAM_DATA = "s3://social-research-cheating/team_data.parquet"

pubg.combine_team_data(31, 6, PATH_TO_TEAM_DATA)

In [3]:
# Read the data stored in the S3 bucket.
PATH_TO_TEAM_DATA = "s3://social-research-cheating/team_data.parquet"
team_data = spark.read.parquet(PATH_TO_TEAM_DATA)

# Show the top 10 rows of the dataset.
team_data.show(10)

# Count the number of rows in the dataframe.
print(team_data.count())
# The number of rows is 93,730,706.

+--------------------+--------------------+---+
|                 mid|                  id|tid|
+--------------------+--------------------+---+
|b6a091d4-2bdb-451...|account.9fbe4bbe5...|  1|
|24d0a877-2d20-43a...|account.9ad264163...| 17|
|866b5d75-0d8f-497...|account.4c10d9e9f...| 47|
|476c22d8-d929-46c...|account.74c896572...| 21|
|499aa106-272e-468...|account.bebee03c5...| 29|
|355aafa1-b7a2-45c...|account.289b29eda...| 13|
|4020041c-a4a6-46f...|account.4d93bc13f...| 35|
|450b9c1c-6bd0-4d7...|account.a8a2ff4b7...| 15|
|79ca6d6c-8f3a-485...|account.452fb2497...| 30|
|02c36bd8-de13-479...|account.1a3ac664c...| 14|
+--------------------+--------------------+---+
only showing top 10 rows

93730706


In [2]:
# Read the data stored in the S3 bucket. 
team_data = spark.read.parquet("s3://social-research-cheating/team_data.parquet")
team_data.registerTempTable("team_data")

raw_td = spark.read.parquet("s3://social-research-cheating/raw_td.parquet")
raw_td.registerTempTable("raw_td")

In [3]:
# Get unique match IDs in the raw data.
unique_mids = spark.sql("SELECT DISTINCT mid FROM raw_td")
unique_mids.registerTempTable("unique_mids")
unique_mids.write.parquet("s3://social-research-cheating/general-stats/unique_mids.parquet")

# Get unique match IDs in the team membership data.
unique_team_mids = spark.sql("SELECT DISTINCT mid FROM team_data")
unique_team_mids.registerTempTable("unique_team_mids")

In [5]:
# Count the number of match IDs in both tables.
team_mids = spark.sql("SELECT t.mid FROM unique_team_mids t JOIN unique_mids m ON t.mid = m.mid")
team_mids.write.parquet("s3://social-research-cheating/general-stats/unique_team_mids.parquet")

In [6]:
team_mids = spark.read.parquet("s3://social-research-cheating/general-stats/unique_team_mids.parquet")
team_mids.registerTempTable("team_mids")

# Count the number of unique match IDs in the team membership data.
team_mid_cnt = spark.sql("SELECT COUNT(DISTINCT mid) FROM team_mids")
team_mid_cnt.show()

+-------------------+
|count(DISTINCT mid)|
+-------------------+
|            1022520|
+-------------------+



In [4]:
# Create a small team dataset.
team_data = spark.read.parquet("s3://social-research-cheating/edges/small_team_data.parquet")
team_data.registerTempTable("team_data")

obs_data = spark.read.parquet("s3://social-research-cheating/edges/obs_data.parquet")
obs_data.registerTempTable("obs_data")

# Get a list of unique match IDs from 'obs_data'.
obs_mids = spark.sql("SELECT DISTINCT mid FROM obs_data")
obs_mids.registerTempTable("obs_mids")

# Count the number of match IDs in both tables.
team_mids = spark.sql("SELECT t.mid, id, tid FROM team_data t JOIN obs_mids o ON t.mid = o.mid")
team_mids.write.parquet("s3://social-research-cheating/edges/tiny_team_data.parquet")

### 5. Create a dataset for the use of analysing the observation-based mechanism.

The dataset for analysing the observation-based mechanism should contain self-loops because players who killed themselves (self-loops) cannot observe what happens in the match after they die.<br>
To reduce the amount of data, we extract the matches where at least one player was killed by cheating.<br>
The number of unique match IDs in this dataset is 19,216.<br>

The table below describes the variables in the data for analysing the observation-based mechanism:

| Variable   | Explanation   
|:-----------|:-------
| mid         | ID of the match               
| src      | ID of the killer
| src_sd      | date in the format YYYY-MM-DD when the killer started cheating ('NA' if the player is a non-cheater)
| src_bd      | date in the format YYYY-MM-DD when the killer was banned ('NA' if the player is a non-cheater)
| src_curr_flag      | 1 if the killer was cheating on the date when the match was played
| src_flag      | 1 if the killer was banned, 0 otherwise
| dst     | ID of the victim
| dst_sd      | date in the format YYYY-MM-DD when the victim started cheating ('NA' if the player is a non-cheater)
| dst_bd      | date in the format YYYY-MM-DD when the victim was banned ('NA' if the player is a non-cheater)
| dst_curr_flag      | 1 if the victim was cheating on the date when the match was played
| dst_flag      | 1 if the victim was banned, 0 otherwise
| time   | time in the format YYYY-MM-DD HH:MM:SS.SSS Z when the attack (killing) happened
| m_date   | date in the format YYYY-MM-DD when the match was played

The number of edges is 1,693,699 and there are 7,522 self-loops in this dataset.

In [2]:
PATH_TO_RAW_DATA = "s3://social-research-cheating/raw_td.parquet"
players = spark.read.parquet("s3://social-research-cheating/nodes.parquet")
players.registerTempTable("players")

# Get the logs of the matches where at least one cheater took part in.
pubg.get_obs_data(PATH_TO_RAW_DATA, players)

In [3]:
obs_data = spark.read.parquet("s3://social-research-cheating/edges/obs_data.parquet")
obs_data.registerTempTable("obs_data")

# Count the number of rows in the dataframe.
print(obs_data.count())

# Count the number of self-loops.
self_loops = spark.sql("SELECT * FROM obs_data WHERE src = dst")
print(self_loops.count())

# The number of edges is 1,693,699 and there are 7,522 self-loops in this dataset.

1693699
7522


### 6. Create a dataset for the use of analysing the victimisation-based mechanism.

We need the killing records of matches where cheaters killed at least one player without self-loops.<br> 
We can simply reuse the dataset for the observation-based mechanism by getting rid of self-loops from it.<br>
Thus, the number of edges should be 1,693,699 - 7,522 = 1,686,177.

The table below describes the variables in the data for analysing the victimisation-based mechanism:

| Variable   | Explanation   
|:-----------|:-------
| mid         | ID of the match               
| src      | ID of the killer
| src_sd      | date in the format YYYY-MM-DD when the killer started cheating ('NA' if the player is a non-cheater)
| src_bd      | date in the format YYYY-MM-DD when the killer was banned ('NA' if the player is a non-cheater)
| src_curr_flag      | 1 if the killer was cheating on the date when the match was played
| src_flag      | 1 if the killer was banned, 0 otherwise
| dst     | ID of the victim
| dst_sd      | date in the format YYYY-MM-DD when the victim started cheating ('NA' if the player is a non-cheater)
| dst_bd      | date in the format YYYY-MM-DD when the victim was banned ('NA' if the player is a non-cheater)
| dst_flag      | 1 if the victim was banned, 0 otherwise
| dst_curr_flag      | 1 if the victim was cheating on the date when the match was played
| time   | time in the format YYYY-MM-DD HH:MM:SS.SSS Z when the attack (killing) happened
| m_date   | date in the format YYYY-MM-DD when the match was played

The number of edges in this dataset is 1,686,177.

In [5]:
# Create a dataset for analysing the victimisation-based mechanism.
spark.read.parquet("s3://social-research-cheating/edges/obs_data.parquet").createOrReplaceTempView("raw_data")
cleaned_data = spark.sql("SELECT * FROM raw_data WHERE src != dst") # Remove self-loops.
cleaned_data.write.parquet("s3://social-research-cheating/edges/vic_data.parquet")

In [6]:
vic_data = spark.read.parquet("s3://social-research-cheating/edges/vic_data.parquet")
print(vic_data.count())

# The number of edges should be 1,693,699 - 7,522 = 1,686,177.

1686177


### 7. Check the number of winners and test whether winners have the same team ID for each match.

In [2]:
# Read a table that contains killings.
obs_data = spark.read.parquet("s3://social-research-cheating/edges/obs_data.parquet")
obs_data.registerTempTable("obs_data")

# Read a table that contains team membership data.
team_info = spark.read.parquet("s3://social-research-cheating/edges/tiny_team_data.parquet")
team_info.registerTempTable("team_ids")

players = spark.read.parquet("s3://social-research-cheating/nodes.parquet")
players.registerTempTable("players")

In [4]:
# Get a list of mids and m_dates.
match_info = spark.sql("SELECT DISTINCT mid, m_date FROM obs_data")
match_info.registerTempTable("match_info")

# Get a list of victims for each match.
victims = spark.sql("SELECT DISTINCT mid, dst FROM obs_data")
victims.registerTempTable("victims")

# Get a list of winners for each match.
winners = spark.sql("""SELECT DISTINCT o.mid, src FROM obs_data o 
                       WHERE NOT EXISTS (SELECT mid, dst FROM victims v WHERE o.mid = v.mid AND o.src = v.dst)""")
winners.registerTempTable("winners")

# Add team information.
add_tids = spark.sql("""SELECT w.mid, src, CASE WHEN tid IS NULL THEN 'NA' ELSE tid END AS src_tid
                        FROM winners w LEFT JOIN team_ids t ON w.mid = t.mid AND w.src = t.id""")
add_tids.registerTempTable("add_tids")

# Add m_dates.
temp_tab = spark.sql("""SELECT a.mid, src, src_tid, m_date 
                        FROM add_tids a LEFT JOIN match_info m ON a.mid = m.mid""")
temp_tab.registerTempTable("temp_tab")

# Find the matches where at least one winner's team ID is 'NA'.
na_tids = spark.sql("SELECT DISTINCT mid FROM add_tids WHERE src_tid = 'NA'")
na_tids.registerTempTable("na_tids")

# Add the current cheating flag of players.
winners = spark.sql("""SELECT t.*, 
                       CASE WHEN cheating_flag = 1 AND m_date < start_date THEN 1 ELSE 0 END AS pot_flag 
                       FROM temp_tab t LEFT JOIN players p ON t.src = p.id""")
winners.registerTempTable("winners")

# Count the number of winners and that of unique times for each match. 
cnt_tab = spark.sql("""SELECT mid, COUNT(src) AS winner_cnt, 
                       COUNT(DISTINCT src_tid) AS tid_cnt, SUM(pot_flag) AS pot_cnt 
                       FROM winners GROUP BY mid""")
cnt_tab.registerTempTable("cnt_tab")

summary_tab = spark.sql("""SELECT c.mid, winner_cnt, tid_cnt, pot_cnt, 
                           CASE WHEN n.mid IS NULL THEN 0 ELSE 1 END AS na_flag 
                           FROM cnt_tab c LEFT JOIN na_tids n ON c.mid = n.mid""")
summary_tab.registerTempTable("summary_tab")
summary_tab.show(10)

# summary_tab.write.parquet("s3://social-research-cheating/general-stats/sum_tab_of_winners.parquet")

+--------------------+----------+-------+-------+-------+
|                 mid|winner_cnt|tid_cnt|pot_cnt|na_flag|
+--------------------+----------+-------+-------+-------+
|0143e2da-14d2-4d8...|         9|      6|      0|      0|
|036a8903-186b-45f...|         4|      2|      0|      0|
|080d5622-6b94-4d7...|         3|      2|      0|      0|
|0c7d472e-5064-4d4...|         2|      2|      0|      0|
|0ef25288-88d3-476...|         2|      1|      0|      0|
|1203abce-50ec-40d...|         4|      4|      0|      0|
|1574a6bb-a63f-473...|         5|      2|      0|      0|
|16d6f605-4118-4de...|         4|      3|      0|      0|
|1773f8d7-b807-439...|         3|      2|      0|      0|
|194e1d81-b65c-4dc...|         4|      2|      0|      0|
+--------------------+----------+-------+-------+-------+
only showing top 10 rows



In [2]:
import pandas as pd

summary_tab = spark.read.parquet("s3://social-research-cheating/general-stats/sum_tab_of_winners.parquet")
summary_tab.registerTempTable("summary_tab")

temp = spark.sql("SELECT * FROM summary_tab WHERE tid_cnt > 1 AND pot_cnt >= 1")
temp.show(10)
print(temp.count())

# Store a list of match IDs with multiple winners.
# temp_df = temp.toPandas()
# temp_df.to_csv('mids_multiple_winners.csv')

+--------------------+----------+-------+-------+-------+
|                 mid|winner_cnt|tid_cnt|pot_cnt|na_flag|
+--------------------+----------+-------+-------+-------+
|013caebc-8504-4d7...|         4|      4|      1|      0|
|0bd6149a-c6f5-4ed...|         9|      7|      1|      0|
|0c2c1334-9af0-41d...|        11|      6|      1|      0|
|2dc03f99-5d44-42e...|         7|      5|      1|      0|
|35866cf5-93de-48a...|         4|      2|      1|      0|
|391b03c1-3393-4af...|         9|      5|      1|      0|
|3bbd09e0-d4af-4ac...|         5|      3|      1|      0|
|456bc019-80ee-4c6...|         4|      3|      1|      0|
|86ef180f-da6b-4b2...|         5|      3|      1|      0|
|9c7144ce-008e-41d...|         5|      3|      1|      0|
+--------------------+----------+-------+-------+-------+
only showing top 10 rows

1964


In [2]:
summary_tab = spark.read.parquet("s3://social-research-cheating/general-stats/sum_tab_of_winners.parquet")
summary_tab.registerTempTable("summary_tab")

temp = spark.sql("SELECT * FROM summary_tab WHERE tid_cnt > 1 AND pot_cnt >= 1 AND na_flag = 1")
temp.show(10)
print(temp.count())

+--------------------+----------+-------+-------+-------+
|                 mid|winner_cnt|tid_cnt|pot_cnt|na_flag|
+--------------------+----------+-------+-------+-------+
|9d8edf15-f814-48f...|         5|      2|      1|      1|
|b2c7e5a4-f0f0-48d...|         5|      3|      1|      1|
|dfae8103-19b6-4c1...|         7|      6|      1|      1|
|b62ae865-af8e-4e3...|         3|      2|      2|      1|
|2da2cc0d-41d1-487...|         9|      5|      1|      1|
|12bcdfe5-34a4-473...|         4|      3|      1|      1|
|13c1ad12-8e12-4a1...|         9|      6|      1|      1|
|bb78c330-ea48-42a...|         4|      2|      1|      1|
|7b6c2381-afde-452...|        13|      9|      1|      1|
|f2f76e66-9fb7-40d...|         8|      6|      1|      1|
+--------------------+----------+-------+-------+-------+
only showing top 10 rows

73


In [8]:
uniq_mids = spark.sql("SELECT DISTINCT mid FROM obs_data")
print(uniq_mids.count())

19216


In [4]:
import pandas as pd

temp = spark.sql("SELECT * FROM summary_tab WHERE tid_cnt > 1 AND pot_cnt >= 1 AND na_flag = 1")
temp_df = temp.toPandas()
temp_df.to_csv('na_flags.csv')

### 8. Create a dataset that contains the ranks of teams (for teamplay matches).

The table below describes the variables in the team rank data:

| Variable   | Explanation   
|:-----------|:-------
| mid        | ID of the match               
| tid     | ID of the team
| mod     | game mode of the match
| rank     | rank of the team (integer)
| m_date     | date in the format YYYY-MM-DD when the match was played

In [2]:
# Create a dataframe that contains the ranks of teams for each teamplay match.
PATH_TO_DATA = "s3://social-research-cheating/edges/team_ranks.parquet"

team_data = pubg.get_team_ranks("md_day_1_1")
team_data.write.parquet(PATH_TO_DATA)
    
for i in range(2, 7):
    team_data = pubg.get_team_ranks("md_day_1_" + str(i))
    team_data.write.mode("append").parquet(PATH_TO_DATA)
    
file_nums = [(2, 6), (3, 6), (4, 5), (5, 5), 
             (6, 5), (7, 5), (8, 5), (9, 7), (10, 6),
             (11, 4), (12, 5), (13, 5), (14, 5), (15, 5), 
             (16, 6), (17, 6), (18, 4), (19, 4), (20, 4), 
             (21, 5), (22, 5), (23, 7), (24, 7), (25, 4), 
             (26, 4), (27, 4), (28, 3), (29, 4), (30, 6), (31, 6)]

for tup in file_nums:
    for i in range(1, tup[1] + 1):
        team_data = pubg.get_team_ranks("md_day_" + str(tup[0]) + "_" + str(i))
        team_data.write.mode("append").parquet(PATH_TO_DATA)

In [2]:
rank_data = spark.read.parquet("s3://social-research-cheating/edges/team_ranks.parquet")
rank_data.registerTempTable("rank_data")
rank_data.show(5)

+--------------------+---+-----+----+----------+
|                 mid|tid|  mod|rank|    m_date|
+--------------------+---+-----+----+----------+
|f905942d-149d-49d...| 38|  duo|   3|2019-03-17|
|a8f5eca6-cc65-480...| 15|squad|   7|2019-03-17|
|2b708e1f-5496-4fb...| 24|  duo|  29|2019-03-17|
|63514f97-098a-496...| 30|  duo|   9|2019-03-17|
|3b171f42-13c5-4df...| 26|squad|   5|2019-03-17|
+--------------------+---+-----+----+----------+
only showing top 5 rows



In [2]:
# Create a dataset that contains the ranks of teams for the teamplay matches
# where winners have different team IDs and at least one potential cheater exists as a winner.
# Run this cell only once.

summary_tab = spark.read.parquet("s3://social-research-cheating/general-stats/sum_tab_of_winners.parquet")
summary_tab.registerTempTable("summary_tab")

invalid_mids = spark.sql("SELECT DISTINCT mid FROM summary_tab WHERE tid_cnt > 1 AND pot_cnt >= 1")
invalid_mids.registerTempTable("invalid_mids")
print(invalid_mids.count())

1964


In [3]:
team_ranks = spark.read.parquet("s3://social-research-cheating/edges/team_ranks.parquet")
team_ranks.registerTempTable("team_ranks")

sampled_ranks = spark.sql("""SELECT t.* FROM team_ranks t JOIN invalid_mids i ON t.mid = i.mid 
                             ORDER BY mid, rank""")
sampled_ranks.write.parquet("s3://social-research-cheating/edges/sampled_ranks.parquet")

In [2]:
rank_data = spark.read.parquet("s3://social-research-cheating/edges/sampled_ranks.parquet")
rank_data.registerTempTable("rank_data")

temp = spark.sql("SELECT * FROM rank_data WHERE mid = '9d8edf15-f814-48fc-95ec-a7dc6ff24f41'")
temp_df = temp.toPandas()
temp_df.to_csv('rank_data.csv')

### 9. Add additional self-loops in 'obs_data'.

First, add self-loops for the cases where winners have different team IDs and no team has 'NA' as its team ID.

In [2]:
team_ids = spark.read.parquet("s3://social-research-cheating/edges/tiny_team_data.parquet")
team_ids.registerTempTable("team_ids")

obs_data = spark.read.parquet("s3://social-research-cheating/edges/obs_data.parquet")
obs_data.registerTempTable("obs_data")

# It contains the ranks of players for 1,964 teamplay matches. 
team_ranks = spark.read.parquet("s3://social-research-cheating/edges/ordered_ranks.parquet")
team_ranks.registerTempTable("team_ranks")

players = spark.read.parquet("s3://social-research-cheating/nodes.parquet")
players.registerTempTable("players")

In [26]:
temp = spark.sql("SELECT * FROM obs_data WHERE mid = '013caebc-8504-4d71-be02-a082ddccda9a'")
temp_df = temp.toPandas()
temp_df.to_csv('obs_data.csv')

temp = spark.sql("SELECT * FROM team_ids WHERE mid = '0e85fbcc-0d91-4f03-942e-c92b5fae991f'")
temp_df = temp.toPandas()
temp_df.to_csv('team_ids.csv')

temp = spark.sql("SELECT * FROM team_ranks WHERE mid = '0e85fbcc-0d91-4f03-942e-c92b5fae991f'")
temp_df = temp.toPandas()
temp_df.to_csv('team_ranks.csv')

In [3]:
summary_tab = spark.read.parquet("s3://social-research-cheating/general-stats/sum_tab_of_winners.parquet")
summary_tab.registerTempTable("summary_tab")

invalid_mids = spark.sql("""SELECT DISTINCT mid FROM summary_tab 
                            WHERE tid_cnt > 1 AND pot_cnt >= 1""")
invalid_mids.registerTempTable("invalid_mids")

# It contains the killings of 1,891 teamplay matches.
sampled_obs = spark.sql("SELECT o.* FROM obs_data o JOIN invalid_mids i ON o.mid = i.mid")
sampled_obs.registerTempTable("sampled_obs")

# Add team IDs of killers.
add_src_tids = spark.sql("""SELECT s.*, CASE WHEN tid IS NULL THEN 'NA' ELSE tid END AS src_tid 
                            FROM sampled_obs s LEFT JOIN team_ids t ON s.mid = t.mid AND s.src = t.id""")
add_src_tids.registerTempTable("add_src_tids")

add_tids = spark.sql("""SELECT a.*, CASE WHEN tid IS NULL THEN 'NA' ELSE tid END AS dst_tid 
                        FROM add_src_tids a LEFT JOIN team_ids t ON a.mid = t.mid AND a.dst = t.id""")
add_tids.registerTempTable("add_tids")

In [4]:
# Get a list of victims for each match.
victims = spark.sql("SELECT mid, dst FROM sampled_obs")
victims.registerTempTable("victims")

# Get a list of winners for each match.
winners = spark.sql("""SELECT DISTINCT o.mid, src, src_tid, m_date FROM add_tids o 
                       WHERE NOT EXISTS (SELECT mid, dst FROM victims v WHERE o.mid = v.mid AND o.src = v.dst)""")
winners.registerTempTable("winners")

# Add the current cheating flag of players.
add_flags = spark.sql("""SELECT t.*, 
                         CASE WHEN cheating_flag = 1 AND m_date < start_date THEN 1 ELSE 0 END AS pot_flag 
                         FROM winners t LEFT JOIN players p ON t.src = p.id""")
add_flags.registerTempTable("add_flags")

# Get a list of winners who are potential cheaters.
pot_cheaters = spark.sql("SELECT * FROM add_flags WHERE pot_flag = 1 AND src_tid != 'NA'")
pot_cheaters.registerTempTable("pot_cheaters")
pot_cheaters.show(5)

+--------------------+--------------------+-------+----------+--------+
|                 mid|                 src|src_tid|    m_date|pot_flag|
+--------------------+--------------------+-------+----------+--------+
|21330d5b-0ba7-420...|account.175b7548e...|     23|2019-03-10|       1|
|44925719-4ae3-421...|account.175b7548e...|     14|2019-03-10|       1|
|0031e4e0-b475-46d...|account.175b7548e...|     20|2019-03-15|       1|
|7ce8d183-c8e9-42f...|account.175b7548e...|      3|2019-03-08|       1|
|c33acfa5-d4b9-428...|account.175b7548e...|     22|2019-03-10|       1|
+--------------------+--------------------+-------+----------+--------+
only showing top 5 rows



In [6]:
temp = spark.sql("SELECT * FROM add_flags WHERE mid = '013caebc-8504-4d71-be02-a082ddccda9a'")
temp.show()

temp = spark.sql("SELECT * FROM add_flags WHERE pot_flag = 1 AND src_tid = 'NA'")
temp.show()
print(temp.count())

temp = spark.sql("SELECT * FROM pot_cheaters WHERE mid = '013caebc-8504-4d71-be02-a082ddccda9a'")
temp.show()

+--------------------+--------------------+-------+----------+--------+
|                 mid|                 src|src_tid|    m_date|pot_flag|
+--------------------+--------------------+-------+----------+--------+
|013caebc-8504-4d7...|account.c9a9eaa2a...|     50|2019-03-07|       0|
|013caebc-8504-4d7...|account.3e5396b91...|      6|2019-03-07|       0|
|013caebc-8504-4d7...|account.577f76fe0...|     36|2019-03-07|       1|
|013caebc-8504-4d7...|account.0e2dd932a...|      9|2019-03-07|       0|
+--------------------+--------------------+-------+----------+--------+

+--------------------+--------------------+-------+----------+--------+
|                 mid|                 src|src_tid|    m_date|pot_flag|
+--------------------+--------------------+-------+----------+--------+
|f2f76e66-9fb7-40d...|account.57d64f776...|     NA|2019-03-27|       1|
|6283fdb3-c24d-413...|account.f24c22165...|     NA|2019-03-05|       1|
|bbe25e99-755d-4ca...|account.cdd20db96...|     NA|2019-03-04| 

In [5]:
# Add the ranks of invalid winners.
add_ranks = spark.sql("""SELECT w.mid, src, src_tid, 
                         CASE WHEN rank IS NULL THEN 'NA' ELSE rank END AS src_rank 
                         FROM pot_cheaters w JOIN team_ranks t ON w.mid = t.mid AND w.src_tid = t.tid 
                         WHERE rank != 1""")
add_ranks.registerTempTable("add_ranks")

temp = spark.sql("SELECT * FROM add_ranks WHERE mid = '013caebc-8504-4d71-be02-a082ddccda9a'")
temp.show()

+--------------------+--------------------+-------+--------+
|                 mid|                 src|src_tid|src_rank|
+--------------------+--------------------+-------+--------+
|013caebc-8504-4d7...|account.577f76fe0...|     36|      13|
+--------------------+--------------------+-------+--------+



In [6]:
temp_tab = spark.sql("""SELECT mid, tid, rank, 
                        LAG(tid) OVER (ORDER BY mid, rank) AS lag_tid, 
                        LAG(rank) OVER (ORDER BY mid, rank) AS lag_rank, 
                        LEAD(tid) OVER (ORDER BY mid, rank) AS lead_tid, 
                        LEAD(rank) OVER (ORDER BY mid, rank) AS lead_rank 
                        FROM team_ranks""")
temp_tab.registerTempTable("temp_tab")

lag_lead_rows = spark.sql("""SELECT a.mid, src, src_tid, src_rank, lag_tid, lag_rank, lead_tid, lead_rank 
                             FROM add_ranks a JOIN temp_tab t ON a.mid = t.mid AND a.src_tid = t.tid""")
lag_lead_rows.registerTempTable("lag_lead_rows")

In [9]:
temp = spark.sql("SELECT * FROM lag_lead_rows WHERE mid = '013caebc-8504-4d71-be02-a082ddccda9a'")
temp.show()

+--------------------+--------------------+-------+--------+-------+--------+--------+---------+
|                 mid|                 src|src_tid|src_rank|lag_tid|lag_rank|lead_tid|lead_rank|
+--------------------+--------------------+-------+--------+-------+--------+--------+---------+
|013caebc-8504-4d7...|account.577f76fe0...|     36|      13|      5|      12|       7|       13|
+--------------------+--------------------+-------+--------+-------+--------+--------+---------+



In [7]:
# Add the time when the last team member died for each match.
add_lag_time = spark.sql("""SELECT mid, src, src_tid, src_rank, 
                            lag_tid, lag_rank, lag_time, lead_tid, lead_rank 
                            FROM (SELECT l.*, time AS lag_time, 
                            ROW_NUMBER() OVER (PARTITION BY l.mid, l.src, l.src_tid ORDER BY time DESC) AS rownumber 
                            FROM lag_lead_rows l JOIN add_tids a 
                            ON l.lag_tid = a.dst_tid AND l.mid = a.mid) 
                            WHERE rownumber IN (1)""")
add_lag_time.registerTempTable("add_lag_time")

add_time = spark.sql("""SELECT mid, src, src_tid, src_rank, lag_tid, lag_rank, lag_time, 
                        lead_tid, lead_rank, lead_time 
                        FROM (SELECT l.*, time AS lead_time, 
                        ROW_NUMBER() OVER (PARTITION BY l.mid, l.src, l.src_tid ORDER BY time DESC) AS rownumber 
                        FROM add_lag_time l JOIN add_tids a ON l.lead_tid = a.dst_tid AND l.mid = a.mid) 
                        WHERE rownumber IN (1)""")
add_time.registerTempTable("add_time")

In [8]:
add_tsdiff = spark.sql("""SELECT *, (UNIX_TIMESTAMP(lag_time) - UNIX_TIMESTAMP(lead_time)) AS tsdiff 
                          FROM add_time""")
add_tsdiff.registerTempTable("add_tsdiff")

add_new_time = spark.sql("""SELECT *, 
                            CASE WHEN lag_rank = 1 AND tsdiff < 0
                            THEN TO_TIMESTAMP(FROM_UNIXTIME(UNIX_TIMESTAMP(lead_time) + 1))
                            WHEN lead_rank = 1 THEN lag_time
                            WHEN lead_rank != 1 AND lag_rank != 1 AND tsdiff < 0
                            THEN TO_TIMESTAMP(FROM_UNIXTIME(UNIX_TIMESTAMP(lead_time) + 1))
                            ELSE TO_TIMESTAMP(FROM_UNIXTIME(UNIX_TIMESTAMP(lead_time) + FLOOR(0 + (RAND() * tsdiff)))) END 
                            AS new_time
                            FROM add_tsdiff""")
add_new_time.registerTempTable("add_new_time")

In [9]:
# Test whether the time difference is always non-negative.
temp = spark.sql("""SELECT * FROM add_new_time 
                    WHERE (UNIX_TIMESTAMP(new_time) - UNIX_TIMESTAMP(lead_time)) < 0""")
temp.show()
# temp_df = temp.toPandas()
# temp_df.to_csv('errors.csv')

+--------------------+--------------------+-------+--------+-------+--------+--------------------+--------+---------+--------------------+------+--------------------+
|                 mid|                 src|src_tid|src_rank|lag_tid|lag_rank|            lag_time|lead_tid|lead_rank|           lead_time|tsdiff|            new_time|
+--------------------+--------------------+-------+--------+-------+--------+--------------------+--------+---------+--------------------+------+--------------------+
|160b38b4-7bb1-4c2...|account.f19f20ca6...|     24|      27|     13|      27|2019-03-20 17:12:...|      23|        1|2019-03-20 17:26:...|  -851|2019-03-20 17:12:...|
|7e6c76cb-e8f5-450...|account.5b99b526c...|     44|      47|     36|      47|2019-03-03 07:21:...|      33|        1|2019-03-03 07:31:...|  -612|2019-03-03 07:21:...|
+--------------------+--------------------+-------+--------+-------+--------+--------------------+--------+---------+--------------------+------+--------------------

In [10]:
temp_df = add_new_time.toPandas()
temp_df.to_csv('new_time.csv')

In [11]:
# Create a table that contains participant information.
player_info = spark.sql("""SELECT DISTINCT mid, src AS id, src_sd AS sd, src_bd AS bd, 
                           src_curr_flag AS curr_flag, src_flag AS flag, m_date 
                           FROM sampled_obs 
                           UNION 
                           SELECT DISTINCT mid, dst, dst_sd, dst_bd, 
                           dst_curr_flag, dst_flag, m_date 
                           FROM sampled_obs""")
player_info.registerTempTable("player_info")
player_info.show(5)

+--------------------+--------------------+----------+----------+---------+----+----------+
|                 mid|                  id|        sd|        bd|curr_flag|flag|    m_date|
+--------------------+--------------------+----------+----------+---------+----+----------+
|1d4f9928-93ba-451...|account.c8348adfe...|2019-03-03|2019-03-04|        1|   1|2019-03-03|
|9041b53d-ce14-448...|account.2c9a1b06b...|        NA|        NA|        0|   0|2019-03-01|
|cd295514-a3a9-469...|account.510342370...|        NA|        NA|        0|   0|2019-03-07|
|dffc21ac-c81c-475...|account.52bd2d26e...|        NA|        NA|        0|   0|2019-03-06|
|999ad874-66ab-4e1...|account.1de176fd4...|        NA|        NA|        0|   0|2019-03-09|
+--------------------+--------------------+----------+----------+---------+----+----------+
only showing top 5 rows



In [12]:
# Create self-loops.
self_loops = spark.sql("""SELECT a.mid, src, sd, bd, curr_flag, flag, 
                          src, sd, bd, curr_flag, flag, new_time AS time, m_date 
                          FROM add_new_time a JOIN player_info p ON a.mid = p.mid AND a.src = p.id""")
self_loops.registerTempTable("self_loops")

In [22]:
temp_loops = spark.sql("SELECT * FROM self_loops WHERE mid = '013caebc-8504-4d71-be02-a082ddccda9a'")
temp_loops.show()

+--------------------+--------------------+----------+----------+---------+----+--------------------+----------+----------+---------+----+-------------------+----------+
|                 mid|                 src|        sd|        bd|curr_flag|flag|                 src|        sd|        bd|curr_flag|flag|               time|    m_date|
+--------------------+--------------------+----------+----------+---------+----+--------------------+----------+----------+---------+----+-------------------+----------+
|013caebc-8504-4d7...|account.577f76fe0...|2019-03-08|2019-03-09|        0|   1|account.577f76fe0...|2019-03-08|2019-03-09|        0|   1|2019-03-07 11:48:19|2019-03-07|
+--------------------+--------------------+----------+----------+---------+----+--------------------+----------+----------+---------+----+-------------------+----------+



Next, add self-loops for the cases where winners have different team IDs including 'NA' as team ID.

In [14]:
na_pot_cheaters = spark.sql("SELECT * FROM add_flags WHERE pot_flag = 1 AND src_tid = 'NA'")
na_pot_cheaters.registerTempTable("na_pot_cheaters")
na_pot_cheaters.show(5)
print(na_pot_cheaters.count())

+--------------------+--------------------+-------+----------+--------+
|                 mid|                 src|src_tid|    m_date|pot_flag|
+--------------------+--------------------+-------+----------+--------+
|f2f76e66-9fb7-40d...|account.57d64f776...|     NA|2019-03-27|       1|
|6283fdb3-c24d-413...|account.f24c22165...|     NA|2019-03-05|       1|
|bbe25e99-755d-4ca...|account.cdd20db96...|     NA|2019-03-04|       1|
|99a629b2-f4e3-42e...|account.cdd20db96...|     NA|2019-03-03|       1|
|12bcdfe5-34a4-473...|account.81f027093...|     NA|2019-03-06|       1|
+--------------------+--------------------+-------+----------+--------+
only showing top 5 rows

17


In [15]:
# Get the last kill of each match.
last_kills = spark.sql("""SELECT * 
                          FROM (SELECT o.*, ROW_NUMBER() OVER (PARTITION BY mid ORDER BY time DESC) AS row_num 
                                FROM sampled_obs AS o) 
                          WHERE row_num = 1""")
last_kills.registerTempTable("last_kills")
last_kills.show(10)

+--------------------+--------------------+----------+----------+-------------+--------+--------------------+------+------+-------------+--------+--------------------+----------+-------+
|                 mid|                 src|    src_sd|    src_bd|src_curr_flag|src_flag|                 dst|dst_sd|dst_bd|dst_curr_flag|dst_flag|                time|    m_date|row_num|
+--------------------+--------------------+----------+----------+-------------+--------+--------------------+------+------+-------------+--------+--------------------+----------+-------+
|1de898e9-65f0-4d2...|account.bcbbd1d75...|2019-03-21|2019-03-23|            0|       1|account.588c4bb15...|    NA|    NA|            0|       0|2019-03-05 11:13:...|2019-03-05|      1|
|2450097b-8958-40e...|account.6b5cbec14...|        NA|        NA|            0|       0|account.d2a9ff1ee...|    NA|    NA|            0|       0|2019-03-05 16:54:...|2019-03-05|      1|
|2f183c4d-ca43-41f...|account.421182046...|2019-03-10|2019-03-11|

In [16]:
# Find the last kill of each winner whose team ID is 'NA'.
na_player_kills = spark.sql("""SELECT * 
                               FROM (SELECT o.*, 
                                     ROW_NUMBER() OVER (PARTITION BY w.mid, w.src ORDER BY time DESC) AS row_num 
                                     FROM sampled_obs o JOIN na_pot_cheaters w 
                                     ON o.mid = w.mid AND o.src = w.src) 
                              WHERE row_num = 1""")
na_player_kills.registerTempTable("na_player_kills")
na_player_kills.show()
print(na_player_kills.count()) # The result should be 17.

+--------------------+--------------------+----------+----------+-------------+--------+--------------------+------+------+-------------+--------+--------------------+----------+-------+
|                 mid|                 src|    src_sd|    src_bd|src_curr_flag|src_flag|                 dst|dst_sd|dst_bd|dst_curr_flag|dst_flag|                time|    m_date|row_num|
+--------------------+--------------------+----------+----------+-------------+--------+--------------------+------+------+-------------+--------+--------------------+----------+-------+
|be2c6092-f9d3-4a2...|account.c6f71a3c5...|2019-03-18|2019-03-30|            0|       1|account.a18547acf...|    NA|    NA|            0|       0|2019-03-04 18:50:...|2019-03-04|      1|
|bac58a82-62ca-485...|account.d8fc8cfc9...|2019-03-05|2019-03-06|            0|       1|account.0c3829c84...|    NA|    NA|            0|       0|2019-03-03 17:42:...|2019-03-03|      1|
|093eb054-176b-4b9...|account.216ad15bd...|2019-03-02|2019-03-03|

In [26]:
# create a new random time for each self-loop.
cal_tsdiff = spark.sql("""SELECT n.mid, n.src, n.time,  
                          (UNIX_TIMESTAMP(l.time) - UNIX_TIMESTAMP(n.time)) AS tsdiff
                          FROM na_player_kills n JOIN last_kills l ON n.mid = l.mid""")
cal_tsdiff.registerTempTable("cal_tsdiff")
cal_tsdiff.show()

+--------------------+--------------------+-------------+--------------------+----------+------+
|                 mid|                 src|src_curr_flag|                time|    m_date|tsdiff|
+--------------------+--------------------+-------------+--------------------+----------+------+
|f2f76e66-9fb7-40d...|account.57d64f776...|            0|2019-03-27 08:41:...|2019-03-27|  1676|
|0e85fbcc-0d91-4f0...|account.9ccecb41a...|            0|2019-03-06 10:15:...|2019-03-06|     0|
|b2c7e5a4-f0f0-48d...|account.ac666be40...|            0|2019-03-03 20:26:...|2019-03-03|   374|
|bac58a82-62ca-485...|account.d8fc8cfc9...|            0|2019-03-03 17:42:...|2019-03-03|    44|
|f54ab324-6b31-474...|account.44b0bd971...|            0|2019-03-04 15:46:...|2019-03-04|     0|
|483a0e46-2d62-444...|account.4d2951657...|            0|2019-03-13 08:19:...|2019-03-13|     0|
|bbe25e99-755d-4ca...|account.cdd20db96...|            0|2019-03-04 12:33:...|2019-03-04|  1497|
|cb84a1ce-cd19-427...|account.

In [27]:
add_rand_time = spark.sql("""SELECT c.*,  
                             CASE WHEN tsdiff = 0 THEN NULL
                             ELSE TO_TIMESTAMP(FROM_UNIXTIME(UNIX_TIMESTAMP(time) + FLOOR(0 + (RAND() * tsdiff)))) END 
                             AS new_time
                             FROM cal_tsdiff AS c""")
add_rand_time.registerTempTable("add_rand_time")
add_rand_time.show()

+--------------------+--------------------+-------------+--------------------+----------+------+-------------------+
|                 mid|                 src|src_curr_flag|                time|    m_date|tsdiff|           new_time|
+--------------------+--------------------+-------------+--------------------+----------+------+-------------------+
|f2f76e66-9fb7-40d...|account.57d64f776...|            0|2019-03-27 08:41:...|2019-03-27|  1676|2019-03-27 09:05:07|
|0e85fbcc-0d91-4f0...|account.9ccecb41a...|            0|2019-03-06 10:15:...|2019-03-06|     0|               null|
|b2c7e5a4-f0f0-48d...|account.ac666be40...|            0|2019-03-03 20:26:...|2019-03-03|   374|2019-03-03 20:28:39|
|bac58a82-62ca-485...|account.d8fc8cfc9...|            0|2019-03-03 17:42:...|2019-03-03|    44|2019-03-03 17:42:09|
|f54ab324-6b31-474...|account.44b0bd971...|            0|2019-03-04 15:46:...|2019-03-04|     0|               null|
|483a0e46-2d62-444...|account.4d2951657...|            0|2019-03

In [28]:
# Remove the rows with NULL values.
add_rand_time = spark.sql("SELECT * FROM add_rand_time WHERE new_time IS NOT NULL")
add_rand_time.registerTempTable("add_rand_time")
add_rand_time.show()

+--------------------+--------------------+-------------+--------------------+----------+------+-------------------+
|                 mid|                 src|src_curr_flag|                time|    m_date|tsdiff|           new_time|
+--------------------+--------------------+-------------+--------------------+----------+------+-------------------+
|f2f76e66-9fb7-40d...|account.57d64f776...|            0|2019-03-27 08:41:...|2019-03-27|  1676|2019-03-27 09:05:07|
|b2c7e5a4-f0f0-48d...|account.ac666be40...|            0|2019-03-03 20:26:...|2019-03-03|   374|2019-03-03 20:28:39|
|bac58a82-62ca-485...|account.d8fc8cfc9...|            0|2019-03-03 17:42:...|2019-03-03|    44|2019-03-03 17:42:09|
|bbe25e99-755d-4ca...|account.cdd20db96...|            0|2019-03-04 12:33:...|2019-03-04|  1497|2019-03-04 12:56:16|
|cb84a1ce-cd19-427...|account.88cca8d42...|            0|2019-03-04 21:40:...|2019-03-04|   871|2019-03-04 21:46:54|
|6283fdb3-c24d-413...|account.f24c22165...|            0|2019-03

In [29]:
# Create self-loops.
rand_self_loops = spark.sql("""SELECT a.mid, src, sd, bd, curr_flag, flag, 
                               src, sd, bd, curr_flag, flag, new_time AS time, m_date 
                               FROM add_rand_time a JOIN player_info p 
                               ON a.mid = p.mid AND a.src = p.id""")
rand_self_loops.registerTempTable("rand_self_loops")
rand_self_loops.show()

+--------------------+--------------------+----------+----------+---------+----+--------------------+----------+----------+---------+----+-------------------+----------+
|                 mid|                 src|        sd|        bd|curr_flag|flag|                 src|        sd|        bd|curr_flag|flag|               time|    m_date|
+--------------------+--------------------+----------+----------+---------+----+--------------------+----------+----------+---------+----+-------------------+----------+
|be2c6092-f9d3-4a2...|account.c6f71a3c5...|2019-03-18|2019-03-30|        0|   1|account.c6f71a3c5...|2019-03-18|2019-03-30|        0|   1|2019-03-04 18:54:04|2019-03-04|
|bac58a82-62ca-485...|account.d8fc8cfc9...|2019-03-05|2019-03-06|        0|   1|account.d8fc8cfc9...|2019-03-05|2019-03-06|        0|   1|2019-03-03 17:42:09|2019-03-03|
|093eb054-176b-4b9...|account.216ad15bd...|2019-03-02|2019-03-03|        0|   1|account.216ad15bd...|2019-03-02|2019-03-03|        0|   1|2019-03-01 1

In [30]:
# Combine two sets of self-loops.
full_self_loops = spark.sql("SELECT * FROM self_loops UNION SELECT * FROM rand_self_loops")
full_self_loops.registerTempTable("full_self_loops")
print(full_self_loops.count())

+--------------------+--------------------+----------+----------+---------+----+--------------------+----------+----------+---------+----+-------------------+----------+
|                 mid|                 src|        sd|        bd|curr_flag|flag|                 src|        sd|        bd|curr_flag|flag|               time|    m_date|
+--------------------+--------------------+----------+----------+---------+----+--------------------+----------+----------+---------+----+-------------------+----------+
|1b9c6a48-b81f-4f6...|account.78f9c3700...|2019-03-08|2019-03-09|        0|   1|account.78f9c3700...|2019-03-08|2019-03-09|        0|   1|2019-03-05 15:40:59|2019-03-05|
|11a69c34-647b-4fa...|account.57d44119f...|2019-03-08|2019-03-09|        0|   1|account.57d44119f...|2019-03-08|2019-03-09|        0|   1|2019-03-02 16:55:43|2019-03-02|
|28bca936-a425-492...|account.078af237a...|2019-03-25|2019-03-26|        0|   1|account.078af237a...|2019-03-25|2019-03-26|        0|   1|2019-03-17 1

In [31]:
# Add self-loops into the original 'obs_data' dataset.
obs_data = spark.read.parquet("s3://social-research-cheating/edges/obs_data.parquet")
obs_data.registerTempTable("obs_data")

rev_obs = spark.sql("SELECT * FROM obs_data UNION SELECT * FROM full_self_loops ORDER BY mid, time")
rev_obs.registerTempTable("rev_obs")
rev_obs.write.parquet("s3://social-research-cheating/edges/rev_obs_data.parquet")

Test the new dataset with extra self-loops.

In [2]:
rev_obs = spark.read.parquet("s3://social-research-cheating/edges/rev_obs_data.parquet")
rev_obs.registerTempTable("rev_obs")

# Create a file for testing.
temp = spark.sql("SELECT * FROM rev_obs WHERE mid = '013caebc-8504-4d71-be02-a082ddccda9a'")
temp_df = temp.toPandas()
temp_df.to_csv('test_data.csv')

In [3]:
team_ids = spark.read.parquet("s3://social-research-cheating/edges/tiny_team_data.parquet")
team_ids.registerTempTable("team_ids")

players = spark.read.parquet("s3://social-research-cheating/nodes.parquet")
players.registerTempTable("players")

# Get a list of mids and m_dates.
match_info = spark.sql("SELECT DISTINCT mid, m_date FROM rev_obs")
match_info.registerTempTable("match_info")

# Get a list of victims for each match.
victims = spark.sql("SELECT DISTINCT mid, dst FROM rev_obs")
victims.registerTempTable("victims")

# Get a list of winners for each match.
winners = spark.sql("""SELECT DISTINCT o.mid, src FROM rev_obs o 
                       WHERE NOT EXISTS (SELECT mid, dst FROM victims v WHERE o.mid = v.mid AND o.src = v.dst)""")
winners.registerTempTable("winners")

# Add team information.
add_tids = spark.sql("""SELECT w.mid, src, CASE WHEN tid IS NULL THEN 'NA' ELSE tid END AS src_tid
                        FROM winners w LEFT JOIN team_ids t ON w.mid = t.mid AND w.src = t.id""")
add_tids.registerTempTable("add_tids")

# Add m_dates.
temp_tab = spark.sql("""SELECT a.mid, src, src_tid, m_date 
                        FROM add_tids a LEFT JOIN match_info m ON a.mid = m.mid""")
temp_tab.registerTempTable("temp_tab")

# Find the matches where at least one winner's team ID is 'NA'.
na_tids = spark.sql("SELECT DISTINCT mid FROM add_tids WHERE src_tid = 'NA'")
na_tids.registerTempTable("na_tids")

# Add the current cheating flag of players.
winners = spark.sql("""SELECT t.*, 
                       CASE WHEN cheating_flag = 1 AND m_date < start_date THEN 1 ELSE 0 END AS pot_flag 
                       FROM temp_tab t LEFT JOIN players p ON t.src = p.id""")
winners.registerTempTable("winners")

# Count the number of winners and that of unique times for each match. 
cnt_tab = spark.sql("""SELECT mid, COUNT(src) AS winner_cnt, 
                       COUNT(DISTINCT src_tid) AS tid_cnt, SUM(pot_flag) AS pot_cnt 
                       FROM winners GROUP BY mid""")
cnt_tab.registerTempTable("cnt_tab")

summary_tab = spark.sql("""SELECT c.mid, winner_cnt, tid_cnt, pot_cnt, 
                           CASE WHEN n.mid IS NULL THEN 0 ELSE 1 END AS na_flag 
                           FROM cnt_tab c LEFT JOIN na_tids n ON c.mid = n.mid""")
summary_tab.registerTempTable("summary_tab")
summary_tab.show(10)

+--------------------+----------+-------+-------+-------+
|                 mid|winner_cnt|tid_cnt|pot_cnt|na_flag|
+--------------------+----------+-------+-------+-------+
|0143e2da-14d2-4d8...|         9|      6|      0|      0|
|036a8903-186b-45f...|         4|      2|      0|      0|
|080d5622-6b94-4d7...|         3|      2|      0|      0|
|0c7d472e-5064-4d4...|         2|      2|      0|      0|
|0ef25288-88d3-476...|         2|      1|      0|      0|
|1203abce-50ec-40d...|         4|      4|      0|      0|
|1574a6bb-a63f-473...|         5|      2|      0|      0|
|16d6f605-4118-4de...|         4|      3|      0|      0|
|1773f8d7-b807-439...|         3|      2|      0|      0|
|194e1d81-b65c-4dc...|         4|      2|      0|      0|
+--------------------+----------+-------+-------+-------+
only showing top 10 rows

