In [None]:
import numpy as np 
import pandas as pd 
import seaborn as sns
import matplotlib.pyplot as plt
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

## **Objective**
The objective of this kernel is not to predict the the players' placements but rather to identify cheaters within the dataset. As I explore the data and find potential culprits, I will explain my reasoning for my suspicion and such. Enjoy!
***
## **Background**
#### **What is PUBG?**
PUBG, or PLAYERUNKNOWN'S BATTLEGROUNDS, is an online multiplayer game where (at most) 100 players drop from a plane onto an island, scavenge for loot (weapons, items, etc.), and kill eachother until there is only one person alive aka the winner! While all of this happens, a ring will appear on the map which shrinks as the match goes on, making it hard to hide the whole match. Depending on the game mode, the player could be playing in groups of 2 , 3, or 4 as well, in which case, the game would go on until they are the last group alive. Games like PUBG are classified as battle-royale games. 
#### **Cheaters in Online Games**
In every online game (especially those involving player versus player combat) there are cheaters who use some type of program to gain unfair advantages over others. The types of cheats that players use are listed in the next section. These people are breaking the rules of the game and need to be punished in order to keep any competitive game competitive and healthy. An influx of cheaters in a game can make people lose interest in the game and damage the competitive integrity of the game. For this reason, catching and punishing cheaters is important for game companies. 

#### **Types of Cheats**
These are the 3 types of cheats that I think are most relevant in PUBG and also identifable with the given data. Of course, tons of different types of cheats. However, some will be impossible to find using the given data. For this kernel we will focus on these 3 cheats:
***
**Aim Bot**: An aim assisting cheat which automatically aims at enemies. Some aim bots are blatant (100% Headshot accuracy) but some will intentioally miss to deceive anti-cheats/spectators in an effort to avoid getting banned. The features headshotKills, and roadKills could help us idenfity aim bot users. A high headshot or road kill rate can both be reasons for suspicion as they are both difficult (but certainly not impossible) to achieve. 
***
**ESP**: ESP, or extra sensory perception, is a type of cheat that gives player information that they shouldn't have access to. For example, an ESP could allow a player to see where enemies or items are positioned through walls. Depending on how experience the cheater is, it can be difficult to detect ESP cheats, as the player can just act clueless. For now, I think that the weaponsAcquired feature might give us some hints because ESP users will likely look for the best items available and go straight for those. This is a wild guess and we'll see if we can find any patterns during our EDA. 
***
**Movement**: Allows players to control their character in an impossible (to non-cheaters) way. Some examples: increased movement speed and flying. In this dataset, there are features that describe how much the player moved such as walkDistance, rideDistance, and swimDistance. These features could give us some hints. 
 

Lastly, the way cheats work is very fascinating! I will include some information about how certain cheats work later in the kernel. 

Let's begin by importing the training dataset and looking at an overview of its content.

In [None]:
PATH = '/kaggle/input/pubg-finish-placement-prediction/'
train = pd.read_csv(PATH + 'train_V2.csv')
test = pd.read_csv(PATH + 'test_V2.csv')
df = train

In [None]:
DEFAULT_PLOT_SIZE = (12, 4)


def scatterplot(df, x, y, size=DEFAULT_PLOT_SIZE):
    plt.figure(figsize=size)
    sns.scatterplot(data=df, x=x, y=y).set_title(f"{y} vs {x}")
    plt.show()
    
    
def countplot(df, x, size=DEFAULT_PLOT_SIZE):
    plt.figure(figsize=size)
    sns.countplot(data=df, x=x).set_title(x)
    plt.show()

    
def distplot(df, x, bins=50, size=DEFAULT_PLOT_SIZE):
    plt.figure(figsize=size)
    sns.distplot(df[x], bins=bins).set_title(x)
    plt.show()     
    

def boxplot(df, x, y, size=DEFAULT_PLOT_SIZE):
    plt.figure(figsize=size)
    sns.boxplot(data = df, x=x, y=y).set_title(f"{y} vs {x}")
    plt.show()     


## **Feature Descriptions**
Okay that's a lot of features so let's copy down the provided descriptions and get a feel for what the dataset is giving us. 
* **DBNOs** - Number of enemy players knocked.  
* **assists** - Number of enemy players this player damaged that were killed by teammates.  
* **boosts** - Number of boost items used.  
* **damageDealt** - Total damage dealt. Note: Self inflicted damage is subtracted.  
* **headshotKills** - Number of enemy players killed with headshots.  
* **heals** - Number of healing items used.  
* **Id** - Player’s Id  
* **killPlace** - Ranking in match of number of enemy players killed.  
* **killPoints** - Kills-based external ranking of player. (Think of this as an Elo ranking where only kills matter.) If there is a value other than -1 in rankPoints, then any 0 in killPoints should be treated as a “None”.  
* **killStreaks** - Max number of enemy players killed in a short amount of time.  
* **kills** - Number of enemy players killed.  
* **longestKill** - Longest distance between player and player killed at time of death. This may be misleading, as downing a player and driving away may lead to a large longestKill stat.  
* **matchDuration** - Duration of match in seconds.  
* **matchId** - ID to identify match. There are no matches that are in both the training and testing set.  
* **matchType** - String identifying the game mode that the data comes from. The standard modes are “solo”, “duo”, “squad”, “solo-fpp”, “duo-fpp”, and “squad-fpp”; other modes are from events or custom matches.  
* **rankPoints** - Elo-like ranking of player. This ranking is inconsistent and is being deprecated in the API’s next version, so use with caution. Value of -1 takes place of “None”.  
* **revives** - Number of times this player revived teammates.  
* **rideDistance** - Total distance traveled in vehicles measured in meters.  
* **roadKills** - Number of kills while in a vehicle.  
* **swimDistance** - Total distance traveled by swimming measured in meters.  
* **teamKills** - Number of times this player killed a teammate.  
* **vehicleDestroys** - Number of vehicles destroyed.  
* **walkDistance** - Total distance traveled on foot measured in meters.  
* **weaponsAcquired** - Number of weapons picked up.  
* **winPoints** - Win-based external ranking of player. (Think of this as an Elo ranking where only winning matters.) If there is a value other than -1 in rankPoints, then any 0 in winPoints should be treated as a “None”.  
* **groupId** - ID to identify a group within a match. If the same group of players plays in different matches, they will have a different groupId each time.  
* **numGroups** - Number of groups we have data for in the match.  
* **maxPlace** - Worst placement we have data for in the match. This may not match with numGroups, as sometimes the data skips over placements.  
* **winPlacePerc** - The target of prediction. This is a percentile winning placement, where 1 corresponds to 1st place, and 0 corresponds to last place in the match. It is calculated off of maxPlace, not numGroups, so it is possible to have missing chunks in a match.

Cool, so the dataset provides information about how players/groups in certain matches performed. It looks like we need to filter the data because it features matches from a variety of differnet game modes. 

## Data Cleaning

In [None]:
df['matchType'].value_counts()

These are all the game modes in this dataset. We want to only look at the battle-royale game modes so let's filter the data. 


In [None]:
df = df[ (df['matchType'] == 'solo') | (df['matchType'] == 'solo-fpp') | (df['matchType'] == 'duo') | (df['matchType'] == 'duo-fpp') | (df['matchType'] == 'squad') | (df['matchType'] == 'squad-fpp') ]
print(f"There are {df.matchId.nunique()} total matches and {df.Id.nunique()} unique players featured in the dataset.")

In [None]:
countplot(df, 'numGroups', (22, 5))

The graph shows how many matches had a certain number of groups playing in it. The reason why there are three separate humps is becasue the data includes three different game modes, where the size of the group differs. For duos (max 2 player groups), the number of groups should be 50 or less since the max amount of players in a lobby is 100. For squads (max 4 players per group), the number of groups is going to be even less. However, it should almosst never be the case that a match has only a few groups (1-15). Even with 4 person groups, it would only be a 60 person lobby. Also, the range from 51 - 80 should not be counted because they would be too many for duos but not enough for solos. Let's drop any match with group counts between 51 and 80 as well. 

In [None]:
df = df[(df['numGroups'] >= 15) & (~df.numGroups.isin([x for x in range(51, 80)]))]
df.drop('groupId', axis=1)
countplot(df, 'numGroups', (15, 5))

# EDA - Finding cheaters
It might be hard to focus on one cheat to look for because we may find suspicious things that we weren't necessarily seraching for. I'm just going to explore the data based on my experience with battle-royale games. 

First, I will create some features that I think will be useful in finding cheaters.

In [None]:
df['totalDistance'] = df['walkDistance'] + df['swimDistance'] + df['rideDistance']
df['totalDistance'] = df['totalDistance']

df['headshotRate'] = df['headshotKills']/df['kills']
df['headshotRate'] = df['headshotRate'].fillna(0)

df['totalHeals'] = df['heals'] + df['boosts']

df[['totalDistance', 'headshotRate', 'totalHeals']].describe()

In [None]:
print(f"There are {df['matchId'].nunique()} matches and {df['Id'].nunique()} players in the dataset ")

In [None]:
boxplot(df, 'kills', 'totalHeals')

Right off the bat, we see someone using 80 total healing items without getting a single kill. This may be a result of a custom game (spawning in items?) or some type of looting hack. Whether or not they were hacking, this is abnormal and they definitely weren't trying to win! Perhaps there was an achievement available that wants players to use healing items?  
The average amount of healing people are using is around 10 per match. More than that is generally not needed and does not help in getting more kills. So people with 40+ heals is really not normal. Let's zoom in a little bit and focus on players who used a reasonable amount of heals (< 20). 

In [None]:
boxplot(df[df['totalHeals']<20], 'kills', 'totalHeals')

When I see a player getting 10 + kills without healing or boosting ONCE, it becomes suspicious. Either this person is a professional playing in a new player lobby or something fishy is going on. This person may be using some sort of cheat (not necessarily just one type). There are cheats that allow you to have unlimited health. My guess is that the player is using a combination of hacks so that they can get kills fast (aim bot) and without taking much damage from others (movement speed). Of course, I can not say this with 100% confidence due to the lack of features in this dataset as well as the possibility of the player simply being good at the game. 

In general, it takes time to get lots of kills and the longer you stay alive, the more prone you are to taking damage and needing to heal as you can see by the general trend of more kills = more heals up to around 13 kills. This is why it might be suspicious when players are getting double digit kills without ever healing. I wish there was a feature in this dataset that told us: how much damage the player took and how long they were alive for duing the match.

In [None]:
scatterplot(df, 'totalDistance', 'kills')

Next, we are looking at how the distance traveled by the player affects their kils. In general, up to around 6000m, players will get more kills as they travel more. This trend can be easily explained. The longer you stay alive the more chance you have to get kills, and the more you'll be moving around chasing after players / finding loot / running away. However, at a certain point (~ 6000m) the distance traveled actually hurts the players kill count. Why is this?  
Check out [this heatmap ](https://www.reddit.com/r/PUBATTLEGROUNDS/comments/69janc/heatmap_of_deaths_in_game_squad_gamemode/) made by a reddit user which shows which spots on the map are most popular based on the amount of people dying there. There are more maps in PUBG, but on ever map there are popular and unpopular spots.

<img src="https://external-preview.redd.it/WV0H7qK657MHYQyMvHkbV-HeW3oyIaizXnk90ZmOPWw.jpg?width=960&crop=smart&auto=webp&s=19ed024bfda221f7f8705f2517a1c59f0f76ac29" width="400">
My best guess is this that players who drop on unpopular parts of the map will get less kills (due to less people being around them) and also need to travel more due to the ring's tendency to converge towards the center of the map. I believe this graph tells us that there is a sweet spot in terms of how much you want to travel. More traveling = more risk of being attacked and dying. However, to get kills, you need to travel a certain amount. It seems that ~6000m is the sweet spot.  

Now let's get back on track. Can we spot any cheaters from the data?  
We can clearly see that there are some people traveling an insane amount (35000+m). They don't seem to be getting any kills, either. This is similar to how some people were using 80 heal items but getting 0 kills. Perhaps it's achievement farming? We can check for movement speed cheats by finding the maximum distance a player can travel in a certain amount of time.  

* Max Vehicle Speed: 42.2 m/s  
* Max Running Speed (w/ boost): 6.7 m/s  
* Max Swimming Speed: 2.9 m/s  

In [None]:
max_ride_speed = 42.2
max_walk_speed = 6.3
max_swim_speed = 2.9

Let's check to see if there are any impossible values in each distance measurement feature. If matchDuration (s) * max speed (m/s) is less than the distance (m) , then the distance is invalid. 

In [None]:
df[(df['matchDuration'] * max_swim_speed < df['swimDistance'])][['Id', 'matchDuration', 'swimDistance']]

In [None]:
df[(df['matchDuration'] * max_ride_speed < df['rideDistance'])][['Id', 'matchDuration', 'rideDistance']]

In [None]:
df[(df['matchDuration'] * max_walk_speed < df['walkDistance'])][['Id', 'matchDuration', 'walkDistance']]

We found that 56 players have invalid walkDistances. I'm not exactly sure how walkDistance was measured (does falling distance count? does jumping while sprinting count?) and I also don't know if there are tricks that people use to go faster without cheating so I can't say these players are 100% cheating. However, if there are only 56 invalid walkDistances out of over 4 million players, there is a high chance that these players are using some type of movement cheat that allows them to travel faster. Also, we saw that all rideDistance and swimDistances are valid. This is most likely because movement hacks don't work in vehicles and there isn't that much swimming that needs to take place on PUBG maps. It may even be the case that cheaters just run on top of the water!    
Anyway, let's move on from the distance stuff. 

In [None]:
distplot(df, 'weaponsAcquired')

Huh?? 200+ weapons??

In [None]:
print(f"The average player acquires {df['weaponsAcquired'].mean()} weapons. 99% of player acquire less than {df['weaponsAcquired'].quantile(0.99)} weapons.")
df.sort_values('weaponsAcquired', ascending=False)[:25][['Id', 'matchId', 'weaponsAcquired']]

These are ridiculous numbers! 99% of players pick up less than 10 weapons but there are player picking up 100+ weapons. This is similar to the instance where people were using 70+ healing items or running 35000+m with barely any kills. Again, is there some sort of achievement that you can get for picking up weapons? I'm not even sure if there are 236 guns that spawn on each map. Let's check out the match ID for the match where the player acquired 236 weapons.

In [None]:
df[df['matchId'] == '1df2560f0937ab'].sort_values('winPlacePerc', ascending=False)[['Id', 'kills', 'matchDuration', 'weaponsAcquired']]

Seems like a pretty normal game to me. How did this player acquire 238 weapons? My thought process is this: 99% of player pick up less than 10 weapons in a match. 1% of players in this dataset is around 40,000 players. I couldn't confidently say that a player is cheating just because they're acquiring a bunch of weapons. But 100 + weapons?? What if weaponsAcquired means how many times the person picked up a weapon and not how many unique weapons they picked up? Could they just drop their weapon 236 times and pick it up each time? If not, then I would be confident that this player used some type of looting cheat. 

In [None]:
distplot(df,'roadKills')