# Introduction

PlayerUnknown's Battlegrounds, also known as PUBG, is a multiplayer video game. In the game, close to 100 (maximum 100) players start the game in a plane whose route is determined. They jump off the plane whenever they want and land on any location on the island with their parachute. The main goal of the players is to be the last player to be the last player without killing other players with various weapons they find from various places.

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
plt.style.use("seaborn-whitegrid")
import seaborn as sns 
from collections import Counter
import warnings
warnings.filterwarnings("ignore")
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Load Data
Read a comma-separated values (csv) file into DataFrame.

In [None]:
df = pd.read_csv('/kaggle/input/pubg-finish-placement-prediction/train_V2.csv')

# Variables

In [None]:
df.info()

* groupId - Integer ID to identify a group within a match. If the same group of players plays in different matches, they will have a different groupId each time.
* matchId - Integer ID to identify match. There are no matches that are in both the training and testing set.
* assists - Number of enemy players this player damaged that were killed by teammates.
* boosts - Number of boost items used.
* damageDealt - Total damage dealt. Note: Self inflicted damage is subtracted.
* DBNOs - Number of enemy players knocked.
* headshotKills - Number of enemy players killed with headshots.
* heals - Number of healing items used.
* killPlace - Ranking in match of number of enemy players killed.
* killPoints - Kills-based external ranking of player. (Think of this as an Elo ranking where only kills matter.)
* kills - Number of enemy players killed.
* killStreaks - Max number of enemy players killed in a short amount of time.
* longestKill - Longest distance between player and player killed at time of death. This may be misleading, as downing a - player and driving away may lead to a large longestKill stat.
* maxPlace - Worst placement we have data for in the match. This may not match with numGroups, as sometimes the data skips over placements.
* numGroups - Number of groups we have data for in the match.
* revives - Number of times this player revived teammates.
* rideDistance - Total distance traveled in vehicles measured in meters.
* roadKills - Number of kills while in a vehicle.
* swimDistance - Total distance traveled by swimming measured in meters.
* teamKills - Number of times this player killed a teammate.
* vehicleDestroys - Number of vehicles destroyed.
* walkDistance - Total distance traveled on foot measured in meters.
* weaponsAcquired - Number of weapons picked up.
* winPoints - Win-based external ranking of player. (Think of this as an Elo ranking where only winning matters.)
* winPlacePerc - The target of prediction. This is a percentile winning placement, where 1 corresponds to 1st place, and 0 corresponds to last place in the match. It is calculated off of maxPlace, not numGroups, so it is possible to have missing chunks in a match.

In order to perform a healthy data analysis, outliers and missing values should be found and corrected first.

## Outlier Detection
In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to variability in the measurement or it may indicate experimental error; the latter are sometimes excluded from the data set. An outlier can cause serious problems in statistical analyses.

In [None]:
def detect_outliers(df,features):
    outlier_indices = []
    
    for c in features:
        # 1st quartile
        Q1 = np.percentile(df[c],25)
        # 3rd quartile
        Q3 = np.percentile(df[c],75)
        # IQR
        IQR = Q3 - Q1
        # Outlier step
        outlier_step = IQR * 1.5
        # detect outlier and their indeces
        outlier_list_col = df[(df[c] < Q1 - outlier_step) | (df[c] > Q3 + outlier_step)].index
        # store indeces
        outlier_indices.extend(outlier_list_col)
    
    outlier_indices = Counter(outlier_indices)
    multiple_outliers = list(i for i, v in outlier_indices.items() if v > 3)
    
    return multiple_outliers

In [None]:
numeric_features = list(df.columns)
# remove non-numeric features
numeric_features.remove("Id")
numeric_features.remove("groupId")
numeric_features.remove("matchId")
numeric_features.remove("matchType")
df.loc[detect_outliers(df,numeric_features)]

In [None]:
# drop outliers
df = df.drop(detect_outliers(df,numeric_features),axis = 0).reset_index(drop = True)

## Missing Value
In statistics, missing data, or missing values, occur when no data value is stored for the variable in an observation. Missing data are a common occurrence and can have a significant effect on the conclusions that can be drawn from the data.

In [None]:
# Find Missing Value
df.columns[df.isnull().any()]
df.isnull().sum()

In [None]:
df[df["winPlacePerc"].isnull()]

In [None]:
# Fill Missing Value
df["winPlacePerc"] = df["winPlacePerc"].fillna(np.mean(df["winPlacePerc"]))
df[df["winPlacePerc"].isnull()]

Now that outliers are deleted and there is no missing value, we can start data analysis.

# Basic Data Analysis

In [None]:
df.head()

In [None]:
def bar_plot(variable):
    var = df[variable]
    varValue = var.value_counts()
    
    # visualize
    plt.figure(figsize=(10,5))
    plt.bar(varValue.index, varValue)
    plt.xticks(varValue.index, varValue.index.values)
    plt.ylabel("Quantity (million)")
    plt.title(variable)
    plt.show()
    print("{}: \n {}".format(variable,varValue))

Let's start with kill scores.

In [None]:
bar_plot("kills")

* 65% of the players couldn't even kill a person.
* The interesting thing is that 15,515 player from this group won the game.

In [None]:
# Winners without killing
len(df[(df.kills == 0) & (df.winPlacePerc == 1)])

In [None]:
len(df[(df.kills == 0) & (df.winPlacePerc == 1) & (df.numGroups > 50)])

* However, it seems that only 3 of these players played solo, which means that for the rest, their teammates did all the work.

In [None]:
# top 5 killers
df.sort_values(by=['kills'], ascending=False).head(5)

In [None]:
print("{:.2f}% of kills are headshot kills.".format(df.headshotKills.sum()/df.kills.sum() * 100))

In [None]:
def plot_hist(df, variable):
    plt.figure(figsize = (10,5))
    plt.hist(df[variable], bins = 50)
    plt.xlabel(variable)
    plt.ylabel("Quantity (million)")
    plt.title("{} distribution with hist".format(variable))
    plt.show()

Let's take a look at the damage dealt by the 0 killers.

In [None]:
plot_hist(df[df.kills == 0], "damageDealt")

* Most of them couldn't even do damage.
* 16% of those who deal zero damage didn't even find a weapon, which means they were most likely killed at the beginning of the game.

In [None]:
# 0 damage dealt by the 0 killers.
len(df[(df.damageDealt == 0) & (df.kills == 0)])

In [None]:
len(df[(df["weaponsAcquired"] == 0) & (df.damageDealt == 0) & (df.kills == 0)])

In [None]:
df[(df["weaponsAcquired"] == 0) & (df.damageDealt == 0) & (df.kills == 0)].winPlacePerc.mean()

The result above shows that our assumption is correct because winPlacePerc is close to 0.

In [None]:
kills = df.copy()

kills['killsCategories'] = pd.cut(kills['kills'], [-1, 0, 2, 5, 10, 60], labels=['0_kills','1-2_kills', '3-5_kills', '6-10_kills', '10+_kills'])

plt.figure(figsize=(15,8))
sns.boxplot(x="killsCategories", y="winPlacePerc", data=kills)
plt.show()

There is a correlation between killing and winning, as seen in the boxplot above.

Now let's examine vehicle usage.

In [None]:
plot_hist(df, "rideDistance")

In [None]:
print("{:.2f}% of the players did not drive.".format(len(df[df.rideDistance == 0])/len(df) * 100.0))
print("The average person drives for {:.2f}.".format(df.rideDistance.mean()))

In [None]:
distances = df.copy()

distances['distanceCategories'] = pd.cut(distances['rideDistance'], [-1, 0, 1000, 5000, 10000, 50000], labels=['0','1-1000', '1001-5000', '5001-10000', '10000+'])

plt.figure(figsize=(15,8))
sns.boxplot(x="distanceCategories", y="winPlacePerc", data=distances)
plt.show()

There is a small correlation between rideDistance and winPlacePerc.

In [None]:
# Top 5 drivers
print("average winPlacePerc of the top 5 drivers is {}".format(df.sort_values(by=['rideDistance'], ascending=False).head(5).winPlacePerc.mean()))
df.sort_values(by=['rideDistance'], ascending=False).head(5)

Now let's examine walking distances and see if it relates to winning.

In [None]:
plot_hist(df, "walkDistance")
print("max walking distance is {}".format(df.walkDistance.max()))

In [None]:
distances2 = df.copy()

distances2['distanceCategories'] = pd.cut(distances2['walkDistance'], [-1, 0, 1000, 5000, 10000, 50000], labels=['0','1-1000', '1001-5000', '5001-10000', '10000+'])

plt.figure(figsize=(15,8))
sns.boxplot(x="distanceCategories", y="winPlacePerc", data=distances2)
plt.show()

Apparently walking has a correlation with winning.