In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

## Introduction

This kernel-based competition started and ended a while ago, so it already has some great notebooks to look at. As a kaggle novice myself, I'll try to illustrate the basic precedure of data science with this interesting dataset, starting from EDA(Exploratory Data Analysis), feature engineering and modelling. To briefly explain,

* EDA : looking for patterns and correlations of/between given columns
* Feature Engineering : creating and manipulating columns based on your observations from EDA
* Modelling : choosing the best features from what you've got so far, and coming up with the best model to use them

Let's begin with EDA first!

## EDA

The goal of EDA is to have a general view of the dataset. This includes finding out what columns there are, in what data type they are in, how's the general distribution of values in those columns and how one column correlates to another. Here, by column I generally refer to features, which is a more data-sciencish term.

Since human brain is more suited for understanding visual patterns than text or numbers, a good EDA comes with good visualization a lot of times. So we'll need some descent libraries to help visualization.

In [None]:
# The classic visualization tool in Python. Old, but still powerful.
import matplotlib.pyplot as plt 
# A easy-to-use tool based on matplotlib
import seaborn as sns
# A handy library for missing value detection
import missingno as msno

Now it's time to load our dataset! It consists of train, test and sample submission. This is the simplest form of dataset for a Kaggle competition. Train is a dataset that comes with a label which you can use to perform supervised learning, and test is the one without the labels, so to predict them with your model would be your job. Sample submission gives you the guideline for how your submission csv file should look like, such as what the column name should be, whether you should include the index, etc. 

I'll bring in train and test for now.

In [None]:
train = pd.read_csv('/kaggle/input/pubg-finish-placement-prediction/train_V2.csv')
test = pd.read_csv('/kaggle/input/pubg-finish-placement-prediction/test_V2.csv')
train.head()

I prefer pd.DataFrame.head() function for my first action after loading a dataset. It gives you the general idea of the columns, their data types and their format(ex: % or decimal). However, this dataset came with so many columns that you can't see them at once. In this case, pd.DataFrame.columns function comes in handy.

In [None]:
print("Number of Columns: {}".format(len(train.columns)))
train.columns

In [None]:
train.dtypes.sort_values(ascending=False)

Okay. We have four columns in object type(roughly equivalent to string) and all the others are in numbers. And since three of the four are IDs, there won't be much need for label encoding - type of preprocessing done to handle categorical data. Lucky for us!

And I think the column 'winPlacePerc' would be our label, but let's check to be safe.

In [None]:
[x for x in train.columns if x not in test.columns]

So, the column 'winPlacePerc' is in train.csv but not in test.csv, so that must be our label.

Before anything, I recommend you to check for missing values so that later on your lines of code do not return errors, which can make you really frustrating. I'll use the library missingno for this.

### Handling Missing Values

In [None]:
msno.matrix(train)

In [None]:
train.isna().sum()

Wow. This really is a set of nice dataset! There is only one missing value in winPlacePerc column. We will have to see what happened there.

In [None]:
train[np.isnan(train['winPlacePerc'])]

Since this person has no assists, boosts, damageDealt, riderDistance, swimDistance, walkDistance, etc whatsoever, I think I can call this not valid. Just to make sure, let's see if the match itself was null.

In [None]:
train[train['matchId']=='224a123c53e008']

All right. So there was only that guy in the game, so we should call that no game. We'll just drop this row, and finally begin our EDA.

In [None]:
train.dropna(inplace=True)

Now it's time for actual EDA. Let's check for the general distributions of values first.

Since I'll want to plot many distributions, to make it handy I'll just define a function for that visualization.

### Overall Distriutions

In [None]:
def dist_plot(col, data=train):
    plt.figure(figsize=(10,6.5))
    sns.distplot(data[col])
    plt.title("Distribution_{}".format(col))

In [None]:
dist_plot('winPlacePerc')

# Rather evenly distributed, as it ought to be.

In [None]:
dist_plot('kills')

In [None]:
dist_plot('assists')

In [None]:
dist_plot('heals')

In [None]:
dist_plot('damageDealt')

In [None]:
dist_plot('walkDistance')

We could do this a lot more, but I'm sure you've got the point. If you have to make similar type of visualizations repeatly, calling a function could make it a lot easier.

Insights from Distribution Visulization : 
* Most of in-game indicators show long tail (clearly it's a battle royal game with less and less survivors as it goes on)
* Many people tend to walk all right, but only the chosen ones get to shoot at others much (especially uneven distribution of kills and damagDealt)

The distplot is great at giving you the general picture, yet you need a little more effort to make 'insights'. A lot of times, it requires you to come up with some hypotheses. In my case, I chose to focus on four parts: kills(damages), rides, walks and number of weapons acquired.

### Hypothesis 1 : Kills and Win Place

Number of kills gotta have some relationship with the final rank of the player. How many kills do people get in the first place?

In [None]:
train['kills'].describe(percentiles=[0.1*x for x in range(10)])
# This prints the threshold value for each 10th percentile

Over 50% of the players get no kills at all, and almost 80% get less than 2. At the same time, a guy slayed 72. (John Wick is that you?)

In [None]:
temp = train.groupby('kills')['winPlacePerc'].mean()

plt.figure(figsize=(10, 7))
plt.bar(temp.index, temp, color='peachpuff')
plt.plot(temp, color='chocolate')
plt.plot(temp.index, np.ones(len(temp)), ls='--', alpha=0.5, color='k')
plt.title("Kills and Win Place")
plt.xlabel("# Kills")
plt.ylabel("Win Place")

How do you like this visualization? It's a very simple one using matplotlib barplots and lineplots, but I think it catches what we wanted to see effectively. 

Also, it's my personal hobby, but if you visit the url below, you can check out some charming colors that matplotlib allows you to use with simple english names. Try out some cool color combinations of your own. 

https://matplotlib.org/3.1.0/gallery/color/named_colors.html

Back to our EDA, it shows that until like 8 kills, getting another kill greatly lifts your Win Place Percentage (almost 6%/kill)

However, after certain point, more kills doesn't necessarily guarantee a higher rank. There is even a downslide in kill-winplace correlation. Perhaps as one says, to know when to fight and when not to is also crucial in winning a war!

So to sum up, a certain level of aiming skill is definitely required to survive long, but after that it's more of some other strategies and positioning (+ luck!) rather than simple shooting and killing that count.



In [None]:
plt.figure(figsize=(12,8))
plt.scatter(train['kills'], train['winPlacePerc'], s=1, color='plum')
plt.title('kills-winPlace')