In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

**1. Load Data**


In [None]:
train = pd.read_csv("../input/train_V2.csv")
test = pd.read_csv("../input/test_V2.csv")

#Check the shape of data
print(train.shape)
print(test.shape)
train.dtypes

All the features are numerical, execpt the ID colunms.

check missing values of the dataset:

In [None]:
train.isnull().sum()

In [None]:
train.winPlacePerc.fillna(train.winPlacePerc.mean(),inplace=True)
train.winPlacePerc.isnull().sum()

****2.Exploratory Data Analysis****

Columns Description:
* **DBNOs** - Number of enemy players knocked.
* **assists** - Number of enemy players this player damaged that were killed by teammates.
* **boosts** - Number of boost items used.
* **damageDealt** - Total damage dealt. Note: Self inflicted damage is subtracted.
* **headshotKills** - Number of enemy players killed with headshots.
* **heals** - Number of healing items used.
* **Id** - Player’s Id
* **killPlace** - Ranking in match of number of enemy players killed.
* **killPoints** - Kills-based external ranking of player. (Think of this as an Elo ranking where only kills matter.) If there is a value other than -1 in rankPoints, then any 0 in killPoints should be treated as a “None”.
* **killStreaks** - Max number of enemy players killed in a short amount of time.
* **kills** - Number of enemy players killed.
* **longestKill** - Longest distance between player and player killed at time of death. This may be misleading, as downing a player and driving away may lead to a large longestKill stat.
* **matchDuration** - Duration of match in seconds.
* **matchId** - ID to identify match. There are no matches that are in both the training and testing set.
* **matchType** - String identifying the game mode that the data comes from. The standard modes are “solo”, “duo”, “squad”, “solo-fpp”, “duo-fpp”, and “squad-fpp”; other modes are from events or custom matches.
* **rankPoints** - Elo-like ranking of player. This ranking is inconsistent and is being deprecated in the API’s next version, so use with caution. Value of -1 takes place of “None”.
* **revives** - Number of times this player revived teammates.
* **rideDistance** - Total distance traveled in vehicles measured in meters.
* **roadKills** - Number of kills while in a vehicle.
* **swimDistance** - Total distance traveled by swimming measured in meters.
* **teamKills** - Number of times this player killed a teammate.
* **vehicleDestroys** - Number of vehicles destroyed.
* **walkDistance** - Total distance traveled on foot measured in meters.
* **weaponsAcquired** - Number of weapons picked up.
* **winPoints** - Win-based external ranking of player. (Think of this as an Elo ranking where only winning matters.) If there is a value other than -1 in rankPoints, then any 0 in winPoints should be treated as a “None”.
* **groupId** - ID to identify a group within a match. If the same group of players plays in different matches, they will have a different groupId each time.
* **numGroups** - Number of groups we have data for in the match.
* **maxPlace** - Worst placement we have data for in the match. This may not match with numGroups, as sometimes the data skips over placements.
* **winPlacePerc** - The target of prediction. This is a percentile winning placement, where 1 corresponds to 1st place, and 0 corresponds to last place in the match. It is calculated off of maxPlace, not numGroups, so it is possible to have missing chunks in a match.

From the description, choose the below columns as most concerned features

In [None]:
features = list(train.columns)
print(features)

#Based on player, so drop "Id", "groupId", "matchId", "winPlacePerc"
for i in ["Id","groupId","matchId","winPlacePerc"]:
    features.remove(i)
    
print(features)

As for the defination of the columns, we can  divide these features into following groups:

kills:[damageDealt, DBNOs, headshotKills, killPlace, killPoints, kills, killStreaks, longestKill, roadKills]

teamwork: [assists, revives]

med: [boosts, heals]

movement: [rideDistance, swimDistance, walkDistance]

Others: [maxPlace, numGroups, teamKills, vehicleDestroys, weaponsAcquired, winPoints]

In [None]:
kills = ["damageDealt", "DBNOs", "headshotKills", "killPlace", "killPoints", "kills", "killStreaks", "longestKill", "roadKills"]
teamwork = ["assists", "revives"]
med = ["boosts", "heals"]
movement = ["rideDistance", "swimDistance", "walkDistance"]
Others = ["maxPlace", "numGroups", "teamKills", "vehicleDestroys", "weaponsAcquired", "winPoints"]

print("kills: "+str(len(kills)))
print("teamwork: "+str(len(teamwork)))
print("med: "+str(len(med)))
print("movement: "+str(len(movement)))
print("Others: "+str(len(Others)))

**2-1 Kills **

Let's first see the relationship between Kills columns and winPlacePerc!

In [None]:
data = train.copy()
for col in kills:
    print(data[col].describe().drop("count"))
    print()

For kills data, commonly we will think about how many people I killed/knocked down. That's also one of the  direct data to judge your performance in the game. So I will show the kills&DBNOs data first.

Through the above description, we can know some people kill so many ones(like 60), which may cause some problems for ploting and analyse. And I will set all data larger than 10 as 10 to avoid this.

In [None]:
data.loc[data.kills > 10,"kills"] = 10
data.loc[data.DBNOs > 10,"DBNOs"] = 10 

In [None]:
nrows=3
ncols=3
fig,axes = plt.subplots(nrows=nrows,ncols=ncols,figsize=(ncols*6,nrows*4))
for i in range(ncols):
    for j in range(nrows):
        idx=i*ncols+j
        sns.distplot(data[kills[idx]],kde=True,ax=axes[i][j])

boxplot of kills&DBNOs

In [None]:
sns.boxplot(x="kills",y="winPlacePerc",data=data)        

In [None]:
sns.boxplot(x="DBNOs",y="winPlacePerc",data=data)

In [None]:
sns.jointplot(x="winPlacePerc",y="damageDealt",data=data)

Apparently, kills, DBNOs and damageDealt show  a positive trend.** *More kills/DBNOs/damageDealt, more likely to win. ***

It's easy to understand, because these also represent a player's personal fighting/shooting skills. Usually, better players are more likely to live to the end! 

However, personal shooting skills do not mean everything. Teamwork and Strategy also depends. We will discuss later.

Before discussing about teamwork data, I also notice an interesting feature --- longestKills. What does this mean?


In [None]:
sns.distplot(data.longestKill,bins=100)

In [None]:
plt.scatter(x="winPlacePerc",y="longestKill",data=data)

Maybe this columns is midleading, as downing a player and driving away may lead to a large longestKill stat. But low winPlace always show up with short longestKill, **which indicate the importance of *Sniper rifle and high times scope*.**

**2-2 Medicine Items**

Except Killing, how to make yourself live and move fast is the key to win. There are mainly two kinds of medicine in the game, heals and boosts items. One for keeping health, and the other for boosting the movement. Let's see how they will affect the probability of winning~

In [None]:
data = train.copy()
data = data[data['heals'] < data['heals'].quantile(0.99)]
data = data[data['boosts'] < data['boosts'].quantile(0.99)]
print(data["heals"].describe().drop("count"))
print(data["boosts"].describe().drop("count"))

**2-3 Teamwork**

Only know how to kill is not enough to win, teamwork is also important. In this section, I will mainly discuss two features: "assists" and "revives"

In [None]:
for col in teamwork:
    print(data[col].describe().drop("count"))
    print()

Considering the group work, need to transform the dataset.

In [None]:
team = train.groupby("groupId").mean()
print(team.shape)
team.head()

In [None]:
fig,axes=plt.subplots(nrows=1,ncols=2,figsize=(12,6))
sns.distplot(team[teamwork[0]],kde=False,ax=axes[0],bins=100)
sns.distplot(team[teamwork[1]],kde=False,ax=axes[1],bins=100)

In [None]:
fig,axes=plt.subplots(nrows=1,ncols=2,figsize=(18,6))
sns.boxplot(x=round(team.assists),y=team.winPlacePerc,ax=axes[0])
sns.boxplot(x=round(team.revives),y=team.winPlacePerc,ax=axes[1])

It's really funny for these two variables.

"Assists" shows that **more assists usually lead to a better rank and higher lower bounds. **  Why? I think more assists means more team battles and kills, and teamwork is always powerful than fighting alone.

"Revives" shows a different result -- more revives could lead to not bad ranks, but not the winner for most games. Moderate revives is best. From my understanding, too many revives means knocked down several times, which is not good for the finals. ***Avoiding the needless deaths and ready to save teammates at any time should be the best way to win.***

**2-4 Movement**

Now we will discussing about the movement in the game, including swimming, running and riding.

In [None]:
#Riding
ride = team.copy()
ride = ride[ride["rideDistance"]<ride["rideDistance"].quantile(0.99)]
print(ride["rideDistance"].describe().drop("count"))
sns.distplot(ride["rideDistance"])
sns.jointplot(x="winPlacePerc",y="rideDistance",data=team)

In [None]:
#Running
run = team.copy()
run = run[run["walkDistance"]<run["walkDistance"].quantile(0.99)]
print(run["walkDistance"].describe().drop("count"))
sns.distplot(run["walkDistance"])
sns.jointplot(x="winPlacePerc",y="walkDistance",data=team)

In [None]:
swim = team.copy()
swim = swim[swim["swimDistance"]<swim["swimDistance"].quantile(0.99)]
print(swim["swimDistance"].describe().drop("count"))
sns.distplot(swim["swimDistance"],kde=False)

In [None]:
swim['swimDistance'] = pd.cut(swim['swimDistance'], [-1, 0, 5, 20, 5286], labels=['0m','1-5m', '6-20m', '20m+'])
sns.boxplot(x="swimDistance",y="winPlacePerc",data=swim)

Obviously, whatever the running, riding or swimming, always most people move short distance. 
However, the charts show us positive corrlation between distance and winPlace. 
As we know, in the game, avtive area is always shrinking. So players need to keep moving, and also, they must find some good site to ambush others or shoot. That's why more movement leads to a better winPlace. 