# Table of content

[0.Disclaimer](#0)

[I. Define the problem](#I)
* [1. Problem description](#I1)
* [2. Methodology](#I2)
* [3. Tools importing](#I3)

[II. Gather the data](#II)

[III. Wrangle, cleanse and Prepare Data for Consumption](#III)
* [0. The 4 C's of Data Cleaning: Correcting, Completing, Creating, and Converting](#III0)
* [1. Saving time and memory with big datasets](#III1)
* [2. Descriptive analysis of the data](#III2)
* [3. Dropping features](#III3)
* [4. Dropping irrelevant or marginal data](#III4)
* [5. Completing missing values for each feature](#III5)
* [6. Grouping features by group/match and getting their mean/size/min/max](#III6)
* [7. Creating new potentially relevant values for feature engineering](#III7)

[IV. Perform Exploratory Analysis and visualize the data](#IV)
* [1. Visualize relation between each feature and the mean of the target](#IV1)
* [2. Feature selection](#IV2)

[V. Model data, attempt n°1: LASSO Regression (Outdated)](#V)
* [1. Initial toughts](#V1)
* [2. Creating a new array with polynomial features](#V2)
* [3. Lasso regression analysis method](#V3)
* [4. Forward search for feature selection](#V4)  

[VI. Model data, attempt n°2: Random Forest (Outdated)](#VI)
* [1. Creating and normalizing matrices for our model](#VI1)
* [2. Modeling](#VI2)

[VII. Model data, attempt n°3: Light GBM](#VII)
* [1. Creating and normalizing matrices for our model](#VII1)
* [2. Modeling](#VII2)

[VIII. Model Submission](#VIII)

## **<div id="0">0. Disclaimer</div>**

If you want to see the clean and short version of my work, **go here : https://www.kaggle.com/toldo171/pubg-top-35-with-lgbm**
This kernel contains my whole thinking process (thus it is also a mess :-) ).

## **<div id="I">I. Define the problem</div>**

### **<div id="I1">1. Problem description</div>**

Battle Royale-style video games have taken the world by storm. 100 players are dropped onto an island empty-handed and must explore, scavenge, and eliminate other players until only one is left standing, all while the play zone continues to shrink.

PlayerUnknown's BattleGrounds (PUBG) has enjoyed massive popularity. With over 50 million copies sold, it's the fifth best selling game of all time, and has millions of active monthly players.

The team at PUBG has made official game data available for the public to explore and scavenge outside of "The Blue Circle." This competition is not an official or affiliated PUBG site - Kaggle collected data made possible through the PUBG Developer API.

You are given over 65,000 games' worth of anonymized player data, split into training and testing sets, and asked to predict final placement from final in-game stats and initial player ratings.

What's the best strategy to win in PUBG? Should you sit in one spot and hide your way into victory, or do you need to be the top shot? Let's let the data do the talking!

In a PUBG game, up to 100 players start in each match (matchId). Players can be on teams (groupId) which get ranked at the end of the game (winPlacePerc) based on how many other teams are still alive when they are eliminated. In game, players can pick up different munitions, revive downed-but-not-out (knocked) teammates, drive vehicles, swim, run, shoot, and experience all of the consequences -- such as falling too far or running themselves over and eliminating themselves.

You are provided with a large number of anonymized PUBG game stats, formatted so that each row contains one player's post-game stats. The data comes from matches of all types: solos, duos, squads, and custom; there is no guarantee of there being 100 players per match, nor at most 4 player per group.

You must create a model which predicts players' finishing placement based on their final stats, on a scale from 1 (first place) to 0 (last place). 

#### **Data fields**

* **DBNOs** - Number of enemy players knocked.
* **assists** - Number of enemy players this player damaged that were killed by teammates.
* **boosts** - Number of boost items used.
* **damageDealt** - Total damage dealt. Note: Self inflicted damage is subtracted.
* **headshotKills** - Number of enemy players killed with headshots.
* **heals** - Number of healing items used.
* **Id** - Player’s Id
* **killPlace** - Ranking in match of number of enemy players killed.
* **killPoints** - Kills-based external ranking of player. (Think of this as an Elo ranking where only kills matter.) If there is a value other than -1 in rankPoints, then any 0 in killPoints should be treated as a “None”.
* **killStreaks** - Max number of enemy players killed in a short amount of time.
* **kills** - Number of enemy players killed.
* **longestKill** - Longest distance between player and player killed at time of death. This may be misleading, as downing a player and driving away may lead to a large longestKill stat.
* **matchDuration** - Duration of match in seconds.
* **matchId** - ID to identify match. There are no matches that are in both the training and testing set.
* **matchType** - String identifying the game mode that the data comes from. The standard modes are “solo”, “duo”, “squad”, “solo-fpp”, “duo-fpp”, and “squad-fpp”; other modes are from events or custom matches.
* **rankPoints** - Elo-like ranking of player. This ranking is inconsistent and is being deprecated in the API’s next version, so use with caution. Value of -1 takes place of “None”.
* **revives** - Number of times this player revived teammates.
* **rideDistance** - Total distance traveled in vehicles measured in meters.
* **roadKills** - Number of kills while in a vehicle.
* **swimDistance** - Total distance traveled by swimming measured in meters.
* **teamKills** - Number of times this player killed a teammate.
* **vehicleDestroys** - Number of vehicles destroyed.
* **walkDistance** - Total distance traveled on foot measured in meters.
* **weaponsAcquired** - Number of weapons picked up.
* **winPoints** - Win-based external ranking of player. (Think of this as an Elo ranking where only winning matters.) If there is a value other than -1 in rankPoints, then any 0 in winPoints should be treated as a “None”.
* **groupId** - ID to identify a group within a match. If the same group of players plays in different matches, they will have a different groupId each time.
* **numGroups** - Number of groups we have data for in the match.
* **maxPlace** - Worst placement we have data for in the match. This may not match with numGroups, as sometimes the data skips over placements.
* **winPlacePerc** - The target of prediction. This is a percentile winning placement, where 1 corresponds to 1st place, and 0 corresponds to last place in the match. It is calculated off of maxPlace, not numGroups, so it is possible to have missing chunks in a match.


### **<div id="I2">2. Methodology</div>**

* **1. Define the Problem**: If data science, big data, machine learning, predictive analytics, business intelligence, or any other buzzword is the solution, then what is the problem? As the saying goes, don't put the cart before the horse. Problems before requirements, requirements before solutions, solutions before design, and design before technology. Too often we are quick to jump on the new shiny technology, tool, or algorithm before determining the actual problem we are trying to solve.
* **2. Gather the Data**: John Naisbitt wrote in his 1984 (yes, 1984) book Megatrends, we are “drowning in data, yet staving for knowledge." So, chances are, the dataset(s) already exist somewhere, in some format. It may be external or internal, structured or unstructured, static or streamed, objective or subjective, etc. As the saying goes, you don't have to reinvent the wheel, you just have to know where to find it. In the next step, we will worry about transforming "dirty data" to "clean data."
* **3. Prepare Data for Consumption**: This step is often referred to as data wrangling, a required process to turn “wild” data into “manageable” data. Data wrangling includes implementing data architectures for storage and processing, developing data governance standards for quality and control, data extraction (i.e. ETL and web scraping), and data cleaning to identify aberrant, missing, or outlier data points.
* **4. Perform Exploratory Analysis**: Anybody who has ever worked with data knows, garbage-in, garbage-out (GIGO). Therefore, it is important to deploy descriptive and graphical statistics to look for potential problems, patterns, classifications, correlations and comparisons in the dataset. In addition, data categorization (i.e. qualitative vs quantitative) is also important to understand and select the correct hypothesis test or data model.
* **5. Model Data**: Like descriptive and inferential statistics, data modeling can either summarize the data or predict future outcomes. Your dataset and expected results, will determine the algorithms available for use. It's important to remember, algorithms are tools and not magical wands or silver bullets. You must still be the master craft (wo)man that knows how-to select the right tool for the job. An analogy would be asking someone to hand you a Philip screwdriver, and they hand you a flathead screwdriver or worst a hammer. At best, it shows a complete lack of understanding. At worst, it makes completing the project impossible. The same is true in data modelling. The wrong model can lead to poor performance at best and the wrong conclusion (that’s used as actionable intelligence) at worst.
* **6. Validate and Implement Data Model**: After you've trained your model based on a subset of your data, it's time to test your model. This helps ensure you haven't overfit your model or made it so specific to the selected subset, that it does not accurately fit another subset from the same dataset. In this step we determine if our model overfit, generalize, or underfit our dataset.
* **7. Optimize and Strategize**: This is the "bionic man" step, where you iterate back through the process to make it better...stronger...faster than it was before. As a data scientist, your strategy should be to outsource developer operations and application plumbing, so you have more time to focus on recommendations and design. Once you're able to package your ideas, this becomes your “currency exchange" rate.

### **<div id="I3">3. Tools importing</div>**

Here we are importing every useful tool needed during our research process.

In [None]:
# Data analysis and wrangling
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Machine learning
import sklearn as skl
import lightgbm as lgb
from sklearn import preprocessing
from sklearn import model_selection
from sklearn.metrics import mean_absolute_error
#from sklearn.preprocessing import PolynomialFeatures
#from sklearn.linear_model import Lasso
#from sklearn.linear_model import LinearRegression
#from sklearn.metrics import normalized_mutual_info_score
#from sklearn.ensemble import RandomForestRegressor
#from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

# File handling
import os
import gc
gc.enable()
print(os.listdir("../input"))

## **<div id="II">II. Gather the data</div>**

We start by acquiring the training and testing datasets into Pandas DataFrames.

In [None]:
training_df = pd.read_csv("../input/train_V2.csv")
testing_df = pd.read_csv("../input/test_V2.csv")

## **<div id="III">III. Wrangle, cleanse and prepare Data for Consumption</div>**

### **<div id="III0">0. Reminder: the 4 C's of data cleaning</div>**

* **Correcting**: Reviewing the data, there does not appear to be any aberrant or non-acceptable data inputs. In addition, we see we may have potential outliers in age and fare. However, since they are reasonable values, we will wait until after we complete our exploratory analysis to determine if we should include or exclude from the dataset. It should be noted, that if they were unreasonable values, for example age = 800 instead of 80, then it's probably a safe decision to fix now. However, we want to use caution when we modify data from its original value, because it may be necessary to create an accurate model.
* **Completing**: There are null values or missing data in the age, cabin, and embarked field. Missing values can be bad, because some algorithms don't know how-to handle null values and will fail. While others, like decision trees, can handle null values. Thus, it's important to fix before we start modeling, because we will compare and contrast several models. There are two common methods, either delete the record or populate the missing value using a reasonable input. It is not recommended to delete the record, especially a large percentage of records, unless it truly represents an incomplete record. Instead, it's best to impute missing values. A basic methodology for qualitative data is impute using mode. A basic methodology for quantitative data is impute using mean, median, or mean + randomized standard deviation. An intermediate methodology is to use the basic methodology based on specific criteria; like the average age by class or embark port by fare and SES. There are more complex methodologies, however before deploying, it should be compared to the base model to determine if complexity truly adds value.
* **Creating**: Feature engineering is when we use existing features to create new features to determine if they provide new signals to predict our outcome.
* **Converting**: Last, but certainly not least, we'll deal with formatting. There are no date or currency formats, but datatype formats. Our categorical data imported as objects, which makes it difficult for mathematical calculations.

### **<div id="III1">1. Saving time and memory with big datasets</div>**

The size of the dataset is pretty big. Implementing a script to make the dataset smaller without losing information can save us a lot of time. I did not create this script, the credit goes to https://www.kaggle.com/gemartin/load-data-reduce-memory-usage

This script uses the following approach:
* Iterate over every column
* Determine if the column is numeric
* Determine if the column can be represented by an integer
* Find the min and the max value
* Determine and apply the smallest datatype that can fit the range of values


In [None]:
# Memory saving function credit to https://www.kaggle.com/gemartin/load-data-reduce-memory-usage
def reduce_mem_usage(df):
    """ iterate through all the columns of a dataframe and modify the data type
        to reduce memory usage.        
    """
    start_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage of dataframe is {:.2f} MB'.format(start_mem))

    for col in df.columns:
        col_type = df[col].dtype

        if col_type != object:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float16).min and c_max < np.finfo(np.float16).max:
                    df[col] = df[col].astype(np.float16)
                elif c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)

    end_mem = df.memory_usage().sum() / 1024**2
    print('Memory usage after optimization is: {:.2f} MB'.format(end_mem))
    print('Decreased by {:.1f}%'.format(100 * (start_mem - end_mem) / start_mem))

    return df

Now let's just apply the script to our traning and testing dataframes.

### **<div id="III2">2. Descriptive analysis of the data</div>**

Let's:
* display the first rows of the training dataset, in order to have an overview of the parameters,
* display the data type of each feature,
* display a quick descriptive representation of the dataset

In [None]:
training_df.head()

In [None]:
training_df.info(verbose=True)
print('_'*40)
testing_df.info(verbose=True)


In [None]:
training_df.describe()

### **<div id="III3">3. Dropping features</div>**

First of all, we want to drop some feature which may not be relevant for our problem:
* **ID, groupId**: in the future, we may want to treat our data set as a "cluster" of many games (grouping by matches or by player ID), but for this first iteration let's just drop these features and not take into account these dependencies.
* **longestKill, rankPoints**: in the description, these features are stated as "inconsistent", so let's just drop them.
* **numGroups**: this feature is very similar to maxPlace. Let's drop it.
* **matchType**: this feature is somewhat redundant with maxPlace, since for example, on a duo-type game, maxPlace ~ 50. Let's drop it.

In [None]:
training_df = training_df.drop(['Id', 'longestKill', 'rankPoints', 'numGroups', 'matchType'], axis=1)
testing_df = testing_df.drop(['Id', 'longestKill', 'rankPoints', 'numGroups', 'matchType'], axis=1)

### **<div id="III4">4. Dropping irrelevant or marginal data</div>**

In a game of PUBG, we can observe that games with more than 20 kills represents less than 0,01% of the data. Basically, if you score more than 20 kills in a game, either you are the best of the world, or a cheater (this). Let's remove data with more than 20 kills per game.

In [None]:
training_df['kills'].value_counts().tail(10)

In [None]:
training_df = training_df.drop(training_df[training_df.kills > 20].index)
training_df['kills'].value_counts()

Let's go through the same process for some other features : 
* **weaponsAcquired** > 30
* **DBNOs** > 15

In [None]:
training_df = training_df.drop(training_df[training_df.DBNOs > 15].index)
training_df['DBNOs'].value_counts()

In [None]:
training_df = training_df.drop(training_df[training_df.weaponsAcquired > 30].index)
training_df['weaponsAcquired'].value_counts()

In [None]:
training_df.info(verbose=True)

### **<div id="III5">5. Completing missing values for each feature</div>**

There is only one missing value in the entire dataset (winPlacePerc column). Here, we choose to remove its associated row.

In [None]:
training_df.isnull().sum()
training_df = training_df.dropna(how='any',axis=0) 
training_df.winPlacePerc.isnull().sum()

### **<div id="III6">6. Grouping features by group/match and getting their mean/size/min/max</div>**

The match and group feature seems interesting, and one way to get some information from it is to compute the mean of each feature on each match, and the size of the match feature.

In [None]:
features = list(training_df.columns)
features.remove("matchId")
features.remove("groupId")
features.remove("winPlacePerc")

#Getting match mean features
print("get match mean feature")
match_mean = training_df.groupby(['matchId'])[features].agg('mean').reset_index()
training_df = training_df.merge(match_mean, suffixes=["", "_match_mean"], how='left', on=['matchId'])
match_mean = testing_df.groupby(['matchId'])[features].agg('mean').reset_index()
testing_df = testing_df.merge(match_mean, suffixes=["", "_match_mean"], how='left', on=['matchId'])

#Getting match size features
print("get match size feature")
match_size = training_df.groupby(['matchId']).size().reset_index(name='match_size')
training_df = training_df.merge(match_size, how='left', on=['matchId'])
match_size = testing_df.groupby(['matchId']).size().reset_index(name='match_size')
testing_df = testing_df.merge(match_size, how='left', on=['matchId'])

del match_mean, match_size
gc.collect()

In [None]:
print("get group size feature")
group_size = training_df.groupby(['matchId','groupId']).size().reset_index(name='group_size')
training_df = training_df.merge(group_size, how='left', on=['matchId', 'groupId'])
group_size = testing_df.groupby(['matchId','groupId']).size().reset_index(name='group_size')
testing_df = testing_df.merge(group_size, how='left', on=['matchId', 'groupId'])

#print("get group mean feature")
#group_mean = training_df.groupby(['matchId','groupId'])[features].agg('mean')
#group_mean_rank = group_mean.groupby('matchId')[features].rank(pct=True).reset_index()
#training_df = training_df.merge(group_mean.reset_index(), suffixes=["", "_mean"], how='left', on=['matchId', 'groupId'])
#training_df = training_df.merge(group_mean_rank, suffixes=["", "_mean_rank"], how='left', on=['matchId', 'groupId'])
#group_mean = testing_df.groupby(['matchId','groupId'])[features].agg('mean')
#group_mean_rank = group_mean.groupby('matchId')[features].rank(pct=True).reset_index()
#testing_df = testing_df.merge(group_mean.reset_index(), suffixes=["", "_mean"], how='left', on=['matchId', 'groupId'])
#testing_df = testing_df.merge(group_mean_rank, suffixes=["", "_mean_rank"], how='left', on=['matchId', 'groupId'])

print("get group max feature")
group_max = training_df.groupby(['matchId','groupId'])[features].agg('max')
group_max_rank = group_max.groupby('matchId')[features].rank(pct=True).reset_index()
training_df = training_df.merge(group_max.reset_index(), suffixes=["", "_max"], how='left', on=['matchId', 'groupId'])
training_df = training_df.merge(group_max_rank, suffixes=["", "_max_rank"], how='left', on=['matchId', 'groupId'])
group_max = testing_df.groupby(['matchId','groupId'])[features].agg('max')
group_max_rank = group_max.groupby('matchId')[features].rank(pct=True).reset_index()
testing_df = testing_df.merge(group_max.reset_index(), suffixes=["", "_max"], how='left', on=['matchId', 'groupId'])
testing_df = testing_df.merge(group_max_rank, suffixes=["", "_max_rank"], how='left', on=['matchId', 'groupId'])

print("get group min feature")
group_min = training_df.groupby(['matchId','groupId'])[features].agg('min')
group_min_rank = group_min.groupby('matchId')[features].rank(pct=True).reset_index()
training_df = training_df.merge(group_min.reset_index(), suffixes=["", "_min"], how='left', on=['matchId', 'groupId'])
training_df = training_df.merge(group_min_rank, suffixes=["", "_min_rank"], how='left', on=['matchId', 'groupId'])
group_min = testing_df.groupby(['matchId','groupId'])[features].agg('min')
group_min_rank = group_min.groupby('matchId')[features].rank(pct=True).reset_index()
testing_df = testing_df.merge(group_min.reset_index(), suffixes=["", "_min"], how='left', on=['matchId', 'groupId'])
testing_df = testing_df.merge(group_min_rank, suffixes=["", "_min_rank"], how='left', on=['matchId', 'groupId'])

del group_size, group_max, group_max_rank, group_min, group_min_rank
gc.collect()

In [None]:
#print("get group mean feature")
#group_mean = training_df.groupby(['matchId','groupId'])[features].agg('mean')
#group_mean_rank = group_mean.groupby('matchId')[features].rank(pct=True).reset_index()
#training_df = training_df.merge(group_mean.reset_index(), suffixes=["", "_mean"], how='left', on=['matchId', 'groupId'])
#training_df = training_df.merge(group_mean_rank, suffixes=["", "_mean_rank"], how='left', on=['matchId', 'groupId'])

In [None]:
#training_df = reduce_mem_usage(training_df)
#print('_'*40)
#testing_df = reduce_mem_usage(testing_df)

In [None]:
#group_mean = testing_df.groupby(['matchId','groupId'])[features].agg('mean')
#group_mean_rank = group_mean.groupby('matchId')[features].rank(pct=True).reset_index()
#testing_df = testing_df.merge(group_mean.reset_index(), suffixes=["", "_mean"], how='left', on=['matchId', 'groupId'])
#testing_df = testing_df.merge(group_mean_rank, suffixes=["", "_mean_rank"], how='left', on=['matchId', 'groupId'])

#We don't need matchId and groupId anymore
training_df.drop(["matchId", "groupId"], axis=1, inplace=True)
testing_df.drop(["matchId", "groupId"], axis=1, inplace=True)

In [None]:
training_df = reduce_mem_usage(training_df)
print('_'*40)
testing_df = reduce_mem_usage(testing_df)

### **<div id="III7">7. Creating new potentially relevant values for feature engineering</div>**

Here we want to create new features to determine if they can potentially provide new signals to predict our outcome. For this dataset, a few ideas come in mind:
* **headshotRate** = headshotKills / kills
* **totalDistance** = rideDistance + swimDistance + walkDistance
* **totalItems** = heals + boosts
* **healsPerWalkDistance** = heals / walkDistance
* **killsPerWalkDistance** = kills / walkDistance

In [None]:
training_df['headshotRate'] = training_df['headshotKills'] / training_df['kills']
training_df['headshotRate'].fillna(0, inplace=True)
training_df['headshotRate'].replace(np.inf, 0, inplace=True)
testing_df['headshotRate'] = testing_df['headshotKills'] / training_df['kills']
testing_df['headshotRate'].fillna(0, inplace=True)
testing_df['headshotRate'].replace(np.inf, 0, inplace=True)

training_df['totalDistance'] = training_df['rideDistance'] + training_df['swimDistance'] + training_df['walkDistance']
testing_df['totalDistance'] = testing_df['rideDistance'] + testing_df['swimDistance'] + testing_df['walkDistance']

training_df['items'] = training_df['heals'] + training_df['boosts']
testing_df['items'] = testing_df['heals'] + testing_df['boosts']

training_df['healsPerWalkDistance'] = training_df['heals'] / training_df['walkDistance']
training_df['healsPerWalkDistance'].fillna(0, inplace=True)
training_df['healsPerWalkDistance'].replace(np.inf, 0, inplace=True)
testing_df['healsPerWalkDistance'] = testing_df['heals'] / testing_df['walkDistance']
testing_df['healsPerWalkDistance'].fillna(0, inplace=True)
testing_df['healsPerWalkDistance'].replace(np.inf, 0, inplace=True)

training_df['killsPerWalkDistance'] = training_df['kills'] / training_df['walkDistance']
training_df['killsPerWalkDistance'].fillna(0, inplace=True)
training_df['killsPerWalkDistance'].replace(np.inf, 0, inplace=True)
testing_df['killsPerWalkDistance'] = testing_df['kills'] / testing_df['walkDistance']
testing_df['killsPerWalkDistance'].fillna(0, inplace=True)
testing_df['killsPerWalkDistance'].replace(np.inf, 0, inplace=True)

training_df.head()

## **<div id="IV">IV. Perform Exploratory Analysis and visualize the data</div>**

### **<div id="IV1">1. Visualize relation between each feature and the mean of the target</div>**

First of all, I want to see how each feature is related with the target value. I am drawing scatter plots between each feature and the mean of the target value.

In [None]:
# every feature with low number of discrete values (<100). 
feature_comparison = [
    'assists',
    'boosts',
    'DBNOs',
    'headshotKills',
    'heals',
    'killPlace',
    'kills',
    'killStreaks',
    'maxPlace',
    'revives',
    'roadKills',
    'teamKills',
    'vehicleDestroys',
    'weaponsAcquired',
    'items'
    ]

#We will store every comparison table in this list
table_comparison = []
row_axis = 0
column_axis = 0

#graph individual features
fig, saxis = plt.subplots(4, 4,figsize=(16,12))

#Creating the comparison dataframes with two columns : feature, and the mean of the winplace percentage
for feature in feature_comparison:
    table_comparison.append(training_df[[feature, 'winPlacePerc']].groupby([feature], as_index=False).mean().sort_values(by=feature, ascending=True))    

#Plotting the win place percentage as a function of each feature
for table in table_comparison: 
    sns.scatterplot(x = table.iloc[:,0], y = table.winPlacePerc, ax = saxis[row_axis,column_axis])
    row_axis += 1
    if row_axis > 3:
        row_axis = 0
        column_axis += 1

In [None]:
# every feature with continuous value. 
feature_comparison_2 = [
    'damageDealt',
    'killPoints',
    'rideDistance',
    'swimDistance',
    'walkDistance',
    'winPoints',
    'headshotRate',
    'totalDistance',
    'healsPerWalkDistance',
    'killsPerWalkDistance'
    ]

#We will store every comparison table in this list
table_comparison_2 = []
row_axis = 0
column_axis = 0

#graph individual features
fig, saxis = plt.subplots(4, 3,figsize=(16,12))

#Creating the comparison dataframes with two columns : feature, and the mean of the winplace percentage
for feature in feature_comparison_2:
    table_comparison_2.append(training_df[[feature, 'winPlacePerc']].groupby([feature], as_index=False).mean().sort_values(by=feature, ascending=True))  
    table_comparison_2[-1][feature + '_binned'] = pd.cut(table_comparison_2[-1][feature], bins = 100, labels=False)
    table_comparison_2[-1] = table_comparison_2[-1].groupby([feature + '_binned'], as_index=False).mean().sort_values(by=feature + '_binned', ascending=True)

#Plotting the win place percentage as a function of each feature
for table in table_comparison_2: 
    sns.scatterplot(x = table.iloc[:,1], y = table.winPlacePerc, ax = saxis[row_axis,column_axis])
    row_axis += 1
    if row_axis > 3:
        row_axis = 0
        column_axis += 1

### **<div id="IV2">2. Feature selection</div>**

#### **a. Mutual information**

Feature Selection is a very critical component in a Data Scientist’s workflow. When presented data with very high dimensionality, models usually choke because:
* Training time increases exponentially with number of features.
* Models have increasing risk of overfitting with increasing number of features.

Feature Selection methods helps with these problems by reducing the dimensions without much loss of the total information. It also helps to make sense of the features and its importance.

Here we are going to use a **filter method, mutual information**, to select our features.

**Mutual Information** between two variables measures the dependence of one variable to another. If X and Y are two variables:
* If X and Y are independent, then no information about Y can be obtained by knowing X or vice versa. Hence their mutual information is 0.
* If X is a deterministic function of Y, then we can determine X from Y and Y from X with mutual information 1.
   
We can then select our features from feature space by ranking their mutual information with the target variable.

Advantage of using mutual information is it does well with the non-linear relationship between feature and target variable.

In [None]:
feature_MI = [
    'assists',
    'boosts',
    'damageDealt',
    'DBNOs',
    'headshotKills',
    'heals',
    'killPlace',
    'killPoints',
    'kills',
    'killStreaks',
    'matchDuration',
    'maxPlace',
    'revives',
    'rideDistance',
    'roadKills',
    'swimDistance',
    'teamKills',
    'vehicleDestroys',
    'walkDistance',
    'weaponsAcquired',
    'winPoints',
    'winPlacePerc',
    'headshotRate',
    'totalDistance',
    'items',
    'healsPerWalkDistance',
    'killsPerWalkDistance'
    ]

mutual_info_df = training_df.truncate(after=-1)

#for feature in feature_MI:
    #mutual_info_df.loc[feature] = pd.Series([np.nan])

#for feature1 in feature_MI:
    #for feature2 in feature_MI:
        #mutual_info = normalized_mutual_info_score(training_df[feature1], training_df[feature2], average_method='arithmetic')
        #if mutual_info == 1:
            #print('OK')
        #mutual_info_df[feature1][feature2] = mutual_info

In [None]:
#plt.figure(figsize=(9,7))
#sns.heatmap(
    mutual_info_df,
    xticklabels=mutual_info_df.columns.values,
    yticklabels=mutual_info_df.columns.values,
    linecolor='white',
    linewidths=0.1,
    cmap="RdBu"
)
#plt.show()

Now we can plot the associated heatmap, and see which feature are the most dependent on the target value.

In [None]:
#mutual_info_target_df = abs(mutual_info_df[['winPlacePerc']])
#mutual_info_target_df = mutual_info_target_df.drop(['winPlacePerc'])
#mutual_info_target_df['feature'] = mutual_info_target_df.index

#plt.figure(figsize=(10, 6))
#sns.barplot(x='winPlacePerc', y='feature', data=mutual_info_target_df.sort_values(by="winPlacePerc", ascending=False))
#plt.title('Mutual Information between each feature and the target value')
#plt.tight_layout()

According to **mutual information**, we can see which feture are the most dependant on the target:
* maxPlace,
* killPlace,
* walkDistance
* totalDistance, killsPerWalkDistance, healsPerWalkDistance, rideDistance (but they are obviously highly dependent on walkDistance),
* damageDealt
* boosts,
* weaponsAcquired,


#### **b. Pearson correlation**

This technique is very time consuming. It took like 30min to compute the mutual information dataframe. Let's do the same work with the Pearson's correlation and compare the results, to see if it is worth computing the mutual information dataframe.

**Pearson correlation** is a measure of the linear correlation between two variables X and Y. It has a value between +1 and −1, where 1 is total positive linear correlation, 0 is no linear correlation, and −1 is total negative linear correlation.

In [None]:
#corr_df = training_df.corr()

#plt.figure(figsize=(9,7))
#sns.heatmap(
    corr_df,
    xticklabels=corr_df.columns.values,
    yticklabels=corr_df.columns.values,
    linecolor='white',
    linewidths=0.1,
    cmap="RdBu"
)
#plt.show()

In [None]:
#corr_target_df = abs(corr_df[['winPlacePerc']])
#corr_target_df = corr_target_df.drop(['winPlacePerc'])
#corr_target_df['feature'] = corr_target_df.index

#plt.figure(figsize=(10, 6))
#sns.barplot(x='winPlacePerc', y='feature', data=corr_target_df.sort_values(by="winPlacePerc", ascending=False))
#plt.title('Pearson Correlation between each feature and the target value')
#plt.tight_layout()

According to **Pearson correlation**, we can see which feture are the most correlated to the target:
* walkDistance,
* killPlace,
* totalDistance (but it is obviously highly correlated to walkDistance),
* boosts,
* weaponsAcquired,
* items (but it is obviously highly correlated to boosts),
* damageDealt

Except for *maxPlace*, results are very similar with this technique. An interpretation of this similarity is that we don't have in this dataset features that are highly dependent but non-linearly correlated.

#### **c. Conclusion**

Here are the feature we will be using for our first model. Then, we may add more features and see how they impact the performance of the model (Wrapper Method):
* **walkDistance**,
* **killPlace**,
* **damageDealt**,
* **boosts**,
* **weaponsAcquired**,

## **<div id="V">V. Model data, attempt n°1: LASSO Regression (Outdated)</div>**

### **<div id="V1">1. Initial toughts</div>**

Resource: https://towardsdatascience.com/machine-learning-with-python-easy-and-robust-method-to-fit-nonlinear-data-19e8a1ddbd49

Now that we have acquired, analyzed and prepared the data, we are ready to train a model and predict the required solution. There are many predictive modelling algorithms to choose from. We must understand the type of problem and solution requirement to narrow down to a select few models which we can evaluate. Here we are performing supervised learning, and our problem is a regression problem.
Here we are facing two challenges:
* Our dataset is very large, this we need our model to be scalable to big datasets,
* Our dataset is high-dimensionnal, and the relations between features are highly non-linear. One solution is to fit a model with polynomial degree terms, and potentially cross-coupled features. This leads us to few questions:
    * How to decide up to what polynomials are necessary?
    * When to stop if we start by incorporating 1st degree, 2nd degree, 3rd-degree terms one by one?
    * How to decide if any of the cross-coupled terms are important (for example, X1², X2³, X1.X2, X1².X3...)?

### **<div id="V2">2. Creating a new array with polynomial features</div>**

First, let's truncate our training dataframe, to speed up computing. Then, we want to create a new array based on our training dataframe, with each 2nd degree (for now) polynomial features in it. We allow every cross-coupled terms. For degree 2 and 5 features, we will have 21 features in our new array:
* our 5 initial features,
* 5 squared features, 
* 5+4+3+2+1 = 15 cross-coupled features.
* one last feature, with every features powered zero.

In [None]:
#training_df_truncated = training_df.truncate(before=50000,after=60000)
#X_train_truncated = np.asarray(training_df_truncated[['walkDistance','killPlace', 'damageDealt', 'boosts', 'weaponsAcquired']])
#X_train_truncated = np.float32(X_train_truncated)
#X_train_truncated = PolynomialFeatures(2, interaction_only=False).fit_transform(X_train_truncated)
#X_train_truncated[0:5]

In [None]:
#y_train_truncated = np.asarray(training_df_truncated[['winPlacePerc']])
#y_train_truncated[0:5]

In [None]:
#print ('Train set truncated:', X_train_truncated.shape,y_train_truncated.shape)

### **<div id="V3">3. Lasso regression analysis method</div>**

LASSO is a regression analysis method that performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the statistical model it produces. The basic idea is to penalize the model coefficients such that they don’t grow too big and overfit the data. Using LASSO regression, we are essentially eliminating the higher-order terms in the more complex models. 

**Basically, LASSO regression is similar to Linear Regression, but with a penalization coefficient at the end of the formula, eliminating the least important terms.**

Here, we want to evaluate the best model complexity (order of polynomial degree) for our LASSO regression model. Do we need linear regression with 7th degree order terms to reach the best accuracy, or is 2nd degree enough? Let's see.

In [None]:
#Split the model
cross_validation_split = model_selection.ShuffleSplit(n_splits = 5, test_size = .3, train_size = .6, random_state = 0)
#Create dataframe to store results according to degree of polynomial features.
lasso_results = pd.DataFrame(data = {'degree': [], 'test_score_mean': [], 'fit_time_mean': []})
#lasso_results = pd.DataFrame(data = {'degree': [], 'test_score_mean': [], 'fit_time_mean': [], 'mean_absolute_error': []})

#Evaluate the model for different dataframes. Each step increases the degree of the PolynomialFeatures function and outputs the accuracy of the model. 
for degree in range (1,6):
    X_train_truncated = np.asarray(training_df_truncated[['walkDistance','killPlace', 'damageDealt', 'boosts', 'weaponsAcquired']])
    X_train_truncated = np.float32(X_train_truncated)
    X_train_truncated = PolynomialFeatures(degree, interaction_only=False).fit_transform(X_train_truncated)
    #Evaluate the model
    cross_validation_results = model_selection.cross_validate(Lasso(alpha = 0.00001, max_iter=10000, normalize=True), X_train_truncated, y_train_truncated, cv = cross_validation_split, return_train_score = True)
    #The line below is here if you want to see the effect of LASSO compared to a classic linear regression.
    #cross_validation_results = model_selection.cross_validate(LinearRegression(), X_train_truncated, y_train_truncated, cv = cross_validation_split, return_train_score = True)
    #Predicts the target value
    #y_hat_truncated = Lasso(alpha=0.00001, max_iter=10000, normalize=True).fit(X_train_truncated, y_train_truncated).predict(X_train_truncated)
    lasso_results = lasso_results.append({'degree' : degree,
                                          'test_score_mean' : cross_validation_results['test_score'].mean(), 
                                          'fit_time_mean' : cross_validation_results['fit_time'].mean()}, ignore_index=True) 
                                          #'mean_absolute_error' : mean_absolute_error(y_train_truncated, y_hat_truncated)}, ignore_index=True)
    print('OK degree ' + str(degree))
        
sns.pointplot(x = lasso_results.degree, y = lasso_results.test_score_mean)
#sns.pointplot(x = lasso_results.degree, y = lasso_results.fit_time_mean)

#This part was here to find a good value of alpha where the test_score converge.
#It showed that alpha = 0.00001 is a good value, in terms of convergence and fit time
#------------------------------------------------------------------------------------------------------
#lasso_results = pd.DataFrame(data = {'1 / alpha': [], 'test_score_mean': [], 'fit_time_mean': []})
#lasso_alpha = 1
#denominator = 1

#for i in range (1,7):
    #cross_validation_results = model_selection.cross_validate(Lasso(alpha = (lasso_alpha / denominator), max_iter=10000, normalize=True), X_train_truncated, y_train_truncated, cv = cross_validation_split, return_train_score = True)
    #lasso_results = lasso_results.append({'1 / alpha' : (denominator), 'test_score_mean' : cross_validation_results['test_score'].mean(), 'fit_time_mean' : cross_validation_results['fit_time'].mean()}, ignore_index=True)
    #i += 1
    #denominator *= 10
#------------------------------------------------------------------------------------------------------

**Results**: 4th degree terms and further don't seem to improve the accuracy of our model so much. Let's stick to 3rd degree polynomial terms and below. 

### **<div id="V4">4. Forward search for feature selection</div>**

**Forward search** is a **wrapper method** for feature selection. It allows you to search for the best feature model performance and add them to your feature subset one after the other.

For data with n features:
* On first round ‘n’ models are created with individual feature and the best predictive feature is selected.
* On second round, ‘n-1’ models are created with each feature and the previously selected feature.
* This is repeated till a best subset of ‘m’ features are selected.

In [None]:
feature_FS = [
    'assists',
    'DBNOs',
    'headshotKills',
    'heals',
    'killPoints',
    'kills',
    'killStreaks',
    'matchDuration',
    'maxPlace',
    'revives',
    'rideDistance',
    'roadKills',
    'swimDistance',
    'teamKills',
    'vehicleDestroys',
    'winPoints',
    'headshotRate',
    'totalDistance',
    'items',
    'healsPerWalkDistance',
    'killsPerWalkDistance'
    ]

#Split the model
cross_validation_split = model_selection.ShuffleSplit(n_splits = 5, test_size = .3, train_size = .6, random_state = 0)
#Create dataframe to store results according to degree of polynomial features.
FS_results = pd.DataFrame(data = {'feature': [], 'test_score_mean': [], 'fit_time_mean': [], 'mean_absolute_error': []})
#Create X_train_truncated. The best new features will be appended to this array
X_train_truncated = np.asarray(training_df_truncated[['walkDistance','killPlace', 
                                                      'damageDealt', 'boosts', 
                                                      'weaponsAcquired']])
X_train_truncated = np.float32(X_train_truncated)
#Number of feature we want to add in the model
features_to_add = 6

#Loop for adding new feature into the model
#for i in range(1,features_to_add + 1):
    #Loops through each feature and computes cross_validation test score with LASSO regression model.
    #for feature in feature_FS:
        #Creates a temporary array
        X_temp = X_train_truncated
        #Add a new feature to the temporary array, and apply PolynomialFeatures function
        added_feat = np.asarray(training_df_truncated[[feature]])
        X_temp = np.append(X_temp, added_feat, axis = 1)
        X_temp = PolynomialFeatures(3, interaction_only=False).fit_transform(X_temp)
        #Evaluate the model
        cross_validation_results = model_selection.cross_validate(Lasso(alpha = 0.00001, max_iter=10000, normalize=True), X_temp, y_train_truncated, cv = cross_validation_split, return_train_score = True)
        #Predicts the target value
        y_hat_truncated = Lasso(alpha=0.00001, max_iter=10000, normalize=True).fit(X_temp, y_train_truncated).predict(X_temp)
        FS_results = FS_results.append({'feature' : feature, 
                                        'test_score_mean' : cross_validation_results['test_score'].mean(), 
                                        'fit_time_mean' : cross_validation_results['fit_time'].mean(), 
                                        'mean_absolute_error' : mean_absolute_error(y_train_truncated, y_hat_truncated)}, ignore_index=True)
        print('OK for ' + feature)
    
    #Store the results into a dataframe, sort it, and choose the best feature to add to the model.
    FS_results = FS_results.sort_values(by='mean_absolute_error', ascending=True)
    new_feat = FS_results.feature.iloc[0]
    new_score = FS_results.test_score_mean.iloc[0]
    new_MAE = FS_results.mean_absolute_error.iloc[0]
    new_fit_time = FS_results.fit_time_mean.iloc[0]
    X_train_truncated = np.append(X_train_truncated, np.asarray(training_df_truncated[[new_feat]]), axis = 1)
    print(new_feat + ' feature has been added to the model. Test score mean is now ' + str(new_score) + '. Mean absolute error is now ' + str(new_MAE) + '. Fit time mean is now ' + str(new_fit_time) + '.')
    i += 1

According to this forward search algorithm, we choose to add to our model the following features:
* **kills** (killsPerWalkDistance if sorting according to min Mean Absolute Error value)
* **matchDuration** (maxPlace if sorting according to MAE)
* **maxPlace** (matchDuration if sorting according to MAE)
* **totalDistance** (totalDistance if sorting according to MAE)
* **killsPerWalkDistance** (kills if sorting according to MAE)
* **killStreaks** (assists if sorting according to MAE)

When submitting this model to Kaggle, we get a score of **0,09127**. This is not very good, and we can't use more than 8-9 features in our model without making the computing time skyrocket (each time we add a feature in our model, it adds like 1000 polynomial features and the model computing time takes forever).
Plus, it seems like adding more feature won't take us much further, since the relations between features in our problem seem complex and can't be modelled with only linear and low degree polynomial functions.

Let's try something else!

## **<div id="VI">VI. Model data, attempt n°2: Random Forest (Outdated)</div>**

### **<div id="VI1">1. Creating and normalizing matrices for our model</div>**

First, we need to prepare and normalize our train and test matrices, which we are then going to use for our models.

Normalizing the data has two purposes :
* Making training less sensitive to the scale of features. If we don't normalize the data when we face features with different scales (for example, age and house price), our ML algorithms might take too much care to features with large scales.
* Accelerating optimization. Most machine learning optimizations are solved using gradient descent, or a variant thereof. And the speed of convergence depends on the scaling of features. Normalization makes the problem better conditioned, improving the convergence rate of gradient descent.


In [None]:
training_df_truncated = training_df.truncate(after=100000)
X_train_truncated = np.asarray(training_df_truncated.drop(['winPlacePerc'], axis = 1))
X_train_truncated[0:1]

In [None]:
#X_train = np.asarray(training_df.drop(['winPlacePerc'], axis = 1))
#X_train[0:1]

In [None]:
y_train_truncated = np.asarray(training_df_truncated[['winPlacePerc']])
y_train_truncated[0:5]

In [None]:
#y_train = np.asarray(training_df[['winPlacePerc']])
#y_train[0:5]

In [None]:
del training_df
gc.collect()

In [None]:
X_test = np.asarray(testing_df)
X_test[0:1]

In [None]:
del testing_df
gc.collect()

In [None]:
print ('Train set truncated:', X_train_truncated.shape,y_train_truncated.shape)
print ('Train set:', X_train.shape,y_train.shape)
print ('Test set:', X_test.shape)

### **<div id="VI2">2. Modeling</div>**

Logistic Regression is a useful model to run early in the workflow. Logistic regression measures the relationship between the categorical dependent variable (feature) and one or more independent variables (features) by estimating probabilities using a logistic function, which is the cumulative logistic distribution.

#### **a. Cross validation**

For a prediction problem, a model is generally provided with a data set of known data, called the training data set, and a set of unknown data against which the model is tested, known as the test data set. The target is to have a data set for testing the model in the training phase and then provide insight on how the specific model adapts to an independent data set. A round of cross-validation comprises the partitioning of data into complementary subsets, then performing analysis on one subset. After this, the analysis is validated on other subsets (testing sets). To reduce variability, many rounds of cross-validation are performed using many different partitions and then an average of the results are taken. **Cross-validation is a powerful technique in the estimation of model performance technique.**

Here, we are using the ShuffleSplit function from Scikit Learn. ShuffleSplit will randomly sample your entire dataset during each iteration to generate a training set and a test set. The test_size and train_size parameters control how large the test and training test set should be for each iteration. Since you are sampling from the entire dataset during each iteration, values selected during one iteration, could be selected again during another iteration.  

The main difference between ShuffleSplit and K-Fold is that In KFold, during each round you will use one fold as the test set and all the remaining folds as your training set. However, in ShuffleSplit, during each round n you should only use the training and test set from iteration n. **As your data set grows, cross validation time increases, making shufflesplits a more attractive alternate**. If you can train your algorithm, with a certain percentage of your data as opposed to using all k-1 folds, ShuffleSplit is an attractive option.

In [None]:
models = [ 
    RandomForestRegressor(n_estimators=10, criterion = 'mse', oob_score = True, random_state = 1)
    ]

model_results = pd.DataFrame(data = {'test_score_mean': [], 'fit_time_mean': [], 'mean_absolute_error': []})

# Spliting the model
cross_validation_split = model_selection.ShuffleSplit(n_splits = 5, test_size = .3, train_size = .6, random_state = 0 )
# Performing shufflesplit cross validation, with the whole training set (the cross_validate function coupled with ShuffleSplit take care of spliting the training set) 
#for model in models:
    #cross_validation_results = model_selection.cross_validate(model, X_train_truncated, y_train_truncated, 
                                                              cv= cross_validation_split, return_train_score=True)
    #Predicts the target value on the whole training set
    y_hat = model.fit(X_train_truncated, y_train_truncated).predict(X_train)    
    # Checking the mean of test scores for each iteration of the validation    
    model_results = model_results.append({'test_score_mean' : cross_validation_results['test_score'].mean(), 
                                          'fit_time_mean' : cross_validation_results['fit_time'].mean(), 
                                          'mean_absolute_error' : mean_absolute_error(y_train, y_hat)}, ignore_index=True) 
 
model_results

#### **b. Tune Model with Hyper-Parameters**

In machine learning, a hyperparameter is a parameter whose value is set before the learning process begins. By contrast, the values of other parameters are derived via training.

Different model training algorithms require different hyperparameters, some simple algorithms (such as ordinary least squares regression) require none. Given these hyperparameters, the training algorithm learns the parameters from the data. Model hyperparameters are set manually and are used in processes to help estimate model parameters.
**Tuning an hyperparameter means trying to get the closest possible to its best value, in order to maximize the accuracy of the model** (for larger dataset, time computing may also be taken into account for parameter tuning).

First, we are using **RandomizedSearchCV**. We need to create a parameter grid to sample from during fitting. On each iteration, the algorithm will choose a difference combination of the features. Altogether, there are 2x6x6x8x8 = 4608 settings! However, the benefit of a random search is that we are not trying every combination, but selecting at random to sample a wide range of values.

In [None]:
#A first iteration (see below) gave these results: 
#0.9099475158875073
#{'n_estimators': 30, 'min_samples_split': 20, 'min_samples_leaf': 10, 'max_depth': 30}
#--------------------------------------------------------------------------------------------------
#RFR = RandomForestRegressor(criterion = 'mse', oob_score = True, random_state = 1)
#param_grid = {'min_samples_leaf' : [1, 10, 50, 100, 500, 1000], 
              #'min_samples_split' : [2, 20, 100, 200, 1000, 2000], 
              #'max_depth': [10, 20, 30, 40, 50, None],
              #'n_estimators': [10, 20, 30]}

#RS = RandomizedSearchCV(estimator = RFR, 
                        #param_distributions = param_grid, 
                        #n_iter = 100, 
                        #cv = cross_validation_split, verbose = 5, random_state = 0, n_jobs = -1)

#RS = RS.fit(X_train_truncated, y_train_truncated)

#print(RS.best_score_)
#print(RS.best_params_)
#--------------------------------------------------------------------------------------------------

RFR = RandomForestRegressor(criterion = 'mse', oob_score = True, random_state = 1)
param_grid = {'min_samples_leaf' : [5, 10, 20, 40, 70, 100], 
              'min_samples_split' : [10, 20, 40, 60, 80, 100], 
              'max_depth': [10, 20, 30, 40, 50, None],
              'n_estimators': [30, 40, 50]}

#RS = RandomizedSearchCV(estimator = RFR, 
                        param_distributions = param_grid, 
                        n_iter = 100, 
                        cv = cross_validation_split, verbose = 5, random_state = 0, n_jobs = -1)

#RS = RS.fit(X_train_truncated, y_train_truncated)

print(RS.best_score_)
print(RS.best_params_)

We get theses results:
* **best score**: 0.9111606731899772
* **bests params**: {'n_estimators': 40, 'min_samples_split': 10, 'min_samples_leaf': 5, 'max_depth': 40}

Random search allowed us to narrow down the range for each hyperparameter. Now that we know where to concentrate our search, we can explicitly specify every combination of settings to try. We do this with **GridSearchCV**, a method that, instead of sampling randomly from a distribution, evaluates all combinations we define. To use Grid Search, we make another grid based on the best values provided by random search:

In [None]:
param_grid = {'min_samples_leaf' : [5, 10, 15], 
              'min_samples_split' : [10, 15, 20], 
              'max_depth': [30, 35, 40, None],
              'n_estimators': [40]}

#GS = GridSearchCV(estimator = RFR, param_grid = param_grid, cv = cross_validation_split, verbose = 5, n_jobs = -1)
#GS = GS.fit(X_train_truncated, y_train_truncated)

print(GS.best_score_)
print(GS.best_params_)

We get theses results:
* **best score**: 0.9112120273422432
* **bests params**: {'max_depth': 30, 'min_samples_leaf': 5, 'min_samples_split': 15, 'n_estimators': 40}

In [None]:
#best_model = RandomForestRegressor(n_estimators=40, 
                                    oob_score = True,
                                    min_samples_leaf = 5,
                                    min_samples_split = 15,
                                    max_depth = 30,
                                    random_state = 1).fit(X_train_truncated,y_train_truncated)
#yhat = best_model.predict(X_train)
print("%.4f" % best_model.oob_score_)
print ("%.4f" % mean_absolute_error(y_train, y_hat))
importance_df = pd.concat((pd.DataFrame(training_df_truncated.drop(['winPlacePerc'], axis=1).columns, columns = ['variable']), 
           pd.DataFrame(best_model.feature_importances_, columns = ['importance'])), 
           axis = 1).sort_values(by='importance', ascending = False)
importance_df

In [None]:
plt.figure(figsize=(10, 6))
sns.barplot(x='importance', y='variable', data=importance_df.sort_values(by="importance", ascending=False))
plt.title('Feature Importance')
plt.tight_layout()

When submitting this model to Kaggle, we get a score of **0,06454**. This is way better! Now let's try light GBM.

## **<div id=VII">VII. Model data, attempt n°3: Light GBM</div>**

**LightGBM** is a relatively new algorithm. It is a gradient boosting framework that uses tree based learning algorithm. Light GBM grows tree vertically while other algorithm grows trees horizontally meaning that Light GBM grows tree leaf-wise while other algorithm grows level-wise. It will choose the leaf with max delta loss to grow. When growing the same leaf, Leaf-wise algorithm can reduce more loss than a level-wise algorithm. Light GBM is prefixed as ‘Light’ because of its high speed. Light GBM can handle the large size of data and takes lower memory to run. Another reason of why Light GBM is popular is because it focuses on accuracy of results. LGBM also supports GPU learning and thus data scientists are widely using LGBM for data science application development.

Useful Ressources: 
* https://medium.com/@pushkarmandot/https-medium-com-pushkarmandot-what-is-lightgbm-how-to-implement-it-how-to-fine-tune-the-parameters-60347819b7fc
* https://lightgbm.readthedocs.io/en/latest/Python-API.html#scikit-learn-api
* https://lightgbm.readthedocs.io/en/latest/Parameters.html

### **<div id="VII1">1. Creating and normalizing matrices for our model</div>**

In [None]:
#training_df_truncated = training_df.truncate(after=10000)
#X_train_truncated = np.asarray(training_df_truncated.drop(['winPlacePerc'], axis = 1))
#X_train_truncated = np.float32(X_train_truncated)
#X_train_truncated = preprocessing.StandardScaler().fit(X_train_truncated).transform(X_train_truncated)
#X_train_truncated[0:1]

In [None]:
#y_train_truncated = np.asarray(training_df_truncated[['winPlacePerc']])
#y_train_truncated[0:5]

In [None]:
print ('Train set truncated:', X_train_truncated.shape,y_train_truncated.shape)
print ('Train set:', X_train.shape,y_train.shape)
print ('Test set:', X_test.shape)

### **<div id="VII2">2. Modeling</div>**

#### **a. Cross validation**

In [None]:
models = [ 
    #lgb.LGBMRegressor(boosting_type='gbdt', n_estimators=1000, learning_rate=0.05, bagging_fraction = 0.9, max_bin = 127, metric = 'mae', n_jobs=-1, 
                      #max_depth=-1, num_leaves=200, min_data_in_leaf = 100),
    #lgb.LGBMRegressor(boosting_type='gbdt', n_estimators=50, learning_rate=0.003, metric = 'mae', n_jobs=-1, 
                      #max_depth=-1, num_leaves=200, min_data_in_leaf = 100),
    #lgb.LGBMRegressor(boosting_type='gbdt', n_estimators=100, learning_rate=0.003, metric = 'mae', n_jobs=-1, 
                      #max_depth=-1, num_leaves=200, min_data_in_leaf = 100)
    ]

model_results = pd.DataFrame(data = {'test_score_mean': [], 'fit_time_mean': [], 'mean_absolute_error': []})

# Spliting the model
cross_validation_split = model_selection.ShuffleSplit(n_splits = 4, test_size = .3, train_size = .6, random_state = 0 )
# Performing shufflesplit cross validation, with the whole training set (the cross_validate function coupled with ShuffleSplit take care of spliting the training set) 
for model in models:
    cross_validation_results = model_selection.cross_validate(model, X_train_truncated, y_train_truncated, cv= cross_validation_split, 
                                                              scoring = 'neg_mean_absolute_error', return_train_score=True)
    #Predicts the target value on the whole training set
    y_hat = model.fit(X_train_truncated, y_train_truncated).predict(X_train)    
    # Checking the mean of test scores for each iteration of the validation    
    model_results = model_results.append({'test_score_mean' : cross_validation_results['test_score'].mean(), 
                                          'fit_time_mean' : cross_validation_results['fit_time'].mean(), 
                                          'mean_absolute_error' : mean_absolute_error(y_train, y_hat)}, ignore_index=True) 
 
model_results

#### **b. Tune Model with Hyper-Parameters**

In [None]:
LGBM = lgb.LGBMRegressor(learning_rate=0.003, metric = 'mae', n_estimators = 100, n_jobs=-1)
#early_stopping_rounds = 100, 
param_grid = {'boosting_type' : ['gbdt', 'dart', 'goss'],
              'max_depth' : [10, 20, 30, -1],
              'min_data_in_leaf' : [10, 50, 100, 500, 1000],
              'num_leaves' : [50, 100, 200, 500, 1000]
             }

#RS = RandomizedSearchCV(estimator = LGBM, param_distributions = param_grid, 
                        n_iter = 50, scoring = 'neg_mean_absolute_error',
                        cv = cross_validation_split, verbose = 10, random_state = 0, n_jobs = -1)

#RS = RS.fit(X_train_truncated, y_train_truncated)

print(RS.best_score_)
print(RS.best_params_)

We get theses results:
* **best score**: -0.07856967910566438
* **bests params**: {'num_leaves': 500, 'min_data_in_leaf': 10, 'max_depth': 30, 'boosting_type': 'goss'}

In [None]:
param_grid = {'boosting_type' : ['goss'],
              'max_depth' : [20, 30, 40, -1],
              'min_data_in_leaf' : [10, 20, 50, 100],
              'num_leaves' : [400, 500, 600]}

#GS = GridSearchCV(estimator = LGBM, param_grid = param_grid, cv = cross_validation_split, verbose = 10, scoring = 'neg_mean_absolute_error', n_jobs = -1)
#GS = GS.fit(X_train_truncated, y_train_truncated)

print(GS.best_score_)
print(GS.best_params_)

In [None]:
#best_model = lgb.LGBMRegressor(learning_rate=0.003, metric = 'mae', n_estimators = 2000, n_jobs=-1,
                               boosting_type = 'gbdt',
                               max_depth = 30,
                               min_data_in_leaf = 10,
                               num_leaves = 500).fit(X_train_truncated,y_train_truncated)

#y_hat = best_model.predict(X_train)
print ("%.4f" % mean_absolute_error(y_train, y_hat))
importance_df = pd.concat((pd.DataFrame(training_df_truncated.drop(['winPlacePerc'], axis=1).columns, columns = ['variable']), 
           pd.DataFrame(best_model.feature_importances_, columns = ['importance'])), 
           axis = 1).sort_values(by='importance', ascending = False)
importance_df

In [None]:
plt.figure(figsize=(10, 6))
sns.barplot(x='importance', y='variable', data=importance_df.sort_values(by="importance", ascending=False))
plt.title('Feature Importance')
plt.tight_layout()

## **<div id="VIII">VIII. Model Submission</div>**

In [None]:
# Predicting the results of the testing set with the model
#yhat_test = lgb.LGBMRegressor(learning_rate=0.05, bagging_fraction = 0.9, max_bin = 127, metric = 'mae', n_estimators = 1000, n_jobs=-1,
                              boosting_type = 'gbdt',
                              max_depth = 30,
                              min_data_in_leaf = 10,
                              num_leaves = 200).fit(X_train_truncated, y_train_truncated).predict(X_test)
# Submitting
testing_df = pd.read_csv("../input/test_V2.csv")
submission = testing_df.copy()
submission['winPlacePerc'] = yhat_test
submission.to_csv('submission.csv', columns=['Id', 'winPlacePerc'], index=False)
submission[['Id', 'winPlacePerc']].head()