# Take a first look at the data
________
The first thing we'll need to do is load in the libraries and datasets we'll be using. For today, I'll be using a dataset of events that occured at auction based on its usage in Fast Iron store and predict the price of their products.

> **Important!** Make sure you run this cell yourself or the rest of your code won't work!

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory
from fastai.imports import *
from fastai.structured import *

from pandas_summary import DataFrameSummary
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from IPython.display import display

from sklearn import metrics
import os
print(os.listdir("../input"))

# Any results you write to the current directory are saved as output.

In [None]:
data = pd.read_csv('../input/train/Train.csv',low_memory=False, parse_dates=["saledate"])

In [None]:
data.saledate

The first thing I do when I get a new dataset is take a look at some of it. This lets me see that it all read in correctly and get an idea of what's going on with the data. In this case, I'm looking to see if I see any missing values, which will be reprsented with `NaN` or `None`.

We need to build fn that allows us to see the full rows/columns in data when we hit .head() or .tail()

In [None]:
def display_all(df):
    with pd.option_context("display.max_rows", 1000, "display.max_columns", 1000): 
        display(df)

We are going to see in a sample of our data to check for missing values, i added .T for transpose since we are having too many columns so to make it easier for me to see the instane for each column 

In [None]:
display_all(data.sample(10).T)

In [None]:
display_all(data.isnull().sum().sort_values(ascending=False)/len(data))

Well it seems that there are a bunch of data that has  90% missing values, so we are throwing away columns that have more than 60 percent missing data

In [None]:
data_needed = data[['SalesID',                    
'state'  ,                 
'fiProductClassDesc',          
'fiBaseModel',         
'fiModelDesc' ,        
'ProductGroup' ,       
'saledate',      
'datasource',     
'ModelID' ,    
'MachineID',   
'SalePrice' ,
'YearMade',   
'ProductGroupDesc',  
'Enclosure', 
'auctioneerID' ,
'Hydraulics',
'fiSecondaryDesc'  ,
'Coupler' ,
'Forks',
'ProductSize'  ,
'Transmission']]

If you go to overview then Evaluation, you will find how kaggle is going to measure the performance of your model so
The evaluation metric for this competition is the RMSLE (root mean squared log error) between the actual and predicted auction prices.

Sample submission files can be downloaded from the data page. Submission files should be formatted as follows:

    Have a header: "SalesID,SalePrice"
    Contain two columns
        SalesID: SalesID for the validation set in sorted order
        SalePrice: Your predicted price of the sale
so we need to build our score fn that upon it we will detect the performance first will change our sale price to log representation

In [None]:
data_needed.SalePrice = np.log(data_needed.SalePrice)

In [None]:
data_needed.head()

Let's make use of the date here split it into useful information like day of year , day of month , it will be useful as there are factors like is it was working day or a holiday so that factor could be affecting the purchasing rate of any product

In [None]:
add_datepart(data_needed, 'saledate',drop=False)

Sorting the data according to date so later we can train on an earlier purchasing events then we can predict on later events

In [None]:
data_needed.sort_values('saledate',inplace=True)

In [None]:
data_needed.head(20)

In [None]:
data_needed.drop('saledate',axis=1,inplace=True)

In [None]:
display_all(data_needed.head())

**We have solved all our problems with the data so far but! **
how could we train our model with all these string data so we should figure out something that could help us to convert these data types to a numerical values, so next line of code is just we iterate through the data column name(n)/values(c), and we intialize some categorical values for example 0 for high ,1 for low and so on to the string data after we convert it to a category type.

In [None]:
train_cats(data_needed)

In [None]:
data_needed.dtypes

In [None]:
#let's see our data with the changes we have made so far
display_all(data_needed.head(100))

**No changes!!!!**
don't worry it works behind the scene trust me but if you don't :D you can run this
data_needed.UsageBand = data_needed.UsageBand.cat.codes
and see in by your own eyes

Now i need to see the categories of the most intuitive column lets say 

In [None]:
data_needed.state.cat.categories

Code
used for every state

In [None]:
data_needed.state.cat.codes.sort_index()

**Handling missing values**

In [None]:
data_needed.isnull().sum().sort_values(ascending=False)/len(data_needed)*100

In [None]:
#let see in transmission column and try to make a better intution about why is data missing in this column
data_needed.Transmission

Seems that it's some types of machines so i will not be able to guess about it or fill in some values so i will drop that column

In [None]:
data_needed.drop('Transmission',axis=1,inplace=True)

In [None]:
#let's see the next column
data_needed.ProductSize       

In [None]:
#it's the sizes of the product so i think we can play around so will fill in values with the perivous instant as x[1] which
#is NaN will be filled with x[0]
data_needed.ProductSize.fillna(method = 'bfill', axis=0)

In [None]:
#coupler
data_needed.Coupler                 

In [None]:
#as before fill in the next value to the NaN
data_needed.Coupler.fillna(method = 'bfill', axis=0)

In [None]:
data_needed.fiSecondaryDesc         

In [None]:
data_needed.fiSecondaryDesc.fillna(method = 'bfill', axis=0)

In [None]:
data_needed.Hydraulics.fillna(method = 'bfill', axis=0)            

In [None]:
data_needed.auctioneerID             

In [None]:
#Since we finally have a numerical type data we will fill with the median of the column
data_needed.auctioneerID = data_needed.auctioneerID.fillna(data_needed.auctioneerID.median())

In [None]:
#small number of NaN so will just drop it
data_needed.Enclosure.dropna()              

Now let's split the data to training data and labels as x and y respectively 

Last we will apply proc_df fn which is function introduced by Fastai it just handle the NaN values we couldnt handle and convert the types of all columns across our dataset to numeric value so that we can use it in learning process

In [None]:
df, y, nas = proc_df(data_needed, 'SalePrice')

https://youtu.be/zvUOpbgtW3c Here's a video that demonstrates regression tree from there you have the basic idea about what is going on so what is the difference , Random forest are just more trees represented by parameter n_estimators that we will be using later and it outputs the average

In [None]:
#see the source code of the proc fn
??proc_df

In [None]:
m = RandomForestRegressor(n_jobs=-1)
m.fit(df, y)
m.score(df,y)

Because we are experimenting on just one data so accuarcy percentage will go up high but this might be prone to overfitting, please see the image below to understand what i am after
https://raw.githubusercontent.com/fastai/fastai/6ccb0f4e6c7ad88279dcf678da2b605e8e32aea8/courses/ml1/images/overfitting2.png


So now we will split our data to training/validation splits in order to test the model we will build

In [None]:
def split_vals(a,n): return a[:n].copy(), a[n:].copy()

n_valid = 12000  # same as Kaggle's test set size
n_trn = len(df)-n_valid
X_train, X_valid = split_vals(df, n_trn)
y_train, y_valid = split_vals(y, n_trn)

X_train.shape, y_train.shape, X_valid.shape

fns below are representing the score function our model will be based on

In [None]:
def rmse(x,y): return math.sqrt(((x-y)**2).mean())

def print_score(m):
    res = [rmse(m.predict(X_train), y_train), rmse(m.predict(X_valid), y_valid),
                m.score(X_train, y_train), m.score(X_valid, y_valid)]
    if hasattr(m, 'oob_score_'): res.append(m.oob_score_)
    print(res)

In [None]:
m = RandomForestRegressor(n_jobs=-1)
m.fit(X_train, y_train)
print_score(m)

Took quiet a time so we are taking just a subset from the whole data and cut the training data leaving the old validation data as it is

Incresing the number of trees so we are able to learn more from data we have plus our new variable oob_score  which create different validation data from the subsample we cut from the dataset  so   whether to use out-of-bag samples to estimate the R^2 on unseen data.


In [None]:
m = RandomForestRegressor(n_estimators=40,n_jobs=-1,oob_score=True)
m.fit(X_train, y_train)
print_score(m)

Stack the prediction of each tree in this case 40 and print the mean of them and the actual value

In [None]:
preds = np.stack([t.predict(X_valid) for t in m.estimators_])

In [None]:
preds[:,0], np.mean(preds[:,0]), y_valid[0]

Trying a different approach instead of just training on a subsample of the data we can actually let each tree train on different subsample of the whole data so in this case we can see the whole data and at the same time not taking much time and mitigating overfitting

In [None]:
set_rf_samples(80000)

In [None]:
m = RandomForestRegressor(n_estimators=80, n_jobs=-1, oob_score=True)
m.fit(X_train, y_train)
print_score(m)


We revert to using a full bootstrap sample in order to show the impact of other over-fitting avoidance methods.

In [None]:
#reset_rf_samples()

In [None]:
m = RandomForestRegressor(n_estimators=80, n_jobs=-1, min_samples_leaf=3, oob_score=True)
m.fit(X_train, y_train)
print_score(m)



Another way to reduce over-fitting is to grow our trees less deeply. We do this by specifying (with min_samples_leaf) that we require some minimum number of rows in every leaf node. This has two benefits:

   There are less decision rules for each leaf node; simpler models should generalize better
   The predictions are made by averaging more rows in the leaf node, resulting in less volatility



max_features : int, float, string or None, optional (default=”auto”)

   The number of features to consider when looking for the best split:
        If int, then consider max_features features at each split.
        If float, then max_features is a percentage and int(max_features * n_features) features are considered at each split.
        If “auto”, then max_features=n_features.
        If “sqrt”, then max_features=sqrt(n_features).
        If “log2”, then max_features=log2(n_features).
        If None, then max_features=n_features.

    Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features.


In [None]:
m = RandomForestRegressor(n_estimators=80, min_samples_leaf=3, max_features=0.5, n_jobs=-1, oob_score=True)
m.fit(X_train, y_train)
print_score(m)

In [None]:
m = RandomForestRegressor(n_estimators=80, min_samples_leaf=4, max_features=0.5, n_jobs=-1, oob_score=True)
m.fit(X_train, y_train)
print_score(m)