# Naive XGBoost

Now, we create a basic xgboost model without fine tuning it.

first, doing the necessary steps like preprocessing the data.

In [1]:
# Supress unnecessary warnings so that presentation looks clean
import warnings
warnings.filterwarnings('ignore')

#importing the  necessary modules
import pandas                                      #to read and manipulate data
import zipfile                                     #to extract data
import numpy as np                                 #for matrix operations
#rest will be imported as and when required
#read the train and test zip file
zip_ref = zipfile.ZipFile("train.csv.zip", 'r')    
zip_ref.extractall()                               
zip_ref.close()

train_data = pandas.read_csv("train.csv")

import copy
test_data = copy.deepcopy(train_data.iloc[150000:])
train_data = train_data.iloc[:150000]

y_true = test_data['loss']

ids = test_data['id']

target = train_data['loss']

#drop the unnecessary column id and loss from both train and test set.
train_data.drop(['id','loss'],1,inplace=True)
test_data.drop(['id','loss'],1,inplace=True)

shift = 200
target = np.log(target+shift)

#merging both the datasets to make single joined dataset
joined = pandas.concat([train_data, test_data],ignore_index = True)
del train_data,test_data                                         #deleting previous one to save memory.

cat_feature = [n for n in joined.columns if n.startswith('cat')]  #list of all the features containing categorical values

#factorizing them
for column in cat_feature:
    joined[column] = pandas.factorize(joined[column].values, sort=True)[0]
        
del cat_feature

#dividing the training data between training and testing set
train_data = joined.iloc[:150000,:]
test_data = joined.iloc[150000:,:]

## Importing the model

Now, we import the model and the metric.

In [2]:
from sklearn.metrics import mean_absolute_error
import xgboost as xgb

def evalerror(preds, dtrain):
    labels = dtrain.get_label()
    return 'mae', mean_absolute_error(np.exp(preds), np.exp(labels))

We set the default parameters as given in this link - https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/

In [3]:
RANDOM_STATE = 2016
params = {
        'min_child_weight': 1,
        'eta': 0.1,
        'colsample_bytree': 0.8,
        'max_depth': 5,
        'subsample': 0.8,
        'alpha': 1,
        'gamma': 0,
        'silent': 1,
        'verbose_eval': True,
        'seed': RANDOM_STATE,'eval_metric': 'mae','verbose_eval': 2,
}

## Training the model

In [4]:
xgtrain = xgb.DMatrix(train_data, label=target)                   #training matrix
xgtest = xgb.DMatrix(test_data)                                   #testing matrix
model = xgb.train(params, xgtrain, 3000, feval=evalerror)         
#3000 is taken intuitely(after seeing the iterations during finetuning)
prediction = np.exp(model.predict(xgtest)) - shift

Parameters: { "silent", "verbose_eval" } might not be used.

  This may not be accurate due to some parameters are only used in language bindings but
  passed down to XGBoost core.  Or some parameters are not used but slip through this
  verification. Please open an issue if you find above cases.




## Finding MAE

In [5]:
from sklearn.metrics import mean_absolute_error

error = mean_absolute_error(y_true,prediction)

print (error)

1149.7521575055487


The naive XGBoost model gives an score of 1149.75. Its time to fine tune it.