# XGBoost with a single feature: X0

## Scenario
We will try to use only one feature: X0. It is the categorical feature with the highest cardinality, so there is a high probability that it is related somehow to the car model. The problem is: some categories of X0 are present in the test dataset but absent from the training dataset.

What we are going to do to overcome this issue is transform X0 into a continuous feature. We hope the model can then interpolate the missing categories.  All other features will be ignored for this experiment.

Regression using XGBoost and locally evaluated by a 10-fold cross validation.

## Load the data

In [1]:
import pandas as pd

def load_data(file):
    return pd.read_csv(file, index_col='ID')

In [2]:
train_df = load_data('../input/train.csv')
print("Train dataset has {} samples.".format(len(train_df)))
test_df = load_data('../input/test.csv')
print("Test dataset has {} samples.".format(len(test_df)))
train_df.head()

### Encoding X0 as a continuous feature

X0 is originally a categorical feature

In [None]:
print(train_df['X0'].unique())

Since XGBoost works only with numerical data, categorical columns will not work. We will use a transformation that converts the codes used by Mercedes Benz into a continuous space.  We could have used a LabelEncoder, but in this case we would have to use the combination of train and test dataset (remember: some categories do not exist in the training set).

In [4]:
def mercedes_code_to_int(code):
    vocab = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 
             'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 
             'u', 'v', 'w', 'x', 'y', 'z','aa','ab','ac','ad',
            'ae','af','ag','ah','ai','aj','ak','al','am','an', 
            'ao','ap','aq','ar','as','at','au','av','aw','ax', 
            'ay','az','ba','bb','bc','bd','be','bf','bg','bh',  
    ]
    return vocab.index(code)

In [5]:
def extract_feature_matrix(df):
    return df['X0'].apply(mercedes_code_to_int).values.reshape(-1, 1)

In [6]:
import numpy as np
train_X = extract_feature_matrix(train_df)
print(np.unique(train_X))

train_y = train_df['y'].values
print(train_X.shape)
print(train_y.shape)

Good! X0 was converted to a continuous feature. Do you see the gaps 30, 32, 33, 39, 43, 47 and 53? Some of them occur only in the test dataset. Some of them are absent from both datasets.

## Creating a Model
We will be using XGBoost

In [7]:
import xgboost as xgb

The parameters max_depth=2 and eta=0.1 are not default. The change on max_depth intends to prevent overfitting and eta produces smaller steps, to get closer to the optimal result.

In [8]:
dtrain = xgb.DMatrix(data=train_X, label=train_y)
param = {'objective':'reg:linear', 'max_depth': 2, 'eta': 0.1}

We will use R² score as local evaluation for the xgboost model. It is the same scoring used by Kaggle for this competition to evaluate submissions.

In [None]:
from sklearn.metrics import r2_score
def kaggle_eror_eval(preds, dtrain):
    return 'r^2', r2_score(y_pred=preds, y_true=dtrain.get_label())

Evaluation will be based on a 10-fold cross validation. It will take a while because of the small steps we are taking (see eta above). It can be accelerated by increasing eta, at the expense of the evaluation score.

In [10]:
cv_results = xgb.cv(param, dtrain, 1000, nfold=10, verbose_eval=False, feval=kaggle_eror_eval,
                    maximize=False, early_stopping_rounds=20, seed=42, as_pandas=True)
cv_results.tail()

Since we have now a reference for the number of rounds and the R² score, we will train using all the available training data.

In [11]:
bst = xgb.train(param, dtrain, num_boost_round=len(cv_results))
bst

## Create a Submission
After training our model, we need to create a final submission file based on the test dataset.

Let's start by extracting the features from the dataframe, using the same approach we used for the training dataset.

In [12]:
test_X = extract_feature_matrix(test_df)
print(np.unique(test_X))
print(test_X.shape)

Now that we have the features, we will just make the predictions using the previously trained model.

In [13]:
dtest = xgb.DMatrix(data=test_X)
predictions = bst.predict(dtest)

The predictions are being stored as a pandas Dataframe. Later, it can be used to generate the submission file.

In [None]:
submission = pd.DataFrame(index=test_df.index,
                          data={'y': predictions})
submission