*Author: BingQing Wei*

**Introduction**
I have tried using regex to solve labeling the data([my notebook](https://www.kaggle.com/alphasis/regex-label-data)), but I later found I just can't make it perfect.
Therefore I turned once again to machine learning.
And I decided to use XGboost as a starter, because I think decision trees might be the best in classifying this kind of data.

**How I train it**
I didn't use data lebeld as 'PLAIN', 'VERBATIM', 'LETTERS', 'PUNCT' to train the model.
Because for 'VERBATIM' and 'PUNCT' data, they can be labeled trivially.
And 'PLAIN' and 'LETTERS' data can only be classified given the context. It's something RNN is good at but XGboost is not.
So I use the rest 12 classes to train XGboost model and only used 20,000 of them since I don't want to be kept waiting for too long.

**Results**
Accuracy 97.9% on validation data.
The output of this script consists of 3 files:

*xgb_model*: the dumped model that we trained
*pred.csv*: contains the validation data
*errors.csv*: contains data that the model predicts wrong

If you look into the errors.csv, you will find the 0.021 error rate is reasonable:
*Because some special 'CARDINAL' data are classified as 'DATE' *.
Agian, it's something beyond XGBoost's ability.

We begin by loading data and then drop those 'PLAIN', 'VERBATIM', 'LETTERS' or 'PUNCT' data

In [None]:
import pandas as pd
import numpy as np
import os
import pickle

max_num_features = 20

out_path = r'.'
df = pd.read_csv(r'../input/en_train.csv')
exclude_classes = ['PLAIN', 'VERBATIM', 'LETTERS', 'PUNCT']
df = df.loc[df['class'].isin(exclude_classes) == False]

To convert strings into numbers, I simply take their ASCII value then minus 'a'.
This, minusing 'a', I think is important.
Because I think **it distinguishes alphabets from numbers and some symbols**.
*Since most alphabets in the data are lower-case, we don't consider upper-case here.*

In [None]:
max_size = 200000
x_data = []
y_data = pd.factorize(df['class'])
labels = y_data[1]
y_data = y_data[0]
for x in df['before'].values:
    x_row = np.zeros(max_num_features, dtype=int)
    for xi, i in zip(list(str(x)), np.arange(max_num_features)):
        x_row[i] = ord(xi) - ord('a')
    x_data.append(x_row)

print('Total number of samples:', len(x_data))
print('Use: ', max_size)
#x_data = np.array(x_data)
#y_data = np.array(y_data)
x_data = np.array(x_data[:max_size])
y_data = np.array(y_data[:max_size])

print('x_data sample:')
print(x_data[0])
print('y_data sample:')
print(y_data[0])
print('labels:')
print(labels)

del df

Next we begin training the model.

In [None]:
import xgboost as xgb
import numpy as np
import pickle
import os
import re
import pandas as pd
from sklearn.model_selection import train_test_split

out_path = r'.'

x_train = x_data
y_train = y_data
del x_data
del y_data

x_train, x_valid, y_train, y_valid= train_test_split(x_train, y_train,
                                                      test_size=0.1, random_state=2017)
num_class = len(labels)
dtrain = xgb.DMatrix(x_train, label=y_train)
dvalid = xgb.DMatrix(x_valid, label=y_valid)
watchlist = [(dvalid, 'valid'), (dtrain, 'train')]

param = {'objective':'multi:softmax',
         'eta':'0.3', 'max_depth':10,
         'silent':1, 'nthread':-1,
         'num_class':num_class,
         'eval_metric':'merror'}
model = xgb.train(param, dtrain, 60, watchlist, early_stopping_rounds=20,
                  verbose_eval=10)

Next we take the predictions by the model of the validation data and save them into csv files.

In [None]:
pred = model.predict(dvalid)
pred = [labels[int(x)] for x in pred]
y_valid = [labels[x] for x in y_valid]
x_valid = [ [ chr(x + ord('a')) for x in y] for y in x_valid]
x_valid = [''.join(x) for x in x_valid]
x_valid = [re.sub('a+$', '', x) for x in x_valid]

df_pred = pd.DataFrame(columns=['data', 'predict', 'target'])
df_pred['data'] = x_valid
df_pred['predict'] = pred
df_pred['target'] = y_valid
df_pred.to_csv(os.path.join(out_path, 'pred.csv'))

df_errors = df_pred.loc[df_pred['predict'] != df_pred['target']]
df_errors.to_csv(os.path.join(out_path, 'errors.csv'))

model.save_model(os.path.join(out_path, 'xgb_model'))

Since Kaggle Notebook doesn't provide the output of it, I gonna use another way.

In [None]:
df_pred[:10]

In [None]:
df_errors[:10]