I have covered a very simple approach in my previous Notebook where I simply changed the categorical variables into numeric with using only the number of category levels and made 1 more alteration on the output to make it symmetric:
[Just an easy solution][1]

In this solution I will try to do a better job on categorical variables. This can be really important as the XGBoost (which I will use again) does splits based on relational operators: in each step it splits the dataset to "<" and ">=" of a given value. If the target variable is totally independent from the value of the feature then the internal decision tree might not find the relevant rules easily and will go towards the goal in small steps by just selecting the tails of the feature. So let's help the decision tree and give it meaningful inputs!


  [1]: https://www.kaggle.com/guyko81/allstate-claims-severity/just-an-easy-solution

The first part will be the same as in [Just an easy solution][1].

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import xgboost as xgb # XGBoost implementation

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.

# read data
train = pd.read_csv("../input/train.csv")
test = pd.read_csv("../input/test.csv")

features = [x for x in train.columns if x not in ['id','loss']]
#print(features)

cat_features = [x for x in train.select_dtypes(include=['object']).columns if x not in ['id','loss']]
num_features = [x for x in train.select_dtypes(exclude=['object']).columns if x not in ['id','loss']]
print(cat_features)
print(num_features)

train['log_loss'] = np.log(train['loss'])

I'm going to use the average value of the target (log_loss in our case) for each category. Let's see how it changes the 'cat1' variable:

In [None]:
train_x = train[features]
a = pd.DataFrame(train['log_loss'].groupby([train['cat1']]).mean())
a['cat1'] = a.index
train_x['cat1'] = pd.merge(left=train_x, right=a, how='left', on='cat1')['log_loss']
train_x.head(n=20)

Nice, just perfect! Hopefully it will help and worth the work. 
Let's do it for all of the variables! 
(don't forget the test dataset)

In [None]:
train_x = train[features]
test_x = test[features]
for c in range(len(cat_features)):
    a = pd.DataFrame(train['log_loss'].groupby([train[cat_features[c]]]).mean())
    a[cat_features[c]] = a.index
    train_x[cat_features[c]] = pd.merge(left=train_x, right=a, how='left', on=cat_features[c])['log_loss']
    test_x[cat_features[c]] = pd.merge(left=test_x, right=a, how='left', on=cat_features[c])['log_loss']

train_x.head(n=20)

Come XGBoost, do it for us :)

In [None]:
xgdmat = xgb.DMatrix(train_x, train['log_loss']) # Create our DMatrix to make XGBoost more efficient

params = {'eta': 0.01, 'seed':0, 'subsample': 0.5, 'colsample_bytree': 0.5, 
             'objective': 'reg:linear', 'max_depth':6, 'min_child_weight':3} 

num_rounds = 1000
bst = xgb.train(params, xgdmat, num_boost_round = num_rounds)

And the feature importance. Will it differ from the previous?

In [None]:
import matplotlib.pyplot as plt
import operator

def ceate_feature_map(features):
    outfile = open('xgb.fmap', 'w')
    i = 0
    for feat in features:
        outfile.write('{0}\t{1}\tq\n'.format(i, feat))
        i = i + 1

    outfile.close()
    
ceate_feature_map(features)

importance = bst.get_fscore(fmap='xgb.fmap')
importance = sorted(importance.items(), key=operator.itemgetter(1))

df = pd.DataFrame(importance, columns=['feature', 'fscore'])
df['fscore'] = df['fscore'] / df['fscore'].sum()

plt.figure()
df.plot()
df.plot(kind='barh', x='feature', y='fscore', legend=False, figsize=(6, 10))
plt.title('XGBoost Feature Importance')
plt.xlabel('relative importance')
plt.gcf().savefig('feature_importance_xgb.png')

df

It definitely differs. Seems like more categorical variables reached the top 10 in the importance list: 5 vs 2 in previous model.

Ok, model prediction again.

In [None]:
test_xgb = xgb.DMatrix(test_x)
submission = pd.read_csv("../input/sample_submission.csv")
submission.iloc[:, 1] = np.exp(bst.predict(test_xgb))
submission.to_csv('xgb_starter.cat_mean.csv', index=None)

Hmm, the result is better but not that much: 1138.56

I guess that 1000 trees can solve the issue of not having ordered inputs. But at least we made some progression. 