# A quick and dirty training and submission

This is my first kernel. And I'm trying to get used to what Kaggle can offer. In this kernel, I aim to quickly train a regression model and create a submission. Nothing fancy.

In [None]:
import os
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from subprocess import check_output
import matplotlib.pyplot as plt
import seaborn as sns
import xgboost
print(check_output(["ls", "../input"]).decode("utf8"))

# Data Wrangling

In this kernel, I only going to use visitor data and day of the week/holiday information for training/prediction. 

In [None]:
DATA_HOME_DIR = '../input'

In [None]:
air_visit = pd.read_csv(os.path.join(DATA_HOME_DIR, 'air_visit_data.csv'))
date_info = pd.read_csv(os.path.join(DATA_HOME_DIR, 'date_info.csv'))

We will merge `air_visit` and `date_info` table using `visit_date` as a key. The `day_of_week` column will be one hot encoded.

In [None]:
air_visit_date = pd.merge(air_visit, date_info, how='left', left_on='visit_date', right_on='calendar_date')
one_hot = pd.get_dummies(air_visit_date['day_of_week'])
X_train_all = air_visit_date[['holiday_flg']].join(one_hot)
y_train_all = air_visit_date['visitors']

In [None]:
validation = 0.1
mask = np.random.rand(len(X_train_all)) < validation
X_train = X_train_all[~mask]
y_train = y_train_all[~mask]
X_validation = X_train_all[mask]
y_validation = y_train_all[mask]

# Train
OK, We will train a model simply using the `day_of_week` and `holiday_flg`. I really don't expect this model to perform well, but it would be interesting to test. We will use `XGRegressor` to build a regression model.

In [None]:
# we will simply use 
xgb = xgboost.XGBRegressor()

In [None]:
xgb.fit(X_train, y_train)

# Validation

In [None]:
y_test = xgb.predict(X_validation)

In [None]:
rmsle = np.sqrt(np.average(np.log(y_test + 1)**2 - np.log(y_validation + 1)**2))
print(rmsle)

OK, We are having RMSLE score of 0.98, which roughly means my predictions are about an order of magnitude bigger Â (or smaller) than the actual value. Let's look at the scatter plot.

In [None]:
plt.scatter(y_validation, y_test)
plt.xlabel("Visitor (actual)")
plt.ylabel("Visitor (predicted)")
plt.show()

In general, the model predicted visitors only between 16 to 28 while the visitors range from 0 to 200 (and in one case 500). I will leave it at that and come back to that later for more detailed analysis.

# Test
Now we will create prediction and submission file.

In [None]:
xgb = xgboost.XGBRegressor()
xgb.fit(X_train_all, y_train_all)

In [None]:
submission = pd.read_csv(os.path.join(DATA_HOME_DIR, 'sample_submission.csv'))
air_store_id = ['_'.join(id.split('_')[:2]) for id in submission['id']]
visit_date = [id.split('_')[2] for id in submission['id']]
air_visit_test = pd.DataFrame({'air_store_id': air_store_id, 'visit_date': visit_date})
air_visit_date_test = pd.merge(air_visit_test, date_info, how='left', left_on='visit_date', right_on='calendar_date')
one_hot = pd.get_dummies(air_visit_date_test['day_of_week'])
X_test = air_visit_date_test[['holiday_flg']].join(one_hot)

In [None]:
y_test = xgb.predict(X_test)
submission.visitors = y_test

In [None]:
submission.to_csv('submission.csv', index=False)