# A simple linear baseline for the Walmart challenge
This notebook shows how you load the data, prepare it for usage with Keras and then create a submission file. The model is a simple linear regression.

In [1]:
import pandas as pd
import numpy as np

## Loading the data
In Kaggle, data that can be accessed by a Kernel is saved under ``../inputs/``
From there we can load it with pandas:

In [2]:
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')

We are going to do some data preparation. It is easiest to do this for training and test set combined so we have to do all these steps only once. It is good to know where to split the set afterwards though!

In [3]:
len(train) # Get number of training examples

282451

In [4]:
len(test) # Get number of test examples

139119

In [5]:
df = pd.concat([train,test],axis=0) # Join train and test

In [6]:
df.head() # Get an overview of the data

Unnamed: 0,CPI,Date,Dept,Fuel_Price,IsHoliday,MarkDown1,MarkDown2,MarkDown3,MarkDown4,MarkDown5,Size,Store,Temperature,Type,Unemployment,Weekly_Sales
0,136.213613,2011-08-26,92,3.796,False,,,,,,152513,26,61.1,A,7.767,87235.57
1,128.616064,2011-03-25,22,3.48,False,,,,,,158114,34,53.11,A,10.398,5945.97
2,211.265543,2010-12-03,28,2.708,False,,,,,,140167,21,50.43,B,8.163,1219.89
3,214.878556,2010-09-17,9,2.582,False,,,,,,155078,8,75.32,A,6.315,11972.71
4,138.106581,2012-05-18,55,4.029,False,12613.98,,11.5,1705.28,3600.79,203819,19,58.81,A,8.15,8271.82


In [7]:
df.describe()

Unnamed: 0,CPI,Dept,Fuel_Price,MarkDown1,MarkDown2,MarkDown3,MarkDown4,MarkDown5,Size,Store,Temperature,Unemployment,Weekly_Sales
count,421570.0,421570.0,421570.0,150681.0,111248.0,137091.0,134967.0,151432.0,421570.0,421570.0,421570.0,421570.0,282451.0
mean,171.201947,44.260317,3.361027,7246.420196,3334.628621,1439.421384,3383.168256,4628.975079,136727.915739,22.200546,60.090059,7.960289,15983.429692
std,39.159276,30.492054,0.458515,8291.221345,9475.357325,9623.07829,6292.384031,5962.887455,60980.583328,12.785297,18.447931,1.863296,22661.092494
min,126.064,1.0,2.472,0.27,-265.76,-29.1,0.22,135.16,34875.0,1.0,-2.06,3.879,-4988.94
25%,132.022667,18.0,2.933,2240.27,41.6,5.08,504.22,1878.44,93638.0,11.0,46.68,6.891,2079.33
50%,182.31878,37.0,3.452,5347.45,192.0,24.6,1481.31,3359.45,140167.0,22.0,62.09,7.866,7616.55
75%,212.416993,74.0,3.738,9210.9,1926.94,103.99,3595.04,5563.8,202505.0,33.0,74.28,8.572,20245.745
max,227.232807,99.0,4.468,88646.76,104519.54,141630.61,67474.85,108519.28,219622.0,45.0,100.14,14.313,693099.36


There seem to be some missing values in the data. We have to make sure to deal with them before feeding anything into the network.

In [8]:
df.isnull().sum()

CPI                  0
Date                 0
Dept                 0
Fuel_Price           0
IsHoliday            0
MarkDown1       270889
MarkDown2       310322
MarkDown3       284479
MarkDown4       286603
MarkDown5       270138
Size                 0
Store                0
Temperature          0
Type                 0
Unemployment         0
Weekly_Sales    139119
dtype: int64

We will do a bit of very basic feature engineering here by creating a feature which indicates whether a certain markdown was active at all.

In [9]:
df = df.assign(md1_present = df.MarkDown1.notnull())
df = df.assign(md2_present = df.MarkDown2.notnull())
df = df.assign(md3_present = df.MarkDown3.notnull())
df = df.assign(md4_present = df.MarkDown4.notnull())
df = df.assign(md5_present = df.MarkDown5.notnull())

In [10]:
df.isnull().sum()

CPI                  0
Date                 0
Dept                 0
Fuel_Price           0
IsHoliday            0
MarkDown1       270889
MarkDown2       310322
MarkDown3       284479
MarkDown4       286603
MarkDown5       270138
Size                 0
Store                0
Temperature          0
Type                 0
Unemployment         0
Weekly_Sales    139119
md1_present          0
md2_present          0
md3_present          0
md4_present          0
md5_present          0
dtype: int64

We can probably safely fill all missing values with zero. For the markdowns this means that there was no markdown. For the weekly sales, the missing values are the ones we have to predict, so it does not really matter what we fill in there.

In [11]:
df.fillna(0, inplace=True)

In [12]:
df.dtypes

CPI             float64
Date             object
Dept              int64
Fuel_Price      float64
IsHoliday          bool
MarkDown1       float64
MarkDown2       float64
MarkDown3       float64
MarkDown4       float64
MarkDown5       float64
Size              int64
Store             int64
Temperature     float64
Type             object
Unemployment    float64
Weekly_Sales    float64
md1_present        bool
md2_present        bool
md3_present        bool
md4_present        bool
md5_present        bool
dtype: object

Now we have to create some dummy variebles for categorical data.

In [13]:
# Make sure we can later recognize what a dummy once belonged to
df['Type'] = 'Type_' + df['Type'].map(str)
df['Store'] = 'Store_' + df['Store'].map(str)
df['Dept'] = 'Dept_' + df['Dept'].map(str)

In [14]:
# Create dummies
type_dummies = pd.get_dummies(df['Type'])
store_dummies = pd.get_dummies(df['Store'])
dept_dummies = pd.get_dummies(df['Dept'])

In [15]:
# Add dummies
df = pd.concat([df,type_dummies,store_dummies,dept_dummies],axis=1)

In [16]:
# Remove originals
del df['Type']
del df['Store']
del df['Dept']

In [17]:
del df['Date']

In [18]:
df.dtypes

CPI             float64
Fuel_Price      float64
IsHoliday          bool
MarkDown1       float64
MarkDown2       float64
MarkDown3       float64
MarkDown4       float64
MarkDown5       float64
Size              int64
Temperature     float64
Unemployment    float64
Weekly_Sales    float64
md1_present        bool
md2_present        bool
md3_present        bool
md4_present        bool
md5_present        bool
Type_A            uint8
Type_B            uint8
Type_C            uint8
Store_1           uint8
Store_10          uint8
Store_11          uint8
Store_12          uint8
Store_13          uint8
Store_14          uint8
Store_15          uint8
Store_16          uint8
Store_17          uint8
Store_18          uint8
                 ...   
Dept_59           uint8
Dept_6            uint8
Dept_60           uint8
Dept_65           uint8
Dept_67           uint8
Dept_7            uint8
Dept_71           uint8
Dept_72           uint8
Dept_74           uint8
Dept_77           uint8
Dept_78         

Now we can split train test again.

In [19]:
train = df.iloc[:282451]
test = df.iloc[282451:]

In [29]:
test = test.drop('Weekly_Sales',axis=1) # We should remove the nonsense values from test

To get numpy arrays out of the pandas data frame, we can ask for a columns, or dataframes values

In [21]:
y = train['Weekly_Sales'].values

In [22]:
X = train.drop('Weekly_Sales',axis=1).values

In [23]:
X.shape

(282451, 145)

Now we create the baseline model

In [24]:
from keras.layers import Dense, Activation
from keras.models import Sequential

Using TensorFlow backend.


In [25]:
model = Sequential()
model.add(Dense(1,input_dim=145))
model.compile(optimizer='adam', loss='mae')

In [26]:
model.fit(X,y,batch_size=2048,epochs=5)

Epoch 1/5
Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.callbacks.History at 0x7f3f5f5eaa58>

After we have created our model, we can predict things with it on the test set

In [31]:
X_test = test.values

In [32]:
y_pred = model.predict(X_test,batch_size=2048)

To create the ids required for the submission we need the original test file one more time

In [35]:
testfile = pd.read_csv('../input/test.csv')

Now we create the submission. Once you run the kernel you can download the submission from its outputs and upload it to the Kaggle InClass competition page.

In [38]:
submission = pd.DataFrame({'id':testfile['Store'].map(str) + '_' + testfile['Dept'].map(str) + '_' + testfile['Date'].map(str),
                          'Weekly_Sales':y_pred.flatten()})

In [None]:
submission.to_csv('submission.csv',index=False)