# A simple linear baseline for the Walmart challenge
This notebook shows how you load the data, prepare it for usage with Keras and then create a submission file. The model is a simple linear regression.

## Data fields
* Store - the store number
* Dept - the department number
* Date - the week
* Weekly_Sales - sales for the given department in the given store
* IsHoliday - whether the week is a special holiday week
* Temperature - average temperature in the region
* Fuel_Price - cost of fuel in the region
* MarkDown1-5 - anonymized data related to promotional markdowns that Walmart is running. MarkDown data is only available after Nov 2011, and is not available for all stores all the time. Any missing value is marked with an NA.
* CPI - the consumer price index
* Unemployment - the unemployment rate
* IsHoliday - whether the week is a special holiday week
* Weekly_Sales: The weekly department wide sales (train set only)
* Type: An anonymized description on which type of store it is

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Loading the data
In Kaggle, data that can be accessed by a Kernel is saved under ``../inputs/``
From there we can load it with pandas:

In [None]:
train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')

We are going to do some data preparation. It is easiest to do this for training and test set combined so we have to do all these steps only once. It is good to know where to split the set afterwards though!

In [None]:
len(train) # Get number of training examples

In [None]:
len(test) # Get number of test examples

In [None]:
df = pd.concat([train,test],axis=0) # Join train and test

In [None]:
df.head() # Get an overview of the data

In [None]:
df.describe()

There seem to be some missing values in the data. We have to make sure to deal with them before feeding anything into the network.

In [None]:
df.isnull().sum()

We will do a bit of very basic feature engineering here by creating a feature which indicates whether a certain markdown was active at all.

In [None]:
df.isnull().sum()

We can probably safely fill all missing values with zero. For the markdowns this means that there was no markdown. For the weekly sales, the missing values are the ones we have to predict, so it does not really matter what we fill in there.

In [None]:
df.fillna(0, inplace=True)

In [None]:
df.isnull().sum()


In [None]:
df.dtypes

Now we have to create some dummy variables for categorical data.

In [None]:
def get_holiday_feature(date):
    super_bowl = ['2010-02-12','2011-02-11','2012-02-10','2013-02-08']
    labor = ['2010-09-10','2011-09-09','2012-09-07','2013-09-06']
    thanksgiving = ['2010-11-26','2011-11-25','2012-11-23','2013-11-29']
    christmas = ['2010-12-31','2011-12-30','2012-12-28','2013-12-27']
    if date in super_bowl:
        return [0,0,0,1]
    elif date in labor:
        return [0,0,1,0]
    elif date in thanksgiving:
        return [0,1,0,0]
    elif date in christmas:
        return [1,0,0,0]
    else:
        return [0,0,0,0]

In [None]:
def dates(datelist):
    x = []
    for date in datelist:
        temp = 0
        temp = get_holiday_feature(date)
        x.append(temp)
    return x

In [None]:
x = dates(df['Date'])
x[:100]

In [None]:
df['Week'] = pd.to_datetime(df.Date).dt.week
df['Year'] = pd.to_datetime(df.Date).dt.year

In [None]:
lastweek = df.sort_values(by = ['Store', 'Dept', 'Date'])
sales = lastweek['Weekly_Sales'].values
avg = df['Weekly_Sales'].mean()
for i in range(1,len(sales)):
    avg.append((z[i-1]))
for j in range(len(avg)):
    if avg[j] == 0:
        avg[j] = Prev[j-1]
lastweek = lastweek.assign(np.array(avg))
df = lastweek

In [None]:
# Make sure we can later recognize what a dummy once belonged to
df['Type'] = 'Type_' + df['Type'].map(str)
df['Store'] = 'Store_' + df['Store'].map(str)
df['Dept'] = 'Dept_' + df['Dept'].map(str)
df['Week'] = 'Week_' + df['Week'].map(str)
df['Year'] = 'Year_' + df['Year'].map(str)

In [None]:
# Create dummies
type_dummies = pd.get_dummies(df['Type'])
store_dummies = pd.get_dummies(df['Store'])
dept_dummies = pd.get_dummies(df['Dept'])
week_dummies = pd.get_dummies(df['Week'])
year_dummies = pd.get_dummies(df['Year'])

In [None]:
# Add dummies
df = pd.concat([df,type_dummies,store_dummies,dept_dummies, week_dummies, year_dummies],axis=1)

In [None]:
# Remove originals
del df['Type']
del df['Store']
del df['Dept']
del df['Week']
del df['Year']
del df['Date']
#del df['CPI']
#del df['Fuel_Price']
#del df['MarkDown1']
#del df['MarkDown2']
#del df['MarkDown3']
#del df['MarkDown4']
#del df['MarkDown5']
#del df['Size']
#del df['Temperature']
#del df['Unemployment']
#del df['IsHoliday']

In [None]:
df.head()

Now we can split train test again.

In [None]:
# smaller training set just to test out different models
train_fake = df.iloc[:15000]
train = df.iloc[:282451]

test_fake = df.iloc[15000:20000]
test = df.iloc[282451:]

In [None]:
test = test.drop('Weekly_Sales',axis=1) # We should remove the nonsense values from test

To get numpy arrays out of the pandas data frame, we can ask for a columns, or dataframes values

In [None]:
y = train['Weekly_Sales'].values
y.shape

In [None]:
X = train.drop('Weekly_Sales',axis=1).values
X.shape

Now we create the baseline model

In [None]:
from keras.layers import Dense, Activation
from keras.models import Sequential
from keras import regularizers

# Testing of different models starts here

We will train this model using batch gradient descent, that is we will process all of our training examples at once. We can do this since we do not have very many training examples and the size of each individual example is quite small, just a 64 number per row. If you have a computer with little RAM you might consider using a smaller batch size than the whole trainings set.

In [None]:
model = Sequential()
model.add(Dense(1,input_dim=196,
                activation ='relu',
                kernel_regularizer= regularizers.l2(0.01)))
model.compile(optimizer='adam', loss='mae')

In [None]:
model.fit(X, y, epochs=5, batch_size= 2048)

In [None]:
model.evaluate(x=X,y=y)

After we have created our model, we can predict things with it on the test set

In [None]:
y_pred = model.predict(test.values, batch_size = X.shape[0])
y_pred[:10]

In [None]:
X_test = test.values

In [None]:
y_pred = model.predict(X_test,batch_size=2048)

To create the ids required for the submission we need the original test file one more time

In [None]:
testfile = pd.read_csv('../input/test.csv')

Now we create the submission. Once you run the kernel you can download the submission from its outputs and upload it to the Kaggle InClass competition page.

In [None]:
submission = pd.DataFrame({'id':testfile['Store'].map(str) + '_' + testfile['Dept'].map(str) + '_' + testfile['Date'].map(str),
                         'Weekly_Sales':y_pred.flatten()})

In [None]:
submission.to_csv('submission.csv',index=False)