# Capstone Project - Fang Hong


# Title of Project: Rossmann Store Sales
Kaggle link: https://www.kaggle.com/c/rossmann-store-sales

## Descriptive of the Project:
Rossmann operates over 3,000 drug stores in 7 European countries. Currently, Rossmann store managers are tasked with predicting their daily sales for up to six weeks in advance. Store sales are influenced by many factors, including promotions, competition, school and state holidays, seasonality, and locality. With thousands of individual managers predicting sales based on their unique circumstances, the accuracy of results can be quite varied.

In their first Kaggle competition, Rossmann is challenging you to predict 6 weeks of daily sales for 1,115 stores located across Germany. Reliable sales forecasts enable store managers to create effective staff schedules that increase productivity and motivation. By helping Rossmann create a robust prediction model, you will help store managers stay focused on what’s most important to them: their customers and their teams! 

## Aims of the Project:
- EDA: explore the relationship among different features
- Test models with different features
- Test model fits with different features and find out which models are good

## Steps to take:
- Step 1: Import the data and examine the descriptives of variables of interest; broadly examime the relationship between sales and features.
- Step 2: Run linear regression model to predict sales. -- Use Train_test split and K-folds cross-valication
- Step 3: Run linear regression model to predict sales. -- Use decision tree
- Step 4: Make the submission folder and submit to website to get accuracy score.

### Step 1: Import the data and examine the descriptives of variables of interest; broadly examime the relationship between sales and features.

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
%config InlineBackend.figure_format = 'retina'
%matplotlib inline

from datetime import datetime as dt
from sklearn.model_selection import train_test_split


from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor


In [4]:
# import three dataframe: test, train, store
test=pd.read_csv('test.csv')
test.head(5)

FileNotFoundError: File b'test.csv' does not exist

In [None]:
train=pd.read_csv('train.csv')
train.head(5)

In [None]:
train.dtypes

In [None]:
store=pd.read_csv('store.csv')
store.head(5)

In [None]:
store.dtypes

In [None]:
train.dropna()
store.dropna()
test.dropna()

-- Since train and Store both have the key value store (store number); let's merge them today to have a dataframe that have all features

In [None]:
train_store=pd.merge(train, store, on='Store')
train_store.head(5)

In [None]:
# Descripative of each store types
#train_store.groupby('StoreType').Sales.mean()
sns.barplot(data=train_store, x='StoreType', y='Sales', ci=None)

-- Store type b has the highest mean sales. Thus, StoreType mighe be a good figure to predict sales.

In [None]:
# Examine the whether the sales affected by assorment for each types of store.
train_store.groupby(['StoreType', 'Assortment']).Sales.mean()

-- Sales differ with different types of assortment. Assortment might also be a good feature.

In [None]:
# Examine the whether Day of Week impact sales for each types of score.
examine2=train_store.groupby(['StoreType', 'DayOfWeek']).Sales.mean()
#examine2.plot(kind='bar', figsize=(10, 8), ))

It seems people shoped less (sales is lower) on day 6 and day 7 thank other days of week. Day of Week should be a good feature.

In [None]:
# Examine whether the SchoolHoliday (school closed during this holiday) affects sales
train_store.groupby(['StoreType', 'SchoolHoliday']).Sales.mean()

-- school holiday increase sales of store type a, c, d, but decrease sales of store type b.

In [None]:
# Examine the association between sales and CompetitionDistance
train_store.plot.scatter(x='CompetitionDistance', y='Sales', legend=True, figsize=(10, 8))

--It seems the relationship is not very clear. Generally, the smaller the distance, the higher the sale. Probably because the more the stores, the more likely the locations is good and have more people/visitors.

In [None]:
# Examine the association between number of customers and sales
#train_store.plot.scatter(x='Customers', y='Sales', legend=True, figsize=(10, 8))

sns.regplot('Customers','Sales', data = train_store, line_kws={"color":"r", "lw":3})


In [None]:
# Examine the correlation among different features.
plt.subplots(figsize=(8,6))
cmap = sns.diverging_palette(220, 10, as_cmap=True)
sns.heatmap(train_store.corr(), vmin=0, vmax=1, cmap=cmap, linewidths=.5)

In [None]:
#Multiple scatterplots in Seaborn
sns.set_style('darkgrid')
feature_cols1=['DayOfWeek', 'Customers', 'StoreType', 'Assortment', 'CompetitionDistance']
sns.pairplot(train_store, x_vars=feature_cols1, y_vars='Sales', kind='scatter', palette="Set2")

-- A very clear relationship: the larger the number of customers, the higher the sales.

In general, possible good features include StoreType (categorical), Assortment(categorical), DayofWeek(categorical), SchoolHoliday(categorical), customers.

### Step 2: Run linear regression model to predict sales. -- Use Train_test split and K-Folds cross-valication

### Model 1: Use CompetitionDistrance to Predict Sales

In [None]:
# Merge the test and store dataframe
test_store=pd.merge(test, store, on='Store')
test_store.head(5)

In [None]:
# Run the linear regression with dummy variables included
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
lr=LinearRegression()

# Create X and y.
feature_cols2 = ['CompetitionDistance']
X = train_store[feature_cols2]
y = train_store.Sales

# run model with train dataframe


X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=99)
lr.fit(X_train, y_train)

In [None]:
y_pred = lr.predict(X_test)

print(metrics.mean_squared_error(y_train, lr.predict(X_train)))
print(metrics.mean_squared_error(y_test, y_pred))

### Convert all categorical features.

In [None]:
# convert categorical variables (StoreType) to multiple dummy codetrain
train_store_type1=pd.get_dummies(train_store.StoreType, prefix='StoreType')

# drop the first column
train_store_type1.drop(train_store_type1.columns[0], axis=1, inplace=True)

# concatenate the orginal train_store dataframe and the dummy dataframe
train_store1=pd.concat([train_store, train_store_type1], axis=1)

In [None]:
# convert categorical variables (Assortment) to multiple dummy codetrain
train_store_assortment=pd.get_dummies(train_store.Assortment, prefix='Assortment')

# drop the first column
train_store_assortment.drop(train_store_assortment.columns[0], axis=1, inplace=True)

# concatenate the orginal train_store dataframe and the dummy dataframe
train_store2=pd.concat([train_store1, train_store_assortment], axis=1)

In [None]:
# convert categorical variables (DayOfWeek) to multiple dummy codetrain
train_store_dayofweek=pd.get_dummies(train_store.DayOfWeek, prefix='DayOfWeek')

# drop the first column
train_store_dayofweek.drop(train_store_dayofweek.columns[0], axis=1, inplace=True)

# concatenate the orginal train_store dataframe and the dummy dataframe
train_store3=pd.concat([train_store2, train_store_dayofweek], axis=1)

In [None]:
# convert categorical variables (SchoolHoliday) to multiple dummy codetrain
train_store_SchoolHoliday=pd.get_dummies(train_store.SchoolHoliday, prefix='SchoolHoliday')

# drop the first column
train_store_SchoolHoliday.drop(train_store_SchoolHoliday.columns[0], axis=1, inplace=True)

# concatenate the orginal train_store dataframe and the dummy dataframe
train_store4=pd.concat([train_store3, train_store_dayofweek], axis=1)
train_store4.head(5)

In [None]:
train_store4.columns

### Run mutiple linear regression

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
lr=LinearRegression()


# Run the linear regression with dummy variables included
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
lr=LinearRegression()

feature_cols=['CompetitionDistance', 'StoreType_b', u'StoreType_c', 'StoreType_d', 'Assortment_b', 'Assortment_c', 'DayOfWeek_2', 'DayOfWeek_3', u'DayOfWeek_4', u'DayOfWeek_5', u'DayOfWeek_6', 'DayOfWeek_7', u'DayOfWeek_2', u'DayOfWeek_3', u'DayOfWeek_4', 'DayOfWeek_5', u'DayOfWeek_6', u'DayOfWeek_7']
x=train_store4[feature_cols]
y=train_store4.Sales

lr.fit(x, y)
list(zip(feature_cols, lr.coef_))