##  Data Preparation 
In this section, 
+ irrelavant and canceled orders are removed from data set;
+ missing values are analyzed (i.e. rest days and holidays) and filled with the appropriate value (i.e. 0);
+ trending analysis is analyzed, with dates with outlier values detected and seasonal decomposition conducted; 
+ time series analysis is conducted, showing that the series is not statioanry with the obvious yearly cycle and seasonal cycle; 
+ daily sales records are obtained (see last part of the notebook). 

In [None]:
# import packages,read csv and combine data from multiple sheets
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb 

data = pd.read_excel('../input/online-retail-ii-data-set-from-ml-repository/online_retail_II.xlsx',sheet_name=[0,1])
data = pd.concat([data[0],data[1]],axis=0)
data

In [None]:
data.shape

In [None]:
# Delete canceled orders that start with 'C'
data['Success'] = data['Invoice'].apply(lambda x: 'C' not in str(x))
data = data[data['Success']==True]
data = data.drop('Success',axis=1)

In [None]:
# Delete replenishing orders 
data = data[data['Quantity'] > 0]

In [None]:
# Reformat the InvoiceDate to yyyy-mm-dd
data['InvoiceDate'] = pd.to_datetime(data['InvoiceDate'].dt.strftime('%Y-%m-%d'))

data.head(1)

In [None]:
# Delete orders on debt (irrelavant on sales)
data = data[data['Price'] > 0]

In [None]:
# Obtain daily sales 
data['TotalPrice'] = data['Quantity']*data['Price']

grp_date = data[['Quantity','InvoiceDate','Price','TotalPrice']].groupby('InvoiceDate')
grp_date = grp_date.sum()

sale = grp_date[['TotalPrice']]
sale.to_csv('DailySalesTrending.csv')
sale

---

### Missing Values 

In [None]:
# read daily sales' records
data = pd.read_csv('DailySalesTrending.csv')

data['InvoiceDate'] = data['InvoiceDate'].astype('datetime64[ns]')
data.info()

In [None]:
# set index to be invoce date and find missing dates 
data.set_index(data['InvoiceDate'],drop=False,inplace=True)

missing_date = pd.date_range(start ='2009-12-01', end ='2011-12-09').difference(data.index)

In [None]:
# print dates with missting values
for date in missing_date:
    print(date.year,date.month,date.day,date.dayofweek)

There are several days without sales shown above. After short analysis, it is found that:
1. the shop seems not to operate on most Saturdays and holidays;
1. the shop has Christmas holiday and not operates from 12-24 to 01-03 each year. 

In [None]:
# add missing dates and fill them 
data = data.reindex(pd.date_range(start ='2009-12-01', end ='2011-12-09'))
data.fillna(0,inplace=True)
data['InvoiceDate'] = data.index

In [None]:
# add timestamp 
data['year'] = data['InvoiceDate'].dt.year
data['month'] = data['InvoiceDate'].dt.month
data['day'] = data['InvoiceDate'].dt.day
data['week'] = data['InvoiceDate'].dt.week
data['weekday'] = data['InvoiceDate'].dt.weekday
data['dayofyear'] = data['InvoiceDate'].dt.dayofyear

## Exploratory Analysis 

### Overall Trending 

In [None]:
# Sales trending 
f = plt.figure(figsize=(20,6))
sb.lineplot(x=data.index,y='TotalPrice',data=data).set_title('Sales Trending')

In [None]:
# seasonal decomposition    Ref: https://machinelearningmastery.com/decompose-time-series-data-trend-seasonality/
from statsmodels.tsa.seasonal import seasonal_decompose
result = seasonal_decompose(data['TotalPrice'], model='additive')
result.plot()
plt.show()

The yearly cycle and seasonal cycle are observed. 

### Yearly Trending 

In [None]:
# yearly trending 
sb.boxplot(x='year',y='TotalPrice',data=data)

There is no obvious yearly trending, except for 2009, in which there is less data. 

### Outliers 

In [None]:
# outliers within years 
year = data.groupby('year')
for year, df in year:
    IQR = df['TotalPrice'].quantile(0.75) - df['TotalPrice'].quantile(0.25)
    median = df['TotalPrice'].median()
    large_outliers = df[(df['TotalPrice'] > median + 1.5*IQR)]
    print(large_outliers)

+ **From October to 10th(From Week 40~49),Dec**, there is a selling peak.
+ Other selling peaks: 29th, March, one day occuring in Week 23~24 in June, the end of September 

From the line plot of sales trending, the peaks are observed around Oct, 10 and particularly before Christmas. 

### Monthly Trending 

In [None]:
# monthly trending 
month = data.groupby('month')
month_sum = month.sum()

f = plt.figure(figsize = (12,4))
ax = sb.lineplot(x=month_sum.index,y='TotalPrice',data=month_sum)
ax.set_title('Sales\' Monthly Trending')

f,axes = plt.subplots(1,2,figsize = (12,5))
sb.violinplot(x='month',y='TotalPrice',data=data,ax=axes[0])
sb.boxplot(x='month',y='TotalPrice',data=data,ax=axes[1])

In [None]:
# sales and day in month 
f,axes = plt.subplots(1,2,figsize = (20,5))
sb.violinplot(x='day',y='TotalPrice',data=data,ax=axes[0])
sb.boxplot(x='day',y='TotalPrice',data=data,ax=axes[1])

In [None]:
# generate heatmap of sales 
data['NormalizedPrice'] = (data['TotalPrice'] - data['TotalPrice'].mean())/data['TotalPrice'].std()

f, axes = plt.subplots(1,3,figsize=(10*3,5))
for i, (year, group) in enumerate(data.groupby('year')):
    hd = group.pivot_table('NormalizedPrice','weekday','week')
    sb.heatmap(hd,ax=axes[i])

data = data.drop('NormalizedPrice',axis=1)

**Detailed analysis on sales' peak**:
The sales are larger in Week 49, around 2 weeks before Christmas.
There are another peaks around Week 39, which might be related to the holiday. 
Holidays are one of the important factors. 

PS. It can be verified that the shop doesn't operate on most Saturdays. 

### Time Series Analysis 

In [None]:
# display auto-correlation graph 
from pandas.plotting import autocorrelation_plot as auto_p
plt.figure(figsize=(20,5))
f = auto_p(data['TotalPrice'])

In [None]:
'''
Check if the time series is stationary by Dickey-Fuller test. 
ref: https://machinelearningmastery.com/time-series-data-stationary-python/
'''
from statsmodels.tsa.stattools import adfuller

X = data['TotalPrice'].values
result = adfuller(X)
print('ADF Statistic: %f' % result[0])
print('p-value: %f' % result[1])
print('Critical Values:')
for key, value in result[4].items():
	print('\t%s: %.3f' % (key, value))

The time series is concluded not to be stationary. 

---

In [None]:
data.to_csv('DataSet.csv')
data