# Capstone 2 Data Wrangling

**The Data Science Method**  

1.   Problem Identification 

2.   **Data Wrangling** 
  * Data Collection 
   * Data Organization
  * Data Definition 
  * Data Cleaning
 
3.   Exploratory Data Analysis
 * Build data profile tables and plots
        - Outliers & Anomalies
 * Explore data relationships
 * Identification and creation of features

4.   Pre-processing and Training Data Development
  * Create dummy or indicator features for categorical variables
  * Standardize the magnitude of numeric features
  * Split into testing and training datasets
  * Apply scaler to the testing set
5.   Modeling 
  * Fit Models with Training Data Set
  * Review Model Outcomes — Iterate over additional models as needed.
  * Identify the Final Model

6.   Documentation
  * Review the Results
  * Present and share your findings - storytelling
  * Finalize Code 
  * Finalize Documentation

## Data Collection

In [1]:
#load python packages
import os
import pandas as pd
import datetime
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

  import pandas.util.testing as tm


In [2]:
#check current working directory
os.getcwd()

'C:\\Users\\tc18f\\Desktop\\springboard\\Capstone Two'

In [3]:
#need to change working directory to data, where the data are saved
path="C:\\Users\\tc18f\\Desktop\\springboard\\Capstone Two\\data"
os.chdir(path)
#check and see what files are in data folder
os.listdir()

['OceanHourlySales2016.xlsx', 'OceanHourlySales2017.xlsx', 'processed']

<font color='teal'> **I'll use both 2016 and 2017 excel files, which has the months of sales in a hourly base in sheets named Jan to Dec**</font>

In [4]:
#read the excels into xls
xls2016 = pd.ExcelFile('OceanHourlySales2016.xlsx')
xls2017 = pd.ExcelFile('OceanHourlySales2017.xlsx')

#read one sheet from 2016 and see its components
df = pd.read_excel(xls2016, 'Dec')
df.head()

Unnamed: 0.1,Unnamed: 0,Monday,Tuesday,Wednesday,Thursday,Friday,Saturday,Sunday,AVG
0,Time,2016-12-05 00:00:00,2016-12-06 00:00:00,2016-12-07 00:00:00,2016-12-08 00:00:00,2016-12-09 00:00:00,2016-12-10 00:00:00,2016-12-11 00:00:00,
1,11:00:00,68.43,41.83,36.4,53.84,24.05,15.2,0,34.25
2,12:00:00,48.85,95.09,79.42,36.44,65.03,46.26,69.93,63.0029
3,13:00:00,101.34,115.09,41.12,80.89,71.51,54.59,74.43,76.9957
4,14:00:00,129.75,60.77,67.68,63.36,80.83,69.26,155.25,89.5571


## Data Definition

In [5]:
#check it's info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 93 entries, 0 to 92
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  85 non-null     object
 1   Monday      83 non-null     object
 2   Tuesday     83 non-null     object
 3   Wednesday   83 non-null     object
 4   Thursday    83 non-null     object
 5   Friday      88 non-null     object
 6   Saturday    75 non-null     object
 7   Sunday      83 non-null     object
 8   AVG         84 non-null     object
dtypes: object(9)
memory usage: 6.7+ KB


In [6]:
#we're going to need the sales from hour to hour, so we'll remove the AVG column
df.drop(columns=['AVG'], inplace=True)
#let's also rename unnamed: 0 to Time
df.columns.values[0] = 'Time'
df.head()

Unnamed: 0,Time,Monday,Tuesday,Wednesday,Thursday,Friday,Saturday,Sunday
0,Time,2016-12-05 00:00:00,2016-12-06 00:00:00,2016-12-07 00:00:00,2016-12-08 00:00:00,2016-12-09 00:00:00,2016-12-10 00:00:00,2016-12-11 00:00:00
1,11:00:00,68.43,41.83,36.4,53.84,24.05,15.2,0
2,12:00:00,48.85,95.09,79.42,36.44,65.03,46.26,69.93
3,13:00:00,101.34,115.09,41.12,80.89,71.51,54.59,74.43
4,14:00:00,129.75,60.77,67.68,63.36,80.83,69.26,155.25


In [7]:
#let's check the values we have for the column 'Time' (plan to use it as index)
df['Time'].value_counts()

12:00:00    5
AM          5
13:00:00    5
17:00:00    5
16:00:00    5
Time        5
20:00:00    5
14:00:00    5
21:00:00    5
Total       5
22:00:00    5
PM          5
19:00:00    5
11:00:00    5
18:00:00    5
23:00:00    5
15:00:00    5
Name: Time, dtype: int64

In [8]:
#we don't need the Total AM/PM rows
df = df[~df['Time'].isin(['AM', 'PM', 'Total'])]
#let's check again
df['Time'].value_counts()

12:00:00    5
23:00:00    5
14:00:00    5
11:00:00    5
19:00:00    5
22:00:00    5
21:00:00    5
18:00:00    5
20:00:00    5
Time        5
16:00:00    5
17:00:00    5
13:00:00    5
15:00:00    5
Name: Time, dtype: int64

In [9]:
#checking df's last 15 rows, since there should be 4 weeks in a month, yet there are 5 values for each hour
df.tail(15)

Unnamed: 0,Time,Monday,Tuesday,Wednesday,Thursday,Friday,Saturday,Sunday
75,,,,,,,,
76,Time,Monday,Tuesday,Wednesday,Thursday,Friday,Saturday,Sunday
77,11:00:00,59.345,40.9275,40.43,51.7425,46.2925,39.965,20.045
78,12:00:00,54.03,73.095,61.2575,47.8225,78.42,136.53,84.345
79,13:00:00,85.6525,109.755,89.2975,78.3975,74.37,104.655,124.775
80,14:00:00,106.422,88.945,71.08,79.175,82.97,157.65,135.968
81,15:00:00,105.845,106.527,91.8625,89.93,98.09,147.985,203.648
82,16:00:00,139.3,104.838,91.5275,148.685,118.43,120.882,130.868
83,17:00:00,87.195,63.935,95.0225,59.4675,67.4725,104.493,112.668
84,18:00:00,72.57,92.3675,85.5425,76.61,78.7875,74.335,120.463


In [10]:
#remove empty rows, so we'll drop rows where the value is NaN in the Time column
df.dropna(subset = ['Time'], inplace=True)
df.tail(15)

Unnamed: 0,Time,Monday,Tuesday,Wednesday,Thursday,Friday,Saturday,Sunday
70,23:00:00,,,,,35.21,,
76,Time,Monday,Tuesday,Wednesday,Thursday,Friday,Saturday,Sunday
77,11:00:00,59.345,40.9275,40.43,51.7425,46.2925,39.965,20.045
78,12:00:00,54.03,73.095,61.2575,47.8225,78.42,136.53,84.345
79,13:00:00,85.6525,109.755,89.2975,78.3975,74.37,104.655,124.775
80,14:00:00,106.422,88.945,71.08,79.175,82.97,157.65,135.968
81,15:00:00,105.845,106.527,91.8625,89.93,98.09,147.985,203.648
82,16:00:00,139.3,104.838,91.5275,148.685,118.43,120.882,130.868
83,17:00:00,87.195,63.935,95.0225,59.4675,67.4725,104.493,112.668
84,18:00:00,72.57,92.3675,85.5425,76.61,78.7875,74.335,120.463


In [11]:
#the last week is the weekly average, let's remove that, we'll do so by removing the last 14 rows
#but we need to make sure the previous rows are good, so we'll do so if there exist a value named 'Monday'
#on the [-14] row's Monday column (it's supposed to be the date, only the avg week has the value as Monday)
if df.iloc[-14]['Monday'] == 'Monday':
    df = df[:-14]
df.tail(15)

Unnamed: 0,Time,Monday,Tuesday,Wednesday,Thursday,Friday,Saturday,Sunday
51,23:00:00,,,,,90.48,,
57,Time,2016-12-26 00:00:00,2016-12-27 00:00:00,2016-12-28 00:00:00,2016-12-29 00:00:00,2016-12-30 00:00:00,2016-12-31 00:00:00,2017-01-01 00:00:00
58,11:00:00,22.27,38.19,11.91,48.54,30.98,34.22,35.7
59,12:00:00,16.3,64.58,62.75,44.9,69.24,66.94,78.57
60,13:00:00,70.9,130.33,88.9,121.44,118.86,129.39,150.28
61,14:00:00,96.56,71.25,71.52,85.99,74.31,226.19,137.23
62,15:00:00,151.16,122.81,91.35,140.9,126.38,147.41,231.21
63,16:00:00,94.81,69.89,127.88,174.19,52.98,179.84,127.24
64,17:00:00,161.21,113.45,98.31,61.1,64.58,,108.06
65,18:00:00,86.37,97.48,119.95,96.84,66.92,,128.6


In [12]:
#we need to make it to a wide list, so we'll merge them from left to right to make a month
#before that, lets reset the index
df.reset_index(drop=True, inplace=True)

In [13]:
#week one is row 0~13 which is df[0:14], week2 is 14~27, week3 is 28~41, week4 is 42~55
#occassionaly there might be week5, which is 56~69, we'll do so if len(df)>60
df1 = df[:14]
df2 = df[14:28]
df3 = df[28:42]
df4 = df[42:56]
if len(df) > 60:
    df5 = df[56:70]
else:
    df5 = 0     #we'll set df5=0 so it's false that way we can use it for the following if->else statement

#let's merge them, need reduce function
from functools import reduce
if df5 == True:
    dfmerged = reduce(lambda  left,right: pd.merge(left,right,on=['Time'], how='outer'), [df1, df2, df3, df4, df5])
else:
    dfmerged = reduce(lambda  left,right: pd.merge(left,right,on=['Time'], how='outer'), [df1, df2, df3, df4])
dfmerged.head(15)

Unnamed: 0,Time,Monday_x,Tuesday_x,Wednesday_x,Thursday_x,Friday_x,Saturday_x,Sunday_x,Monday_y,Tuesday_y,...,Friday_x.1,Saturday_x.1,Sunday_x.1,Monday_y.1,Tuesday_y.1,Wednesday_y,Thursday_y,Friday_y,Saturday_y,Sunday_y
0,Time,2016-12-05 00:00:00,2016-12-06 00:00:00,2016-12-07 00:00:00,2016-12-08 00:00:00,2016-12-09 00:00:00,2016-12-10 00:00:00,2016-12-11 00:00:00,2016-12-12 00:00:00,2016-12-13 00:00:00,...,2016-12-23 00:00:00,2016-12-24 00:00:00,2016-12-25 00:00:00,2016-12-26 00:00:00,2016-12-27 00:00:00,2016-12-28 00:00:00,2016-12-29 00:00:00,2016-12-30 00:00:00,2016-12-31 00:00:00,2017-01-01 00:00:00
1,11:00:00,68.43,41.83,36.4,53.84,24.05,15.2,0,77.04,48.58,...,58.57,37.85,14.65,22.27,38.19,11.91,48.54,30.98,34.22,35.7
2,12:00:00,48.85,95.09,79.42,36.44,65.03,46.26,69.93,67.85,74.24,...,76.66,26.5,110.04,16.3,64.58,62.75,44.9,69.24,66.94,78.57
3,13:00:00,101.34,115.09,41.12,80.89,71.51,54.59,74.43,62.54,163.14,...,16.19,114.4,146.68,70.9,130.33,88.9,121.44,118.86,129.39,150.28
4,14:00:00,129.75,60.77,67.68,63.36,80.83,69.26,155.25,78.38,142.5,...,93.8,164.75,143.11,96.56,71.25,71.52,85.99,74.31,226.19,137.23
5,15:00:00,73.37,105.77,80.8,58.94,112.93,109.83,195.62,114.08,88.82,...,96.19,206.62,267.09,151.16,122.81,91.35,140.9,126.38,147.41,231.21
6,16:00:00,131.18,177,68,123.91,168.33,94.69,103.45,104.48,70.36,...,132.33,106.04,152.77,94.81,69.89,127.88,174.19,52.98,179.84,127.24
7,17:00:00,70.78,41.1,86.02,50.59,106.57,77.23,107.46,54.52,40.4,...,56.84,141.52,107.64,161.21,113.45,98.31,61.1,64.58,,108.06
8,18:00:00,81.61,62.75,58.55,54.24,86.65,59.64,87.2,93.28,119.72,...,54.31,,141.98,86.37,97.48,119.95,96.84,66.92,,128.6
9,19:00:00,59.58,90.57,87.37,23.7,98.62,72.9,64.97,104.46,53.67,...,68.17,,91.05,63.45,91.44,72.52,151.06,120.79,,85.69


In [14]:
#we'll now make it into a long list by trasnposing the df
dft = dfmerged.T
dft.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
Time,Time,11:00:00,12:00:00,13:00:00,14:00:00,15:00:00,16:00:00,17:00:00,18:00:00,19:00:00,20:00:00,21:00:00,22:00:00,23:00:00
Monday_x,2016-12-05 00:00:00,68.43,48.85,101.34,129.75,73.37,131.18,70.78,81.61,59.58,36.2,81.02,49.03,
Tuesday_x,2016-12-06 00:00:00,41.83,95.09,115.09,60.77,105.77,177,41.1,62.75,90.57,81.64,65.03,20.1,
Wednesday_x,2016-12-07 00:00:00,36.4,79.42,41.12,67.68,80.8,68,86.02,58.55,87.37,19.75,19.64,28.65,
Thursday_x,2016-12-08 00:00:00,53.84,36.44,80.89,63.36,58.94,123.91,50.59,54.24,23.7,38.58,103.92,53.35,


In [15]:
# let's fill the NaNs for columnws 1~12 by its column mean
for i in range(1, 13):
    dft[i] = dft[i].fillna(dft.iloc[1:, i].mean())
# the store closes on 23:00 so we'll fill 0 for those NaN in that column
dft[13] = dft[13].fillna(0)
dft.isnull().any().any()

False

In [16]:
#rename the columns names using first row's values
dft.columns = dft.iloc[0]
dft.head()

Time,Time.1,11:00:00,12:00:00,13:00:00,14:00:00,15:00:00,16:00:00,17:00:00,18:00:00,19:00:00,20:00:00,21:00:00,22:00:00,23:00:00
Time,Time,11:00:00,12:00:00,13:00:00,14:00:00,15:00:00,16:00:00,17:00:00,18:00:00,19:00:00,20:00:00,21:00:00,22:00:00,23:00:00
Monday_x,2016-12-05 00:00:00,68.43,48.85,101.34,129.75,73.37,131.18,70.78,81.61,59.58,36.2,81.02,49.03,0
Tuesday_x,2016-12-06 00:00:00,41.83,95.09,115.09,60.77,105.77,177,41.1,62.75,90.57,81.64,65.03,20.1,0
Wednesday_x,2016-12-07 00:00:00,36.4,79.42,41.12,67.68,80.8,68,86.02,58.55,87.37,19.75,19.64,28.65,0
Thursday_x,2016-12-08 00:00:00,53.84,36.44,80.89,63.36,58.94,123.91,50.59,54.24,23.7,38.58,103.92,53.35,0


In [17]:
# changing the 1st column index to Date to set it as row index later
dft = dft.rename(columns={'Time': 'Date'})
# setting new row index
dft = dft.set_index('Date')
# now that we have the index, let's remove the first row (Time)
dft = dft.iloc[1:]
dft.head()

Time,11:00:00,12:00:00,13:00:00,14:00:00,15:00:00,16:00:00,17:00:00,18:00:00,19:00:00,20:00:00,21:00:00,22:00:00,23:00:00
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
2016-12-05 00:00:00,68.43,48.85,101.34,129.75,73.37,131.18,70.78,81.61,59.58,36.2,81.02,49.03,0.0
2016-12-06 00:00:00,41.83,95.09,115.09,60.77,105.77,177.0,41.1,62.75,90.57,81.64,65.03,20.1,0.0
2016-12-07 00:00:00,36.4,79.42,41.12,67.68,80.8,68.0,86.02,58.55,87.37,19.75,19.64,28.65,0.0
2016-12-08 00:00:00,53.84,36.44,80.89,63.36,58.94,123.91,50.59,54.24,23.7,38.58,103.92,53.35,0.0
2016-12-09 00:00:00,24.05,65.03,71.51,80.83,112.93,168.33,106.57,86.65,98.62,28.81,94.93,80.51,57.1


# Defining Function to clean data
The goal is to compile all 2016's sales in one df, and 2017 in another, with both data starting with Monday, and no missing days in between. So we can check their covariance/correlation and see if days of the week/year matter.

In [18]:
# defining a function that takes in a dataframe and returns a cleaned dataframe
def clean_data(df):
    df=df.iloc[:, :8]     # keep the first 8 columns but not the AVG column
    df.columns.values[0] = 'Time'    # rename unnamed: 0 to Time
    df = df[~df['Time'].isin(['AM', 'PM', 'Total'])]    # remove the Total AM/PM rows

    # remove empty rows, so we'll drop rows where the value is NaN in the Time column
    df.dropna(subset = ['Time'], inplace=True)
    
    #the last week is the weekly average, let's remove that, we'll do so by removing the last 14 rows
    #but we need to make sure the previous rows are good, so we'll do so if there exist a value named 'Monday'
    #on the [-14] row's Monday column (it's supposed to be the date, only the avg week has the value as Monday)
    if df.iloc[-14]['Monday'] == 'Monday':
        df = df[:-14]
    
    #reset the index
    df.reset_index(drop=True, inplace=True)
    
    # make it to a wide list, so we'll merge them from left to right make a month    
    #week one is row 0~13 which is df[0:14], week2 is 14~27, week3 is 28~41, week4 is 42~55
    #occassionaly there might be week5, which is 56~69, we'll do so if len(df)>60
    # merge them, need reduce function
    from functools import reduce
    df1 = df[:14]
    df2 = df[14:28]
    df3 = df[28:42]
    df4 = df[42:56]
    if len(df) > 60:
        df5 = df[56:70]
        dfmerged = reduce(lambda  left,right: pd.merge(left,right,on=['Time'], how='outer'), [df1, df2, df3, df4, df5])
    else:
        dfmerged = reduce(lambda  left,right: pd.merge(left,right,on=['Time'], how='outer'), [df1, df2, df3, df4])
        
    dft = dfmerged.T    # make it into a long list by trasnposing the df
    # fill the NaNs for columnws 1~12 by its column mean
    for i in range(1, 13):
        dft[i] = dft[i].fillna(dft.iloc[1:, i].mean())
    # the store closes on 23:00 so we'll fill 0 for those NaN in that column
    dft[13] = dft[13].fillna(0)  
    dft.columns = dft.iloc[0]    # rename the columns to first row
    dft = dft.rename(columns={'Time': 'Date'})    #changing the 1st column index to Date to set it as row index later
    dft = dft.set_index('Date')    # setting row index
    dft = dft.iloc[1:]    #remove first row
    return dft

In [19]:
#read the rest of the sheets into dataframes
#make a month list to iterate
month_list = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']

#dict comprehension
df_2016raw = {month:pd.read_excel(xls2016, month) for month in month_list}
df_2017raw = {month:pd.read_excel(xls2017, month) for month in month_list}

In [20]:
#combine all the dfs by year, but we have to clean them first
df2016 = pd.concat([clean_data(df_2016raw[month]) for month in month_list])
df2017 = pd.concat([clean_data(df_2017raw[month]) for month in month_list])

#then combine both years to df24 for 24 months, so we can check both df at once
df24 = pd.concat([df2016, df2017])
df24.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 728 entries, 2016-01-04 to 2017-12-31
Data columns (total 13 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   11:00:00  728 non-null    object
 1   12:00:00  728 non-null    object
 2   13:00:00  728 non-null    object
 3   14:00:00  728 non-null    object
 4   15:00:00  728 non-null    object
 5   16:00:00  728 non-null    object
 6   17:00:00  728 non-null    object
 7   18:00:00  728 non-null    object
 8   19:00:00  728 non-null    object
 9   20:00:00  728 non-null    object
 10  21:00:00  728 non-null    object
 11  22:00:00  728 non-null    object
 12  23:00:00  728 non-null    object
dtypes: object(13)
memory usage: 79.6+ KB


# Check the compiled df for error/NaN/outlier... etc.

In [21]:
#change the Dtype to float
df2016 = df2016.apply(pd.to_numeric, errors='ignore')
df2017 = df2017.apply(pd.to_numeric, errors='ignore')
df24 = df24.apply(pd.to_numeric, errors='ignore')

#check df info again
df24.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 728 entries, 2016-01-04 to 2017-12-31
Data columns (total 13 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   11:00:00  728 non-null    float64
 1   12:00:00  728 non-null    float64
 2   13:00:00  728 non-null    float64
 3   14:00:00  728 non-null    float64
 4   15:00:00  728 non-null    float64
 5   16:00:00  728 non-null    float64
 6   17:00:00  728 non-null    float64
 7   18:00:00  728 non-null    float64
 8   19:00:00  728 non-null    float64
 9   20:00:00  728 non-null    float64
 10  21:00:00  728 non-null    float64
 11  22:00:00  728 non-null    float64
 12  23:00:00  728 non-null    float64
dtypes: float64(13)
memory usage: 79.6 KB


In [22]:
# take a look at df24's .describe() to see if anything is unusual
df24.describe()

Time,11:00:00,12:00:00,13:00:00,14:00:00,15:00:00,16:00:00,17:00:00,18:00:00,19:00:00,20:00:00,21:00:00,22:00:00,23:00:00
count,728.0,728.0,728.0,728.0,728.0,728.0,728.0,728.0,728.0,728.0,728.0,728.0,728.0
mean,60.302299,91.899589,107.594729,124.13953,140.26731,138.627479,107.208178,94.913269,86.115463,92.958345,95.344627,69.273088,13.102898
std,30.832246,41.127044,42.789526,49.137206,49.455942,48.378577,37.37561,35.619215,33.676715,33.638262,34.705013,31.954989,24.577494
min,0.0,11.87,7.25,12.15,23.81,28.44,20.68,7.0,14.48,19.65,15.5,4.0,0.0
25%,39.1725,65.0525,75.4275,87.705,106.5075,103.325,80.7075,70.4775,60.72,69.4975,71.285,47.4475,0.0
50%,56.145,86.54,104.225,119.36,135.765,133.93,104.865,91.525,82.9,89.99,92.218148,65.855,0.0
75%,76.8025,114.6775,133.5575,153.24,166.1025,170.045,130.18,115.1525,105.95,111.945,116.9675,87.84,17.3325
max,226.18,406.42,254.23,301.34,358.54,292.96,264.7,256.9,271.32,272.32,215.14,237.25,177.46


In [23]:
#since there's no unusual numbers, we'll keep the NaN as it is, b/c Nan means the store wasnt open at that hour
#let's add a new column to indicate the daily sales and name it df24['Daily']
df24['Daily'] = df24.apply(lambda row: row.sum(), axis = 1) 
df2016['Daily'] = df2016.apply(lambda row: row.sum(), axis = 1) 
df2017['Daily'] = df2017.apply(lambda row: row.sum(), axis = 1) 
df24.head()

Time,11:00:00,12:00:00,13:00:00,14:00:00,15:00:00,16:00:00,17:00:00,18:00:00,19:00:00,20:00:00,21:00:00,22:00:00,23:00:00,Daily
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2016-01-04,16.39,36.27,78.68,51.44,57.7,148.93,74.55,50.34,56.02,79.35,58.34,31.68,0.0,739.69
2016-01-05,22.45,27.75,7.25,30.64,100.67,149.72,43.14,68.53,93.65,75.9,27.45,31.7,0.0,678.85
2016-01-06,33.59,41.48,111.56,92.0,94.83,94.27,45.26,70.35,57.22,52.53,46.87,49.12,0.0,789.08
2016-01-07,8.4,23.3,54.49,42.28,116.13,101.65,52.04,47.96,128.0,77.01,91.02,75.42,0.0,817.7
2016-01-08,27.25,86.0,48.34,65.21,186.2,158.67,93.76,117.3,143.23,105.7,182.96,89.71,64.12,1368.45


In [24]:
#let's use describe to see the min/max and 75%
df24['Daily'].describe()

count     728.000000
mean     1221.746804
std       267.557733
min       465.070000
25%      1042.882500
50%      1213.970000
75%      1377.530000
max      2795.950000
Name: Daily, dtype: float64

In [25]:
#let's check the days with over 3 times std more than the mean
df24.loc[df24['Daily'] > (df24['Daily'].mean()+3*df24['Daily'].std())]

Time,11:00:00,12:00:00,13:00:00,14:00:00,15:00:00,16:00:00,17:00:00,18:00:00,19:00:00,20:00:00,21:00:00,22:00:00,23:00:00,Daily
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2017-08-31,150.21,228.03,142.35,176.46,321.18,256.74,150.99,238.12,105.22,131.27,192.93,92.6,0.0,2186.1
2017-09-01,224.96,264.04,210.29,213.52,253.81,216.95,230.17,210.36,271.32,272.32,205.51,162.31,60.39,2795.95
2017-09-02,173.97,282.15,182.63,242.13,312.23,292.96,187.25,256.9,150.05,166.69,149.25,176.22,85.13,2657.56
2017-10-06,141.3,216.29,167.76,260.65,282.04,204.99,181.84,106.83,142.5,162.91,128.98,93.37,58.04,2147.5


In [26]:
# finding out the row index number, 241~243 is 8/31~9/2/2017.
df2017.iloc[240:245]

Time,11:00:00,12:00:00,13:00:00,14:00:00,15:00:00,16:00:00,17:00:00,18:00:00,19:00:00,20:00:00,21:00:00,22:00:00,23:00:00,Daily
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2017-08-30,69.46,142.09,134.13,143.99,105.96,130.27,162.98,130.28,134.01,63.92,73.85,126.28,0.0,1417.22
2017-08-31,150.21,228.03,142.35,176.46,321.18,256.74,150.99,238.12,105.22,131.27,192.93,92.6,0.0,2186.1
2017-09-01,224.96,264.04,210.29,213.52,253.81,216.95,230.17,210.36,271.32,272.32,205.51,162.31,60.39,2795.95
2017-09-02,173.97,282.15,182.63,242.13,312.23,292.96,187.25,256.9,150.05,166.69,149.25,176.22,85.13,2657.56
2017-09-03,53.4,89.13,196.68,208.27,185.51,125.76,105.49,154.65,148.46,148.99,91.57,141.13,0.0,1649.04


In [27]:
# 8/31~9/2 SF had Comic con and thus much more people were in SF than usual, we'll replace those 3 days hourly sales
# to the yearly average
# let's create a lit of the averages
hr_avg_list2017 = [df2017.iloc[:, i:(i+1)].mean() for i in range(13)]

# 8/31 is Thursday, so we'll only do the values up to the 22:00 hour
for i in range(12):
        df2017.iloc[241, i:(i+1)] = hr_avg_list2017[i]

# 9/1 and 9/2 are weekends which the store open until midnight and thus need all the values replaced
for i in range(13):
    df2017.iloc[242, i:(i+1)] = hr_avg_list2017[i]
    df2017.iloc[243, i:(i+1)] = hr_avg_list2017[i]

# we need the update the Daily column for those 3 rows as well
for i in range(241, 244):
    df2017.iloc[i, 13] = df2017.iloc[i, :13].sum()

In [28]:
# reconcat and check for outlier again
df24 = pd.concat([df2016, df2017])
df24.loc[df24['Daily'] > (df24['Daily'].mean()+3*df24['Daily'].std())]

Time,11:00:00,12:00:00,13:00:00,14:00:00,15:00:00,16:00:00,17:00:00,18:00:00,19:00:00,20:00:00,21:00:00,22:00:00,23:00:00,Daily
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2016-05-01,77.55,123.77,185.81,221.38,220.77,259.26,187.89,147.74,157.63,180.57,134.91,84.68,0.0,1981.96
2017-09-26,184.95,190.8,162.99,197.43,190.33,287.42,119.08,190.94,159.76,107.04,124.28,95.07,0.0,2010.09
2017-10-06,141.3,216.29,167.76,260.65,282.04,204.99,181.84,106.83,142.5,162.91,128.98,93.37,58.04,2147.5


These three days weren't consecutive like the events, so they may just be really hot that day\
5/01 2016 Sun SF highest at 81F\
9/26 2017 Tue SF highest at 86F\
10/6 2017 Fri SF highest at 81F

In [29]:
df24.head()

Time,11:00:00,12:00:00,13:00:00,14:00:00,15:00:00,16:00:00,17:00:00,18:00:00,19:00:00,20:00:00,21:00:00,22:00:00,23:00:00,Daily
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
2016-01-04,16.39,36.27,78.68,51.44,57.7,148.93,74.55,50.34,56.02,79.35,58.34,31.68,0.0,739.69
2016-01-05,22.45,27.75,7.25,30.64,100.67,149.72,43.14,68.53,93.65,75.9,27.45,31.7,0.0,678.85
2016-01-06,33.59,41.48,111.56,92.0,94.83,94.27,45.26,70.35,57.22,52.53,46.87,49.12,0.0,789.08
2016-01-07,8.4,23.3,54.49,42.28,116.13,101.65,52.04,47.96,128.0,77.01,91.02,75.42,0.0,817.7
2016-01-08,27.25,86.0,48.34,65.21,186.2,158.67,93.76,117.3,143.23,105.7,182.96,89.71,64.12,1368.45


In [30]:
# output the dfs to csv files in data/processed folder
df24.to_csv('C:\\Users\\tc18f\\Desktop\\springboard\\Capstone Two\\data\\processed\\df24.csv')
df2016.to_csv('C:\\Users\\tc18f\\Desktop\\springboard\\Capstone Two\\data\\processed\\df2016.csv')
df2017.to_csv('C:\\Users\\tc18f\\Desktop\\springboard\\Capstone Two\\data\\processed\\df2017.csv')

In [31]:
# let's try to make a df with time series with sale hour by hour
df2 = pd.read_csv('C:\\Users\\tc18f\\Desktop\\springboard\\Capstone Two\\data\\processed\\df24.csv')
df2.drop(columns='Daily', inplace=True)
df3 = pd.melt(df2, id_vars=['Date'])

In [32]:
# create a new column named hour and iterate/add value to it
df3['hour'] = df3['Date']
for i in range(len(df3)):
    df3.iloc[i, 3] = df3.iloc[i, 1][:2]
df3['year'] = df3['Date'].str[:4]
df3['month'] = df3['Date'].str[5:7]
df3['day'] = df3['Date'].str[8:]
df3['Date']=pd.to_datetime(df3[['year','month','day', 'hour']], format='%Y%m%d%h')
# now that the date is correct, let's drop the year/month/day/hour/variable columns
df3.drop(columns=['year', 'month', 'day', 'hour', 'variable'], inplace=True)
# set the Date column as index
df3.set_index('Date', inplace=True)
# change 'value' to sales
df3.rename({'value': 'sales'}, axis=1, inplace=True)
# sort index
df3 = df3.sort_index()
#since it's hour by hour now, let's drop the 0 sales since the store is closed
df3 = df3[(df3.T != 0).any()]
display(df3.min())
df3.head()

sales    3.9
dtype: float64

Unnamed: 0_level_0,sales
Date,Unnamed: 1_level_1
2016-01-04 11:00:00,16.39
2016-01-04 12:00:00,36.27
2016-01-04 13:00:00,78.68
2016-01-04 14:00:00,51.44
2016-01-04 15:00:00,57.7


In [33]:
# finally export the file
df3.to_csv('C:\\Users\\tc18f\\Desktop\\springboard\\Capstone Two\\data\\processed\\dfts.csv')