Line 1: Import python package Pandas (a Python package for working with data), seaborn (a Python graphing library) and Python matplotlib (a Python plotting library), datetime (used to format dates).  The Seaborn package generates warnings that we don't want to see, so there is a "warning" code to suppress them.

In [None]:
import csv
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import warnings 
warnings.filterwarnings("ignore")
import seaborn as sns
import datetime as dt

Line 2: Imports the excel data as a csv file.  It is saved into the variable ‘data.’ For working on my computer, the code is:
data = pd.read_csv('Datamart-Export_DY_WK100-500 Pound Barrel Cheddar Cheese Prices, Sales, and Moisture Content_20170829_122601.csv').  
But once uploaded to Kaggle it needs to be:
data = pd.read_csv('../input/ Datamart-Export_DY_WK100-500 Pound Barrel Cheddar Cheese Prices, Sales, and Moisture Content_20170829_122601.csv')

In [None]:
data = pd.read_csv('../input/Datamart-Export_DY_WK100-500 Pound Barrel Cheddar Cheese Prices, Sales, and Moisture Content_20170829_122601.csv')

Line 3: The next few line are ‘standard’ ways to heck the data & take a quick look at it.   First the ‘shape’; the number of rows & columns. 

In [None]:
data.shape

Line 4: ".head" command shows the first few rows of the data.   The 10 specifies the number of rows.  If the () is blank, the default is 5 rows.

In [None]:
data.head(10)

line 5 Verify that we are working with a data frame. We ‘know’ we have a data frame, type(data) verifies that the data was read in as a data frame .

In [None]:
type(data)

line 6: this is a new code I found ‘.dtypes.’ Returns the type of each data type in the dataframe.  This tells me the 3 ‘date” fields aren’t dates & that the sales field in not an integer (or float).   

In [None]:
data.dtypes

Line 7: See if we have any missing data. We don’t have any missing data. We are asking if there are any null values. If we had any null values we would get ‘True” but since everything is ‘False’ we don’t have any null values.  However, we do know there are a handful of zero data (which are probably null data).

In [None]:
data.isnull().any()

Line 8: Use data.describe to see an overview of the data means, median, max, min, etc for each column. This gives us information on the columns with data, but the columns with word descriptors (the 3 date columns & the sales column) are not included. This is an interesting feature that gives us some statistical information about our data.  We see the that there is zero listed as a minimum (from the zeros in out data). 

In [None]:
data.describe()

Line 9:  Here is where I begin working to get useable dates.  (Again the ‘ugly’ files show many of the attempts I made that didn’t work).  First I took the ‘Date’ column & saved as a new ‘mo_day’ column and added 1999 to the end (for a full date format).  Data head then shows us the data.

In [None]:
data['mo_day']=data['Date'].astype(str)+'-1999'
data.head()

Line 10: extracts the year from the Week ending date.

In [None]:
data['Week Ending Date'] = pd.to_datetime(data['Week Ending Date'])

data['year'] = data['Week Ending Date'].dt.year
data.head(5)
data.tail(5)

Line 11: Splits the date into the month and year form the date format we forced into ‘mo_day’.

In [None]:
data['mo_day'] = pd.to_datetime(data['mo_day'])
data['month'],data['day'] = data['mo_day'].dt.month, data['mo_day'].dt.day 
data.head(10)

line 12: Look at ‘.dtypes’ again to see what fields we have been able to modify.

In [None]:
data.dtypes

Line 13:  put the year, month & day back together again as a new column ‘Date_year.’  Other than changing the ‘type” of data we haven’t changed any of the original data.

In [None]:
data['Date_yr'] = data['year'].map(str)+'-' + data['month'].map(str) +'-'+ data['day'].map(str)     
data.head(5)

Line 14 converts our new ‘Date_year’ into a date format.

In [None]:

data['Date_yr'] = pd.to_datetime(data['Date_yr'])
data['Week Ending Date'] = pd.to_datetime(data['Week Ending Date'])
data.head(5)


Line 15: Again use  ‘.dtypes’ to check our field types.

In [None]:
data.dtypes

Line 16: converts sales to an integer, we must remove the commas (,) from the fields.

In [None]:
data['Sales'] = data['Sales'].str.replace(',', '')
data['Sales'] = pd.to_numeric(data['Sales'])          
data.dtypes

Line 17: Now that we have a date format in week ending and the mo_day column we can get the difference to find the age of the cheese.  This actually took quite a bit of searching for the right code.  Since I was subtracting 2 dates it wanted to make the result a date filed as well (‘0 days’ or ‘7 days’, etc).  But I needed this number to be an integer. 

In [None]:
data['age']=(data['Week Ending Date']-data['Date_yr']).dt.days
data.head(10)

Line 18 Once again, use ‘.dtypes’ to check our field types.

In [None]:
data.dtypes

Line 19: We want to look at our data.  (Specifically, January).  Since we had to force a year based on the week ending date into the “prior week(s)’ data for the age, the December dates in January should be for the prior year.  We see where this results in a negative value for the age.

In [None]:
data.iloc[140:170,]

Line 20:  To correct these we’ll add back in 365.  The code to modify the dates and arrive at the age looks really simple (now), but it took me over 3 hours to just figure out how to get the age.  The lesson: Don’t despair when working with different methods to clean the data.

In [None]:
data['age'][data['age'] < 0] = 365+data['age']
data.iloc[140:170,]
#finally 3 hours!

Line 21: We can drop the month and day columns we created; we have the date format we needed.

In [None]:
df1=data
del df1['month']
del df1['day']
df1.head(5)


Line 22: Index the data by the year.

In [None]:
years = df1.set_index("year")
years.head()
years.tail()

Line 23: separate the date into smaller year ‘chunks’ so we can take a look at it.

In [None]:
#2017 data
df2017=df1.iloc[0:165,]  #line 165 is 2016 so we need 1 more than line 164 ie 165
#df2017.head()
df2017.tail()

Line 24:  separate the 2016 data, I look at both the head and tail.  If one of the lines is not commented out with a ‘#’ it only shows the last command (head or tail).

In [None]:
df2016=df1.iloc[166:430,]  #line 430 is 2015 so we need 1 more than line 429 ie 430
#df2016.head()
df2016.tail(7)

Line 25:  The 2015 data.

In [None]:
df2015=df1.iloc[430:690,]  #line 690 is 2014 so we need 1 more than line 690 ie 691
#df2015.head()
df2015.tail(7)

Line 26:  The 2014 data

In [None]:
df2014=df1.iloc[690:950,]  #line 950 is 2013
#df2014.head()
df2014.tail(7)

Line 27: We have the data broken into a few smaller chunks so we can take a look at the data.  This first graph shows why we needed to separate out the ‘age.’  The multiple prices tied to the Week Ending Date don’t give us a concise graph.

In [None]:
test = df2017
test['Week Ending Date'] = pd.to_datetime(test['Week Ending Date'])
test = test.set_index('Week Ending Date')
title=('sales per week')
ax = test.plot()
plt.show()

Line 28: The 2017 data based on sales; showing all 5 age lines.

In [None]:
fig, ax = plt.subplots()
for name, group in df2017.groupby('age'):
    group.plot('Week Ending Date', y='Sales', ax=ax, label=name)
    ax.set_title('Sales 2017')
plt.show()

Line 29: The 2017 data based on moisture content; showing all 5 age lines.

In [None]:
fig, ax = plt.subplots()
for name, group in df2017.groupby('age'):
    group.plot('Week Ending Date', y='Moisture Content', ax=ax, label=name)
    ax.set_title('Moisture Content 2017')
plt.show()

Line 30: The 2017 data based on price; showing all 5 age lines.  If we compare these 3 preliminary graphs.  We notice that sales increase when the price is lowest.

In [None]:
fig, ax = plt.subplots()
for name, group in df2017.groupby('age'):
    group.plot('Week Ending Date', y='Weighted Price', ax=ax, label=name)
    ax.set_title('Weighted Price per week 2017')
plt.show()

Line 31: The 2016 graph shows some problems with the age data.  We’re going to skip over this 2016 data for now.  More data cleaning for another project.

In [None]:
fig, ax = plt.subplots()
for name, group in df2016.groupby('age'):
    group.plot('Week Ending Date', y='Weighted Price', ax=ax, label=name)
    ax.set_title('Weighted Price per week 2016')
plt.show()

Line 32: The 2015 data based on sales; showing all 5 age lines.  This data has quite a few swings, so we’ll increase our figure size to get a better look.

In [None]:
fig, ax = plt.subplots()
for name, group in df2015.groupby('age'):
    group.plot('Week Ending Date', y='Sales', ax=ax, label=name, figsize=(12,4))
    ax.set_title('Sales 2015')
plt.show()

Line 33: The 2015 data based on price; showing all 5 age lines.  Unlike the 2017 data, we don’t see an increase in demand at a lower price.   

In [None]:
fig, ax = plt.subplots()
for name, group in df2015.groupby('age'):
    group.plot('Week Ending Date', y='Weighted Price', ax=ax, label=name, figsize=(12,4))
    ax.set_title('Weighted Price per week 2015')
plt.show()

Line 34: Let’s take the same look at the 2014 data. The 2014 sales.

In [None]:
fig, ax = plt.subplots()
for name, group in df2014.groupby('age'):
    group.plot('Week Ending Date', y='Sales', ax=ax, label=name, figsize=(12,4))
    ax.set_title('Sales 2014')
plt.show()

the 36: 2017 price data, we don’t see an increase in demand at a lower price.  The above graphs also show that the pricing (regardless of the age) follows the same trajectory.   

In [None]:
fig, ax = plt.subplots()
for name, group in df2014.groupby('age'):
    group.plot('Week Ending Date', y='Weighted Price', ax=ax, label=name, figsize=(12,4))
    ax.set_title('Weighted Price per week 2014')
plt.show()

Line 37: We will take a different direction and look at the data by the age.  We will look at the 14 day old cheese data.  Index by the age and pull out all the 14 age. Again, this code looks pretty simple but it entailed many attempts to figure out how to get this.

In [None]:
df1.set_index(keys=['age'], drop=False,inplace=True)
ages=df1['age'].unique().tolist()
df1_14 = df1.loc[df1.age==14]               
df1_14.head(5)
#df1_14.tail(5)

Line 38: we’ll check the shape of this data. 

In [None]:
df1_14.shape


    Line 39: This graph compares all the 14 day prices in our data set.  We see how those “zero” prices affect our graph.  

In [None]:
fig, ax = plt.subplots()
for name, group in df1_14.groupby('year'):
    group.plot('Date', y='Weighted Price', ax=ax, label=name,figsize=(12,4))
    ax.set_title('Weighted price 14 day age cheddar')
plt.show()

Line 40: When we compare the sale for the 14 day cheddar we see a relatively steady demand over time.    

In [None]:
fig, ax = plt.subplots()
for name, group in df1_14.groupby('year'):
    group.plot('Date', y='Sales', ax=ax, label=name,figsize=(12,4))
    ax.set_title('Sales 14 day age cheddar')
plt.show()

Line 41: Lets see if the 28-day cheddar shows us the same information.  We’ll create a 28 day dataset.

In [None]:
df1.set_index(keys=['age'], drop=False,inplace=True)
ages=df1['age'].unique().tolist()
df1_28 = df1.loc[df1.age==28]        
df1_28.head(5)
#df1_28.tail(5)

Line 42: This graph compares all the 28 day prices in our data set.  Agin we see the drop with the “zero” prices.

In [None]:
fig, ax = plt.subplots()
for name, group in df1_28.groupby('year'):
    group.plot('Date', y='Weighted Price', ax=ax, label=name,figsize=(12,4))
    ax.set_title('Weighted price 28 day age cheddar')
plt.show()

Line 43: This shows us what we would expect the same relatively steady demand over time for 28 day cheddar.  

In [None]:
fig, ax = plt.subplots()
for name, group in df1_28.groupby('year'):
    group.plot('Date', y='Sales', ax=ax, label=name,figsize=(12,4))
    ax.set_title('Salese 28 day age cheddar')
plt.show()

Line 44:  We can also compare a limited number of years.  Here is a data set for 14 day cheddar  for 2013 to 2015.

In [None]:
df1_14.set_index(keys=['year'], drop=False,inplace=True)
years_14=df1_14['year'].unique().tolist()
df1_14_2013_2015 = df1_14.loc[(df1_14.year>=2013) & (df1_14.year<=2015)]
df1_14_2013_2015.head(5)
df1_14_2013_2015.tail(5)

Line 45:  With fewer data points (only 3 years).  This graph is much less congested and easier to understand.

In [None]:
fig, ax = plt.subplots()
for name, group in df1_14_2013_2015.groupby('year'):
    group.plot('Date', y='Weighted Price', ax=ax, label=name,figsize=(12,4))
    ax.set_title('Weighted price 2013 vs 2015')
plt.show()


Line 46:  This data is best viewed as time series (line) graphs.  The Sales data can be plotted as a bar graph.

In [None]:
bar = df1_14_2013_2015.groupby("year").sum().plot(kind='bar', width=1.5)
bar_width = 0.4 
bar.set_xlabel("year")
bar.set_ylabel("Sales")
#plt.legend()
#plt.legend.remove()
plt.legend().set_visible(False)
plt.title('Total Sales')
plt.show()