# SALES FORECASTING -  EDA

# DATA SOURCE

This dataset is extracted from a Brazilian top retailer and has many SKUs and many stores.

# PROBLEM STATEMENT

Our Main Objective is to predict sales of store in a week. As in dataset size and time related data are given as feature, so analyze if sales are impacted by time-based factors and space- based factor.

Most importantly how inclusion of holidays in a week soars the sales in store.

# WORK FLOW

1. Gathering relevant data.
2. Data Cleaning 
3. Data Exploration
4. Relationship between Variables
5. Data Visualization

# Import Libraries and Data

In [1]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
sales = pd.read_csv('retailSaleForecasting.csv')
sales.head()

FileNotFoundError: [Errno 2] No such file or directory: 'retailSaleForecasting.csv'

In [None]:
sales = sales.rename(columns={'data' : 'Date', 'venda' : 'Sales', 'estoque' : 'Stock', 'preco' : 'Price'})
print(sales.shape)
sales.head()

# Data Cleaning

Let's make sure that there aren't any null values, or any unusable data.

In [None]:
sales.isnull().sum()

In [None]:
sales.dtypes

converting Date to datetime, versus object

In [None]:
sales['Date'] = pd.to_datetime(sales['Date'], format='%d-%m-%Y')
sales.dtypes

In [None]:
sales = sales.assign(Day = sales.Date.dt.day,
               Month = sales.Date.dt.month,
               Year = sales.Date.dt.year)
sales['Revenue'] = sales['Sales'] * sales['Price']
sales.dtypes

In [None]:
sales.Year.value_counts()

# Data Exploration

In [None]:
sales.corr()

1. For Sales, there is some positive correlation between Year and Sales, and a little positive correlation between Sales and Stock. It seems that more stock leads to more sales, and sales have increased year over year.


2. For Stock, there is some negative correlation between Month and Stock, and a little negative correlation between Year and Stock. As the year goes on, stock seems to diminish, and each year sees less stock available than previous years.


3. For Price, the only correlation that really shows is between Price and Year. There is a slightly strong correlation between an increase in price as time goes by; perhaps, like a lesser amount of stock in later years.


4. For Revenue, which should not come as a surprise, is very highly correlated to sales - you can't make money if you don't sell product. There is a moderate correlation to price, as well, but price is far less important in determining revenue than the number of sales.



--> Changes in Variables by Date

In [None]:
fig, axs = plt.subplots(4, figsize = [14, 14])
fig.suptitle('Variables per Year', fontsize = 18)
axs[0].plot(sales.Date, sales.Price, color = 'gold')
axs[0].set_title('Price', fontsize = 14)
axs[1].plot(sales.Date, sales.Stock, color = 'red')
axs[1].set_title('Stock', fontsize = 14)
axs[2].plot(sales.Date, sales.Sales, color = 'black')
axs[2].set_title('Sales', fontsize = 14)
axs[3].plot(sales.Date, sales.Revenue, color = 'green')
axs[3].set_title('Revenue', fontsize = 14)

1. Price seems to be the most stable of all, with a minimal amount of sharp peaks and troughs, being better represented by plateaus.

2. For stock, nothing seems extraordinary: stock climbs steeply as large shipments or orders are received, and then dwindles as sales are made.

3. As for sales, it is the most volatile of all variables. It will be worth looking into specific days of the week to see what days are the most popular.

From the above plots we can say that there is no trend but there is seasonality

In [None]:
#plot acf and pacf plots

from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

In [None]:
plot_acf(sales['Sales'], lags = 30, title='Autocorrelation for Sales', zero=False, auto_ylims=True)

#ACF plot shows significant 10 lags 

In [None]:
plot_pacf(sales['Sales'], lags = 30, title='Partial Autocorrelation for Sales', zero=False, auto_ylims=True)


#PACF plot shows significant 1 lag

--> Data by Days

In [None]:
dsales = sales.groupby('Day').mean()
dsales = dsales[['Sales', 'Stock', 'Price', 'Revenue']]

In [None]:
fig, axs = plt.subplots(2, 2 , figsize = [14, 14])
fig.suptitle('Variables Over the Month', fontsize = 18)
axs[0, 0].plot(dsales.index, dsales.Sales, color = 'black')
axs[0, 0].set_title('Sales', fontsize = 14)
axs[0, 1].plot(dsales.index, dsales.Stock, color = 'red')
axs[0, 1].set_title('Stock', fontsize = 14)
axs[1, 0].plot(dsales.index, dsales.Price, color = 'gold')
axs[1, 0].set_title('Price', fontsize = 14)
axs[1, 1].plot(dsales.index, dsales.Revenue, color = 'green')
axs[1, 1].set_title('Revenue', fontsize = 14)

In [None]:
dsm = dsales
dsm['Price'] = dsm['Price'] * 1000
dsm['Sales'] = dsm['Sales'] * 15
dsm['Revenue'] = dsm['Revenue'] * 10
dsm = dsm.reset_index()

In [None]:
plt.figure(figsize=(16,8))
plt.title('Relationship Between Variables', fontsize = 18)
plt.plot('Day', 'Sales', data = dsm, color='black', linewidth=2, label = "Sales")
plt.plot('Day', 'Price', data = dsm, color='gold', linewidth=2, label = "Price")
plt.plot('Day', 'Stock', data = dsm, color='red', linewidth=2, label="Stock")
plt.plot('Day', 'Revenue', data = dsm, color='green', linewidth=2, label="Revenue")
plt.yticks([])
plt.legend(fontsize = 12)

1. It can be seen that sales (and, therefore, revenue) peak at the beginnings of the months, and have a minor uptick later in the months.

2. Stock tends to get depleted towards the beginnings of the months, in two noticable waves, before slowly refilling throughout the rest of the month.

3. Price tends to be slightly higher at the beginnings and ends of the months with a slump towards the middle of the month.

--> Data by Month

In [None]:
msales = sales.groupby('Month').mean()
msales = msales[['Sales', 'Stock', 'Price', 'Revenue']]

In [None]:
fig, axs = plt.subplots(2, 2 , figsize = [14, 14])
fig.suptitle('Variables by the Month', fontsize = 18)
axs[0, 0].plot(msales.index, msales.Sales, color = 'black')
axs[0, 0].set_title('Sales', fontsize = 14)
axs[0, 1].plot(msales.index, msales.Stock, color = 'red')
axs[0, 1].set_title('Stock', fontsize = 14)
axs[1, 0].plot(msales.index, msales.Price, color = 'gold')
axs[1, 0].set_title('Price', fontsize = 14)
axs[1, 1].plot(msales.index, msales.Revenue, color = 'green')
axs[1, 1].set_title('Revenue', fontsize = 14)

1. Sales peak in June and hit their bottom in August, both during the winter months (for the Southern Hemisphere).

2. Stock, similar to data over the course of a month, begins with a much larger reserve and dwindles as the year continues.

3. Price has two noticable peaks, at the start and end of the year - they may be much closer on a linear timeline than by the breakdown shown here.

4. Revenue sees a more general climb towards its' peak than sales, without as much of a second peak late in the year, due to the lower value of price.

--> Top 5 Most Common Price Values

In [None]:
### Only showing the 5 most common prices
sales.Price.value_counts().head().plot(kind='bar', color = 'gold', figsize = (14, 8))

# CONCLUSION

Based on the data, the 7th of June should be expected to bring in the most money of the year, whereas the 24th of August should be expected to earn the least amount of money of the year.

Stock should hit its' lowest points of the year in early November, while climbing to its' highest points in the middle through late February.

Price should trough around the 15th of August, hitting its' lowest point, while peaking around the start and end of February.