# Introduction

This notebook will be looking at a data set of online business sales and visulizing key parts of the data to gain further insight into the patterns of these sales. There are two csv files provided that we can use. The first one contains information about the sales of the product types, like the net quantity, gross sale, discount, returns and the net sales. The second contains information about the month, year, total orders, gross sale, net sales, and shipping. We have two files that provide different kinds of information that can be valuable to providing insights about this online business's sales. The visualizations will be in two different sections, product types insights and sales insight. 

To answer the task set by the post, I will also have a section over the projections for Oct- Dec of 2020 using data from previous years. I will try to provide an output of a list of sales by product type. From this, we would be able to see demand by each product type and suggest discounts for the near future.

# Loading the Sales Data

First, load the notebook libraries. I chose seaborn and matplotlib specifically to create visualizations for analysis. 

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt # visulization
%matplotlib inline
import seaborn as sns # data visualization

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
import os
for dirname, _, filenames in os.walk('/kaggle/input/'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

Second, I read in the csv files and set them to variables names, called reta_data for the first csv file and reta_data2 for second csv file.

In [None]:
# Path of file to read
reta_filepath = "../input/retail-business-sales-20172019/business.retailsales.csv"
reta_filepath2 = "../input/retail-business-sales-20172019/business.retailsales2.csv"
print("File read")

# Read the file into a variable reta_data
reta_data = pd.read_csv(reta_filepath)
reta_data2 = pd.read_csv(reta_filepath2)


# Product Insights

This section is focused on the products of the sales using the data from the first csv file. First, we can look at the first and last five data entries to see what we are working with. There is 1775 lines of data with product types, net quantity, gross sales, discounts, returns and net sales. 

In [None]:
reta_data.head()

In [None]:
reta_data.tail()

First, we can graph the gross sales of each product. The gross sales is graphed first so we can see the sales before including the discount and returns. 

The product type that has the most gross sales are (1) art & sculpture, (2) baskets and (3) skin care. The least grossing product types are (1) gift baskets, (2) Easter and (3) kids. 

In [None]:
plt.figure(figsize=(10,6))
plt.title("Gross Sales by Product Type")
sns.barplot(x=reta_data['Gross Sales'], y=reta_data['Product Type'])

We can consider the discounts of each product type. The product types that get the most discounts on are (1) furniture, (2) art & sculpture, and (3) baskets. The least discounts for product types are (1) soapstone, (2) kids and (3) fair trade gifts. 

In [None]:
plt.figure(figsize=(10,6))
plt.title("Discount by Product Type")
discounts1 = reta_data['Discounts']*-1;
sns.barplot(x= discounts1, y=reta_data['Product Type'])

Now, we can look at the total net sales of each product type. The top three product types are (1) art & sculpture, (2) baskets and (3) skin care. The least selling product types are (1) gift baskets, (2) Easter and (3) kids.   

In [None]:
plt.figure(figsize=(10,6))
plt.title("Net Sales by Product Type")
sns.barplot(x=reta_data['Total Net Sales'], y=reta_data['Product Type'])

Products that are returned the most are (1) Christmas, (2) art & sculpture and (3) baskets. 

In [None]:
plt.figure(figsize=(10,6))
plt.title("Returns by Product Type")
returns1 = reta_data['Returns']*-1;
sns.barplot(x=returns1, y=reta_data['Product Type'])

The total quantity sold of each product type.

In [None]:
#plt.figure(figsize=(10,6))
#plt.title("Quantity Sold by Product Type")
#sns.barplot(x=reta_data.index, y=reta_data['Product Type'])

We look at the gross sales of each product type and the quantity they sell each purchase. From this, we can see the quantity sold and grossing sales each product type. The most selling and grossing is (1) art & sculpture, (2) jelewry and (3) baskets.

In [None]:
reta_data['Product Type']

In [None]:
plt.figure(figsize=(10,9))
plt.title("Gross Sales of Quantity of Each Product")
sns.scatterplot(x=reta_data['Gross Sales'], y=reta_data['Net Quantity'], hue=reta_data['Product Type'])

# Sales Insights

This section will focus on the sales using the data from the second csv file. 

The data of this file starts from January 2017 and ends December 2019, list the total orders, gross sales, discounts, returns, net sales, shipping and total sales for each month. We can see this from looking at the first and last fie entries of the data.

In [None]:
reta_data2.head()

In [None]:
reta_data2.tail()

First, lets look at the net sales across the three years, or 36 months. 

In [None]:
plt.figure(figsize=(8,6))
plt.title("Net Sales from 2017 to 2019")
sns.lineplot(data = reta_data2['Net Sales'], label = "Net Sales")
plt.ylabel("Dollars");
plt.xlabel("Months");

We see that net sales of the first 24 monthes are consistently flucuating between 5,000 and 10,000. The monthes 25 to 32 perform the same until month 33 where there is a large spike to the end of 2019.

We can compare the three years in the graph below. Each year is progressively increasing in net sales. Year 2019 had a slightly larger increase net sales than the last two years.

In [None]:
plt.figure(figsize=(12,8))
plt.title("Net Sales Each Year")
sns.barplot(x=reta_data2['Year'], y=reta_data2['Net Sales'])

Net sales can also be viewed in based on each month. We can see that months (1) December, (2) November and (3) June have the most net sales while months like (1) October, (2) February and (3) May have the least. 

In [None]:
plt.figure(figsize=(12,8))
plt.title("Net Sales Each Month")
#sns.barplot(x=reta_data2.index, y=reta_data2['Net Sales'])
sns.barplot(x=reta_data2['Month'], y=reta_data2['Net Sales'])

We can also look at net sales and gross sales to see how discounts are impacting the net sales. 

In [None]:
plt.figure(figsize=(8,6))
plt.title("Net Sales V.S. Gross Sales")
sns.lineplot(data = reta_data2['Gross Sales'], label = "Gross Sales")
sns.lineplot(data = reta_data2['Discounts'], label = "Discounts")
sns.lineplot(data = reta_data2['Net Sales'], label = "Net Sales")
plt.ylabel("Dollars");

# 2020 Q4 Sales Predictions

The goal of this section to output a list of sales predictions per product type for Oct-Dec 2020. There are many regression techniques we can use to try to predict the trends of each product type. However, the way this data is structured into two seperate files with the prodcut types file lacking date information poses a hurdle to predicting future sales per product type. We can try to work around this by still predicting  future sales using the first file. 