#Exploritory Analysys of Sales Data
###Investigating comparisons through visualizations
***

This dataset provides details of products sold at a chain of stores across various regions. 
The dataset contains are 730 Observations with 13 characteristics in the dataset:

* Order ID : A specific ID given to each product (This characteristic was not included in the majority of comparisons)
* Order Priority : Shipping priority of the product
* Order Quantity: Number of items sold
* Sales: Amount recieved for the  purchase.
* Ship Mode: Divided in two categories - Express Air and Regular Air
* Profit: Profit earned from the sale
* Customer Name: Name of the customer purchasing the products (This characteristic was not included in the majority of comparisons)
* Region: Region to which the customer is assigned by location
* Customer Segment: Divided as per the size of business
* Product Category: Divided according to the usage of the product
* Product Sub-Category: Divided according to the usage of the product
* Product Name: Name of the product (This characteristic was not included in the majority of comparisons)
* Product Container: Type of container in which the product is shipped



In [None]:

# Import the libraries to work with
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# load the dataset in
sales = pd.read_csv('../input/sales-store-product-details/Salesstore.csv')

# Get an idea of what the dataset contains
sales.head()

In [None]:
# How big is the data set?
sales.shape

In [None]:
# What are the columns labeled as?
sales.columns

In [None]:
# Check the counts of each row for each column
n = sales.nunique(axis=0)
  
print("No.of.unique values in each column :\n",n)

In [None]:
# Check for any nulls
sales.isnull().sum()

In [None]:
# Find just numeric values in the set
numerics = sales.select_dtypes(include=np.number)
print(numerics.columns)

In [None]:
# Describe what the numeric data shows us
numerics.describe()

In [None]:
# Find the categorical data columns by excluding the numerics
categorical = sales.drop(columns=numerics.columns)
print(categorical.columns)

In [None]:
# Check each of the categorical counts
# Order Priority
sns.set_theme(style="darkgrid")
ax = sns.countplot(y=sales.Order_Priority, data=sales, order=sales.Order_Priority.value_counts().index)
# annotate
ax.bar_label(ax.containers[0], label_type='edge')

# pad the spacing between the number and the edge of the figure
ax.margins(y=0.1)

In [None]:
# Does not seem to be a major difference in priority from what I can see so far.

In [None]:
# Ship Mode
ax = sns.countplot(y=sales.Ship_Mode, data=sales, order=sales.Ship_Mode.value_counts().index)

# annotate
ax.bar_label(ax.containers[0], label_type='edge')

# pad the spacing between the number and the edge of the figure
ax.margins(y=0.1)

In [None]:
# What percentage of each is recorded?
print(sales.Ship_Mode.value_counts('Regular Air'))

###79% of the shipmode is Regular Air.

In [None]:
# Customer name may not be relevant here, so skipping over this.
# Region
ax = sns.countplot(y=sales.Region, data=sales, order=sales.Region.value_counts().index)
# annotate
ax.bar_label(ax.containers[0], label_type='edge')

# pad the spacing between the number and the edge of the figure
ax.margins(y=0.1)

In [None]:
# Customer Segment
ax = sns.countplot(y=sales.Customer_Segment, data=sales)

# annotate
ax.bar_label(ax.containers[0], label_type='edge')

# pad the spacing between the number and the edge of the figure
ax.margins(y=0.1)

###Corporate is the vast majority of sales, while small business is the least.

In [None]:
# Product Category

ax = sns.countplot(y=sales.Product_Category, data=sales, order=sales.Product_Category.value_counts().index)

# annotate
ax.bar_label(ax.containers[0], label_type='edge')

# pad the spacing between the number and the edge of the figure
ax.margins(y=0.1)

###Most sold are Office Supplies, and least is furniture.

In [None]:
# Product sub category
# sns syntax didn't like the dash in the column name assignment, using simple one here
sub = sales['Product_Sub-Category']
ax = sns.countplot(y=sub, data=sales, order=sub.value_counts().index)

# annotate
ax.bar_label(ax.containers[0], label_type='edge')

# pad the spacing between the number and the edge of the figure
ax.margins(y=0.1)

In [None]:
# Product Names - too many to really delve into with a simple figure, may need to examine this further later.
# Product Container
ax = sns.countplot(y=sales.Product_Container, data=sales, order=sales.Product_Container.value_counts().index)

# annotate
ax.bar_label(ax.containers[0], label_type='edge')

# pad the spacing between the number and the edge of the figure
ax.margins(y=0.1)

In [None]:
# Quantative exploration

# Order quantity
sales.Order_Quantity.describe()


In [None]:
ax = sns.boxplot(data=sales.Order_Quantity)
ax = sns.stripplot(data=sales.Order_Quantity, y=sales.Order_Quantity, palette="Dark2", linewidth=1)

In [None]:
# Sales
sales.Sales.describe()

In [None]:

ax = sns.boxplot(data=sales.Sales)
ax = sns.stripplot(data=sales.Sales, y=sales.Sales, palette="Dark2", linewidth=1)

In [None]:
# Profit
sales.Profit.describe()

In [None]:

ax = sns.boxplot(data=sales.Profit)
ax = sns.stripplot(data=sales.Profit, y=sales.Profit, palette="Dark2", linewidth=1)

In [None]:
#Numeric Comparisons
#Order Quantity vs. Sales
ax = sns.jointplot(data=sales, x=sales.Order_Quantity, y=sales.Sales, height=8, ratio=2, marginal_ticks=True, kind="reg")

In [None]:
#Order Quantity vs Profit
ax = sns.jointplot(data=sales, x=sales.Order_Quantity, y=sales.Profit, height=8, ratio=2, marginal_ticks=True, kind="reg")

In [None]:
# Sales vs. Profit
ax = sns.jointplot(data=sales, x=sales.Sales, y=sales.Profit, height=8, ratio=2, marginal_ticks=True, kind="reg")

In [None]:
plotnum = numerics.drop(["Order_ID"], axis=1)
ax = sns.pairplot(plotnum)

In [None]:
# Numeric vs. Categorical Comparisons
# Order Quantity vs Order Priority

ax = sns.displot(
    sales, y=sales.Order_Quantity, hue=sales.Order_Priority,
    binwidth=4, height=8, facet_kws=dict(margin_titles=True), multiple="stack")

In [None]:
#Order Quantity vs. Ship Mode
ax = sns.displot(
    sales, y=sales.Order_Quantity, hue=sales.Ship_Mode,
    binwidth=4, height=8, facet_kws=dict(margin_titles=True))

In [None]:
#Order Quantity vs. Region

ax = sns.displot(
    sales, x=sales.Order_Quantity, hue=sales.Region,
    binwidth=8, height=10, facet_kws=dict(margin_titles=True), multiple="dodge", rug=True)

###The majority of orders come from the West Region

In [None]:
#Order Quantity vs. Customer Segment

ax = sns.displot(
    sales, y=sales.Order_Quantity, hue=sales.Customer_Segment,
    binwidth=5, height=10, facet_kws=dict(margin_titles=True), multiple="stack")

###Majority of Orders were from Corporate

In [None]:
#Order Quantity vs. Product Category
ax = sns.displot(
    sales, x=sales.Order_Quantity, hue=sales.Product_Category,
    binwidth=8, height=10, facet_kws=dict(margin_titles=True), multiple="dodge")

###Majority of orders were for Office Supplies

In [None]:
#Order Quantity vs. Product Sub Category
sub = sales["Product_Sub-Category"]
ax = sns.displot(
    sales, y=sales.Order_Quantity, hue=sub,
    binwidth=8, height=10, facet_kws=dict(margin_titles=True), multiple="dodge", rug=True)

###Majority of sub categories were Binders and paper

In [None]:
#Order Quantity vs. Product Container
ax = sns.relplot(x=sales.Product_Container, y=sales.Order_Quantity, data=sales, kind = 'line')

In [None]:
#Sales vs. Order Priority
ax = sns.relplot(x=sales.Order_Priority, y=sales.Sales, data=sales, kind = 'line')

In [None]:
#Sales vs. Ship Mode
ax = sns.relplot(x=sales.Ship_Mode, y=sales.Sales, data=sales, kind = 'line')

In [None]:
#Sales Vs. Region
ax = sns.relplot(x=sales.Region, y=sales.Sales, data=sales, kind = 'line', height=4, aspect=2)

In [None]:
#Sales Vs. Customer Segment
ax = sns.relplot(x=sales.Customer_Segment, y=sales.Sales, data=sales, kind = 'line')

In [None]:
#Sales Vs. Product Category
ax = sns.relplot(x=sales.Product_Category, y=sales.Sales, data=sales, kind = 'line')

In [None]:
#Sales vs. Product Sub Category
sub=sales['Product_Sub-Category']
ax = sns.jointplot(data=sales, x=sales.Sales, y=sub, kind="hist")

In [None]:
#Sales Vs. Product Container
ax = sns.relplot(x=sales.Product_Container, y=sales.Sales, data=sales, kind = 'line')

In [None]:
#Profit vs. Order Priority
ax = sns.relplot(x=sales.Order_Priority, y=sales.Profit, data=sales, kind = 'line')

In [None]:
#Profit vs. Ship Mode
ax = sns.relplot(x=sales.Ship_Mode, y=sales.Profit, data=sales, kind = 'line')

In [None]:
#Profit Vs. Region

ax = sns.relplot(x=sales.Region, y=sales.Profit, data=sales, kind = 'line', height=4, aspect=2)

In [None]:
#Highest profits were seen from Northwest Territories

In [None]:
#Profit vs. Customer Segment
ax = sns.relplot(x=sales.Customer_Segment, y=sales.Profit, data=sales, kind = 'line')

In [None]:
#Profit Vs. Product Category
ax = sns.relplot(x=sales.Product_Category, y=sales.Profit, data=sales, kind = 'line')

In [None]:
#Profit Vs.Product Sub Category
sub=sales['Product_Sub-Category']
ax = sns.jointplot(data=sales, x=sales.Profit, y=sub, kind="hist")

In [None]:
#Profit Vs. Product Container
ax = sns.relplot(x=sales.Product_Container, y=sales.Profit, data=sales, kind = 'line')

##Observations so far:
* 79% of the shipmode is Regular Air.
* The majority of orders come from the West Region
* Majority of Orders were from Corporate
* Majority of orders were for Office Supplies
* Majority of sub categories were Binders and paper
* Majority were small boxes
* Highest profits were seen from Northwest Territories

In [None]:
# 79% of the shipmode is Regular Air.
#Pairwise Ship Mode
pairnums = sales.drop(columns=['Order_ID'])
ax = sns.pairplot(data=pairnums, kind="hist", hue="Ship_Mode", height=3, aspect=1.5)

In [None]:
#The majority of orders come from the West Region
#Highest profits were seen from Northwest Territories
#Pairwise Region
ax = sns.pairplot(data=pairnums, hue="Region", height=3, aspect=1.5)

In [None]:

#Profit vs Region by Product Category 
ax = sns.catplot(x ='Region', y ='Profit', data=sales, kind='bar', col="Product_Category", col_wrap=2, height=5, aspect=1.5)

In [None]:
#Pairwise by sub category
ax = sns.pairplot(data=pairnums, hue="Product_Sub-Category", kind="hist", height=3, aspect=1.5)

In [None]:
#The majority of orders come from the West Region
#Profit vs Region by Product Sub-Category 
ax = sns.catplot(x ='Region', y ='Profit', data=sales, kind='violin', col="Product_Sub-Category", col_wrap=2, height=5, aspect=1.5)

In [None]:
#Majority of Orders were from Corporate
#Pairwise Customer Segment
ax = sns.pairplot(data=pairnums, hue="Customer_Segment", kind="hist", height=3, aspect=1.5)

In [None]:
#Profit vs Region by Customer Segment
ax = sns.catplot(x ='Region', y ='Profit', data=sales, kind='violin', col="Customer_Segment", col_wrap=2, height=5, aspect=1.5)

In [None]:
#Majority of orders were for Office Supplies
#Pairwise Product Categories
ax = sns.pairplot(data=pairnums, hue="Product_Category", kind='hist')

In [None]:
#Highest profits were seen from Northwest Territories
#Northwest pairwise
north = sales.loc[sales['Region'] == 'Northwest Territories'].drop(["Order_ID"], axis=1)
ax = sns.pairplot(data=north, kind="hist")

In [None]:
#The majority of orders come from the West Region
#West  pairwise
west = sales.loc[sales['Region'] == 'West'].drop(["Order_ID"], axis=1)
ax = sns.pairplot(data=west)