## Analisys Objective

* Q1. What is the overall sales trend?
* Q2. What is the Top 10 products by sales?
* Q3. What are the Most Selling Products?
* Q4. What is the preferred Shipping Mode?
* Q5. What are the Most Profitable Category and Sub-Category?

### Import required libraries

In [1]:
# data manipulation
import pandas as pd
import numpy as np

# install openpyxl

# data visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

### Import the dataset

In [2]:
# NOTE read Excel files requires `openpyxl` as a dependency
# read dataset into a pandas dataframe
df = pd.read_excel("eds_superstore_sales.xlsx")

### Data audit and overview

In [None]:
# let's see what the dataset looks like
df.head()

In [None]:
df.tail()

In [None]:
# what is the shape of our dataframe
df.shape

In [None]:
# list the columns
df.columns

In [None]:
# in what types the data is stored
df.dtypes

In [None]:
# a concise summary of the above information
df.info()

As per the above result we can see that we don't have any missing values, so we can skip this step.

In [None]:
# get small descriptive statistics of the dataset
df.describe()

### Data Cleaning

* Some of the columns are not critical to our analisys, so we're going to drop them.

In [3]:
df = df.drop(columns=["customer_name", "segment", "state", "country", "market", "region", "discount", "order_priority", "year"])

In [None]:
df.head()

### Exploratory Data Analisys

##### 1) What is the overall sales trend?

In [None]:
# first let's take a look at out `order_date` column
df["order_date"]

We can see that the records go from Jan 1st, 2011 up to Dec 31, 2014.<br />
And also the data is in the year-month-day format. Let's make it just year-month (enterprise standard format).

In [4]:
# to do that we gonna use the .apply() method with a lambda on the `order_date` column
# the year-month format is %Y-%m
df["order_date"] = df["order_date"].apply(lambda x: x.strftime("%Y-%m"))

In [None]:
# the result is as follow
df["order_date"]

In [5]:
# okay so far so good, now we have to group the data by order date and sales quantity
df_sales_trend = df.groupby("order_date").sum()["sales"].reset_index()
df_sales_trend

Unnamed: 0,order_date,sales
0,2011-01,98898.48886
1,2011-02,91152.15698
2,2011-03,145729.36736
3,2011-04,116915.76418
4,2011-05,146747.8361
5,2011-06,215207.38022
6,2011-07,115510.41912
7,2011-08,207581.49122
8,2011-09,290214.45534
9,2011-10,199071.26404


Let's visualize this information

In [None]:
# plt.style.use("seaborn")
plt.figure(figsize=(15, 6))
plt.grid("both")
plt.plot(df_sales_trend["order_date"], df_sales_trend["sales"])
plt.xlabel("Order Date in Months", labelpad=8)
plt.ylabel("Sales", labelpad=8)
plt.xticks(rotation="vertical")
plt.show()

- Observation: Order quantity tends to hit lower rates most often in February and July. In the other hand, by the end of the year rates are usually at their best. We can also observe huge leap in May-June period.

In [6]:
categories = set(df.category.values.tolist())
categories

{'Furniture', 'Office Supplies', 'Technology'}

##### 2) What is the top ten (10) products by sales?

In [None]:
# let's the many different products in the dataset
df["product_name"]

In [7]:
# let's group by product name and sales
df_top_sales = pd.DataFrame(df.groupby(["product_name"]).sum()["sales"])

In [8]:
# let's sort it by sales in descending order
df_top_sales = df_top_sales.sort_values("sales", ascending=False)

In [9]:
df_top_sales.head(10)

Unnamed: 0_level_0,sales
product_name,Unnamed: 1_level_1
"Apple Smart Phone, Full Size",86935.7786
"Cisco Smart Phone, Full Size",76441.5306
"Motorola Smart Phone, Full Size",73156.303
"Nokia Smart Phone, Full Size",71904.5555
Canon imageCLASS 2200 Advanced Copier,61599.824
"Hon Executive Leather Armchair, Adjustable",58193.4841
"Office Star Executive Leather Armchair, Adjustable",50661.684
"Harbour Creations Executive Leather Armchair, Adjustable",50121.516
"Samsung Smart Phone, Cordless",48653.46
"Nokia Smart Phone, with Caller ID",47877.7857


- Observation: 

#### 3) What are the most selling products?

In [None]:
# let's take a look at the quantity column
df["quantity"]

In [None]:
# let's group by product name and quantity sold
df_most_selling = pd.DataFrame(df.groupby("product_name").sum()["quantity"])

In [None]:
# now let's sort it out by quantity in descending order
df_most_selling = df_most_selling.sort_values("quantity", ascending=False)
df_most_selling[:10]    # top 10

- Obsevation:

#### 4) What is the preferred shipping mode?

In [None]:
df

We can use `Counter` from the `collections` module for this task. Just to vary our game a little.

In [None]:
from collections import Counter

In [None]:
# let's take a look at the ship mode column
df["ship_mode"]

In [None]:
# all different shipping methods
shipping_methods = set(df["ship_mode"].values.tolist())
shipping_methods

In [None]:
# let's figure out which method is the most often used
count = Counter(df["ship_mode"].values.tolist())
most_used_ship = count.most_common()
most_used_ship

In [None]:
# total products sold
total_orders = len(list(df["order_id"].values.tolist()))
total_orders

In [None]:
# additionnally let's see the percentage share
percent_ship = round((most_used_ship[0][-1] / total_orders * 100), 2)
print(f"{percent_ship}%")

Or, even easier, we can use `seaborn` to both get the result and to visualize it.

In [None]:
plt.figure(figsize=(10, 7))
plt.grid("y")
sns.countplot(df['ship_mode'])

- Observation: From this result, with 60.0%, we can see that most buyers use the "Standard Class" to ship their products.

#### 5) Which are the most profitable Category and Sub_Category?

In [10]:
# let's group by category and sub_category
df_catg_profit = pd.DataFrame(df.groupby(["category", "sub_category"]).sum()["profit"])

In [11]:
df_catg_profit = df_catg_profit.sort_values(["category", "profit"], ascending=False)
df_catg_profit

Unnamed: 0_level_0,Unnamed: 1_level_0,profit
category,sub_category,Unnamed: 2_level_1
Technology,Copiers,258567.54818
Technology,Phones,216717.0058
Technology,Accessories,129626.3062
Technology,Machines,58867.873
Office Supplies,Appliances,141680.5894
Office Supplies,Storage,108461.4898
Office Supplies,Binders,72449.846
Office Supplies,Paper,59207.6827
Office Supplies,Art,57953.9109
Office Supplies,Envelopes,29601.1163
