#####  SCENARIO:

> __You are working as a Data Analyst/Scientist at Rohkorp Consolidated. The CEO wants you to have a look at the commercial data for this year & to present your findings.__
___

# Import Libraries & Load Dataset

### Imports

#### Imports

Version:
* --Python 3.8.5--
* autoviz==0.0.81
* numpy==1.19.3
* openpyxl==3.0.5
* pandas==1.2.0
* pandas-profiling==2.9.0
* plotly==4.14.1
* plotly-express==0.4.1
* xlrd==2.0.1

In [None]:
# Imports:
import pandas as pd
from pandas_profiling import ProfileReport
import numpy as np
import plotly
import plotly_express as px
import plotly.graph_objects as go
import plotly.offline as pyo
from autoviz.AutoViz_Class import AutoViz_Class
%matplotlib inline

#### Plotly Template Settings

In [None]:
# -- Settings Plotly template
#      Reference Link:
#      https://plotly.com/python/templates/
#      Try other themes: 'plotly_dark', 'plotly_white', 'ggplot2', 'seaborn', 'simple_white'
template_style = "plotly_dark"

### Load DataFrame

**Load DataFrame and store it in a variable called "df"**

In [None]:
dataset = "../cif_sales_analysis/data/data.xlsx"
df = pd.read_excel(dataset, index_col=False)

**Inspect first 5 rows of the DataFrame**

In [None]:
df = df.reset_index(drop=True)
# df.head()

# Explore Dataset

## Traditionally

In [None]:
# Basic Info about DataFrame
# df.info()

We can already notice that there's no missing values in the dataset.

In [None]:
# Describe Method
# df.describe()

In [None]:
# Get a view of unique values in column, e.g. 'Ship Mode'
# ship_modes = set(df["Ship Mode"].tolist())
# ship_modes
# df["Ship Mode"].unique()

In [None]:
# NaN count for each column
# df.isna().sum()

### Clean Data

I'm gonna drop the following columns:

In [None]:
df = df.drop(columns=["Row ID", "Order ID", "Customer ID", "Postal Code", "Country", "Product ID"])

## Automated Reports

#### Pandas Profiling Report

In [None]:
# # Generate Pandas Profiling Report
# profile = ProfileReport(df, title="Sales Profiling Report")

# # View in Notebook
# profile.to_widgets()

In [None]:
# # Export Pandas Profiling Report to HTML
# profile.to_file(f"{directory}/sales_profiling_report.html")

#### Auto Viz Report

In [None]:
# # initiate AutoViz class
# AV = AutoViz_Class()
# # create AutoViz object for the DF
# df_autoviz = AV.AutoViz(dataset, chart_format="bokeh")

# Data Preperation & Analysis

### 🚩 TASKS:
- What was the highest Sale in 2020?
- What is average discount rate of charis?
- Add extra columns to seperate Year & Month from the Order Date
- Add a new column to calculate the Profit Margin for each sales record
- Export manipulated dataframe to Excel
- Create a new dataframe to reflect total Profit & Sales by Sub-Category
- Develop a function, to return a dataframe which is grouped by a particular column (as an input)

**What was the highest Sale?**

In [None]:
# Highest Sale (Note that all sales are in 2020)
highest_sale = df.nlargest(1, "Sales")
highest_sale

**What is average Discount of chairs?**

In [None]:
# Create Boolean mask
mask = df.loc[df["Sub-Category"] == "Chairs"]

# Use Boolean mask to filter dataframe
mean_discount = mask["Discount"].mean().__round__(2)
f"On average, we set ${mean_discount} of discount on chairs."

**Add an extra column for "Order Month" & "Order Year"**

In [None]:
# Order Month
df["Order Month"] = df["Order Date"].dt.month
# or df["Order Month"] = df["Order Date"].apply(lambda x: x.strftime("%m"))
# Order Year
df["Order Year"] = df["Order Date"].dt.year
# or df["Order Year"] = df["Order Date"].apply(lambda x: x.strftime("%Y"))

**Add a new column to calculate the Profit Margin for each sales record**

In [None]:
# Profit Margin
df["Profit Margin"] = df["Profit"] / df["Sales"]

**Export manipulated dataframe back to excel**

Round numerical data

In [None]:
for col in df.select_dtypes(include=['float64']).columns:
    df[col] = df[col].round(2)
# print(df.select_dtypes(include=['float64']))

#### Total Profit & Sales by Sub-Category

In [None]:
# Group By Sub-Category [SUM]
sub_category_group = df.groupby("Sub-Category").sum()

# Reset Index
sub_category_group.reset_index(inplace=True)
sub_category_group

#### Develop a function, to return a dataframe which is grouped by a particular column (as an input)

In [None]:
# Groupby as a function
def grouped_data(column: str) -> pd.DataFrame:
    """
    Groupby column and return DataFrame
    :param:
    column : column name, default None
        column name which we want to group by
    """
    df_tmp = df.groupby(column).sum().reset_index()
    df_tmp.reset_index()
    return df_tmp

# Group DataFrame by Segment
grouped_data("Segment")

# Further Deep Dive & Visualization

### 🚩 Objective: 
- Further Analysis/Deep Dive using various kind of Charts
- Prepare/Refactor Dataframe for different Charttypes
- Generate & Export 'Ready-To-Present- Charts': Clean & Interactive
-----
#### 📊 Chart Types:
- [x]  Histogram
- [x] Boxpot
- [x] Various Barplots
- [x] Scatterplot
- [x] Linechart

In [None]:
# directory to export figures
directory = "/home/rohkoder29/Documents/year2022/python/data_science/cif_sales_analysis/output"

**Distribution Sales [Histogram]**

In [None]:
# Quick Stats Overview for Sales
df["Sales"].describe()

In [None]:
# Create Chart (with plotly_express)
fig1 = px.histogram(df,
                    x="Sales",
                    template=template_style)
# Plot Chart
fig1.show()
# Export Chart to HTML
pyo.plot(fig1, filename=f"{directory}/df_fig1.html", auto_open=False)

**Show the distribution and skewness of Sales [Boxplot]**

In [None]:
# Create Chart
fig2 = px.box(df,
              y="Sales",
              range_y=[0, 1000],
              template=template_style)
# Plot Chart
fig2.show()
# Export Chart to HTML
pyo.plot(fig2, filename=f"{directory}/df_fig2.html", auto_open=False)

**Plot Sales by Sub-Category [Bar]**

In [None]:
# Create Dataframe
df_sub_cat = grouped_data("Sub-Category")
df_sub_cat

In [None]:
# Create Chart
fig3 = px.bar(df_sub_cat, 
              x="Sub-Category", 
              y="Sales",
              title="<b>Sales by Sub-Category<b>",
              template=template_style)

# Display Plot
fig3.show()

# Export Chart to HTML
pyo.plot(fig3, filename=f"{directory}/sub_cat_sales_fig3.html", auto_open=False)

**Plot Profit by Sub-Category**

In [None]:
# Create Chart
fig4 = px.bar(df_sub_cat,
              x="Sub-Category",
              y="Profit",
              title="<b>Sales by Profit</b>",
              template=template_style)

# Display Plot
fig4.show()

# Export Chart to HTML
pyo.plot(fig4, filename=f"{directory}/df_sub_cat_fig4.html", auto_open=False)

**Plot Sales & Profit by Sub-Category**

In [None]:
# Create Chart
fig5 = px.bar(df_sub_cat,
              x="Sub-Category",
              y="Sales",
              color="Profit",
              color_continuous_scale=["red", "yellow", "green"],
              title="<b>Sales & Profit by Sub-Category</b>",
              template=template_style)

# Display Plot
fig5.show()

# Export Chart to HTML
pyo.plot(fig5, filename=f"{directory}/df_sub_cat_fig5.html", auto_open=False)

#### Inspect Negative Profit of Tables

Is there any linear correlation between Sales/Profit & Discount? [Scatterplot]

In [None]:
# Create Chart
fig6 = px.scatter(df,
                  x="Sales",
                  y="Profit",
                  color="Discount",
                  title="<b>Scatterplot Sales/Profit by Discount</b>",
                  template=template_style)

# Display Plot
fig6.show()

# Export Chart to HTML
pyo.plot(fig6, filename=f"{directory}/df_fig6.html", auto_open=False)

We can notice that higher discount rates result in a increased deficit.

**Check Discount mean by Sub Category**

In [None]:
# Create new dataframe: Group by 'Sub-Category' then aggregate the mean of 'Discount' and sum of 'Profit'
df_disc_subcat = df.groupby("Sub-Category").agg({"Discount":"mean",
                                                 "Profit":"sum"})

# Display first 5 rows of new dataframe
df_disc_subcat.head()

**Plot Mean Discount by Sub Category**

In [None]:
# Create Chart
fig7 = px.bar(df_disc_subcat,
              x=df_disc_subcat.index,
              y="Discount",
              color="Profit",
              color_continuous_scale=['red', "yellow", "green"],
              title="<b>Mean Discount by Sub-Category</b>",
              template=template_style)

# Display Plot
fig7.show()

# Export Chart to HTML
pyo.plot(fig7, filename=f"{directory}/df_disc_subcat_fig7.html", auto_open=False)

**Plot Sales & Profit Development for the year 2020**

In [None]:
# Sort Values by Order Date
df_sorted_date = df.sort_values(["Order Date"]).reset_index(drop=True)

# Add cumulative Sales & Profit (new columns)
df_sorted_date["Cumulative Sales"] = df_sorted_date["Sales"].cumsum()
df_sorted_date["Cumulative Profit"] = df_sorted_date["Profit"].cumsum()

# Print tail & head of sorted dataframe
df_sorted_date.tail()

In [None]:
# # validation
# df["Sales"].sum().__round__(2)  # must be == to the last row of the df_sorted_date df
# df["Profit"].sum().__round__(2)  # must be == to the last row of the df_sorted_date df

In [None]:
# Create Chart
fig8 = px.line(df_sorted_date,
               x="Order Date",
               y=["Cumulative Sales", "Cumulative Profit"],
               title="<b>Sales/Profit Development</b>",
               template=template_style)

# Display Plot
fig8.show()

# Export Chart to HTML
pyo.plot(fig8, filename=f"{directory}/df_sorted_date_fig8.html", auto_open=False)

### **Personal stuff**

In [None]:
df.head(1)

Let's create a new DF which is the copy of the current one.

In [None]:
df_me = df.copy()

In [None]:
df_me.head(1)

**What is the total sales by region?**

In [None]:
df_me.groupby("Region").sum()["Sales"].reset_index()

In [None]:
fig9 = px.bar(df_me,
              x="Region",
              y="Sales",
              title="<b>Sales by Region</b>",
              template="plotly_dark")
fig9.show()

In [None]:
fig9_1 = px.pie(df_me,
                values="Sales",
                names="Region",
                color="Region",
                hole=0,
                title="<b>Sales Distribution by Region</b>",
                template=template_style)
fig9_1.update_traces(textposition="inside",
                     textinfo="percent+label",
                     marker=dict(line=dict(color="#000000", width=1.25)),
                     pull=[0, 0, 0.15, 0], opacity=.9, rotation=0)
fig9_1.show()

**Which state has the most unit sold in each quarter?**

Let's create a new column "Quarter" for this purpose

In [None]:
# function to get quarter from month
def quarter_to_month(month: int) -> str:
    quarters = {
    "1st": [1, 2, 3],
    "2nd": [4, 5, 6],
    "3rd": [7, 8, 9],
    "4th": [10, 11, 12]
    }

    for idx, quarter in quarters.items():
        if month in quarter:
            return idx

In [None]:
# new column
df_me["Quarter"] = df_me["Order Month"].apply(lambda x: quarter_to_month(x))
df_me.head(1)

In [None]:
# now let's group by quarter the states
quarter_state = pd.DataFrame(df_me.groupby(["Quarter", "State"]).sum()["Quantity"])
quarter_state.sort_values(["Quarter", "Quantity"], ascending=False).head()

In [None]:
df_qs = pd.DataFrame(df_me.groupby(["Quarter", "State"]).sum())

In [None]:
df_qs.head().sort_values(["Quantity"], ascending=False)

In [None]:
df_qs.sort_values(["Quantity"], ascending=False)

Question not yet resolved

**What is the Sales Trend in the different Regions?**