### Given Datasets:
#### 1. Sample - Superstore_Orders.csv - Superstore orders between Jan 2017 to Dec 2020.
#### 2. Sales Target (US)_Full Data.csv - Sales Target Data by Product Category, Segment between Jan 2017 to Dec 2020.

In [None]:
import os
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
import math
from IPython.display import Image

pd.set_option('max_columns', None)
pd.set_option('display.max_colwidth', None)

In [None]:
orders = pd.read_csv('/kaggle/input/superstore-orders-sales-target-data-2017-to-2020/Sample - Superstore_Orders.csv')

In [None]:
orders.head()

In [None]:
orders.info() 
orders.shape 

#### Checking for duplicates.

In [None]:
orders.duplicated().sum()

In [None]:
# observing the dataset Order ID and Product ID are the primary keys

orders[['Order ID','Product ID']].drop_duplicates().shape

In [None]:
orders[['Order ID','Product ID']].duplicated().sum()

In [None]:
orders.loc[orders[['Order ID','Product ID']].duplicated()]

In [None]:
orders.loc[(orders['Order ID'] == 'CA-2019-129714') & (orders['Product ID'] == 'OFF-PA-10001970')]

# This is due to error in data collection or reporting - we could delete, manipulate or proceed as it is.


### Data screening & cleaning steps for orders data:

#### 1. Postal Code is having missing values and should be converted to string.
#### 2. Remove special characters from Sales, Discount, Profit, Sales Forecast columns and convert data type to float64.
#### 3. Ship Date & Order Date to be converted to datetime and rearrange the dataset by Order Date.
#### 4. Keep the latest entry in the duplicates, remove other duplicates (Assumption : Correction is allowed after customer has ordered).
#### 5. Rearrange columns in a meaningful order and drop Row ID column.

In [None]:
orders.loc[(orders['Postal Code'].isnull()),'Postal Code'] = '05401'
orders['Postal Code'] = orders['Postal Code'].astype(str)

In [None]:
orders['Sales'] = orders['Sales'].apply(lambda x: x.replace('$','').replace(',','')).astype('float64')
orders['Sales Forecast'] = orders['Sales Forecast'].apply(lambda x: x.replace('$','').replace(',','')).astype('float64')
orders['Discount'] = orders['Discount'].apply(lambda x: x.strip('%')).astype('float64')
orders['Profit'] = orders['Profit'].apply(lambda x: x.replace('(','-').replace(')', '').replace('$','').replace(',','')).astype('float64')

In [None]:
orders['Order Date'] = pd.to_datetime(orders['Order Date'])
orders['Ship Date'] = pd.to_datetime(orders['Ship Date'])
orders = orders.sort_values(by = 'Order Date')

In [None]:
orders.head()

In [None]:
orders.drop_duplicates(subset = ['Order ID','Product ID'], keep = 'last', inplace = True)
# asssumption made - keep last duplicated rows as correction is allowed after the customer has ordered. 
# pandas series version doesn't take 'ignore_index' parameter in 'drop_duplicates', and so we have to reset index
orders = orders.reset_index(drop=True)

In [None]:
orders.shape

In [None]:
rearranged_col = ['Order ID','Order Date','Ship Date','Ship Mode','Days to Ship Actual','Days to Ship Scheduled','Ship Status','Customer ID','Customer Name','Segment','Country/Region','City','State','Postal Code','Region','Product ID','Category','Sub-Category','Product Name','Sales','Quantity','Discount','Profit','Sales Forecast']
# column 'Row ID' is dropped.
orders_data = orders.reindex(columns = rearranged_col)

In [None]:
orders_data.head()

#### Importing Sales Target Data

In [None]:
sales_target = pd.read_csv('/kaggle/input/superstore-orders-sales-target-data-2017-to-2020/Sales Target (US)_Full Data.csv')

In [None]:
sales_target.head()

In [None]:
sales_target.info()
sales_target.shape

#### Checking for duplicates

In [None]:
sales_target.duplicated().sum()

In [None]:
# observing the dataset Category, Order Date and Segment are the primary keys

sales_target[['Category','Order Date','Segment']].drop_duplicates().shape

In [None]:
sales_target[['Category','Order Date','Segment']].duplicated().sum()

In [None]:
sales_target.loc[sales_target[['Category','Order Date','Segment']].duplicated()]

In [None]:
sales_target.loc[(sales_target['Category'] == 'Furniture') & (sales_target['Order Date'] == '3/1/2018') & (sales_target['Segment'] == 'Corporate')]

### Data screening & cleaning steps for sales target data:

#### 1. Order Date to be converted to datetime and rearrange dataset by Order Date.
#### 2. Keep the latest entry in the duplicates, remove other duplicates (Assumption : Correction is allowed in Sales Target Data).

In [None]:
sales_target['Order Date'] = pd.to_datetime(sales_target['Order Date'])

In [None]:
sales_target.drop_duplicates(subset = ['Category','Order Date','Segment'], keep = 'last', inplace = True)
# asssumption made - keep last duplicated rows as correction is allowed in Sales Target Data. 
# pandas series version doesn't take 'ignore_index' parameter in 'drop_duplicates', and so we have to reset index
sales_target = sales_target.reset_index(drop=True)

In [None]:
sales_target.head()

In [None]:
sales_target.shape

In [None]:
orders_data.shape

### Preparing the final data.

In [None]:
final_data = pd.merge(orders_data,sales_target,how='left',on=['Category','Order Date','Segment'])

In [None]:
final_data.head()

In [None]:
final_data.info()
final_data.shape

### Additional information we could pull out from the dataset:

#### 1. Price of Product
#### 2. Profit Ratio
#### 3. Sales Target status
#### 4. Forecast bias
#### 5. Sales Forecast status

In [None]:
final_data['Price'] = round(final_data['Sales']/final_data['Quantity'],2)

In [None]:
final_data['Profit Ratio'] = round(final_data['Profit']*100/final_data['Sales'],2)
final_data.loc[(final_data['Sales'] == 0),'Profit Ratio'] = 0

In [None]:
final_data.loc[((final_data['Sales Target'] - final_data['Sales'])<=0),'Sales Target Status'] = 'Target Achieved'
final_data.loc[((final_data['Sales Target'] - final_data['Sales'])>0),'Sales Target Status'] = 'Target Not Achieved'

In [None]:
final_data['forecast_bias'] = final_data['Sales Forecast'] - final_data['Sales']

In [None]:
final_data.loc[(final_data['forecast_bias']>0),'Sales Forecast Status'] = 'Over Forecast'
final_data.loc[(final_data['forecast_bias']== 0),'Sales Forecast Status'] = 'Accurate Forecast'
final_data.loc[(final_data['forecast_bias']<0),'Sales Forecast Status'] = 'Under Forecast'

In [None]:
#final_data.to_csv('final_data.csv',index=False)

#### Descriptive statistics of the dataset:

In [None]:
final_data.describe().T

#### Insights for categorical variables:



In [None]:
cat_var = ['Ship Mode','Ship Status','Customer ID','Customer Name','Segment','Country/Region','City','State','Postal Code','Region','Product ID','Category','Sub-Category','Product Name','Sales Target Status','Sales Forecast Status']

uniq_cat = pd.DataFrame()
uniq_cat['variables'] = cat_var

for x in cat_var:
    uniq_cat.loc[(uniq_cat['variables'] == x),'no. of unique values'] = final_data[x].nunique()

uniq_cat['no. of unique values'] = uniq_cat['no. of unique values'].astype(int)
    
new_cols = uniq_cat.loc[(uniq_cat['no. of unique values'] < 10),'variables'].to_list()

uniq_cat.loc[(uniq_cat['no. of unique values'] > 10),'unique values'] = '-'

for x in new_cols:
    uniq_cat.loc[(uniq_cat['variables'] == x),'unique values'] = str(final_data[x].unique())

uniq_cat

In [None]:
plt.figure(figsize=(16, 6))
# define the mask to set the values in the upper triangle to True
mask = np.triu(np.ones_like(final_data.corr(), dtype=np.bool))
heatmap = sns.heatmap(final_data.corr(), mask=mask, vmin=-1, vmax=1, annot=True, cmap='BrBG')
heatmap.set_title('Triangle Correlation Heatmap', fontdict={'fontsize':18}, pad=16);

#### Insights from correlation matrix:

 - Looking at the table, we see that Discount and Profit Ratio have the highest negative correlation, and we can deduce that high discounts mean less profit.
 - The correlation coefficient for Profit and Selling Price is high. As Selling Price increases, Profit also is higher.
 - Sales, Sales Forecast, Price, forecast_bias are obviously positively correlated and are inter related to each other.
 - Sales Target & Profit are positively correlated with Sales, implies we can obtain high profit with high sales amount and obviously our sales target would also increase.

### Insights:

In [None]:
Image(filename="../input/analysisfiles/sales_univariate_quarter.PNG", width= "800", height="400")

- At quarterly aggregate level - the forecast model always over forecasts and the sales target is always greater than the actual sales.

- Sales are always high in Q4 of every year and it decreases significantly in Q1. One of the reason being Q4 as a holiday season and so are the sales high.


In [None]:
Image(filename="../input/analysisfiles/yearly_sales.PNG", width= "800", height="400")

- In every year, sales are having a dip in October and the reason should be investigated.


In [None]:
Image(filename="../input/analysisfiles/Product_Category_Stats.PNG", width= "800", height="400")

- Sales are highest for Chairs, Storage and Phones in their respective Product Categories.
- Profit is highest for Chairs, Paper and Copiers in their respective Product Categories.
- Tables & Supplies are having lowest profit and could be due to high shipping costs.

In [None]:
Image(filename="../input/analysisfiles/Regional_Sales.PNG", width= "800", height="400")

- California has the highest sales and there is an opportunity to open another store. North Dakota has least number of sales.


In [None]:
Image(filename="../input/analysisfiles/Regional_Profit.PNG", width= "800", height="400")

- California & New Jersey has the highest profit amount.Texas & Illinois has the least profit amount. There is an opportunity to open another store in New Jersey as well.


In [None]:
Image(filename="../input/analysisfiles/Regional_Profit_Ratio.PNG", width= "800", height="400")

- District of Columbia & Iowa have the highest profit ratio's across all the states. Illinois & Texas has the lowest profit ratio's.


In [None]:
Image(filename="../input/analysisfiles/Regional_Discount.PNG", width= "800", height="400")

- Highest Discounts are recorded in Illinois & Texas. This could be the reason for having least profits from these two states. If we could reduce discounts in these two states, we could increase the profits.


In [None]:
Image(filename="../input/analysisfiles/shipment_status.PNG", width= "800", height="400")

- In all of the years, the orders which are shipped early has the highest number of sales and profit.


In [None]:
Image(filename="../input/analysisfiles/Sales_Target_Stats.PNG", width= "800", height="400")

- Sales belonging to Office Supplies, Furniture, Technology, Office Supplies haven't achieved the target in 2017, 2018, 2019 & 2020 respectively.


In [None]:
Image(filename="../input/analysisfiles/Ship_Mode_Stats.PNG", width= "800", height="400")

- In the case of Office Supplies and Tables, profit is really low when these items are shipped in standard class. We could achieve profit if we shift the demand for these two product categories from standard class to the remaining ship modes. Organization could remove the standard class ship mode option for these two product categories.


In [None]:
Image(filename="../input/analysisfiles/Executive_Overview.PNG", width= "800", height="400")

### Executive Dashboard

- When considered the evaluation period between 2017 to 2020, organization should focus on improving the sales & profit from states of Illinois and Texas. Organization should also consider opening a new store in California to increase their profits.



### Forecast model KPI's and insights.

#### Additional information required for model accuracy calculation:

#### 1. dollarized weight
#### 2. MAPE
#### 3. WMAPE

#### Inorder to get forecast model accuracy at global level, initially we need to aggregate the data to sku * Order Date level

In [None]:
# calculating dollarized weight

dte = list(pd.date_range(final_data['Order Date'].min(),final_data['Order Date'].max(),freq='D'))

for x in range(0,len(dte)):
    final_data.loc[(final_data['Order Date'] == dte[x]), 'dollarized_wgt'] = round(final_data.loc[(final_data['Order Date'] == dte[x]), 'Sales']/final_data.loc[(final_data['Order Date'] == dte[x]), 'Sales'].sum(),2)

In [None]:
final_data['MAPE'] = 0
final_data.loc[(final_data['Sales']==0)&(final_data['Sales Forecast']>0), 'MAPE'] = 100
final_data.loc[final_data['Sales']>0, 'MAPE'] = abs(final_data.loc[final_data['Sales']>0, 'Sales'] - final_data.loc[final_data['Sales']>0, 'Sales Forecast']) / final_data.loc[final_data['Sales']>0, 'Sales'] * 100
final_data['MAPE'] = round(final_data['MAPE'],2)

In [None]:
final_data['WMAPE'] = final_data['dollarized_wgt'] * final_data['MAPE']

In [None]:
accuracy_data = final_data.groupby(['Product ID','Order Date'])['WMAPE'].sum().reset_index()

In [None]:
accuracy = 100 - accuracy_data['WMAPE'].mean()

In [None]:
accuracy

### Forecast model insights:

 - forecast model never underforecasts
 - forecast model has 93.83 % global forecast accuracy for the evaluation period between Jan 2017 to Dec 2020.