# Author : Akash Kothare

Data Science & Business Analytics Intern (Batch - Dec'20)

## Task 3: Exploratory Data Analysis - Retail


In this EDA task, we have to clean the data and visualize the same using different methods in order to derive meaningful insights beneficial for the business.

## Importing Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
#from chart_studio.plotly import __version__
import cufflinks as cf

In [None]:
from plotly.offline import download_plotlyjs,init_notebook_mode,plot, iplot

init_notebook_mode(connected=True)

cf.go_offline()

## Loading the Dataset

In [None]:
df=pd.read_csv('../input/tsf-datasets/SampleSuperstore.csv')

In [None]:
df.head()

In [None]:
df.tail()

In [None]:
df.info()

In [None]:
#checking shape of the whole dataset
df.shape

In [None]:
#displaying column names
df.columns

In [None]:
#chceking null values
df.isnull().sum()

In [None]:
#displaying datatype of columns
df.dtypes

In [None]:
#checking if any duplicate row is present in the dataset
print("There are {0} duplicated rows in the data!".format(df.duplicated().sum()))

In [None]:
#removing the 17 duplicated rows to avoid errors in further calculations
df.drop_duplicates(inplace =  True)

In [None]:
#chceking shape again after removing those 17 duplicated rows
df.shape

In [None]:
df['Country'].nunique()

In [None]:
df['Postal Code']

There is only one country in the whole dataset, so dropping it won't affect the further analysis, besides Postal Codes are of no use as well so it would be better to drop this column along with Country column.

In [None]:
#removing the unimportant columns
df = df.drop(['Country', 'Postal Code'], axis =1)

In [None]:
#checking for correlation between the columns visually
plt.subplots(figsize = (12, 8))
sns.heatmap(df.corr(), annot = True, cmap = 'magma', lw = 8, linecolor = 'white')
plt.plot()

## Observation : No such strong co-relations found!

In [None]:
df_num = df.select_dtypes(include = [np.number])

In [None]:
#BoxPlot

plt.figure(figsize = [12, 8])
sns.set(style = 'whitegrid')
sns.boxplot(x = 'variable', y = 'value', data = pd.melt(df_num), width = 1)
plt.show()

In [None]:
df_num.iplot(kind='box')

From the above plot, the outliers in Sale and Profit are clearly visible. As it is a large dataset, we can remove those rows containing outliers in order to improve our results.

In [None]:
#Removal of Outliers
def remove_outlier(dataset, k= 3.33):
    for col in dataset.columns:
        if (dataset[col].dtype == 'int64' or dataset[col].dtype == 'float64'):
            mean = dataset[col].mean()
            global ds
            std = dataset[col].std()
            outlier = [i for i in dataset[col] if (i > mean - k * std)]
            outlier = [i for i in outlier if (i < mean + k * std)]
            ds = dataset.loc[dataset[col].isin(outlier)]

In [None]:
remove_outlier(df, k = 3.33)

In [None]:
ds_num = ds.select_dtypes(include = [np.number])

In [None]:
#lets check if the outliers are removed or not
#BoxPlot(After removing outliers)

plt.figure(figsize = [12, 8])
sns.set(style = 'whitegrid')
sns.boxplot(x = 'variable', y = 'value', data = pd.melt(ds_num), width = 1)
plt.show()

In [None]:
ds_num.iplot(kind='box')

From the above BoxPlot, it is visible that most of the outliers are removed and thus we will use this dataset for EDA.

## Exploratory Data Analysis

In [None]:
ds.shape

In [None]:
ds.head()

In [None]:
#display basic information
ds.info()

In [None]:
ds.describe()

In [None]:
#display number of uniquq entries in the Categorical Columns
for col in ds.columns:
    if ds[col].dtype == 'object':
        print("Number of unique entries in", col + " are", ds[col].nunique())
        print("================================================")

## Data Visualization

In [None]:
ds.iplot(x = 'Region', y = 'Sales', kind = 'bar', title = 'Region vs Sales', xTitle = 'Region', yTitle = 'Sales')

## Observation : The West region is leading the sales followed by the East, Central and South.

In [None]:
#Category wise sales in each region
plt.figure(figsize = [12, 8])
ax = sns.barplot(x = "Region", y = "Sales", hue = "Category", data = ds, palette = 'Greens')

## Observation : In each and every region, sales for 'Office Supplies' are very poor. Furniture and Technology are well ahead.

In [None]:
#Segment wise count of the ship modes
ax = sns.catplot(x = 'Ship Mode', hue = "Segment", data = ds, kind = 'count', aspect = 1.5, palette = "Set1")

## Observation : No much surprises here. Consumer count is highest in each case and they generally prefer 'Standard Class'.

In [None]:
#Segment wise sales in each region
plt.figure(figsize = [12, 8])
ax = sns.barplot(x = 'Region', y = 'Sales', hue = "Segment", data = ds, palette = "Set1")

## Observation : In case of sales not much difference is seen based on the Segments for any region. Overall 'Corporate' is leading a bit.

In [None]:
#Sub-Category vs Sales
ds.iplot(x = 'Sub-Category', y = 'Sales', kind = 'bar', colors = 'pink', title = 'Sub-Category VS Sales', xTitle = 'Sub-Category', yTitle = 'Sales')

## Observation : Sales of Sub Categories such as Chairs and Phones are much higher than any other item.

In [None]:
#Aggregated views from pairplot
sns.set_palette('dark')
ax = sns.pairplot(ds)

## Observation : No such strong relations are found here between columns of the dataset.

=================================================================================================================

## Interesting Insights using Stats

In [None]:
#Based on Cities
grouped = ds.groupby('City')

In [None]:
#aggregated sale per city
agg_sales = grouped['Sales'].agg(np.sum).sort_values(ascending = False).reset_index()

In [None]:
#Cities with highest total sales
agg_sales.head()

## Observation : New York City has the most amount of sales followed by Los Angeles and San Fracisco.

In [None]:
#aggregated Profit per city
agg_profit = grouped['Profit'].agg(np.sum).sort_values(ascending = False).reset_index()

In [None]:
#Cities with Highest total Profit
agg_profit.head()

## Observation : Similarly most profit is earned from New York City followed by Los Angeles and Seattle.

In [None]:
#Aggregate Discount per city
agg_dist = grouped['Discount'].agg(np.sum).sort_values(ascending = False).reset_index()

In [None]:
#Cities with highest aggregated Discount
agg_dist.head()

## Observation : Interestingly highest total discount is for Philadelphia followed by Houston and Chicago. Shouldn't they lead the Sales and Profit table as well!

In [None]:
#Average Sales per city
avg_sales = grouped['Sales'].agg(np.mean).sort_values(ascending = False).reset_index()

In [None]:
#Cities with highest Average sales
avg_sales.head()

In [None]:
#Cities with lowest Average sales
avg_sales.tail()

In [None]:
#Average Profit per city
avg_profit=grouped['Profit'].agg(np.mean).sort_values(ascending=False).reset_index()

In [None]:
#Cities with highest Average profit
avg_profit.head()

In [None]:
#Cities with lowest Average profit
avg_profit.tail()

In [None]:
#Average Discount per city
avg_dist=grouped['Discount'].agg(np.mean).sort_values(ascending=False).reset_index()

In [None]:
#Cities with highest Average discount
avg_dist.head()

In [None]:
#Cities with lowest Average Discount
avg_dist.tail()

## Observation : Something new, in all these average calculations, the Cities which topped the total Sales, total Profit or total Discount, are not leading here. But the overall scenario can be obtained from these average values.

In [None]:
#Cities having High Average Discounts
high_dist = avg_dist[avg_dist['Discount'] >= 0.7]

#Cities having Low Average Discounts
low_dist = avg_dist[avg_dist['Discount'] == 0]

#Cities having High Average Sales
high_sales = avg_sales[avg_sales['Sales'] > 500]

#Cities having low Average Sales
low_sales = avg_sales[avg_sales['Sales'] < 50]

#Cities having High Average Profit
high_profit = avg_profit[avg_profit['Profit'] > 100]

#Cities having low Average profit
low_profit = avg_profit[avg_profit['Profit'] < 0]

#Cities with High-Average-Discounts but Low-Average-Sales
merged = pd.merge(high_dist, low_sales, on = ['City'], how = 'inner')
merged


## Important Insight #1 : Here we can see 7 Cities where the Company is giving high discounts but Sales very very low. As already Discounts are high, no question of increasing discount further. Hence here our investment is not fruitful.

In [None]:
#Cities with high Average Sales as well as Average Profit
merged2 = pd.merge(high_sales, high_profit, on = ['City'], how = 'inner')
merged2

## Important Insight #2 : The stats above are very pleasing. In all these 15 Cities the sales as well as profit is quite good. Hence if we can invest in these cities (in terms of Discount and other aspects), business can increase more. These can be termed as the Hot-Spots.

In [None]:
#Cities where Average Discount is less but Average Sales is High
merged3 = pd.merge(low_dist, high_sales, on = 'City', how = 'inner')
merged3

## Important Insight #3 : These 10 cities are generating high average sales in spite of '0' discount! Hence if our investments can be increased in these cities, then huge sales as well as huge profits can be driven from these Cities. These can be termed as the Dark-Horses.

In [None]:
#Cities with high Average sales but low Average profit
merged4 = pd.merge(high_sales, low_profit, on = 'City', how = 'inner')
merged4

## Important insight #4 : In Richardson city, good amount of sales are there but the company is going with loss here. Hence focus can be shifted from here or the reasons are to be found!

In [None]:
#Cities with high Average discount but low Average profit
merged5 = pd.merge(high_dist, low_profit, on = 'City', how = 'inner')
merged5

# ## Important Insight #5 : The 8 cities above gets highest average Discount, but here the business is generating loss! Either strong focus is to be given in these cities to find out the faults or Discounts are to withdrawn to make up the loss.

In [None]:
#Cities with low Average discount but High Average profit
merged6 = pd.merge(low_dist, high_profit, on  = 'City', how =  'inner')
merged6

## Important insight #6 : Here are the 18 cities, where the company is not at all providing any discount, yet these cities are genetating good amount of profit. Hence more and more care is to be taken and investments are to be made in these kinds of Hot-Spots!

## Some visuals with profit :

In [None]:
#State-wise Profit
plt.figure(figsize = [24, 15])
ax = sns.barplot(x = 'State', y = 'Profit', data  = ds, palette = 'Set1')
plt.xticks(rotation = 90, fontsize = 16)
plt.yticks(fontsize = 16)
plt.title("State VS Profit", fontsize = 24)
plt.xlabel("States", fontsize = 20)
plt.ylabel("Profit", fontsize = 20)
plt.tight_layout()

## Observation : The plot shows, 'District of Columbia', 'Vermont', 'Wyoming' states are generating highest profits. And States like 'Texas', 'Pennsylvania', 'Illinois', 'Arizona', 'Oregon', 'Colorado', 'Ohio' are generating losses. Hence focus has to be given in such States.

In [None]:
#Category Wise profit in the whole country
ds.iplot(kind = 'bar', x = "Category", y = "Profit", title = "Category VS Profit", xTitle = "Category", yTitle = 'Profit', colors = 'magenta')

In [None]:
#Category Wise profit in the whole country
plt.figure(figsize = [12,8])
ax = sns.barplot(x = "Category", y = "Profit", data = ds, palette = "Set2")

## Observation : Highest profit is generated from Category='Technology'. 'Furniture' is lagging the list.

In [None]:
#Category wise Profit in Each Region
plt.figure(figsize = [12,8])
ax = sns.barplot(x = "Region", y = "Profit", hue = "Category", data = ds, palette = "Set2")

## Observation : Simply 'Technology' generates highest profit in every region. And 'Furniture' is lagging in all the regions except South region. Most importantly, at Central region, 'Furniture' is experiencing loss. These points are to be noted and taken care of.

In [None]:
#Subcategory wise profit
plt.figure(figsize = [10, 8])
ax = sns.barplot(x = "Sub-Category", y = "Profit", data = ds, palette = 'magma')
plt.xlabel("SubCategory", fontsize = 15)
plt.ylabel("Profit", fontsize = 15)
plt.xticks(rotation = 90)
plt.show()

## Observation : As we can see, 'Copiers' are gaining huge profit. 'Accessories' are also doing good. But 'Tables' and 'Bookcases' are going with loss. Respective steps are to be taken to improve the business in these SubCategories.

As we saw, the Profit is maximum in case of Category = Technology, So we should explore that more!

In [None]:
#Entries with Category=Technology
ds_tech = ds[(ds['Category'] == "Technology")]

In [None]:
ds_tech.head()

In [None]:
#lets get the Sales of each Sub-Category under Technology
plt.figure(figsize = [12, 8])
ax = sns.barplot(x = "Sub-Category", y = "Sales", data = ds_tech, palette = 'magma')
plt.xlabel("SubCategory", fontsize = 15)
plt.ylabel("Sales", fontsize = 15)
plt.show()

## Observation : In category 'Technology', sub-category 'Copiers' is having the highest sales, where 'Accessories' is having lowest amount of sales. 'Machines' sub-category is also performing good.

In [None]:
#lets get the Profit of each Sub-Category under Technology
plt.figure(figsize = [12,8])
ax = sns.barplot(x = "Sub-Category", y = "Profit", data = ds_tech, palette = 'viridis')
plt.xlabel("SubCategory")
plt.ylabel("Profit")
plt.show()

## Observation : It's clear from the picture that 'Copiers' is generating the highest profit as mentioned earlier, but 'Machines' are lagging in this case in spite of generating good amount of sales, as shown previously.

In [None]:
#lets get the Profit of each SubCategory under Technology for each Region
#lets get the Profit of each Sub-Category inder Technology
plt.figure(figsize = [12,8])
ax = sns.barplot(x = "Sub-Category", y = "Profit", hue = "Region", data = ds_tech, palette = 'viridis')
plt.xlabel("SubCategory")
plt.ylabel("Profit")
plt.show()

## Observation : In case of Profit, 'Copiers' is ahead always in each of the regions. 'Machines' is going with approx. loss in all regions except West. Other two subcategories are doing average in each region.

In [None]:
#Profit of each SubCategory as per the Ship-mode
plt.figure(figsize = [12, 8])
ax = sns.barplot(x = "Sub-Category", y = "Profit", hue = "Ship Mode", data = ds_tech, palette = 'viridis')
plt.xlabel("SubCategory", fontsize = 15)
plt.ylabel("Profit", fontsize = 15)
plt.show()

## Observations : Here also, same picture can be seen. 'Copiers' is well ahead of others, while 'Machines' is not at all performing well, especially in the 'Second Class', where Consumers are large in Count, as seen earlier. So, further steps are to be taken accordingly!

# Conclusion :

### Overall scenario shows large cities, like New York, Los Angeles, Seatle, San Francisco are generating highest amount of Sales as well as Profit. And in case of Categories, Technology is always leading in terms of Sales and Profit. Some useful insights are shown like, some cities are there, where company is giving huge discounts but very less sales and profit is generating. Also there are cities, where discounts are totaly '0', but they are generating high profits. Hence great focus are to be given in those cases. If above mentioned points are taken care of, surely some improvements can be done in order to improve the efficiency of the business.