## Exploratory Data Analysis - Retail 

### Problem Statement:

#### *  Perform ‘Exploratory Data Analysis’ on dataset ‘SampleSuperstore’ 

#### *  As a business manager, try to find out the weak areas where you can work to make more profit. 

#### *  What all business problems you can derive by exploring the data? 


In [None]:
# Importing Libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
sns.set_style('darkgrid')

In [None]:
# Loading Data
data = pd.read_csv('../input/sample-supermarket-dataset/SampleSuperstore.csv')
data.head()

In [None]:
# Structure of Dataset
data.shape

In [None]:
# Checking for Null Values
data.isna().sum().to_frame('Null Values')

In [None]:
# Checking for Duplicates
data.duplicated().sum()

In [None]:
# Removing Duplicates
data = data.drop_duplicates()
data.head()

In [None]:
# Unique values in Country Column
data.Country.unique()

In [None]:
# Dropping Unnecessary columns from the Dataset
data = data.drop(columns=['Postal Code', 'Country'], axis=1)
data.head()

In [None]:
# Statistics of Data
data.describe()

# DATA EXPLORATION

In [None]:
# PLotting Correlation between Variables
plt.figure(figsize=(5,5))
sns.heatmap(data.corr(), annot=True, cbar=False, annot_kws={'size':14})
plt.show()

In [None]:
# Plotting Overall Sales Summary
summary = pd.DataFrame({'Profit':data.Quantity[data.Profit > 0].sum(), 
                        'No-Profit':data.Quantity[data.Profit == 0].sum(), 
                        'Loss':data.Quantity[(data.Profit) < 0].sum()},
                       index={'Count'}).T
plt.title('Overall Sales Summary', fontsize=20)
summary.Count.plot.pie(autopct='%1.2f%%',figsize=(7,7), label='Percentage', 
                       textprops = {"fontsize":15}, shadow=True, explode=(0.08,0.05,0))
plt.show()

## Overall Profit Analysis

In [None]:
# Plotting Shipmode, Segment, and Region-wise profit
prof_S = pd.DataFrame(data.groupby('Ship Mode').sum()).sort_values('Profit')
prof_G = pd.DataFrame(data.groupby('Segment').sum()).sort_values('Profit')
prof_R = pd.DataFrame(data.groupby('Region').sum()).sort_values('Profit')
fig, ax = plt.subplots(1,3, figsize=(15,6))
ax[0].set_title('SHIP MODE', fontsize=12)
ax[1].set_title('SEGMENT', fontsize=12)
ax[2].set_title('REGION', fontsize=12)
prof_S.Profit.plot.pie(autopct='%1.2f%%', textprops = {"fontsize":12}, shadow=True, ax=ax[0])
prof_G.Profit.plot.pie(autopct='%1.2f%%', textprops = {"fontsize":12}, shadow=True, ax=ax[1])
prof_R.Profit.plot.pie(autopct='%1.2f%%', textprops = {"fontsize":12}, shadow=True, ax=ax[2])
plt.show()

In [None]:
cat = np.round(data.groupby('Category').sum(), decimals=2).sort_values('Profit', ascending=False)
plt.title('Overall Profit - By Category', fontsize=14)
cat.Profit.plot.pie(autopct='%1.2f%%', figsize=(7,7), label='Profit Percentage', 
                    textprops = {"fontsize":15}, explode=(0, 0, 0.2), shadow=True)
plt.show()

In [None]:
cat = data.groupby('Category').sum().iloc[:,1].sort_values().to_frame('Count')
sub = data.groupby('Sub-Category').sum().iloc[:,1].sort_values().to_frame('Count')
print(f'Total items Sold: {data.Quantity.sum()}')
fig, ax = plt.subplots(1,2, figsize=(15,10))
ax[0].set_title('By Category', fontsize=15)
ax[1].set_title('By Sub-Category', fontsize=15)
cat.Count.plot.pie(autopct='%1.2f%%', label='Percentage', radius=1, shadow=True, ax=ax[0])
sub.Count.plot.pie(autopct='%1.2f%%', label='Percentage', radius=1, shadow=True, ax=ax[1])
plt.show()

In [None]:
fig, ax = plt.subplots(1,2, figsize=(15,7))
ax[0].set_title('Items Sold: By Category', fontsize=15)
ax[1].set_title('Items Sold: By Sub-Category', fontsize=15)
sns.heatmap(cat, ax=ax[0], cbar=False, annot=True, cmap='crest', fmt='2', annot_kws={'size':18})
sns.heatmap(sub, ax=ax[1], annot=True, cbar=False, cmap='crest', fmt='2', annot_kws={'size':14})
plt.show()
plt.show()

In [None]:
summ_offc = pd.DataFrame({'Profit':data[(data.Category == 'Office Supplies') 
                                        & (data.Profit > 0)].count()[0], 
                        'No-Profit':data[(data.Category == 'Office Supplies') 
                                         & (data.Profit == 0)].count()[0], 
                        'Loss':data[(data.Category == 'Office Supplies') 
                                    & (data.Profit < 0)].count()[0]},
                       index={'Percentage'}).T.sort_values('Percentage')
summ_furn = pd.DataFrame({'Profit':data[(data.Category == 'Furniture') 
                                        & (data.Profit > 0)].count()[0], 
                        'No-Profit':data[(data.Category == 'Furniture') 
                                         & (data.Profit == 0)].count()[0], 
                        'Loss':data[(data.Category == 'Furniture') 
                                    & (data.Profit < 0)].count()[0]},
                       index={'Percentage'}).T.sort_values('Percentage')
summ_tech = pd.DataFrame({'Profit':data[(data.Category == 'Technology') 
                                        & (data.Profit > 0)].count()[0], 
                        'No-Profit':data[(data.Category == 'Technology') 
                                         & (data.Profit == 0)].count()[0], 
                        'Loss':data[(data.Category == 'Technology') 
                                    & (data.Profit < 0)].count()[0]},
                       index={'Percentage'}).T.sort_values('Percentage')
fig, ax = plt.subplots(1,3, figsize=(15,5))
print('Category-wise Summary')
ax[0].set_title('FURNITURE', fontsize=14)
ax[1].set_title('OFFICE SUPPLIES', fontsize=14)
ax[2].set_title('TECHNOLOGY', fontsize=14)
summ_furn.Percentage.plot.pie(autopct='%1.2f%%', radius= 1.2, explode=(0.1, 0, 0), 
                              textprops = {"fontsize":13}, shadow=True, ax=ax[0])
summ_offc.Percentage.plot.pie(autopct='%1.2f%%', radius= 1.2, explode=(0.1, 0, 0), 
                              textprops = {"fontsize":13}, shadow=True, ax=ax[1])
summ_tech.Percentage.plot.pie(autopct='%1.2f%%', radius= 1.2, explode=(0.1, 0, 0), 
                              textprops = {"fontsize":13}, shadow=True, ax=ax[2])
plt.show()

In [None]:
# Sales Summary by Sub-Category
sub = np.round(data.groupby('Sub-Category').sum(), decimals=2).sort_values('Profit', ascending=False)
plt.figure(figsize=(10,6))
plt.title('Profit - by Sub-Category', fontsize=14)
sns.barplot(sub.Profit, sub.index)
plt.xticks(rotation=45)
plt.show()

## State-wise Analysis

In [None]:
state_P = pd.DataFrame(data.groupby('State').sum()).sort_values('Profit', ascending=False)
plt.figure(figsize=(18,7))
plt.title("States' position in overall Sales", fontsize=15)
sns.barplot(state_P.index, state_P.Profit)
plt.xticks(rotation=75)
plt.show()

# Region-wise Analysis

In [None]:
summ_cent = pd.DataFrame({'Profit':data[(data.Region == 'Central') 
                                        & (data.Profit > 0)].count()[0], 
                        'No-Profit':data[(data.Region == 'Central') 
                                         & (data.Profit == 0)].count()[0], 
                        'Loss':data[(data.Region == 'Central') 
                                    & (-data.Profit > 0)].count()[0]},
                       index={'Percentage'}).T.sort_values('Percentage')
summ_sou = pd.DataFrame({'Profit':data[(data.Region == 'South') 
                                        & (data.Profit > 0)].count()[0], 
                        'No-Profit':data[(data.Region == 'South') 
                                         & (data.Profit == 0)].count()[0], 
                        'Loss':data[(data.Region == 'South') 
                                    & (-data.Profit > 0)].count()[0]},
                       index={'Percentage'}).T.sort_values('Percentage')
summ_west = pd.DataFrame({'Profit':data[(data.Region == 'West') 
                                        & (data.Profit > 0)].count()[0], 
                        'No-Profit':data[(data.Region == 'West') 
                                         & (data.Profit == 0)].count()[0], 
                        'Loss':data[(data.Region == 'West') 
                                    & (-data.Profit > 0)].count()[0]},
                       index={'Percentage'}).T.sort_values('Percentage')
summ_east = pd.DataFrame({'Profit':data[(data.Region == 'East') 
                                        & (data.Profit > 0)].count()[0], 
                        'No-Profit':data[(data.Region == 'East') 
                                         & (data.Profit == 0)].count()[0], 
                        'Loss':data[(data.Region == 'East') 
                                    & (-data.Profit > 0)].count()[0]},
                       index={'Percentage'}).T.sort_values('Percentage')
fig, ax = plt.subplots(1,4, figsize=(15,5))
print('REGION-WISE SALES SUMMARY')
ax[0].set_title('Central')
ax[1].set_title('South')
ax[2].set_title('West')
ax[3].set_title('East')
summ_cent.Percentage.plot.pie(autopct='%1.2f%%', ax=ax[0], explode=(0,0.1,0), shadow=True)
summ_sou.Percentage.plot.pie(autopct='%1.2f%%', ax=ax[1], explode=(0,0.1,0), shadow=True)
summ_west.Percentage.plot.pie(autopct='%1.2f%%', ax=ax[2], explode=(0,0.1,0), shadow=True)
summ_east.Percentage.plot.pie(autopct='%1.2f%%', ax=ax[3], explode=(0,0.1,0), shadow=True)
plt.show()

In [None]:
# Analysis on Sales Loss
loss = np.round(data[-(data.Profit) > 0], decimals=2).sort_values('Profit')

In [None]:
# Sales Loss in each Region
reg_loss = loss.groupby('Region').sum()
sns.barplot(reg_loss.index, reg_loss.Profit)
plt.show()

In [None]:
print(f'States with Sales Loss in each Region\n{"-"*37}')
print(f'Central\t:{loss[loss.Region == "Central"].State.unique().tolist()}')
print(f'South\t:{loss[loss.Region == "South"].State.unique().tolist()}')
print(f'West\t:{loss[loss.Region == "West"].State.unique().tolist()}')
print(f'East\t:{loss[loss.Region == "East"].State.unique().tolist()}')

## Sales Analysis - FURNITURE

In [None]:
furn = data[data.Category == 'Furniture'].groupby('Sub-Category').sum().sort_values('Profit', ascending=False).iloc[:, [-1]]
plt.figure(figsize=(10,4))
plt.title('Profit - Furniture ')
sns.barplot( furn.Profit, furn.index,)
plt.show()

### Chairs & Furnishings 

In [None]:
fc = data[(data['Sub-Category'] == 'Chairs') | (data['Sub-Category'] == 'Furnishings')].sort_values('Profit')
plt.figure(figsize=(15,5))
plt.title('Overall Profit - Chairs & Furnishings', fontsize=15)
sns.barplot(fc['State'], fc.Profit)
plt.xticks(rotation=75)
plt.show()

### Tables & Bookcases

In [None]:
tb = data[(data['Sub-Category'] == 'Bookcases') | (data['Sub-Category'] == 'Tables')].sort_values('Profit')
plt.figure(figsize=(15,5))
plt.title('Overall Profit - Tables & Bookcases', fontsize=15)
sns.barplot(tb['State'], tb.Profit)
plt.xticks(rotation=75)
plt.show()

## Texas-Illinois Sales Analysis

In [None]:
# Plotting Central Region Sales Loss
central = loss[loss.Region == 'Central']
plt.figure(figsize=(15,7))
plt.title('Total Sales - Texas & Illinois', fontsize=14)
sns.barplot(central['Sub-Category'], central['Quantity'], hue=central['State'])
plt.xticks(rotation=45)
plt.show()

In [None]:
tex_ill = data[(data.State == 'Texas') | (data.State == 'Illinois')].iloc[:, [2,3,5,6,-2,-1]].sort_values('State')

In [None]:
tex_f = tex_ill[tex_ill.Category == 'Furniture']
tex_o = tex_ill[tex_ill.Category == 'Office Supplies']
tex_t = tex_ill[tex_ill.Category == 'Technology']
print('TEXAS & ILLINOIS - Sales Analysis by Category')
fig, ax = plt.subplots(1,3, figsize=(15,7))
ax[0].set_title('FURNITURE', fontsize=12)
ax[1].set_title('OFFICE SUPPLIES', fontsize=12)
ax[2].set_title('TECHNOLOGY', fontsize=12)
sns.barplot(tex_f.Profit, tex_f.State, hue=tex_f['Sub-Category'], ax=ax[0])
sns.barplot(tex_o.Profit, tex_o.State, hue=tex_o['Sub-Category'], ax=ax[1])
sns.barplot(tex_t.Profit, tex_t.State, hue=tex_t['Sub-Category'], ax=ax[2])
plt.show()

In [None]:
texas_profit = tex_ill.Profit.sum()
texas_loss = tex_ill.Profit[-tex_ill.Profit > 0].sum()
print(f'Sales Profit of Texas & Illinois: {np.round(texas_profit, decimals=2)}')
print(f'Profit if avoided items that cause loss: {np.round(texas_profit - texas_loss, decimals=2)}')

## Quantity, Discount & Profit

In [None]:
# Pairplot showing dependency of variables
sns.pairplot(data=data.iloc[:,-3:], kind='reg')
plt.show()

In [None]:
# Analysis on Discount, Quantity & Profit
fig, ax = plt.subplots(1,2, figsize=(15,5))
ax[0].set_title('Discount vs Profit', fontsize=15)
ax[1].set_title('Quantity vs Profit', fontsize=15)
sns.lineplot(data.Discount, data.Profit, color='green', label='Profit Change', ax=ax[0])
sns.lineplot(data.Quantity, data.Profit, color='red', label='Profit Change', ax=ax[1])
plt.show()

In [None]:
# Plotting Profit change with Discount 
plt.figure(figsize=(10,7))
sns.scatterplot(data.Discount, data.Profit, hue=data.Profit, s=100)
plt.show()

### Findings:

* The profit is in a good range when the discount is minimal, and there is no discount.

* When discount increases, Sales Loss is increasing and vice versa.

* Central region facing more loss in sales compared with others.

* Texas & Illinois are the States where overall sales are in loss and particularly for furniture.

* Supply of Furniture results in high loss - especially Tables & Bookcases.

* Texas & Illinois have a loss in some of the Office-Supplies - Binders, Appliances & Storage(& Supplies in Texas).

* States except Pennsylvania, Texas & Illinois has profit in sales of Chairs & Furnishings.

## Conclusion

* The product must sell with low/no discount to become the best profitable.

* Better minimize supplying Furniture(Tables & Bookcases) and the items in other categories that result in loss.

* Texas & Illinois must drop the supply of furniture,  items in Technology will enhance their profit (especially Copiers).

# Thank You!