# The Sparks Foundation - GRIP
## Data Science and Business Analytics

## Exploratory Data Analysis - Retail
##     by Shashank Raghupatro



### Objectives

+ Perform EDA on the dataset "SampleSuperstore"
+ Find out weak areas where we need to work in order to increase profit.
+ What business problems do you encounter on exploring the data? What is your approach to solve them?

### Importing necessary packages

In [None]:
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # plotting

### Download the data

We can use the SampleSuperstore dataset which is already uploaded by Aakash Kothare.

We need to download the data from the CSV file into a pandas dataframe to be able to work on it.

In [None]:
df = pd.read_csv('../input/tsf-datasets/SampleSuperstore.csv')
df.head()

In [None]:
df.info()

In [None]:
df.describe()

### Data Preparation and Cleaning

In [None]:
df.isna().sum()

we can see that no columns have null values, thus the data is already clean.

In [None]:
df.nunique(axis=0)

## Perform Exploratory Analysis

We can analyse the following columns to gain inferences about the data, and to understand how to increase sales.

+ Ship Mode
+ Segment
+ State
+ Category
+ Sales
+ Discount
+ Profit

### Ship Mode

In [None]:
df['Ship Mode'].value_counts().plot(kind='pie',figsize=(8,8))
plt.legend()
plt.title('Shipping Modes')

**From the above pie chart, we can infer that most customers prefer Standard Shipping.**

**This tells us that we should focus on improving the delivery times for standard shipping in order to improve overall customer experience which will in turn increase sales.**

### Segment

In [None]:
df['Segment'].value_counts().plot(kind='pie',figsize=(8,8))
plt.legend()
plt.title('Segment')

**The above pie chart shows us that almost half of our entire customers are from the consumer segment.**

**As we know that the Corporate Segment has a higher spending capacity than the avverage consumer, we should be targetting ads towards the corporate segment inorder to increase the amount of purchases made by them, in order to increase overall revenue.**

### State wise Profit

In [None]:
df.groupby(['State']).sum()['Profit'].sort_values(ascending = False)

In [None]:
state_wise_profit = pd.DataFrame(df.groupby(['State']).sum()['Profit'].sort_values(ascending = False))
state_wise_profit[-10:]

In [None]:
state_wise_profit.plot(kind='bar',figsize=(20,8))
plt.legend()
plt.ylabel('Profit')
plt.title('State-wise Profits')

**From the above chart we can see that some cities generate high profit, while some cities causes loss for the business**

In [None]:
state_wise_profit[-10:].plot(kind='bar',figsize=(15,8),color='r')
plt.ylabel('Profit')
plt.title('Worst performing states')

**We should take a serious look into the operations related to these top 10 worst performing cities, and either eliminate the problem or stop sales in these states as they only cause losses to the business.**

### Category wise Sales vs Profit

In [None]:
category_wise_profit = pd.DataFrame(df.groupby('Category').sum()['Profit'])
category_wise_sales = pd.DataFrame(df.groupby('Category').sum()['Sales'])
profit_vs_sales = pd.concat([category_wise_profit, category_wise_sales], axis=1)

In [None]:
fig = plt.figure(figsize=(10,8))
ax1=plt.subplot(1,1,1)
ax2=ax1.twinx()
ax1.bar(profit_vs_sales.index,profit_vs_sales.Sales,width=0.2,label='Quantity',alpha=0.7)
ax2.plot(profit_vs_sales.index,profit_vs_sales.Profit,color='g',label='Profit')
ax1.set_ylabel('Total Sales')
ax2.set_ylabel('Profit')
plt.title('Category wise Sales vs Profit')
plt.legend()

**We can see that the most profitable category is Technology, also we have the most sales from this category.**

**The total sales vs profit ratio is maximum for Furniture, we need to either increase the margin in order to increase the profit per dollar of sale, or we need to lower our furniture inventory and invest in tech inventory.**

### Discount

In [None]:
df.groupby('Discount').count()['Sales'].plot(kind='barh',figsize=(15,8))
plt.ylabel('Discount')
plt.xlabel('Number of Products')
plt.grid()
plt.title('Discounts on Products')

**We can observe that the maximum number of products sold have no discount on their price, followed by produucts haveing 20% discount**

### Profit

In [None]:
df.groupby('Sub-Category').sum()['Profit'].plot(kind='barh',figsize=(15,8))
plt.ylabel('Profit')
plt.xlabel('Total Profit')
plt.grid()
plt.title('Profit for each Sub-Category')

**We can see that the Sub-Categories "Tables","Supplies", and "Bookcases" causes us an overall loss, so we should either increase the profit margin or drop selling these products altogether.**

**Sale of Copiers provide us the most profit, followed by Phones and their Accessories.**

# Thank You