# Exploratory Data Analysis (EDA) on Superstore Dataset
---
This notebook follows a structured 7-step approach:
1. Business Understanding
2. Importing & Inspecting Data
3. Handling Missing Data
4. Exploring Data Characteristics
5. Visualizing Relationships
6. Handling Outliers
7. Communicating Findings

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

# Load dataset
df = pd.read_csv('Superstore.csv', encoding='latin1')
print('Dataset shape:', df.shape)
df.head()

In [None]:
# Convert date columns
for col in ['Order Date', 'Ship Date']:
    if col in df.columns:
        df[col] = pd.to_datetime(df[col], errors='coerce')

# Missing data summary
missing = df.isnull().sum()
print(missing[missing > 0])

In [None]:
# Impute missing data
num_cols = df.select_dtypes(include=np.number).columns
for col in num_cols:
    df[col].fillna(df[col].median(), inplace=True)
cat_cols = df.select_dtypes(exclude=np.number).columns
for col in cat_cols:
    df[col].fillna(df[col].mode()[0], inplace=True)
print('Missing data imputed.')

In [None]:
# Descriptive statistics
desc = df[num_cols].describe().T
desc['skewness'] = df[num_cols].skew()
desc['kurtosis'] = df[num_cols].kurtosis()
desc

In [None]:
# Histograms
df[num_cols].hist(bins=30, figsize=(12,8))
plt.suptitle('Histograms of Numeric Features')
plt.show()

In [None]:
# Correlation heatmap
plt.figure(figsize=(8,6))
sns.heatmap(df[num_cols].corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

In [None]:
# Outlier detection using IQR
def outlier_summary(col):
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower = Q1 - 1.5*IQR
    upper = Q3 + 1.5*IQR
    return len(df[(df[col] < lower) | (df[col] > upper)])

for col in num_cols:
    print(col, 'Outliers:', outlier_summary(col))

### Key Insights
- Sales and Profit are correlated, but many transactions yield low or negative profit.
- Discounts tend to reduce profit significantly.
- Certain categories and regions dominate profitability.
- Data is skewed with real-world outliers (large transactions).