# Exploratory Data Analysis Using Azure Blob Storage

This notebook performs exploratory data analysis (EDA) on CSV files stored in the Azure Blob Storage account `globalmartmlsa` under the `source` container. The analysis covers sales, inventory, customer behavior, and competitor data, and demonstrates how to connect to Azure Blob Storage using access keys.

## 1. Import Required Libraries

Import the necessary libraries for data handling, visualization, and Azure Blob Storage access.

In [None]:
# Import Required Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from azure.storage.blob import BlobServiceClient
import io

# Set visualization style
sns.set(style='whitegrid')

## 2. Connect to Azure Blob Storage and Load Data

Set up the connection to Azure Blob Storage using the storage account name and access key. Define a function to load CSV files from the `source` container and load the required datasets.

In [None]:
# Azure Blob Storage credentials
account_name = "globalmartmlsa"
account_key = "<YOUR_ACCESS_KEY>"  # Replace with your actual access key
container_name = "source"

# Set up BlobServiceClient
blob_service_client = BlobServiceClient(
    f"https://{account_name}.blob.core.windows.net",
    credential=account_key
)
container_client = blob_service_client.get_container_client(container_name)

def load_csv_from_blob(blob_name):
    blob_client = container_client.get_blob_client(blob_name)
    stream = blob_client.download_blob().readall()
    return pd.read_csv(io.BytesIO(stream))

# Load datasets
sales_data = load_csv_from_blob("sales_data_dictionary.csv")
inventory_data = load_csv_from_blob("inventory_data_dictionary.csv")
customer_behavior_data = load_csv_from_blob("customer_behavior_data_dictionary.csv")
competitor_data = load_csv_from_blob("competitor_data_dictionary.csv")

## 3. Display First Few Rows of Each Dataset

Inspect the first few rows of each loaded DataFrame to understand the data structure.

In [None]:
print("Sales Data:")
display(sales_data.head())
print("Inventory Data:")
display(inventory_data.head())
print("Customer Behavior Data:")
display(customer_behavior_data.head())
print("Competitor Data:")
display(competitor_data.head())

## 4. Summary Statistics for Each Dataset

Generate summary statistics for each dataset using the `describe()` method.

In [None]:
print("Sales Data Summary Statistics:")
display(sales_data.describe(include='all'))
print("Inventory Data Summary Statistics:")
display(inventory_data.describe(include='all'))
print("Customer Behavior Data Summary Statistics:")
display(customer_behavior_data.describe(include='all'))
print("Competitor Data Summary Statistics:")
display(competitor_data.describe(include='all'))

## 5. Check for Missing Values

Check and print the number of missing values in each column for all datasets.

In [None]:
print("Sales Data missing values:")
print(sales_data.isnull().sum())
print("\nInventory Data missing values:")
print(inventory_data.isnull().sum())
print("\nCustomer Behavior Data missing values:")
print(customer_behavior_data.isnull().sum())
print("\nCompetitor Data missing values:")
print(competitor_data.isnull().sum())

## 6. Visualize Selling Price Distribution

Plot a histogram of the `SellingPrice` column from the sales data.

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(sales_data['SellingPrice'], bins=30, kde=True)
plt.title('Selling Price Distribution of Tide Products')
plt.xlabel('Selling Price')
plt.ylabel('Frequency')
plt.show()

## 7. Correlation Heatmap for Sales Data

Generate a correlation heatmap for numeric columns in the sales data.

In [None]:
plt.figure(figsize=(8, 6))
corr = sales_data.select_dtypes(include=[np.number]).corr()
sns.heatmap(corr, annot=True, fmt='.2f', cmap='coolwarm')
plt.title('Correlation Heatmap - Sales Data')
plt.show()

## 8. Standardize and Merge Datasets

Standardize date columns and merge all datasets on the `Date` column for joint analysis.

In [None]:
# Standardize date columns for merging
for df in [sales_data, inventory_data, customer_behavior_data, competitor_data]:
    if 'date' in df.columns:
        df.rename(columns={'date': 'Date'}, inplace=True)

# Merge on 'Date'
final_df = sales_data \
    .merge(competitor_data, on='Date', how='left') \
    .merge(customer_behavior_data, on='Date', how='left') \
    .merge(inventory_data, on='Date', how='left')

## 9. Summary Statistics and Missing Values for Merged Data

Display summary statistics and check for missing values in the merged DataFrame.

In [None]:
print("Merged DataFrame Summary Statistics:")
display(final_df.describe(include='all'))

print("Merged DataFrame missing values:")
print(final_df.isnull().sum())

## 10. Visualize Selling Price Distribution in Merged Data

Plot a histogram of the `SellingPrice` column from the merged DataFrame.

In [None]:
if 'SellingPrice' in final_df.columns:
    plt.figure(figsize=(10, 6))
    sns.histplot(final_df['SellingPrice'].dropna(), bins=30, kde=True)
    plt.title('Selling Price Distribution (Merged Data)')
    plt.xlabel('Selling Price')
    plt.ylabel('Frequency')
    plt.show()

## 11. Correlation Heatmap for Merged Data

Generate a correlation heatmap for numeric columns in the merged DataFrame.

In [None]:
plt.figure(figsize=(12, 8))
corr_matrix = final_df.select_dtypes(include=[np.number]).corr()
sns.heatmap(corr_matrix, annot=True, fmt='.2f', cmap='coolwarm')
plt.title('Correlation Heatmap (Merged Data)')
plt.show()

---

This concludes the exploratory data analysis using data loaded directly from Azure Blob Storage. The next steps may involve feature engineering and model development based on these insights.