# Amazon Sales Analytics: Business Context

## Welcome to Your Role as a Senior Data Scientist at Amazon!

In this notebook, you'll learn about the business context and your role in Amazon's sales analytics team. This is the foundation for understanding how machine learning models add value to real business operations.

### Learning Objectives
- Understand Amazon's sales process and key metrics
- Learn about business KPIs and stakeholder communication
- Generate and explore sample sales data
- Perform initial exploratory data analysis

## 1. Business Context: Amazon Sales Analytics

### Your Role
As a Senior Data Scientist in Amazon's sales analytics team, you are responsible for:

- **Revenue Forecasting**: Predicting daily, weekly, and monthly sales
- **Demand Planning**: Optimizing inventory levels across warehouses
- **Marketing ROI**: Measuring the effectiveness of marketing campaigns
- **Pricing Strategy**: Understanding price elasticity and optimal pricing
- **Performance Monitoring**: Detecting when models need retraining

### Key Business Metrics
- **Revenue**: Total sales value and growth trends
- **Conversion Rate**: Percentage of visitors who make purchases
- **Average Order Value (AOV)**: Revenue per transaction
- **Customer Lifetime Value (CLV)**: Long-term customer value
- **Marketing ROI**: Return on marketing investment

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

# Set up plotting style
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

print("Libraries imported successfully!")

## 2. Sample Data Generation

Let's generate realistic Amazon sales data to work with throughout this course.

In [None]:
# Set random seed for reproducibility
np.random.seed(42)

# Generate 1000 days of sales data
n_days = 1000
dates = pd.date_range(start='2021-01-01', periods=n_days, freq='D')

# Create realistic sales features
data = {
    'date': dates,
    'marketing_spend': np.random.uniform(10000, 100000, n_days),
    'website_traffic': np.random.uniform(50000, 200000, n_days),
    'avg_product_price': np.random.uniform(20, 200, n_days),
    'customer_reviews': np.random.uniform(3.5, 5.0, n_days),
    'inventory_level': np.random.uniform(0.3, 1.0, n_days),
    'competitor_price_ratio': np.random.uniform(0.8, 1.2, n_days)
}

# Add seasonal patterns
day_of_year = np.array([d.timetuple().tm_yday for d in dates])
seasonal_factor = np.sin(2 * np.pi * day_of_year / 365) * 0.3

# Add weekly patterns (higher sales on weekends)
weekday = np.array([d.weekday() for d in dates])
weekend_boost = np.where(weekday >= 5, 0.2, 0)

# Add holiday effects
holiday_dates = ['2021-11-26', '2021-12-25', '2022-01-01', '2022-07-04', '2022-11-25', '2022-12-25']
holiday_boost = np.zeros(n_days)
for holiday in holiday_dates:
    holiday_date = pd.to_datetime(holiday)
    if holiday_date in dates:
        idx = dates.get_loc(holiday_date)
        holiday_boost[max(0, idx-2):min(n_days, idx+3)] = 0.5

# Generate revenue with realistic relationships
revenue = (
    50000 +                                    # Base revenue
    0.8 * data['marketing_spend'] +            # Marketing impact
    0.3 * data['website_traffic'] +            # Traffic impact
    100 * data['avg_product_price'] +          # Price impact
    20000 * data['customer_reviews'] +         # Reviews impact
    15000 * data['inventory_level'] +          # Inventory impact
    -10000 * data['competitor_price_ratio'] +  # Competition impact
    30000 * seasonal_factor +                  # Seasonal impact
    20000 * weekend_boost +                    # Weekend boost
    40000 * holiday_boost +                    # Holiday boost
    np.random.normal(0, 8000, n_days)         # Random noise
)

# Create DataFrame
data['revenue'] = revenue
sales_df = pd.DataFrame(data)

print(f"Generated {len(sales_df)} days of sales data")
print(f"Date range: {sales_df['date'].min()} to {sales_df['date'].max()}")
print(f"Revenue range: ${sales_df['revenue'].min():,.0f} to ${sales_df['revenue'].max():,.0f}")
print(f"Average daily revenue: ${sales_df['revenue'].mean():,.0f}")

## 3. Exploratory Data Analysis

Let's explore our sales data to understand patterns and relationships.

In [None]:
# Display basic information about the dataset
print("Dataset Shape:", sales_df.shape)
print("\nColumn Information:")
print(sales_df.info())
print("\nFirst few rows:")
sales_df.head()

In [None]:
# Statistical summary
print("Statistical Summary:")
sales_df.describe()

In [None]:
# Revenue trends over time
plt.figure(figsize=(15, 8))

# Revenue trend
plt.subplot(2, 2, 1)
plt.plot(sales_df['date'], sales_df['revenue'], alpha=0.7)
plt.title('Daily Revenue Trend')
plt.xlabel('Date')
plt.ylabel('Revenue ($)')
plt.xticks(rotation=45)

# Monthly revenue
monthly_revenue = sales_df.set_index('date').resample('M')['revenue'].sum()
plt.subplot(2, 2, 2)
plt.plot(monthly_revenue.index, monthly_revenue.values, marker='o')
plt.title('Monthly Revenue')
plt.xlabel('Month')
plt.ylabel('Revenue ($)')
plt.xticks(rotation=45)

# Revenue distribution
plt.subplot(2, 2, 3)
plt.hist(sales_df['revenue'], bins=50, alpha=0.7)
plt.title('Revenue Distribution')
plt.xlabel('Revenue ($)')
plt.ylabel('Frequency')

# Box plot by day of week
sales_df['day_of_week'] = sales_df['date'].dt.day_name()
plt.subplot(2, 2, 4)
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
sns.boxplot(data=sales_df, x='day_of_week', y='revenue', order=day_order)
plt.title('Revenue by Day of Week')
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

## 4. Correlation Analysis

Let's examine relationships between different variables and revenue.

In [None]:
# Correlation matrix
correlation_cols = ['marketing_spend', 'website_traffic', 'avg_product_price', 
                   'customer_reviews', 'inventory_level', 'competitor_price_ratio', 'revenue']
correlation_matrix = sales_df[correlation_cols].corr()

plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0, 
            square=True, linewidths=0.5)
plt.title('Correlation Matrix: Sales Variables')
plt.show()

# Print correlation with revenue
print("Correlation with Revenue:")
revenue_corr = correlation_matrix['revenue'].drop('revenue').sort_values(key=abs, ascending=False)
for var, corr in revenue_corr.items():
    print(f"{var:<25}: {corr:>6.3f}")

In [None]:
# Scatter plots for key relationships
fig, axes = plt.subplots(2, 3, figsize=(18, 12))
axes = axes.flatten()

key_vars = ['marketing_spend', 'website_traffic', 'avg_product_price', 
           'customer_reviews', 'inventory_level', 'competitor_price_ratio']

for i, var in enumerate(key_vars):
    axes[i].scatter(sales_df[var], sales_df['revenue'], alpha=0.5)
    axes[i].set_xlabel(var.replace('_', ' ').title())
    axes[i].set_ylabel('Revenue ($)')
    axes[i].set_title(f'Revenue vs {var.replace("_", " ").title()}')
    
    # Add trend line
    z = np.polyfit(sales_df[var], sales_df['revenue'], 1)
    p = np.poly1d(z)
    axes[i].plot(sales_df[var], p(sales_df[var]), "r--", alpha=0.8)

plt.tight_layout()
plt.show()

## 5. Business Insights

Based on our exploratory analysis, let's identify key business insights.

In [None]:
# Business insights
print("=== BUSINESS INSIGHTS ===")
print("\n1. Revenue Patterns:")
print(f"   • Average daily revenue: ${sales_df['revenue'].mean():,.0f}")
print(f"   • Revenue volatility (std): ${sales_df['revenue'].std():,.0f}")
print(f"   • Peak revenue day: {sales_df.loc[sales_df['revenue'].idxmax(), 'date'].strftime('%Y-%m-%d')}")
print(f"   • Peak revenue amount: ${sales_df['revenue'].max():,.0f}")

print("\n2. Key Drivers (Correlation with Revenue):")
for var, corr in revenue_corr.head(3).items():
    print(f"   • {var.replace('_', ' ').title()}: {corr:.3f}")

print("\n3. Seasonal Patterns:")
monthly_avg = sales_df.set_index('date').resample('M')['revenue'].mean()
peak_month = monthly_avg.idxmax().strftime('%B %Y')
low_month = monthly_avg.idxmin().strftime('%B %Y')
print(f"   • Peak month: {peak_month} (${monthly_avg.max():,.0f} avg)")
print(f"   • Low month: {low_month} (${monthly_avg.min():,.0f} avg)")

print("\n4. Day-of-Week Effects:")
dow_avg = sales_df.groupby('day_of_week')['revenue'].mean()
dow_avg = dow_avg.reindex(day_order)
best_day = dow_avg.idxmax()
worst_day = dow_avg.idxmin()
print(f"   • Best performing day: {best_day} (${dow_avg[best_day]:,.0f} avg)")
print(f"   • Worst performing day: {worst_day} (${dow_avg[worst_day]:,.0f} avg)")

print("\n5. Marketing Efficiency:")
marketing_roi = sales_df['revenue'] / sales_df['marketing_spend']
print(f"   • Average marketing ROI: {marketing_roi.mean():.2f}x")
print(f"   • Best marketing ROI: {marketing_roi.max():.2f}x")
print(f"   • Marketing spend range: ${sales_df['marketing_spend'].min():,.0f} - ${sales_df['marketing_spend'].max():,.0f}")

## 6. Data Quality Assessment

Let's check for data quality issues that might affect our machine learning models.

In [None]:
# Data quality checks
print("=== DATA QUALITY ASSESSMENT ===")

print("\n1. Missing Values:")
missing_values = sales_df.isnull().sum()
if missing_values.sum() == 0:
    print("   ✓ No missing values found")
else:
    print(missing_values[missing_values > 0])

print("\n2. Duplicate Records:")
duplicates = sales_df.duplicated().sum()
if duplicates == 0:
    print("   ✓ No duplicate records found")
else:
    print(f"   ⚠ {duplicates} duplicate records found")

print("\n3. Outliers Detection (using IQR method):")
numeric_cols = ['marketing_spend', 'website_traffic', 'avg_product_price', 'revenue']
for col in numeric_cols:
    Q1 = sales_df[col].quantile(0.25)
    Q3 = sales_df[col].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    outliers = ((sales_df[col] < lower_bound) | (sales_df[col] > upper_bound)).sum()
    print(f"   • {col}: {outliers} outliers ({outliers/len(sales_df)*100:.1f}%)")

print("\n4. Data Types:")
print(sales_df.dtypes)

print("\n5. Value Ranges:")
for col in numeric_cols:
    print(f"   • {col}: {sales_df[col].min():.0f} to {sales_df[col].max():.0f}")

## 7. Summary and Next Steps

### Key Takeaways from Business Context Analysis:

1. **Data Quality**: Our dataset is clean with no missing values or duplicates
2. **Strong Relationships**: Marketing spend and website traffic show strong positive correlation with revenue
3. **Seasonal Patterns**: Clear seasonal and day-of-week patterns in sales
4. **Business Impact**: Understanding these patterns can help optimize marketing spend and inventory

### Next Steps:
1. **Notebook 2**: Learn linear regression fundamentals
2. **Notebook 3**: Build multiple linear regression models
3. **Notebook 4**: Evaluate model performance
4. **Notebook 5**: Deploy models to production

### Business Questions to Answer:
- How much revenue can we expect from a given marketing spend?
- What is the optimal marketing budget allocation?
- How do seasonal factors affect our sales forecasts?
- Which variables are most important for predicting revenue?

In [None]:
# Save the dataset for use in subsequent notebooks
sales_df.to_csv('../data/amazon_sales_data.csv', index=False)
print("Dataset saved to '../data/amazon_sales_data.csv'")
print("\nReady to move to the next notebook: 02_linear_regression_fundamentals.ipynb")