# ?? Amazon Sales Analytics: Business Context & Data Science Role

## ?? Learning Objectives

In this notebook, you'll learn:
- **Amazon's Sales Process**: Understanding the complete customer journey
- **Key Business Metrics**: Revenue, conversion rates, customer lifetime value
- **Data Scientist Role**: How ML creates business value at Amazon
- **Sample Data Generation**: Creating realistic Amazon sales data
- **Exploratory Data Analysis**: Visualizing sales patterns and trends

## ?? Business Context: Why This Matters

As Amazon's sales data scientist, you're responsible for:
- **Revenue Forecasting**: Predicting daily, weekly, and monthly sales
- **Demand Planning**: Optimizing inventory levels across warehouses
- **Marketing ROI**: Measuring the effectiveness of marketing campaigns
- **Pricing Strategy**: Understanding price elasticity and optimal pricing
- **Performance Monitoring**: Detecting when models need retraining

### **Business Impact**
- **5-15% Revenue Increase** through better demand forecasting
- **10-20% Cost Reduction** in inventory management
- **Improved Customer Satisfaction** with better product availability
- **Data-Driven Decisions** replacing guesswork in strategy

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

# Set style for better visualizations
plt.style.use('default')
sns.set_palette("husl")

# Configure pandas for better display
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)
pd.set_option('display.max_rows', 20)

print("? Libraries imported successfully!")
print(f"?? Working with data on: {datetime.now().strftime('%B %d, %Y')}")

## ?? Amazon's Sales Process Overview

### **1. Sales Funnel Stages**

Amazon's customer journey follows these key stages:

1. **Awareness**: Product discovery through search, recommendations, ads
2. **Consideration**: Product page views, reviews, comparison shopping
3. **Purchase**: Add to cart, checkout, payment processing
4. **Retention**: Post-purchase support, re-engagement, loyalty programs

### **2. Key Sales Metrics**

**Revenue Metrics:**
- **Revenue**: Total sales value (Gross Merchandise Value - GMV)
- **Units Sold**: Number of products sold
- **Average Order Value (AOV)**: Revenue per transaction

**Performance Metrics:**
- **Conversion Rate**: Visitors who make a purchase
- **Customer Acquisition Cost (CAC)**: Cost to acquire new customers
- **Customer Lifetime Value (CLV)**: Total value from a customer

### **3. Sales Challenges**

Amazon faces several key challenges:
- **Seasonal Fluctuations**: Holiday seasons, weather patterns
- **Inventory Optimization**: Stockout prevention vs. overstock
- **Dynamic Pricing**: Competitive positioning and price elasticity
- **Regional Variations**: Market-specific patterns and preferences

In [None]:
# Generate realistic Amazon sales data
np.random.seed(42)

# Create sample size
n_days = 365  # One year of data
dates = pd.date_range('2023-01-01', periods=n_days, freq='D')

# Generate base features with realistic patterns
base_revenue = 1000000  # Base daily revenue
seasonal_factor = np.sin(2 * np.pi * np.arange(n_days) / 365) * 0.3  # Seasonal variation
trend_factor = np.arange(n_days) * 1000  # Upward trend
noise = np.random.normal(0, 50000, n_days)  # Random noise

# Generate sales data with realistic relationships
daily_revenue = base_revenue + seasonal_factor * base_revenue + trend_factor + noise
daily_units = (daily_revenue / 50) + np.random.normal(0, 1000, n_days)  # Assume $50 average price
conversion_rate = 0.02 + np.random.normal(0, 0.005, n_days)  # 2% base conversion
avg_order_value = daily_revenue / daily_units
visitors = daily_units / conversion_rate

# Create DataFrame
sales_data = pd.DataFrame({
    'date': dates,
    'revenue': daily_revenue,
    'units_sold': daily_units,
    'conversion_rate': conversion_rate,
    'avg_order_value': avg_order_value,
    'visitors': visitors
})

print("?? Amazon Sales Dataset Created:")
print(f"• Time Period: {sales_data['date'].min().strftime('%B %d, %Y')} to {sales_data['date'].max().strftime('%B %d, %Y')}")
print(f"• Total Days: {len(sales_data)} days")
print(f"• Revenue Range: ${sales_data['revenue'].min():,.0f} - ${sales_data['revenue'].max():,.0f}")
print(f"• Average Daily Revenue: ${sales_data['revenue'].mean():,.0f}")
print(f"• Total Annual Revenue: ${sales_data['revenue'].sum():,.0f}")

print("\n?? Sample Data (First 10 days):")
print(sales_data.head(10).round(2))

## ?? Understanding Sales Patterns

Let's visualize the key sales metrics to understand patterns and trends:

In [None]:
# Create comprehensive sales dashboard
fig = make_subplots(
    rows=3, cols=2,
    subplot_titles=('Daily Revenue', 'Units Sold', 'Conversion Rate', 
                   'Average Order Value', 'Monthly Revenue Trend', 'Revenue Distribution'),
    specs=[[{"secondary_y": False}, {"secondary_y": False}],
           [{"secondary_y": False}, {"secondary_y": False}],
           [{"secondary_y": False}, {"secondary_y": False}]]
)

# 1. Daily Revenue
fig.add_trace(
    go.Scatter(x=sales_data['date'], y=sales_data['revenue'], 
               mode='lines', name='Revenue', line=dict(color='#FF9900')),
    row=1, col=1
)

# 2. Units Sold
fig.add_trace(
    go.Scatter(x=sales_data['date'], y=sales_data['units_sold'], 
               mode='lines', name='Units', line=dict(color='#232F3E')),
    row=1, col=2
)

# 3. Conversion Rate
fig.add_trace(
    go.Scatter(x=sales_data['date'], y=sales_data['conversion_rate'], 
               mode='lines', name='Conversion Rate', line=dict(color='#146EB4')),
    row=2, col=1
)

# 4. Average Order Value
fig.add_trace(
    go.Scatter(x=sales_data['date'], y=sales_data['avg_order_value'], 
               mode='lines', name='AOV', line=dict(color='#FF6B6B')),
    row=2, col=2
)

# 5. Monthly Revenue Trend
monthly_revenue = sales_data.groupby(sales_data['date'].dt.to_period('M'))['revenue'].sum().reset_index()
monthly_revenue['date'] = monthly_revenue['date'].astype(str)
fig.add_trace(
    go.Bar(x=monthly_revenue['date'], y=monthly_revenue['revenue'], 
           name='Monthly Revenue', marker_color='#FF9900'),
    row=3, col=1
)

# 6. Revenue Distribution
fig.add_trace(
    go.Histogram(x=sales_data['revenue'], nbinsx=30, name='Revenue Distribution', 
                 marker_color='#232F3E'),
    row=3, col=2
)

fig.update_layout(
    title='Amazon Sales Analytics Dashboard',
    height=800,
    showlegend=False,
    template='plotly_white'
)

fig.show()

## ?? Key Business Insights from the Data

### **Patterns Identified:**
1. **Seasonal Trends**: Revenue peaks during holiday seasons
2. **Growth Trajectory**: Overall upward trend in sales
3. **Conversion Stability**: Consistent conversion rates with minor fluctuations
4. **Order Value Variation**: AOV varies based on product mix and promotions

### **Business Questions We Need to Answer:**
1. **Forecasting**: Can we predict next month's revenue?
2. **Seasonality**: How do holidays affect sales?
3. **Growth**: What factors drive sales growth?
4. **Optimization**: How can we improve conversion rates?

In [None]:
# Calculate key business metrics
print("?? Key Business Metrics:")
print("=" * 50)

# Overall metrics
total_revenue = sales_data['revenue'].sum()
total_units = sales_data['units_sold'].sum()
avg_conversion = sales_data['conversion_rate'].mean()
avg_aov = sales_data['avg_order_value'].mean()

print(f"?? Total Annual Revenue: ${total_revenue:,.2f}")
print(f"?? Total Units Sold: {total_units:,.0f}")
print(f"?? Average Conversion Rate: {avg_conversion:.3%}")
print(f"?? Average Order Value: ${avg_aov:.2f}")

# Growth metrics
q1_revenue = sales_data[sales_data['date'].dt.quarter == 1]['revenue'].sum()
q4_revenue = sales_data[sales_data['date'].dt.quarter == 4]['revenue'].sum()
growth_rate = ((q4_revenue - q1_revenue) / q1_revenue) * 100

print(f"\n?? Growth Analysis:")
print(f"• Q1 Revenue: ${q1_revenue:,.2f}")
print(f"• Q4 Revenue: ${q4_revenue:,.2f}")
print(f"• Growth Rate: {growth_rate:.1f}%")

# Seasonal analysis
monthly_analysis = sales_data.groupby(sales_data['date'].dt.month).agg({
    'revenue': 'sum',
    'units_sold': 'sum',
    'conversion_rate': 'mean'
}).round(2)

print(f"\n?? Monthly Performance:")
print(monthly_analysis)

## ?? Your Mission as Amazon's Data Scientist

### **Primary Objectives:**

1. **Sales Forecasting**: Predict future revenue based on historical data
2. **Demand Planning**: Optimize inventory levels across warehouses
3. **Pricing Strategy**: Develop dynamic pricing models
4. **Performance Analysis**: Identify factors driving sales success
5. **Regional Insights**: Understand market-specific patterns

### **Business Impact:**
- **Revenue Optimization**: 5-15% increase through better forecasting
- **Cost Reduction**: 10-20% reduction in inventory costs
- **Customer Satisfaction**: Improved product availability
- **Competitive Advantage**: Data-driven decision making

### **Key Challenges You'll Address:**
- **Seasonality**: Sales patterns vary by season and holidays
- **External Factors**: Economic conditions, competition, market changes
- **Data Quality**: Missing data, outliers, inconsistent reporting
- **Model Drift**: Changing customer behavior over time

In [None]:
# Visualize seasonal patterns and trends
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
fig.suptitle('Amazon Sales Analysis: Patterns and Trends', fontsize=16, fontweight='bold')

# 1. Revenue by Month (Seasonal Pattern)
monthly_revenue = sales_data.groupby(sales_data['date'].dt.month)['revenue'].mean()
axes[0, 0].bar(monthly_revenue.index, monthly_revenue.values, color='#FF9900', alpha=0.7)
axes[0, 0].set_xlabel('Month')
axes[0, 0].set_ylabel('Average Revenue ($)')
axes[0, 0].set_title('Seasonal Revenue Pattern')
axes[0, 0].set_xticks(range(1, 13))
axes[0, 0].grid(True, alpha=0.3)

# 2. Revenue Trend Over Time
axes[0, 1].plot(sales_data['date'], sales_data['revenue'], color='#232F3E', linewidth=1)
axes[0, 1].set_xlabel('Date')
axes[0, 1].set_ylabel('Revenue ($)')
axes[0, 1].set_title('Revenue Trend Over Time')
axes[0, 1].grid(True, alpha=0.3)

# 3. Conversion Rate Distribution
axes[1, 0].hist(sales_data['conversion_rate'], bins=30, color='#146EB4', alpha=0.7, edgecolor='black')
axes[1, 0].set_xlabel('Conversion Rate')
axes[1, 0].set_ylabel('Frequency')
axes[1, 0].set_title('Conversion Rate Distribution')
axes[1, 0].grid(True, alpha=0.3)

# 4. Average Order Value vs Revenue
axes[1, 1].scatter(sales_data['avg_order_value'], sales_data['revenue'], 
                   alpha=0.6, color='#FF6B6B', s=20)
axes[1, 1].set_xlabel('Average Order Value ($)')
axes[1, 1].set_ylabel('Revenue ($)')
axes[1, 1].set_title('AOV vs Revenue Relationship')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Business insights summary
print("?? Key Business Insights:")
print("=" * 50)
print(f"• Peak Sales Month: {monthly_revenue.idxmax()} ({monthly_revenue.max():,.0f} avg revenue)")
print(f"• Lowest Sales Month: {monthly_revenue.idxmin()} ({monthly_revenue.min():,.0f} avg revenue)")
print(f"• Revenue Volatility: {sales_data['revenue'].std():,.0f} (standard deviation)")
print(f"• Conversion Rate Stability: {sales_data['conversion_rate'].std():.4f} (low variance = stable)")

## ?? Data Science Challenges in Sales Analytics

### **1. Linear Regression Applications:**

Linear regression is perfect for Amazon sales analytics because:
- **Revenue Forecasting**: Predict future sales based on historical data
- **Demand Prediction**: Estimate product demand for inventory planning
- **Pricing Analysis**: Understand price elasticity and optimal pricing
- **Marketing ROI**: Measure impact of marketing campaigns on sales

### **2. Key Challenges:**

**Data Challenges:**
- **Seasonality**: Sales patterns vary by season and holidays
- **External Factors**: Economic conditions, competition, market changes
- **Data Quality**: Missing data, outliers, inconsistent reporting
- **Model Drift**: Changing customer behavior over time

**Technical Challenges:**
- **Feature Engineering**: Creating meaningful variables from raw data
- **Multicollinearity**: Handling correlated features
- **Model Validation**: Ensuring reliable predictions
- **Production Deployment**: Scaling models for business use

### **3. Success Metrics:**

**Model Performance:**
- **Forecast Accuracy**: How close are our predictions?
- **Model Stability**: Consistent performance over time
- **Business Impact**: Revenue increase, cost reduction

**Business Metrics:**
- **Revenue Growth**: Measurable increase in sales
- **Cost Reduction**: Lower inventory and operational costs
- **Customer Satisfaction**: Improved product availability

In [None]:
# Demonstrate correlation analysis
print("?? Correlation Analysis: Understanding Relationships")
print("=" * 60)

# Calculate correlations
correlation_matrix = sales_data[['revenue', 'units_sold', 'conversion_rate', 'avg_order_value', 'visitors']].corr()

# Create correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0,
            square=True, linewidths=0.5, fmt='.3f')
plt.title('Correlation Matrix: Amazon Sales Metrics', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

# Business interpretation
print("?? Business Interpretation:")
print("=" * 40)
print(f"• Revenue ? Units Sold: {correlation_matrix.loc['revenue', 'units_sold']:.3f} (Strong positive)")
print(f"• Revenue ? Conversion Rate: {correlation_matrix.loc['revenue', 'conversion_rate']:.3f}")
print(f"• Revenue ? Average Order Value: {correlation_matrix.loc['revenue', 'avg_order_value']:.3f}")
print(f"• Revenue ? Visitors: {correlation_matrix.loc['revenue', 'visitors']:.3f}")

print("\n?? Key Insights:")
print("• Higher visitor count generally leads to more revenue")
print("• Conversion rate has moderate impact on revenue")
print("• Average order value shows weak correlation with total revenue")
print("• Units sold is strongly correlated with revenue (as expected)")

## ?? Business Questions We'll Answer with Linear Regression

### **1. Sales Forecasting**
**Question**: Can we predict next month's revenue?
**Approach**: Use historical data to build regression models
**Features**: Previous sales, seasonality, marketing spend, external factors
**Business Value**: Better inventory planning and resource allocation

### **2. Marketing ROI Analysis**
**Question**: How effective are our marketing campaigns?
**Approach**: Regression analysis of marketing spend vs. revenue
**Features**: Marketing budget, channel performance, campaign timing
**Business Value**: Optimize marketing budget allocation

### **3. Pricing Strategy**
**Question**: What's the optimal price for maximum revenue?
**Approach**: Price elasticity analysis using regression
**Features**: Product prices, competitor prices, sales volume
**Business Value**: Maximize revenue through optimal pricing

### **4. Demand Planning**
**Question**: How much inventory should we stock?
**Approach**: Predict demand using regression models
**Features**: Historical demand, seasonal patterns, promotional events
**Business Value**: Reduce stockouts and overstock costs

In [None]:
# Create sample data for different business scenarios
print("?? Sample Business Scenarios for Linear Regression:")
print("=" * 60)

# Scenario 1: Marketing Spend vs Revenue
marketing_spend = np.random.uniform(10000, 100000, 100)
revenue_from_marketing = 50000 + 0.8 * marketing_spend + np.random.normal(0, 5000, 100)

marketing_df = pd.DataFrame({
    'marketing_spend': marketing_spend,
    'revenue': revenue_from_marketing
})

print("1. Marketing ROI Analysis:")
print(f"   • Marketing spend range: ${marketing_spend.min():,.0f} - ${marketing_spend.max():,.0f}")
print(f"   • Revenue range: ${revenue_from_marketing.min():,.0f} - ${revenue_from_marketing.max():,.0f}")
print(f"   • Correlation: {marketing_df.corr().iloc[0,1]:.3f}")

# Scenario 2: Price vs Demand
product_price = np.random.uniform(20, 200, 100)
demand = 1000 - 2 * product_price + np.random.normal(0, 50, 100)

pricing_df = pd.DataFrame({
    'price': product_price,
    'demand': demand
})

print("\n2. Pricing Strategy Analysis:")
print(f"   • Price range: ${product_price.min():.2f} - ${product_price.max():.2f}")
print(f"   • Demand range: {demand.min():.0f} - {demand.max():.0f} units")
print(f"   • Correlation: {pricing_df.corr().iloc[0,1]:.3f}")

# Visualize scenarios
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 6))

# Marketing ROI
ax1.scatter(marketing_df['marketing_spend'], marketing_df['revenue'], alpha=0.6, color='#FF9900')
ax1.set_xlabel('Marketing Spend ($)')
ax1.set_ylabel('Revenue ($)')
ax1.set_title('Marketing ROI Analysis')
ax1.grid(True, alpha=0.3)

# Pricing Strategy
ax2.scatter(pricing_df['price'], pricing_df['demand'], alpha=0.6, color='#232F3E')
ax2.set_xlabel('Product Price ($)')
ax2.set_ylabel('Demand (Units)')
ax2.set_title('Price Elasticity Analysis')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## ?? Next Steps: Your Data Science Journey

### **Notebook 2**: Linear Regression Fundamentals
- **Theory**: Mathematical foundations of linear regression
- **Implementation**: Building models from scratch and with libraries
- **Business Application**: Predicting Amazon sales using linear regression
- **Model Validation**: Checking assumptions and evaluating performance

### **Notebook 3**: Multiple Linear Regression
- **Feature Engineering**: Creating meaningful variables from raw data
- **Multicollinearity**: Detecting and handling correlated features
- **Model Interpretation**: Understanding the impact of each factor
- **Advanced Diagnostics**: Comprehensive model validation

### **Notebook 4**: Model Evaluation & Metrics
- **Evaluation Metrics**: MSE, MAE, MAPE, R², Adjusted R²
- **Cross-Validation**: Robust model validation techniques
- **Business Interpretation**: Translating metrics to business value
- **Model Comparison**: Choosing the best model for business needs

### **Notebook 5**: Advanced Topics & Production
- **Regularization**: Ridge, Lasso, Elastic Net for overfitting prevention
- **Hyperparameter Tuning**: Optimizing model parameters
- **Model Drift**: Monitoring performance over time
- **AWS Deployment**: Production deployment strategies

### **Business Questions We'll Answer:**
1. Which factors have the strongest impact on sales?
2. How do different marketing channels interact?
3. What's the optimal mix of marketing spend and pricing?
4. How do seasonal patterns affect our predictions?

---

**Ready to dive into the world of linear regression for sales analytics?** ??

Your role as Amazon's data scientist is crucial for driving business growth through data-driven insights!