# 📊 Linear Regression Fundamentals: Amazon Sales Analytics

## 🎯 Learning Objectives

In this notebook, you'll learn:
- **Theory**: Mathematical foundations of linear regression
- **Implementation**: Building models from scratch and with libraries
- **Business Application**: Predicting Amazon sales using linear regression
- **Model Validation**: Checking assumptions and evaluating performance

## 🏢 Business Context: Sales Forecasting

As Amazon's data scientist, you need to predict:
- **Daily Revenue**: Based on marketing spend, seasonality, and external factors
- **Product Demand**: To optimize inventory levels
- **Customer Behavior**: To improve conversion rates

Linear regression is your first tool for these predictions!

In [None]:
# Import libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import statsmodels.api as sm
from scipy import stats

# Set random seed for reproducibility
np.random.seed(42)

print("✅ Libraries imported successfully!")

## 📚 Linear Regression Theory

### **Mathematical Foundation**

Linear regression models the relationship between a dependent variable (Y) and one or more independent variables (X) using a linear function:

**Simple Linear Regression:**
$$Y = \beta_0 + \beta_1 X + \epsilon$$

**Multiple Linear Regression:**
$$Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_n X_n + \epsilon$$

Where:
- $\beta_0$ = Intercept (baseline value)
- $\beta_1, \beta_2, ...$ = Coefficients (slopes)
- $\epsilon$ = Error term (residuals)

### **Business Interpretation**
- **Intercept ($\beta_0$)**: Baseline sales when all factors are zero
- **Coefficients ($\beta_i$)**: How much sales change for each unit change in the factor
- **R²**: How much of sales variation is explained by our model

In [None]:
# Generate realistic Amazon sales data with multiple features
np.random.seed(42)

# Create sample size
n_samples = 1000

# Generate features (independent variables)
marketing_spend = np.random.uniform(10000, 100000, n_samples)  # Daily marketing budget
website_traffic = np.random.uniform(50000, 200000, n_samples)  # Daily visitors
avg_product_price = np.random.uniform(20, 200, n_samples)      # Average product price
seasonal_factor = np.sin(2 * np.pi * np.arange(n_samples) / 365) * 0.3  # Seasonal effect
promotion_active = np.random.choice([0, 1], n_samples, p=[0.7, 0.3])  # Promotion days

# Generate target variable (revenue) with realistic relationships
revenue = (
    50000 +                    # Base revenue
    0.8 * marketing_spend +    # Marketing impact
    0.3 * website_traffic +    # Traffic impact
    100 * avg_product_price +  # Price impact
    20000 * seasonal_factor +  # Seasonal impact
    15000 * promotion_active + # Promotion impact
    np.random.normal(0, 5000, n_samples)  # Random noise
)

# Create DataFrame
sales_df = pd.DataFrame({
    'marketing_spend': marketing_spend,
    'website_traffic': website_traffic,
    'avg_product_price': avg_product_price,
    'seasonal_factor': seasonal_factor,
    'promotion_active': promotion_active,
    'revenue': revenue
})

print("📊 Amazon Sales Dataset:")
print(sales_df.head())
print(f"\n📈 Dataset Shape: {sales_df.shape}")
print(f"💰 Revenue Range: ${sales_df['revenue'].min():,.0f} - ${sales_df['revenue'].max():,.0f}")

## 🔍 Exploratory Data Analysis (EDA)

Before building models, let's understand our data relationships:

In [None]:
# Correlation analysis
correlation_matrix = sales_df.corr()

plt.figure(figsize=(12, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0,
            square=True, linewidths=0.5)
plt.title('Correlation Matrix: Amazon Sales Features', fontsize=16, fontweight='bold')
plt.tight_layout()
plt.show()

# Key insights
print("🔍 Key Insights from Correlation Analysis:")
print("=" * 50)
print(f"Marketing Spend → Revenue: {correlation_matrix.loc['marketing_spend', 'revenue']:.3f}")
print(f"Website Traffic → Revenue: {correlation_matrix.loc['website_traffic', 'revenue']:.3f}")
print(f"Product Price → Revenue: {correlation_matrix.loc['avg_product_price', 'revenue']:.3f}")
print(f"Promotion → Revenue: {correlation_matrix.loc['promotion_active', 'revenue']:.3f}")

In [None]:
# Visualize relationships with revenue
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
fig.suptitle('Feature Relationships with Revenue', fontsize=16, fontweight='bold')

# Marketing Spend vs Revenue
axes[0, 0].scatter(sales_df['marketing_spend'], sales_df['revenue'], alpha=0.6, color='#FF9900')
axes[0, 0].set_xlabel('Marketing Spend ($)')
axes[0, 0].set_ylabel('Revenue ($)')
axes[0, 0].set_title('Marketing Spend vs Revenue')

# Website Traffic vs Revenue
axes[0, 1].scatter(sales_df['website_traffic'], sales_df['revenue'], alpha=0.6, color='#232F3E')
axes[0, 1].set_xlabel('Website Traffic')
axes[0, 1].set_ylabel('Revenue ($)')
axes[0, 1].set_title('Website Traffic vs Revenue')

# Product Price vs Revenue
axes[1, 0].scatter(sales_df['avg_product_price'], sales_df['revenue'], alpha=0.6, color='#146EB4')
axes[1, 0].set_xlabel('Average Product Price ($)')
axes[1, 0].set_ylabel('Revenue ($)')
axes[1, 0].set_title('Product Price vs Revenue')

# Promotion Impact
promotion_data = sales_df.groupby('promotion_active')['revenue'].mean()
axes[1, 1].bar(['No Promotion', 'Promotion Active'], promotion_data.values, 
               color=['#FF6B6B', '#4ECDC4'])
axes[1, 1].set_ylabel('Average Revenue ($)')
axes[1, 1].set_title('Promotion Impact on Revenue')

plt.tight_layout()
plt.show()

## 🎯 Simple Linear Regression: Marketing Spend → Revenue

Let's start with the simplest case: predicting revenue based on marketing spend.

In [None]:
# Prepare data for simple linear regression
X_simple = sales_df[['marketing_spend']]
y = sales_df['revenue']

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X_simple, y, test_size=0.2, random_state=42
)

# Build simple linear regression model
simple_model = LinearRegression()
simple_model.fit(X_train, y_train)

# Make predictions
y_pred_train = simple_model.predict(X_train)
y_pred_test = simple_model.predict(X_test)

# Model coefficients
intercept = simple_model.intercept_
coefficient = simple_model.coef_[0]

print("🎯 Simple Linear Regression Results:")
print("=" * 50)
print(f"Intercept (β₀): ${intercept:,.2f}")
print(f"Coefficient (β₁): {coefficient:.4f}")
print(f"\n📊 Business Interpretation:")
print(f"• Base Revenue (no marketing): ${intercept:,.2f}")
print(f"• Revenue increase per $1 marketing: ${coefficient:.4f}")
print(f"• ROI: {coefficient:.1%} ({(coefficient-1)*100:.1f}% return)")

In [None]:
# Visualize the simple linear regression
plt.figure(figsize=(12, 5))

# Training data
plt.subplot(1, 2, 1)
plt.scatter(X_train, y_train, alpha=0.6, color='#FF9900', label='Actual Data')
plt.plot(X_train, y_pred_train, color='red', linewidth=2, label='Predicted Line')
plt.xlabel('Marketing Spend ($)')
plt.ylabel('Revenue ($)')
plt.title('Training Data: Marketing Spend vs Revenue')
plt.legend()
plt.grid(True, alpha=0.3)

# Test data
plt.subplot(1, 2, 2)
plt.scatter(X_test, y_test, alpha=0.6, color='#232F3E', label='Actual Data')
plt.plot(X_test, y_pred_test, color='red', linewidth=2, label='Predicted Line')
plt.xlabel('Marketing Spend ($)')
plt.ylabel('Revenue ($)')
plt.title('Test Data: Marketing Spend vs Revenue')
plt.legend()
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

## 📊 Model Evaluation Metrics

Let's evaluate our model using multiple metrics:

In [None]:
# Calculate evaluation metrics
def calculate_metrics(y_true, y_pred, model_name):
    """Calculate comprehensive model evaluation metrics"""
    
    # Basic metrics
    mse = mean_squared_error(y_true, y_pred)
    rmse = np.sqrt(mse)
    mae = mean_absolute_error(y_true, y_pred)
    r2 = r2_score(y_true, y_pred)
    
    # MAPE (Mean Absolute Percentage Error)
    mape = np.mean(np.abs((y_true - y_pred) / y_true)) * 100
    
    # Business metrics
    avg_revenue = np.mean(y_true)
    error_percentage = (rmse / avg_revenue) * 100
    
    print(f"📊 {model_name} Evaluation Metrics:")
    print("=" * 50)
    print(f"Mean Squared Error (MSE): {mse:,.2f}")
    print(f"Root Mean Squared Error (RMSE): {rmse:,.2f}")
    print(f"Mean Absolute Error (MAE): {mae:,.2f}")
    print(f"Mean Absolute Percentage Error (MAPE): {mape:.2f}%")
    print(f"R-squared (R²): {r2:.4f}")
    print(f"\n💼 Business Interpretation:")
    print(f"• Average Revenue: ${avg_revenue:,.2f}")
    print(f"• Prediction Error: ${rmse:,.2f} (±{error_percentage:.1f}%)")
    print(f"• Model Accuracy: {r2*100:.1f}% of variance explained")
    
    return {
        'mse': mse, 'rmse': rmse, 'mae': mae, 
        'mape': mape, 'r2': r2, 'error_percentage': error_percentage
    }

# Evaluate simple linear regression
simple_metrics = calculate_metrics(y_test, y_pred_test, "Simple Linear Regression")

## 🔍 Model Assumptions Check

Linear regression has several key assumptions. Let's verify them:

In [None]:
# Calculate residuals
residuals = y_test - y_pred_test

# Create diagnostic plots
fig, axes = plt.subplots(2, 2, figsize=(15, 12))
fig.suptitle('Linear Regression Assumptions Check', fontsize=16, fontweight='bold')

# 1. Linearity: Predicted vs Actual
axes[0, 0].scatter(y_pred_test, y_test, alpha=0.6, color='#FF9900')
axes[0, 0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 
                'r--', lw=2, label='Perfect Prediction')
axes[0, 0].set_xlabel('Predicted Revenue')
axes[0, 0].set_ylabel('Actual Revenue')
axes[0, 0].set_title('Linearity Check: Predicted vs Actual')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# 2. Normality: Residuals histogram
axes[0, 1].hist(residuals, bins=30, alpha=0.7, color='#232F3E', edgecolor='black')
axes[0, 1].set_xlabel('Residuals')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].set_title('Normality Check: Residuals Distribution')
axes[0, 1].grid(True, alpha=0.3)

# 3. Homoscedasticity: Residuals vs Predicted
axes[1, 0].scatter(y_pred_test, residuals, alpha=0.6, color='#146EB4')
axes[1, 0].axhline(y=0, color='red', linestyle='--', alpha=0.8)
axes[1, 0].set_xlabel('Predicted Revenue')
axes[1, 0].set_ylabel('Residuals')
axes[1, 0].set_title('Homoscedasticity Check: Residuals vs Predicted')
axes[1, 0].grid(True, alpha=0.3)

# 4. Independence: Residuals vs Index
axes[1, 1].plot(range(len(residuals)), residuals, alpha=0.7, color='#FF6B6B')
axes[1, 1].axhline(y=0, color='red', linestyle='--', alpha=0.8)
axes[1, 1].set_xlabel('Observation Index')
axes[1, 1].set_ylabel('Residuals')
axes[1, 1].set_title('Independence Check: Residuals vs Index')
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Statistical tests for normality
from scipy.stats import shapiro, jarque_bera

print("🔍 Statistical Tests for Model Assumptions:")
print("=" * 50)

# Shapiro-Wilk test for normality
shapiro_stat, shapiro_p = shapiro(residuals)
print(f"Shapiro-Wilk Test (Normality):")
print(f"  Statistic: {shapiro_stat:.4f}")
print(f"  P-value: {shapiro_p:.4f}")
print(f"  Normal? {'Yes' if shapiro_p > 0.05 else 'No'}")

# Jarque-Bera test for normality
jb_stat, jb_p = jarque_bera(residuals)
print(f"\nJarque-Bera Test (Normality):")
print(f"  Statistic: {jb_stat:.4f}")
print(f"  P-value: {jb_p:.4f}")
print(f"  Normal? {'Yes' if jb_p > 0.05 else 'No'}")

## 🎯 Business Insights from Simple Linear Regression

### **Key Findings:**
1. **Marketing ROI**: Every $1 spent on marketing generates ${coefficient:.4f} in revenue
2. **Base Revenue**: ${intercept:,.0f} in revenue even without marketing
3. **Model Performance**: {simple_metrics['r2']*100:.1f}% of revenue variation explained
4. **Prediction Accuracy**: Average error of ${simple_metrics['rmse']:,.0f} (±{simple_metrics['error_percentage']:.1f}%)

### **Business Recommendations:**
- **Marketing Investment**: The positive coefficient suggests marketing is effective
- **Budget Planning**: Use the model to forecast revenue for different marketing budgets
- **Performance Monitoring**: Track actual vs predicted revenue to detect changes

### **Limitations:**
- **Single Factor**: Only considers marketing spend, ignoring other important factors
- **Linear Assumption**: Assumes linear relationship (may not hold at extremes)
- **Missing Variables**: Website traffic, seasonality, and promotions not considered

## 🚀 Next Steps: Multiple Linear Regression

In the next notebook, we'll build a more sophisticated model that considers:
- **Multiple Features**: Marketing spend, website traffic, product price, promotions
- **Feature Engineering**: Creating new meaningful variables
- **Multicollinearity**: Detecting and handling correlated features
- **Model Interpretation**: Understanding the impact of each factor

### **Business Questions We'll Answer:**
1. Which factors have the strongest impact on sales?
2. How do different marketing channels interact?
3. What's the optimal mix of marketing spend and pricing?
4. How do seasonal patterns affect our predictions?

---

**Ready to build a more comprehensive sales prediction model?** 🎯