# Dataset Download and Initial Load

**Project**: Customer Purchase Behavior Analysis  
**Phase**: 1 - Data Acquisition  
**Date**: November 7, 2025  

---

## Objective
Download and perform initial load of e-commerce dataset for analysis.

## Dataset Options
1. **Online Retail Dataset (UCI)** - Recommended
2. **Kaggle Datasets** - Requires API setup
3. **Synthetic Generated Data** - Custom creation

## Setup and Imports

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import os
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.float_format', lambda x: '%.2f' % x)

print("Libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

## Define File Paths

In [None]:
# Define paths
RAW_DATA_PATH = Path('../data/raw')
PROCESSED_DATA_PATH = Path('../data/processed')

# Create directories if they don't exist
RAW_DATA_PATH.mkdir(parents=True, exist_ok=True)
PROCESSED_DATA_PATH.mkdir(parents=True, exist_ok=True)

print(f"Raw data path: {RAW_DATA_PATH.absolute()}")
print(f"Processed data path: {PROCESSED_DATA_PATH.absolute()}")

## Option 1: Download Online Retail Dataset (UCI)

**Best for beginners** - No account needed!

In [None]:
# Download Online Retail Dataset
import urllib.request

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00352/Online%20Retail.xlsx"
output_file = RAW_DATA_PATH / "Online_Retail.xlsx"

if not output_file.exists():
    print("Downloading dataset... This may take a few minutes.")
    urllib.request.urlretrieve(url, output_file)
    print(f"✅ Dataset downloaded successfully to: {output_file}")
else:
    print(f"✅ Dataset already exists at: {output_file}")

## Load the Dataset

In [None]:
# Load the dataset
print("Loading dataset...")
df = pd.read_excel(RAW_DATA_PATH / "Online_Retail.xlsx")
print(f"✅ Dataset loaded successfully!")
print(f"Shape: {df.shape}")
print(f"Rows: {df.shape[0]:,}")
print(f"Columns: {df.shape[1]}")

## Initial Data Inspection

In [None]:
# Display first few rows
print("First 5 rows of the dataset:")
df.head()

In [None]:
# Display basic information
print("Dataset Information:")
df.info()

In [None]:
# Check column names
print("Column Names:")
print(df.columns.tolist())

In [None]:
# Basic statistics
print("Basic Statistics:")
df.describe()

In [None]:
# Check for missing values
print("Missing Values:")
missing = df.isnull().sum()
missing_pct = (missing / len(df)) * 100
missing_df = pd.DataFrame({
    'Missing_Count': missing,
    'Percentage': missing_pct
})
print(missing_df[missing_df['Missing_Count'] > 0])

In [None]:
# Check data types
print("Data Types:")
print(df.dtypes)

## Save Initial Dataset Info

In [None]:
# Save a sample for quick testing
sample_df = df.head(1000)
sample_df.to_csv(RAW_DATA_PATH / "sample_data.csv", index=False)
print("✅ Sample data saved for quick testing")

# Save full dataset as CSV for easier loading
df.to_csv(RAW_DATA_PATH / "online_retail_raw.csv", index=False)
print("✅ Full dataset saved as CSV")

## Summary

**Dataset loaded successfully!**

Next steps:
1. ✅ Dataset acquired and loaded
2. ⏭️ Move to notebook: `01_data_exploration.ipynb`
3. ⏭️ Perform detailed data profiling
4. ⏭️ Start data cleaning process