# Data Engineering Interview Exercise Notebook
 
---
 
## 📖 Scenario
 
You are a Data Engineer at an international e-commerce company. Your task is to build a data pipeline that processes daily transaction data, enriches it with customer information, and produces an aggregated report to identify top-spending customers.
 
You have been provided with two CSV files:
- `transactions.csv`: Transaction details (transaction_id, customer_id, product_id, quantity, price, date).
- `customers.csv`: Customer details (customer_id, name, email, join_date).
 
---

## 🛠️ Step-by-Step Instructions
 
Complete each step below. You may choose either Pandas or PySpark for this exercise.
 
### ⚙️ Step 1: Data Ingestion
- Load both CSV files into DataFrames.
 
**Hint**: 
- Pandas: `pd.read_csv()`
- PySpark: `spark.read.csv()`

In [None]:
# Import for Pandas users
import pandas as pd

In [None]:
# Imports for PySpark users
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum

spark = SparkSession.builder.appName("EcommerceETL")

In [None]:
# Your code here

### Check the output
 
**Hint**: 
- Pandas: head
- PySpark: show


In [None]:
# Your code here

### 🧹 Step 2: Data Cleaning
- Remove duplicates from both datasets.
 
**Hint**: 
- Pandas: `drop_duplicates()`
- PySpark: `dropDuplicates()`

In [None]:
# Your code here

### 🔗 Step 3: Data Joining
- Join transaction data with customer data on `customer_id`.
 
**Hint**:
- Pandas: `merge()`
- PySpark: `.join()`

In [None]:
# Your code here

### 📊 Step 4: Data Aggregation
- Calculate the total amount spent (`quantity * price`) per customer.
- Find the top 5 customers based on total spending.
 
**Hint**:
- Pandas: Use `groupby()` and aggregate with `.sum()`
- PySpark: Use `groupBy()` and aggregation functions (`sum()`)

In [None]:
# Your code here

### 📁 Step 5: Export Results
- Export the aggregated results as a CSV file named `top_customers.csv`.
 
**Hint**:
- Pandas: `to_csv()`
- PySpark: `write.csv()`

In [None]:
# Your code here

In [None]:
import pandas as pd
import pyspark
from pyspark.sql import SparkSession

# Start Spark session
spark = SparkSession.builder.appName("InterviewApp").getOrCreate()

# Example DataFrame
df = spark.createDataFrame([
    (1, "Alice", 100),
    (2, "Bob", 200)
], ["id", "name", "amount"])

df.show()