# Unit 1 Lab: Exploring Real-World Data (UK & US)

Welcome to your first hands-on lab. In this notebook you will:

- Load real CSV datasets from UK retail and US e-commerce.
- Explore the structure of the data (columns, types, missing values).
- Calculate simple summaries and compare UK vs US patterns.
- Answer short written questions like you would in an exam or job task.

You do **not** need any advanced Python yet. We will use simple commands and
focus on understanding *what the data is telling you*.

---

## 1. Setup

Run the cell below to import the libraries we need.

```python
import pandas as pd
from pathlib import Path

# Set base path relative to this notebook location
BASE_PATH = Path("..") / "datasets"

UK_RETAIL_PATH = BASE_PATH / "uk_retail_sales.csv"
US_ECOM_PATH = BASE_PATH / "us_ecommerce_orders.csv"

UK_RETAIL_PATH, US_ECOM_PATH
```

**Task 1.1** – Run the cell. If there is an error, read it carefully and fix it.

- If the path is wrong, check that the `data_science_pathway1/datasets/`
  folder exists and contains the two CSV files.

---

## 2. Load the datasets

### 2.1 Load UK retail sales

```python
uk_retail = pd.read_csv(UK_RETAIL_PATH)
uk_retail.head()
```

**Task 2.1** – After running the cell:

- How many **rows** and **columns** are there in `uk_retail`?
- In your own words, what does **one row** represent?

Write your answers here:

```python
# TODO: Replace the text below with your own answers
uk_rows, uk_cols = uk_retail.shape
print("Rows:", uk_rows, "Columns:", uk_cols)

# One row represents: ... (describe in plain English)
```

### 2.2 Load US e-commerce orders

```python
us_ecom = pd.read_csv(US_ECOM_PATH)
us_ecom.head()
```

**Task 2.2** – Answer similar questions for the US dataset:

- How many rows and columns?
- What does one row represent?

---

## 3. Basic data understanding

### 3.1 Column types and missing values

```python
uk_retail.info()
```

```python
us_ecom.info()
```

**Task 3.1** – Look at the output and answer:

1. Which columns are numbers, which are text, which are dates?
2. Do you see any obvious missing values or strange types?

Write a short summary:

```python
# TODO: Write a short summary in comments
# Example:
# - In uk_retail, sales_amount is float, transactions is int...
# - In us_ecom, order_value is float, etc.
```

---

## 4. Simple summaries and comparisons

### 4.1 Total and average sales

```python
uk_summary = uk_retail["sales_amount"].describe()
us_summary = us_ecom["order_value"].describe()

uk_summary, us_summary
```

**Task 4.1** – Based on the output, compare UK vs US:

- Which dataset has the higher **average** order/sales value?
- Which one is more **variable** (look at std / min / max)?

Write your interpretation in plain language, as if explaining to a manager.

```python
# TODO: Describe the comparison in 3–5 sentences.
```

### 4.2 Grouped summaries

Now group by categories.

```python
# UK: revenue by product category
uk_by_category = uk_retail.groupby("product_category")["sales_amount"].sum().reset_index()
uk_by_category
```

```python
# US: revenue by channel
us_by_channel = us_ecom.groupby("channel")["order_value"].sum().reset_index()
us_by_channel
```

**Task 4.2** – Answer:

- In the UK dataset, which product category has the highest total sales?
- In the US dataset, which channel (Web/Mobile/etc.) has the highest revenue?

---

## 5. Short written reflection (exam-style)

Answer the questions below using **full sentences**. Imagine you are
preparing for an exam or a job interview where the assessor cares about how
clearly you communicate.

1. Why is it important to understand *what one row represents* in a dataset?
2. Give one practical UK example and one US example of a business question
   you could answer using these datasets.
3. What would you ask the business stakeholder **before** doing deeper
   analysis on this data?

You can write your answers in a Markdown cell or in comments inside a code
cell – choose whichever you prefer.

---

## 6. Optional extension

If you have time, try to:

- Create a new column for UK data: `average_sale_per_transaction`.
- Filter the US data to show only `Mobile` channel orders.
- Save a filtered version of each dataset to a new CSV file.

These are the kind of small, practical steps you will perform repeatedly
throughout your data science journey.