# Unit 3 Lab: Cleaning & Exploring UK Retail Data with Pandas

In this lab you will:

- Load the **UK retail** dataset from CSV.
- Inspect and clean the data (missing values, duplicates, formats).
- Create grouped summaries for business questions.
- Prepare a clean table ready for visualisation.

You will use `pandas` heavily – this is how real analysts work day to day.

---

## 1. Load the dataset

```python
import pandas as pd
from pathlib import Path

BASE_PATH = Path("..") / "datasets"
UK_RETAIL_PATH = BASE_PATH / "uk_retail_sales.csv"

uk = pd.read_csv(UK_RETAIL_PATH)
uk.head()
```

**Task 1.1** – Run the cell and then:

- Print `uk.shape`.
- Describe in words what one row represents.

---

## 2. Basic data quality checks

```python
uk.info()
uk.describe()
```

**Task 2.1** – Check:

- Are any columns clearly numeric but stored as text?
- Are there any obvious missing values?

Write a short comment cell with your observations.

---

## 3. Handling missing values & duplicates

```python
# Count missing values per column
uk.isna().sum()
```

**Task 3.1** – Decide how to handle missing values:

- For numeric columns like `sales_amount`, decide whether to:
  - Drop rows, or
  - Fill with 0, or
  - Fill with a sensible value (e.g. median).

Implement your choice and briefly justify it in comments.

```python
# Example pattern (adapt as needed)
# uk["sales_amount"] = uk["sales_amount"].fillna(uk["sales_amount"].median())
```

Then:

```python
# Remove any exact duplicate rows
before = len(uk)
uk = uk.drop_duplicates()
after = len(uk)
print("Removed", before - after, "duplicate rows")
```

---

## 4. Creating useful features

**Task 4.1** – Create new columns such as:

- `year` from the `date` column.
- `month` from the `date` column.

Hint: convert `date` to datetime first.

```python
uk["date"] = pd.to_datetime(uk["date"])
uk["year"] = uk["date"].dt.year
uk["month"] = uk["date"].dt.month
uk.head()
```

---

## 5. Business-focused summaries

Answer questions a UK retail manager might ask.

### 5.1 Sales by product category

```python
sales_by_cat = (
    uk.groupby("product_category")["sales_amount"]
      .sum()
      .reset_index()
      .sort_values("sales_amount", ascending=False)
)
sales_by_cat
```

**Task 5.1** – In words:

- Which category performs best?
- Which one might need attention?

### 5.2 Sales by region

```python
sales_by_region = (
    uk.groupby("region")["sales_amount"]
      .sum()
      .reset_index()
      .sort_values("sales_amount", ascending=False)
)
sales_by_region
```

**Task 5.2** – Suggest one possible reason why a particular region might be
higher or lower (you can invent a realistic story).

---

## 6. Preparing a clean table

Create a **final cleaned dataset** with only the columns you need for
visualisation:

```python
cols = [
    "country", "region", "store_id", "date",
    "year", "month", "product_category", "sales_amount", "transactions",
]

uk_clean = uk[cols].copy()
uk_clean.head()
```

Save it to a new CSV:

```python
OUTPUT_PATH = BASE_PATH / "uk_retail_sales_clean.csv"
uk_clean.to_csv(OUTPUT_PATH, index=False)
OUTPUT_PATH
```

This file will be useful again in Unit 6 (visualisation) and in your
capstone projects.

---

## 7. Reflection

In a short Markdown cell, answer:

1. Which cleaning step do you think is **most important** for trustworthy
   analysis, and why?
2. How would you explain to a non-technical UK retail manager what you did
   in this notebook?

# Unit 3 Lab: Cleaning & Exploring UK Retail Data with Pandas

In this lab you will:

- Load the **UK retail** dataset from CSV.
- Inspect and clean the data (missing values, duplicates, formats).
- Create grouped summaries for business questions.
- Prepare a clean table ready for visualisation.

You will use `pandas` heavily – this is how real analysts work day to day.

---

## 1. Load the dataset

```python
import pandas as pd
from pathlib import Path

BASE_PATH = Path("..") / "datasets"
UK_RETAIL_PATH = BASE_PATH / "uk_retail_sales.csv"

uk = pd.read_csv(UK_RETAIL_PATH)
uk.head()
```

**Task 1.1** – Run the cell and then:

- Print `uk.shape`.
- Describe in words what one row represents.

---

## 2. Basic data quality checks

```python
uk.info()
uk.describe()
```

**Task 2.1** – Check:

- Are any columns clearly numeric but stored as text?
- Are there any obvious missing values?

Write a short comment cell with your observations.

---

## 3. Handling missing values & duplicates

```python
# Count missing values per column
uk.isna().sum()
```

**Task 3.1** – Decide how to handle missing values:

- For numeric columns like `sales_amount`, decide whether to:
  - Drop rows, or
  - Fill with 0, or
  - Fill with a sensible value (e.g. median).

Implement your choice and briefly justify it in comments.

```python
# Example pattern (adapt as needed)
# uk["sales_amount"] = uk["sales_amount"].fillna(uk["sales_amount"].median())
```

Then:

```python
# Remove any exact duplicate rows
before = len(uk)
uk = uk.drop_duplicates()
after = len(uk)
print("Removed", before - after, "duplicate rows")
```

---

## 4. Creating useful features

**Task 4.1** – Create new columns such as:

- `year` from the `date` column.
- `month` from the `date` column.

Hint: convert `date` to datetime first.

```python
uk["date"] = pd.to_datetime(uk["date"])
uk["year"] = uk["date"].dt.year
uk["month"] = uk["date"].dt.month
uk.head()
```

---

## 5. Business-focused summaries

Answer questions a UK retail manager might ask.

### 5.1 Sales by product category

```python
sales_by_cat = (
    uk.groupby("product_category")["sales_amount"]
      .sum()
      .reset_index()
      .sort_values("sales_amount", ascending=False)
)
sales_by_cat
```

**Task 5.1** – In words:

- Which category performs best?
- Which one might need attention?

### 5.2 Sales by region

```python
sales_by_region = (
    uk.groupby("region")["sales_amount"]
      .sum()
      .reset_index()
      .sort_values("sales_amount", ascending=False)
)
sales_by_region
```

**Task 5.2** – Suggest one possible reason why a particular region might be
higher or lower (you can invent a realistic story).

---

## 6. Preparing a clean table

Create a **final cleaned dataset** with only the columns you need for
visualisation:

```python
cols = [
    "country", "region", "store_id", "date",
    "year", "month", "product_category", "sales_amount", "transactions",
]

uk_clean = uk[cols].copy()
uk_clean.head()
```

Save it to a new CSV:

```python
OUTPUT_PATH = BASE_PATH / "uk_retail_sales_clean.csv"
uk_clean.to_csv(OUTPUT_PATH, index=False)
OUTPUT_PATH
```

This file will be useful again in Unit 6 (visualisation) and in your
capstone projects.

---

## 7. Reflection

In a short Markdown cell, answer:

1. Which cleaning step do you think is **most important** for trustworthy
   analysis, and why?
2. How would you explain to a non-technical UK retail manager what you did
   in this notebook?