# Assignment5: CRM Cleanup @ **DalaShop**
*Files (Ch.10), Exceptions (Ch.10), Unit Tests (Ch.11), and Regular Expressions*  
.....

**Total: 3 points**  (Two questions, 1.5 pts each)  

> This assignment is scenario-based and aligned with Python Crash Course Ch.10 (files & exceptions), Ch.11 (unit testing with `unittest`), and Regular Expressions.

## Scenario
You are a data intern at an online retailer called **DalaShop**.  
Sales exported a **raw contacts** file from the CRM. It contains customer names, emails, and phone numbers, but the formatting is messy and some emails are invalid.  
Your tasks:

1. **Clean** the contacts (Files + Exceptions + Regex).  
2. **Write unit tests** to make sure your helper functions work correctly and keep working in the future.

## Data file (given by the company): `contacts_raw.txt`
Use this exact sample data (you may extend it for your own testing, but do **not** change it when submitting).  
Run the next cell once to create the file beside your notebook.

In [1]:
# Create the provided company dataset file
with open("contacts_raw.txt", "w", encoding="utf-8") as f:
    f.write('Alice Johnson <alice@example.com> , +1 (469) 555-1234\nBob Roberts <bob[at]example.com> , 972-555-777\nSara M. , sara@mail.co , 214 555 8888\n"Mehdi A." <mehdi.ay@example.org> , (469)555-9999\nDelaram <delaram@example.io>, +1-972-777-2121\nNima <NIMA@example.io> , 972.777.2121\nduplicate <Alice@Example.com> , 469 555 1234')
print("Wrote contacts_raw.txt with sample DalaShop data.")

Wrote contacts_raw.txt with sample DalaShop data.


## Q1 (1.5 pts) — CRM cleanup with Files, Exceptions, and Regex
Implement `q1_crm_cleanup.py` to:

1. **Read** `contacts_raw.txt` using `pathlib` and `with`. If the file is missing, **handle** it gracefully with `try/except FileNotFoundError` (print a friendly message; do not crash).
2. **Validate emails** with a simple regex (`r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}"`).  
   - Trim whitespace with `strip()` before checking.  
   - Use **full** matching (not partial).
3. **Normalize phone numbers:** remove all non-digits (e.g., with `re.sub(r"\D", "", raw)`).  
   - If the result has **≥ 10 digits**, keep the **last 10 digits**.  
   - Otherwise, return an **empty string** (`""`).
4. **Filter rows:** keep **only** rows with a valid email.
5. **Deduplicate:** remove duplicates by **email** using **case-insensitive** comparison (e.g., `email.casefold()`). **Keep the first occurrence** and drop later duplicates.
6. **Output CSV:** write to `contacts_clean.csv` with **columns exactly** `name,email,phone` (UTF-8).  
7. **Preserve input order:** the order of rows in `contacts_clean.csv` must match the **first appearance** order from the input file. **Do not sort** the rows.

**Grading rubric (1.5 pts):**
- (0.4) File read/write via `pathlib` + graceful `FileNotFoundError` handling  
- (0.5) Correct email regex validation + filtering  
- (0.4) Phone normalization + case-insensitive de-dup (keep first)  
- (0.2) Clean code, clear names, minimal docstrings/comments

In [3]:
import re
import csv
from pathlib import Path

EMAIL_REGEX = re.compile(r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}")

# ------------------------------
# 1️⃣ Phone normalization
# ------------------------------
def normalize_phone(raw: str) -> str:
    """Keep only digits; return last 10 digits if >=10, else empty."""
    digits = ''.join(re.findall(r"\d", raw))
    return digits[-10:] if len(digits) >= 10 else ""

# ------------------------------
# 2️⃣ Email cleaning/validation
# ------------------------------
def clean_email(email: str) -> str:
    """Strip whitespace, remove angle brackets, replace [at] with @."""
    email = email.strip("<> ").replace("[at]", "@")
    return email if EMAIL_REGEX.fullmatch(email) else ""

# ------------------------------
# 3️⃣ Parse a single line into structured data
# ------------------------------
def parse_line(line: str) -> dict | None:
    """
    Parse a line like 'Name <email>, phone' into a dict.
    Returns None if email is invalid.
    """
    parts = [p.strip() for p in line.split(",") if p.strip()]
    if len(parts) < 2:
        return None

    # Extract name and email
    if "<" in parts[0] and ">" in parts[0]:
        name = parts[0][:parts[0].find("<")].strip().strip('"')
        email = parts[0][parts[0].find("<")+1 : parts[0].find(">")]
    else:
        name = parts[0].strip().strip('"')
        email = parts[1]

    email = clean_email(email)
    if not email:
        return None

    # Extract phone (last part)
    phone = normalize_phone(parts[-1])

    return {"name": name, "email": email, "phone": phone}

# ------------------------------
# 4️⃣ Deduplicate rows by email (case-insensitive)
# ------------------------------
def deduplicate_rows(rows: list[dict]) -> list[dict]:
    seen = set()
    result = []
    for row in rows:
        key = row["email"].casefold()
        if key not in seen:
            seen.add(key)
            result.append(row)
    return result

# ------------------------------
# 5️⃣ Parse multiple lines of text
# ------------------------------
def parse_text(text: str) -> list[dict]:
    rows = [parse_line(line) for line in text.splitlines()]
    rows = [row for row in rows if row is not None]
    return deduplicate_rows(rows)

# ------------------------------
# 6️⃣ Write cleaned rows to CSV
# ------------------------------
def write_csv(rows: list[dict], output_file: Path):
    with output_file.open("w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=["name", "email", "phone"])
        writer.writeheader()
        writer.writerows(rows)

# ------------------------------
# Optional: convenience function to read file + process
# ------------------------------
def process_file(input_file: Path, output_file: Path):
    try:
        text = input_file.read_text(encoding="utf-8")
    except FileNotFoundError:
        print(f"⚠️ File not found: {input_file}")
        return

    rows = parse_text(text)
    write_csv(rows, output_file)
    print(f"✅ Cleaned data written to {output_file}")

process_file(Path("contacts_raw.txt"), Path("contact_clean.csv"))


✅ Cleaned data written to contact_clean.csv


## Q2 (1.5 pts) — Unit testing with `unittest`
Create tests in `test_crm_cleanup.py` that cover at least:

1. **Email validation**: valid/invalid variations.  
2. **Phone normalization**: parentheses, dashes, spaces, country code; too-short cases.  
3. **Parsing**: from a small multi-line string (not from a file), assert the exact structured rows (name/email/phone).  
4. **De-duplication**: demonstrate that a case-variant duplicate email is dropped (first occurrence kept).

In [None]:
import unittest

class TestCRMCleanup(unittest.TestCase):

    # -------------------------------------------------------------
    # 1️⃣ Email validation tests
    # -------------------------------------------------------------
    def test_valid_emails(self):
        valid_emails = [
            "alice@example.com",
            "bob.smith@company.co",
            "user+alias@domain.org",
            "UPPERCASE@EXAMPLE.IO",
        ]
        for email in valid_emails:
            with self.subTest(email=email):
                self.assertEqual(clean_email(email), email)

    def test_invalid_emails(self):
        invalid_emails = [
            "no-at-symbol.com",
            "user@.com",
            "bad@email",
            "another@@example.com",
            "email@domain,com",
        ]
        for email in invalid_emails:
            with self.subTest(email=email):
                self.assertEqual(clean_email(email), "")

    # -------------------------------------------------------------
    # 2️⃣ Phone normalization tests
    # -------------------------------------------------------------
    def test_phone_normalization(self):
        cases = {
            "(469) 555-1234": "4695551234",
            "+1-972-555-7777": "9725557777",
            "214.555.8888": "2145558888",
            "972 777 2121": "9727772121",
            "555-999": "",  # too short
        }
        for raw, expected in cases.items():
            with self.subTest(raw=raw):
                self.assertEqual(normalize_phone(raw), expected)

    # -------------------------------------------------------------
    # 3️⃣ Parsing logic from multi-line string
    # -------------------------------------------------------------
    def test_parsing_and_structure(self):
        text = """Alice Johnson <alice@example.com> , +1 (469) 555-1234
Sara M. , sara@mail.co , 214 555 8888
invalid_user , user@invalid , 555-000-0000
duplicate <Alice@Example.com> , 469 555 1234
"Mehdi A." <mehdi.ay@example.org> , (469)555-9999
"""
        rows = parse_text(text)
        expected = [
            {"name": "Alice Johnson", "email": "alice@example.com", "phone": "4695551234"},
            {"name": "Sara M.", "email": "sara@mail.co", "phone": "2145558888"},
            {"name": "Mehdi A.", "email": "mehdi.ay@example.org", "phone": "4695559999"},
        ]
        self.assertEqual(rows, expected)

    # -------------------------------------------------------------
    # 4️⃣ De-duplication by case-insensitive email
    # -------------------------------------------------------------
    def test_deduplication_case_insensitive(self):
        data = [
            {"name": "Alice", "email": "Alice@Example.com", "phone": "123"},
            {"name": "Duplicate", "email": "alice@example.COM", "phone": "999"},
            {"name": "Bob", "email": "bob@example.com", "phone": "111"},
        ]
        deduped = deduplicate_rows(data)
        expected = [
            {"name": "Alice", "email": "Alice@Example.com", "phone": "123"},
            {"name": "Bob", "email": "bob@example.com", "phone": "111"},
        ]
        self.assertEqual(deduped, expected)


# Run tests inside Jupyter or normal Python
if __name__ == "__main__":
    unittest.main(argv=["first-arg-is-ignored"], exit=False, verbosity=2)


test_deduplication_case_insensitive (__main__.TestCRMCleanup.test_deduplication_case_insensitive) ... ok
test_invalid_emails (__main__.TestCRMCleanup.test_invalid_emails) ... ok
test_parsing_and_structure (__main__.TestCRMCleanup.test_parsing_and_structure) ... ok
test_phone_normalization (__main__.TestCRMCleanup.test_phone_normalization) ... ok
test_valid_emails (__main__.TestCRMCleanup.test_valid_emails) ... ok

----------------------------------------------------------------------
Ran 5 tests in 0.005s

OK


## Grading rubric (total 3 pts)
- **Q1 (1.5 pts)**  
  - (0.4) File I/O with `pathlib` + graceful `FileNotFoundError` handling  
  - (0.5) Email validation (regex + strip + full match) and filtering  
  - (0.4) Phone normalization and **case-insensitive** de-duplication (keep first)  
  - (0.2) Code clarity (names, minimal docstrings/comments)
- **Q2 (1.5 pts)**  
  - (0.6) Meaningful coverage for email/phone functions (valid & invalid)  
  - (0.6) Parsing & de-dup tests that assert exact expected rows  
  - (0.3) Standard `unittest` structure and readable test names
