<a href="https://colab.research.google.com/github/samirbehindscreen-jpg/DTSC3020/blob/main/Assignment5_Ch_10%2611.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment5: CRM Cleanup @ **DalaShop**
*Files (Ch.10), Exceptions (Ch.10), Unit Tests (Ch.11), and Regular Expressions*  
.....

**Total: 3 points**  (Two questions, 1.5 pts each)  

> This assignment is scenario-based and aligned with Python Crash Course Ch.10 (files & exceptions), Ch.11 (unit testing with `unittest`), and Regular Expressions.

## Scenario
You are a data intern at an online retailer called **DalaShop**.  
Sales exported a **raw contacts** file from the CRM. It contains customer names, emails, and phone numbers, but the formatting is messy and some emails are invalid.  
Your tasks:

1. **Clean** the contacts (Files + Exceptions + Regex).  
2. **Write unit tests** to make sure your helper functions work correctly and keep working in the future.

## Data file (given by the company): `contacts_raw.txt`
Use this exact sample data (you may extend it for your own testing, but do **not** change it when submitting).  
Run the next cell once to create the file beside your notebook.

In [None]:
# Create the provided company dataset file
with open("contacts_raw.txt", "w", encoding="utf-8") as f:
    f.write('Alice Johnson <alice@example.com> , +1 (469) 555-1234\nBob Roberts <bob[at]example.com> , 972-555-777\nSara M. , sara@mail.co , 214 555 8888\n"Mehdi A." <mehdi.ay@example.org> , (469)555-9999\nDelaram <delaram@example.io>, +1-972-777-2121\nNima <NIMA@example.io> , 972.777.2121\nduplicate <Alice@Example.com> , 469 555 1234')
print("Wrote contacts_raw.txt with sample DalaShop data.")

Wrote contacts_raw.txt with sample DalaShop data.


## Q1 (1.5 pts) — CRM cleanup with Files, Exceptions, and Regex
Implement `q1_crm_cleanup.py` to:

1. **Read** `contacts_raw.txt` using `pathlib` and `with`. If the file is missing, **handle** it gracefully with `try/except FileNotFoundError` (print a friendly message; do not crash).
2. **Validate emails** with a simple regex (`r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}"`).  
   - Trim whitespace with `strip()` before checking.  
   - Use **full** matching (not partial).
3. **Normalize phone numbers:** remove all non-digits (e.g., with `re.sub(r"\D", "", raw)`).  
   - If the result has **≥ 10 digits**, keep the **last 10 digits**.  
   - Otherwise, return an **empty string** (`""`).
4. **Filter rows:** keep **only** rows with a valid email.
5. **Deduplicate:** remove duplicates by **email** using **case-insensitive** comparison (e.g., `email.casefold()`). **Keep the first occurrence** and drop later duplicates.
6. **Output CSV:** write to `contacts_clean.csv` with **columns exactly** `name,email,phone` (UTF-8).  
7. **Preserve input order:** the order of rows in `contacts_clean.csv` must match the **first appearance** order from the input file. **Do not sort** the rows.

**Grading rubric (1.5 pts):**
- (0.4) File read/write via `pathlib` + graceful `FileNotFoundError` handling  
- (0.5) Correct email regex validation + filtering  
- (0.4) Phone normalization + case-insensitive de-dup (keep first)  
- (0.2) Clean code, clear names, minimal docstrings/comments

In [None]:
# Write your answer here
# q1_crm_cleanup.py
"""
CRM Cleanup @ DalaShop
Reads a messy contacts_raw.txt file, validates emails, normalizes phones,
removes duplicates (by email, case-insensitive), and writes cleaned CSV.
"""

import re
import csv
from pathlib import Path


def is_valid_email(email: str) -> bool:
    """Return True if email matches the given regex pattern exactly."""
    pattern = r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}"
    email = email.strip()
    return re.fullmatch(pattern, email) is not None


def normalize_phone(raw: str) -> str:
    """Remove all non-digit characters and keep last 10 digits if length ≥ 10."""
    digits = re.sub(r"\D", "", raw)
    if len(digits) >= 10:
        return digits[-10:]
    return ""


def parse_contact_line(line: str):
    """Parse one line of contact info (name, email, phone)."""
    parts = [p.strip().strip('"') for p in line.split(",")]
    if len(parts) < 3:
        return None

    name = parts[0]
    email = parts[1]
    phone = parts[2]

    if not is_valid_email(email):
        return None

    phone = normalize_phone(phone)
    return {"name": name, "email": email, "phone": phone}


def clean_contacts(input_path="contacts_raw.txt", output_path="contacts_clean.csv"):
    """Main cleanup: read raw file, filter, de-dup, and write clean CSV."""
    path = Path(input_path)
    try:
        lines = path.read_text(encoding="utf-8").splitlines()
    except FileNotFoundError:
        print(f"❌ File not found: {input_path}")
        return

    seen_emails = set()
    cleaned_rows = []

    for line in lines:
        contact = parse_contact_line(line)
        if not contact:
            continue
        email_key = contact["email"].casefold()
        if email_key in seen_emails:
            continue
        seen_emails.add(email_key)
        cleaned_rows.append(contact)

    # Write cleaned CSV
    with open(output_path, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=["name", "email", "phone"])
        writer.writeheader()
        writer.writerows(cleaned_rows)

    print(f"✅ Cleaned {len(cleaned_rows)} contacts written to {output_path}")


if __name__ == "__main__":
    clean_contacts()

✅ Cleaned 1 contacts written to contacts_clean.csv


## Q2 (1.5 pts) — Unit testing with `unittest`
Create tests in `test_crm_cleanup.py` that cover at least:

1. **Email validation**: valid/invalid variations.  
2. **Phone normalization**: parentheses, dashes, spaces, country code; too-short cases.  
3. **Parsing**: from a small multi-line string (not from a file), assert the exact structured rows (name/email/phone).  
4. **De-duplication**: demonstrate that a case-variant duplicate email is dropped (first occurrence kept).




In [17]:
# === Q2: Unit Testing (test_crm_cleanup.py) ===
import unittest
from q1_crm_cleanup import is_valid_email, normalize_phone, parse_contact_line

class TestCRMHelpers(unittest.TestCase):
    def test_valid_emails(self):
        self.assertTrue(is_valid_email("alice@example.com"))
        self.assertTrue(is_valid_email("me.hdi-22@domain.org"))
        self.assertTrue(is_valid_email("A_B@exa.co"))

    def test_invalid_emails(self):
        self.assertFalse(is_valid_email("bob[at]example.com"))
        self.assertFalse(is_valid_email("sara@mail"))
        self.assertFalse(is_valid_email("justtext"))

    def test_phone_normalization(self):
        self.assertEqual(normalize_phone("+1 (469) 555-1234"), "4695551234")
        self.assertEqual(normalize_phone("972.555.7777"), "9725557777")
        self.assertEqual(normalize_phone("214 555 8888"), "2145558888")
        self.assertEqual(normalize_phone("4695559"), "")  # too short

    def test_parse_contact_line_valid(self):
        line = "Alice Johnson, alice@example.com, +1 (469) 555-1234"
        parsed = parse_contact_line(line)
        expected = {"name": "Alice Johnson", "email": "alice@example.com", "phone": "4695551234"}
        self.assertEqual(parsed, expected)

    def test_parse_contact_line_invalid_email(self):
        line = "Bob Roberts, bob[at]example.com, 972-555-777"
        parsed = parse_contact_line(line)
        # parse_contact_line does not filter; it extracts fields.
        # For an invalid email, no valid email token should be found → "".
        self.assertEqual(parsed["email"], "")

    def test_deduplication_case_insensitive(self):
        lines = [
            "Alice, alice@example.com, 111-111-1111",
            "Duplicate, ALICE@EXAMPLE.COM, 222-222-2222",
        ]
        parsed = [parse_contact_line(l) for l in lines]
        seen = set()
        cleaned = []
        for c in parsed:
            if not c or not c["email"]:
                continue
            key = c["email"].casefold()
            if key in seen:
                continue
            seen.add(key)
            cleaned.append(c)
        self.assertEqual(len(cleaned), 1)
        self.assertEqual(cleaned[0]["name"], "Alice")

if __name__ == "__main__":
    # Makes tests run correctly inside notebooks (no kernel exit)
    unittest.main(argv=[""], exit=False)

F.EF..
ERROR: test_parse_contact_line_invalid_email (__main__.TestCRMHelpers.test_parse_contact_line_invalid_email)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/ipython-input-3661601332.py", line 33, in test_parse_contact_line_invalid_email
    self.assertEqual(parsed["email"], "")
                     ~~~~~~^^^^^^^^^
TypeError: 'NoneType' object is not subscriptable

FAIL: test_deduplication_case_insensitive (__main__.TestCRMHelpers.test_deduplication_case_insensitive)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/ipython-input-3661601332.py", line 51, in test_deduplication_case_insensitive
    self.assertEqual(len(cleaned), 1)
AssertionError: 0 != 1

FAIL: test_parse_contact_line_valid (__main__.TestCRMHelpers.test_parse_contact_line_valid)
----------------------------------------------------------------------
Traceback (most recent call l

In [16]:
%%writefile q1_crm_cleanup.py
# q1_crm_cleanup.py
"""
CRM Cleanup @ DalaShop
Reads a messy contacts_raw.txt file, validates emails, normalizes phones,
removes duplicates (by email, case-insensitive), and writes cleaned CSV.
"""

import re
import csv
from pathlib import Path


def is_valid_email(email: str) -> bool:
    """Return True if email matches the given regex pattern exactly."""
    pattern = r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}"
    email = email.strip()
    return re.fullmatch(pattern, email) is not None


def normalize_phone(raw: str) -> str:
    """Remove all non-digit characters and keep last 10 digits if length ≥ 10."""
    digits = re.sub(r"\D", "", raw)
    if len(digits) >= 10:
        return digits[-10:]
    return ""


def parse_contact_line(line: str):
    """Parse one line of contact info (name, email, phone)."""
    # Extract name, email, and phone using regex
    match = re.match(r'^"?([^"]+)"?,?\s*<?([^>\s]+)>?,?\s*(.*)$', line.strip())

    if not match:
        # Handle lines that don't match the expected format
        parts = [p.strip().strip('"') for p in line.split(",", 2)]
        if len(parts) < 3:
          return None
        name = parts[0]
        email = parts[1]
        phone = parts[2]
    else:
        name = match.group(1).strip().strip('"')
        email = match.group(2).strip()
        phone = match.group(3).strip()


    if not is_valid_email(email):
        return None

    phone = normalize_phone(phone)
    return {"name": name, "email": email, "phone": phone}


def clean_contacts(input_path="contacts_raw.txt", output_path="contacts_clean.csv"):
    """Main cleanup: read raw file, filter, de-dup, and write clean CSV."""
    path = Path(input_path)
    try:
        lines = path.read_text(encoding="utf-8").splitlines()
    except FileNotFoundError:
        print(f"❌ File not found: {input_path}")
        return

    seen_emails = set()
    cleaned_rows = []

    for line in lines:
        contact = parse_contact_line(line)
        if not contact:
            continue
        email_key = contact["email"].casefold()
        if email_key in seen_emails:
            continue
        seen_emails.add(email_key)
        cleaned_rows.append(contact)

    # Write cleaned CSV
    with open(output_path, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=["name", "email", "phone"])
        writer.writeheader()
        writer.writerows(cleaned_rows)

    print(f"✅ Cleaned {len(cleaned_rows)} contacts written to {output_path}")


if __name__ == "__main__":
    clean_contacts()

Overwriting q1_crm_cleanup.py


## Grading rubric (total 3 pts)
- **Q1 (1.5 pts)**  
  - (0.4) File I/O with `pathlib` + graceful `FileNotFoundError` handling  
  - (0.5) Email validation (regex + strip + full match) and filtering  
  - (0.4) Phone normalization and **case-insensitive** de-duplication (keep first)  
  - (0.2) Code clarity (names, minimal docstrings/comments)
- **Q2 (1.5 pts)**  
  - (0.6) Meaningful coverage for email/phone functions (valid & invalid)  
  - (0.6) Parsing & de-dup tests that assert exact expected rows  
  - (0.3) Standard `unittest` structure and readable test names
