<a href="https://colab.research.google.com/github/vinhxtrinh/VinhTrinh_DTSC3020_Fall2025/blob/main/Assignment5_Ch_10%2611.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment5: CRM Cleanup @ **DalaShop**
*Files (Ch.10), Exceptions (Ch.10), Unit Tests (Ch.11), and Regular Expressions*  
.....

**Total: 3 points**  (Two questions, 1.5 pts each)  

> This assignment is scenario-based and aligned with Python Crash Course Ch.10 (files & exceptions), Ch.11 (unit testing with `unittest`), and Regular Expressions.

## Scenario
You are a data intern at an online retailer called **DalaShop**.  
Sales exported a **raw contacts** file from the CRM. It contains customer names, emails, and phone numbers, but the formatting is messy and some emails are invalid.  
Your tasks:

1. **Clean** the contacts (Files + Exceptions + Regex).  
2. **Write unit tests** to make sure your helper functions work correctly and keep working in the future.

## Data file (given by the company): `contacts_raw.txt`
Use this exact sample data (you may extend it for your own testing, but do **not** change it when submitting).  
Run the next cell once to create the file beside your notebook.

In [1]:
# Create the provided company dataset file
with open("contacts_raw.txt", "w", encoding="utf-8") as f:
    f.write('Alice Johnson <alice@example.com> , +1 (469) 555-1234\nBob Roberts <bob[at]example.com> , 972-555-777\nSara M. , sara@mail.co , 214 555 8888\n"Mehdi A." <mehdi.ay@example.org> , (469)555-9999\nDelaram <delaram@example.io>, +1-972-777-2121\nNima <NIMA@example.io> , 972.777.2121\nduplicate <Alice@Example.com> , 469 555 1234')
print("Wrote contacts_raw.txt with sample DalaShop data.")

Wrote contacts_raw.txt with sample DalaShop data.


## Q1 (1.5 pts) — CRM cleanup with Files, Exceptions, and Regex
Implement `q1_crm_cleanup.py` to:

1. **Read** `contacts_raw.txt` using `pathlib` and `with`. If the file is missing, **handle** it gracefully with `try/except FileNotFoundError` (print a friendly message; do not crash).
2. **Validate emails** with a simple regex (`r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}"`).  
   - Trim whitespace with `strip()` before checking.  
   - Use **full** matching (not partial).
3. **Normalize phone numbers:** remove all non-digits (e.g., with `re.sub(r"\D", "", raw)`).  
   - If the result has **≥ 10 digits**, keep the **last 10 digits**.  
   - Otherwise, return an **empty string** (`""`).
4. **Filter rows:** keep **only** rows with a valid email.
5. **Deduplicate:** remove duplicates by **email** using **case-insensitive** comparison (e.g., `email.casefold()`). **Keep the first occurrence** and drop later duplicates.
6. **Output CSV:** write to `contacts_clean.csv` with **columns exactly** `name,email,phone` (UTF-8).  
7. **Preserve input order:** the order of rows in `contacts_clean.csv` must match the **first appearance** order from the input file. **Do not sort** the rows.

**Grading rubric (1.5 pts):**
- (0.4) File read/write via `pathlib` + graceful `FileNotFoundError` handling  
- (0.5) Correct email regex validation + filtering  
- (0.4) Phone normalization + case-insensitive de-dup (keep first)  
- (0.2) Clean code, clear names, minimal docstrings/comments

In [8]:
# Write your answer here

from pathlib import Path
import re
import csv

EMAIL_PATTERN = r"^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$"


def clean_phone(phone):
    digits = re.sub(r"\D", "", phone)
    if len(digits) >= 10:
        return digits[-10:]
    else:
        return ""


def parse_line(line):
    parts = [p.strip() for p in line.split(",") if p.strip()]
    email = ""
    for p in parts:
        if "@" in p:
            email = p.replace("<", "").replace(">", "").strip()
            break
    name = parts[0].replace("<", "").replace(">", "").replace(email, "").strip()
    phone = ""
    for p in reversed(parts):
        if email not in p:
            phone = p
            break

    return name, email, phone


def main():
    file_path = Path("contacts_raw.txt")

    try:
        with file_path.open("r", encoding="utf-8") as f:
            lines = f.readlines()
    except FileNotFoundError:
        print("File 'contacts_raw.txt' not found. Make sure it’s in the same folder.")
        return
    contacts = []
    seen_emails = set()
    for line in lines:
        if not line.strip():
            continue

        name, email, phone = parse_line(line)
        if not re.fullmatch(EMAIL_PATTERN, email):
            continue
        phone = clean_phone(phone)
        email_key = email.lower()
        if email_key in seen_emails:
            continue
        seen_emails.add(email_key)
        contacts.append({"name": name, "email": email, "phone": phone})
    with open("contacts_clean.csv", "w", encoding="utf-8", newline="") as csvfile:
        writer = csv.DictWriter(csvfile, fieldnames=["name", "email", "phone"])
        writer.writeheader()
        for c in contacts:
            writer.writerow(c)

    print("✅ contacts_clean.csv created successfully!")

if __name__ == "__main__":
    main()


✅ contacts_clean.csv created successfully!


## Q2 (1.5 pts) — Unit testing with `unittest`
Create tests in `test_crm_cleanup.py` that cover at least:

1. **Email validation**: valid/invalid variations.  
2. **Phone normalization**: parentheses, dashes, spaces, country code; too-short cases.  
3. **Parsing**: from a small multi-line string (not from a file), assert the exact structured rows (name/email/phone).  
4. **De-duplication**: demonstrate that a case-variant duplicate email is dropped (first occurrence kept).




In [7]:
import unittest
import re

EMAIL_PATTERN = r"^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$"

def clean_phone(phone):
    digits = re.sub(r"\D", "", phone)
    if len(digits) >= 10:
        return digits[-10:]
    else:
        return ""

def parse_line(line):
    parts = [p.strip() for p in line.split(",") if p.strip()]
    if not parts:
        return "", "", ""
    email = ""
    email_part_index = None
    for i, chunk in enumerate(parts):
        cleaned_chunk = chunk.replace("<", " ").replace(">", " ")
        for token in cleaned_chunk.split():
            if "@" in token:
                email = token.strip()
                email_part_index = i
                break
        if email:
            break
    first_chunk_clean = parts[0].replace("<", " ").replace(">", " ")
    first_tokens = first_chunk_clean.split()
    name_tokens = [t for t in first_tokens if t.strip() != email]
    name = " ".join(name_tokens).strip()
    if not name:
        name = first_chunk_clean.strip()
    phone = ""
    for j in range(len(parts) - 1, -1, -1):
        if j != email_part_index:
            phone = parts[j]
            break
    return name, email, phone

def run_pipeline_on_lines(lines):
    cleaned_contacts = []
    seen_emails = set()
    for raw_line in lines:
        line = raw_line.strip()
        if not line:
            continue
        name, email, phone = parse_line(line)
        if not re.fullmatch(EMAIL_PATTERN, email):
            continue
        phone = clean_phone(phone)
        key = email.casefold()
        if key in seen_emails:
            continue
        seen_emails.add(key)
        cleaned_contacts.append({
            "name": name,
            "email": email,
            "phone": phone,
        })
    return cleaned_contacts

class TestCRMCleanup(unittest.TestCase):
    def test_email_validation(self):
        valid_emails = [
            "alice@example.com",
            "mehdi.ay@example.org",
            "NIMA@example.io",
            "first.last+tag@sub.domain.co",
        ]
        invalid_emails = [
            "bob[at]example.com",
            "noatsymbol.com",
            "bad@",
            "@bad.com",
            "a@b",
        ]
        for email in valid_emails:
            self.assertIsNotNone(re.fullmatch(EMAIL_PATTERN, email))
        for email in invalid_emails:
            self.assertIsNone(re.fullmatch(EMAIL_PATTERN, email))

    def test_clean_phone(self):
        self.assertEqual(clean_phone("(469) 555-1234"), "4695551234")
        self.assertEqual(clean_phone("2145558888"), "2145558888")
        self.assertEqual(clean_phone("+1-972-777-2121"), "9727772121")
        self.assertEqual(clean_phone("972.777.2121"), "9727772121")
        self.assertEqual(clean_phone("555-12"), "")

    def test_parsing_basic_rows(self):
        sample_lines = [
            'Alice Johnson <alice@example.com> , +1 (469) 555-1234',
            'Sara M. , sara@mail.co , 214 555 8888',
            '"Mehdi A." <mehdi.ay@example.org> , (469)555-9999',
        ]
        cleaned_list = run_pipeline_on_lines(sample_lines)
        self.assertEqual(len(cleaned_list), 3)
        self.assertEqual(cleaned_list[0]["name"], "Alice Johnson")
        self.assertEqual(cleaned_list[0]["email"], "alice@example.com")
        self.assertEqual(cleaned_list[0]["phone"], "4695551234")
        self.assertEqual(cleaned_list[1]["name"], "Sara M.")
        self.assertEqual(cleaned_list[1]["email"], "sara@mail.co")
        self.assertEqual(cleaned_list[1]["phone"], "2145558888")
        self.assertEqual(cleaned_list[2]["name"], '"Mehdi A."')
        self.assertEqual(cleaned_list[2]["email"], "mehdi.ay@example.org")
        self.assertEqual(cleaned_list[2]["phone"], "4695559999")

    def test_full_contacts_behavior(self):
        contacts_lines = [
            'Alice Johnson <alice@example.com> , +1 (469) 555-1234',
            'Bob Roberts <bob[at]example.com> , 972-555-777',
            'Sara M. , sara@mail.co , 214 555 8888',
            '"Mehdi A." <mehdi.ay@example.org> , (469)555-9999',
            'Delaram <delaram@example.io>, +1-972-777-2121',
            'Nima <NIMA@example.io> , 972.777.2121',
            'duplicate <Alice@Example.com> , 469 555 1234',
        ]
        cleaned_list = run_pipeline_on_lines(contacts_lines)
        self.assertEqual(len(cleaned_list), 5)
        self.assertEqual(cleaned_list[0]["name"], "Alice Johnson")
        self.assertEqual(cleaned_list[0]["email"], "alice@example.com")
        self.assertEqual(cleaned_list[0]["phone"], "4695551234")
        self.assertEqual(cleaned_list[1]["name"], "Sara M.")
        self.assertEqual(cleaned_list[1]["email"], "sara@mail.co")
        self.assertEqual(cleaned_list[1]["phone"], "2145558888")
        self.assertEqual(cleaned_list[2]["name"], '"Mehdi A."')
        self.assertEqual(cleaned_list[2]["email"], "mehdi.ay@example.org")
        self.assertEqual(cleaned_list[2]["phone"], "4695559999")
        self.assertEqual(cleaned_list[3]["name"], "Delaram")
        self.assertEqual(cleaned_list[3]["email"], "delaram@example.io")
        self.assertEqual(cleaned_list[3]["phone"], "9727772121")
        self.assertEqual(cleaned_list[4]["name"], "Nima")
        self.assertEqual(cleaned_list[4]["email"], "NIMA@example.io")
        self.assertEqual(cleaned_list[4]["phone"], "9727772121")

if __name__ == "__main__":
    unittest.main(argv=[""], verbosity=2, exit=False)


test_clean_phone (__main__.TestCRMCleanup.test_clean_phone) ... ok
test_email_validation (__main__.TestCRMCleanup.test_email_validation) ... ok
test_full_contacts_behavior (__main__.TestCRMCleanup.test_full_contacts_behavior) ... ok
test_parsing_basic_rows (__main__.TestCRMCleanup.test_parsing_basic_rows) ... ok

----------------------------------------------------------------------
Ran 4 tests in 0.008s

OK


## Grading rubric (total 3 pts)
- **Q1 (1.5 pts)**  
  - (0.4) File I/O with `pathlib` + graceful `FileNotFoundError` handling  
  - (0.5) Email validation (regex + strip + full match) and filtering  
  - (0.4) Phone normalization and **case-insensitive** de-duplication (keep first)  
  - (0.2) Code clarity (names, minimal docstrings/comments)
- **Q2 (1.5 pts)**  
  - (0.6) Meaningful coverage for email/phone functions (valid & invalid)  
  - (0.6) Parsing & de-dup tests that assert exact expected rows  
  - (0.3) Standard `unittest` structure and readable test names
