<a href="https://colab.research.google.com/github/upen1530/Assignment1_ur0072.ipynb/blob/main/Assignment5_Ch_10%2611_(1).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment5: CRM Cleanup @ **DalaShop**
*Files (Ch.10), Exceptions (Ch.10), Unit Tests (Ch.11), and Regular Expressions*  
.....

**Total: 3 points**  (Two questions, 1.5 pts each)  

> This assignment is scenario-based and aligned with Python Crash Course Ch.10 (files & exceptions), Ch.11 (unit testing with `unittest`), and Regular Expressions.

## Scenario
You are a data intern at an online retailer called **DalaShop**.  
Sales exported a **raw contacts** file from the CRM. It contains customer names, emails, and phone numbers, but the formatting is messy and some emails are invalid.  
Your tasks:

1. **Clean** the contacts (Files + Exceptions + Regex).  
2. **Write unit tests** to make sure your helper functions work correctly and keep working in the future.

## Data file (given by the company): `contacts_raw.txt`
Use this exact sample data (you may extend it for your own testing, but do **not** change it when submitting).  
Run the next cell once to create the file beside your notebook.

In [None]:
# Create the provided company dataset file
with open("contacts_raw.txt", "w", encoding="utf-8") as f:
    f.write('Alice Johnson <alice@example.com> , +1 (469) 555-1234\nBob Roberts <bob[at]example.com> , 972-555-777\nSara M. , sara@mail.co , 214 555 8888\n"Mehdi A." <mehdi.ay@example.org> , (469)555-9999\nDelaram <delaram@example.io>, +1-972-777-2121\nNima <NIMA@example.io> , 972.777.2121\nduplicate <Alice@Example.com> , 469 555 1234')
print("Wrote contacts_raw.txt with sample DalaShop data.")

## Q1 (1.5 pts) — CRM cleanup with Files, Exceptions, and Regex
Implement `q1_crm_cleanup.py` to:

1. **Read** `contacts_raw.txt` using `pathlib` and `with`. If the file is missing, **handle** it gracefully with `try/except FileNotFoundError` (print a friendly message; do not crash).
2. **Validate emails** with a simple regex (`r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}"`).  
   - Trim whitespace with `strip()` before checking.  
   - Use **full** matching (not partial).
3. **Normalize phone numbers:** remove all non-digits (e.g., with `re.sub(r"\D", "", raw)`).  
   - If the result has **≥ 10 digits**, keep the **last 10 digits**.  
   - Otherwise, return an **empty string** (`""`).
4. **Filter rows:** keep **only** rows with a valid email.
5. **Deduplicate:** remove duplicates by **email** using **case-insensitive** comparison (e.g., `email.casefold()`). **Keep the first occurrence** and drop later duplicates.
6. **Output CSV:** write to `contacts_clean.csv` with **columns exactly** `name,email,phone` (UTF-8).  
7. **Preserve input order:** the order of rows in `contacts_clean.csv` must match the **first appearance** order from the input file. **Do not sort** the rows.

**Grading rubric (1.5 pts):**
- (0.4) File read/write via `pathlib` + graceful `FileNotFoundError` handling  
- (0.5) Correct email regex validation + filtering  
- (0.4) Phone normalization + case-insensitive de-dup (keep first)  
- (0.2) Clean code, clear names, minimal docstrings/comments

In [1]:
# q1_crm_cleanup.py – DalaShop CRM Cleanup (Files + Exceptions + Regex)
from pathlib import Path
import re, csv

# 1️⃣  Create the sample input file
with open("contacts_raw.txt", "w", encoding="utf-8") as f:
    f.write("""Alice Johnson <alice@example.com> , +1 (469) 555-1234
Bob Roberts <bob[at]example.com> , 972-555-777
Sara M. , sara@mail.co , 214 555 8888
"Mehdi A." <mehdi.ay@example.org> , (469)555-9999
Delaram <delaram@example.io>, +1-972-777-2121
Nima <NIMA@example.io> , 972.777.2121
duplicate <Alice@Example.com> , 469 555 1234""")

# 2️⃣  Email regex
EMAIL_RE = re.compile(r"^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$")

def is_valid_email(raw):
    return bool(raw and EMAIL_RE.fullmatch(raw.strip()))

def normalize_phone(raw):
    digits = re.sub(r"\D", "", raw or "")
    return digits[-10:] if len(digits) >= 10 else ""

def parse_line(line):
    """Return dict(name,email,phone) or None"""
    line = line.strip()
    if not line: return None
    m = re.match(r'^\s*(?P<name>.*?)\s*<(?P<email>[^>]+)>\s*,\s*(?P<phone>.+?)\s*$', line)
    if m:
        return {"name": m["name"].strip('" '), "email": m["email"].strip(), "phone": m["phone"].strip()}
    m2 = re.match(r'^\s*(?P<name>[^,]+)\s*,\s*(?P<email>[^,]+)\s*,\s*(?P<phone>.+?)\s*$', line)
    if m2:
        return {"name": m2["name"].strip('" '), "email": m2["email"].strip(), "phone": m2["phone"].strip()}
    return {"name": line, "email": "", "phone": ""}

def dedupe_by_email(rows):
    seen, out = set(), []
    for r in rows:
        key = r["email"].casefold()
        if key not in seen:
            seen.add(key)
            out.append(r)
    return out

def clean_rows_from_text(text):
    rows = []
    for line in text.splitlines():
        rec = parse_line(line)
        if rec and is_valid_email(rec["email"]):
            rows.append({
                "name": rec["name"],
                "email": rec["email"],
                "phone": normalize_phone(rec["phone"]),
            })
    return dedupe_by_email(rows)

def read_and_clean(input_path, output_path):
    try:
        txt = Path(input_path).read_text(encoding="utf-8")
    except FileNotFoundError:
        print(f"File not found: {input_path}")
        return []
    rows = clean_rows_from_text(txt)
    with open(output_path, "w", newline="", encoding="utf-8") as f:
        writer = csv.DictWriter(f, fieldnames=["name","email","phone"])
        writer.writeheader()
        writer.writerows(rows)
    print(f"Wrote {len(rows)} cleaned contacts → {output_path}")
    return rows

# 3️⃣  Run cleanup
rows = read_and_clean("contacts_raw.txt", "contacts_clean.csv")

# 4️⃣  Preview cleaned file
import pandas as pd
pd.read_csv("contacts_clean.csv")


Wrote 5 cleaned contacts → contacts_clean.csv


Unnamed: 0,name,email,phone
0,Alice Johnson,alice@example.com,4695551234
1,Sara M.,sara@mail.co,2145558888
2,Mehdi A.,mehdi.ay@example.org,4695559999
3,Delaram,delaram@example.io,9727772121
4,Nima,NIMA@example.io,9727772121


## Q2 (1.5 pts) — Unit testing with `unittest`
Create tests in `test_crm_cleanup.py` that cover at least:

1. **Email validation**: valid/invalid variations.  
2. **Phone normalization**: parentheses, dashes, spaces, country code; too-short cases.  
3. **Parsing**: from a small multi-line string (not from a file), assert the exact structured rows (name/email/phone).  
4. **De-duplication**: demonstrate that a case-variant duplicate email is dropped (first occurrence kept).




In [5]:
%%bash
cat > test_crm_cleanup.py << 'PY'
import unittest
from q1_crm_cleanup import (
    is_valid_email, normalize_phone, clean_rows_from_text, parse_line
)

class TestEmailValidation(unittest.TestCase):
    def test_valid(self):
        self.assertTrue(is_valid_email("alice@example.com"))
        self.assertTrue(is_valid_email("mehdi.ay@example.org"))
        self.assertTrue(is_valid_email("user+tag@sub.example.co.uk"))
    def test_invalid(self):
        self.assertFalse(is_valid_email("bob[at]example.com"))
        self.assertFalse(is_valid_email("bad@@example..com"))
        self.assertFalse(is_valid_email("user@domain"))  # no TLD

class TestPhoneNormalization(unittest.TestCase):
    def test_formats(self):
        self.assertEqual(normalize_phone("(469) 555-1234"), "4695551234")
        self.assertEqual(normalize_phone("+1-972-777-2121"), "9727772121")
        self.assertEqual(normalize_phone("972.777.2121"), "9727772121")
        self.assertEqual(normalize_phone("214 555 8888"), "2145558888")
    def test_short(self):
        self.assertEqual(normalize_phone("555-777"), "")

class TestParsingAndDedup(unittest.TestCase):
    def test_parsing_variants(self):
        a = parse_line('Alice Johnson <alice@example.com> , +1 (469) 555-1234')
        b = parse_line('Sara M. , sara@mail.co , 214 555 8888')
        c = parse_line('"Mehdi A." <mehdi.ay@example.org> , (469)555-9999')
        self.assertEqual((a["name"], a["email"]), ("Alice Johnson", "alice@example.com"))
        self.assertEqual((b["name"], b["email"]), ("Sara M.", "sara@mail.co"))
        self.assertEqual((c["name"], c["email"]), ("Mehdi A.", "mehdi.ay@example.org"))

    def test_clean_and_dedupe(self):
        text = """Alice <alice@example.com> , 469 555 1234
duplicate <Alice@Example.com> , 972 555 0000
Bob , bob[at]example.com , 111 111 1111"""
        rows = clean_rows_from_text(text)
        expected = [{"name":"Alice","email":"alice@example.com","phone":"4695551234"}]
        self.assertEqual(rows, expected)

if __name__ == "__main__":
    unittest.main(verbosity=2)
PY

python -m unittest -v test_crm_cleanup.py


test_invalid (test_crm_cleanup.TestEmailValidation.test_invalid) ... ok
test_valid (test_crm_cleanup.TestEmailValidation.test_valid) ... ok
test_clean_and_dedupe (test_crm_cleanup.TestParsingAndDedup.test_clean_and_dedupe) ... ok
test_parsing_variants (test_crm_cleanup.TestParsingAndDedup.test_parsing_variants) ... ok
test_formats (test_crm_cleanup.TestPhoneNormalization.test_formats) ... ok
test_short (test_crm_cleanup.TestPhoneNormalization.test_short) ... ok

----------------------------------------------------------------------
Ran 6 tests in 0.001s

OK


## Grading rubric (total 3 pts)
- **Q1 (1.5 pts)**  
  - (0.4) File I/O with `pathlib` + graceful `FileNotFoundError` handling  
  - (0.5) Email validation (regex + strip + full match) and filtering  
  - (0.4) Phone normalization and **case-insensitive** de-duplication (keep first)  
  - (0.2) Code clarity (names, minimal docstrings/comments)
- **Q2 (1.5 pts)**  
  - (0.6) Meaningful coverage for email/phone functions (valid & invalid)  
  - (0.6) Parsing & de-dup tests that assert exact expected rows  
  - (0.3) Standard `unittest` structure and readable test names
