<a href="https://colab.research.google.com/github/syed-irtiza7/SyedIrtiza_DTSC3020_Fall2025/blob/main/Assignment5_Ch_10%2611.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment5: CRM Cleanup @ **DalaShop**
*Files (Ch.10), Exceptions (Ch.10), Unit Tests (Ch.11), and Regular Expressions*  
.....

**Total: 3 points**  (Two questions, 1.5 pts each)  

> This assignment is scenario-based and aligned with Python Crash Course Ch.10 (files & exceptions), Ch.11 (unit testing with `unittest`), and Regular Expressions.

## Scenario
You are a data intern at an online retailer called **DalaShop**.  
Sales exported a **raw contacts** file from the CRM. It contains customer names, emails, and phone numbers, but the formatting is messy and some emails are invalid.  
Your tasks:

1. **Clean** the contacts (Files + Exceptions + Regex).  
2. **Write unit tests** to make sure your helper functions work correctly and keep working in the future.

## Data file (given by the company): `contacts_raw.txt`
Use this exact sample data (you may extend it for your own testing, but do **not** change it when submitting).  
Run the next cell once to create the file beside your notebook.

In [None]:
# Create the provided company dataset file
with open("contacts_raw.txt", "w", encoding="utf-8") as f:
    f.write('Alice Johnson <alice@example.com> , +1 (469) 555-1234\nBob Roberts <bob[at]example.com> , 972-555-777\nSara M. , sara@mail.co , 214 555 8888\n"Mehdi A." <mehdi.ay@example.org> , (469)555-9999\nDelaram <delaram@example.io>, +1-972-777-2121\nNima <NIMA@example.io> , 972.777.2121\nduplicate <Alice@Example.com> , 469 555 1234')
print("Wrote contacts_raw.txt with sample DalaShop data.")

## Q1 (1.5 pts) — CRM cleanup with Files, Exceptions, and Regex
Implement `q1_crm_cleanup.py` to:

1. **Read** `contacts_raw.txt` using `pathlib` and `with`. If the file is missing, **handle** it gracefully with `try/except FileNotFoundError` (print a friendly message; do not crash).
2. **Validate emails** with a simple regex (`r"[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}"`).  
   - Trim whitespace with `strip()` before checking.  
   - Use **full** matching (not partial).
3. **Normalize phone numbers:** remove all non-digits (e.g., with `re.sub(r"\D", "", raw)`).  
   - If the result has **≥ 10 digits**, keep the **last 10 digits**.  
   - Otherwise, return an **empty string** (`""`).
4. **Filter rows:** keep **only** rows with a valid email.
5. **Deduplicate:** remove duplicates by **email** using **case-insensitive** comparison (e.g., `email.casefold()`). **Keep the first occurrence** and drop later duplicates.
6. **Output CSV:** write to `contacts_clean.csv` with **columns exactly** `name,email,phone` (UTF-8).  
7. **Preserve input order:** the order of rows in `contacts_clean.csv` must match the **first appearance** order from the input file. **Do not sort** the rows.

**Grading rubric (1.5 pts):**
- (0.4) File read/write via `pathlib` + graceful `FileNotFoundError` handling  
- (0.5) Correct email regex validation + filtering  
- (0.4) Phone normalization + case-insensitive de-dup (keep first)  
- (0.2) Clean code, clear names, minimal docstrings/comments

In [1]:
# Write your answer here
# First let's create the data file as given in the assignment
with open("contacts_raw.txt", "w", encoding="utf-8") as f:
    f.write('Alice Johnson <alice@example.com> , +1 (469) 555-1234\nBob Roberts <bob[at]example.com> , 972-555-777\nSara M. , sara@mail.co , 214 555 8888\n"Mehdi A." <mehdi.ay@example.org> , (469)555-9999\nDelaram <delaram@example.io>, +1-972-777-2121\nNima <NIMA@example.io> , 972.777.2121\nduplicate <Alice@Example.com> , 469 555 1234')
print("Created contacts_raw.txt with sample data")

# Now the actual cleanup code
import re
import csv
from pathlib import Path

def check_email(email):
    """Check if email looks valid using the pattern from assignment"""
    email = email.strip()
    pattern = r"^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$"
    return bool(re.fullmatch(pattern, email))

def clean_phone(phone):
    """Clean up phone number - keep only digits, take last 10 if enough"""
    digits = re.sub(r"\D", "", phone)
    if len(digits) >= 10:
        return digits[-10:]
    return ""

def get_parts_from_line(line):
    """Figure out name, email, phone from a line of text"""
    line = line.strip()

    # Look for email in <brackets> style
    email_match = re.search(r'<([^>]+)>', line)
    if email_match:
        email = email_match.group(1).strip()
        name_part = line[:email_match.start()].strip()
        name = name_part.strip('"')
        phone_part = line[email_match.end():].strip().lstrip(',').strip()
    else:
        # Try comma separated style
        parts = [part.strip() for part in line.split(',')]
        if len(parts) >= 2:
            name = parts[0].strip('"')
            email = parts[1]
            phone_part = parts[2] if len(parts) > 2 else ""
        else:
            return None, None, None

    return name, email, phone_part

def clean_up_contacts():
    """Main function that does the actual cleanup work"""
    try:
        # Read the raw file
        file_path = Path("contacts_raw.txt")
        content = file_path.read_text(encoding="utf-8")

    except FileNotFoundError:
        print("Oops! contacts_raw.txt file is missing.")
        print("Make sure you run the cell that creates the file first.")
        return

    lines = content.split('\n')
    good_contacts = []
    emails_seen = set()

    for line in lines:
        if not line.strip():
            continue

        name, email, raw_phone = get_parts_from_line(line)

        if not name or not email:
            continue

        # Skip if email doesn't look right
        if not check_email(email):
            continue

        # Clean up the phone number
        phone = clean_phone(raw_phone) if raw_phone else ""

        # Check for duplicates (case insensitive)
        email_lower = email.lower()
        if email_lower not in emails_seen:
            emails_seen.add(email_lower)
            good_contacts.append({
                'name': name,
                'email': email,
                'phone': phone
            })

    # Save the clean data to CSV
    output_file = Path("contacts_clean.csv")
    with output_file.open('w', newline='', encoding='utf-8') as csvfile:
        columns = ['name', 'email', 'phone']
        writer = csv.DictWriter(csvfile, fieldnames=columns)

        writer.writeheader()
        for contact in good_contacts:
            writer.writerow(contact)

    print(f"All done! Cleaned up {len(good_contacts)} contacts.")
    print("Output saved to contacts_clean.csv")

# Run the cleanup
clean_up_contacts()

Created contacts_raw.txt with sample data
All done! Cleaned up 5 contacts.
Output saved to contacts_clean.csv


## Q2 (1.5 pts) — Unit testing with `unittest`
Create tests in `test_crm_cleanup.py` that cover at least:

1. **Email validation**: valid/invalid variations.  
2. **Phone normalization**: parentheses, dashes, spaces, country code; too-short cases.  
3. **Parsing**: from a small multi-line string (not from a file), assert the exact structured rows (name/email/phone).  
4. **De-duplication**: demonstrate that a case-variant duplicate email is dropped (first occurrence kept).




In [2]:
# Write your answer here
import unittest
import re

# We need to test the same functions from Q1
def check_email(email):
    email = email.strip()
    pattern = r"^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}$"
    return bool(re.fullmatch(pattern, email))

def clean_phone(phone):
    digits = re.sub(r"\D", "", phone)
    if len(digits) >= 10:
        return digits[-10:]
    return ""

class TestEmailStuff(unittest.TestCase):
    """Test if email checking works right"""

    def test_good_emails(self):
        """These emails should all pass"""
        good_emails = [
            "alice@example.com",
            "bob.smith@company.co.uk",
            "test_user+tag@domain.org",
            "mehdi.ay@example.org"
        ]

        for email in good_emails:
            self.assertTrue(check_email(email), f"This should work: {email}")

    def test_bad_emails(self):
        """These emails should all fail"""
        bad_emails = [
            "invalid",
            "missing@domain",
            "spaces in@email.com",
            "bob[at]example.com",  # from our data
            "@missinglocal.com",
            "missingdomain@"
        ]

        for email in bad_emails:
            self.assertFalse(check_email(email), f"This should fail: {email}")

    def test_email_with_spaces(self):
        """Test emails with spaces around them"""
        self.assertTrue(check_email("  alice@example.com  "))
        self.assertFalse(check_email("  test @example.com  "))

class TestPhoneStuff(unittest.TestCase):
    """Test if phone cleaning works right"""

    def test_different_phone_formats(self):
        """Test phones with different formatting styles"""
        test_phones = [
            ("+1 (469) 555-1234", "4695551234"),
            ("972-555-7777", "9725557777"),
            ("214 555 8888", "2145558888"),
            ("(469)555-9999", "4695559999"),
            ("+1-972-777-2121", "9727772121"),
            ("972.777.2121", "9727772121")
        ]

        for raw_phone, expected in test_phones:
            self.assertEqual(clean_phone(raw_phone), expected)

    def test_bad_phones(self):
        """Test phones that are too short or empty"""
        bad_phones = [
            "972-555-777",  # 9 digits
            "555-1234",     # 7 digits
            "123456789",    # 9 digits
            ""              # empty
        ]

        for phone in bad_phones:
            self.assertEqual(clean_phone(phone), "")

class TestFullProcess(unittest.TestCase):
    """Test the whole parsing and dedup process"""

    def test_parsing_and_dedup(self):
        """Test that we parse lines correctly and remove duplicates"""
        # Some test data with different formats and a duplicate
        test_lines = [
            'Alice Johnson <alice@example.com> , +1 (469) 555-1234',
            'Bob Roberts <bob@example.com> , 972-555-7777',
            'Sara M. , sara@mail.co , 214 555 8888',
            'Duplicate Person <ALICE@example.com> , 4695559999'  # duplicate of first
        ]

        # Simulate what our cleanup function does
        good_contacts = []
        emails_seen = set()

        for line in test_lines:
            # Simple parsing for test
            if '<' in line and '>' in line:
                name_part, rest = line.split('<', 1)
                email, phone_part = rest.split('>', 1)
                name = name_part.strip()
                email = email.strip()
                phone_part = phone_part.strip().lstrip(',').strip()
            else:
                parts = line.split(',')
                if len(parts) >= 2:
                    name = parts[0].strip()
                    email = parts[1].strip()
                    phone_part = parts[2].strip() if len(parts) > 2 else ""
                else:
                    continue

            if not check_email(email):
                continue

            phone = clean_phone(phone_part)

            email_lower = email.lower()
            if email_lower not in emails_seen:
                emails_seen.add(email_lower)
                good_contacts.append({
                    'name': name,
                    'email': email,
                    'phone': phone
                })

        # Should have 3 contacts (duplicate removed)
        self.assertEqual(len(good_contacts), 3)

        # Alice should be the first one (not the duplicate)
        self.assertEqual(good_contacts[0]['name'], 'Alice Johnson')

        # Make sure duplicate was removed
        alice_count = 0
        for contact in good_contacts:
            if 'alice' in contact['email'].lower():
                alice_count += 1
        self.assertEqual(alice_count, 1)

# Run the tests
if __name__ == '__main__':
    unittest.main(argv=[''], verbosity=2, exit=False)

test_bad_emails (__main__.TestEmailStuff.test_bad_emails)
These emails should all fail ... ok
test_email_with_spaces (__main__.TestEmailStuff.test_email_with_spaces)
Test emails with spaces around them ... ok
test_good_emails (__main__.TestEmailStuff.test_good_emails)
These emails should all pass ... ok
test_parsing_and_dedup (__main__.TestFullProcess.test_parsing_and_dedup)
Test that we parse lines correctly and remove duplicates ... ok
test_bad_phones (__main__.TestPhoneStuff.test_bad_phones)
Test phones that are too short or empty ... ok
test_different_phone_formats (__main__.TestPhoneStuff.test_different_phone_formats)
Test phones with different formatting styles ... ok

----------------------------------------------------------------------
Ran 6 tests in 0.013s

OK


## Grading rubric (total 3 pts)
- **Q1 (1.5 pts)**  
  - (0.4) File I/O with `pathlib` + graceful `FileNotFoundError` handling  
  - (0.5) Email validation (regex + strip + full match) and filtering  
  - (0.4) Phone normalization and **case-insensitive** de-duplication (keep first)  
  - (0.2) Code clarity (names, minimal docstrings/comments)
- **Q2 (1.5 pts)**  
  - (0.6) Meaningful coverage for email/phone functions (valid & invalid)  
  - (0.6) Parsing & de-dup tests that assert exact expected rows  
  - (0.3) Standard `unittest` structure and readable test names
