# 🧹 Mini-Project: Email List Cleaner

Welcome to your third project choice! The goal is to build a tool that reads a file containing a list of email addresses, cleans the list by validating formats and removing duplicates, and then saves the clean list to a new file.

This project will test your skills with **file I/O**, **functions**, **string methods**, **conditional logic**, and the use of **sets** for efficiency.

--- 
### Step 0: Create a Sample Email List File

First, we need some data to clean. Run the cell below to create a sample `email_list.txt` file. This file contains a mix of valid emails, invalid formats, and duplicates.

In [None]:
%%writefile email_list.txt
contact@example.com
user.name@domain.co
another_user@gmail.com
invalid-email
user@.com
 contact@example.com 
duplicate@email.com
duplicate@email.com
test@sub.domain.org

--- 
### Step 1: Create a Function to Read the File

As with the other projects, we'll start with a modular function to handle reading the file. This separates our file I/O logic from our cleaning logic.

**Your Task:**
Create a function called `read_email_file(filepath)` that:
1.  Takes one argument: `filepath` (the path to the email list file).
2.  Uses a **`try-except`** block to handle a `FileNotFoundError`.
3.  If the file is found, it should read all the lines and return them as a **list of strings**.
4.  If the file is not found, it should print an error message and return an empty list `[]`.

In [None]:
# Write the read_email_file function here


# --- Test your function ---
email_lines = read_email_file('email_list.txt')
if email_lines:
    print(f"Successfully read {len(email_lines)} lines.")

# Test the error handling
non_existent_lines = read_email_file('non_existent_emails.txt')
print(f"Reading a non-existent file returned: {non_existent_lines}")

--- 
### Step 2: Create a Function to Validate and Clean an Email

This function will take a single email string and check if it's in a valid format. A simple validation for this project is:
* The string must contain exactly one `@` symbol.
* The string must contain at least one `.` after the `@` symbol.
* We should also clean up any leading/trailing whitespace.

**Your Task:**
Create a function called `validate_and_clean_email(email)` that:
1.  Takes one argument: `email` (a string).
2.  Uses the `.strip()` method to remove any leading or trailing whitespace.
3.  Uses **conditional logic** (`if` statements) and **string methods** (`.count()`, `.find()`) to check if the cleaned email is valid based on the rules above.
4.  If the email is valid, the function should return the cleaned email string.
5.  If the email is invalid, the function should return `None`.

In [None]:
# Write the validate_and_clean_email function here


# --- Test your function ---
print(f"Valid: {validate_and_clean_email(' test@example.com ')}")
print(f"Invalid (no @): {validate_and_clean_email('invalid-email')}")
print(f"Invalid (multiple @): {validate_and_clean_email('test@@example.com')}")
print(f"Invalid (no . after @): {validate_and_clean_email('test@examplecom')}")

--- 
### Step 3: Create a Function to Process the Email List

This function will be the core of our cleaner. It will loop through all the raw email lines, use our validation function, and sort them into valid and invalid lists. It will also handle duplicates.

**Your Task:**
Create a function called `process_emails(email_lines)` that:
1.  Takes one argument: `email_lines` (the list of strings).
2.  Initializes three empty data structures:
    * `valid_emails = set()` (A **set** is perfect for automatically handling duplicates!)
    * `invalid_emails = []`
    * `duplicate_emails = []`
3.  **Loops** through each `line` in `email_lines`.
4.  Inside the loop, calls `validate_and_clean_email()` on the line.
5.  If the result is valid (not `None`):
    * Check if the email is already in the `valid_emails` set. If it is, add it to the `duplicate_emails` list.
    * If it's not a duplicate, add it to the `valid_emails` set.
6.  If the result is invalid (`None`), add the original (stripped) line to the `invalid_emails` list.
7.  The function should return the three collections: `valid_emails`, `invalid_emails`, and `duplicate_emails`.

In [None]:
# Write the process_emails function here


# --- Test your function ---
test_lines = [
    'good@email.com',
    ' bademail ',
    'good@email.com',
    ' another@good.com '
]
valid, invalid, duplicates = process_emails(test_lines)
print(f"Valid emails: {valid}")
print(f"Invalid emails: {invalid}")
print(f"Duplicate emails: {duplicates}")

--- 
### Step 4: Create a Function to Write the Cleaned List

Now we need a function to save our results. This function will write the final, clean list of emails to a new file.

**Your Task:**
Create a function called `write_clean_emails(filepath, valid_emails)` that:
1.  Takes two arguments: `filepath` for the output file and the `valid_emails` set.
2.  Opens the specified `filepath` in **write mode** (`'w'`).
3.  Loops through the `valid_emails` set.
4.  For each email, it writes the email followed by a newline character (`\n`) to the file.
5.  After writing, it should print a confirmation message, like `"Cleaned email list saved to clean_emails.txt"`.

In [None]:
# Write the write_clean_emails function here


# --- Test your function ---
test_valid_emails = {'test1@example.com', 'test2@example.com'}
write_clean_emails('test_clean_list.txt', test_valid_emails)

# You can check your directory for 'test_clean_list.txt' to verify it worked!

--- 
### Step 5: Putting It All Together 🧩

Let's combine all our modular functions to create the full email cleaning tool.

**Your Task:**
1.  Define two variables for your file paths: `input_file` and `output_file`.
2.  Call `read_email_file()` to get the raw lines.
3.  If the file was read successfully, pass the lines to `process_emails()` to get your `valid`, `invalid`, and `duplicate` collections.
4.  Print a summary report to the screen showing:
    * How many valid, invalid, and duplicate emails were found.
5.  Call `write_clean_emails()` to save the `valid_emails` to your `output_file`.

In [None]:
# Main script execution

def main():
    input_file = 'email_list.txt'
    output_file = 'clean_email_list.txt'
    
    # Step 1: Read the file
    lines = read_email_file(input_file)
    
    if lines:
        # Step 2: Process the emails
        valid, invalid, duplicates = process_emails(lines)
        
        # Step 3: Print a summary report
        print("--- Email Cleaning Report ---")
        print(f"Valid emails found: {len(valid)}")
        print(f"Invalid emails found: {len(invalid)}")
        print(f"Duplicate emails found: {len(duplicates)}")
        print("---------------------------")
        
        # Step 4: Write the clean list to a new file
        write_clean_emails(output_file, valid)

# Run the main function
main()