# `re` Module

## Basic concepts tutorial

The site https://regexone.com/ lets you review all the basic concepts of regular expressions step by step.
Go through the tutorial before continuing with these exercises.

You can also use https://regex101.com/, which helps you test and understand regular expressions.

## Data extraction

### Getting the data

In [None]:
!git clone https://github.com/nzmonzmp/dataset-spam-kaggle.git
!ls dataset-spam-kaggle

### Data analysis

This corpus is a collection of more than 4000 fraudulent emails (spam, phishing, â€¦) from 1998 to 2007.

All emails are in a single file; the file `debug_file` contains only 2 emails.

Each email has the following headers:
- `Return-Path`: address the email was sent from
- `X-Sieve`: the X-Sieve host (always cmu-sieve 2.0)
- `Message-Id`: a unique identifier for each message
- `From`: the message sender (sometimes blank)
- `Reply-To`: the email address to which replies will be sent
- `To`: the email address to which the e-mail was originally sent (some are truncated for anonymity)
- `Date`: date the e-mail was sent
- `Subject`: subject line of the e-mail
- `X-Mailer`: the platform the e-mail was sent from
- `MIME-Version`: the Multipurpose Internet Mail Extension version
- `Content-Type`: type of content and character encoding
- `Content-Transfer-Encoding`: encoding in bits
- `X-MIME-Autoconverted`: the type of autoconversion done
- `Status`: r (read) and o (opened)

Start by storing the text from each file in variables (for example `debug_text` and `work_text`).  
Note that the files are encoded with the `windows-1252` encoding.

In [None]:
# Your code here

#### Solution

In [None]:
import pathlib


debug_file = pathlib.Path("dataset-spam-kaggle/debug_file.txt")
work_file = pathlib.Path("dataset-spam-kaggle/fradulent_emails.txt")

debug_text = debug_file.read_text(encoding="windows-1252")
work_text = work_file.read_text(encoding="windows-1252")

In [None]:
print(work_text[0:1000])

### Displaying the sender

Using a regular expression, display the first 10 lines that start with `From:`.

In [None]:
# Your code here

#### Solution

In [None]:
import itertools
import re


for line in re.findall(r"^From:.*$", work_text, re.MULTILINE)[0:10]:
    print(line)

print("***")


for i, matched in enumerate(re.finditer(r"^From:.*$", work_text, re.MULTILINE)):
    if i >= 10:
        break
    print(matched.group(0))

print("***")


for matched in itertools.islice(
    re.finditer(r"^From:.*$", work_text, re.MULTILINE), 10
):
    print(matched.group(0))

### Removing quotes

Implement a function that:
- takes a string as input
- returns the string with:
  - all `'` and `"` characters removed
  - spaces removed at the beginning and at the end

In [None]:
# Your code here

#### Solution

In [None]:
def process_name(name: str) -> str:
    out = re.sub("""['"]""", "", name)
    out = re.sub(r"^\s*(.*?)\s*$", r"\1", out)
    return out

### Extracting the sender

Using a regular expression, build the dictionary `senders` where:

- the key is the sender's name, with quotes and unnecessary spaces removed
- the value is the set of email addresses (making sure that the email address is valid)

In [None]:
# Your code here

#### Solution

In [None]:
senders = {}
for matched in re.finditer(
    r"^From:(.*)<(\w\S+@\S+\.\w+)>$", work_text, re.MULTILINE
):
    name, email = matched.groups()
    processed_name = process_name(name)
    if processed_name not in senders:
        senders[processed_name] = {email}
    else:
        senders[processed_name].add(email)

### Displaying the names of people who have multiple email addresses

In [None]:
# Your code here

#### Solution

In [None]:
for name, emails in senders.items():
    if len(emails) > 1:
        print(name)