# ISYS613 - Data Sourcing and Quality
## Assignment 1
## String Operations and Regular Expressions

**Note:** Do not use concepts not yet covered in our course to answer the assignment questions.
Similarly, do not use concepts you do not fully understand.  Translation, using a search engine
to construct your solutions is tedious, error prone, and most importantly,
does not help you learn to solve problems!

## Question_1
You work for SPAM Marketing Co.  Your supervisor has just received a text file (MonsterJobsData_variations.txt –
a UTF-8 encoded file) containing, what she believes, will be pure SPAM marketing gold - previously
un-spammed email addresses.

You have been asked to author a Python script and the necessary regular expressions to extract all
of the email addresses from that file.

Since HTTP is case insensitive, your RE patterns should be case insensitive.

### Requirements Overview
1.	open the file for reading (already completed)
2.	for each line in the file, use a RE to scan the line and capture any email address of the type show below.
You may assume there can be at most one email address (of any type) per line.
3.	Print all captured email addresses.

### Additional Information
#### Obscured Email Addresses
Regarding emails, alas, some folks are wise to your “spammer” tactics and obscure their
email addresses in ingenious ways.  Specifically, you are to recognize and retain the following email variations:
 
| Variation No. | Variation Example |
| :---- | :---- |
| 1. | a valid email address in RFC 5322 format (see below) |
| 2. | robertsj1503 at duq dot edu |
| 3. | robertsj1503_at_duq_dot_edu |
| 4. | jeff dot roberts at duq dot edu |
| 5. | ```<!-- -->jungcon1<!-- -->@<!-- -->hanmail<!-- -->net<!-- -->``` <br/>(note/hint: the embedded within HTML comment tags) |

#### Valid Email Addresses According to the RFC 5322
The local-part of a valid email address (the part before the @) may use any of these ASCII characters:

- uppercase and lowercase Latin letters A to Z and a to z
- digits 0 to 9
- the following printable characters
```
.!#$%&'*+-/=?^_`{|}~
```

The domain-part of a valid email address (the part after the @) may be composed of several subdomains
separated by a dot (.).  Each subdomain’s element may contain only:

- uppercase and lowercase ASCII letters A to Z and a to z
- digits 0 to 9
- \- (hyphen or dash)

#### Input / Output

**Input:** MonsterJobsData_variations.txt – a UTF-8 encoded file.

**Output:** Print the list of email addresses (one email per line). For this assignment, do not change
or otherwise re-interpret the non-standard emails into RFC 5322 compliant email addresses.  Simply
output the emails as you found them.
```
Example Output: robertsj1503 at duq dot edu
```
### Challenge
Want a challenge (not a requirement)?  The idea is to "spam" people, right?  So really we need legit email addresses.
To that end, author the conversion code to convert all of your found emails to RFC
5322 compliant email addresses.  If you implement the challenge please,
output both the original email text followed by the converted email text on the same
line using a "pipe" separator ( | ).
```
Challenged Example Output: robertsj1503 at duq dot edu | robertsj1503@duq.edu
```

In [1]:
# TEST DATA
import re 

in_file = 'MonsterJobsData_variations.txt'


# Define the regular expression pattern for capturing email variations
email_pattern1 = "[a-zA-Z0-9.!#$%&'*+-/=?^_`{|}~<!>]+\s*at+\s*[a-zA-Z0-9.(-|_)<!>]+\s*dot\s*[a-zA-Z<!>(-|_)]{2,}"  # local-part (RFC 5322 format) # @ symbol  # domain-part

email_pattern2 = "[a-zA-Z0-9.!#$%&'*+-/=?^_`{|}~<!>]+\s*@+\s*[a-zA-Z0-9.(-|_)<!>]+\s*\.\s*[a-zA-Z<!>(-|_)]{2,}"  # local-part (RFC 5322 format) # @ symbol  # domain-part

#\s* - optional whitespace
rec1 = re.compile(email_pattern1, re.IGNORECASE)
rec2 = re.compile(email_pattern2, re.IGNORECASE)

# Function to convert email address variations to RFC 5322 compliant format
def convert_to_rfc5322(email):
    # Replace common variations with RFC 5322 compliant characters
    email = email.replace(' at ', '@').replace(' dot ', '.').replace('_at_', '@').replace('_dot_', '.')
    return email
    
# open using open as context manager
# read each line from the file
with  open(in_file, mode='r', encoding='utf-8') as jfh:
    for line in jfh:
        # rstrip() remove trailing whitespace chars which includes newline characters.
        line = line.strip()

        # skip empty lines
        if len(line) == 0:
            continue

        # BEGIN YOUR CODE HERE
        
        for email in rec1.findall(line):
            rfc5322_email = convert_to_rfc5322(email)
            print(f"Challenged Example Output   : {email}                       |            {rfc5322_email}")
            
        for email in rec2.findall(line):
            print(f'Example Output              :                                             {email}')
        
        
        

Challenged Example Output   : servicesresearch at yahoo dot ca                       |            servicesresearch@yahoo.ca
Example Output              :                                             applications@plan.ca
Example Output              :                                             nwwf@shawbiz.ca
Challenged Example Output   : lawfirm2000 at hotmail dot com                       |            lawfirm2000@hotmail.com
Example Output              :                                             job_wgkva-2047163377@craigslist.org
Example Output              :                                             anna_norrstrom@swimrecruiting.com
Example Output              :                                             anna_norrstrom@swimrecruiting.com
Challenged Example Output   : info_at_pwccommunitycollege_dot_com                       |            info@pwccommunitycollege.com
Challenged Example Output   : ndshlolibe at yahoo dot com                       |            ndshlolibe@yahoo.com
C

## Question_2
It turns out Julius Caesar was more paranoid than first thought.  He had another cipher which
was a combination of Pig Latin and step cipher.  You are tasked with implementing the so-called
Latin Pig Cipher (LPC).  You may use any combination of RE or string operations you wish in
constructing your solution.

The Latin_pig_cipher function receives a (word/string) and a step value as parameters. The objective of the
function is to convert an incoming string to its LPC equivalent and return the result.

Note: **be sure you have solved the Caesar cipher and Pig Latit lab problems before attempting this problem!**

### Requirements Overview
1. A word is a consecutive sequence of letters (a-z, A-Z) and numeric digits (0-9). You may assume
that the string input to the function will only be a single "word". Examples: Zebra29, apple85, etc.

3. If word ends with a digit, or a series of consecutive digits, the minimum digit in the series
becomes the cipher step value.  Thus, overriding (replacing) the step input parameter.  Note: a series
composed of a single digit has a minimum value equal to that digit.

5. If the first letter of a word is the letter 'y', the 'y' should be treated as a consonant, **unless** it is
followed by a numeric digit.  If the first letter of a word is a 'y' followed by a digit,
it is to treated as a vowel as are any other occurrences of 'y'.

7. If a word starts with a vowel, the LPC version is the original word with the string ŵāŷ (Latin small letter
W with circumflex (Unicode code point \u0175) + Latin small letter A with macron (Unicode code point \u0101) +
Latin small letter Y with circumflex (Unicode code point \u0177)) to the end.

9. If a word starts with a consonant, or a series of consecutive consonants, the LPC version transfers
ALL cipher shifted consonants up to the first vowel to the end of the word, and adds the string "āŷ"
(Latin small letter A with macron (Unicode code point \u0101) + Latin small letter Y with
circumflex (Unicode code point \u0177) to the end.
NOTE: A cipher shifted letter is substitution cipher in which the letter in the plaintext

is replaced by a letter some fixed number of step positions down the alphabet. For example, with a step of 2,
D would be replaced by F, E would become G.  Numeric digits are not cipher shifted.


### Function Output
Print the results of your algorithm against each of the following test data values.
For the following test Data, the cipher step is 2.

Test Word | Latin Pig Cipher
----------|------------------
football415 | ootball415gāŷ
pittsburgh | ittsburghrāŷ
y2ellow | y2ellowŵāŷ
yellow | ellowaāŷ
yttrium | iumavvtāŷ


## Assignment Deliverables
Once you have completed the tasks above, upload a single Jupyter Notebook file (.ipynb) containing
your solution to the Blackboard assignment item.  You may upload your JNB as many times as would like;
however, only your final submission will be graded.

In [5]:
r'''
UNICODE examples for your consideration

In [12]: encoded_bytes = 'ŵāŷ'.encode('utf8')
In [13]: print(encoded_bytes)
Out[13]: b'\xc5\xb5\xc4\x81\xc5\xb7'
In [14]: decoded_bytes = encoded_bytes.decode('utf8')
In [15]: print(decoded_bytes)
Out[15]: 'ŵāŷ'

In[1]:  s='\u0175\u0101\u0177'
In[2]:  print(s)
Out[2]: 'ŵāŷ'

# The following {names} are the official (and unique) names of Unicode characters.
# The Python \N escape indicates you are about to use Unicode character
# names and Python will decode the names using UTF-8.
In[1]:  s="\N{latin small letter W with circumflex}\N{latin small letter A with macron}\N{latin small letter Y with circumflex}"
In[2]:  print(s)
Out[2]: 'ŵāŷ'

'''

# TEST DATA
# football415, cipher shift is 1 because the word ends in 415 and 1 is the min
# All remaining, cipher shift is 2 (words d.n. end in digit(s))
# Each line of test data is in the form:
# <word to be converted to LPC>,<Result of conversion to LPC>
TEST_DATA1='''\
football415,ootball415gāŷ
pittsburgh,ittsburghrāŷ
y2ellow,y2ellowŵāŷ
yellow,ellowaāŷ
yttrium,iumavvtāŷ\
'''

# This produces identical TEST DATA
TEST_DATA2='''\
football415,ootball415g\u0101\u0177
pittsburgh,ittsburghr\u0101\u0177
y2ellow,y2ellow\u0175\u0101\u0177
yellow,ellowa\u0101\u0177
yttrium,iumavvt\u0101\u0177\
'''

import re

def latin_pig_cipher(word):
    if len(word) == 0:
        raise ValueError("Input must contain characters.")
        
    VOWELS = 'aeiou'
    
    # Define Latin Pig Cipher characters directly
    lpc_chars = '\u0175\u0101\u0177'  # Latin small letter W with circumflex + Latin small letter A with macron + Latin small letter Y with circumflex

    def find_cipher_step(word):
        # Determine the cipher step
        digits = re.findall(r'\d+', word)
        if digits:
            min_digit = min(int(digit) for digit in digits)
            return min(min_digit, 2)
        return 2  # Default step

    def handle_y(word):
        # Handle 'y' as a consonant or vowel based on the first letter
        if word.startswith('y') and not re.match(r'^y\d', word):
            return word[1:] + 'y'  # Move 'y' to the end
        return word

    # Normalize the word to lowercase
    word = word.lower()
    
    # Find the step value
    cipher_step = find_cipher_step(word)
    
    # Handle 'y'
    word = handle_y(word)
    
    # Identify vowels and consonants
    consonants = ''
    i = 0
    while i < len(word) and word[i] not in VOWELS:
        consonants += word[i]
        i += 1

    # Apply the Latin Pig Cipher rules
    if word[0] in VOWELS:
        return word + lpc_chars
    else:
        shifted_consonants = ''.join(chr(((ord(char) - ord('a') + cipher_step) % 26) + ord('a')) if char.islower() else char for char in consonants)
        return word[i:] + shifted_consonants + lpc_chars

# Test data
TEST_DATA1 = '''\
football415,ootball415gāŷ
pittsburgh,ittsburghrāŷ
y2ellow,y2ellowŵāŷ
yellow,ellowaāŷ
yttrium,iumavvtāŷ
'''

# Split the test data into individual test cases
test_cases = [test_case.split(',') for test_case in TEST_DATA1.strip().split('\n')]

# Test the Latin Pig Cipher function against the test cases

for test_case in test_cases:
    
    word, expected_result = test_case
    result = latin_pig_cipher(word)
    
    print(f"{word}, {result} (Expected: {expected_result})")

print('---------------------------------------------------------')
# Split the test data into individual test cases
test_cases = [test_case.split(',') for test_case in TEST_DATA2.strip().split('\n')]

# Test the Latin Pig Cipher function against the test cases

for test_case in test_cases:
    
    word, expected_result = test_case
    result = latin_pig_cipher(word)
    
    print(f"{word}, {result} (Expected: {expected_result})")



football415, ootball415hŵāŷ (Expected: ootball415gāŷ)
pittsburgh, ittsburghrŵāŷ (Expected: ittsburghrāŷ)
y2ellow, ellowa2ŵāŷ (Expected: y2ellowŵāŷ)
yellow, ellowyŵāŷ (Expected: ellowaāŷ)
yttrium, iumyvvtŵāŷ (Expected: iumavvtāŷ)
---------------------------------------------------------
football415, ootball415hŵāŷ (Expected: ootball415gāŷ)
pittsburgh, ittsburghrŵāŷ (Expected: ittsburghrāŷ)
y2ellow, ellowa2ŵāŷ (Expected: y2ellowŵāŷ)
yellow, ellowyŵāŷ (Expected: ellowaāŷ)
yttrium, iumyvvtŵāŷ (Expected: iumavvtāŷ)
