To get started, fork this notebook so you have copies of all of the problem strings.

### Warmup problems:
You can test each problem in this section with the strings named like `PROBLEM_1`, `PROBLEM_2` etc. Try writing your first draft regex for each problem without looking at the problem input; this will help you practice for cases where the inputs are too large to review the edge cases manually.

1. Extract the domain name from these simple urls, which always start with `http` and end with `.com`.
    - Hint: the `match any character` metacharacter will be very helpful here. 
- Write a pattern that will returns numbers containing no zeros or ones.
    - Hint: you can solve this with a pattern just seven characters long with the word boundary special character and a custom character class.
- Write a pattern to count the number of sentences that end with a word ending in 'ing' or 'ings'.
    - Hint: if you find that you're matching more items than expected, try a regex tester like [Pythex](pythex.org) so you can visualize exactly what's going wrong. 
- Count the number of words in this sentence with at least five characters.
- Extract the two phone numbers from this sentence.
- Re-write the pattern '\d{3}(?=\d{7})' so that it returns everything in the phone number except for the area code.
- Write a pattern to extract the domain name from an email address in a string. For this sentence, notreal@notmail.com should return 'notmail'.
- Identify all of words that look like names in the sentence. In other words, those which are capitalized but aren't the first word in the sentence.
- Find the valid urls that use http instead of https.
- Tidy up the weird whitespace problems with the problem's sentence.


# Warm Up

In [None]:
import re

In [None]:
PROBLEM_1 = 'https://www.kaggle.com https://www.google.com https://www.wikipedia.com'
pattern = re.compile(r"(https://)(www.)(\w+)(.com)")
matches = pattern.finditer(PROBLEM_1)
for match in matches:
    print(match.group(3))

In [None]:
PROBLEM_2 = '123, 012410, 01010, , 000, 111, 3495873, 3, not a number!, ...!@$,.'
pattern = re.compile(r"[2-9]+")
matches = pattern.findall(PROBLEM_2)
matches

In [None]:
PROBLEM_3 = 'Looking for many endings? You should only be seeing one match.'
pattern = re.compile(r"[A-Z][^\.!?]*[ing|ings][\.!?]")
matches = pattern.findall(PROBLEM_3)
len(matches)

In [None]:
PROBLEM_4 = 'Count the number of words in this sentence with at least five characters.'
pattern = re.compile(r"\b[\w+]{5,}\b")
matches = pattern.findall(PROBLEM_4)
for match in matches:
    print(match)

In [None]:
PROBLEM_5 = 'Extract these two normally formatted phone numbers from this sentence: (123) 456 7890, 123-456-7890.'
pattern = re.compile(r"\(?\d{3}\D+\d{3}\D\d{4}")
matches = pattern.findall(PROBLEM_5)
for match in matches:
    print(match)

In [None]:
PROBLEM_6 = '1234567890'
pattern = re.compile(r"\d{3}(\d{7})")
matches = pattern.findall(PROBLEM_6)
for match in matches:
    print(match)

In [None]:
PROBLEM_7 = "An email address (imaginaryperson@imaginarymail.edu) in a sentence. Don't match Invalid_email@invalid."
pattern = re.compile(r"\(\w+@(\w+)")
matches = pattern.findall(PROBLEM_7)
for match in matches:
    print(match)

In [None]:
PROBLEM_8 = "This is not a name, but Harry is. So is Susy. Sam should be missed as it's the first word in the sentence."
pattern = re.compile(r"(?<!^)(?<!\. )[A-Z][a-z]+")
matches = pattern.findall(PROBLEM_8)
for match in matches:
    print(match)

In [None]:
PROBLEM_9 = "https://www.kaggle.com https://www.google.com https://www.wikipedia.com http://phishing.com not.a.url gibberish41411 http https www.com"
pattern = re.compile(r"http://[a-zA-Z0-9]+\.\w+")
matches = pattern.findall(PROBLEM_9)
for match in matches:
    print(match)

In [None]:
PROBLEM_10 = "Weird whitespace           issues\t\t\t can be\n\n annoying."
re.sub("\s{2,}", " ", PROBLEM_10)

# Advanced Exercises

### Advanced Exercises
1. Extract all of the valid phone numbers from the string PHONE_FIELD_ENTRIES. You should get one phone number for each of the numbers 1-9. This one may be easier if you tackle it in stages.
- Extract the date ranges from the description field in [this dataset's documentation file](https://www.kaggle.com/sohier/nber-macrohistory-database/data). 
- Identify the people who contributed to books in this [library collections dataset](https://www.kaggle.com/seattle-public-library/seattle-library-checkout-records/data). You can find the relevant data in the Title column of the `Library_Collection_Inventory.csv`.
- Identify all imports used [in this notebook](https://www.kaggle.com/sohier/static-copy-of-recommendation-engine-notebook). Then, count the uses of each of those libraries. Note that the author used several different import styles! You can read the notebook's source code into Python with: 

```with open("../input/static-copy-of-recommendation-engine-notebook/recommendation_engine.ipynb", "r") as f_open:
    df = pd.DataFrame(json.load(f_open)['cells'])```

5\. Take one of your own kernels and do a style analysis. What's the shortest variable name you used? Do your function names follow [PEP8](https://www.python.org/dev/peps/pep-0008/)?

In [None]:
# AE 1
PHONE_FIELD_ENTRIES = '\n\n'.join([
    "1111111111",
    "222 222 2222",
    "333.333.3333",
    "(444) 444-4444",
    "Whitespace duplications can be hard to spot manually  555  555  5555 ",
    "Weird whitespace formats are still valid 666\t666\t6666",
    "Two separate phone numbers in one field 777.777.7777, 888 888 8888",
    "A common typo plus the US country code +1 999..999.9999",
    "Not a phone number, too many digits 1234567891011",
    "Not a phone number, too few digits 123.456",
    "Not a phone number, nine digits (123) 456-789",
                                   ])

pattern = re.compile(r"\(?\b\d{3}\)?[\s.]*\d{3}[\s.-]*\d{4}\b")
matches = pattern.findall(PHONE_FIELD_ENTRIES)
for match in matches:
    print(match)

In [None]:
import os
os.listdir("../input/")

In [None]:
import csv
pattern = re.compile(r"\d{4}\-\d{4}")
count = 0
with open("../input/documentation/documentation.csv", "r") as in_file: 
    csv_reader = csv.reader(in_file)
    for line in csv_reader:
        count += 1
        matches = pattern.findall(line[2])
        for match in matches:
            print("file_name: ", line[0], "\tYears: ", match)   
        if count == 100:
            break

In [None]:
count

In [None]:
# first 250 results
import re
import csv
count = 0
pattern = re.compile(r"([A-Z][a-z.-]+ (?:[A-Z][A-Za-z.] +?)?[A-Z][A-Za-z-']+)")
with open("../input/seattle-library-collection-inventory/library-collection-inventory.csv", "r") as in_file: 
    csv_reader = csv.reader(in_file)
    next(csv_reader)
    for line in csv_reader:
        string = line[1].split("/")
        if len(string) == 2:
            matches = pattern.findall(line[1].split("/")[1])
            for match in matches:
                print(match)
                count += 1
            if count == 250:
                break

In [None]:
count