In [1]:
import re

## Matching Email Addresses

In [2]:
emails = """
RishavSharma@gmail.com
rv.sharma@du.edu
rv-23-sharma@my-company.net
"""

#### To match the first email address, we see that it contains one or more lower and upper case letters till @ after which it contains one or more small letters, then a dot and then small letters.

In [3]:
pattern1 = re.compile(r'[a-zA-Z.]+@[a-z]+\.com')

matches = pattern1.finditer(emails)

for match in matches:
    print(match)

<re.Match object; span=(1, 23), match='RishavSharma@gmail.com'>


#### In order to get the next email address which contains .edu instead of .com, we can group com and edu together using or.

In [4]:
pattern2 = re.compile(r'[a-zA-Z.]+@[a-z]+\.(com|edu)')

matches = pattern2.finditer(emails)

for match in matches:
    print(match)

<re.Match object; span=(1, 23), match='RishavSharma@gmail.com'>
<re.Match object; span=(24, 40), match='rv.sharma@du.edu'>


#### To get the last email address, we also need to include the presence of digits in the pattern before @. Additionaly, we also need to add hyphens which are present in both the domain and the address

In [5]:
pattern3 = re.compile(r'[a-zA-Z0-9.-]+@[a-z-]+\.(edu|net|com)')

matches = pattern3.finditer(emails)

for match in matches:
    print(match)

<re.Match object; span=(1, 23), match='RishavSharma@gmail.com'>
<re.Match object; span=(24, 40), match='rv.sharma@du.edu'>
<re.Match object; span=(41, 68), match='rv-23-sharma@my-company.net'>


## Capturing information from groups

In [6]:
urls = """
https://www.google.com
http://rvs.com
https://youtube.com
https://www.isro.gov.in
"""

##### All the urls are inconsistent. Let's say for each of these URL's we wanted to grab the domain name followed by the top level domain.
ex: isro.gov.in<br>
How do we do it?

#### First let's write an expression that matches these URL's

In [7]:
pattern = re.compile(r'https?://(www\.)?\w+\.\w+\.?(in)?')

matches = pattern.finditer(urls)

for match in matches:
    print(match)

<re.Match object; span=(1, 23), match='https://www.google.com'>
<re.Match object; span=(24, 38), match='http://rvs.com'>
<re.Match object; span=(39, 58), match='https://youtube.com'>
<re.Match object; span=(59, 82), match='https://www.isro.gov.in'>


#### In order to capture the domain names, we can put them in a group by surrounding them with parenthesis.

In [8]:
pattern = re.compile(r'https?://(www\.)?(\w+)(\.\w+\.?(in)?)')

matches = pattern.finditer(urls)

for match in matches:
    print(match)

<re.Match object; span=(1, 23), match='https://www.google.com'>
<re.Match object; span=(24, 38), match='http://rvs.com'>
<re.Match object; span=(39, 58), match='https://youtube.com'>
<re.Match object; span=(59, 82), match='https://www.isro.gov.in'>


#### Now we have three groups in the pattern:

<pre>
i) The first group is the optional www 
ii) The second group are the word chracters that make up the domain name.
iii) The third group is the top level domain (.com, .gov)

There's also a group 0 that captures the entire URL. Now, we can print out the group that we need.
</pre>

In [9]:
pattern = re.compile(r'https?://(www\.)?(\w+)(\.\w+\.?(in)?)')

matches = pattern.finditer(urls)

for match in matches:
    print(match.group(2))   # will give the domain name

google
rvs
youtube
isro


#### We will use the pattern to replace the entire URL with just the domain and top level domain. We can do that using the sub() function.

In [10]:
pattern = re.compile(r'https?://(www\.)?(\w+)(\.\w+\.?(in)?)')

subbed_urls = pattern.sub(r'\2\3', urls)  # groups 2 and 3 are the groups that contain what we want

print(subbed_urls)


google.com
rvs.com
youtube.com
isro.gov.in



#### Instead of using the finditer() method througout, we can also use methods like: 

<pre>
i) findall() - finditer returns extra information about the location of the pattern but findall just returns the matches in the form of a list.

ii) match() - This will determine if a particular string STARTS with a particular pattern or not. This doesn't return an iterable. It only matches at the beginning of the string.

iii) search() - This will return the first occurence of a particular patter. Returns None otherwise.
</pre>

## We can use flags to make our lives easier.

In [11]:
sen = "Dear Marie, tell me what it was I used to be."

#### Let's say that the start can contain both lower and upper case characters. Writing regular expressions for all possibilities is a pain, so instead we can use a flag.

In [12]:
pattern = re.compile(r'DeAr', re.IGNORECASE)  # re.I is a shorthand which can also be used.

matches = pattern.search(sen)

print(matches)

<re.Match object; span=(0, 4), match='Dear'>
