## Regular Expressions in Python

In Python, regular expressions are supported by the re module. 

In [0]:
import re

## Basic Patterns: Ordinary Characters

Ordinary characters are the simplest regular expressions. They match themselves exactly and do not have a special meaning in their regular expression syntax.

In [0]:
pattern = r"Cookie"
sequence = "Cookie"
if re.match(pattern, sequence):
  print("Match!")
else: print("Not a match!")

Match!


The match() function returns a match object if the text matches the pattern. Otherwise it returns None. The re module also contains several other functions and you will learn some of them later on in the tutorial. 

For now, though, let's focus on ordinary characters! Do you notice the r at the start of the pattern Cookie? 

This is called a raw string literal. It changes how the string literal is interpreted. Such literals are stored as they appear.

For example, \ is just a backslash when prefixed with a r rather than being interpreted as an escape sequence. You will see what this means with special characters. Sometimes, the syntax involves backslash-escaped characters and to prevent these characters from being interpreted as escape sequences, you use the raw r prefix. You don't actually need it for this example, however it is a good practice to use it for consistency.

#### Wild Card Characters: Special Characters

##### Special characters are characters which do not match themselves as seen but actually have a special meaning when used in a regular expression. 

In [0]:
The most widely used special characters are:

# . - A period. Matches any single character except newline character.

In [0]:
re.search(r'Co.k.e', 'Cookie').group()

'Cookie'

In [0]:
The group() function returns the string matched by the re. You will see this function in more detail later.

In [0]:
re.search(r'l..' ,'Man lived a century ago').group()

'liv'

#### Regular expressions use the backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning. 

# \w - Lowercase w. Matches any single letter, digit or underscore.

In [0]:
re.search(r'Co\wk\we', 'Cookie').group()

'Cookie'

In [0]:
re.search(r'I\W.....' ,'Today is I@gmail.com match').group()

'I@gmail'

In [0]:
re.search(r'ce\w..' ,'Man lived a century ago').group()

'centu'

# \W - Uppercase w. Matches any character not part of \w (lowercase w).

In [0]:
re.search(r'C\Wke', 'C@ke').group()

'C@ke'

In [0]:
re.search(r'ce\W.' ,'Man lived a century ago').group()

AttributeError: 'NoneType' object has no attribute 'group'

In [0]:
re.search(r'ce\W.' ,'Man lived a ce%tury ago').group()

'ce%t'

In [0]:
re.search(r'ce\w.' ,'Man lived a ce%tury ago').group()

AttributeError: 'NoneType' object has no attribute 'group'

# \s - Lowercase s. Matches a single whitespace character like: space, newline, tab, return.

In [0]:
re.search(r'Eat\scake', 'Eat cake').group()

'Eat cake'

In [0]:
re.search(r'shut\Stoday', 'BSE, NSE shut today as Mumbai goes to polls').group()

AttributeError: 'NoneType' object has no attribute 'group'

##### \S - Uppercase s. Matches any character not part of \s (lowercase s).

In [0]:
re.search(r'shut\Stoday', 'BSE, NSE shut today as Mumbai goes to polls').group()

AttributeError: 'NoneType' object has no attribute 'group'

In [0]:
re.search(r'Cook\Se', 'Cookie').group()

'Cookie'

In [0]:
re.search(r'shut\stoday', 'BSE, NSE shut@today as Mumbai goes to polls').group()

AttributeError: 'NoneType' object has no attribute 'group'

In [0]:
re.search(r'shut\Stoday', 'BSE, NSE shut@today as Mumbai goes to polls').group()

'shut@today'

\n - Lowercase n. Matches newline.

\r - Lowercase r. Matches return.

\d - Lowercase d. Matches decimal digit 0-9.


In [0]:
re.search(r'c\d\dkie', 'c00kie').group()

'c00kie'

# ^ - Caret. Matches a pattern at the start of the string.

In [0]:
re.search(r'^Eat', 'Eat cake').group()

'Eat'

# $ - Matches a pattern at the end of string.

In [0]:
re.search(r'cake$', 'Eat everyday cake').group()

'cake'

[abc] - Matches a or b or c.

### [a-zA-Z0-9] - Matches any letter from (a to z) or (A to Z) or (0 to 9). Characters that are not within a range can be matched by complementing the set. If the first character of the set is ^, all the characters that are not in the set will be matched.

In [0]:
re.search(r'Number: [0-6]', 'Number: 5').group()

'Number: 5'

In [0]:
re.search(r'[0-9]', ' This is my 5 st car').group()

'5'

In [0]:
# Matches any character except 5
re.search(r'Number: [^5]', 'Number: 0').group()

'Number: 0'

In [0]:
re.search(r'[^5]', ' virat scored 22 runs').group()

' '

# \A - Uppercase a. Matches only at the start of the string. Works across multiple lines as well.

In [0]:
re.search(r'\A[A-E]ookie', 'cookie').group()

'Cookie'

In [0]:
re.search(r'\b[a-z]umbai', 'mumbai').group()

'mumbai'

# \b - Lowercase b. Matches only the beginning or end of the word.

\ - Backslash. If the character following the backslash is a recognized escape character, then the special meaning of the term is taken. For example, \n is considered as newline. However, if the character following the \ is not a recognized escape character, then the \ is treated like any other character and passed through.

In [0]:
# This checks for '\' in the string instead of '\t' due to the '\' used 
re.search(r'Back\\stail', 'Back\stail').group()

'Back\\stail'

In [0]:
# This treats '\s' as an escape character because it lacks '\' at the start of '\s'
re.search(r'Back\stail', 'Back tail').group()


'Back tail'

# Repetitions

It becomes quite tedious if you are looking to find long patterns in a sequence. Fortunately, the re module handles repetitions using the following special characters:

# + - Checks for one or more characters to its left.

In [0]:
re.search(r'Co+kie', 'Cooookie').group()

'Cooookie'

# * - Checks for zero or more characters to its left.

In [0]:
# Checks for any occurrence of a or o or both in the given sequence
re.search(r'Ca*o*kie', 'Caokie').group()

'Caokie'

In [0]:
result=re.findall(r'\w+','AV is largest Analytics community of India')
print (result)

['AV', 'is', 'largest', 'Analytics', 'community', 'of', 'India']


In [0]:
result=re.findall(r'\w*','AV is largest Analytics community of India')
print (result)

['AV', '', 'is', '', 'largest', '', 'Analytics', '', 'community', '', 'of', '', 'India', '']


# ? - Checks for exactly zero or one character to its left.

In [0]:
# Checks for exactly zero or one occurrence of a or o or both in the given sequence
re.search(r'Colou?r', 'Color').group()

'Color'

## But what if you want to check for exact number of sequence repetition? 

For example, checking the validity of a phone number in an application. re module handles this very gracefully as well using the following regular expressions:

{x} - Repeat exactly x number of times.

{x,} - Repeat at least x times or more.

{x, y} - Repeat at least x times but no more than y times.

In [0]:
re.search(r'\d{9,10}', '9920996342').group()

'9920996342'

# The + and * qualifiers are said to be greedy.

In [0]:
email_address = 'Please contact us at: support@gmail.com'
re.search(r'([\w\.-]+)@([\w\.-]+)', 'Please contact us at: support@gmail.com').group()

'support@gmail.com'

search() versus match()

The match() function checks for a match only at the beginning of the string (by default) whereas the search() function checks for a match anywhere in the string.

# findall(pattern, string, flags=0)

Finds all the possible matches in the entire sequence and returns them as a list of strings. Each returned string represents one match.

In [0]:
email_address = "Please contact us at: support@datacamp.com, xyz@datacamp.com"

#'addresses' is a list that stores all the possible match
addresses = re.findall(r'[\w\.-]+@[\w\.-]+', email_address)
for address in addresses: 
    print(address)

support@datacamp.com
xyz@datacamp.com


## sub(pattern, repl, string, count=0, flags=0)

This is the substitute function. It returns the string obtained by replacing or substituting the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern is not found then the string is returned unchanged.

In [0]:
email_address = "Please contact us at: xyz@datacamp.com"
new_email_address = re.sub(r'([\w\.-]+)@([\w\.-]+)', r'support@datacamp.com', email_address)
print(new_email_address)

Please contact us at: support@datacamp.com


# Case Study: Working with Regular Expressions

In [0]:
import re
import requests

In [0]:
#! pip install requests

In [0]:
the_idiot_url = 'https://www.gutenberg.org/files/2638/2638-0.txt'

In [0]:
the_idiot_url

'https://www.gutenberg.org/files/2638/2638-0.txt'

In [0]:
def get_book(url):
    # Sends a http request to get the text from project Gutenberg
    raw = requests.get(url).text
    # Discards the metadata from the beginning of the book
    start = re.search(r"\*\*\* START OF THIS PROJECT GUTENBERG EBOOK .* \*\*\*",raw ).end()
    # Discards the metadata from the end of the book
    stop = re.search(r"II", raw).start()
    # Keeps the relevant text
    text = raw[start:stop]
    return text

def preprocess(sentence): 
    return re.sub('[^A-Za-z0-9.]+' , ' ', sentence).lower()

book = get_book(the_idiot_url)
processed_book = preprocess(book)
print(processed_book)


 produced by martin adamson david widger with corrections by andrew sly the idiot by fyodor dostoyevsky translated by eva martin part i i. towards the end of november during a thaw at nine o clock one morning a train on the warsaw and petersburg railway was approaching the latter city at full speed. the morning was so damp and misty that it was only with great difficulty that the day succeeded in breaking and it was impossible to distinguish anything more than a few yards away from the carriage windows. some of the passengers by this particular train were returning from abroad but the third class carriages were the best filled chiefly with insignificant persons of various occupations and degrees picked up at the different stations nearer town. all of them seemed weary and most of them had sleepy eyes and a shivering expression while their complexions generally appeared to have taken on the colour of the fog outside. when day dawned two passengers in one of the third class carriages fou

## Find the number of the pronoun "the" in the corpus. Hint: use the len() function. 

In [0]:
len(re.findall(r'the', processed_book))

1

## Try to convert every single stand-alone instance of 'i' to 'I' in the corpus. Make sure not to change the 'i' occuring in a word:

In [0]:
processed_book = re.sub(r'\si\s', " I ", processed_book)
print(processed_book)

 produced by martin adamson david widger with corrections by andrew sly the idiot by fyodor dostoyevsky translated by eva martin part I i. towards the end of november during a thaw at nine o clock one morning a train on the warsaw and petersburg railway was approaching the latter city at full speed. the morning was so damp and misty that it was only with great difficulty that the day succeeded in breaking and it was impossible to distinguish anything more than a few yards away from the carriage windows. some of the passengers by this particular train were returning from abroad but the third class carriages were the best filled chiefly with insignificant persons of various occupations and degrees picked up at the different stations nearer town. all of them seemed weary and most of them had sleepy eyes and a shivering expression while their complexions generally appeared to have taken on the colour of the fog outside. when day dawned two passengers in one of the third class carriages fou

### Find the number of times anyone was quoted ("") in the corpus. 

In [0]:
len(re.findall(r'\”', book))

96

In [0]:
What are the words connected by '--' in the corpus?

In [0]:
re.findall(r'[a-zA-Z0-9]*--[a-zA-Z0-9]*', book)

['ironical--it',
 'malicious--smile',
 'fur--or',
 'astrachan--overcoat',
 'it--the',
 'Italy--was',
 'malady--a',
 'money--and',
 'little--to',
 'No--Mr',
 'is--where',
 'I--I',
 'I--',
 '--though',
 'crime--we',
 'or--judge',
 'gaiters--still',
 '--if',
 'through--well',
 'say--through',
 'however--and',
 'Epanchin--oh',
 'too--at',
 'was--and',
 'Andreevitch--that',
 'everyone--that',
 'reduce--or',
 'raise--to',
 'listen--and',
 'history--but',
 'individual--one',
 'yes--I',
 'but--',
 't--not',
 'me--then',
 'perhaps--',
 'Yes--those',
 'me--is',
 'servility--if',
 'Rogojin--hereditary',
 'citizen--who',
 'least--goodness',
 'memory--but',
 'latter--since',
 'Rogojin--hung',
 'him--I',
 'anything--she',
 'old--and',
 'you--scarecrow',
 'certainly--certainly',
 'father--I',
 'Barashkoff--I',
 'see--and',
 'everything--Lebedeff',
 'about--he',
 'now--I',
 'Lihachof--',
 'Zaleshoff--looking',
 'old--fifty',
 'so--and',
 'this--do',
 'day--not',
 'that--',
 'do--by',
 'know--my',
 'il