# Problem
You want to parse text data using regular expressions.
# Solution
The best way to do this is by using the “re” library in Python.

# How It Works
Let’s look at some of the ways we can use regular expressions for our tasks.

Basic flags: the basic flags are I, L, M, S, U, X:

     • re.I: This flag is used for ignoring casing.
     • re.L: This flag is used to find a local dependent.
     • re.M: This flag is useful if you want to find patterns throughout multiple lines.
     • re.S: This flag is used to find dot matches.
     • re.U: This flag is used to work for unicode data.
     • re.X: This flag is used for writing regex in a more readable format.

# Regular expressions’ functionality:

• Find the single occurrence of character a and b:

    Regex: [ab]

• Find characters except for a and b:

    Regex: [^ab]
    
• Find the character range of a to z:
    
    Regex: [a-z]

• Find a range except to z:
    
    Regex: [^a-z]

• Find all the characters a to z as well as A to Z:
    
    Regex: [a-zA-Z]

• Any single character:
     
     Regex:

• Any whitespace character:
     
     Regex: \s

• Any non-whitespace character:

    Regex: \S

• Any digit:
    
    Regex: \d

• Any non-digit:

    Regex: \D

• Any non-words:

    Regex: \W

• Any words:

    Regex: \w

• Either match a or b:

    Regex: (a|b)

# • The occurrence of a is either zero or one:
• Matches zero or one occurrence but not more than one occurrence

    Regex: a? ; ?

# • The occurrence of a is zero times or more than that:
    
       Regex: a* ; * matches zero or more than that

# • The occurrence of a is one time or more than that:

     Regex: a+ ; + matches occurrences one or more that one time

# • Exactly match three occurrences of a:
     
     Regex: a{3}

# • Match simultaneous occurrences of a with 3 or more than 3:
     
     Regex: a{3,}

# • Match simultaneous occurrences of a between 3 to 6:

     Regex: a{3,6}

# • Starting of the string:

     Regex: ^

# • Ending of the string:

     Regex: $

# • Match word boundary:

    Regex: \b

# • Non-word boundary:

    Regex: \B

# re.match() and re.search() functions are used to find the patterns

Let’s look at the differences between re.match() and re.search():
    
    • re.match(): This checks for a match of the string only at the beginning of the string. So, if it finds the pattern at the beginning of the input string, then it returns the matched pattern; otherwise; it returns a noun.

    • re.search(): This checks for a match of the string anywhere in the string. It finds all the occurrences of the pattern in the given input string or data.

Now let’s look at a few of the examples using these regular expressions.

# Tokenizing
You want to split the sentence into words – tokenize. One of the ways to do
this is by using re.split.

In [1]:
import re

In [2]:
re.split("\s+","i like this book")

['i', 'like', 'this', 'book']

# Extracing email IDs

In [3]:
doc = "For more details please mail us at: xyz@abc.com, pqr@mno.com"

In [4]:
re.findall(r"[\w\.-]+@[\w\.-]+",doc)

['xyz@abc.com', 'pqr@mno.com']

# Replacing email IDs
Here we replace email ids from the sentences or documents with another
email id. The simplest way to do this is by using re.sub.

In [5]:
doc = "For more details please mail us at xyz@abc.com"

In [6]:
new_email_address = re.sub(r'([\w\.-]+)@([\w\.-]+)',r'pqr@mno.com', doc)

In [7]:
new_email_address

'For more details please mail us at pqr@mno.com'

# Extract data from the ebook and perform regex
Let’s solve this case study by using the techniques learned so far.

In [8]:
import re
import requests

In [15]:
url = 'https://www.gutenberg.org/files/2638/2638-0.txt'

In [103]:
len(requests.get(url).text)

1427675

In [104]:
#function to extract
def get_book(url):
# Sends a http request to get the text from projectGutenberg
    raw = requests.get(url).text
# Discards the metadata from the beginning of the book
    start = re.search(r"\*\*\* START OF THE PROJECT GUTENBERG EBOOK .*\*\*\*",raw ).end()
# Discards the metadata from the end of the book
    stop = re.search(r"II", raw).start()
# Keeps the relevant text
    text = raw[0:1427675]
    return text

In [105]:
book=get_book(url)
book



# processing

In [68]:
def preprocess(sentence):
    return re.sub('[^A-Za-z0-9.]+' , ' ', sentence).lower()

In [90]:
#calling the above function
book = get_book(url)
processed_book = preprocess(book)
print(processed_book)

 the idiot by fyodor dostoyevsky translated by eva martin contents part i part ii part iii part iv part i i. towards the end of november during a thaw at nine o clock one morning a train on the warsaw and petersburg railway was approaching the latter city at full speed. the morning was so damp and misty that it was only with great difficulty that the day succeeded in breaking and it was impossible to distinguish anything more than a few yards away from the carriage windows. some of the passengers by this particular train were returning from abroad but the third class carriages were the best filled chiefly with insignificant persons of various occupations and degrees picked up at the different stations nearer town. all of them seemed weary and most of them had sleepy eyes and a shivering expression while their complexions generally appeared to have taken on the colour of the fog outside. when day dawned two passengers in one of the third class carriages found themselves opposite each ot

In [91]:
processed_book

' the idiot by fyodor dostoyevsky translated by eva martin contents part i part ii part iii part iv part i i. towards the end of november during a thaw at nine o clock one morning a train on the warsaw and petersburg railway was approaching the latter city at full speed. the morning was so damp and misty that it was only with great difficulty that the day succeeded in breaking and it was impossible to distinguish anything more than a few yards away from the carriage windows. some of the passengers by this particular train were returning from abroad but the third class carriages were the best filled chiefly with insignificant persons of various occupations and degrees picked up at the different stations nearer town. all of them seemed weary and most of them had sleepy eyes and a shivering expression while their complexions generally appeared to have taken on the colour of the fog outside. when day dawned two passengers in one of the third class carriages found themselves opposite each o

# 2. Perform some exploratory data analysis on this data using regex

In [92]:
# Count number of times "the" is appeared in the book
len(re.findall(r"the",processed_book))

53

In [98]:
#Replace "i" with "I"
re.sub(r"\si\s"," I ",processed_book)

' the idiot by fyodor dostoyevsky translated by eva martin contents part I part ii part iii part iv part I i. towards the end of november during a thaw at nine o clock one morning a train on the warsaw and petersburg railway was approaching the latter city at full speed. the morning was so damp and misty that it was only with great difficulty that the day succeeded in breaking and it was impossible to distinguish anything more than a few yards away from the carriage windows. some of the passengers by this particular train were returning from abroad but the third class carriages were the best filled chiefly with insignificant persons of various occupations and degrees picked up at the different stations nearer town. all of them seemed weary and most of them had sleepy eyes and a shivering expression while their complexions generally appeared to have taken on the colour of the fog outside. when day dawned two passengers in one of the third class carriages found themselves opposite each o

# find all occurance of text in the format "abc--xyz"

In [106]:
re.findall(r"[a-zA-Z0-9]*--[a-zA-Z0-9]*",book)

['one--the', 'away--you']