## Regular Expressions in Python

In Python, regular expressions are supported by the re module. 

In [2]:
import re

## Basic Patterns: Ordinary Characters

Ordinary characters are the simplest regular expressions. They match themselves exactly and do not have a special meaning in their regular expression syntax.

In [2]:
pattern = r"Cookie"
sequence = "Cookie"
if re.match(pattern, sequence):
  print("Match!")
else: print("Not a match!")

Match!


The regex search returns an object with several methods that give details about it.
These methods include group which returns the string matched, start and end which return the start and ending positions of the first match, and span which returns the start and end positions of the first match as a tuple.

In [4]:
pattern = r"pam"

match = re.search(pattern, "eggspamsausage")
if match:
   print(match.group())
   print(match.start())
   print(match.end())
   print(match.span())

pam
4
7
(4, 7)


In [3]:
 re.match(r'\n', '\\n\n\n')  # output none

In [4]:
 re.match(r'\n', '\n\n\n')  # True

<re.Match object; span=(0, 1), match='\n'>

The match() function returns a match object if the text matches the pattern. Otherwise it returns None. The re module also contains several other functions and you will learn some of them later on in the tutorial. 

For now, though, let's focus on ordinary characters! Do you notice the r at the start of the pattern Cookie? 

This is called a raw string literal. It changes how the string literal is interpreted. Such literals are stored as they appear.

For example, \ is just a backslash when prefixed with a r rather than being interpreted as an escape sequence. You will see what this means with special characters. Sometimes, the syntax involves backslash-escaped characters and to prevent these characters from being interpreted as escape sequences, you use the raw r prefix. You don't actually need it for this example, however it is a good practice to use it for consistency.

## Metacharacters

-->> A more understandable one.

Metacharacters are what make regular expressions more powerful than normal string methods.
They allow you to create regular expressions to represent concepts like "one or more repetitions of a vowel".

The existence of metacharacters poses a problem if you want to create a regular expression (or regex) that matches a literal metacharacter, such as "$". You can do this by escaping the metacharacters by putting a backslash in front of them.
However, this can cause problems, since backslashes also have an escaping function in normal Python strings. This can mean putting three or four backslashes in a row to do all the escaping.
To avoid this, you can use a raw string, which is a normal string with an "r" in front of it. We saw usage of raw strings in the previous lesson.

#### Wild Card Characters: Special Characters

##### Special characters are characters which do not match themselves as seen but actually have a special meaning when used in a regular expression. 

The most widely used special characters are:

# . - A period. Matches any single character except newline character.

In [6]:
pattern = r"gr.y"

if re.match(pattern, "grey"):
   print("Match 1")

if re.match(pattern, "gray"):
   print("Match 2")

if re.match(pattern, "blue"):
   print("Match 3")

Match 1
Match 2


In [5]:
re.search(r'Co.k.e', 'Co kie').group()

'Co kie'

The group() function returns the string matched by the re. You will see this function in more detail later.

In [5]:
re.search(r'l..' ,'Man lived a century ago').group()

'liv'

#### Regular expressions use the backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning. 

# \w - Lowercase w. Matches any single letter, digit or underscore.

\w - single character alpha-numeric, or underscore

In [8]:
re.search(r'Co\wk\we', 'Cookie').group()

'Cookie'

In [7]:
re.search(r'I\W.....' ,'Today is I@gmail.com match').group()

'I@gmail'

In [9]:
re.search(r'ce\w..' ,'Man lived a ce_T!ry ago ce_90').group()  # Man lived a century ago
# This seems to give the first part of the string from left that satisifies

'ce_T!'

In [33]:
re.search(r'ce\w..' ,'Man lived a ce;!ury ago ce_90').group()  # Man lived a century ago

'ce_90'

In [10]:
re.search(r'ce\w..' ,'Man lived a ce;Tury ago ce_9')   # .group()
# Man lived a century ago
# '.'-placeholder can't take '\n'

# If nothing satisifies then None is returned which on calling group() gives AttributeError as NoneType has no members

# \W - Uppercase w. Matches any character not part of \w (lowercase w).

opp to \w

non-(alphanumeric or underscore) chars

In [42]:
re.search(r'C\Wke', 'C@ke')

<re.Match object; span=(0, 4), match='C@ke'>

In [11]:
re.search(r'C\Wke', 'C@ke').group()

'C@ke'

In [18]:
re.search(r'ce\W.' ,'Man lived a century ago')  # None

In [12]:
re.search(r'ce\W.' ,'Man lived a ce%tury ago').group()

'ce%t'

In [13]:
re.search(r'ce\w.' ,'Man lived a ce%tury ago')  # None

# \s - Lowercase s. Matches a single whitespace character like: space, newline, tab, return.

In [14]:
re.search(r'shut\stoday', 'BSE, NSE shut@today as Mumbai goes to polls')  # None

# \S - Uppercase s. Matches any character not part of \s (lowercase s).

In [15]:
re.search(r'Cook\Se', 'Cookie').group()

'Cookie'

In [16]:
re.search(r'shut\Stoday', 'BSE, NSE shut today as Mumbai goes to polls')  # None

In [17]:
re.search(r'shut\Stoday', 'BSE, NSE shut@today as Mumbai goes to polls').group()

'shut@today'

\n - Lowercase n. Matches newline.

\r - Lowercase r. Matches return(i.e. enter).

\d - Lowercase d. Matches decimal digit 0-9.


In [21]:
re.search(r'c\d\dkie', 'c00kie').group()

'c00kie'

# ^ - Caret. Matches a pattern at the start of the string.

In [19]:
re.search(r'^Eat', 'Eat cake').group()

'Eat'

Don't use it like this

In [20]:
re.search(r'Eat^', 'cake Eat')  # None

# $ - Matches a pattern at the end of string.

In [22]:
re.search(r'cake$', 'Eat everyday cake').group()

'cake'

Don't use it like this

In [23]:
re.search(r'$cake', 'cake Eat everyday')  # None

In [7]:
pattern = r"^gr.y$"

if re.match(pattern, "grey"):
   print("Match 1")

if re.match(pattern, "gray"):
   print("Match 2")

if re.match(pattern, "stingray"):
   print("Match 3")

Match 1
Match 2


# Character Classes

Character classes provide a way to match only one of a specific set of characters.
A character class is created by putting the characters it matches inside square brackets.

\[abc\] - Matches a or b or c.

### \[a-zA-Z0-9\] - Matches any letter from (a to z) or (A to Z) or (0 to 9). Characters that are not within a range can be matched by complementing the set. If the first character of the set is ^, all the characters that are not in the set will be matched.

In [8]:
pattern = r"[aeiou]"

if re.search(pattern, "grey"):
   print("Match 1")

if re.search(pattern, "qwertyuiop"):
   print("Match 2")

if re.search(pattern, "rhythm myths"):
   print("Match 3")

Match 1
Match 2


In [24]:
re.search(r'Number: [0-6]', 'Number: 5').group()

'Number: 5'

In [25]:
re.search(r'[0-9]', ' This is my 5 st car').group()

'5'

In [9]:
pattern = r"[A-Z][A-Z][0-9]"

if re.search(pattern, "LS8"):
   print("Match 1")

if re.search(pattern, "E3"):
   print("Match 2")

if re.search(pattern, "1ab"):
   print("Match 3")

Match 1


Place a ^ at the start of a character class to invert it.
This causes it to match any character other than the ones included.
Other metacharacters such as $ and ., have no meaning within character classes.
The metacharacter ^ has no meaning unless it is the first character in a class.

In [27]:
# Matches any character except 5
re.search(r'Number: [^5]', 'Number: 09').group()

'Number: 0'

In [26]:
re.search(r'Number: [^5]', 'Number: 5 0')  # None

In [28]:
re.search(r'[^5]', ' virat scored 22 runs').group()

' '

# \A - Uppercase a. Matches only at the start of the string. Works across multiple lines as well.

In [29]:
re.search(r'\A[A-E]ookie', 'cookie Cookie')  # None

In [55]:
re.search(r'\A[A-E]ookie', 'Cookie').group()

'Cookie'

# \b - Lowercase b. Matches only the beginning or end of the word.

b - boundary b/w '\w' and '\W'

or

eg. r'\bfoo\b' matches 'foo' 'foo.' 'pfoo '

In [30]:
re.search(r'\b[a-z]umbai', 'mumbait bumbai').group()

'mumbai'

In [31]:
re.search(r'\b[a-z]umbai', 'Mumbai mumbai').group()

'mumbai'

In [32]:
re.search(r'\b[a-z]umbai', 'Mumbai mumbai Mumbai').group()

'mumbai'

In [34]:
re.search(r'\bmumba\b', 'Mumbai mumba Mumbai').group()

'mumba'

In [35]:
re.search(r'mumba\b[a-z]', 'Mumbai mumbai Mumbai')  # None

# ??

## This cell below is just to feel what raw strings are and behave like

In [36]:
print(r'\\n', r'\n', '\n', str(r'\n'), repr('\n'), str(str(r'\n')), str(str(r'\\n')))
print(repr(2), type(repr('\n')), type(r'\n'))
print(len('\n'), len(r'\n'), len('\\n'))
print(re.match(r'\\n', '\n'),  re.match(r'\n', '\n'), re.match(r'\n', '\\n'))

s = """l
    k"""
print(
    s,
    repr(s),
    repr(repr(s)),
    repr(repr(repr(s))),
    sep='\t'
)

\\n \n 
 \n '\n' \n \\n
2 <class 'str'> <class 'str'>
1 2 2
None <re.Match object; span=(0, 1), match='\n'> None
l
    k	'l\n    k'	"'l\\n    k'"	'"\'l\\\\n    k\'"'


\ - Backslash. If the character following the backslash is a recognized escape character, then the special meaning of the term is taken. For example, \n is considered as newline. However, if the character following the \ is not a recognized escape character, then the \ is treated like any other character and passed through.

In [37]:
# This checks for '\' in the string instead of '\t' due to the '\' used 
re.search(r'Back\\stail', 'Back\stail').group()  # ????

'Back\\stail'

only \ seems to behave this way

In [43]:
# This treats '\s' as an escape character because it lacks '\' at the start of '\s'
re.search(r'Back\stail\?', 'Back tail?').group()  # ???
# \? is a char

'Back tail?'

In [38]:
# This treats '\s' as an escape character because it lacks '\' at the start of '\s'
re.search(r'Back\stail?', 'Back tail?').group()  # ???
# ? here is placeholder wildcard

'Back tail'

In [39]:
# This treats '\s' as an escape character because it lacks '\' at the start of '\s'
re.search(r'Back\stail\\?', 'Back tail\?').group()  # ???
# \? is a char

'Back tail\\'

In [40]:
# This checks for '\' in the string instead of '\t' due to the '\' used 
re.search(r'Back\\\stail', 'Back\\ tail').group()  # ????

'Back\\ tail'

# Repetitions

It becomes quite tedious if you are looking to find long patterns in a sequence. Fortunately, the re module handles repetitions using the following special characters:

# + - Checks for one or more characters to its left.

In [11]:
pattern = r"g+"

if re.match(pattern, "g"):
   print("Match 1")

if re.match(pattern, "gggggggggggggg"):
   print("Match 2")

if re.match(pattern, "abc"):
   print("Match 3")

Match 1
Match 2


In [41]:
re.search(r'Co+kie', '4 Cooookie 5 Coooookie').group()

'Cooookie'

# * - Checks for zero or more characters to its left.

In [12]:
pattern = r"ice(-)?cream"

if re.match(pattern, "ice-cream"):
   print("Match 1")

if re.match(pattern, "icecream"):
   print("Match 2")

if re.match(pattern, "sausages"):
   print("Match 3")

if re.match(pattern, "ice--ice"):
   print("Match 4")

Match 1
Match 2


In [44]:
# Checks for any occurrence of a or o or both in the given sequence
re.search(r'Ca*o*kie', 'Cokie Caokie').group()

'Cokie'

In [45]:
result=re.findall(r'\w+','AV is largest Analytics community of India\.')
print (result)

['AV', 'is', 'largest', 'Analytics', 'community', 'of', 'India']


In [46]:
result=re.findall(r'\w*','AV is largest Analytics community of India\.')
print (result)

['AV', '', 'is', '', 'largest', '', 'Analytics', '', 'community', '', 'of', '', 'India', '', '', '']


In [49]:
result=re.findall(r'\w*','AV is largest Analytics community of India.')
print (result)

['AV', '', 'is', '', 'largest', '', 'Analytics', '', 'community', '', 'of', '', 'India', '', '']


# ? - Checks for exactly zero or one character to its left.

In [47]:
# Checks for exactly zero or one occurrence of a or o or both in the given sequence
re.search(r'Colou?r', 'Color').group()

'Color'

## But what if you want to check for exact number of sequence repetition? 

# Curly Braces

Curly braces can be used to represent the number of repetitions between two numbers.
The regex {x,y} means "between x and y repetitions of something".
Hence {0,1} is the same thing as ?.
If the first number is missing, it is taken to be zero. If the second number is missing, it is taken to be infinity.

For example, checking the validity of a phone number in an application. re module handles this very gracefully as well using the following regular expressions:

{x} - Repeat exactly x number of times.

{x,} - Repeat at least x times or more.

{x, y} - Repeat at least x times but no more than y times.

In [48]:
re.search(r'\d{9,10}', '9920996342').group()

'9920996342'

In [15]:
re.search(r'\d{,}', '9920996342').group()

# explaine later, soon( approx. 4 cells down)

'9920996342'

In [13]:
pattern = r"9{1,3}$"

if re.match(pattern, "9"):
   print("Match 1")

if re.match(pattern, "999"):
   print("Match 2")

if re.match(pattern, "9999"):
   print("Match 3")

Match 1
Match 2


Some more metacharacters are *, +, ?, { and }.
These specify numbers of repetitions.
The metacharacter * means "zero or more repetitions of the previous thing". It tries to match as many repetitions as possible. The "previous thing" can be a single character, a class, or a group of characters in parentheses.

In [10]:
pattern = r"egg(spam)*"

if re.match(pattern, "egg"):
   print("Match 1")

if re.match(pattern, "eggspamspamegg"):
   print("Match 2")

if re.match(pattern, "spam"):
   print("Match 3")

Match 1
Match 2


# The + and * qualifiers are said to be greedy.

In [62]:
email_address = 'Please contact us at: support@gmail.com'
re.search(r'([\w.-]+)@([\w\.-]+)', 'Please contact us at: sup.p.o-rt@g-.mail.com').group()  # both \. and . give same valid outputs

'sup.p.o-rt@g-.mail.com'

## minor Summary for +, *, ?, {}

+ {min_reps, max_reps}  `# both inclusive; ints`

* {, 3} ==> {0, 3} # 0 is default 1st arg
* {1, } ==> {1, inf} # inf is default for 2nd arg


*  \+ ==> {1,}
*  \* ==> {,}
*  ? ==> {,1}

# Groups

A group can be created by surrounding part of a regular expression with parentheses.
This means that a group can be given as an argument to metacharacters such as * and ?.

In [16]:
pattern = r"egg(spam)*"

if re.match(pattern, "egg"):
   print("Match 1")

if re.match(pattern, "eggspamspamspamegg"):
   print("Match 2")

if re.match(pattern, "spam"):
   print("Match 3")

Match 1
Match 2


The content of groups in a match can be accessed using the group function.
A call of group(0) or group() returns the whole match.
A call of group(n), where n is greater than 0, returns the nth group from the left.
The method groups() returns all groups up from 1.

In [19]:
pattern = r"a(bc)(de)(f(g)h)i(j)k"

match = re.match(pattern, "abcdefghijklmnop")
if match:
   print(match.group())
   print(match.group(0))
   print(match.group(1))
   print(match.group(2))
   print(match.groups())
# for some reason putting in an arg -1 didn't change the output

abcdefghijk
abcdefghijk
bc
de
('bc', 'de', 'fgh', 'g', 'j')


There are several kinds of special groups.
Two useful ones are named groups and non-capturing groups.
Named groups have the format (?P<name>...), where name is the name of the group, and ... is the content. They behave exactly the same as normal groups, except they can be accessed by group(name) in addition to its number.
Non-capturing groups have the format (?:...). They are not accessible by the group method, so they can be added to an existing regular expression without breaking the numbering.

In [20]:
pattern = r"(?P<first>abc)(?:def)(ghi)"

match = re.match(pattern, "abcdefghi")
if match:
   print(match.group("first"))
   print(match.groups())

abc
('abc', 'ghi')


# | - used as OR

Another important metacharacter is |.
This means "or", so red|blue matches either "red" or "blue".

In [21]:
pattern = r"gr(a|e)y"

match = re.match(pattern, "gray")
if match:
   print ("Match 1")

match = re.match(pattern, "grey")
if match:
   print ("Match 2")    

match = re.match(pattern, "griy")
if match:
    print ("Match 3")

Match 1
Match 2


# Special Sequences

## In-short

There are various special sequences you can use in regular expressions. They are written as a backslash followed by another character.
One useful special sequence is a backslash and a number between 1 and 99, e.g., \1 or \17. This matches the expression of the group of that number.

In [22]:
pattern = r"(.+) \1"

match = re.match(pattern, "word word")
if match:
   print ("Match 1")

match = re.match(pattern, "?! ?!")
if match:
   print ("Match 2")    

match = re.match(pattern, "abc cde")
if match:
   print ("Match 3")

Match 1
Match 2


More useful special sequences are \d, \s, and \w.
These match digits, whitespace, and word characters respectively.
In ASCII mode they are equivalent to \[0-9\], \[ \t\n\r\f\v\], and \[a-zA-Z0-9_\].

+ \d ==> \[0-9\]
+ \w ==> \[a-zA-Z0-9_\]
+ \s ==> \[ \t\n\r\f\v\]  `# \space is there too`

In Unicode mode they match certain other characters, as well. For instance, \w matches letters with accents.

Versions of these special sequences with upper case letters - \D, \S, and \W - mean the opposite to the lower-case versions. For instance, \D matches anything that isn't a digit.

In [23]:
pattern = r"(\D+\d)"

match = re.match(pattern, "Hi 999!")

if match:
   print("Match 1")

match = re.match(pattern, "1, 23, 456!")
if match:
   print("Match 2")

match = re.match(pattern, " ! $?")
if match:
    print("Match 3")

Match 1


Additional special sequences are \A, \Z, and \b.
The sequences \A and \Z match the beginning and end of a string, respectively.
The sequence \b matches the empty string between \w and \W characters, or \w characters and the beginning or end of the string. Informally, it represents the boundary between words.
The sequence \B matches the empty string anywhere else.

In [24]:
pattern = r"\b(cat)\b"

match = re.search(pattern, "The cat sat!")
if match:
   print ("Match 1")

match = re.search(pattern, "We s>cat<tered?")
if match:
   print ("Match 2")

match = re.search(pattern, "We scattered.")
if match:
   print ("Match 3")

Match 1
Match 2


# Email Extraction

To demonstrate a sample usage of regular expressions, lets create a program to extract email addresses from a string.
Suppose we have a text that contains an email address:

`str = "Please contact info@sololearn.com for assistance"`

Our goal is to extract the substring "info@sololearn.com".
A basic email address consists of a word and may include dots or dashes. This is followed by the @ sign and the domain name (the name, a dot, and the domain name suffix).
This is the basis for building our regular expression.

`pattern = r"([\w\.-]+)@([\w\.-]+)(\.[\w\.]+)"`

**\[\w\.-\]+** matches one or more word character, dot or dash.
The regex above says that the string should contain a word (with dots and dashes allowed), followed by the @ sign, then another similar word, then a dot and another word.

  Our regex contains three groups:
*  1 - first part of the email address.
*  2 - domain name without the suffix.
*  3 - the domain suffix.

In [25]:
pattern = r"([\w\.-]+)@([\w\.-]+)(\.[\w\.]+)"
str = "Please contact info@sololearn.com for assistance"

match = re.search(pattern, str)
if match:
   print(match.group())

info@sololearn.com


In case the string contains multiple email addresses, we could use the re.findall method instead of re.search, to extract all email addresses.

The regex in this example is for demonstration purposes only.
A much more complex regex is required to fully validate an email address.

# Gen Funcs

## Search() vs Match()

The match() function checks for a match only at the beginning of the string (by default) whereas the search() function checks for a match anywhere in the string.

## findall(pattern, string, flags=0)

Finds all the possible matches in the entire sequence and returns them as a list of strings. Each returned string represents one match.

In [50]:
email_address = "Please contact us at: support@datacamp.com, xyz@datacamp.com"

#'addresses' is a list that stores all the possible match
addresses = re.findall(r'[\w\.-]+@[\w\.-]+', email_address)
for address in addresses: 
    print(address)

support@datacamp.com
xyz@datacamp.com


In [51]:
match = re.search(r'[\w.-]+@[\w.-]+', 'fcggf: a?.wlice-b@google.com, gvj@kgh.com')
if match:
  print(match.group())  ## 'alice-b@google.com'

.wlice-b@google.com


In [54]:
re.search(r'[\.w-]+@[\w.-]+', 'fcggf: alice-b@google.com, gvj@kgh.com')  # None, ; if poss find why order \w.-

In [52]:
re.search(r'\.+', ',  . ..').group()

'.'

The function **re.finditer** does the same thing as *re.findall*, except it returns an **iterator**, rather than a *list*.

## sub(pattern, repl, string, count=0, flags=0)

This is the substitute function. It returns the string obtained by replacing or substituting the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern is not found then the string is returned unchanged.

In [3]:
str = "My name is David. Hi David."
pattern = r"David"
newstr = re.sub(pattern, "Amy", str)
print(newstr)

My name is Amy. Hi Amy.


In [53]:
email_address = "Please contact us at: xyz@datacamp.com"
new_email_address = re.sub(r'([\w\.-]+)@([\w\.-]+)', r'support@datacamp.com', email_address)
print(new_email_address)

Please contact us at: support@datacamp.com


# Case Study: Working with Regular Expressions

In [67]:
#! pip install requests

In [56]:
# import re
import requests

In [57]:
the_idiot_url = 'https://www.gutenberg.org/files/2638/2638-0.txt'
the_idiot_url

'https://www.gutenberg.org/files/2638/2638-0.txt'

In [58]:
def get_book(url):
    # Sends a http request to get the text from project Gutenberg
    raw = requests.get(url).text
    # Discards the metadata from the beginning of the book
    start = re.search(r"\*\*\* START OF THIS PROJECT GUTENBERG EBOOK .* \*\*\*",raw ).end()
    # Discards the metadata from the end of the book
    stop = re.search(r"II", raw).start()
    # Keeps the relevant text
    text = raw[start:stop]
    return text

def preprocess(sentence): 
    return re.sub('[^A-Za-z0-9.]+' , ' ', sentence).lower()

book = get_book(the_idiot_url)
processed_book = preprocess(book)
print(processed_book)


this cloak was a young fellow also of about twenty six or twenty seven years of age slightly above the middle height very fair with a thin pointed and very light coloured beard his eyes were large and blue and had an intent look about them yet that heavy expression which some people affirm to be a peculiarity as well as evidence of an epileptic subject. his face was decidedly a pleasant one for all that refined but quite colourless except for the circumstance that at this moment it was blue with cold. he held a bundle made up of an old faded silk handkerchief that apparently contained all his travelling wardrobe and wore thick shoes and gaiters his whole appearance being very un russian. his black haired neighbour inspected these peculiarities having nothing better to do and at length remarked with that rude enjoyment of the discomforts of others which the common classes so often show cold very said his neighbour readily and this is a thaw too. fancy if it had been a hard frost i never

## Find the number of the pronoun "the" in the corpus. Hint: use the len() function. 

In [59]:
len(re.findall(r'the', processed_book))

302

This method also takes into account the type of words below

In [60]:
len(re.findall(r'the', ' the them their theory'))

4

So,

In [61]:
len(re.findall(r' the ', processed_book))
# r' [Tt]he ' is'nt taken as all the letters we took are lowercase, so 'T' won't make any diff
# r'[\s.][Tt]he ' is not taken as in english after '\.' '\space' follows

187

and is right

In [63]:
len(re.findall(r' [Tt]he ', 'Their the. The '))

1

## Try to convert every single stand-alone instance of 'i' to 'I' in the corpus. Make sure not to change the 'i' occuring in a word:

In [64]:
processed_book = re.sub(r'\si\s', " I ", processed_book)
print(processed_book)

this cloak was a young fellow also of about twenty six or twenty seven years of age slightly above the middle height very fair with a thin pointed and very light coloured beard his eyes were large and blue and had an intent look about them yet that heavy expression which some people affirm to be a peculiarity as well as evidence of an epileptic subject. his face was decidedly a pleasant one for all that refined but quite colourless except for the circumstance that at this moment it was blue with cold. he held a bundle made up of an old faded silk handkerchief that apparently contained all his travelling wardrobe and wore thick shoes and gaiters his whole appearance being very un russian. his black haired neighbour inspected these peculiarities having nothing better to do and at length remarked with that rude enjoyment of the discomforts of others which the common classes so often show cold very said his neighbour readily and this is a thaw too. fancy if it had been a hard frost I never

### Find the number of times anyone was quoted ("") in the corpus. 

In [66]:
len(re.findall(r'\”', book))

0

##### Q) What are the words connected by '--' in the corpus?

In [65]:
re.findall(r'[a-zA-Z0-9]*--[a-zA-Z0-9]*', book)

['ironical--it',
 'malicious--smile',
 'fur--or',
 'astrachan--overcoat',
 'it--the',
 'Italy--was',
 'malady--a',
 'money--and',
 'little--to',
 'No--Mr',
 'is--where',
 'I--I',
 'I--',
 '--though',
 'crime--we',
 'or--judge',
 'gaiters--still',
 '--if',
 'through--well',
 'say--through',
 'however--and',
 'Epanchin--oh',
 'too--at',
 'was--and',
 'Andreevitch--that',
 'everyone--that',
 'reduce--or',
 'raise--to',
 'listen--and',
 'history--but',
 'individual--one',
 'yes--I',
 'but--',
 't--not',
 'me--then',
 'perhaps--',
 'Yes--those',
 'me--is',
 'servility--if',
 'Rogojin--hereditary',
 'citizen--who',
 'least--goodness',
 'memory--but',
 'latter--since',
 'Rogojin--hung',
 'him--I',
 'anything--she',
 'old--and',
 'you--scarecrow',
 'certainly--certainly',
 'father--I',
 'Barashkoff--I',
 'see--and',
 'everything--Lebedeff',
 'about--he',
 'now--I',
 'Lihachof--',
 'Zaleshoff--looking',
 'old--fifty',
 'so--and',
 'this--do',
 'day--not',
 'that--',
 'do--by',
 'know--my',
 'il

###### The END

######## Oh