# **Regular Expression**
*   A regular expression (RegEx) is a special sequence of characters that helps you match or find other strings or sets of strings, using a specialized syntax held in a pattern.
*   RegEx can be used to check if a string contains the specified search pattern.
*   Python has a built-in package called **re**, which can be used to work with Regular Expressions.

# **Basics @** https://docs.python.org/3/howto/regex.html


In [None]:
import re   # regular expression module

# specialised language which can be used to search for text within a given document with precision and efficiency
# expression -- > compiled into bytecode    --> executed by a matching engine written in C
# Usage :Matching Characters

'''
a simple expression matches itself in the given string (abc in abcdefgh)
Exception --> Metacharacters
They don't match themselves
Complete list of Metacharacters  -->         . ^ $ * + ? { } [ ] \ | ( )

'''
# First Metacharacters which we will look at are --- >     [   ] character class
'''

#b
#[abcdef] - [a-f]

#output of above is true

# 4
# [12345] - [1-5]
#output of above is true

used for specifying a character class   - character class is a set of characters you wish to match

for example if I've written the following regex :

            [xyz]

this will match any x,y or z character

We could also give a range using hyphen,

            [x-z]           --- equivalent to ---              [xyz]

'''

In [None]:
#26-compile+function+and+character+class

import re

# re.compile(pattern)       -- returns a regex object

#below two are used without character class
#regex = re.compile('a')
#regex = re.compile('ab')  # -- returns none as patteren a in ab

#below is used with character class
#regex = re.compile('[abc]')
#regex = re.compile('[a-z]')
#regex = re.compile('[a-zA-Z]')
regex = re.compile('[^a-zA-Z]')  # compare all the elements other than specified a-zA-Z 
#regex = re.compile('[+]')

# regex.match(string to match) -- returns None if no match else returns a match object
# character class

#print(regex.match('a')) #--sre.SRE_Match object; span=(0, 1), match='a'#
#print(regex.match('b'))
#print(regex.match('c'))
#print(regex.match('d')) #-- reutrns none no match

#print(regex.match('g'))  #-- regex = re.compile('[a-h]') for range of character
#print(regex.match('G'))  # returns none for (regex = re.compile('[a-z]')) case sensitive

# complement the set [^pattern]
print(regex.match('1')) # regex = re.compile('[^a-zA-Z]') returns match
#print(regex.match('+'))  # regex = re.compile('[^a-zA-Z]') returns match


# all metacharacters lose their meaning inside a character class

In [None]:
#27-special+sequences
# A Good reference @ https://www.w3schools.com/python/python_regex.asp

import re

# special sequences are commenly used character classes


# \d        -- matches any decimal digit --     [0-9]

regex = re.compile('\d')


# \D        -- matches any non-digit character  -- [^0-9]

regex = re.compile('\D')

# \s        -- matches any whitespace character [tab, newline, spaces]

regex = re.compile('\s')

# \S        -- matches any non-whitespace character

regex = re.compile('\S')

# \w        -- matches any alphanumeric character -- [a-zA-Z0-9_]

regex = re.compile('\w')

# \W        -- matches any non-alphanumeric character -- [^ a-zA-Z0-9_]

regex = re.compile('\W')


In [None]:
#28-asterisk+repeating+things
# How to write powerful regex expression and handle recurring characters

import re

# * character - this specifies that the previous character can be matched zero or more times, instead of exactly once.

#regex = re.compile('a') #single character 'a'
#print(regex.match('a'))

#I want to match 5 'a' then
#regex = re.compile('aaaaa') #5 character 'a'
#print(regex.match('aaaaa'))


#regex = re.compile('[a-c]*')       # -- lower limit is 0 and the upper limit is infinity

#print(regex.match('caaaaaaaaaaabcaaaaaddddd'))
#print(regex.match('ddddd'))



<_sre.SRE_Match object; span=(0, 0), match=''>


# **\***  character --> this specifies that the **previous character** can be matched **zero or more times**, insted of exactly once.

In [None]:
# -++repeating+thing

import re

#below code matches to only matching character sequence with exact number

#regex = re.compile('a')
#print(regex.match('a'))

#Say I want to match 5 a's using previous code it will return null
#regex = re.compile('aaaaa')
#print(regex.match('a'))

#regex = re.compile('aaaaa')
#print(regex.match('aaaaa'))

# *  character -- this specifies that the previous character can be matched zero or more times, insted of exactly once.
#regex = re.compile('a*') #Lower limit is '0' and upper limit is infinty
#print(regex.match('aaa'))  # returns match aaa

#regex = re.compile('c*')
#print(regex.match('aaa')) # returns match '' as Lower limit is '0' and upper limit is infinty

#regex = re.compile('[a-c]*') # matches any string having a,b,c in any order
#print(regex.match('aaaaabbcccaaaa'))  # returns match 'aaaaabbcccaaaa'

#regex = re.compile('[a-c]*') # matches any string having a,b,c in any order
#print(regex.match('aaaaabbcccgggggaaaa'))   # returns match 'aaaaabbccc' sud string



# **\+**  character --> this specifies that the **previous character** can be matched **one or more** times.

In [None]:
# +  character -- this specifies that the previous character can be matched one or more times

# difference from * --> 0 to infinity         + --> 1 to infinity

import re

#regex = re.compile('a+')
#print(regex.match(''))      # returns none as it req atleast 1 match

#regex = re.compile('a+')
#print(regex.match('a'))      # returns match = span=(0, 1), match='a'>

#regex = re.compile('a+')
#print(regex.match('aaaaaaaaa'))  # returns match = 

# using character classes

#regex = re.compile('[a-c]+')
#print(regex.match(''))         # returns match = none

#regex = re.compile('[a-c]+')
#print(regex.match('aaaabbbccccaaaccc'))         # returns match = 

regex = re.compile('[a-c]+')
print(regex.match('aaaabdddddbbccccaaaccc'))         # returns match = aaaab


# **\?** question mark --> says the previous character can either come **once or not** at all.

# **?** --> **min = 0       max = 1**

# **\{m,n}**    m and n are integer values   -- This qualifier means there must be **at least m repetitions**, and **at most n**.

In [None]:
#? +and+{m,n}+repeating+things

import re

# ? question mark --> says the previous character can either come once or not at all

regex = re.compile('a?b')        # min - 0       max - 1

#print(regex.match('ab'))  #returns a match = b here appearance 'a' is only once
#print(regex.match('aab'))  #returns none here appearance 'a' is more than one
#print(regex.match('b'))  #returns a match = b here appearance 'a' is zero
#print(regex.match('a'))  #returns none here appearance 'b' is zero 


# {m,n}    m and n are integer values   -- This qualifier means there must be at least m repetitions, and at most n

regex = re.compile('a{2,4}')            # accepts value of: aa aaa aaaa
'''
#print(regex.match('a'))    # returns none as it expcets atleast 2 'a' 
#print(regex.match('aa'))  # returns match =aa
#print(regex.match('aaa')) # returns match =aaa
#print(regex.match('aaaa')) # returns match =aaaa
#print(regex.match('aaaaa')) # returns match =aaaa Note = only upper limit is matched and last 'a' is leftout
'''


# * {0,}  here 0 = minimum and openended means it is infinite
# {} = here first value by default is zero and last value is infinite

regex = re.compile('a{0,}')

# for all below print statements it will return a match based on above condition

#print(regex.match(''))
#print(regex.match('a'))
#print(regex.match('aaaaaaaa'))


# + {1,} 

# ? {0,1}

#assignmnet: write these expressions and demonstrate


In [None]:
#metacharacters 2
# A Good reference @ https://www.w3schools.com/python/python_regex.asp

import re

# ^ (hat) character   --> says that the string should start with
regex = re.compile('^abc')


# | (pipe) character --> is the or operator, here string should have either 'a' or 'b'

regex = re.compile('a|b')

# $ character -- matches the end of line, here string should end with 'c'

regex = re.compile('abc$')


# assignmnet: Tweak with the above code snippets write print statements form these for different conditions.

# **Searching the Parse Tree Using Beautiful Soup**

In [None]:
#Intro to Searching 
#Note: Use three_sisters.html

from bs4 import BeautifulSoup
import re

def read_file():
    file = open('three_sisters.html')
    data = file.read()
    file.close()
    return data

soup = BeautifulSoup(read_file(),'lxml')

# most popular methods

# find()
# find_all()        -- to keep it simple for now, it takes the tag name as parameter



# These methods are Kind of filters which we can use to retrieve tags - filters sent as parameter to find/find_all methods

# string

#print(soup.find_all('a')) #prints all 'a' tag in html file
#print(soup.find_all('b')) #prints all 'b' tag in html file

# regular expression

# tag names start with b
regex = re.compile('^b')

for tag in soup.find_all(regex):
    #print(tag.name) #prints all the tag starting with tag name 'b'
    pass

'''
So this just shows you how we can use RegEX to find tags and we can write very 
complex Regular Expressions for finding different kind of tags or by narrowing down classes,
tags using their attributes and RegEX
'''

'''
Below code snippets returns all the tag names whcih contains 't' in it. 
For this case it is html and title.
'''

regex = re.compile('t')

for tag in soup.find_all(regex):
    #print(tag.name)
    pass

'''
we send a list of tags: all a and b tags
Below code snippets returns all the tag names whcih contains 'a' and 'b' in it. 
'''

for tag in soup.find_all(['a','b']):
    #print(tag.name)
    pass


'''
function: find_all(): function can accept another function as paramenter
Just giving an example here - we'll discuss this more when we implement find_all
'''

'''
We will write a function which accepts 'a' tag as its parameter 
that returns all those tags which have class as attributes.

a, p taga have class
'''

def has_class(tag):
    return tag.has_attr('class') # returns true or false

for tag in soup.find_all(has_class):
    print(tag.name)
    pass


# **find_all() introduction**

# **Signature**: find_all(name, attrs, recursive, string, limit, **kwargs) 

In [None]:
#find_all() introduction

from bs4 import BeautifulSoup

def read_file():
    file = open('three_sisters.html')
    data = file.read()
    file.close()
    return data

soup = BeautifulSoup(read_file(),'lxml')

# Signature: find_all(name, attrs, recursive, string, limit, **kwargs)

'''
name parameter takes regex object, string, True, function
'''
a_tags = soup.find_all('a') #need to find all 'a' tags
#print(a_tags)


'''
attrs parameter: It is dicitonary
Each tag has attributes with them.
Ex: 'a' tag has attributs like class, href, id.
'''

'''
Lets print first 'a' tag having atributs: 'class':'sister','id':'link1'
'''

attr = {'class':'sister','id':'link1'}
first_a = soup.find_all('a', attrs=attr)
#print(first_a)


'''
Lets print all tag having class as sister
'''

attr = {'class':'sister'}
first_a = soup.find_all('a', attrs=attr)
#print(first_a)


attr = {'class':'sister'}
first_a = soup.find_all(attrs=attr)
#print(first_a)

'''
In the above code snippet both will print the same output because 
only 'a' has sister tag
'''

'''
Below code snippet returns only p tag having attributes story
'''

attr = {'class':'story'}
first_a = soup.find_all(attrs=attr)
#print(first_a)

'''
limit parameter:  limits the number of search it returns.
In our HTML there are 3 'a' tags
'''

a_tags = soup.find_all('a') # returns 3 'a' tags
#print(a_tags)

a_tags = soup.find_all('a',limit=1) # returns 1 'a' tags
#print(a_tags)

a_tags = soup.find_all('a',limit=2) # returns 2 'a' tags
#print(a_tags)

In [None]:
#find_all more parameters

from bs4 import BeautifulSoup
import re

def read_file():
    file = open('three_sisters.html')
    data = file.read()
    file.close()
    return data

soup = BeautifulSoup(read_file(),'lxml')

# Signature: find_all(name, attrs, recursive, string, limit, **kwargs)

'''
String parameter:string parameter expects a string or a RegEX object as its input 
and it returns us a navigable string it finds in the whole document which contains 
in our string specified.
'''

'''
We want to know if there is some navibable string which contains the word "Elsie".
'''

regex = re.compile('Elsie')
tag = soup.find_all(string=regex)
#print(tag)

'''
We want to know if there is some navibable string which contains the word "story".
'''

regex = re.compile('story')
tag = soup.find_all(string=regex)
#print(tag)

'''
# **kwargs arguments: we are passing kayword argument, class is a keyword so we write class_
to write the class attribute of a tag - use: class_  because simple class is a keyword in Python

usecase1: you want to find all the tags having sister
usecase2: you want to find all the tags having sister for attribute link1
'''

regex = re.compile('sister')
#tags = soup.find_all(class_='sister')  #usecase1
#tags = soup.find_all(class_='sister', id = 'link1') #usecase2
#for tag in tags:
 # print() # to avoid eror in next snippet
  #print(tag)

'''
We want to print all story tags
'''
regex = re.compile('story')
tags = soup.find_all(class_='story')
#print(len(tags))
for tag in tags:
  #print(tag)
  print() # to avoid eror in next snippet

'''
recursive parameter: recursively searches the whole parse tree.

Scenario: let's say you search for "a" tags. So what happens is your parser starts from this "html" tag and
then it recursively searches through the whole parse tree to see if there are any "a" tags. It goes to
"head" and it goes to head's children and then sees if "head" contains any "a" tag then goes to "body" and
then goes to body's children, like this "p" tag here, then goes to this "p" tag's children, this "b" tag
here and see if there is any "a" here then it goes to this "p" tag and it comes here and it finds "a".
And then it collects all of them and goes back.
'''

title = soup.find_all('title',recursive=False) #beautiful soup just looks at childern of HTML i.,e head and body tag. Doesnot goes to children of it.
#print(title) #returns an empty list

title = soup.find_all('title',recursive=True) #recursive=True is by default
print(title) 


# **find function**

# **Signature:** find(name, attrs, recursive, string, **kwargs)     - limit missing

find function: returns a single object if found and in case of multiple objects, it returns the first one it finds.


In [None]:
#36- find function

from bs4 import BeautifulSoup
import re

def read_file():
    file = open('three_sisters.html')
    data = file.read()
    file.close()
    return data

soup = BeautifulSoup(read_file(),'lxml')

'''
Signature: find(name, attrs, recursive, string, **kwargs)     - limit missing

find function: returns a single object if found      -- in case of multiple objects, it returns the first one it finds

'''
tag = soup.find_all('a') #returs list of object it finds in whole parse tree
#tag = soup.find('a') 
print(tag)

<a class="sister" href="http://example.com/elsie" id="link1">
                Elsie
            </a>
