## Intro to Regex

### Why do we need Regular expressions ? 

Regular expressions, at the most basic level, allow computer users and developers to find desired pieces
of text and, often, to replace those pieces of text with something that is preferred. On websites, regular
expressions are used to test whether a sequence of characters that might be intended to be a credit card
number, email address or a telephone number has an allowed pattern of characters. 

Whether it’s finding existing sequences of characters or testing sequences of characters for their suitability (or not) for storage, the key aspect of regular expressions is matching a pattern against a sequence of characters.


### What can data folks do with regex? 

+ Retrieve useful information from the noise- eg numbers from text
+ Finding matching strings or numbers to suit a pre known pattern
+ Split strings into useable, very specific portions
+ Find repeated words, to count instances of the same pattern, or find sort of words (typos)



---- 
of course if the text you are using is quite long and simple pattern matching isnt enough, then maybe you should use a NLP package instead, eg NLTK

In [1]:
#Import regex library
import re 

### Extracting  numeric information from text 

Recipe extract : 
" Cream 140g softened butter and 140g caster sugar until light and fluffy, then slowly add 2 beaten large eggs with a little of the flour. Fold in the remaining flour, 1 tsp baking powder and 2 mashed bananas"

+ Task - using Regex we want to extract the volumes from the text 
+ findall is a regex function to search the text for matches eg.findall(pattern,text)
+ \d looks for numeric characters
+ find this and other useful re characters on [this link]( https://cheatography.com/davechild/cheat-sheets/regular-expressions/)

In [2]:
recipe="Cream 140g softened butter and 140g caster sugar until light and  fluffy then slowly add 2 beaten large eggs with a little of the flour.  Fold in the remaining flour, 1 tsp baking powder and 2 mashed bananas"

matches=re.findall('\d+',recipe)
matches

['140', '140', '2', '1', '2']

### Checking a telephone number input is correct 

This varies depending on the country, because phone number structures vary per country. For example in the UK a valid phone number would be +441176654432 or +447799343821, but sometimes the number is entered without the +44, just starting with 0, and some older people persist in recalling the numbers as starting '0044'. In each case the number is then followed by 10 digits, so to that end it is easy. 

You will find examples online of regex patterns to validate telephone numbers for your own country, here is one matching re pattern for the UK:

> ^(\\+44\s?\d{10}|0044\s?\d{10}|0\s?\d{10})?$

In [3]:
number = '+447799343821'

In [4]:
pattern = '^(\+44\s?\d{10}|0044\s?\d{10}|0\s?\d{10})?$'
result = re.match(pattern, number)

if result:
  print("valid phone number")
else:
  print("not a valid phone number")	

valid phone number


### Checking an email number input is correct 

We could use the same approach for checking any strings where we know what the format should be - eg email addresses, websites, credit card numbers, passport numbers, bank account numbers - you have all probably interacted with a website which does this kind of pattern matching. 

+ one example regex for email addresses is 

> ^([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-z]+)$

+ while this may be beaten by some email address forms (exceptions) it will cover the majority

In [5]:
email = 'sian.davies@ironhack.com'

In [6]:
pattern =  '^([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-z]+)$'

In [7]:
result = re.match(pattern, email)

if result:
  print("valid email address")
else:
  print("not a valid email address")	

valid email address


if you want to write your own pattern for expected text  - this is a useful place to start 

https://www.programiz.com/python-programming/regex 

##### Metacharacters - these are fairly generic and are useful to know for regex, python, SQL (+%)

> . ^ $ * + ? { } [ ] \ | ( )


##### special sequences - these are specific to regex 

+ \d
Matches any decimal digit; this is equivalent to the class [0-9].

+ \D
Matches any non-digit character; this is equivalent to the class [^0-9].

+ \s
Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v].

+ \S
Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v].

+ \w
Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].

+ \W
Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].

### Split strings of text in useful ways

You have learnt how to use python split or rsplit function and this will probably cover many eventualities eg whitespace, specific characters, index position 

However - what about splitting around certain words or phrases ? 

In [9]:
# read file 
email = open(r"sianemail.txt", "r").read()

In [None]:
email

In [10]:
# I want to get the email address from the message, and I can see From: is at the start of that line 
# so I will use regular python 

for line in email.split("n"):
    if "From:" in line:
        print(line)

9:28 AM 
From: Sia


In [11]:
# with regex 

for line in re.findall("From:.*", email):
    print(line)

From: Sian <sian.davies@ironhack.com>


In [None]:
# use regex to get the email address before the @ 

match = re.findall("From:.*", email)
for line in match:
    print(re.findall("\w\S*@", line))

In [None]:
# use regex to get the email address after the @ 
for line in match:
    print(re.findall("@.*", line))

In [None]:
# use regex to retrieve specific information 
match = re.search("password is :.*", email)
print(type(match))
print(type(match.group()))
print(match)
print(match.group())

Once we start using pandas in python, you will be working with multiple rows of data - you can see how you might use Regex to retrieve specified information from each string per row 


### Find matching strings, count those strings, deal with typos in matches

#### Counting matches 

Sometimes we want to know how often in a body of text a certain word is mentioned. For example I have noticed many young people use the word LIKE a lot, and it is very easy to fall into the habit of doing so myself. Can we use Regex to count the instances of the word 'like' ? 

In [None]:
text = "If you too have been infected with the like epidemic, do not despair; there is hope.  Before we can change a habit, whether it is a speech sound or another verbal speech pattern, we have to be able to hear it.  Record yourself telling a story as if you are talking to a friend or while you are on the phone with a friend, record your half of the conversation.  Go back, and listen to the recording and count all of the times you used the word “like”.  Now, while listening to the recording, say the same sentences again, without using the word “like”.  Stop the recording every two to three sentences and reproduce the same part of the story without the use of like.  After you have completed this process, try telling the same story again, from beginning to end, without relying on the filler like.  Record your narration so that you will be able to hear the new, improved, like-less version.  You have begun to train your ear to scan for the word like and now, more aware of it, you can begin to curtail using it in everyday, conversational speech."

In [None]:
patterns= [r'like']

    
for p in patterns:
    match= re.findall(p, text)
    print(match)

length= len(match)

print(length)

#### Dodgy matches - ie someone has spelt the words a little wrong 

but we still want to find the words and count them 

+ note, you dont have to use regex, there are built-in functions in python to search for wildcard strings, but the bigger your task (the more messy your data, or the longer your text/file) the more regex may save you time in the long run 

In [12]:
typos = "in matplotlib and seaborn they spell colour like color but in the uk we spell it colour. So I am constantly typing color as colour and my function doesnt work. I am not sure whether colour or color is correct but I think we should all use the same spelling "

In [13]:
patterns= [r'col\w+r']

for p in patterns:
    match= re.findall(p, typos)
    print(match)

length= len(match)

print(length)

['colour', 'color', 'colour', 'color', 'colour', 'colour', 'color']
7


#### extending matches with grouping - whats the next word? 

Regular expressions allow us to not just match text but also to extract information for further processing. This is done by defining groups of characters and capturing them using  parentheses ( and ) metacharacters.

+ in this example I will disregard the varied spelling of colour, and return the next word as long as it starts with 'p'


In [16]:
list = ['colour wheel', 'colouring pens', 'color blind', 'the color purple - book by A Walker', 'colour palette', 'coloured pink']

In [17]:
for element in list: 
    z=re.match("(col\w+)\W(p\w+)", element)
    
    if z:
        print((z.groups()))

('colouring', 'pens')
('colour', 'palette')
('coloured', 'pink')


+ Note that the fabulous book 'the color purple' by Alice walker was not matched. You will only find matches with this re.match method at the start of the string - you can either set up a regex pattern which allows for other words at the start of the string, or much easier use re. search

> re.match() function of re in Python will search the regular expression pattern and return the first occurrence. The Python RegEx Match method checks for a match only at the beginning of the string. So, if a match is found in the first line, it returns the match object. But if a match is found in some other line, the Python RegEx Match function returns null

In [18]:
for element in list: 
    z=re.search("(col\w+)\W(p\w+)", element)
    
    if z:
        print((z.groups()))

('colouring', 'pens')
('color', 'purple')
('colour', 'palette')
('coloured', 'pink')


### TIP - Using compile with regex - for later and frequent usage

In [19]:
pattern= re.compile(r"\w+(?=\slooks)")

string = "Now I think I could have a go with regex, regex looks awesome!"

In [20]:
result = pattern.search(string)

In [21]:
print(result)

<re.Match object; span=(42, 47), match='regex'>


In [22]:
pattern = re.compile(r"\w+(?=,)")
result = pattern.findall(string)
print(result)

['regex']
