### Day 28!  Regex and all of it's glory! 

First we look at some standard, and powerful, [String methods](https://docs.python.org/3/library/stdtypes.html). 

In [1]:
text = 'Some text that you want to test'

In [2]:
text.startswith('Some')

True

In [3]:
text.endswith('test')

True

In [4]:
'text' in text

True

---
It's good practice to convert case.

In [5]:
'that' in text.lower()

True

---
The replace method is very handy, but note that similar to the lower method - it must be assigned to stick!

In [6]:
text.replace('Some text', 'Some Awesome text')

'Some Awesome text that you want to test'

In [7]:
'awesome' in text.lower()

False

In [8]:
text

'Some text that you want to test'

In [9]:
text = text.replace('Some text', 'Some Awesome text')

In [10]:
'awesome' in text.lower()

True

---
Now, let's look at the [re](https://docs.python.org/3/library/re.html) module.

In [11]:
import re

In [13]:
re.search('some awesome', text.lower())

<_sre.SRE_Match object; span=(0, 12), match='some awesome'>

In [14]:
re.match('some awesome', text.lower())

<_sre.SRE_Match object; span=(0, 12), match='some awesome'>

---
The search method combined with groups will store the matching strings in a tuple that we can access.  

In [15]:
hundred = 'Awesome, I am doing the #100DaysOfCode challenge'
two_hundred = 'Awesome, I am doing the #200DaysOfCode challenge'

In [16]:
m = re.search(r'(#\d+DaysOfCode)', hundred)
m.groups()

('#100DaysOfCode',)

In [17]:
m.groups()[0]

'#100DaysOfCode'

---
The findall method is extremely handy! It will return a list which is similar the split method, but we can add some complexity to it that exceeds the scope of split. 

In [18]:
text = """Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been 
the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and 
scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into 
electronic typesetting, remaining essentially unchanged. It was popularised in the 1960s with the release of
Letraset sheets containing Lorem Ipsum passages, and more recently with desktop publishing software like Aldus
PageMaker including versions of Lorem Ipsum"""

In [19]:
aList = re.findall(r'\w+', text)

In [20]:
aList[:5]

['Lorem', 'Ipsum', 'is', 'simply', 'dummy']

In [21]:
splitList = text.split()

In [22]:
splitList[:5]

['Lorem', 'Ipsum', 'is', 'simply', 'dummy']

---
Let's find all words that begin with upper case characters.  [A-Z] looks for all upper case characters, [a-z0-9] will match any character or digit after a capital letter.  The '+' matches 1 or more and if we wanted 0 or more we would use '*'.

In [23]:
re.findall(r'[A-Z][a-z0-9]+', text)

['Lorem',
 'Ipsum',
 'Lorem',
 'Ipsum',
 'It',
 'It',
 'Letraset',
 'Lorem',
 'Ipsum',
 'Aldus',
 'Page',
 'Maker',
 'Lorem',
 'Ipsum']

---
Now, let's wrap this with Counter and find the most common words in the string that match our regular expression! 

In [24]:
from collections import Counter

In [25]:
cnt = Counter(re.findall(r'[A-Z][a-z0-9]+', text))

In [26]:
cnt.most_common(5)

[('Lorem', 4), ('Ipsum', 4), ('It', 2), ('Letraset', 1), ('Aldus', 1)]

---
[Link](https://github.com/talkpython/100daysofcode-with-python-course/blob/master/days/28-30-regex/regex.ipynb) to TalkPython notebook.

In [27]:
movies = '''1. Citizen Kane (1941)
2. The Godfather (1972)
3. Casablanca (1942)
4. Raging Bull (1980)
5. Singin' in the Rain (1952)
6. Gone with the Wind (1939)
7. Lawrence of Arabia (1962)
8. Schindler's List (1993)
9. Vertigo (1958)
10. The Wizard of Oz (1939)'''.split('\n')