# Data cleaning and text analysis

<img src="https://miro.medium.com/max/2392/0*1-i9w0e4kklVQl5B.jpg">

It is estimated that 80% of the data is **unstructured**

And **unstructured** data is basically **text data**!

Text is present in every major business process, from support tickets to product feedback and customer interactions.

There is not doubt that text analysis has a broad range of business applications and use cases:
* Understand customer 
* Risk managment
* Prediction and prevention of crime
* Personalized adversitsing
* ...

## Is text data used in your company? 
<br>
<br>

## How do you used?


### Social Media Monitoring
* Let's say you work for NIKE and you want to know what the users say about the company. Every day there are thousands of tweets that can provide us really interesting insights, however these data can not be analyzed manually. Some questions:
    * Is the sentiment about the company positive or negative?
    * What are they compeling?
    * What do they say about a the new AIR MAX shoes?

$\bullet$ **Objectives**: 
    * Write scripts for data cleaning (data formatting, text categorization, etc.). 
    * Extract features from unstructured data. Understand and use tools for representing natural language data. 
    * Show examples for sentiment analysis tasks.

$\bullet$ Topics: Data cleaning; Regular Expression; How to represent natural language: tf-idf, n-grams; Tools for NLP (NLTK, Pattern).

<code>Regular expressions and examples
Textual data cleaning/processing:
    Tokenining - convert sentences to words
    Removing unnecessary punctuation, tags
    Removing stop words — frequent words such as ”the”, ”is”, etc. that do not have specific semantic
    Stemming and Lemmatizing - words are reduced to a root by removing inflection
Text representation
        TF-IDF: Term frequencies (counter)
        Vector normalization
        Feature weighting (Inverse Document Frequency)	
        Sklearn implementation
Learning text representations
        Stopwords
        Bag of Words
        n-grams
        Training a (naive Bayes) Classifier with NLTK: film critiques example
        Training a (naive Bayes) Classifier with TextBlob: A Tweet Sentiment Analyzer
        Pattern module</code>

## Regular expressions - Regex

The concept Regular Expression arose in the 1950s, when the American mathematician Stephen Kleene formalized the description of a regular language, and came into common use with the Unix text processing utilities ed, an editor, and grep (global regular expression print), a filter.
<img src="regular.png">

A regular expression processor translates a regular expression into a nondeterministic finite automaton (NFA) (where several states can be the output of a given state and symbol), which is then made deterministic (only one possible state transition for a particular symbol) and run on the target text string to recognize substrings that match the regular expression. 

In other words: Regular expression is a group of characters or symbols which is used to find a specific pattern from some text.


1956."Representation of Events in Nerve Nets and Finite Automata" in Automata Studies. Claude Shannon and John McCarthy, eds.
<img src="Automat.png">

You write regular expressions (regex) to match patterns in strings. When you are processing text, you may want to extract a substring of some predictable structure: a phone number, an email address, or something more specific to your research or task. You may also want to clean your text of some kind of junk: maybe there are repetitive formatting errors due to some transcription process that you need to remove.

In these cases and in many others like them, writing the right regex will be a good choice.

<img src='https://res.cloudinary.com/practicaldev/image/fetch/s--_iE0KvdT--/c_imagga_scale,f_auto,fl_progressive,h_900,q_auto,w_1600/https://dev-to-uploads.s3.amazonaws.com/i/zpek00ubevoxvn458b01.png'>

In [1]:
import re #the regex module in the python standard library

#strings to be searched for matching regex patterns
str1 = "varks Aard belíng to the Captain"
str2 = "Albert's famous equation, E = mc^2."
str3 = "Located at 455 Serra Mall."
str4 = "Beware of the shape-shifters!"

test_strings = [str1, str2, str3, str4] #created a list of strings

In [2]:
for test_string in test_strings:
    print ('\nThe test string is "' + test_string + '"')
    match = re.search('[í]', test_string)
    # r'' raw strings, do not interpret inside special characters, such as \
    # u'' unicode
    if match:
        print ('- The first possible match is: ' + match.group())
    else:
        print ('- ** no match. **')


The test string is "varks Aard belíng to the Captain"
- The first possible match is: í

The test string is "Albert's famous equation, E = mc^2."
- ** no match. **

The test string is "Located at 455 Serra Mall."
- ** no match. **

The test string is "Beware of the shape-shifters!"
- ** no match. **


Let's go through the code above line by line:

<code>
for test_string in test_strings:
</code>

test_strings is a list, and so it is iterable in a for loop. Every element in this list is a string. So for the rest of the for loop, we will be referring to the current element as test_string

<code>
print 'The test string is "' + test_string + '"'
</code>

This just prints out the current object we're iterating over

<code>
match = re.search(r'[A-Z]', test_string)
</code>

You give a searcher (in this case, the function re.search() a pattern and a string in which to find matches. That's exactly what this line does. re.search() returns either an object of type SRE_Match or None. 

<code>
if match:
    print 'The first possible match is: ' + match.group()
else:
    print 'no match.'
</code>    

match is an object that has two possible states: SRE_Match or None. None is a type of object that returns false in a logical test. In this for loop, we've basically told the Python interpreter to check whether match is NoneType or not. If it isn't, we return a string plus match.group(). group() is a method that has SRE_Match objects. It will return the substring that matched the pattern defined.


Note that since we are using re.search, only a single character is returned. That's because of the following:

<ol>
<li>We only defined a single character pattern and</li>
<li>re.search finds the first possible match and then doesn't look for any more.</li>
</ol>


If you want to find all possible matches in a string, you can use re.findall(), which will return a list of all matches:

In [3]:
for string in test_strings:
    print(string)
    print ("-" , re.findall(r'[A-Z]', string),"\n")

varks Aard belíng to the Captain
- ['A', 'C'] 

Albert's famous equation, E = mc^2.
- ['A', 'E'] 

Located at 455 Serra Mall.
- ['L', 'S', 'M'] 

Beware of the shape-shifters!
- ['B'] 



You can also compile your regex ahead of time. This will create SRE_Pattern objects. There are many performance reasons to do this. Additionally, you can create lists of these objects and iterate over both strings and patterns more easily. Here's an example:

In [4]:
patterns = [re.compile(r'[ABC]'),
re.compile(r'[^ABC]'),
re.compile(r'[ABC^]'),
re.compile(r'[0123456789]'),
re.compile(r'[0-9]'),
re.compile(r'[0-4]'),
re.compile(r'[A-Z]'),
re.compile(r'[A-Za-z]'),
re.compile(r'[A-Za-z0-9]'),
re.compile(r'[-a-z]'),
re.compile(r'[- a-z]')]

def find_match(pattern, string):
    match = re.search(pattern, string)
    if match:
        return match.group()
    else:
        return 'no match.'
    
for test_string in test_strings:
    matches = [find_match(pattern, test_string) for pattern in patterns]
    print("In: \""+ test_string+"\"")
    for pattern in patterns:
        print (' - The first potential match for "' + pattern.pattern + ' \t is: ' + matches[patterns.index(pattern)])

In: "varks Aard belíng to the Captain"
 - The first potential match for "[ABC] 	 is: A
 - The first potential match for "[^ABC] 	 is: v
 - The first potential match for "[ABC^] 	 is: A
 - The first potential match for "[0123456789] 	 is: no match.
 - The first potential match for "[0-9] 	 is: no match.
 - The first potential match for "[0-4] 	 is: no match.
 - The first potential match for "[A-Z] 	 is: A
 - The first potential match for "[A-Za-z] 	 is: v
 - The first potential match for "[A-Za-z0-9] 	 is: v
 - The first potential match for "[-a-z] 	 is: v
 - The first potential match for "[- a-z] 	 is: v
In: "Albert's famous equation, E = mc^2."
 - The first potential match for "[ABC] 	 is: A
 - The first potential match for "[^ABC] 	 is: l
 - The first potential match for "[ABC^] 	 is: A
 - The first potential match for "[0123456789] 	 is: 2
 - The first potential match for "[0-9] 	 is: 2
 - The first potential match for "[0-4] 	 is: 2
 - The first potential match for "[A-Z] 	 is: A
 

Let's go over this code line by line: 

<code>
patterns = [re.compile(r'[ABC]'),
re.compile(r'[^ABC]'),
re.compile(r'[ABC^]'),
re.compile(r'[0123456789]'),
re.compile(r'[0-9]'),
re.compile(r'[0-4]'),
re.compile(r'[A-Z]'),
re.compile(r'[A-Za-z]'),
re.compile(r'[A-Za-z0-9]'),
re.compile(r'[-a-z]'),
re.compile(r'[- a-z]')]
</code>

This creates a list of SRE_Patterns.

In [5]:
print (patterns)

[re.compile('[ABC]'), re.compile('[^ABC]'), re.compile('[ABC^]'), re.compile('[0123456789]'), re.compile('[0-9]'), re.compile('[0-4]'), re.compile('[A-Z]'), re.compile('[A-Za-z]'), re.compile('[A-Za-z0-9]'), re.compile('[-a-z]'), re.compile('[- a-z]')]


In [6]:
print (patterns[1])
print (patterns[1].pattern)

re.compile('[^ABC]')
[^ABC]


<code>
def find_match(pattern, string):
    match = re.search(pattern, string)
    if match:
        return match.group()
    else:
        return 'no match.'
</code>

We defined a function find_match that expects some variables called pattern and string. Notice that this function is very similar to the logical condition testing from the code above. Note also that this function returns either the match.group() or a string "no match."

<code>
for test_string in test_strings:
    matches = [find_match(pattern, test_string) for pattern in patterns]
<code>

By defining the find_match() function above, I can then call it from within a list comprehension. In words, for each string test_string that is in test_strings, we compare against the list of patterns and return matches. The resulting list of matches should be the same length as patterns; one match per pattern tested. 

<code>
for pattern in patterns:
        print 'The first potential match for "' + pattern.pattern + '" in "' + test_string + '" is: ' + matches[patterns.index(pattern)]
</code>

Because we wanted to print some diagnostic code, we need to iterate over each pattern in patterns (a list and thus iterable) and print it out, along with the test string. If you want to get the pattern out of an SRE_Pattern object, you can call its member method .pattern and it will return the regex pattern as a string. Since we are nesting this loop within the bigger loop above, this loop will go over every pattern in the patterns list for each string, and then repeat for the next string in the list test_strings.

However, note that we are dynamically referring to the index of the matches list:

<code>
matches[patterns.index(pattern)]
</code>

### Compiling regular expressions

<p>Compiling regular expressions as in the previous example can improve performance if you are using the same regular expression multiple times.</p>
<pre><code>compiled_re = re.compile(r'some_regexpr')    
for word in text:
    match = comp.search(compiled_re)
    # do something with the match
</code></pre><p><strong>E.g., if we want to check if a string ends with a substring:</strong></p>

In [7]:
import re

needle = 'needlers'

# Python approach
print (any([needle.endswith(e) for e in ('ly', 'ed', 'ing', 'ers')]))

# On-the-fly Regular expression in Python
print(bool(re.search(r'(ly|ed|ing|ers)$', needle)))

# Compiled Regular expression in Python
comp = re.compile(r'(ly|ed|ing|ers)$') 
print(bool(comp.search(needle)))

True
True
True


In [8]:
%timeit -n 1000 -r 50 bool(any([needle.endswith(e) for e in ('ly', 'ed', 'ing', 'ers')]))
%timeit -n 1000 -r 50 bool(re.search(r'(ly|ed|ing|ers)$', needle))
%timeit -n 1000 -r 50 bool(comp.search(needle))

608 ns ± 23.3 ns per loop (mean ± std. dev. of 50 runs, 1000 loops each)
660 ns ± 14.2 ns per loop (mean ± std. dev. of 50 runs, 1000 loops each)
356 ns ± 48.5 ns per loop (mean ± std. dev. of 50 runs, 1000 loops each)


### Summary of terms for regular expressions

<p><strong><span style="color:red">'[ ]'</span></strong> - one element inside has to match.</p>
<p><strong><span style="color:red">''|'</span></strong> - or element.</p>
<p><strong><span style="color:red">''( )'</span></strong> - all inside has to be matched.</p>
<p><strong><span style="color:red">''{ }'</span></strong> - to set an interval or number of times repetition.</p>
<p><strong><span style="color:red">''\'</span></strong> - identify next character as a character and not regular expression symbol.</p>
<p><strong><span style="color:red">''.'(Dot.)</span></strong> - In the default mode, this matches any character except a newline. </p>
<p><strong><span style="color:red">''^'(Caret.)</span></strong> - Matches the start of the string, and in MULTILINE mode also matches immediately after each newline.</p>
<p><strong><span style="color:red">'<code>$</code>'</span></strong> - Matches the end of the string or just before the newline at the end of the string, and in MULTILINE mode also matches before a newline. </p>
<p><strong><span style="color:red">'\*'</span></strong> - Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible. ab* will match ‘a’ or ‘ab’ followed by any number of ‘b’s.</p>
<p><strong><span style="color:red">'<code>+</code>'</span></strong>- Causes the resulting RE to match 1 or more repetitions of the preceding RE. ab+ will match ‘a’ followed by any non-zero number of ‘b’s; it will not match just ‘a’.</p>
<p><strong><span style="color:red">'?'</span></strong> - Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. ab? will match either ‘a’ or ‘ab’.</p>
<ul>
<li><span style="color:red">\d </span>- Matches any decimal digit; this is equivalent to the class [0-9].</li>
<li><span style="color:red">\D </span>- Matches any non-digit character; this is equivalent to the class [^0-9].</li>
<li><span style="color:red">\s </span>- Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v].</li>
<li><span style="color:red">\S </span>- Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v].</li>
<li><span style="color:red">\w </span>- Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].</li>
<li><span style="color:red">\W </span>- Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].</li>
</ul>
<p>For more comprehesive and complete documentation, ref: <a href="http://docs.python.org/2/library/re.html#re-syntax">http://docs.python.org/2/library/re.html#re-syntax</a>

#### Pre-defined character classes example

In [9]:
patterns2 = [
re.compile(r'\w'),
re.compile(r'\W'),
re.compile(r'\d'),
re.compile(r'\D'),
re.compile(r'\s'),
re.compile(r'\S')]

test_strings.append('Aardvarks belong to the Captain, capt_hook')

for test_string in test_strings:
    matches = [find_match(pattern, test_string) for pattern in patterns2]
    print('In: "' + test_string +'"')
    for pattern in patterns2:
        print (' - The first potential match for "' + pattern.pattern + '" is: ' + matches[patterns2.index(pattern)])

In: "varks Aard belíng to the Captain"
 - The first potential match for "\w" is: v
 - The first potential match for "\W" is:  
 - The first potential match for "\d" is: no match.
 - The first potential match for "\D" is: v
 - The first potential match for "\s" is:  
 - The first potential match for "\S" is: v
In: "Albert's famous equation, E = mc^2."
 - The first potential match for "\w" is: A
 - The first potential match for "\W" is: '
 - The first potential match for "\d" is: 2
 - The first potential match for "\D" is: A
 - The first potential match for "\s" is:  
 - The first potential match for "\S" is: A
In: "Located at 455 Serra Mall."
 - The first potential match for "\w" is: L
 - The first potential match for "\W" is:  
 - The first potential match for "\d" is: 4
 - The first potential match for "\D" is: L
 - The first potential match for "\s" is:  
 - The first potential match for "\S" is: L
In: "Beware of the shape-shifters!"
 - The first potential match for "\w" is: B
 - The

#### Examples matching sequences with regular expressions

In [10]:
test_strings2 = ["The Aardvarks belong to the Captain.",
                 "Bitter butter won't make the batter better.",
                 "Hark, the pitter patter of little feet!"]

patterns3 = [re.compile(r'(Aa)'),
re.compile(r'[Aa][Aa]'),
re.compile(r'[aeiou][aeiou]'),
re.compile(r'[AaEeIiOoUu][aeiou]'),
re.compile(r'[Tt]he'),
re.compile(r'^[Tt]he'),
re.compile(r'n.'),
re.compile(r'n.$'),
re.compile(r'\W\w'),
re.compile(r'\w[aeiou]tter'),
re.compile(r'..tt..')]

for test_string in test_strings2:
    matches = [find_match(pattern, test_string) for pattern in patterns3]
    print('In: "' + test_string +'"')
    for pattern in patterns3:
        print (' - The first potential match for "' + pattern.pattern + '" is: ' + matches[patterns3.index(pattern)])

In: "The Aardvarks belong to the Captain."
 - The first potential match for "(Aa)" is: Aa
 - The first potential match for "[Aa][Aa]" is: Aa
 - The first potential match for "[aeiou][aeiou]" is: ai
 - The first potential match for "[AaEeIiOoUu][aeiou]" is: Aa
 - The first potential match for "[Tt]he" is: The
 - The first potential match for "^[Tt]he" is: The
 - The first potential match for "n." is: ng
 - The first potential match for "n.$" is: n.
 - The first potential match for "\W\w" is:  A
 - The first potential match for "\w[aeiou]tter" is: no match.
 - The first potential match for "..tt.." is: no match.
In: "Bitter butter won't make the batter better."
 - The first potential match for "(Aa)" is: no match.
 - The first potential match for "[Aa][Aa]" is: no match.
 - The first potential match for "[aeiou][aeiou]" is: no match.
 - The first potential match for "[AaEeIiOoUu][aeiou]" is: no match.
 - The first potential match for "[Tt]he" is: the
 - The first potential match for "^[T

In [11]:
def find_all_matches(pattern, string):
    matches = re.findall(pattern, string)
    if matches:
        return matches
    else:
        return None

for test_string in test_strings2:
    matches = [find_all_matches(pattern, test_string) for pattern in patterns3]
    print('In: "' + test_string +'"')
    for pattern in patterns3:
        if matches[patterns3.index(pattern)]:
            
            print (' - All potential matches for "' + pattern.pattern + '" is/are: ' + ', '.join(matches[patterns3.index(pattern)]))
        else:
            print (' - There were no matches for "' + pattern.pattern + '.')
    print('\n')

In: "The Aardvarks belong to the Captain."
 - All potential matches for "(Aa)" is/are: Aa
 - All potential matches for "[Aa][Aa]" is/are: Aa
 - All potential matches for "[aeiou][aeiou]" is/are: ai
 - All potential matches for "[AaEeIiOoUu][aeiou]" is/are: Aa, ai
 - All potential matches for "[Tt]he" is/are: The, the
 - All potential matches for "^[Tt]he" is/are: The
 - All potential matches for "n." is/are: ng, n.
 - All potential matches for "n.$" is/are: n.
 - All potential matches for "\W\w" is/are:  A,  b,  t,  t,  C
 - There were no matches for "\w[aeiou]tter.
 - There were no matches for "..tt...


In: "Bitter butter won't make the batter better."
 - There were no matches for "(Aa).
 - There were no matches for "[Aa][Aa].
 - There were no matches for "[aeiou][aeiou].
 - There were no matches for "[AaEeIiOoUu][aeiou].
 - All potential matches for "[Tt]he" is/are: the
 - There were no matches for "^[Tt]he.
 - All potential matches for "n." is/are: n'
 - There were no matches for "

We have a new function and some new code. Let's go over it:

First, we wrote a function called find_all_matches:

<code>
def find_all_matches(pattern, string):
    matches = re.findall(pattern, string)
    if matches:
        return matches
    else:
        return None
</code>

There are only two differences between find_matches and find_all_matches. First, find_all_matches uses re.findall not re.search. So matches is a list of all possible matches. Thus, instead of return a single string in either condition, find_all_matches can return either a list of strings or None.

<code>
for test_string in test_strings2:
    matches = [find_all_matches(pattern, test_string) for pattern in patterns3]

    for pattern in patterns3:
        if matches[patterns3.index(pattern)]:
</code>

Remember the use of .index() from the previous code walkthrough. Also, remember that None returns false in a logical condition test. In this if statement, We are testing to see if there were any matches for the current pattern in the loop. If there were any matches, the code will execute the next line. Otherwise, it will go to the else block. 

<code>
print 'All potential matches for "' + pattern.pattern + '" in "' + test_string + '" is/are: ' + ', '.join(matches[patterns3.index(pattern)])
</code>

If matches at the index of the current pattern is not None, it will be a list of strings. Because we are printing these results, we wanted to nicely format them for diagnostic purposes. So we use the standard list-to-string Python expression of ''.join(list). In this case, we wanted the results to be comma-separated. 

In [12]:
test_strings3 = ['Now Mr. N said, "Nooooooo!"',
                 'Then she told him he had to be quiet.']

patterns4 = [re.compile(r'No*'),
re.compile(r'No+'),
re.compile(r'No?'),
re.compile(r'No{7}'),
re.compile(r's?he'),
re.compile(r'she|he')]

for test_string in test_strings3:
    matches = [find_all_matches(pattern, test_string) for pattern in patterns4]
    
    for pattern in patterns4:
        if matches[patterns4.index(pattern)]:
            print ('All potential matches for "' + pattern.pattern + '" in "' + test_string + '" is/are: ' + ', '.join(matches[patterns4.index(pattern)]))
        else:
            print ('There were no matches for "' + pattern.pattern + '" in "' + test_string + '".')

All potential matches for "No*" in "Now Mr. N said, "Nooooooo!"" is/are: No, N, Nooooooo
All potential matches for "No+" in "Now Mr. N said, "Nooooooo!"" is/are: No, Nooooooo
All potential matches for "No?" in "Now Mr. N said, "Nooooooo!"" is/are: No, N, No
All potential matches for "No{7}" in "Now Mr. N said, "Nooooooo!"" is/are: Nooooooo
There were no matches for "s?he" in "Now Mr. N said, "Nooooooo!"".
There were no matches for "she|he" in "Now Mr. N said, "Nooooooo!"".
There were no matches for "No*" in "Then she told him he had to be quiet.".
There were no matches for "No+" in "Then she told him he had to be quiet.".
There were no matches for "No?" in "Then she told him he had to be quiet.".
There were no matches for "No{7}" in "Then she told him he had to be quiet.".
All potential matches for "s?he" in "Then she told him he had to be quiet." is/are: he, she, he
All potential matches for "she|he" in "Then she told him he had to be quiet." is/are: he, she, he


#### Capturing groups

In Python, SRE_Match objects have .groups and .group methods. These correspond to the capturing groups established in the regex, if you chose to indicate groups. By default, the 0th group is the entire match to the whole regex. To access the result for a capturing group, you pass the capturing group index to the .group method. 

In [52]:
test_strings4 = ['The benefit is being held for Mr. Kite and Mr. Henderson.',
                 'Tickets cost $5.00 for adults, $3.50 for children.',
                 'Over 9000 attendees are expected, up from 900 attendees last year.',
                 'Over 9,000 attendees are expected, up from 900 attendees last year.']

patterns5 = [re.compile(r'Mr\. (\w+)'),
             re.compile(r'\$(\d+\.\d\d)'),
             re.compile(r'(\d+) attendees'),
             re.compile(r'((\d+,)*\d+) attendees')]

In [53]:
# simple example
print (patterns5[3].pattern)
print (test_strings4[3])

matches = re.search(patterns5[3], test_strings4[3])
print ("\nMatches:  " + matches.group() + "\n")
print ('Group 0: ' + matches.group(0))
print ('Group 1: ' + matches.group(1))
print ('Group 2: ' + matches.group(2))
# print ('Group 3: ' + matches.group(3)) # what happens if you uncomment this?


((\d+,)*\d+) attendees
Over 9,000 attendees are expected, up from 900 attendees last year.

Matches:  9,000 attendees

Group 0: 9,000 attendees
Group 1: 9,000
Group 2: 9,


This example searched for r'((\d+,)*\d+) attendees' in the string "Over 9000 attendees are expected, up from 900 attendees last year.'" There are two groups, one nested inside the other. Groups are indexed outer-most left parens. This is why Group 1 is 9,000 and Group 2 is 9,.

In [68]:
for test_string in test_strings4:
    print("")
    print(test_string)
    for pattern in patterns5:
        print(pattern)
        for result in re.finditer(pattern, test_string): # iterator finding a new group each time instead of a list (findall)
            for i in range(pattern.groups+1):
                print (' the group ' +str(i)+ ' match is ' + str(result.group(i)))


The benefit is being held for Mr. Kite and Mr. Henderson.
re.compile('Mr\\. (\\w+)')
 the group 0 match is Mr. Kite
 the group 1 match is Kite
 the group 0 match is Mr. Henderson
 the group 1 match is Henderson
re.compile('\\$(\\d+\\.\\d\\d)')
re.compile('(\\d+) attendees')
re.compile('((\\d+,)*\\d+) attendees')

Tickets cost $5.00 for adults, $3.50 for children.
re.compile('Mr\\. (\\w+)')
re.compile('\\$(\\d+\\.\\d\\d)')
 the group 0 match is $5.00
 the group 1 match is 5.00
 the group 0 match is $3.50
 the group 1 match is 3.50
re.compile('(\\d+) attendees')
re.compile('((\\d+,)*\\d+) attendees')

Over 9000 attendees are expected, up from 900 attendees last year.
re.compile('Mr\\. (\\w+)')
re.compile('\\$(\\d+\\.\\d\\d)')
re.compile('(\\d+) attendees')
 the group 0 match is 9000 attendees
 the group 1 match is 9000
 the group 0 match is 900 attendees
 the group 1 match is 900
re.compile('((\\d+,)*\\d+) attendees')
 the group 0 match is 9000 attendees
 the group 1 match is 9000
 the 

Before we go over this code block, let's establish the purpose of the code. We wanted to return all the matches for each group. But there are a few concerns:

<li>The number of groups is different for each pattern. </li>
<li>`.findall' return a list of matches, and if there are groups, it will return a list of tuples, where each tuple is the length of the number of capturing groups.</li>

In [16]:
print (patterns5[3].pattern)
print (test_strings4[3])
matches = re.findall(patterns5[3], test_strings4[3])
print("\nMatches: ")
print (matches)
matches = re.search(patterns5[3], test_strings4[3])
print ('\nGroup 0: ' + matches.group(0))
print ('Group 1: ' + matches.group(1))
print ('Group 2: ' + matches.group(2))


((\d+,)*\d+) attendees
Over 9,000 attendees are expected, up from 900 attendees last year.

Matches: 
[('9,000', '9,'), ('900', '')]

Group 0: 9,000 attendees
Group 1: 9,000
Group 2: 9,


But there are other ways of constructing this kind of loop.

<code>
for test_string in test_strings4:
    for pattern in patterns5:
        for result in re.finditer(pattern, test_string):
</code>

re.finditer returns an iterator. This loop means that for every pattern and for each string we're testing, instead of creating a list of matches, we're going to create a iterator object that contains the results.

<code>
for i in range(pattern.groups+1):
</code>

The .groups method will list the number of capturing groups in the regular expression. range is a function that will return a list of integers ranging from a start or a stop value and by a step value. If you just give it a int, by default it will treat that value is a stopping value and start from 0. Now, we add 1 to this value because the end point is omitted in range. If we want to return all the groups, we have to add that end point back.

<code>
print 'In "' + test_string + '", '  + 'given pattern "' + pattern.pattern + '", the group ' +str(i)+ ' match is ' + str(result.group(i))
</code>

Because i is established as the index value of the current regex match produced by the iterator, we can use i as the index value for which group we'd like to return. That's why we can call result.group(i). 

#### Summary of useful functions

<ul>
<li><code>re.match()</code>  : Determine if the RE matches at the beginning of the string.</li>
<li><code>re.search()</code> : Scan through a string, looking for any location where this RE matches.</li>
<li><code>re.findall()</code> : Find all substrings where the RE matches, and returns them as a list.</li>
<li><code>re.finditer()</code> : Find all substrings where the RE matches, and returns them as an iterator.</li>
</ul>

### Use case examples using regular expressions

#### Identify files via file extensions

<p>A regular expression to check for file extensions.  </p>

In [17]:
import re
pattern = r'(?i)(\w+)\.(jpeg|jpg|png|gif|tif|svg)$'

# remove `(?i)` to make regexpr case-sensitive

str_true = ('test.gif', 
            'image.jpeg', 
            'image.jpg',
            'image.TIF'
            )

str_false = ('test',
             'test.pdf',
             'test.gif.gif',
             )

for t in str_true:
    if (bool(re.match(pattern, t)) == True):
        print ('%s correct' %t)
for f in str_false:
    if (bool(re.match(pattern, f)) == False):
        print ('%s invalid' %f)

test.gif correct
image.jpeg correct
image.jpg correct
image.TIF correct
test invalid
test.pdf invalid
test.gif.gif invalid


#### Username validation

<p>Checking for a valid user name that has a certain minimum and maximum length.</p>
<p>Allowed characters:</p>
<ul>
<li>letters (upper- and lower-case)</li>
<li>numbers</li>
<li>dashes</li>
<li>underscores</li>

In [18]:
min_len = 5 # minimum length for a valid username
max_len = 15 # maximum length for a valid username

pattern = r"^(?i)[a-z0-9_-]{%s,%s}$" %(min_len, max_len)

# remove `(?i)` to only allow lower-case letters



str_true = ('user123', '123_user', 'Username')
            
str_false = ('user', 'username1234_is-way-too-long', 'user$34354')

for t in str_true:
    if (bool(re.match(pattern, t)) == True):
        print ('%s is valid' %t)
for f in str_false:
    if (bool(re.match(pattern, f)) == False):
        print ('%s is invalid' %f)

user123 is valid
123_user is valid
Username is valid
user is invalid
username1234_is-way-too-long is invalid
user$34354 is invalid


  if (bool(re.match(pattern, t)) == True):


#### Checking for valid email addresses

A regular expression that captures most email addresses.

In [19]:
pattern = r"(^(?i)(\w+\.|\w+-)*\w+@(\w+\.|\w+-)*\w+\.[a-z]{2,3}$)"

str_true = ('l-l.l@mail.Aom.PP',)
            
str_false = ('testmail.com','test@mail.com.', '@testmail.com', 'test@mailcom')

for t in str_true:
    if (bool(re.match(pattern, t)) == True):
        print ('%s is valid' %t)
for f in str_false:
    if (bool(re.match(pattern, f)) == False):
        print ('%s is invalid' %f)

l-l.l@mail.Aom.PP is valid
testmail.com is invalid
test@mail.com. is invalid
@testmail.com is invalid
test@mailcom is invalid


  if (bool(re.match(pattern, t)) == True):


#### Checks for an URL if a string ...
<ul>
<li>starts with <code>https://</code>, or <code>http://</code>, or <code>www.</code></li>
<li>and ends with a dot extension</li>

In [20]:
pattern = '^((https?:\/\/)|www\.)([\dA-Za-z\.-]+)\.([A-Za-z\.-]*)$'

str_true = ('https://githuB.com', 
            'http://github.com',
            'www.github.com',
            'https://www.github.com'
            )
            
str_false = ('//testmail.com', 'http:testmailcom', )

for t in str_true:
    if (bool(re.match(pattern, t)) == True):
        print ('%s is valid' %t)
for f in str_false:
    if (bool(re.match(pattern, f)) == False):
        print ('%s is invalid' %f)

https://githuB.com is valid
http://github.com is valid
www.github.com is valid
https://www.github.com is valid
//testmail.com is invalid
http:testmailcom is invalid


In [21]:
# IMPROVE PREVIOUS EXERCISE: THINK ADDITIONAL CONSTRAINTS


### Checking for numbers

##### Positive integers

In [22]:
pattern = '^\d+$'

str_true = ('123', '1', )
            
str_false = ('abc', '1.1', )

for t in str_true:
    if (bool(re.match(pattern, t)) == True):
        print ('%s is positive integer' %t)
for f in str_false:
    if (bool(re.match(pattern, f)) == False):
        print ('%s is not positive integer' %f)

123 is positive integer
1 is positive integer
abc is not positive integer
1.1 is not positive integer


##### Negative integers

In [23]:
pattern = '^-\d+$'

str_true = ('-123', '-1', )
            
str_false = ('123', '-abc', '-1.1', )

for t in str_true:
    if (bool(re.match(pattern, t)) == True):
        print ('%s is negative integer' %t)
for f in str_false:
    if (bool(re.match(pattern, f)) == False):
        print ('%s is not negative integer' %f)

-123 is negative integer
-1 is negative integer
123 is not negative integer
-abc is not negative integer
-1.1 is not negative integer


##### All integers

In [24]:
pattern = '^-{0,1}\d+$'

str_true = ('-123', '-1', '1', '123',)
            
str_false = ('123.0', '-abc', '-1.1', )

for t in str_true:
    if (bool(re.match(pattern, t)) == True):
        print ('%s is integer' %t)
for f in str_false:
    if (bool(re.match(pattern, f)) == False):
        print ('%s is not integer' %f)

-123 is integer
-1 is integer
1 is integer
123 is integer
123.0 is not integer
-abc is not integer
-1.1 is not integer


##### Positive numbers

In [25]:
pattern = '^\d*\.{0,1}\d+$'
str_true = ('1', '123', '1.234','0.2','.2')
            
str_false = ('-abc', '-123', '-123.0')
print("PATTERN:",pattern)
for t in str_true:
    if (bool(re.match(pattern, t)) == True):
        print ('%s is positive number' %t)
for f in str_false:
    if (bool(re.match(pattern, f)) == False):
        print ('%s is not positive number' %f)
print()
pattern = '^\d+([.]\d+)?$'
print("PATTERN:",pattern)
for t in str_true:
    if (bool(re.match(pattern, t)) == True):
        print ('%s is positive number' %t)
for f in str_false:
    if (bool(re.match(pattern, f)) == False):
        print ('%s is not positive number' %f)

PATTERN: ^\d*\.{0,1}\d+$
1 is positive number
123 is positive number
1.234 is positive number
0.2 is positive number
.2 is positive number
-abc is not positive number
-123 is not positive number
-123.0 is not positive number

PATTERN: ^\d+([.]\d+)?$
1 is positive number
123 is positive number
1.234 is positive number
0.2 is positive number
-abc is not positive number
-123 is not positive number
-123.0 is not positive number


##### Negative numbers

In [26]:
pattern = '^-\d*\.{0,1}\d+$'

str_true = ('-1', '-123', '-123.0', )
            
str_false = ('-abc', '1', '123', '1.234', )

for t in str_true:
    if (bool(re.match(pattern, t)) == True):
        print ('%s is negative number' %t)
for f in str_false:
    if (bool(re.match(pattern, f)) == False):
        print ('%s is not negative number' %f)

-1 is negative number
-123 is negative number
-123.0 is negative number
-abc is not negative number
1 is not negative number
123 is not negative number
1.234 is not negative number


##### All numbers

In [27]:
pattern = '^-{0,1}\d*\.{0,1}\d+$'

str_true = ('1', '123', '1.234', '-123', '-123.0')
            
str_false = ('a','-abc')

for t in str_true:
    if (bool(re.match(pattern, t)) == True):
        print ('%s is a number' %t)
for f in str_false:
    if (bool(re.match(pattern, f)) == False):
        print ('%s is not a number' %f)

1 is a number
123 is a number
1.234 is a number
-123 is a number
-123.0 is a number
a is not a number
-abc is not a number


### Validating dates and time

Validates dates in mm/dd/yyyy format. note: Some dates are not verified such as 2080 to be invalid. 

In [28]:
pattern = '^(0[1-9]|1[0-2])\/(0[1-9]|1\d|2\d|3[01])\/(19|20)\d{2}$'

str_true = ('01/08/2014', '12/30/2014', )
            
str_false = ('22/08/2014', '-123', '1/8/2014', '1/08/2014', '01/8/2014')

for t in str_true:
    if (bool(re.match(pattern, t)) == True):
        print ('%s is a valid date' %t)
for f in str_false:
    if (bool(re.match(pattern, f)) == False):
        print ('%s is not a valid date' %f)

01/08/2014 is a valid date
12/30/2014 is a valid date
22/08/2014 is not a valid date
-123 is not a valid date
1/8/2014 is not a valid date
1/08/2014 is not a valid date
01/8/2014 is not a valid date


#### 12-Hour format

\s is the space character

In [29]:
pattern = r'^(1[012]|[1-9]):[0-5][0-9](\s)?(?i)(am|pm)$'

str_true = ('2:00pm', '7:30 AM', '12:05 am', )
            
str_false = ('22:00pm', '14:00', '3:12', '03:12pm', )

for t in str_true:
    if (bool(re.match(pattern, t)) == True):
        print ('%s is a valid 12-hour format' %t)
for f in str_false:
    if (bool(re.match(pattern, f)) == False):
        print ('%s is not a valid 12-hour format' %f)

2:00pm is a valid 12-hour format
7:30 AM is a valid 12-hour format
12:05 am is a valid 12-hour format
22:00pm is not a valid 12-hour format
14:00 is not a valid 12-hour format
3:12 is not a valid 12-hour format
03:12pm is not a valid 12-hour format


  if (bool(re.match(pattern, t)) == True):


#### 24-Hour format

In [30]:
pattern = r'^([0-1]{1}[0-9]{1}|20|21|22|23):[0-5]{1}[0-9]{1}$'

str_true = ('14:00', '00:30', )
            
str_false = ('22:00pm', '4:00', )

for t in str_true:
    if (bool(re.match(pattern, t)) == True):
        print ('%s is a valid 24-hour format' %t)
for f in str_false:
    if (bool(re.match(pattern, f)) == False):
        print ('%s is not valid 24-hour format' %f)

14:00 is a valid 24-hour format
00:30 is a valid 24-hour format
22:00pm is not valid 24-hour format
4:00 is not valid 24-hour format


### Checking for HTML/XML, etc. tags (a very simple approach)

In [31]:
pattern = r"""</?(\w+|\w+\s*[\w+="]*)/?>"""

str_true = ('<a>', '<a href="somethinG">', '</a>', '<img src>')
            
str_false = ('a>', '<a ', '< a >')

for t in str_true:
    if (bool(re.match(pattern, t)) == True): 
        print ('%s is a valid tag text' %t)
for f in str_false:
    if (bool(re.match(pattern, f)) == False): 
        print ('%s is not a valid tag text' %f)

<a> is a valid tag text
<a href="somethinG"> is a valid tag text
</a> is a valid tag text
<img src> is a valid tag text
a> is not a valid tag text
<a  is not a valid tag text
< a > is not a valid tag text


#### E-mail address?

In [32]:
# what is a valid email adress??


#### ID/Passport/NIF