# Modern Data Science 
**(Module 00: Programming Python)**

---
- Materials in this module include resources collected from various open-source online repositories.
- You are free to use, change and distribute this package.

Prepared by and for 
**Student Members** |
2006-2018 [TULIP Lab](http://www.tulip.org.au), Australia

---


# Session D - Text processing


## Introduction


**String** are among the most popular types in *Python*. Python provides a wide range 
of string methods  and other services to support text processing through string 
manipulation. We will look at the most common string methods for text manipulation 
and  use of **regular expression** to extract data  from **HTML** files. 

For more information, please refer to the documentation for [Python's built-in type(String)](https://docs.python.org/3/library/stdtypes.html#textseq).


## Table of Content

1. [Work with String](#cell_string)

2. [Regualr expression](#cell_regular)


<a id = "cell_string"></a>

## 1. Work with string 


### String methods
We will look at some of the most common string methods in this section. A method is 
like a function, but it runs "**on**" an object. 
 
For example the **upper** is a method that can be invoked on any **string** 
object to create a new string in which all the characters are in uppercase. **Lower**
 works in a similar fashion changing all characters in the string to lowercase. Note that 
 the original string **ss** remains unchanged and a new string **tt** is created. 

In [None]:
ss = "Hello, World"
print(ss.upper())

tt = ss.lower()
print(tt)

In addition to **upper** and **lower**, the following  table provides a summary 
of some other useful string methods. 

<table width="304" border="1">
  <tr>
    <th width="50" scope="col">Method</th>
    <th width="250" scope="col">Description</th>
  </tr>
  <tr>
    <td>upper</td>
    <td>Returns a string in all uppercase</td>
  </tr>
  
   <tr>
    <td>lower</td>
    <td>Returns a string in all lowercase </td>
  </tr>
  
   <tr>
    <td>strip </td>
    <td>Returns a string with the leading and trailing whitespace removed
</td>
   </tr>


   <tr>
    <td>lstrip</td>
    <td>Returns a string with the leading whitespace removed</td>
   </tr>

   <tr>
    <td>rstrip</td>
    <td>Returns a string with the trailing whitespace removed
    </td>
   </tr>

   <tr>
    <td>replace</td>
    <td>Replaces all occurrences of old substring with new \</td>
   </tr>

   <tr>
    <td>center</td>
    <td>Returns a string centered in a field of width spaces </td>
   </tr>
   
   <tr>
    <td>ljust</td>
    <td>Returns a string left justified in a field of width spaces  </td>
   </tr>
   
   <tr>
    <td>rjust</td>
    <td>Returns a string right justified in a field of width spaces </td>
   </tr>
   
   <tr>
    <td>find</td>
    <td>Returns the leftmost index where the substring item is found </td>
   </tr>
   
     <tr>
    <td>rfind</td>
    <td>Returns the rightmost index where the substring item is found </td>
   </tr>
   
  

</table>



The following script provides  examples that illustrate use of these methods. Please 
type in and run the scripts to try them out. 

**Example1:**

In [None]:
ss = "    Hello, World    "

els = ss.count("l")
print(els)

In [None]:
print("***" + ss.strip() + "***")

In [None]:
print("***" + ss.lstrip() + "***")

In [None]:
print("***" + ss.rstrip() + "***")

In [None]:
news = ss.replace("o", "***")
print(news)

**Example2:**

In [None]:
food = "banana bread"
print(food.capitalize())

In [None]:
print("*" + food.center(25) + "*")

In [None]:
print("*" + food.ljust(25) + "*")     # stars added to show bounds

In [None]:
print("*" + food.rjust(25) + "*")

In [None]:
print(food.find("e"))

In [None]:
print(food.find("na"))

In [None]:
print(food.find("b"))

In [None]:
print(food.rfind("e"))

In [None]:
print(food.rfind("na"))

In [None]:
print(food.rfind("b"))

In [None]:
print(food.index("e"))

You can also make up your own examples to gain more understanding. Note once again 
that the methods that return strings do not change the original. 

### The **in** operator###
The **in** operator tests of one string as a substring of another. Think about
 what the outputs the following script should be, and then try it out and check your answer.

In [None]:
print('p' in 'apple')

In [None]:
print('i' in 'apple')

In [None]:
print('ap' in 'apple')

In [None]:
print('pa' in 'apple')

Note that a string is a substring of itself, and the empty string is a 
substring of any other string. See the following examples. Also note that you often need 
to consider these edge cases very carefully so that your programs run smoothly on all
  possible inputs. 
 

In [None]:
 print('a' in 'a')

In [None]:
 print('apple' in 'apple')

In [None]:
 print('' in 'a')

In [None]:
 print('' in 'apple') 

 The **not in** operator returns the logical opposite result of **in**. 
 

In [None]:
print('x' not in 'apple')

### String and List ###

Two of the most useful methods on strings involve lists of strings. The **split** 
method breaks a string into a list of words. By default, any number of whitespace 
characters is considered a word boundary. 

In [None]:
song = "The rain in Spain..."
wds = song.split()
print(wds)

An optional argument called a **delimiter** can be used to specify which characters 
to use as 
word boundaries. Notice that the delimiter does not appear in the result.

In [None]:
song = "The rain in Spain..."
wds = song.split('ai')
print(wds)

The inverse of the **split** method is **join**. You choose a desired separator string, 
(often called the **glue**) and join the list with the glue between each of the elements.

In [None]:
wds = ["red", "blue", "green"]
glue = ';'
s = glue.join(wds)
print(s)

In [None]:
print("***".join(wds))

In [None]:
print(" ".join(wds))

<a id = "cell_regular"></a>

## 2. Regular Expression ##


*Regular expressions* are a powerful language for matching text patterns.
 Remember that you have leaned Regular Expression with **Grep** command in 
 Prac *Regular Expression and GREP*. In this prac, you will be given  a basic 
 introduction to how  regular expressions works in Python.  The support on regular 
 expression  is given by  **re** module. 

In Python, a regular expression search is typically written as: 

The **re.search()** methods takes a regular expression pattern and a string as 
parameters, It searches for that pattern within the string. If the search is 
successful, **search()** returns a match object; otherwise, it returns   **None**.

To start using regular expression in your Python code, import the "re" module. 

In [None]:
import re

The following script  provides a template to test if the search succeeded and print the
 matching text. This example searches for the pattern **word:** that followed by 
 a 3-letter word. You will see that we will use this "search" template to test different regular expression examples later.   

In [None]:
str = 'an example word:cat!!'
match = re.search(r'word:\w\w\w', str)

# If-statement after search() tests whether it succeeded
if match:                      
    print('found', match.group())  ## 'found word:cat'
else:
    print('did not find')


The code **match = re.search(pat, str)** stores the search result in 
**match**. Then the **if** statement tests the **match** - if True, 
the search succeeded and **match.group()** returns the matching text. Otherwise, 
if the **match** is **False**(**None**), the search did not succeed, and 
there is no matching text. 
 
The "**r**" at the start of the pattern string designates a Python "raw" string.
 The raw string passes all the backslashes without change which is very hand for 
 regular expressions.  I recommend that you always write pattern strings with the "**r**". 

 

### Basic patterns ###
The power of regular expressions is that they can specify patterns, not just fixed 
characters. Here are the most *basic patterns* which match single chars:

	


- **a, X, 9, <** :  ordinary characters just match themselves exactly.
 The meta-characters which do not match themselves because they have special meanings 
 are: **.   + ? . ^ $ () [] {} |  \  ** (details below)

- **. **(a period):  matches any single character except newline **\n**

- ** \w**(lowercase w): matches a "word" character: a letter or digit 
or underscore **[a-zA-Z0-9\_]}. Note that although ``word'' is the mnemonic for this, 
it only matches a single word char, not a whole word. 
- **\W ** (upper case W): matches any non-word character.

- **\b**:  Boundary between word and non-word

- **\s** (lowercase s): matches a single whitespace character -- space, 
newline, return, tab, form [\n \r \t \f ].  **\S** (upper case S): matches any non-whitespace character.

- **\t, \n, \r**: Tab, newline, return

- **\d**: Decimal digit [0-9] 

- **^ = start, $ = end**:  match the start or end of the string

Now let us look at some examples.

Before we start, here is a joke: what do you call a pig with three eyes? **piiig**!

The basic rules of regular expression search for a pattern within a string are:


- The search proceeds through the string from start to end, stopping at the first match 
found
- All the pattern must be matched, but not all the string
- If **match   = re.search(pat, str)** is successful, match is not **None**
and in particular **match.group()** is the matching text.



Read through the following examples, and try them out. Please note that you should not run the following code as a script. Instead, replace the first two statements in  the previous "search" template  with each  individual statement and run the modified code. 

In [None]:
## Search for pattern 'iii' in string 'piiig'.
## All of the pattern must match, but it may appear anywhere.
## On success, match.group() is matched text.

match = re.search(r'iii', 'piiig')  # =>  found, match.group() == "iii"
match = re.search(r'igs', 'piiig')  # =>  not found, match == None

## . = any char but \n
match = re.search(r'..g', 'piiig') # =>  found, match.group() == "iig"

## \d = digit char, \w = word char
match = re.search(r'\d\d\d', 'p123g')  #=>  found, match.group() == "123"
match = re.search(r'\w\w\w', '@@abcd!!') #=>  found, match.group() == "abc"

### Repetition
Things get more interesting when you use + and * to specify *repetition} in the 
pattern.

- **+**  1 or more occurrences of the pattern to its left, e.g. **i+} = one or 
more **i}'s

- **\***  0 or more occurrences of the pattern to its left

- **?**  match 0 or 1 occurrences of the pattern to its left

Here are examples to demonstrate repetition in the pattern. Again, use them to modify the "search" template and check the result. 

In [None]:
## i+ = one or more i's, as many as possible.
match = re.search(r'pi+', 'piiig')      #=>  found, match.group() == "piii"

## Finds the first/leftmost solution, and within it drives the +
## as far as possible (aka 'leftmost and largest').
## In this example, note that it does not get to the second set of i's.
match = re.search(r'i+', 'piigiiii')    # =>  found, match.group() == "ii"

## \s* = zero or more whitespace chars
## Here look for 3 digits, possibly separated by whitespace.
match = re.search(r'\d\s*\d\s*\d', 'xx1 2   3xx')    #=>  found, match.group() == "1 2   3"
match = re.search(r'\d\s*\d\s*\d', 'xx12  3xx')      #=>  found, match.group() == "12  3"
match = re.search(r'\d\s*\d\s*\d', 'xx123xx')        #=>  found, match.group() == "123"

## ^ = matches the start of string, so this fails:
match = re.search(r'^b\w+', 'foobar')    #=>  not found, match == None
## but without the ^ it succeeds:
match = re.search(r'b\w+', 'foobar')     #=>  found, match.group() == "bar"`

Here is an email example.  Suppose you want to find the email address inside a string. 
We might use the pattern **r'\w+@\w+** to match 
multiple characters before and after **@**. Please  run the following script:

In [None]:
import re
str = 'purple alice-b@google.com monkey dishwasher'
match = re.search(r'\w+@\w+', str)
if match:
     print(match.group())  

The search does not get the whole email address in this case because the 
**\w** does not match the **-** or **.** in 
the address. We will fix this using the regular expression features -- square 
brackets.

Square brackets can be used to indicate a set of chars, so **[abc]** 
matches **a** or **b** or **c**. The codes, such as**\w**
 and **\s**, work inside square brackets too. The only  one exception 
 that dot (.) just means a literal dot. 

For the emails problem, the square brackets are an easy way to add **.** and **-** 
to the set of chars which can appear around the **@**.   The pattern 
**r'[\w.-]+@[\w.-]+'**  is used to get the whole email address:

In [None]:
import re
match = re.search(r'[\w.-]+@[\w.-]+', str)
if match:
    print(match.group()) 

#### Group extraction ###
The "group" feature of a regular expression allows you to pick out parts of the matching
 text. Suppose for the emails problem that we want to extract the username and host separately.
  To do this, add parenthesis ( ) around the username and host in the pattern, like this:
   **r'([\w.-]+)@([\w.-]+)' **. In this case, the parenthesis
    do not change what the pattern will match, instead they establish logical "groups" inside
     of the match text. On a successful search, **match.group(1)** is the match text
      corresponding to the 1st left parenthesis, and **match.group(2)** is the text
       corresponding to the 2nd left parenthesis. The plain **match.group()** is still 
       the whole match text as usual.

In [None]:
str = 'purple alice-b@google.com monkey dishwasher'
match = re.search(r'([\w.-]+)@([\w.-]+)', str)
if match:
    print(match.group())   ## 'alice-b@google.com' (the whole match)
    print(match.group(1))  ## 'alice-b' (the username, group 1)
    print(match.group(2)) ## 'google.com' (the host, group 2)

A common work flow with regular expressions is that you write a pattern for the thing 
you are looking for, adding parenthesis groups to extract the parts you want. 

### The **findall** function

 Above we used re.search() to find the first match for a pattern. **findall()* 
 finds *all** the matches and returns them as a list of strings, with each string
  representing one match.

In [None]:
import re
## Suppose we have a text with many email addresses
str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'

## Here re.findall() returns a list of all the found email strings
emails = re.findall(r'[\w\.-]+@[\w\.-]+', str) ## ['alice@google.com', 'bob@abc.com']
for email in emails:
    # do something with each found email string
    print(email)

The parenthesis **( )** group mechanism can be combined with **findall()**. If the 
pattern includes 2 or more parenthesis groups, then instead of returning a list of 
strings, **findall()** returns a list of *tuples*. Each tuple represents 
one match of the pattern, and inside the tuple is the **group(1), group(2) ..** 
data.

In [None]:
import re
str = 'purple alice@google.com, blah monkey bob@abc.com blah dishwasher'
tuples = re.findall(r'([\w\.-]+)@([\w\.-]+)', str)
print(tuples)  ## [('alice', 'google.com'), ('bob', 'abc.com')]
for tuple in tuples:
    print(tuple[0])  ## username
    print(tuple[1])  ## host

Once you have the list of tuples, you can loop over it to do some computation for each 
tuple. If the pattern includes no parenthesis, then **findall()** returns a list of
 found strings as in earlier examples. If the pattern includes a single set of parenthesis,
  then **findall()** returns a list of strings corresponding to that single group.

<a id = "cell_project"></a>