# Data Analysis with Python

## Outline
* Classes
* Files
* Scripts
* String Handling and Processing

## Classes
Python has the following built-in classes of objects:

* int
* float
* str
* list
* set
* tuple
* dictionary

Python is an **object-oriented programming language**.

While this course will not do an in depth presentation of OOP, it is necessary to understand the basics of objects and classes.

A **class** is a template.  It consists of a set of variables and functions.  An **object** is an instance of a class.  Class/object variables are referred to as **attributes** while class/object functions are referred to as **methods**.  Attributes and methods may belong to the object or the class.

Consider the following toy example:

## Example 1

In [None]:
class Animal():
    '''This is the documentation'''
    
    animal_list=[]
    
    def print_animals():                  #this doesn't have self so it's a class method
        for elem in Animal.animal_list:
            print elem
    
    def __init__(self,a_species,a_sound):       # this is called the constructor (creates a new instance of that class)
        Animal.animal_list.append(a_species)
        self.species=a_species
        self.sound=a_sound
        
    def say(self):                                 # this has self so it's an object method
        print "A",self.species,"says",self.sound      

* `Animal` is the **class**
* `animal_list` is a **class attribute**
* `print_animals` is a **class method**
* `species` and `sound` are each an **object attribute**
* `say` is an **object method**

In [None]:
tigger=Animal("tiger","roar")  # here we are using the constructor 
teddy=Animal("bear","grrr")

* `tigger` and `teddy` are each **objects**.  They are instances of the `Animal` **class**.
* The method `Animal()` is known as a **constructor**.  Constructors return an object instance.

In [None]:
Animal.animal_list

In [None]:
Animal.print_animals()   #doesn't work in python 2

In [None]:
teddy.species

In [None]:
tigger.say()

In [None]:
print(type(tigger))

In [None]:
isinstance(tigger,Animal)

In [None]:
isinstance(tigger, int)

In [None]:
tigger.__doc__

In [None]:
tigger  #tigger itself is just a reference to an object (notice the memory address it is giving us)

In [None]:
tigger2 = tigger
tigger2.sound='purrrr'
tigger.say()

## Example 2
Create an instance of a list:

In [None]:
li = [1, 2, 3, 4]

Checking its type:

In [None]:
type(li)

In [None]:
isinstance(li,list)

Checking its attributes

In [None]:
print li.__doc__

Check its behavior when passed to the function `len()`

In [None]:
print li.__len__()
print len(li)

Check the number of times the value 1 appears in the list

In [None]:
print  li.count(1) 

In [None]:
li.append(3)   #object method
li

In [None]:
list.append(li,5)  #class method
li

# in python, when you create an object method, you get for free a class method

## Example 3

### Inheritance

* Classes can inherit attributes and methods from other classes, in this case `Pet` **inherits** from the `Animal` class.
* `Pet` is said to be a subclass of `Animal`, `Animal` is a superclass of `Pet`.


The class declaration for a subclass uses the superclass as an argument while the constructor for the subclass passes parameters to the superclass constructor, as follows: 

In [None]:
class Pet(Animal): 
    
    def __init__(self, species, sound, name):
        #super().__init__(species,sound)   # this doesn't work in python 2
        Animal.__init__(self,species,sound)  
        self.name = name
              

In [None]:
pup=Pet('canine','woof','Spot')

In [None]:
pup.say()

The subclass can overwrite inherited attributes and methods of the superclass.

In [None]:
class Pet(Animal): 
    
    def __init__(self, species, sound, name):
        Animal.__init__(self,species,sound)
        self.name = name
        
    def say(self):
        print self.name,"the",self.species,"says",self.sound

In [None]:
kit=Pet('cat','meow','Tom')
kit.say()

In [None]:
Pet.say(kit)  # using the class methos 

In [None]:
Animal.say(kit)  # here I am accessing the old say method associated with the Animal class (super class version of say)

`kit` is an instance of the subclass `Pet` and the superclass 

In [None]:
isinstance(kit,Animal)

In [None]:
isinstance(kit,Pet)

In [None]:
isinstance(tigger,Pet)

Consider what happens when the object itself is printed, as follows:

In [None]:
print tigger

In [None]:
tigger

In [None]:
str(tigger)

The behavior of an object in the `str` or `print` function can be changed using the `__str__` method.

In [None]:
class Animal():
    '''This is the documentation'''
    
    animal_list=[]
    
    def print_animals():
        for elem in Animal.animal_list:
            print elem
    
    def __init__(self,a_species,a_sound):
        Animal.animal_list.append(a_species)
        self.species=a_species
        self.sound=a_sound
        
    def say(self):
        print("A",self.species,"says",self.sound)  
        
    def __str__(self):    # we can change the behavior of the print function 
        return "This %s is one of %i animals."  %(self.species, len(Animal.animal_list))

In [None]:
tigger=Animal("tiger","roar")
teddy=Animal("bear","grrr")
print tigger  # now, printing the object gives us a meaningful output

In [None]:
tigger

In [None]:
class Pet(Animal): 
    
    def __init__(self, species, sound, name):
        Animal.__init__(self,species,sound)
        self.name = name
        
    def say(self):
        print self.name,"the",self.species,"says",self.sound

In [None]:
kit=Pet('cat','meow','Tom')
print kit   # this is the power of object oriented programming .. we changed the super class and all subclasses inherited 

## Exercise

* Create a `Dog` class which is a subclass of `Pet`
* Make it have a `species` value "canine" and 'sound' value "woof"
* Give it an additional object attribute `age`
* Give it an additional object method `dog_age` which returns it's age times seven
* Overwrite the `__str__` method so it returns a string stating the name and that it is an old dog, if it is over 10, and a puppy if it is under 3.
* Create an instance of `Dog` and test its methods.


In [None]:
# your code here

class Dog(Pet): 
    
    def __init__(self, name, age):
        Pet.__init__(self,'Canine', 'Woof', name)
        self.age = age
        
    def dog_age(self):
        print self.age * 7
        
        
    def __str__(self):
        if self.age > 10:
            return self.name + " is an old dog."
        elif self.age < 3:
            return self.name + " is a puppy."
        else:
            return self.name + " is a dog"
        



In [None]:
dg1=Dog('Scooby', 13)
dg2=Dog('Pika', 7)
dg3=Dog('Fido',1)
print dg1
print dg2
print dg3

In [None]:
dg1.dog_age()
dg2.dog_age()
dg3.dog_age()

## Files

Working with files is fairly straightforward in Python.

In [None]:
f = open('fooz.txt', 'w') 

In [None]:
print type(f)


In [None]:
print f.__doc__

In [None]:
!dir

When open a file, the mode can be

* 'r': reading (default)
* 'w': writing
* 'a': appending
* 'r+': opens the file for both reading and writing

In this case the mode is 'w', so we either overwrite the file "fooz.txt" or create a new file "fooz.txt".

## Writing Files

In [None]:
f.write('this is some text.\nThis is another line.\n')
f.close()

**Do not forget to close the file after using it.** Otherwise, you can not open the file with the other applications.

Print out the contents by using a shell command, as follows:

In [None]:
!more fooz.txt

## Reading Files

The `read` method returns the file contents as a string.

In [None]:
f = open('fooz.txt', 'r') 
f.read()

Running `f.read` the second time will return nothing because all the contents have already been printed out.

In [None]:
f.read()

In [None]:
f.close()

## Reading part of the file

In [None]:
f = open('fooz.txt', 'r')

In [None]:
f.read(5) # the first 5 bytes

In [None]:
f.read(5) # the next 5 bytes

In [None]:
f.read() # the remaining contents

In [None]:
f.close()
help(f.read)

## Reading Files
Another way to read files is using the method **readlines**.  This method returns a list in which each element refers to one line.

In [None]:
f = open('fooz.txt', 'r')
lines=f.readlines()
print lines
f.close()

In [None]:
help(f.readline)

## Iterating a file object
The file object is iterable.

In [None]:
lines

In [None]:
f = open('fooz.txt', 'r')
for i in f:    # the file object itself is also iterable 
    print(i)
f.close()

Add two more lines into the file "fooz.txt" by using the following shell command.

In [None]:
!echo 'Third line.\nFourth line.' >> fooz.txt

In [None]:
!more fooz.txt

In [None]:
# this reads the even lines
f = open('fooz.txt', 'r')
content = []
num = 0
for i in f:
    num += 1
    if num % 2 == 0:
        content.append(i)
        print num, ':', i
f.close()
print content

## Another way to deal with files
Another way to open a file is to use `with` statement. It will automatically close the file after use.

In [None]:
with open('fooz.txt', 'r') as f: #open the file for reading
    data = f.read()
    
data # the file is automatically closed now

In [None]:
f.readlines()  # this gives me an error because f is closed.

In [None]:
f=open('fooz.txt', 'r')  #open the file for reading
data = f.read()
f.close()
    
data # the file is closed now

Since the file was closed, we can not read its contents any more.

## More operations on files
Run the following command to see more operations about file:

In [None]:
help(type(f))

In [None]:
type(f)

## Preliminary Topics for the Next Object/Class Example

In this case, we create a new class **myfile** which is a subclass of the existing file handling class.   It will add some more methods, such as `wordCountSort` method used to sort the frequencies of the words.

Before that, it's necessary to know how to sort a dictionary.

### Sort a dictionary

Since dictionary is unordered, converting it to a list by using .items method is necessary.

Then the function **sorted** can be used to sort a list.

In [16]:
d = {'a': 5, 'b': 3, 'c':2}
d = d.items() # convert to a list
print d

[('a', 5), ('c', 2), ('b', 3)]


In [17]:
sorted(d)  #sorts it accoring to keys

[('a', 5), ('b', 3), ('c', 2)]

In [18]:
sorted(d, key = lambda x: x[1])

[('c', 2), ('b', 3), ('a', 5)]

* The argument `key` in sorted need a function which is callable.
* Here `lambda x: x[1]` is a anonymous function, which is similar to the function defined by the keyword `def`.
* This whole command means sort the list by the second element.

The final subtopic before the next example is a quick review of lambda funtions.

### Lambda Functions

In [None]:
square1 = lambda x: x**2
def square2(x):
    return x ** 2

print square1(2)
print square2(2)

The functions defined by lambda statement is called anonymous functions, which means it does not need a name.

In [None]:
sorted(d, key = lambda x: x[1])

## Example 4

For the following example there are two classes given.  One for Python2 and one for Python3.

In [None]:
#python2.7 has file object

import string
class myfile(file):
    def __init__(self, name, mode):
        file.__init__(self, name, mode)
        
    def __str__(self):
        return "Opening file %s" %self.name
    
    def wordCount(self, punctuation='\n', ignoreCase = True):
        '''
        punctuation: punctuations to remove
    
        returns: a dict contains each word and it's corresponding frequency
        '''
        ## read contents and convert to lower
        try:
            raw_string = self.read()
            if ignoreCase:
                raw_string = raw_string.lower()
        except:
            raise Exception("Can't read file %s"%self.name)
            
        ## repalce all the punctuations with space
        for i in string.punctuation:
            raw_string = raw_string.replace(i, ' ')
        
        if punctuation != None:
            for i in punctuation:
                raw_string = raw_string.replace(i, ' ')
        
        ## split by space, count each word
        raw_list = raw_string.split(' ')
        result = {}
        for word in raw_list:
            if word in result.keys():
                result[word] += 1
            else:
                result[word] = 1
    
        # remove null character
        # len('') is 0
        result = {key:value for (key, value) in result.items() if len(key) != 0}
        return result
    
    def wordCountSort(self, descend = True, punctuation='\n', ignoreCase = True):
        '''
        return the sorted word frequency
        '''
        result = self.wordCount(punctuation, ignoreCase = ignoreCase)
        result = sorted(result.items(), key = lambda x: x[1] , reverse = descend)
        return result
    
    def mostCommonWord(self, num=5, punctuation='\n', descend = True, ignoreCase = True):
        '''
        return the most common words
        '''
        result = self.wordCountSort(punctuation=punctuation, ignoreCase = ignoreCase, descend = descend)
        if num > len(result):
            Warning('There are only %s words'%len(result))
            return result
        else:
            return result[:num]

In [None]:
#python3 has no file object

import string
class myfile():
    def __init__(self, name, mode):
        #wrap around return object of open()
        self.file=open(name, mode)
        
        
    def __str__(self):
        return "Opening file %s" %self.file.name
    
    def wordCount(self, punctuation='\n', ignoreCase = True):
        '''
        punctuation: punctuations to remove
    
        returns: a dict contains each word and it's corresponding frequency
        '''
        ## read contents and convert to lower
        try:
            raw_string = self.file.read()
            if ignoreCase:
                raw_string = raw_string.lower()
        except:
            raise Exception("Can't read file %s"%self.file.name)
            
        ## repalce all the punctuations with space
        for i in string.punctuation:
            raw_string = raw_string.replace(i, ' ')
        
        if punctuation != None:
            for i in punctuation:
                raw_string = raw_string.replace(i, ' ')
        
        ## split by space, count each word
        raw_list = raw_string.split(' ')
        result = {}
        for word in raw_list:
            if word in result.keys():
                result[word] += 1
            else:
                result[word] = 1
    
        # remove null character
        # len('') is 0
        result = {key:value for (key, value) in result.items() if len(key) != 0}
        return result
    
    def wordCountSort(self, descend = True, punctuation='\n', ignoreCase = True):
        '''
        return the sorted word frequency
        '''
        result = self.wordCount(punctuation, ignoreCase = ignoreCase)
        result = sorted(result.items(), key = lambda x: x[1] , reverse = descend)
        return result
    
    def mostCommonWord(self, num=5, punctuation='\n', descend = True, ignoreCase = True):
        '''
        return the most common words
        '''
        result = self.wordCountSort(punctuation=punctuation, ignoreCase = ignoreCase, descend = descend)
        if num > len(result):
            Warning('There are only %s words'%len(result))
            return result
        else:
            return result[:num]

In [None]:
f=myfile('fooz.txt', 'r')
print f, '...'
print f.wordCount()

In [None]:
f = myfile('fooz.txt', 'r') 
print f, '...'
print f.wordCountSort() # sort

In [None]:
f= myfile('fooz.txt', 'r')
print f, '...'
print f.mostCommonWord(num = 2)

### Test Case
In this case, test our codes for a larger data which is 376 kb.

In [None]:
!dir -s abalone.data

The first 5 words with highest frequencies:

In [None]:
f = myfile('abalone.data', 'r')
print f, '...'
print f.mostCommonWord(num = 5)

Change to case sensitive:

In [None]:
f=myfile('abalone.data', 'r')
print f, '...'
print f.mostCommonWord(num = 5, ignoreCase=False)

The last 5 words with lowest frequencies:

In [None]:
f=myfile('data/abalone.data', 'r')
print (f, '...')
print (f.mostCommonWord(num = 5, ignoreCase=False, descend=False) )

## Script
### Running Python scripts
Python is a script language, which means we can run a script as a shell command.
Here is simple example, write a line "print '1 + 1 = %s' %(1+1)" into the file "script1.py":

In [None]:
!echo print ('1 + 1 = %s' %(1+1)) > script1.py

In [None]:
!more script1.py

In [None]:
!python script1.py

## Run Python scripts
In practice, we usually need to interact with scripts.

- Take in some inputs.
- Run the script.
- Return some outputs.

`input` is a function for getting input from users in python3.

`raw_input` is the equivalent function in python2.

In [3]:
#use raw_input for python2
age = raw_input('how old are you? ')

how old are you? 12


In [4]:
age

'12'

In [5]:
script2 = open('script2.py', 'w')
script2.write('name = input("What is your name?\\n")\n')
script2.write('age = input("How old are you?\\n")\n')
script2.write('print "You are %s, %s years old."%(name, age)\n')
script2.close()

In [None]:
!python script2.py   #it doesn't work .. it's going to hang 

In [6]:
open('script3.py', 'w').write('''from sys import argv
script, name, age = argv
print "The script is called:", script
print "My name is %s." %name
print "Im %s years old." %age''')

In [7]:
f=open('script3.py', 'r') 
num = 1
for i in f:
    print (num, ':', i)
    num += 1

(1, ':', 'from sys import argv\n')
(2, ':', 'script, name, age = argv\n')
(3, ':', 'print "The script is called:", script\n')
(4, ':', 'print "My name is %s." %name\n')
(5, ':', 'print "Im %s years old." %age')


In [8]:
!more script3.py

from sys import argv
script, name, age = argv
print "The script is called:", script
print "My name is %s." %name
print "Im %s years old." %age


In [9]:
!python script3.py jack 18

The script is called: script3.py
My name is jack.
Im 18 years old.


In [6]:
!python script3.py Lucy 20

The script is called: script3.py
My name is Lucy.
Im 20 years old.


In [None]:
!ls

## wordcounter.py

In [7]:
!python wordcounter.py wordcounter.py 5

        Word:Frequency
-----------------------
      result:      16
        self:      11
         raw:      10
      string:      10
 punctuation:      10


In [8]:
!more wordcounter.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-  
# Author: NYC data science <http://nycdatascience.com/>

### part 1
import string
from sys import argv
script, target, number = argv

### part 2
class myfile(file):
    def __init__(self, name, mode):
        file.__init__(self, name, mode)
        
    def __str__(self):
        return "Opening file %s" %self.name
    
    def wordCount(self, punctuation='\n', ignoreCase = True):
        '''
        punctuation: punctuations to remove
    
        returns: a dict contains each word and it's corresponding frequency
        '''
        ## read contents and convert to lower
        try:
            raw_string = self.read()
            if ignoreCase:
                raw_string = raw_string.lower()
        except:
            raise Exception("Can't read file %s"%self.name)
            
        ## reaplce all the punctuations with space
        for i in string.punctuation:
            raw_string = raw_string.replace(i, ' ')
        
        if

## String Processing and Handling
### Creating string

Here are examples of strings.  Note the following:

* Use of single, double and triple quotation marks.
* Use of \n newline character
* Use of \ escape character
* Use of the `r` prefix, indicating "raw" characters

In [17]:
print "I'm Jack."
print 'Jack says "My name is Jack".'
print "Jack says \"Hi, I'm  Jack\""
print 'Jack says "Hi, I\'m  Jack"'
print '''Jack says "Hi, I'm  Jack"'''
print """Jack says 'Hi, I'm  Jack'"""
print '''Jack says:
    "Hi, I'm  Jack"'''
print 'Hi, I do not want a \new line.'
print r'Hi, I do not want a \n new line.'

I'm Jack.
Jack says "My name is Jack".
Jack says "Hi, I'm  Jack"
Jack says "Hi, I'm  Jack"
Jack says "Hi, I'm  Jack"
Jack says 'Hi, I'm  Jack'
Jack says:
    "Hi, I'm  Jack"
Hi, I do not want a 
ew line.
Hi, I do not want a \n new line.


## Basic String Manipulations

Case conversion is done as follows:

In [18]:
'ABcd'.lower()

'abcd'

In [19]:
'ABcd'.upper() # convert to upper case

'ABCD'

In [20]:
'ABcd'.swapcase() # swap case(lower -> upper, upper -> lower)

'abCD'

In [21]:
'acd acd'.title()

'Acd Acd'

String objects have the following object methods:

* split
* replace
* count
* join

In [24]:
'a,b,c,d'.split(',') # split by ' '

['a', 'b', 'c', 'd']

In [23]:
'a b c d'.split() # split by ' ' default

['a', 'b', 'c', 'd']

In [25]:
'a b c d'.replace(' ', '>') # replace ' ' with '>'

'a>b>c>d'

In [26]:
'a b c d'.count(' ') # count the number of ' ' appears

3

In [27]:
' '.join(['a', 'b', 'c'])

'a b c'

In [28]:
''.join(['a', 'b', 'c'])

'abc'

In [29]:
'/'.join(['a', 'b', 'c'])

'a/b/c'

In [30]:
str.join('X',['a','b','c','d'])  # this equivalent to 'X'.join(['a','b','c','d'])

'aXbXcXd'

## Advanced String Manipulations: Regular Expressions
Basic and intermediate functions forworking with strings have been covered.

To fully unleash the power of strings manipulation, it's necessary to learn regular expressions.

### Concept
A regular expression is a special text string for describing a set of strings. This "special string" is formally called a **pattern**. Hence, a regular expression is a pattern that describes a set of strings.

The goal of using regular expression is extracting specific characters from text by describing its pattern.

### Pattern
For example, both **gray** and **grey** match the pattern **gr.y** in which the dot . refers to a arbitrary character.

<code>
Identifiers:

\d = any number
\D = anything but a number
\s = space
\S = anything but a space
\w = any letter
\W = anything but a letter
. = any character, except for a new line
\b = space around whole words
\\. = period. must use backslash, because . normally means any character.

Modifiers:

{1,3} = for digits, u expect 1-3 counts of digits, or "places"
\+ = match 1 or more
? = match 0 or 1 repetitions.
\* = match 0 or MORE repetitions
$ = matches at the end of string
^ = matches start of a string
| = matches either/or. Example x|y = will match either x or y
[] = range, or "variance"
{x} = expect to see this amount of the preceding .
{x,y} = expect to see this x-y amounts of the precedng 

White Space Charts:

\n = new line
\s = space
\t = tab
\e = escape
\f = form feed
\r = carriage return

Characters to REMEMBER TO ESCAPE IF USED!

. + * ? [ ] $ ^ ( ) { } | \

Brackets:

[] = quant[ia]tative = will find either quantitative, or quantatative.
[a-z] = return any lowercase letter a-z
[1-5a-qA-Z] = return all numbers 1-5, lowercase letters a-q and uppercase A-Z
</code>

## re
The library **re** is used to implement regular expressions in python.

In [10]:
import re  #regular expressions library

In [None]:
help(re.search)

In [11]:
raw_string = 'Hi, how are you today?'
print re.search('Hi', raw_string)

<_sre.SRE_Match object at 0x0000000002A25920>


In [33]:
print re.search('Hello', raw_string)

None


In [14]:
s = re.search('Hi', raw_string)
s != None

True

In [35]:
print s.start() # the starting position of of the matched string
print s.end()  # the ending position index of the matched string
print s.span()  # a tuple containing the (start, end) positions of the matched string

0
2
(0, 2)


In [None]:
print( s.group() )# the matched string
print( raw_string[s.start():s.end()] )# same

## Using Identifiers

In [36]:
#     . = any character, except for a new line

print (re.search('a.', 'aa') != None)
print (re.search('a.', 'ab') != None)
print (re.search('a.', 'a1') != None)
print (re.search('a.', 'a#') != None)
print (re.search('a.', 'a') != None)
print (re.search('a.', 'a\n') != None)

True
True
True
True
False
False


In [37]:
#      ? = match 0 or 1 repetitions.

print (re.search('ba?b', 'bb') != None )   # match
print (re.search('ba?b', 'bab') != None )  # match
print (re.search('ba?b', 'baab') != None ) # does not match
print (re.search('ba?b', 'abab') != None )

True
True
False
True


In [None]:
#     + = match 1 or more

print( re.search('ba+b', 'bb') != None  )  # does not match
print( re.search('ba+b', 'bab') != None   )# match
print( re.search('ba+b', 'baab') != None  )# match
print( re.search('ba+b', 'baaaab') != None  )# match
print( re.search('ba+b', 'baaaaaab') != None ) # match

In [None]:
#.       * = match 0 or MORE repetitions

print (re.search('ba*b', 'bb') != None )   # match
print (re.search('ba*b', 'bab') != None )  # match
print (re.search('ba*b', 'baaaaaab') != None ) # match
print (re.search('ba*b', 'baaagb') != None )

In [None]:
print( re.search('ba{1,3}b', 'bab') != None    )# match
print( re.search('ba{1,3}b', 'baab') != None  ) # match
print( re.search('ba{1,3}b', 'baaab') != None)  # match
print( re.search('ba{1,3}b', 'bb') != None     )# does not match
print( re.search('ba{1,3}b', 'baaaab') != None) # does not match


In [15]:
#      ^ = matches start of a string

print( re.search('^a', 'abc') != None )   # match
print( re.search('^a', 'abcde') != None)  # match
print( re.search('^a', ' abcde') != None) # does not match

True
True
False


In [None]:
#.      $ = matches at the end of string

print (re.search('a$', 'aba') != None )   # match
print (re.search('a$', 'abcba') != None)  # match
print (re.search('a$', ' aba ') != None ) # does not match

In [None]:
print (re.search('^a.a$', 'aba') != None )   # match
print (re.search('^a.a$', 'a a') != None  )  # match
print (re.search('^a.a$', 'a.a') != None   ) # match

print (re.search('^a.a$', 'bba') != None  )  # does not match
print (re.search('^a.a$', 'abba') != None  ) 
print (re.search('^a.*a$', 'abba') != None  )
print (re.search('^a.?a$', 'abba') != None  )
print (re.search('^a.{1,3}a$', 'abba') != None  )# does not match

### brackets

[] is used for specifying a set of characters to match. For example, [123abc] will match any of the characters 1, 2, 3, a, b, or c ; this is the same as [1-3a-c], which uses a range to express the same set of characters. Further more [a-z] matches all the lower letters, while [0-9] matches all the numbers.

In [None]:
print (re.search('[123abc]', 'defg')  != None)   # does not match
print (re.search('[123abc]', '1defg') != None )  # match
print (re.search('[1-3a-c]', '2defg') != None  ) # match
print (re.search('[123abc]', 'adefg') != None )  # match
print (re.search('[1-3a-c]', 'bdefg') != None )  # match

### parentheses

() is very similar to the mathematical meaning, they group together the expressions contained inside them, and you can repeat the contents of a group with a repeating qualifier.
For example, the pattern (abc){2,3} matches abc 2 or 3 times.

In [None]:
print (re.search('(abc){2,3}', 'abc')  != None  )       # does not match
print (re.search('(abc){2,3}', 'abcabc')  != None)      # match
print (re.search('(abc){2,3}', 'abcabcabc')  != None )  # match

In [None]:
re.search('(abc){2,3}', 4*'abc')  != None

In [None]:
re.search('^(abc){2,3}$', 4*'abc')  != None

In [None]:
print(re.search('[ab]', 'a') != None)   # match
print(re.search('[ab]', 'b') != None)   # match
print(re.search('[ab]', 'c') != None)  # does not match

### vertical bar
`|` is a logical operator. For examples, `a|b` matches `a` and `b`, which is similar to `[ab]`.

`abc|123` matches `abc` or `123`, while `[abc123]` matches any single characters in `a, b, c, 1, 2, 3`.

In [None]:
print (re.search('abc|123', 'a') != None)   # does not match
print (re.search('abc|123', '1') != None )  # does not match
print (re.search('abc|123', '123') != None) # match
print (re.search('abc|123', 'abc') != None) # match

### backslash
To match exactly ?, it is necessary to add a backslash \?.

Otherwise, the character ? will be treated as an identifier. ? matches a character(group) either once or zero times.

In [None]:
print (re.search('\?', 'Hi, how are you today?') != None)

In [None]:
print (re.search('?', 'Hi, how are you today?') != None)  # this will give us an error because ? is a reserved character

## Useful functions
* **re.split(pattern, string)**: Split the string into a list by the pattern.
* **re.sub(pattern, replace, string)**: Replace the substrings in the string that matches the pattern with the argument replace.
* **re.findall(pattern, string)**: Find all substrings where the pattern matches, and returns them as a list.

In the base library, the strings already have methods like `str.split` and `str.replace` do similar works.

`str.split` is similar to `re.split`, `str.replace` is similar to `re.sub`.

Since regular expressions can be used in the `re` module, `re.split` and `re.sub` are much more powerful!

## re.sub

In [None]:
s = '''The re module was added in Python 1.5, 
and provides Perl-style regular expression patterns. 
Earlier versions of Python came with the regex module, 
which provided Emacs-style patterns. 
The regex module was removed completely in Python 2.5.'''

In [None]:
s

#### Problem
Suppose the goal is to split this sentence into a list in which each element is a word. The separators are dot(.), dash(-), comma(,) and blank space( ).

#### Solution
How to solve the problem in the base library?

- Since we can not split a string by multiple separators, an alternative way is replacing all the separators with blank space.
- Split the replaced the text with blank space.

This technique was used in the script wordcounter.py.

In [None]:
s2 = s
for i in [',', '.', '-', '\n']:
    s2 = s2.replace(i, ' ')
s2.split(' ')

Using regular expression, all the separators can be replaced at the same time.

In [None]:
s3 = s
s3 = re.sub('[\n,.-]', ' ', s3)
print (s3)
re.split(' +', s3) 
# since there are empty characters in the result,
# \ we split it by one or more blank space

### re.split
A simpler way is using regular expression to split the text by multiple separators directly.

In [None]:
re.split('[\n ,\.-]+', s)

### re.findall
Similar to **`re.split`**, **`re.findall`** also works well in this case.

Select letters in the string s by using **`re.findall`**.

In [None]:
re.findall('[a-zA-Z]+', s) # if you want number too, run re.findall('[a-zA-Z0-9]+', s)

### Special sequence in regular expression

There are sequences that have special meaning in regular expressions.

* \d: Matches any decimal digit; this is equivalent to the class [0-9].
* \D: Matches any non-digit character; this is equivalent to the class [^0-9].
* \w: Matches any alphanumeric character; this is equivalent to the class [a-zA-Z0-9_].
* \W: Matches any non-alphanumeric character; this is equivalent to the class [^a-zA-Z0-9_].
* \s: Matches any whitespace character; this is equivalent to the class [ \t\n\r\f\v].
* \S: Matches any non-whitespace character; this is equivalent to the class [^ \t\n\r\f\v].
* \t: tab,
* \v: vertical tab.
* \r: Carraige return. Move to the leading end (left) of the current line.
* \n: Line Feed. Move to next line, staying in the same column. Prior to Unix, usually used only after CR or LF.
* \f: Form feed. Feed paper to a pre-established position on the form, usually top of the page.

Reference: http://regexlib.com/CheatSheet.aspx?AspxAutoDetectCookieSupport=1

The simplest way to solve the problem is as follows:

In [None]:
re.findall('\w+', s) # same as re.findall(`[a-zA-Z0-9_]+`, s)

In [None]:
re.findall(r'\b[a-z]+', s)

### wordCount

Rewrite the function `wordCount`.

In [None]:
import re
def wordCount(x, number=False):
    '''
    x: string to count
    number: whether to count the numbers
    '''
    ## tolower and find words
    x = x.lower()
    if number:
        word_list = re.findall('\w+', x)
    else:
        word_list = re.findall('[a-zA-Z]+', x)
    ## count and return
    result = {}
    for word in word_list:
        if word in result.keys():
            result[word] += 1
        else:
            result[word] = 1
    return result 

In [None]:
wordCount(s)

### Is it a e-mail address?

Write a function to test whether a e-mail address is valid.

Usually, email addresses have one or a similar variant of the following forms:
```
somename9@gmail.com
some_name@yahoo.com
contact@supstat.com.cn
some.name@an-email.com
some.name@an.email.com
some_name@163.com
```
Think about how to test if a string is a valid e-mail address?

Generally, it can be split into three parts:

- user name(somename9, some_name)
- @
- domain name(gmail.com, yahoo.com, supstat.com.cn)

The goal is to write regular expressions to describe the pattern.

#### user name

Try to write a regular expression to match the following user names.
```
somename9
some_name
contact
some.name
some.name
some_name
```
The first part, user name pattern, can be expressed:

`^[a-z0-9]+[_\.]?[a-z0-9]+`

* `^[a-z0-9]+`: begin with letter or numbers
* `[_\.]?`: may contain a underline(_) or dot (.)
* `[a-z0-9]+`: following with numbers and letter

In [None]:
users = ['a','@#$%','somename9', 'some_name', 'contact', 'some.name', 'some.name', 'some_name', 'some.name.name', 'somme']
for i in users:
    if re.search('^[a-z0-9]+[_\.]?[a-z0-9]+', i) != None:
        print ("Match!")
    else:
        print ("Does not match!")

#### domain name

Try to write your regular expression to match the following domain names.
```
gmail.com
yahoo.com
supstat.com.cn
an-email.com
an.email.com
163.com
```

The domain name pattern can be defined as:

`[a-z0-9]+([-\.]?[a-z]){1,3}$`

* `[a-z0-9]+`: starts with numbers or letters
* `[-\.]?[a-z]`: at most one - or . can be inserted into, followed with letters
* `([-\.]?[a-z]){1,3}`: repeat this pattern 1 to 3 three times. `supstat.com.cn` ends with `.com.cn`, which repeat two times(`.com`, `.cn`).

In [None]:
domain = ['gmail.com', 'yahoo.com', 'supstat.com.cn', 'an-email.com', 'an.email.com', 'om']
for i in domain:
    if re.search('[a-z0-9]+([-\.]?[a-z]){1,3}$', i) != None:
        print ("Match!")
    else:
        print ("Does not match!")

## Is it an email address?

In [None]:
import re
def isEmail(x):
    x = x.lower() # case insensitive
    emailPattern = '^[a-z0-9]+[_\.]?[a-z0-9]+@[a-z0-9]+([-\.]?[a-z]+){1,3}$'
    result = re.search(emailPattern, x) != None
    return result

In [None]:
emails = ['some&name9@gmail.com', 'some_name@yahoo.com', 'contact@supstat.com.cn',\
          'some.n@ame@an-email.com', 'some.name@an.email.com', 'some_name@163.com']
for i in emails:
    print ('%25s is a valid e-mail address: %s'%(i, isEmail(i)))

## Extract the domain names

After checking the validation of a e-mail address, we can easily extract its domain name.

Obviously, domain name is the strings after @.

In [None]:
domainPattern = '@(.*)'

In [None]:
re.findall(domainPattern, 'somename9@gmail.com')

Solve it with re.sub

In [None]:
re.sub('[^0-9]', '', '1a2b3c')

Replace the user names and @ with null character.

Note that re.findall return a list while re.sub result a string. Sometimes a pattern may matches more than one part of a string, in these cases re.findall will return all the parts, that's why it is called "find all".

In this case, [0-9] matches a single number, and 1a2b3c contains three numbers. re.findall return a list in which each element is a single number.

In [None]:
re.findall('[0-9]', '1a2b3c')

## Extract the domain names

In [None]:
import re
def domainName(x):
    '''
    given a e-mail address, return its domain name.
    '''
    domainPattern = '@(.*)'
    if isEmail(x):
        return re.findall(domainPattern, x)
    else:
        raise Exception("This e-mail address is invalid!")


In [None]:
emails = ['somename9@gmail.com', 'some_name@yahoo.com', 'contact@supstat.com.cn',\
          'some.name@an-email.com', 'some.name@an.email.com', 'some_name@163.com']
for i in emails:
    print ('The domain name of %s is: %s'%(i, domainName(i)))


In [None]:
domainName('notEmail_at.com')

The remaining content of this lecture uses are the tools to develop  a method for extracting data from the web.  This is a process called scraping.  It is left to the student as an exercise if they have interest but is beyond the time constraints of the class.

## Extract message from web
The upcoming events from the Python User Group Calendar is shown below. What we are going to do is extracting the time, location and title of the upcoming events.

https://www.python.org/events/python-user-group

In [None]:
#in terminal 

# Get files
curl -O http://ftp.gnu.org/gnu/wget/wget-1.13.4.tar.gz

# Extract it
tar -xzf wget-1.13.4.tar.gz

# Configure wget and install 
cd wget-1.13.4
./configure --with-ssl=openssl
make
sudo make install

In [None]:
!wget https://www.python.org/events/python-user-group/ -O data/python-event.html

In [None]:
!cat data/python-event.html

In [None]:
import re 
timePattern = '<time datetime="[\w:+-]+">(.+)<span class="say-no-more">([\d ]+)</span>(.*)</time>'
raw_time = '<time datetime="2015-01-20T17:00:00+00:00">20 Jan.<span class="say-no-more"> 2015</span> 5pm UTC – 7pm UTC</time>'
time = re.findall(timePattern, raw_time)
print (time)

In [None]:
raw_location = '<span class="event-location">Bürgerhaus im Stadtteilzentrum Bilk, Raum 1, 2. OG, Bachstr. 145, 40217 Düsseldorf, Germany</span>'
locationPattern = '<span class="event-location">(.*)</span>'
location = re.findall(locationPattern, raw_location)
print (location)

In [None]:
print (location[0])

In [None]:
raw_title = '<h3 class="event-title"><a href="/events/python-user-group/264/">Cape Town Python User Group (CTPUG) Meeting</a></h3>'
titlePattern = '<h3 class="event-title"><a href=".+">(.*)</a></h3>'
title = re.findall(titlePattern, raw_title)
print (title)

In [None]:
f= open('data/python-event.html')
event = f.read()

In [None]:
timePattern = '<time datetime="[\w:+-]+">(.+)<span class="say-no-more">([\d ]+)</span>(.*)</time>'
time = re.findall(timePattern, event)
for i in time:
    print (''.join(i))

In [None]:
locationPattern = '<span class="event-location">(.*)</span>'
location = re.findall(locationPattern, event)
for i in location:
    print (i)

In [None]:
titlePattern = '<h3 class="event-title"><a href=".+">(.*)</a></h3>'
title = re.findall(titlePattern, event)
for i in title:
    print (i)

In [None]:
import datetime
class event(object):
    def __init__(self, title, time, location):
        self.title = title
        self.time = time
        self.location = location
    
    def day(self):
        try:
            day = re.findall('\w+', self.time)[:3]
            day = ' '.join(day)
            try: 
                return datetime.datetime.strptime(day, "%d %b %Y")
            except ValueError:
                return datetime.datetime.strptime(day, "%d %B %Y")
        except ValueError:
            return self.time
    
    def status(self):
        if isinstance(self.day(), datetime.datetime):
            now = datetime.datetime.now()
            if now < self.day():
                return 'Upcoming'
            elif now - self.day() < datetime.timedelta(days=1):
                return 'Today'
            else:
                return 'Missed'
        else:
            return 'Unknown'
        
    def __str__(self):
        return self.status() + ' Event: %s' %self.title

In [None]:
event1 = event('Python Meeting Düsseldorf', '20 Jan. 2015 5pm UTC – 7pm UTC', \
          'Bürgerhaus im Stadtteilzentrum Bilk, Raum 1, 2. OG, Bachstr. 145, 40217 Düsseldorf, Germany')
print (event1.day())
print (event1)

In [None]:
raw_time = '20 Jan. 2015 5pm UTC – 7pm UTC'
day = re.findall('\w+', raw_time)
print (day)

In [None]:
day = ' '.join(day[:3])
day

In [None]:
import datetime
day= datetime.datetime.strptime(day, "%d %b %Y")
print( day)
print (type(day))

In [None]:
print (datetime.datetime.now())

In [None]:
import requests
text = requests.get('https://www.python.org/events/python-user-group/').text
text


In [None]:
import requests
import datetime
import re

text = requests.get('https://www.python.org/events/python-user-group/').text
timePattern = '<time datetime="[\w:+-]+">(.+)<span class="say-no-more">([\d ]+)</span>(.*)</time>'
locationPattern = '<span class="event-location">(.*)</span>'
titlePattern = '<h3 class="event-title"><a href=".+">(.*)</a></h3>'

time = re.findall(timePattern, text)
time = [''.join(i) for i in time]
location = re.findall(locationPattern, text)
title = re.findall(titlePattern, text)

events = [event(title[i], time[i], location[i]) for i in range(len(title))]

for i in events:
    print (30*'-')
    print (i)
    print ('    Time:  %s' %i.time)
    print ('    Location: %s' %i.location)

In [None]:
#!/usr/bin/env python
# -*- coding: utf-8 -*-  
# Author: NYC data science <http://nycdatascience.com/>

import requests
import datetime
import re

class event(object):
    def __init__(self, title, time, location):
        self.title = title
        self.time = time
        self.location = location
    
    def day(self):
        try:
            day = re.findall('\w+', self.time)[:3]
            day = ' '.join(day)
            try: 
                return datetime.datetime.strptime(day, "%d %b %Y")
            except ValueError:
                return datetime.datetime.strptime(day, "%d %B %Y")
        except ValueError:
            return self.time
    
    def status(self):
        if isinstance(self.day(), datetime.datetime):
            now = datetime.datetime.now()
            if now < self.day():
                return 'Upcoming'
            elif now - self.day() < datetime.timedelta(days=1):
                return 'Today'
            else:
                return 'Missed'
        else:
            return 'Unknown'
        
    def __str__(self):
        return self.status() + ' Event: %s' %self.title

    

text = requests.get('https://www.python.org/events/python-user-group/').text
timePattern = '<time datetime="[\w:+-]+">(.+)<span class="say-no-more">([\d ]+)</span>(.*)</time>'
locationPattern = '<span class="event-location">(.*)</span>'
titlePattern = '<h3 class="event-title"><a href=".+">(.*)</a></h3>'

time = re.findall(timePattern, text)
time = [''.join(i) for i in time]
location = re.findall(locationPattern, text)
title = re.findall(titlePattern, text)

events = [event(title[i], time[i], location[i]) for i in range(len(title))]

for i in events:
    print (30*'-')
    print (i)
    print ('    Time    :  %s' %i.time)
    print ('    Location: %s' %i.location)


In [None]:
!python script/event.py


<code>
Identifiers:

\d = any number
\D = anything but a number
\s = space
\S = anything but a space
\w = any letter
\W = anything but a letter
. = any character, except for a new line
\b = space around whole words
\. = period. must use backslash, because . normally means any character.

Modifiers:

{1,3} = for digits, u expect 1-3 counts of digits, or "places"
\+ = match 1 or more
? = match 0 or 1 repetitions.
\* = match 0 or MORE repetitions
$ = matches at the end of string
^ = matches start of a string
| = matches either/or. Example x|y = will match either x or y
[] = range, or "variance"
{x} = expect to see this amount of the preceding .
{x,y} = expect to see this x-y amounts of the precedng 

White Space Charts:

\n = new line
\s = space
\t = tab
\e = escape
\f = form feed
\r = carriage return

Characters to REMEMBER TO ESCAPE IF USED!

. + * ? [ ] $ ^ ( ) { } | \

Brackets:

[] = quant[ia]tative = will find either quantitative, or quantatative.
[a-z] = return any lowercase letter a-z
[1-5a-qA-Z] = return all numbers 1-5, lowercase letters a-q and uppercase A-Z
</code>

The code:

So, we have the string we intend to search. We see that we have ages that are integers 2-3 numbers in length. We could also expect digits that are just 1, under 10 years old. We probably wont be seeing any digits that are 4 in length, unless we're talking about biblical times or something.


In [None]:
import re

exampleString = ''' 
Jessica is 15 years old, and Daniel is 27 years old.
Edward is 97 years old, and his grandfather, Oscar, is 102. 
'''

Now we define the regular expression, using a simple findall method to find all examples of the pattern we specify as the first parameter within the string we specify as the second parameter.


In [None]:
ages = re.findall(r'\d{1,3}',exampleString)
names = re.findall(r'[A-Z][a-z]*',exampleString)

print(ages)
print(names)

In [None]:
words = re.findall(r'[a-z]',exampleString)
print(words)

In [None]:
words = re.findall(r'[a-z]*',exampleString)
print(words)

In [None]:
words = re.findall(r'[a-z]+',exampleString)
print(words)

In [None]:
words = re.findall(r'[a-z]?',exampleString)
print(words)

In [None]:
words = re.findall(r'[A-Z]+[a-z]+',exampleString)
print(words)

In [None]:
words = re.findall(r'[A-Z][a-z]+',exampleString)
print(words)

In [None]:
words = re.findall(r'[A-Z]?[a-z]+',exampleString)
print(words)

In [None]:
import re

exampleString = ''' AAA I think.
Jessica is 15 years old, and Daniel is 27 years old.
Edward is 97 years old, and his grandfather, Oscar, is 102. 
'''

In [None]:
words = re.findall(r'[A-Z]?[a-z]+',exampleString)
print(words)

In [None]:
words = re.findall(r'[A-Z]{0,4}[a-z]*',exampleString)
print(words)

In [None]:
words = re.findall(r'[A-Za-z]+',exampleString)
print(words)

In [None]:
words = re.findall(r'\w',exampleString)
print(words)

In [None]:
words = re.findall(r'\w+',exampleString)
print(words)

In [None]:
words = re.findall(r'[A-Z]{3,3}',exampleString)
print(words)

In [None]:
words = re.findall(r'[A-Z]{1,3}',exampleString)
print(words)

In [None]:
words = re.findall(r'[A-Z]{1,3}[a-z]*',exampleString)
print(words)

In [None]:
words = re.findall(r'^[a-z]',exampleString)
print(words)

In [None]:
words = re.split(r'\n',exampleString)
print(words)

In [None]:
words = re.split(r'\.',exampleString)
print(words)

In [None]:
words = re.findall(r'\b[a-z]+',exampleString)
print(words)

In [None]:
words = re.findall(r'\w',exampleString)
print(words)

In [None]:
words = re.findall(r'\b\w',exampleString)
print(words)

In [None]:
words = re.findall(r'\b\w+',exampleString)
print(words)

In [None]:
exampleString2='ksjhfgjkshdfgjhksd <code> asFLKJASKLJFHKASJD</code> shjfjhsdGFJHKSd'

words = re.findall(r'<code>.*</code>',exampleString2)
print(words)

In [None]:
words = re.findall(r'</code>',exampleString2)
print(words)

In [None]:
words = re.findall(r'(code>)',exampleString2)
print(words)

In [None]:
words = re.findall(r'(\w*a)',exampleString,re.M)
print(words)

In [None]:
words = re.findall(r'(\w*d\b)',exampleString,re.M)
print(words)

In [None]:
words = re.findall(r'\b[O|o]\w*',exampleString,re.M)
print(words)

In [None]:
words = re.findall(r'\b[A|a]\w*',exampleString,re.M)
print(words)

In [None]:

words = re.findall(r'^J\w+',exampleString,re.M)
print(words)

In [None]:
words = re.findall(r'\w*[\.|,]',exampleString,re.M)
print(words)

In [None]:
exampleString

In [None]:
words = re.findall(r'^[J]\w+',exampleString,re.M)
print(words)

In [None]:
words = re.findall(r'^[J|E]\w+',exampleString,re.M)
print(words)

In [None]:
words = re.findall(r'^[J|D]\w+',exampleString,re.M)

print(words)

<code>
Pattern	Description
^	Matches beginning of line.
$	Matches end of line.
.	Matches any single character except newline. Using m option allows it to match newline as well.
[...]	Matches any single character in brackets.
[^...]	Matches any single character not in brackets
re*	Matches 0 or more occurrences of preceding expression.
re+	Matches 1 or more occurrence of preceding expression.
re?	Matches 0 or 1 occurrence of preceding expression.
re{ n}	Matches exactly n number of occurrences of preceding expression.
re{ n,}	Matches n or more occurrences of preceding expression.
re{ n, m}	Matches at least n and at most m occurrences of preceding expression.
a| b	Matches either a or b.
(re)	Groups regular expressions and remembers matched text.
(?imx)	Temporarily toggles on i, m, or x options within a regular expression. If in parentheses, only that area is affected.
(?-imx)	Temporarily toggles off i, m, or x options within a regular expression. If in parentheses, only that area is affected.
(?: re)	Groups regular expressions without remembering matched text.
(?imx: re)	Temporarily toggles on i, m, or x options within parentheses.
(?-imx: re)	Temporarily toggles off i, m, or x options within parentheses.
(?#...)	Comment.
(?= re)	Specifies position using a pattern. Doesn't have a range.
(?! re)	Specifies position using pattern negation. Doesn't have a range.
(?> re)	Matches independent pattern without backtracking.
\w	Matches word characters.
\W	Matches nonword characters.
\s	Matches whitespace. Equivalent to [\t\n\r\f].
\S	Matches nonwhitespace.
\d	Matches digits. Equivalent to [0-9].
\D	Matches nondigits.
\A	Matches beginning of string.
\Z	Matches end of string. If a newline exists, it matches just before newline.
\z	Matches end of string.
\G	Matches point where last match finished.
\b	Matches word boundaries when outside brackets. Matches backspace (0x08) when inside brackets.
\B	Matches nonword boundaries.
\n, \t, etc.	Matches newlines, carriage returns, tabs, etc.
\1...\9	Matches nth grouped subexpression.
\10	Matches nth grouped subexpression if it matched already. Otherwise refers to the octal representation of a character code.


In [None]:
words = re.findall(r'\b[J|D]\w+',exampleString,re.M)
print(words)

In [None]:
exampleString3='''hi Grok
date hey
hello day
ho ho ho holly'''

In [None]:
words = re.findall(r'^.*',exampleString3,re.M)
print(words)

In [None]:
words = re.findall(r'^.*',exampleString3)
print(words)

In [None]:
words = re.findall(r'y$',exampleString3,re.M)
print(words)

In [None]:
words = re.findall(r'\w*y$',exampleString3,re.M)
print(words)

In [None]:
words = re.findall(r'\bh.*',exampleString3,re.M)
print(words)

In [None]:
words = re.findall(r'\w+',exampleString3,re.M)
print(words)

In [None]:
words = re.findall(r'\w*[aeiou][y]\b',exampleString3,re.M)
print(words)