In [1]:
%autosave 0

Autosave disabled


# Accessing functions in a different .py file

Writing modular functions is a great way to increase code reuse.


In [2]:
# The function myfunc1 is in file myfunc.py
#
# In general, the function name is independent of/different from the file name

# Use triple-quote to crate a multi-line string

x = '''
# file myfunc.py

def times10(x):
    y = 10*x
    return y
    
def times100(x):
    y = 100*x
    return y
'''
f = open('myfunc.py', 'w')
f.write(x)
f.close()

!cat myfunc.py


# file myfunc.py

def times10(x):
    y = 10*x
    return y
    
def times100(x):
    y = 100*x
    return y


There are different ways to import functions from another file. In Python, we call the importable item a "module."

In [3]:
# import the entire `myfunc`; can use any of its function.
# All of the functions are under name space "myfunc".

import myfunc

y = myfunc.times10(20)
print(y)

200


In [4]:
# import only function times100 from myfunc.py
# Note the imported function now exists in the namespace

from myfunc import times100

y = times100(99)
print(y)

# we have no access to times10() because it is not imported. 
# Calling times10() results in an error.
# times10(123)

9900


In [5]:
# import all functions in myfunc.py

from myfunc import *

y1 = times10(3.14159)
y2 = times100(1.27)
print('y1 = {}, y2 = {}'.format(y1,y2))

y1 = 31.4159, y2 = 127.0


In [6]:
! rm -f myfunc.py  # clean up

## A word about re-importing an module

After a module has been imported, if changes are made to the module, it has to be reloaded. To reload, use:
```
from imp import reload
reload(myfunc)
```
instead of simply `import`.

In [7]:
s = '''
def func1():
    print('original func')
'''
with open('myfunc2.py','w') as f:
    f.write(s)
    
!cat myfunc2.py


def func1():
    print('original func')


In [8]:
import myfunc2
myfunc2.func1()

original func


In [9]:
# make changes to myfunc2

s = '''
def func1():
    print('new func')
'''
with open('myfunc2.py','w') as f:
    f.write(s)
    
!cat myfunc2.py


def func1():
    print('new func')


In [10]:
# import does not load the new version - still use the original function!
import myfunc2
myfunc2.func1()

original func


In [11]:
# must use reload from module "imp"
from imp import reload
reload(myfunc2)
myfunc2.func1()       # now we have loaded the new version

new func


In [12]:
! rm -f myfunc1.py myfunc2.py   # clean up

# Python package

A Python package contains one or multiple modules, organized in subdirectories.

```
mymodule/__init__.py
         moduleA.py
         moduleB.py
         /dir1/__init__.py
               moduleX.py
               moduleY.py
         /dir2/__init__.py
               moduleP.py
               moduleQ.py
```

Due to the existence of the `__init__.py` files, Python understands that the entire directory under `mymodule` belong to one hierarchical module. There are different ways to access the functions:

Import the whole module:

```
import mymodule
mymodule.moduleA.func1()
mymodule.moduleB.func2()
```

Import a sub-module:
```
from mymodule import moduleB
moduleB.func1()
```

Import just a single function from a submodule:
```
from mymodule.moduleA import func1
func1()

```

## Name collision -- two functions of the same name

Suppose we have two .py files. Each of the file contains a function named `magic_func()`:

In [13]:
s1 = '''
def magic_func():
    print('magic_func from file file1.py')
'''
s2 = '''
def magic_func():
    print('magic_func from file file2.py')
'''

with open('file1.py', 'w') as f: f.write(s1)
with open('file2.py', 'w') as f: f.write(s2)
    
! cat file1.py
! cat file2.py
! sleep 1


def magic_func():
    print('magic_func from file file1.py')

def magic_func():
    print('magic_func from file file2.py')


When we use the `import *` approach, both functions `magic_func()` are loaded into the same name space. The one loaded later will replace the previous, possibly causing confusion, or problems in the code. Avoid this!

In [14]:
# name collision

from file1 import *
from file2 import *
magic_func()

magic_func from file file2.py


After loading `file2` (after loading `file1`), the access to `magic_func()` in file1 is lost.

In [15]:
! rm -f file1.py file2.py   # clean up

# Regular expression

Search and find particular text patterns in a string.

In [16]:
# example

data = '''
From MAILER-DAEMON Fri Jul  8 12:08:34 2011
From: Author <author1@example1.com>
To: Recipient <recipient@example.com>
Subject: Sample message 1

This is the body.
>From (should be escaped).
There are 3 lines.

From MAILER-DAEMON Fri Jul  8 12:08:34 2011
From: Author <author2@example2.com>
To: Recipient <recipient@example.com>
Subject: Sample message 2

This is the second body.
'''

In [17]:
import re
import io
f = io.StringIO(data) # convert string to file object
for line in f.readlines():
    line = line.strip()
    if re.search('From:', line):
        print(line)

From: Author <author1@example1.com>
From: Author <author2@example2.com>


`re.research` returns the match object if found, otherwise returns None.

In [18]:
x = re.search('From:', 'From: Author <author2@example2.com>')
x

<_sre.SRE_Match object; span=(0, 5), match='From:'>

## A word about non-Boolean expression in control flow statements

In the context of Boolean operations, and also when expressions are used by control flow statements, the following values are interpreted as false: False, None, numeric zero of all types, and empty strings and containers (including strings, tuples, lists, dictionaries, sets and frozensets). All other values are interpreted as true.

Reference: https://docs.python.org/3/reference/expressions.html#boolean-operations

In [19]:
# example - empty dict is evaluated to "False"
d = {}
if d:
    print("empty dict is true")
else:
    print("empty dict is false")

empty dict is false


In [20]:
# example - empty string is evaluated to "False"
d = ""
if d:
    print("empty str is true")
else:
    print("empty str is false")

empty str is false


If a match is found in `re.search()`, a match object is returned. Consequently, the `if re.search(...)` condition is true. Otherwise, when this is no match, `None` is returned and the `if re.search()` condition is `False`.

## Character matching 

Some common examples (see regular expression documentation for more):

```
  Syntax | Explanation
---------|----------------------------------
 ^       | match the beginning of the line
 $       | match to the end of the line
 .       | match any character
 \s      | match any white space character
 \S      | match any non-white space character
 *       | to match zero or more of the preceding character
 *?      | to match zero or more of the preceding character in "non-greedy mode"
 +       | to match one or more of the preceding character
 +?      | to match one or more of the preceding character in "non-greedy mode"
 [a-z]   | a single character in the range
 [^a-z]  | a single character not in the range
 [aei]   | to match a single character listed
  ?      | 0 or 1 repetition of the preceding character
  {m}    | repeat m times
```

To use the Python regular expression module, use:
```
    import re
```

One of the commonly used function is `re.search`; it returns a match object if found, otherwise returns `None`.

In [21]:
import re

txt = ['from: author@example.com',
       'From: author@example.com',
       '  from: author@example.com',
       '  From: author@example.com',
       'from  author@example.com',
       'frommm xxx yyy',
       '90095-1557',
       '90025-1234',
       '91125-1234',
       '91225-1234',
       'zip code 90095-1557',
       'zip code 90095 in Los Angeles',
       'hello',
       'hallo',
       'hxllo',
       'ab ab',
       'abc abc',
       'abccx abccx',
       'abcccx abcccx'
]

In [22]:
for line in txt:
    if re.search('F..m:', line):
        print(line)

From: author@example.com
  From: author@example.com


In [23]:
for line in txt:
    if re.search('^F..m:', line):
        print(line)

From: author@example.com


In [24]:
for line in txt:
    if re.search('^f..m:', line):
        print(line)

from: author@example.com


In [25]:
for line in txt:
    if re.search('f..m', line):
        print(line)

from: author@example.com
  from: author@example.com
from  author@example.com
frommm xxx yyy


In [26]:
for line in txt:
    if re.search('f..m\s', line):
        print(line)

from  author@example.com


In [27]:
for line in txt:
    if re.search('f..m\S', line):
        print(line)

from: author@example.com
  from: author@example.com
frommm xxx yyy


In [28]:
for line in txt:
    if re.search('900[93]5-', line):
        print(line)

90095-1557
zip code 90095-1557


In [29]:
for line in txt:
    if re.search('^90{2}[0-9]5', line):
        print(line)

90095-1557
90025-1234


In [30]:
for line in txt:
    if re.search('h[ea]llo', line):
        print(line)

hello
hallo


In [31]:
for line in txt:
    if re.search('^abc*', line):
        print(line)

ab ab
abc abc
abccx abccx
abcccx abcccx


In [32]:
for line in txt:
    if re.search('^abc+', line):
        print(line)

abc abc
abccx abccx
abcccx abcccx


In [33]:
# example of wild-card matching: one or more characters in between

s = 'From: stephen.smith@foo.bar.com'

if re.search('^From:.+@', s):
    print("match!", s)

match! From: stephen.smith@foo.bar.com


In [34]:
s = 'abc'             # 'a'
re.search('ab*', s)   # matches zero or more repetitions of 'b'

<_sre.SRE_Match object; span=(0, 2), match='ab'>

## Extracting data from texts using `re`


`re.findall`: returns all found patterns in a list

In [35]:
# the pattern of an email address

import re
s = 'Hello from <author1@foo1.com> to reader2@bar.com about the meeting @4pm'
re.findall('\S+@\S+', s)

['<author1@foo1.com>', 'reader2@bar.com']

Q: Why is @4pm not matched in the example above?

In [36]:
s = 'ab'
re.findall('a*', s)

['a', '', '']

## greedy vs non-greedy mode

In [37]:
# greedy matching
s = '<h1>This is a title</h1>'
re.findall('<.*>', s)

['<h1>This is a title</h1>']

In [38]:
# non-greedy matching
re.findall('<.*?>', s)

['<h1>', '</h1>']

## Search and extracting

Use `(` `)` to extract the portion of the matched substring

In [39]:
s = '<h1>This is a title</h1>'

In [40]:
re.findall('<h1>(.*?)</h1>', s)

['This is a title']

In [41]:
s = '''
X-SPAM-SCORE: 0.7654
X-SPAM-PROBABILITY: 0.05
'''

In [42]:
import re, io
f = io.StringIO(s)
for line in f.readlines():
    line = line.strip()
    x = re.findall('^X\S*: ([0-9.]+)', line)
    if len(x) > 0:
        print(x)

['0.7654']
['0.05']


In [43]:
# extracting the number from a URL

s = 'http://foo.bar.com/view/?view=rev&rev=12345'

re.findall('^http:.*rev=([0-9]+)', s)

['12345']

In [44]:
# capture the hour

s = 'From mr.anderson@foo.bar.com Tue Apr 25 14:30:25 2017'

re.findall('^From .* ([0-9][0-9]):', s)

['14']

In [45]:
re.findall('^From .* ([0-9]{2}):', s)   # {2}: repeat 2 times

['14']

# Database

- A database is a file organized for storing data
- Query  (read)
- Update (write)
- Follow a standard format
- data have types

Examples: MySQL, PostgreSQL, SQLite, etc.

SQLite is part of Python distribution, so we will use this to show examples. Using other database systems from Python is similar.

To use SQLite in Python,
```
    import sqlite3
```

## SQL basics

This is not a formal introduction of SQL; you should take a database class or read a book if you want to know more about it. 

Some working knowledge:

- database table
- row and columns
- each column has a data type
- relational databases

Basic SQL commands:
- select: query
- insert: add a row in a table
- create: create a new table
- drop: delete a table

## SQL basics

Commonly used SQL commands:

`CREATE TABLE table_name`

`INSERT INTO table_name (column_name1, column_name2, ...) VALUES (v1,v2,...)`

`SELECT * FROM table_name WHERE <condition>`

`UPDATE table_name SET column_name=XXX WHERE <condition>`

`DELETE FROM table_name WHERE <condition>`


## Creating a database

In [46]:
import sqlite3

conn = sqlite3.connect('mydb.sqlite')
cur = conn.cursor()            # similar to a file handle, i.e. f = open(...)

cur.execute('DROP TABLE IF EXISTS mytable')                      # SQL command
cur.execute('CREATE TABLE mytable (name TEXT, score INTEGER)')   # SQL command

# insert some records

cur.execute('INSERT INTO mytable (name, score) VALUES (?,?)', ('John', 60))
cur.execute('INSERT INTO mytable (name, score) VALUES (?,?)', ('Jane', 70))
cur.execute('INSERT INTO mytable (name, score) VALUES (?,?)', ('Jim', 90))

conn.commit()       # don't forget to commit!
conn.close()

Batch processing - prepare the data in a list of tuples, then call `executemany`:

In [47]:
import sqlite3

conn = sqlite3.connect('mydb.sqlite')
cur = conn.cursor()            # similar to a file handle, i.e. f = open(...)

cur.execute('DROP TABLE IF EXISTS mytable')                      # SQL command
cur.execute('CREATE TABLE mytable (name TEXT, score INTEGER)')   # SQL command

d = [('John', 66), ('Jane', 77), ('Jim', 99)]

cur.executemany('INSERT INTO mytable (name, score) VALUES (?,?)', d)

conn.commit()       # don't forget to commit!
conn.close()

## query a table

In [48]:
import sqlite3

conn = sqlite3.connect('mydb.sqlite')
cur = conn.cursor()

cur.execute('SELECT * from mytable')
for row in cur:
    print(row)

('John', 66)
('Jane', 77)
('Jim', 99)


In [49]:
cur.execute('SELECT * from mytable where score > 60')
for row in cur:
    print(row)

('John', 66)
('Jane', 77)
('Jim', 99)


In [50]:
cur.execute('SELECT * from mytable where name = "John"')
for row in cur:
    print(row)

('John', 66)


## Update a row

In [51]:
# update the row where name = "John"

cur.execute('UPDATE mytable SET score=99 where name = "John"')
conn.commit()

cur.execute('SELECT * from mytable where name = "John"')
for row in cur:
    print(row)

('John', 99)


## Delete a row

In [52]:
# delete the "John" row

cur.execute('DELETE FROM mytable WHERE name = "John"')
conn.commit()

cur.execute('SELECT * from mytable')
for row in cur:
    print(row)

('Jane', 77)
('Jim', 99)


In [53]:
! rm -f mydb.sqlite  # cleanup

# Datetime

```
import datetime
```

In [54]:
import datetime

datetime.datetime.now()

datetime.datetime(2017, 7, 19, 15, 58, 30, 988458)

Since some other packages may also have the `datetime()` function. It's probably a good idea not to use the "`from datetime import *`" if there is a change of confusion.

In [55]:
t1 = datetime.datetime.now()
t2 = t1 - datetime.timedelta(hours=1)    # t2 = t1 - delta_t
print('t1 = {}\nt2 = {}'.format(t1,t2))

t1 = 2017-07-19 15:58:30.995599
t2 = 2017-07-19 14:58:30.995599


In [56]:
# format the time string

x = datetime.datetime(2017, 4, 25, 23, 57, 19)
print('ISO time format = {}'.format(x.strftime('%Y-%m-%d %H:%M:%S')))

ISO time format = 2017-04-25 23:57:19


In [57]:
# the fields of datetime() are padded with zero if not specified

x = datetime.datetime(2017, 4, 25, 23)  # does not specify minute and second
x.strftime('%Y-%m-%d %H:%M:%S')

'2017-04-25 23:00:00'

In [58]:
# write time stamp to a database

import sqlite3
import datetime

conn = sqlite3.connect('timestamp.sqlite')
cur = conn.cursor()
cur.execute('DROP TABLE IF EXISTS ts ')
cur.execute('CREATE TABLE ts (timestamp datetime, what text)')

cur.execute('INSERT INTO ts (timestamp, what) VALUES (?,?)', 
            (datetime.datetime.now(), 'I ate a cake.'))
cur.execute('INSERT INTO ts (timestamp, what) VALUES (?,?)', 
            (datetime.datetime(2017, 4, 25, 23, 0), 'I did my homework.'))
cur.execute('INSERT INTO ts (timestamp, what) VALUES (?,?)', 
            (datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S'), 
             'I wrote a Python function.'))
cur.execute('INSERT INTO ts (timestamp, what) VALUES (?,?)', 
            (datetime.datetime(2016, 4, 25, 23, 0), 'last year.'))

conn.commit()
conn.close()

In [59]:
# query example

conn = sqlite3.connect('timestamp.sqlite')
cur = conn.cursor()
cur.execute('SELECT * from ts where timestamp > "2017-01-01"')
for row in cur:
    print(row)
conn.close()

('2017-07-19 15:58:31.042132', 'I ate a cake.')
('2017-04-25 23:00:00', 'I did my homework.')
('2017-07-19 15:58:31', 'I wrote a Python function.')


In [60]:
!rm -f timestamp.sqlite   # clean up

# Accessing the Internet

```
import urllib
```

Each web page is treated like a file.

In [61]:
# Example

import urllib

f = urllib.request.urlopen('http://www.ucla.edu')

In [62]:
type(f)

http.client.HTTPResponse

In [63]:
f.geturl()

'http://www.ucla.edu'

In [64]:
# HTML header
type(f.getheaders())

list

In [65]:
for field in f.getheaders():
    print(field)

('Content-Type', 'text/html; charset=UTF-8')
('Vary', 'Accept-Encoding')
('Set-Cookie', 'PHPSESSID=nlisr3vl1c7dckdffkcq11ppk6; path=/')
('Expires', 'Thu, 19 Nov 1981 08:52:00 GMT')
('Cache-Control', 'no-store, no-cache, must-revalidate, post-check=0, pre-check=0, max-age=86400, public, must-revalidate, proxy-revalidate')
('Pragma', 'no-cache')
('Transfer-Encoding', 'chunked')
('Date', 'Wed, 19 Jul 2017 22:58:31 GMT')
('Age', '24')
('Connection', 'close')
('X-Cache', 'HIT')


In [66]:
# Check the header of another web site

f2 = urllib.request.urlopen('http://www.latimes.com')
f2.getheaders()

[('Content-Type', 'text/html;charset=UTF-8'),
 ('Httpd-Identifier', 'web-c5a4a157c69161201c18e2fbd3e4b80f'),
 ('Server', 'Apache-Coyote/1.1'),
 ('x-Instance-Name', 'i13prod-bf8dfa6-0-207.1'),
 ('X-UA-Compatible', 'IE=Edge'),
 ('Vary', 'Accept-Encoding'),
 ('Cache-Control', 'public, max-age=43'),
 ('Date', 'Wed, 19 Jul 2017 22:58:31 GMT'),
 ('Transfer-Encoding', 'chunked'),
 ('Connection', 'close'),
 ('Connection', 'Transfer-Encoding')]

In [67]:
# Read the contents of the page
f = urllib.request.urlopen('http://www.ucla.edu')
html = f.read()
type(html)

bytes

In [68]:
# convert bytes to character string

html = html.decode()
type(html)

str

In [69]:
# word count in a web page

import urllib
import re

url = 'http://www.nytimes.com'
keyword = 'education'              # what about (E)ducation?

r = urllib.request.urlopen(url)
html = r.read()
html = html.decode()

x = re.findall(keyword, html)
#x = re.findall(keyword.lower(), html.lower())

print('The word "{}" appears {} times in {}'.format(keyword, len(x), url))

The word "education" appears 6 times in http://www.nytimes.com


In [70]:
# print out all the the http:// links in a page

import urllib, re
url = 'http://www.nytimes.com'
r = urllib.request.urlopen(url)
html = r.read().decode()
links = re.findall('href="(http://.*?)"', html)
for link in links[0:10]:          # we are printing only the first few
    print(link)

http://www.nytimes.com/services/xml/rss/nyt/HomePage.xml
http://mobile.nytimes.com
http://mobile.nytimes.com
http://www.nytimes.com/content/help/site/ie9-support.html
http://cn.nytimes.com
http://www.nytimes.com/pages/todayspaper/index.html
http://www.stitcher.com/podcast/the-new-york-times/the-daily-10
http://www.nytimes.com/spotlight/times-tips?contentCollection=smarter-living
http://www.nytimes.com/video/the-daily-360
http://www.nytimes.com/pages/opinion/index.html


## Parsing HTML yourself vs use a professional library

While it is possible to use regular expression to parse HTML pages, the code can quickly get messy. It is useful to use the BeautifulSoup library to parse HTML.

Installing BeautifulSoup on your computer is similar to installing `spyder`. In a terminal, type the command:
```
conda install beautifulsoup4
```
and answer "Y(es)" to the question. (Don't miss the "4" in "beautifulsoup4")


In [71]:
import bs4
import urllib

url = 'http://www.nytimes.com'
html = urllib.request.urlopen(url).read().decode()
soup = bs4.BeautifulSoup(html, "html.parser")
tags = soup('a')
for tag in tags[0:10]:            # we are printing only the first few
    txt = tag.get('href')
    print(txt)

http://www.nytimes.com/content/help/site/ie9-support.html
#top-news
#site-index-navigation
http://cn.nytimes.com
https://www.nytimes.com/es/
https://www.nytimes.com/
http://www.nytimes.com/pages/todayspaper/index.html
https://www.nytimes.com/video
https://www.nytimes.com/pages/world/index.html
https://www.nytimes.com/pages/national/index.html


In [72]:
import bs4
import urllib

url = 'http://www.nytimes.com'
html = urllib.request.urlopen(url).read().decode()
soup = bs4.BeautifulSoup(html, "html.parser")
tags = soup('a')
for tag in tags[0:10]:            # we are printing only the first few
    txt = tag.get('href')
    if not txt.startswith('#'):
        print(txt)

http://www.nytimes.com/content/help/site/ie9-support.html
http://cn.nytimes.com
https://www.nytimes.com/es/
https://www.nytimes.com/
http://www.nytimes.com/pages/todayspaper/index.html
https://www.nytimes.com/video
https://www.nytimes.com/pages/world/index.html
https://www.nytimes.com/pages/national/index.html
