# Pattern
Pattern is a python web mining module with tools for data mining, natural language processing, machine learning and network analysis. In the whole process of web mining, data scientists usually first collect data, then parse and store data, then extract information and apply machine learning techniques to gain insights. Pattern contains several modules each of which takes care one of those processes in the web mining pipeline. You can view **pattern** as a collection of commonly used tools in web mining packed together into a library. Basically, **pattern** is more user-friendly in the sense that it provides a more high-level api for doing the usual tasks in web mining than existing libraries.

This tutorial will introduce you some basic usage of **pattern** on collecting data from the web and storing data.
Due the length restriction, this tutorial is only involved with modules listed below.
## Modules
- [pattern.web](http://www.clips.ua.ac.be/pages/pattern-web)
- [pattern.db](http://www.clips.ua.ac.be/pages/pattern-db)

The following topics will be covered in the tutorial.
- [Installation](#Installation)
- [Collect Data](#Collect-Data) with **pattern**
- [Store Data](#Store-Data) with **pattern**

# Installation

Pattern can be easily installed with **pip**:

    $ pip install pattern
    
Note that **pattern** is only available in Python 2.5+ (no support for Python 3 yet).

In [2]:
import pattern

# Collect Data
This section involves several basic usages of **pattern** web module for collecting data on the web.

- [API call](#API-call)
- [HTML DOM Parser](#HTML-DOM-Parser)
- [Crawler](#Crawler)

## API call
The **pattern** web module provides API calls that can invoke asynchronuous requests to API of useful sources of data on the web: Twitter, Google, Facebook, Wikipedia, and etc, and it also contains APIs for the web services from those data sources, such as Google search, Google translate, and etc. Also, after receiving data from the data sources, **pattern** web module automatically parses and stores the data into *result* objects that are easy to use. Basically, a *result* object contains the following attributes.

From the official site:
```Python
result.url                  # URL of content associated with the given query.
result.title                # Content title.
result.text                 # Content summary.
result.language             # Content language.
result.author               # Content author. (if there is one)
result.date                 # Content date. (if there is one)
```

We will see several examples of using the **pattern** web module below.

## Example: Twitter



In [3]:
from pattern.web import Twitter, hashtags

To get the most recent tweets based on some keywords from Twitter, we first initialize a *Twitter* object, which contains a useful method called *search*. The *search* function takes several argument:
- *query* (str): the keyword for querying
- *start* (int): the id of tweet started on
- *count* (int): number of results to be returned
- *cached* (bool): whether to cache the results or not

The results return from calling *search* will be a list of *result* objects mentioned above.

In [121]:
tweets = []

for tweet in Twitter().search("#Trump OR #Clinton", start=prev, count=10, cached=False):
    print tweet.url
    print tweet.author
    print tweet.text
    print tweet.like
    print tweet.date
    print hashtags(tweet.text) # get the list of # keywords in a tweet
    tweets.append(tweet)

https://twitter.com/WhiteStorm14ws/status/794676621345050624
WhiteStorm14ws
#Trump  #Trump2016 #MAGA  #BuildTheWall #MakeAmericaGreatAgain https://t.co/Hsl3vHIsGT

Fri Nov 04 23:04:05 +0000 2016
[u'#Trump', u'#Trump2016', u'#MAGA', u'#BuildTheWall', u'#MakeAmericaGreatAgain']
https://twitter.com/Outofnames/status/794676615200591872
Outofnames
RT @SandraTXAS: 'Hillary Clinton &amp; ISIS funded by same money' - Assange 
#WikiLeaks
#SpiritCooking
#Hillary #ImWithHer not!  
#MAGA #Trump…

Fri Nov 04 23:04:04 +0000 2016
[u'#WikiLeaks', u'#SpiritCooking', u'#Hillary', u'#ImWithHer', u'#MAGA', u'#Trump']
https://twitter.com/hopefulfornow/status/794676614319796226
hopefulfornow
RT @_AnimalAdvocate: #Retweet if this gives you that #FridayFeeling!
We've got to land a knockout blow to #Trump on #VotingDay
or he'll be…

Fri Nov 04 23:04:03 +0000 2016
[u'#Retweet', u'#FridayFeeling', u'#Trump', u'#VotingDay']
https://twitter.com/KnucklDraginSam/status/794676613132722177
KnucklDraginSam
RT @WDFx2EU7

## Example: Google

In [5]:
from pattern.web import Google

We may also use the Google search engine directly, in a way similar to using Twitter above. Note that to perform unlimited search using Google requires a license, otherwise the daily allowed number of queries is 100.

In [114]:
engine = Google(license=None, language='en')

for result in engine.search('Trump', count=5, cached=False):
    print result.url
    print result.text

https://www.donaldjtrump.com/
Official campaign site of the Republican nominee in 2016 for U.S. President.
https://twitter.com/realdonaldtrump?lang=en
33.9K tweets • 1859 photos/videos • 12.9M followers. Check out the latest Tweets <br>
from Donald J. <b>Trump</b> (@realDonaldTrump)
https://en.wikipedia.org/wiki/Donald_Trump
Donald John <b>Trump</b> (born June 14, 1946) is an American businessman, <br>
television producer, and politician who is the Republican Party nominee for <br>
President of&nbsp;...
https://www.facebook.com/DonaldTrump/?fref=ts
This is the official Facebook page for Donald J. <b>Trump</b>.
http://www.trump.com/
<b>Trump</b> Luxury Real Estate redefines what is meant by luxury living, built to be the <br>
absolute best in the world.


## HTML DOM Parser

In [12]:
from pattern.web import URL, DOM

The HTML DOM parser  is helpful and user-friendly to get data in a html string. First, we initialize a *URL* object with a url string, which contains a *download* method. By invoking the *download* method, **pattern** sends a GET request to the given url and download the whole html file in the form of a string. Then, we take the html string and initialize a *Element* object by calling the *DOM* function. An *Element* object represents the current hierarchy of the html string, and it contains several useful attributes and methods:

From the official website:
```Python
element.tag                 # Tag name.
element.attrs               # Dictionary of attributes, e.g. {'class':'comment'}.
element.id                  # Value for id attribute (or None).
element.source              # HTML source code in string
element.content             # HTML source minus open and close tag.
element.by_id(str)          # First nested Element with given id.
element.by_tag(str)         # List of nested Elements with given tag name.
element.by_class(str)       # List of nested Elements with given class.
element.by_attr(**kwargs)   # List of nested Elements with given attribute.
element(selector)           # List of nested Elements matching a CSS selector.
```

Sadly, there are some information that the results from *Twitter search* do not record. For example, the full name of a twitter user is not recorded in a *result* object. Now, we try to extract the full name of a twitter user. Actually, the html file of a twitter page is quite complex. For simplicity, we extract the fullname of the last user appear in the page.

In [112]:
twitters = []

def extract_fullname(url):
    dom = DOM(URL(url).download()) # download the html string
    fullname = dom.body.by_class("fullname")[-1].content # the last element with class "fullname"
    username = dom.body.by_class("username")[-1].by_tag("b")[0].content # the last element with class "username"
    return username, fullname
    
for tweet in tweets:
    username, fullname = extract_fullname(tweet.url)
    twitters.append((username, fullname))
    print fullname, "is", username

Brook Harty is IronWolve
Derinda simpson is noseygirl63
KMA is KMA0111
Correct The Record is CorrectRecord
Correct The Record is CorrectRecord
KMA is KMA0111
BbayBop is BbayBop
Thersa is Thersas1
Thersa is Thersas1
JanneW10_Messi is _JanneW10


## Crawler


In [11]:
from pattern.web import Crawler, LIFO

A web crawler is program that systematically visit web pages. Basically, it starts from certain web pages, looks for url links inside those web pages and then go to those web pages behind the links, and repeat this process. The **pattern** web package contains the *Crawler* class that allows us to write crawler programs.

A *crawler* object maintains a list of url links to be visited. When the *crawl* function is called, it will visit the first link in the list. When a link is visited, the function *visit* will be called, otherwise *fail* will be called, both of which a *link* object is passed as an argument. 

A *link* object describes the url it is linked to and the referrer url it is from.

from the official site:
```
crawler = Crawler(links=[], domains=[], delay=20.0, sort=FIFO)
```

```Python
crawler.history                       # Dictionary of (domain, time last visited)-items.
crawler.visited                       # Dictionary of URLs visited.
crawler.crawl(method=DEPTH)           # DEPTH (depth first search) | BREADTH (breath first search) | None.
crawler.visit(link, source=None)
crawler.fail(link)
```

```Python
link.url                    # Parsed from <a href=''> attribute.
link.referrer               # Parent web page URL.
```

For illustration purpose, we create a simple crawler that can crawl tweet pages and print out the username of that page. To do so, we create a subclass of *Crawler* and rewrite the *visit* and *fail* function. Note that the url of a tweet page looks like https://twitter.com/username/status/tweet_id, so we need to extract the *username* inside. 

In [119]:
class TwitterCrawler(Crawler):
    
    def visit(self, link, source=None):
        if "status" in link.url and "status" in link.referrer:
            visiting_id = link.url.split('/')[3] # extract username
            if link.referrer:
                from_id = link.referrer.split('/')[3]
            else:
                from_id = ''
            print "in user:", visiting_id, ", from user:", from_id
            print "in url:", link.url, ", from url:", link.referrer
    
    def fail(self, link):
        print "failed:", link.url

In [120]:
for tweet_url in tweet_urls[:5]: # just crawl from first 5 urls
    crawler = TwitterCrawler(links=[tweet_url], delay=0.0, sort=LIFO) 
    # LIFO: the new links are visited in the order of last-in-first-out

    print "CRAWLing " + "-" * 50
    while len(crawler.visited) < 5: # crawl 5 pages
        crawler.crawl(cached=False)

    print crawler.visited
    print crawler.history

CRAWLing --------------------------------------------------
in user: imcosta1 , from user: IngridBush
in url: https://twitter.com/imcosta1/status/794533506739224576 , from url: https://twitter.com/IngridBush/status/794401601368760320
in user: imcosta1 , from user: imcosta1
in url: https://twitter.com/imcosta1/status/794532866319417344 , from url: https://twitter.com/imcosta1/status/794533506739224576
in user: MaddowBlog , from user: imcosta1
in url: https://twitter.com/MaddowBlog/status/794353514080190464 , from url: https://twitter.com/imcosta1/status/794532866319417344
in user: Angel5202006 , from user: MaddowBlog
in url: https://twitter.com/Angel5202006/status/794543373654822913 , from url: https://twitter.com/MaddowBlog/status/794353514080190464
{u'https://twitter.com/Angel5202006/status/794543373654822913': True, u'https://twitter.com/IngridBush/status/794401601368760320': True, u'https://twitter.com/MaddowBlog/status/794353514080190464': True, u'https://twitter.com/imcosta1/statu

# Store Data
This section involves some basic method of storing data with **pattern**:

- [Create Database](#Create-Database)
- [Create Table](#Create-Table)
- [Insert Data](#Insert-Data)
- [Data Query](#Data-Query)

The pattern.db module contains wrappers for databases (SQLite, MySQL). It is convenient to work with tabular data, collected from **pattern** web module.

In [98]:
from pattern.db import Database, SQLITE, pd, assoc, INNER
from pattern.db import field, pk, STRING, INTEGER, rel

## Create Database

We can create or access a SQLite very easily by calling *Database*, where the first argument is the directory of the database. If there is already a database, then **pattern** will connect to the database, otherwise it will create one. The function *pd* stands for parent directory of the script.

In [89]:
db = Database(pd("tweet.db"), type=SQLITE)

## Create Table

We can define the schema of a table in an expressive way using *pk*, and *field*. We create two tables for the twitter data collected from above.

In [90]:
if not "tweet" in db:
    schema = (
        pk(), # Auto-incremental id.
        field("author", STRING(50)), # name, type
        field("text", STRING(256)),
        field("like", INTEGER)
    )
    db.create("tweet", schema) # create the table
    
if not "twitter" in db:
    schema = (
        pk(),
        field("username", STRING(50)),
        field("fullname", STRING(50))
    )
    db.create("twitter", schema)

Once the table is created, we can access the table with *db.table_name*.

## Insert Data
The *append* function allows easy insertion of a data row, in which the argument names correspond to the field names of the table. We insert the tweet data collected from *Twitter.search* and twitter data collected using the HTML DOM parser above.

In [123]:
tweet_set = set()

for tweet in tweets:
    if tweet.id not in tweet_set:
        if tweet.like:
            db.tweet.append(author=tweet.author, text=tweet.text, like=tweet.like)
        else:
            db.tweet.append(author=tweet.author, text=tweet.text, like=0)
        tweet_set.add(tweet.id)
        

In [122]:
twitter_set = set()

for username, fullname in twitters:
    if username not in twitter_set:
        db.twitter.append(username=username, fullname=fullname)
        twitter_set.add(username)

The function *assoc* will convert a data table into a generator of each row represent as a Python dictionary. Let's print the first 10 rows of the tables.

In [93]:
print "tweet table:" + "-" * 50
for index, row in enumerate(assoc(db.tweet)):
    if index < 10:
        print row # represented by a dictionary {'name': 'value'}
    else:
        break
print

print "twitter table:" + "-" * 50
for index, row in enumerate(assoc(db.twitter)):
    if index < 10:
        print row
    else:
        break

tweet table:--------------------------------------------------
{u'text': u'This is America. NO ONE "ALLOWS" US ANYTHING! With #Trump leading us #WeThePeople WILL #DrainTheSwamp! @realfirearms @jorgenseptember', u'like': 0, u'id': 1, u'author': u'WalkingSeaWater'}
{u'text': u'RT @balleryna: .@heuteshow&lt;=Staatsfunk wei\xdf von Verstrickung der kriminellen #Clinton in #Epstein- &amp; #Weiner-Verbrechen=Mitschuld!:( #Wikil\u2026', u'like': 0, u'id': 2, u'author': u'AndrewRoussak'}
{u'text': u"#Trump\n\nOh. And he hasn't paid any #federaltax in  20 years.\nZero contribution. That is sick. When millions of poor people with 2 jobs+ pay!", u'like': 0, u'id': 3, u'author': u'DubaiDiaries'}
{u'text': u'#trump likes attachment beneficial awesome benevolent', u'like': 0, u'id': 4, u'author': u'nicolasabanner2'}
{u'text': u'RT @joelpollak: Panoramic shot of crowded #Trump rally in #Atkinson #NewHampshire - at least 1,000 packed into hall, wall to wall https://t\u2026', u'like': 0, u'id': 5, u'au

## Data Query
The **pattern** db module contains a high level function for executing SQL data query. Suppose we want to search from the 'tweet' table for the tweet of the twitter recorded in the 'tweeter' table and list them with the full name instead of the user name. The *relations* attribute is a collection of *rel* object which defines the relations between tables. Specifically,
```Python
rel("fieldname1", "fieldname2", "tablename", join=LEFT)  # LEFT or INNER
```
where 'fieldname2' of 'tablename' table is join with 'fieldname1' of the current table.

In [127]:
q = db.twitter.search( # search from the "twitter" table
    fields = (
       "tweet.text",
       "twitter.fullname"
    ),
    relations = (
      rel("username", "tweet.author", "tweet", join=INNER) # inner join from the tweet table
    )
)
print q.sql() # print out the query in SQL 

select tweet.text, twitter.fullname from `twitter` inner join `tweet` on twitter.username=tweet.author;


We print out the result of the query

In [100]:
for row in assoc(q):
    print row 

{'tweet.text': u'This is America. NO ONE "ALLOWS" US ANYTHING! With #Trump leading us #WeThePeople WILL #DrainTheSwamp! @realfirearms @jorgenseptember', 'twitter.fullname': u'WalkingSeaWater'}
{'tweet.text': u"#Trump\n\nOh. And he hasn't paid any #federaltax in  20 years.\nZero contribution. That is sick. When millions of poor people with 2 jobs+ pay!", 'twitter.fullname': u'Bibi'}
{'tweet.text': u'#trump likes attachment beneficial awesome benevolent', 'twitter.fullname': u'Nicolasa Bannerman'}
{'tweet.text': u'The latest A Look Down the Rabbit Hole! https://t.co/BdoLuGvclq Thanks to @AnonBruja @PupperJ @SensiblySecular #nevertrump #trump', 'twitter.fullname': u'Alice King'}
{'tweet.text': u'@LindaSuhler @RSBNetwork my sister is there with her boyfriend #Lucky \u2728Both in PA &amp;  voted for #Trump \u2728Trump signs all over PA \u2728#VoteTrump', 'twitter.fullname': u'Deplorable Jen'}
{'tweet.text': u'#trump treat benevolent yays agreeable ok', 'twitter.fullname': u'Nannette Markowi

# Summary and Referrence
Basically, as mentioned in the introduction, **pattern** is useful in the sense that it is a bundle of a lot of common used tools in web mining. Besides the web and db modules, there are other useful modules that deal with other common tasks in data science. If you are interested, take a look at the official site.

- The official site: http://www.clips.ua.ac.be/pages/pattern
- The **pattern.web** package: http://www.clips.ua.ac.be/pages/pattern-web
- The **pattern.db** package: http://www.clips.ua.ac.be/pages/pattern-db
- The Github directory: https://github.com/clips/pattern
- A Case Study: http://www.clips.ua.ac.be/sites/default/files/modeling-creativity.pdf