<small><i>August 2014 - This notebook was created by [Oriol Pujol Vila](http://www.maia.ub.es/~oriol). Source and license info are in the folder.</i></small>

# Data hunting and gathering (part 1 - student notebook)

<img style = "border-radius:20px;" src = "http://unadocenade.com/wp-content/uploads/2012/09/cavalls-de-valltorta.jpg">

# Contents and Requirements 

**SESSION 1:**

    + Warm-up and Crawling
    + MongoDB basics (writting and reading)
    + APIs:
        + API wrappers
        + direct API programming
    
SOFTWARE REQUIREMENTS FOR SESSION 1
    
    + mongoDB (>=4.0.9) (https://www.mongodb.com/download-center/community , donwload from the Server tab)
    
ADDITIONAL PYTHON LIBRARIES

    + pymongo #pip install pymongo

    

<div class = "alert alert-danger" style = "border-radius:10px;border-width:3px;border-color:darkred;font-family:Verdana,sans-serif;font-size:16px;">
**DISCLAIMER AND USER AGREEMENT:** Ensure you are allowed to use these tools for retrieving data and be respectful with web pages and apps. Ethical use of these tools is mandatory. The content provided by this notebook is for educational purposes only. 
<p>

THE NOTEBOOK/SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE NOTEBOOK CONTENTS OR SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
</div>

# 1. Introduction

Data is the basis of this course. Although we usually find it in well structured formats such as a spreadsheet resulting from our last experiment, or the collection of company records in a classical relational database, with the advent of internet new information sources have to be taken into account. However, these new sources are home of unstructured data. In this lecture several methods for retrieving data and storing it are presented.

Let us first introduce the big picture guiding this lecture. Whenever we want to retrieve data from a web site we should ask first if the web site is providing a simple way for that purpose. Many large sites such as google, facebook, twitter, etc, provide a **Application Programming Interface (API)** that can make data hunting easier. However, most of web sites do not have this interface. Even more, an API may not provide the desired information. In those cases we have to use **scraping** techniques. This means dealing with the raw information as it is provided to the web browser and code our data finding methods.  

<img style="border-radius:20px;" src="./files/big_picture.jpg">

Let us start connecting to the net and checking out how to retrieve a basic page. We will start using `urllib.request` module.

In [None]:
from urllib.request import urlopen
source = urlopen("http://www.google.com/")
print(source)

In [None]:
#Let us check what is in
source

In [None]:
#Hurray we got a socket. An all sockets behave like files, so let us go read() the "file"
something = source.read().decode('latin-1')

In [None]:
#Check on something
print(something)

In [None]:
#What!!!!
#Let us read more
print(source.read())

In [None]:
#Ooooppss nothing else.

Ok, hands on!!! Some first warm up exercises:

<div class = "alert alert-success" style = "border-radius:10px;border-width:3px;border-color:darkgreen;font-family:Verdana,sans-serif;font-size:16px;">**WARM UP EXERCISES**
<ol>
<li>Is there the word python in pyladies.org?</li>
<li>Does http://google.com contain an image? (hint: < img > TAG )</li>
<li>What are the first ten characters of python.org?</li>
</ol>
</div>

In [None]:
#EX 1. Write your code here

In [None]:
#EX 2. Write your code here

In [None]:
#EX 3. Write your code here

We are retrieving data from an URL! So we are done! 

# Crawling and Scraping

Scraping and **crawling** are two very related techniques. While scraping is used for retrieving data from a web page, crawling is used to retrieve the web pages. Scraping and crawling are found at the core of search engines. Scraping is used to get keywords, analyze, and extract useful information from the web pages so that given a user query it may return related results. On the other hand, crawling allows to retrieve the actual pages and uses scraping to get the links in each web site. This allows to create a graph of the connection among web sites and this information can be used to order the results of a query.

In general, we might want not only to get data from a single page but probably retrieve from several related pages. In those cases crawling is the way to go. 

<div class = "alert alert-success" style = "border-radius:10px;border-width:3px;border-color:darkgreen;font-family:Verdana,sans-serif;font-size:16px;">**WARM-UP PROJECT:** Let us build a very simple spider. The basic functionality of an spider is to crawl and store all the data in web pages. In this simple project we will take care of single site. 

<ol>
<li>A crawler must recognize the links to crawl. Take a minute and think how to retrieve the links of a web site.</li>
<li>Let us start the project by creating a Spider class. The constructor will have the following parameters: starting_url, crawl_domain, and max_iter. crawl_domain will be the domain that validates if an absolute link will be considered or not. max_iter is the maximum amount of web items to crawl.</li>
<li>The main method can be Spider.run(). Enumerate the big functionalities/building blocks of the crawler.</li>
</ol>
</div>
    
    

In [None]:
from urllib.request import urlopen
from urllib.error import HTTPError


import time

def getLinks(html, max_links=10):
    url = []
    cursor = 0
    nlinks=0
    while (cursor>=0 and nlinks<max_links):
        start_link = html.find("a href",cursor)
        if start_link==-1:
            return url
        start_quote = html.find('"', start_link)
        end_quote = html.find('"', start_quote + 1)
        url.append(html[start_quote + 1: end_quote])
        cursor = end_quote+1
        nlinks = nlinks +1
    return url

class Spider:
    def __init__(self,starting_url,crawl_domain,max_iter):
        self.crawl_domain = crawl_domain
        self.max_iter = max_iter
        self.links_to_crawl=[]
        self.links_to_crawl.append(starting_url)
        self.links_visited=[]
        self.collection=[]
        
    def retrieveHtml(self):
        try:
            socket = urlopen(self.url);
            self.html = socket.read().decode('latin-1')
            return 0
        except HTTPError:
            # Most probably an url not found 404, possibly due to malformating of the links in retrieveAndValidateLinks
            return -1
             
    def run(self):
        while (len(self.links_to_crawl)>0 and len(self.collection)<self.max_iter):
            self.url = self.links_to_crawl.pop(0)
            print (self.links_to_crawl)
            self.links_visited.append(self.url)
            if self.retrieveHtml()>=0:
                self.storeHtml()
                self.retrieveAndValidateLinks()
    
    def retrieveAndValidateLinks(self):
        tmpList=[]
        items = getLinks(self.html)
        # Check the validity of a link
        for item in items:
            item = item.strip('"')
            if self.crawl_domain in item:
                tmpList.append(item)
            if not(":") in item: #Take care of http:// https:// and mailto:
                tmpList.append(self.crawl_domain+item)
        # Check that the link has not been previously retrieved or is currently on the links_to_crawl list
        for item in tmpList:
            if item not in self.links_visited:
                if item not in self.links_to_crawl:
                    self.links_to_crawl.append(item)
                    print ('Adding: '+item)
                
    def storeHtml(self):
        doc = {}
        doc['url'] = self.url
        doc['date'] = time.strftime("%d/%m/%Y")
        doc['html'] = self.html
        self.collection.append(doc)
       


Let us validate the crawler with the following code: 

In [None]:
spider = Spider('http://www.ub.edu/datascience/postgraduate/','http://www.ub.edu/datascience/postgraduate/',20)
spider.run()

In [None]:
#How many elements does our colletion have?
len(spider.collection)


In [None]:
spider.collection[0]

In [None]:
#Enumerate the urls retreived
[spider.collection[i]['url'] for i in range(len(spider.collection))]

Let us go for a more complex web site. Run the code on http://hunch.net (a machine learning blog by John Langford).

In [None]:
spider = Spider('http://hunch.net','http://hunch.net/',10)
spider.run()

In [None]:
# And check the urls retrieved
[spider.collection[i]['url'] for i in range(len(spider.collection))]

It seems that the simple crawler more or less works as expected. There are still many functionalities to work on , such as valid domains, valid urls, etc. One important issue to consider is **persistence**, or how to store the data retrieved for further analysis. In this basic scraping tutorial we us MongoDB as a Non-SQL database for persistence purposes. 

## 1.1 Introduction to MongoDB
<small>This introduction is partially inspired on the notes of Alberto Negron's [blog](http://altons.github.io/python/2013/01/21/gentle-introduction-to-mongodb-using-pymongo/)</small>

MongoDB is a document-oriented database, part of the NoSQL family of database systems. MongoDB stores structured data as JSON-like structures. From a pythonic point of view it is like storing dictionary data structures. One of its main feature is its schema-less feature, i.e. it supports dynamic schemas. A schema in a relational database informally referst to the structure of the data it stores, i.e. what kind of data, which tables, which relations, etc.

Let us change the Spider class to support MongoDB persistence.

First of all let us configure the MongoDB system.

+ Download mongoDB.
+ Rename the folder to mongodb.
+ Add a directory data and log in your working project directory.
+ Check that the server works 

        `mongod --dbpath . --nojournal` (use `./mongod --dbpath . --nojournal &` in linux based systems)
        
+ Check the connection to the server: in another terminal write mongo, check that it does not raise any error and exit the console.
+ Close the mongo daemon (mongod). You may have to kill mongod with kill -9 and remove the lock on the daemon, mongod.lock.
+ Let us configure a little the data base by configuring the path of the data storage and log files. Create a [mongo.conf](./mongodb/data/mongo.conf) file such as the one provided  and start the server using the following command:

        mongod --config=./mongodb/data/mongo.conf --nojournal &
        
+ Bonus: we can check the database status using  http://127.0.0.1:27017/

### Connect to a MongoDB database

In [None]:
import pymongo

# Connection to Mongo DB
try:
    conn=pymongo.MongoClient()
    print ("Connected successfully!!!")
except pymongo.errors.ConnectionFailure as e:
    print ("Could not connect to MongoDB: %s" % e )
conn


In [None]:
import pymongo
conn = pymongo.MongoClient()

We can **create** a database using attribute access <span style = "font-family:Courier;"> db = conn.name_db</span> or dictionary acces <span style = "font-family:Courier;"> db = conn[name_db]</span>.

In [None]:
#Create a database using db = conn.name_db or dictionary access db = conn['name_db']
db = conn['datascienceUB_Octubre_2018']
print (db)
conn.list_database_names()
#Empty databases do not show

A database stores a **collection**. A collection is a group of documents stored in MongoDB, and can be thought of as the equivalent of a table in a relational database. Getting a collection in PyMongo works the same as getting a database:

In [None]:
collection = db['Hola']
db.list_collection_names()
#Empty collections do not show

In [None]:
#The database has a collection, thus ...
conn.list_database_names()

### Insert documents

MongoDB stores structured data as JSON-like (JavaScript Object Notation) documents, using dynamic schemas (called BSON), rather than predefined schemas. An element of data is called a document, and documents are stored in collections. One collection may have any number of documents.

Compared to relational databases, we could say collections are like tables, and documents are like records. But there is one big difference: every record in a table has the same fields (with, usually, differing values) in the same order, while each document in a collection can have completely different fields from the other documents.

All you really need to know when you're using Python, however, is that documents are Python dictionaries that can have strings as keys and can contain various primitive types (int, float,unicode, datetime) as well as other documents (Python dicts) and arrays (Python lists).

To insert some data into MongoDB, all we need to do is create a dict and call .insert() on the collection object. Let us exemplify this process by downloading an url and storing it in the collection.

In [None]:
from urllib.request import urlopen
import time
# dd/mm/yyyy format
print (time.strftime("%d/%m/%Y"))
url = 'http://www.ub.edu/datascience/postgraduate/'
html = urlopen(url).read().decode('latin-1')

#Create a dictionary/document to store
doc = {}
doc['url'] = url
doc['date'] = time.strftime("%d/%m/%Y")
doc['html'] = html
doc['adios'] = 'esto es otra prueba'

In [None]:
doc['url']

In [None]:
#insert the document in the collection
collection.insert_one(doc)

In [None]:
#Check that we have a non empty collection.
db.list_collection_names()

To recap, we have databases containing collections. A collection is made up of documents. Each document is made up of fields.

### Retrieving documents

In [None]:
collection.find_one() #Returns one random document in the collection

To get more than a single document as the result of a query we use the find() method. find() returns a Cursor instance, which allows us to iterate over all matching documents.


In [None]:
collection.find()

In [None]:
for d in collection.find():
    print(d)

### Retrieving filtered documents

A very naive way to filter is to run on all documents and filter the resulting documents. Thus, a programatic way to filter:

In [None]:
for d in collection.find():
    try:
        if "datascience" in d["url"]:
            print(d["url"])
    except KeyError:
        print("ERROR")
   

But, we can directly use .find() for querying in pymongo

In [None]:
for i in collection.find({"atributo":"valor del atributo"}):
    print(i)

Observe that it finds exact matches. Operations include *gt* (greater than), *gte* (greater than equal), *lt* (lesser than), *lte* (lesser than equal), *ne* (not equal), *nin* (not in a list), *regex* (regular expression), *exists*, *not*, *or*, *and*, etc. Let us see some examples:

In [None]:
collection.find({"date":{"$gte":"01/01/2014"}}).count()

However, the most porwerful way to directly filter is to us **regular expressions** as follows,

In [None]:
substring = "datascience"
reg = substring
collection.find({"html":{"$regex":reg}}).count()

In [None]:
for item in collection.find({"html":{"$regex":"datascience"}}):
    print (item['html'])

Regular expressions are usually not the answer due to the fragility of html pages on the internet today -- common mistakes like missing end tags, mismatched tags, forgetting to close an attribute quote, would all derail a perfectly good regular expression.

And finally close the connection with the database.

In [None]:
conn.close()

## 1.2 Finishing the warm up project with MongoDB storage

In [None]:
from urllib.request import urlopen
from urllib.error import HTTPError

import time
#Import pymongo
import pymongo


def getLinks(html, max_links=10):
    url = []
    cursor = 0
    nlinks=0
    while (cursor>=0 and nlinks<max_links):
        start_link = html.find("a href",cursor)
        if start_link==-1:
            return url
        start_quote = html.find('"', start_link)
        end_quote = html.find('"', start_quote + 1)
        url.append(html[start_quote + 1: end_quote])
        cursor = end_quote+1
        nlinks = nlinks +1
    return url

class Spider:
    def __init__(self,starting_url,crawl_domain,max_iter):
        self.crawl_domain = crawl_domain
        self.max_iter = max_iter
        self.links_to_crawl=[]
        self.links_to_crawl.append(starting_url)
        self.links_visited=[]
        self.collection=[]
        # Create the connection to MongoDB
        try:
            self.conn=pymongo.MongoClient()
            print ("Connection to Mongo Daemon successful!!!")
        except pymongo.errors.ConnectionFailure as e:
            print ("Could not connect to MongoDB: %s" % e )
        self.db = conn['crawlerDB2019']
        self.collection = self.db[starting_url+'DB']

        
    def retrieveHtml(self):
        try:
            socket = urlopen(self.url);
            self.html = socket.read().decode('latin-1')
            return 0
        except HTTPError:
            # Most probably an url not found 404, possibly due to malformating of the links in retrieveAndValidateLinks
            return -1
             
    def run(self):
        #Change the count on the collection
        while (len(self.links_to_crawl)>0 and self.collection.count()<self.max_iter):
            self.url = self.links_to_crawl.pop(0)
            print (self.links_to_crawl)
            self.links_visited.append(self.url)
            if self.retrieveHtml()>=0:
                self.storeHtml()
                self.retrieveAndValidateLinks()
        self.conn.close()
    
    def retrieveAndValidateLinks(self):
        tmpList=[]
        items = getLinks(self.html,max_links=50)
        # Check the validity of a link
        for item in items:
            item = item.strip('"')
            if '.pdf' not in item:
                if self.crawl_domain in item:
                    tmpList.append(item)
                else:
                    if not(":") in item: #Take care of http:// https:// and mailto:
                        tmpList.append(self.crawl_domain+item)
        # Check that the link has not been previously retrieved or is currently on the links_to_crawl list
        for item in tmpList:
            if item not in self.links_visited:
                if item not in self.links_to_crawl:
                    self.links_to_crawl.append(item)
                    print ('Adding: '+item)
                
    def storeHtml(self):
        doc = {}
        doc['url'] = self.url
        doc['date'] = time.strftime("%d/%m/%Y")
        doc['html'] = self.html
        #Insert in the collection
        self.collection.insert_one(doc)

In [None]:
spider = Spider('http://hunch.net','http://hunch.net/',20)
spider.run()


In [None]:
conn = pymongo.MongoClient()


In [None]:
print (conn.database_names())
db = conn['crawlerDB2019']

In [None]:
db.list_collection_names()

In [None]:
collection = db['http://hunch.netDB']
collection.estimated_document_count()

In [None]:
for doc in collection.find():
    print (doc['url'])
    print (doc['date'])

<div class = "alert alert-info" style = "border-radius:10px;border-width:3px;border-color:darkblue;font-family:Verdana,sans-serif;font-size:16px;">

**PROS and CONS:**
<p>
**MongoDB** querying is powerful but based on basic string operations. This actually tells us that storing full HTML pages is not going to be effiecient for retrieval. Actually, we will see that it is important to break the information in the pieces we really want. However, this is a good starting point before a post processing if we are not sure what we are going to do with the data or further scraping is going to take long. </p>
</div>

In the next section we will see more efficient ways of dealing with web based data.

<div class = "alert alert-info" style = "border-radius:10px;border-width:3px;border-color:darkblue;font-family:Verdana,sans-serif;font-size:16px;">

**URLLIB** is good for getting simple things. In the end you end up with a large HTML string you want to do something on it. 
So the next thing you want to do is to parse data. But you want to do it in the same way you do when you interact with the web page. You see a menu, a frame on the left side, a nice colorful block where the price for your flight is. So **you want to parse data the way you see data in the webpage so that you can target it**.
</div>

## 2. Using the API

Recall the **big picture**. If we are targeting for specific data we could check if the web site has a programatic interface for querying. If it has we can use it.



<img style="border-radius:20px;" src="./files/big_picture.jpg">

A standard way for programatically communicating with a web service is using the API (Application Programing Interface) whenever it is provided. 


For example, Twitter provides several APIs. The two most important ones are the RESTful API for static queries (e.g. user's friends and followers, check timelines, etc) and the Streaming API for retrieving live data. The REST API identifies Twitter applications and users using OAuth; responses are available in JSON. The Streaming API should not need authentication.

Ex. 

https://api.twitter.com/oauth/authenticate?oauth_token=XXXXXXXXXXXXXX

https://api.twitter.com/1.1/followers/ids.json?cursor=-1&screen_name=my_user_name&count=5000

Building these queries is not always easy, thus we may use a wrapper around the API. This is what, for example, **tweepy** does.

Using the API with authentification (needed for the RESTful API)

From wikipedia:

>"Web service APIs that adhere to the architectural constraints are called RESTful. HTTP based RESTful APIs are defined with these aspects:

> <ul><li>base URI (Uniform Resource Identifier), such as http://example.com/resources/
<li>an Internet media type for the data. This is often JSON but can be any other valid Internet media type (e.g. XML, Atom, microformats, images, etc.)</li>
<li>standard HTTP methods (e.g., GET [retrieve], PUT[idempotent update/create], POST[update/create], or DELETE)</li>
<li>hypertext links to reference state</li>
<li>hypertext links to reference related resources"</li>
</ul>


# 2.1 Making your own API Interface using Python



Using standard tools we can directly attack the API to build queries and use request. 

Reference for the NASA API: https://api.nasa.gov/api.html#imagery

**From NASA web-site:**
    

This endpoint retrieves the Landsat 8 image for the supplied location and date. The response will include the date and URL to the image that is closest to the supplied date. The requested resource may not be available for the exact date in the request. You can retrieve a list of available resources through the assets endpoint.

The cloud score is an optional calculation that returns the percentage of the queried image that is covered by clouds. If False is supplied to the cloud_score parameter, then no keypair is returned. If True is supplied, then a keypair will always be returned, even if the backend algorithm is not able to calculate a score. Note that this is a rough calculation, mainly used to filter out exceedingly cloudy images.

HTTP REQUEST
GET https://api.nasa.gov/planetary/earth/imagery

QUERY PARAMETERS
+ Parameter	Type	Default	Description
+ lat	float	n/a	Latitude
+ lon	float	n/a	Longitude
+ dim	float	0.025	width and height of image in degrees
+ date	YYYY-MM-DD	today	date of image; if not supplied, then the most recent image (i.e., closest to today) is returned
+ cloud_score	bool	False	calculate the percentage of the image covered by clouds
+ api_key	string	DEMO_KEY	api.nasa.gov key for expanded usage

EXAMPLE QUERY
https://api.nasa.gov/planetary/earth/imagery?lon=100.75&lat=1.5&date=2014-02-01&cloud_score=True&api_key=DEMO_KEY

Demo Key is only for few uses so we might need to ask for a key ourselves at api.nasa.gov 

Let's start with the most simple application. Just replicate the URL

In [None]:
#### WARNING: YOU WILL NEED AN ID!!!! ##########

import urllib.request

url = "https://api.nasa.gov/planetary/earth/imagery/?lat=41.386792&lon=2.163628&date=2015-02-01&dim=0.3&cloud_score=True&api_key=XXX"
response = urllib.request.urlopen(url)
response.read()

Let us get the answer in a better format

In [None]:
import urllib.request
import json

url = "https://api.nasa.gov/planetary/earth/imagery/?lat=41.386792&lon=2.163628&date=2015-02-01&dim=0.3&cloud_score=True&api_key=XXXXX"
response = urllib.request.urlopen(url)
json_response = json.loads(response.read())
json_response


And now get some data:

In [None]:
f = open('scraped_image.bmp','wb')
data = urllib.request.urlopen(json_response['url']).read()
f.write(data)
f.close()
%matplotlib inline
import matplotlib.pyplot as plt
im=plt.imread('scraped_image.bmp')
plt.imshow(im,interpolation='nearest')

And now, lets go for a more programatic way of doing this stuff:

In [None]:
import urllib.request
import json

earth_url = 'https://api.nasa.gov/planetary/earth/imagery'

def get_earth_photo(lon, lat, dim=0.1, date = '2015-6-6', api_key='DEMO_KEY'):
    params = { 'lon': lon, 'lat':lat, 'dim':dim, 'api_key': api_key }
    str_params = "/?lat="+str(params['lat'])+"&lon="+str(params['lon'])+"&dim="+str(params['dim'])+"&date="+date+"&api_key="+params['api_key']
    
    response = urllib.request.urlopen(earth_url+str_params)
    json_response = json.loads(response.read())
    return json_response['url']

##### CHANGE api_key with correct ID #############
print(get_earth_photo(2.163628,41.386792,api_key="XXXXXX"))

f = open('scraped_image.bmp','wb')
data = urllib.request.urlopen(get_earth_photo(2.163628,41.386792,dim=0.3,date="2015-02-01",api_key="XXXXX")).read()
f.write(data)
f.close()
%matplotlib inline
import matplotlib.pyplot as plt
im=plt.imread('scraped_image.bmp')
plt.imshow(im,interpolation='nearest')


STATUS VALUES:

+ 200 — everything went okay, and the result has been returned (if any)
+ 301 — the server is redirecting you to a different endpoint. This can happen when a company switches domain names, or an endpoint name is changed.
+ 400 — the server thinks you made a bad request. This can happen when you don’t send along the right data, among other things.
+ 401 — the server thinks you’re not authenticated. This happens when you don’t send the right credentials to access an API (we’ll talk about authentication in a later post).
+ 403 — the resource you’re trying to access is forbidden — you don’t have the right permissions to see it.
+ 404 — the resource you tried to access wasn’t found on the server.

## BONUS MATERIAL: Scrapping twitter data with an API wrapper

A standard way for programatically communicating with a web service is using the API (Application Programing Interface) whenever it is provided. Twitter provides several APIs. The two most important ones are the RESTful API for static queries (e.g. user's friends and followers, check timelines, etc) and the Streaming API for retrieving live data. The REST API identifies Twitter applications and users using OAuth; responses are available in JSON. The Streaming API should not need authentication.

Ex. 

https://api.twitter.com/oauth/authenticate?oauth_token=XXXXXXXXXXXXXX

https://api.twitter.com/1.1/followers/ids.json?cursor=-1&screen_name=my_user_name&count=5000

Building these queries is not always easy, thus we may use a wrapper around the API. This is what **tweepy** does.

Using the API with authentification (needed for the RESTful API)

From wikipedia:

>"Web service APIs that adhere to the architectural constraints are called RESTful. HTTP based RESTful APIs are defined with these aspects:

> <ul><li>base URI (Uniform Resource Identifier), such as http://example.com/resources/
<li>an Internet media type for the data. This is often JSON but can be any other valid Internet media type (e.g. XML, Atom, microformats, images, etc.)</li>
<li>standard HTTP methods (e.g., GET [retrieve], PUT[idempotent update/create], POST[update/create], or DELETE)</li>
<li>hypertext links to reference state</li>
<li>hypertext links to reference related resources"</li>
</ul>

If we want to use the RESTful API in Twitter we have to follow these steps:
<ul>
<li>From your twitter account we want to generate a token: https://apps.twitter.com</li>
<li>Create a new App. This will create the API keys (consumer keys)</li>
<li>Go to API Keys and generate a token. (access keys)</li>
</ul>

In [None]:
import json
import pymongo
import tweepy

consumer_key = "XXXX"
consumer_secret = "XXXX"

access_key = "XXXX"
access_secret = "XXXXX"

#Authentication
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_key, access_secret)
api = tweepy.API(auth)

#Do something
USER_NAME = "espavilat"
user = api.get_user(id=USER_NAME)

We can access some basic information about the user

In [None]:
user._json

In [None]:
user._json['id']

In [None]:
user.created_at

In [None]:
user.friends_count

In [None]:
user.followers_count

>JSON (JavaScript Object Notation), is an open standard format that uses human-readable text to transmit data objects consisting of attribute–value pairs. It is used primarily to transmit data between a server and web application, as an alternative to XML. JSON is a way to encode complicated information in a platform-independent way.  It could be considered the lingua franca of information exchange on the Internet. 

In [None]:
#We can access the full JSON
user._json['created_at']

We can access all the information as it was a dictionary structure.

In [None]:
juser = user._json
print (juser['created_at'])

We can apply our basic scrape knowledge and use urllib2 to retrieve more interesting infomation, such as the profile image.

In [None]:
img_url = juser['profile_image_url']
print (img_url)

In [None]:
from urllib.request import urlopen

f = open('scraped_image.bmp','wb')
im_str=urlopen(img_url).read()
f.write(im_str)
f.close()
%matplotlib inline
import matplotlib.pyplot as plt
im=plt.imread('scraped_image.bmp')
plt.imshow(im,interpolation='nearest')
plt.title(juser['screen_name'],size=16)

Now we want to retrieve the list of follower ids. There are two ways for doing so. Both uses the `api.followers_ids` function. The function returns a maximum of 100 ids. If we want to get all of them we may use a pagination variable `cursor`. This can be managed directly in the call `api.followers_ids(id, cursor)` or using a `Cursor` object with the `pages` method that handles the cursor implicitly. This second method is illustrated in the following lines:

In [None]:
#Retrieving all the followers
import time
ids = []
for page in tweepy.Cursor(api.followers_ids, screen_name=USER_NAME).pages():
    ids.extend(page)
    time.sleep(60)  #This should be 60 to avoid limit rates

Notice the `sleep` command. This is needed to respect the hourly limit rates of the Twitter API. 

In [None]:
#friends (screen_name) or follower_ids
ids

In [None]:
document={}
document['user'] = user.id
document['followers'] = ids[:]

# Create the connection to MongoDB
try:
    conn=pymongo.MongoClient()
    print ("Connection to Mongo Daemon successful!!!")
except pymongo.errors.ConnectionFailure as e:
    print ("Could not connect to MongoDB: %s" % e )
db = conn['twitter']
collection = db['twitter_users']
collection.insert_one(document)

In [None]:
for doc in collection.find():
    print (doc)

In [None]:
doc['user']

In [None]:
doc['followers']

<div class = "alert alert-error" style = "border-radius:10px;border-width:3px;border-color:darkred;font-family:Verdana,sans-serif;font-size:16px;"> **TAKE HOME EXERCISE:** Given a starting user ID, retrieve the user ids corresponding to the set of followers up to two depth levels. This is the followers of the followers of the named user. This information creates a network of influence that will be used in upcoming sessions.
</div>

The **Streaming API** works by making a request for a specific type of data — filtered by keyword, user, geographic area, or a random sample — and then keeping the connection open as long as there are no errors in the connection. The data you get back will be encoded in JSON. 

One of the main usage cases of tweepy is monitoring for tweets and doing actions when some event happens. Key component of that is the StreamListener object, which monitors tweets in real time and catches them.

If we check the official twitter streaming API we see that we have several modifiers for filtering the stream, i.e. track (filter by keyword), locations (filter by geographic location), etc

StreamListener has several methods, with on_data() and on_status() being the most useful ones. Here is a sample program which implements this behavior:

In [None]:
from tweepy import Stream,StreamListener

class listener(StreamListener):
    def on_data(self, data):
        #Beauty print data
        parsed = json.loads(data)
        print (json.dumps(parsed, indent=4, sort_keys=True))
        return True
    def on_error(self, status):
        print ('ERROR')
        print (status)

Get the twitter data filtered by location inside the following bounding box. (http://boundingbox.klokantech.com)

<img style = "border-radius:10px;" src="./files/ub_location.png">

In [None]:
twitterStream = Stream(auth, listener()) 
twitterStream.filter(locations=[2.1622322352,41.385987385,2.1651827408,41.3877173586])

In [None]:
# Other examples
twitterStream = Stream(auth, listener()) 
#twitterStream.filter(track=["datascience"])
#Use http://boundingbox.klokantech.com to get the Barcelona bounding box
twitterStream.filter(locations=[2.0504377635,41.2787636541,2.3045074059,41.4725622346])

In [None]:
from tweepy import Stream,StreamListener

class listener(StreamListener):
    def on_data(self, status):
        json_data=json.loads(status)
        print (str(json_data["user"]["screen_name"])+' : ' + json_data["text"])
        return True
    
    def on_error(self, status):
        print ('Error')
        print (status)
        
# Catch all tweets in Barcelona area and print them
twitterStream = Stream(auth, listener()) 
#twitterStream.filter(locations=[2.1622322352,41.385987385,2.1651827408,41.3877173586])
twitterStream.filter(locations=[2.0504377635,41.2787636541,2.3045074059,41.4725622346])

Let us fill the class in order to capture and store the data in a MongoDB database.

In [None]:
from tweepy import Stream,StreamListener

class listener(StreamListener):
    def __init__(self):
        super(StreamListener, self).__init__()
        try:
            self.conn=pymongo.MongoClient()
            print ("Connection to Mongo Daemon successful!!!")
        except pymongo.errors.ConnectionFailure as e:
            print ("Could not connect to MongoDB: %s" % e )
        self.db = conn['twitter_stream']
        self.collection = db['tweets']
    
    def on_data(self, status):
        jdata = json.loads(status)
        if 'android' in jdata["source"]:
            device = "android"
        else:
            device = "apple"
        document={'text':jdata["text"], 'created':jdata["created_at"], 'screen_name':jdata["user"]["screen_name"], 'device':device}        
        self.collection.insert(document) 
        print (document)
        return True
    
    def on_error(self, status):
        print ('ERROR')
        print (status)

# Catch all tweets in Barcelona area and print them
twitterStream = Stream(auth, listener()) 
twitterStream.filter(locations=[2.0504377635,41.2787636541,2.3045074059,41.4725622346])

In [None]:
#Check captured data
try:
    conn=pymongo.MongoClient()
    print ("Connection to Mongo Daemon successful!!!")
except pymongo.errors.ConnectionFailure as e:
    print ("Could not connect to MongoDB: %s" % e )

db = conn['twitter_stream']
collection = db['tweets']
collection.count()
for doc in coll.find():
    print (doc)

In [None]:
conn.database_names()
db = conn['twitter']
coll = db.tweets
for item in coll.find():
    print (item['device'])

APIs are nice. Most large web site provide useful APIs, e.g. Google, OpenStreetMap, Facebook, etc, subject to some use terms. However most of the web sited do not provide any kind of access to data. What to do then?