**<center>Accessing data from the web</center>**
***<center>Crawling, scraping, and APIs</center>***

<center>Snorre Ralund</center>

** Todays message **
* Utilize the datasources around you. (Job data and crime)
* Knowing how to create your own custom datasets pulling information from many different sources.
* You should know all the tricks, but use them with care. 

**Agenda**

* The basics of webscraping
    * Connecting, Crawling, Parsing, Storing, Logging.
* Hacks: Backdoors, url construction, and analysis of a webpage.
* Reliability of your datacollection.
* Screen-scraping - Automated browsing
    * Interactions:
        * Login in, scrolling, pressing buttons.
* APIs 
    * Authentication
    * Building queries

## Ethics / Legal Issues
* If a regular user can’t access it, we shouldn’t try to get it (That is considered hacking)https://www.dr.dk/nyheder/penge/gjorde-opmaerksom-paa-cpr-hul-nu-bliver-han-politianmeldt-hacking. 
* Don't hit it to fast: Essentially a DENIAL OF SERVICE attack (DOS). [Again considered hacking](https://www.dr.dk/nyheder/indland/folketingets-hjemmeside-ramt-af-hacker-angreb). 
* Add headers stating your name and email with your requests to ensure transparency. 
* Be careful with copyrighted material.
* Fair use (don't take everything)
* If monetizing on the data, be careful not to be in direct competition with whom you are taking the data from.

<img src="https://github.com/snorreralund/images/raw/master/Sk%C3%A6rmbillede%202017-08-03%2014.46.32.png"/>

## Setting up the essentials:
Good practices:
* Transparency
* Ratelimiting
* Reliability

In [38]:
# Transparent scraping
import requests
#response = requests.get('https://www.google.com') # url, address of the site and instructions on where you wanna go.
session = requests.session()
session.headers['email'] = 'youremail' 
session.headers['name'] = 'name'
session.headers # Who you are, what format you want, and authentification.

{'User-Agent': 'python-requests/2.18.4', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive', 'email': 'youremail', 'name': 'name'}

A quick tip is that you can change the user agent to a cellphone to obtain more simple formatting of the html. 

In [39]:
# Control the pace of your calls
import time
def ratelimit():
    "A function that handles the rate of your calls."
    time.sleep(1)
# Reliable requests
def get(url,iterations=10,exceptions=(Exception)):
    """This module ensures that your script does not crash from connection errors.
        iterations : Define number of iterations before giving up. 
        exceptions: Define which exceptions you accept, default is all. 
    """
    for iteration in range(iterations):
        try:
            # add ratelimit function call here
            ratelimit() # !!
            response = session.get(url)
            return response # if succesful it will end the iterations here
        except exceptions as e: #  find exceptions in the request library requests.exceptions
            print(e) # print or log the exception message.
    return None # your code will purposely crash if you don't create a check function later.

In [None]:
# Interactive browsing - more on this later
from selenium import webdriver
path2gecko = '/Users/axelengbergpallesen/Downloads/geckodriver' # define path to your geckodriver
browser = webdriver.Firefox(executable_path=path2gecko)
browser.get('https://www.google.com')

In [None]:
response = get() 

** HTML is a mess.** 

For now lets look at how to collect well-structured data.

# APIS
For fast, efficient and ***reliable*** data collection.

Only catch is that they control the amounts, and which endpoints you can collect 

also they **change**.
    - e.g. facebook cancelled 
        - querying friendship relations (without having users signing up to your app), 
        - group activity without admin rights, 
        - and most recently the ability to trace public activity (likes and comments) without admin rights.
    - twitter (and more recently facebook) will not let you collect all historic activity --> streaming data.

Begins with reading the docs... 
- getting authentification - creating apps, getting and renewing tokens - 
- building queries.
- ratelimiting and pagination.

Often comes in the Json format. --> nested dictionaries and lists.

Example: Explore the facebook api here: https://developers.facebook.com/tools/explorer/

## APIS: Collect data from Twitter

In [17]:
# XXX: Go to http://dev.twitter.com/apps/new to create an app and get values
# for these credentials that you'll need to provide in place of these
# empty string values that are defined as placeholders.
# See https://dev.twitter.com/docs/auth/oauth for more information 
# on Twitter's OAuth implementation

CONSUMER_KEY=""
CONSUMER_SECRET=""
OAUTH_TOKEN=""
OAUTH_TOKEN_SECRET=""

In [20]:
#pickle.dump([CONSUMER_KEY,CONSUMER_SECRET,OAUTH_TOKEN,OAUTH_TOKEN_SECRET],open('twitter_credentials.pkl','wb'))
CONSUMER_KEY,CONSUMER_SECRET,OAUTH_TOKEN,OAUTH_TOKEN_SECRET = pickle.load(open('twitter_credentials.pkl','rb'))

In [87]:
# How to use this authentification?
# answer from https://stackoverflow.com/questions/33308634/how-to-perform-oauth-when-doing-twitter-scrapping-with-python-requests
from requests_oauthlib import OAuth1

url ='https://api.twitter.com/1.1/search/tweets.json?q=bitcoin' 
auth = OAuth1(CONSUMER_KEY, CONSUMER_SECRET, OAUTH_TOKEN, OAUTH_TOKEN_SECRET)
response = requests.get(url, auth=auth)

In [None]:
# See the ratelimits by looking at the response headers
response.headers

In [40]:
# We want to include the authentification in all calls to twitter
session.auth = auth

In [89]:
response_json = response.json()

In [84]:
# Now lets inspect the response
import pandas as pd
status_df = pd.DataFrame(response_json['statuses']) 
#response_json['statuses'][4]

# Case: Collecting data about danish politics 

## searching for hashtags

In [83]:
url ='https://api.twitter.com/1.1/search/tweets.json?q=dkpol' 
#response_json['statuses'][4].keys()

In [91]:
#response_json['statuses'][12]['retweeted_status']#['retweet_count']


### paging results

In [92]:
#! mkdir  # create directory for the data

In [None]:
base_path = ''
for i in range(5): # get the next 5 results
    next_link = response_json['search_metadata']['next_results'] # grab paging link
    max_id = response_json['search_metadata']['max_id']
    # make another call
    response = get(next_link)
    # dump the raw data
    filename = base_path+'search_%s_%d'%(search_query,max_id)
    f = open(filename,'w')
    f.write(response.text)
    f.close()

    

## getting retweets

In [74]:
tweet_id = response_json['statuses'][12]['retweeted_status']['id']

In [79]:

# get retweets
url = 'https://api.twitter.com/1.1/statuses/retweets/%d.json'%tweet_id
response = get(url)

In [95]:
# dump the data as network data.

## Exercise: Monitoring danish politicians on twitter
A twitter-user with the twitter handle *politikere* has been so kind to curate a list of danish politicians.

**1)** Figure out how to construct an API call to collect who this user follows. (Look in the API reference index)

**2)** Next you should construct an API call to retrieve the statuses of those politicians.

**extra** 
- Make a loop to collect who each politician follows (beware of ratelimits).
- Construct a network of politicians using the Networkx package 