In [28]:
from notebook.services.config import ConfigManager
cm = ConfigManager()
cm.update('livereveal', {
        'width': int(1024*1.2),
        'height': 768,
        'scroll':True
})

{'height': 768, 'scroll': True, 'width': 1228}

**<center>Accessing data from the web</center>**
***<center>Crawling, scraping, and APIs</center>***

<center>Snorre Ralund</center>

** Todays message **
* Utilize the datasources around you. (Job data and crime)
* Knowing how to create your own custom datasets pulling information from many different sources.
* You should know all the tricks, but use them with care. 

**Agenda**

* The basics of webscraping
    * Connecting, Crawling, Parsing, Storing, Logging.
* Hacks: Backdoors, url construction, and analysis of a webpage.
* Reliability of your datacollection.
* Screen-scraping - Automated browsing
    * Interactions:
        * Login in, scrolling, pressing buttons.
* APIs 
    * Authentication
    * Building queries

## Ethics / Legal Issues
* If a regular user can’t access it, we shouldn’t try to get it (That is considered hacking)https://www.dr.dk/nyheder/penge/gjorde-opmaerksom-paa-cpr-hul-nu-bliver-han-politianmeldt-hacking. 
* Don't hit it to fast: Essentially a DENIAL OF SERVICE attack (DOS). [Again considered hacking](https://www.dr.dk/nyheder/indland/folketingets-hjemmeside-ramt-af-hacker-angreb). 
* Add headers stating your name and email with your requests to ensure transparency. 
* Be careful with copyrighted material.
* Fair use (don't take everything)
* If monetizing on the data, be careful not to be in direct competition with whom you are taking the data from.

<img src="https://github.com/snorreralund/images/raw/master/Sk%C3%A6rmbillede%202017-08-03%2014.46.32.png"/>

## Setting up the essentials:
Good practices:
* Transparency
* Ratelimiting
* Reliability

In [2]:
# Transparent scraping
import requests
#response = requests.get('https://www.google.com')
session = requests.session()
session.headers['email'] = 'youremail' 
session.headers['name'] = 'name'
session.headers

{'User-Agent': 'python-requests/2.18.4', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive', 'email': 'youremail', 'name': 'name'}

A quick tip is that you can change the user agent to a cellphone to obtain more simple formatting of the html. 

In [3]:
# Control the pace of your calls
import time
def ratelimit():
    "A function that handles the rate of your calls."
    time.sleep(1)
# Reliable requests
def request(url,iterations=10,exceptions=(Exception)):
    """This module ensures that your script does not crash from connection errors.
        iterations : Define number of iterations before giving up. 
        exceptions: Define which exceptions you accept, default is all. 
    """
    for iteration in range(iterations):
        try:
            # add ratelimit function call here
            ratelimit() # !!
            response = session.get(url)
            return response # if succesful it will end the iterations here
        except exceptions as e: #  find exceptions in the request library requests.exceptions
            print(e) # print or log the exception message.
    return None # your code will purposely crash if you don't create a check function later.

In [84]:
# Interactive browsing
from selenium import webdriver
path2gecko = '/Users/axelengbergpallesen/Downloads/geckodriver' # define path to your geckodriver
browser = webdriver.Firefox(executable_path=path2gecko)
browser.get('https://www.google.com')

Now we are ready to get some data 

# Collecting data from the web: a quick example

So let's get some data.

Lets say we wanted to obtain high frequency and geolocated information about the danish jobmarket. Then we might build a scraper for [jobfinder.dk](https://www.jobfinder.dk/), [jobindex.dk](https://www.jobindex.dk/) or [jobnet.dk](https://job.jobnet.dk/CV/FindWork/).


In [16]:
url = 'https://www.jobfinder.dk/jobs/category/it-konsulent'
base = 'ttps://www.jobfinder.dk'
response = request(url)

In [32]:
response.text.split('<a ')[100].split('href="')[1].split('"')[0]

'/job/business-analytics-consultants-41676'

In [11]:
#print(response.text) # inspect the response to see if it is what you expected.

In [35]:
html = response.text
link_nodes = html.split('user-jobs__content-item')[1:]


In [38]:
for node in link_nodes[0:5]:
    link_node = node.split('h2')[1]
    link = link_node.split('href="')[1]
    print(base+link.split('"')[0])
    

https://www.jobfinder.dk/job/serviceminded-it-supportkonsulent-43311
https://www.jobfinder.dk/job/erfarne-it-konsulenter-med-ambitioner-40763
https://www.jobfinder.dk/job/make-impact-it-consultant-33146
https://www.jobfinder.dk/job/start-din-karriere-som-information-management-konsulent-33672
https://www.jobfinder.dk/job/dygtige-testkonsulenter-38875


In [25]:
# define path
base_path = 'scraping_examples/'
filename = base_path+'jobfinder_data'
f = open(filename,'w')

## Loop over the links and dump the data

In [26]:
# make directory
#! mkdir scraping_examples
#! ls 
import json
data = {'data':[],'meta_data':[]}
json.dumps(data)

'{"data": [], "meta_data": []}'

In [27]:
 
import json
# make loop
html = response.text
link_nodes = html.split('user-jobs__content-item')[1:]
for node in link_nodes[0:5]:
    link_node = node.split('h2')[1]
    link = link_node.split('href="')[1]
    new_url = base+link
    response = request(new_url)
    html = response.text
    data = {'time': time.time(), 'url':new_url,'raw_data':html}
    
    f.write(json.dumps(data))
    f.write('\n\r')
f.close()

## Exercise 1
Get links to the articles listed on this [link](https://www.dr.dk/search/Result?filter_facet_universe=Nyheder&query=kv17&sort=Newest): https://www.dr.dk/search/Result?filter_facet_universe=Nyheder&query=kv17&sort=Newest

Loop over the links and save them to a file.

** Paging extra **
* Use selenium to click on more results or figure out the paging mechanism looking at the activity in the network monitoring of your browser.

** Reliability Extra **
* When writing the file: provide information about servertime (import the time module), the link, other information. 

In [39]:

url = 'https://www.dr.dk/search/Result?filter_facet_universe=Nyheder&query=kv17&sort=Newest' 
response = request(url)
html = response.text
sep = 'item image-2'
link_items = html.split(sep)[1:]
for item in link_items:
    link = item.split('href="')[1]
    print(link.split('"')[0])

http://dr.dk/nyheder/politik/kv17/alternativet-faar-sin-foerste-borgmesterpost-fanoe-bliver-groen
http://dr.dk/nyheder/politik/kv17/vordingborg/romantisk-byraadsmoede-borgmester-faar-baade-kaede-og-kone
http://dr.dk/nyheder/politik/kv17/slagelse/buhraab-under-byraadsmoede-i-slagelse-saa-er-boernehaven-i-gang
http://dr.dk/nyheder/politik/kv17/slagelse/vraget-venstre-borgmester-i-slagelse-min-person-stod-i-vejen
http://dr.dk/nyheder/politik/kv17/slagelse/demonstration-forud-omstridt-moede-i-slagelse-det-er-noget-ged
http://dr.dk/nyheder/politik/kv17/analyse-25-ting-vi-laerte-af-kommunalvalget-2017
http://dr.dk/nyheder/politik/kv17/aalborg/trods-stort-valgnederlag-i-aalborg-fortsat-opbakning-til-v-frontfigur
http://dr.dk/nyheder/politik/kv17/slagelse/folk-i-slagelse-er-rystede-borgmesterkamp-er-barnlig-og-pinlig
http://dr.dk/nyheder/politik/kv17/slagelse/rod-i-aftale-i-slagelse-de-blaa-siger-nej-og-enhedslisten-vil-ikke
http://dr.dk/nyheder/politik/kv17/nordjylland/ulla-astman-efter-glipp

# Tricks: Backdoors, pseudo-apis, and url construction. 


## Static or dynamic html pages
You browser will either get a set of instructions (in javascript) on how to build a page and how interactions will change it, or it will receive a pre-compiled html document.

Lets inspect another jobsite: [jobnet.dk](https://job.jobnet.dk/CV/FindWork/).

In [33]:
# Another examples
url = 'https://job.jobnet.dk/CV/FindWork/' # define the link
# get the raw html
response = request(url)
html = response.text

In [34]:
html # inspect to see if it matches



This looks a lot shorter. This points us to the fact that the page is build dynamically. 

Now we can search for instructions and the data that it uses to construct the front.


In [40]:
# get data through the backdoor
# backdoor link
url = 'https://job.jobnet.dk/CV/FindWork/Search?offset=20'
response = request(url)

In [42]:

for offset in range(0,16520,20)[0:5]:
    url = 'https://job.jobnet.dk/CV/FindWork/Search?offset=%d'%offset
    print(url)

https://job.jobnet.dk/CV/FindWork/Search?offset=0
https://job.jobnet.dk/CV/FindWork/Search?offset=20
https://job.jobnet.dk/CV/FindWork/Search?offset=40
https://job.jobnet.dk/CV/FindWork/Search?offset=60
https://job.jobnet.dk/CV/FindWork/Search?offset=80


In [48]:
import json
data = json.loads(response.text)
data['JobPositionPostings'][0]['Location']
#.json() 

{'Latitude': 55.6726, 'Longitude': 12.5541}

In [49]:
n_links = 103575
for page_num in range(int(n_links/20))[0:5]:
    page_num +=1
    link = 'https://it.jobindex.dk/jobsoegning?maxdate=20171112&mindate=20170601&page=%d&jobage=archive'%page_num
    print(link)

https://it.jobindex.dk/jobsoegning?maxdate=20171112&mindate=20170601&page=0&jobage=archive
https://it.jobindex.dk/jobsoegning?maxdate=20171112&mindate=20170601&page=1&jobage=archive
https://it.jobindex.dk/jobsoegning?maxdate=20171112&mindate=20170601&page=2&jobage=archive
https://it.jobindex.dk/jobsoegning?maxdate=20171112&mindate=20170601&page=3&jobage=archive
https://it.jobindex.dk/jobsoegning?maxdate=20171112&mindate=20170601&page=4&jobage=archive


## Constructions of urls
A nice trick is to understand how urls are constructed to communicate with a server. 

Lets look at how [jobindex.dk](https://www.jobindex.dk/) does it.

This can help you navigate the page, without having to parse information from the html or click any buttons.

* / is like folders on your computer.
* ? entails parameters 
* = defines a variable: e.g. pageid=1000 or offset = 100 or showNumber=20
* & separates different parameters.
* \+ is html for whitespace

## Exercise 2
Find the backdoor to the polling data behind the visualization at b.dk (https://www.b.dk/politiko/barometeret). 
* Follow the above link and look at the network activity for data related responses: .csv xml json 
* Download the data and parse it using xml2dict (conda install -c conda-forge xmltodict):

`data = xmltodict.parse(text)`


**extra**

Find links to all historical data (either by analyzing the path or by locating the masterfile in the network activity). And set up a loop collecting them all.

** extra 2 ** 

Download precompiled polling data from github:
* *** link to a compilation of danish polling [data](https://github.com/erikgahner/polls) ***
    

In [70]:
# lets look at the results
#data.text
url = 'https://www.b.dk/upload/webred/bmsandbox/opinion_poll/2017/10.xml'
#url = 'https://www.b.dk/upload/webred/bmsandbox/opinion_poll/overview.xml'
res = requests.get(url)
import xmltodict
data = xmltodict.parse(res.text)

In [83]:
poll['entries']['entry'][0]

OrderedDict([('party',
              OrderedDict([('id', '1'),
                           ('name', 'Venstre'),
                           ('shortname', 'Venstre'),
                           ('letter', 'V')])),
             ('percent', '18.8'),
             ('mandates', '34'),
             ('supports', '9'),
             ('uncertainty', '1.1')])

In [78]:
for poll in data['result']['polls']['poll']:
    break
#data['result']['institute'][0]['xmls']['xml'][5]

# Navigation and Parsing

##  HTML Scraping
Web Scraping: Automated downloading of web pages and the extraction of specific information from it. 

Snowball crawling: Finding links, following links, find links, follow links.


## Parsing and extracting information
Retrieving target information from unstructured text.

Parsing is used both while crawling and navigating the domain you are scraping from.


## HTML data
- Tree structure.
    - Children, siblings, parents - descendants. 
        - Ids and attributes

<img src="http://www.openbookproject.net/tutorials/getdown/css/images/lesson4/HTMLDOMTree.png"/>


How do we find our way around this tree?

* selectors
* splitting and regex
* traversing html trees (beautifulsoup)

Let's download a module that can parse this tree for us.

`conda install -c asmeurer beautiful-soup`

or 
`pip install beautifulsoup4`

## Selectors 
- quick but has to be hardcoded and therefore more likely to break


In [86]:
# selenium example
browser.get('https://www.facebook.com')
# find login button.


In [87]:
sel = '#email'
element = browser.find_element_by_css_selector(sel)

In [89]:
element.send_keys('snorreralund@gmail.com')

In [90]:
sel = '#pass'
element = browser.find_element_by_css_selector(sel)
element.send_keys('thereyou go fuck my life')

In [91]:
sel = '#u_0_2'
element = browser.find_element_by_css_selector(sel)
element.click()

## Splitting and searching with Regex 

Regex is a general language for searching strings. 

Extremely useful for all extractign information from all sorts of unstructured text data.

... however you have to learn it. 


### Custom parsing using Regex
Won't go into the details here, but working with text this will come in very handy. 
Play around with it [here](http://regexr.com/) or in this notebook.

* \+ = 1 or more times
* \* = 0 or more times
* {3} = exactly three times
* ? = once or none
* \\ = escape character, used to find characters that has special meaning with regex: e.g. \+ \*
* () = within the paranthesis is the pattern I care about.
* [] = set of characters
* ^ = applied within a set, it becomes the inverse of the set defined.
* . = any characters except line break
* | = or statement. p|b find characters a or b.

In [97]:
import re
s = 'abcdefg123456789'
re.split('[a-z][0-9]',s)# splitting by a complex pattern


['abcdef', '23456789']

### Advanced parsing using Beautifulsoup
Beautifulsoup is a powerful parser build on regex, transforming the raw html into traversable tree of objects. 

Here I want to show you how to use it for grabbing data about the KV17 election:
https://www.altinget.dk/kandidater/kv17/stemmeseddel.aspx

Let us analyze the page.

In [98]:
url = 'https://www.altinget.dk/kandidater/kv17/stemmeseddel.aspx?kommune=13'
response = request(url)
#for i in range(1,99):
#    url = 'www.altinget.dk/kandidater/kv17/stemmeseddel.aspx?kommune=%d'%i


In [99]:
html = response.text
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')

In [100]:
node = soup.find('dl',{'class':'candidates-list-parties'}) # find, findall 

In [117]:
party_node = list(node.children)[0]
# children, siblings, parents. 
party_node.get_text()
candidate_node = list(node.children)[1]
candidate_node.find('a').get_text()
candidate_node.find('a')['href']

'/kandidater/kv17/andreas-bennetzen'

In [123]:
candidates = []
base = 'https://www.altinget.dk'
for child in list(node.children):
    if child['class'][0] == "candidates-party-name":
        party = child.text
        continue
    name = child.a.text
    link = base+child.a['href']
    print(party,name,link)
    row = [party,name,link]
    candidates.append(row)

02. Christiania-Listen Andreas Bennetzen https://www.altinget.dk/kandidater/kv17/andreas-bennetzen
02. Christiania-Listen Britta Lillesøe https://www.altinget.dk/kandidater/kv17/Britta-Lillesoee
02. Christiania-Listen Klaus Haase https://www.altinget.dk/kandidater/kv17/Klaus-Haase
02. Christiania-Listen Lis Brandstrup https://www.altinget.dk/kandidater/kv17/Lis-Brandstrup
02. Christiania-Listen Carl Oskar Strange https://www.altinget.dk/kandidater/kv17/carl-oskar-strange
02. Christiania-Listen Kirsten Larsen Mhoja https://www.altinget.dk/kandidater/kv17/Kirsten-Larsen-Mhoja
02. Christiania-Listen Nils Vest https://www.altinget.dk/kandidater/kv17/Nils-Vest
02. Christiania-Listen Lars Fenger https://www.altinget.dk/kandidater/kv17/Lars-Hauch-Fenger
02. Christiania-Listen Hulda Mader https://www.altinget.dk/kandidater/kv17/Hulda-Mader
02. Christiania-Listen Pia Liljenbøl https://www.altinget.dk/kandidater/kv17/Pia-Liljenboel
02. Christiania-Listen Amanda Otilie Ibsen https://www.altinget.

In [124]:
#if child.name =='dt':
 #   party = child.get_text()

In [125]:
import pandas as pd
df = pd.DataFrame(candidates,columns = ['party','name','link'])
df.head()

Unnamed: 0,party,name,link
0,02. Christiania-Listen,Andreas Bennetzen,https://www.altinget.dk/kandidater/kv17/andrea...
1,02. Christiania-Listen,Britta Lillesøe,https://www.altinget.dk/kandidater/kv17/Britta...
2,02. Christiania-Listen,Klaus Haase,https://www.altinget.dk/kandidater/kv17/Klaus-...
3,02. Christiania-Listen,Lis Brandstrup,https://www.altinget.dk/kandidater/kv17/Lis-Br...
4,02. Christiania-Listen,Carl Oskar Strange,https://www.altinget.dk/kandidater/kv17/carl-o...


In [126]:
# Getting information about each candidate

url = 'https://www.altinget.dk/kandidater/kv17/3010-frank-jensen'
response = request(url)
html = response.text
soup = BeautifulSoup(html,'lxml')

In [128]:
meta_nodes = soup.findAll('h3',{'class':'subsection-title h4'})
meta_nodes

[<h3 class="subsection-title h4">Fakta</h3>,
 <h3 class="subsection-title h4">Mere information</h3>]

In [129]:
for meta_node in meta_nodes:
    if meta_node.text == 'Fakta':
        break
    

In [130]:
# search for siblings.

<h3 class="subsection-title h4">Fakta</h3>

In [131]:
node = meta_node.find_next('dl')
node

<dl>
<dt>Valgt i</dt><dd>2009, 2013</dd>
<dt>Uddannelse</dt><dd>Kandidatuddannelse</dd>
<dt>Beskæftigelse</dt><dd>Borgmester</dd>
<dt>Alder</dt><dd>56 år</dd>
</dl>

In [156]:
candidate_data = {}
for cat in node.findAll('dt'):
    category_title = cat.get_text()
    category_point = cat.find_next_sibling().get_text()
    print(category_title,category_point)
    candidate_data[category_title] = category_point

Valgt i 2009, 2013
Uddannelse Kandidatuddannelse
Beskæftigelse Borgmester
Alder 56 år


In [138]:
cat.get_text(),cat.find_next_sibling().get_text().split(',')[-1].strip()

('Valgt i', '2013')

In [153]:
answers = soup.findAll('h3',{'class':'panel-title'})#soup.findAll('div',{'class':'poll-candidates-question panel panel-default'})

In [147]:
#answers[0]

In [158]:

for answer in answers:
    title = answer.text
    #print(title)
    candidate_answer = answer.find_next('div',{'class':'answer-candidate tooltip in top'}).previous
    candidate_data[title] = candidate_answer

In [159]:
data = [candidate_data]
df = pd.DataFrame(data)
df

Unnamed: 0,Alder,BESKÆFTIGELSE,Beskæftigelse,BÆREDYGTIGHED,DAGINSTITUTIONER,ENERGI,ERHVERV,FLYGTNINGE,FOLKESKOLE,FOLKESUNDHED,...,KLIMA,KOMMUNESKAT,KULTUR,MILJØ,TRANSPORT,Uddannelse,Valgt i,ÆLDRE,ÆLDREPLEJE,ØKONOMI
0,56 år,Delvist enig,Borgmester,Delvist enig,Delvist enig,Helt enig,Delvist uenig,Delvist uenig,Delvist uenig,Hverken/eller,...,Helt enig,Delvist uenig,Helt uenig,Helt enig,Helt enig,Kandidatuddannelse,"2009, 2013",Helt enig,Delvist uenig,Delvist uenig


In [150]:
answer.find_next('div',{'class':'answer-candidate tooltip in top'}).previous

'Delvist uenig'

In [160]:
#candidate_data

**Hvem er den vageste politiker?**

## Exercise 3: Find the facebook ids of all candidates in the KV17 election
* Loop through the first 5 muncipalities and collect all candidates.

* Get the facebook link from each candidate's page.

- Parse all social media links in the section (Mere information), without knowing the classes in advance.

extra: 
    Parse all information on the page including answers to the candidate test, and count how many 'delvist' answers each politician has. Who is the vaguest politician in denmark?

extra:
    search for the candidate and kv17 on dr.dk (https://www.dr.dk/search/Result?filter_facet_universe=Nyheder&query=candidate+kv17)
    

In [4]:
url = 'https://www.altinget.dk/kandidater/kv17/jacob-bundsgaard'

response = request(url)

In [5]:
html = response.text

In [6]:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')

In [7]:
nodes = soup.findAll('h3',{'class':'subsection-title h4'})

In [8]:
node = [node for node in nodes if node.text=='Mere information'][0]
node

<h3 class="subsection-title h4">Mere information</h3>

In [9]:
node = node.find_next('dl')


In [15]:
for link_node in node.find_all_next('a'):
    break
    #print(link_node)

In [12]:
node

<dl>
<dd><a href="https://www.instagram.com/jacobbundsgaard/">Instagram </a></dd><dd><a href="https://twitter.com/1836">Twitter</a></dd><dd><a href="https://www.facebook.com/100002941361500">Facebook</a></dd>
</dl>

In [169]:
SoMe = {}
for link_node in node.find_all_next('dd'):
    link_node = link_node.a
    name = link_node.text.strip()
    link = link_node['href']
    SoMe[name] = link


In [170]:
if 'Facebook' in SoMe:
    print(SoMe['Facebook'])

https://www.facebook.com/100002941361500


### Storing data
Storing processed data, logging activity.
* in csv, 
* storing in json,
* as picklefile,
* as dataframe
Install a csv reader and writer:

`conda install -c anaconda unicodecsv `

or 

`pip install unicodecsv`

## Now let's try parsing more structured data: Tables.
If you are not sure always ask google: "how to parse html table"
### [Hit the following link](https://www.basketball-reference.com/leagues/NBA_2017.html). 



In [124]:
import requests
response = requests.get('https://www.basketball-reference.com/leagues/NBA_2017.html') # get the html 

In [121]:
from bs4 import BeautifulSoup # import parser

bsobj = BeautifulSoup(response.text) # parse the response using beautifulsoup

In [122]:
tables = bsobj.findAll('table') # using the tag to locate tables
table = bsobj.select('#confs_standings_E > thead > tr > th.poptip.sort_default_asc.left') # using a specific css selector to locate specific table

### lets look at the tables

In [144]:
for table in tables[0:1]: # iterating through the tables
    header = table.findAll('thead')[0] # thead is standard notation for table header
    data = table.findAll('tbody')[0] # tbody is standard notation for the data
    
    head = []
    for row in header.findAll('th'): # collect data from header.
        #print(row['aria-label'])#.attrs)#,row.text)
        head.append(row['aria-label'])
 #   print(head)
    tbody = [] # container for the data
    for row in data.findAll('tr'): # tr is standard notation for table rows.
#        print row
        tempoary_row = []
        for cell in row.children: # think of children as indentation in python.
            print(cell.get_text())
            tempoary_row.append(cell.get_text())
        tbody.append(tempoary_row)
#        break
tbody

Boston Celtics* (1) 
53
29
.646
—
108.0
105.4
2.25
Cleveland Cavaliers* (2) 
51
31
.622
2.0
110.3
107.2
2.87
Toronto Raptors* (3) 
51
31
.622
2.0
106.9
102.6
3.65
Washington Wizards* (4) 
49
33
.598
4.0
109.2
107.4
1.36
Atlanta Hawks* (5) 
43
39
.524
10.0
103.2
104.0
-1.23
Milwaukee Bucks* (6) 
42
40
.512
11.0
103.6
103.8
-0.45
Indiana Pacers* (7) 
42
40
.512
11.0
105.1
105.3
-0.64
Chicago Bulls* (8) 
41
41
.500
12.0
102.9
102.4
0.03
Miami Heat (9) 
41
41
.500
12.0
103.2
102.1
0.77
Detroit Pistons (10) 
37
45
.451
16.0
101.3
102.5
-1.29
Charlotte Hornets (11) 
36
46
.439
17.0
104.9
104.7
-0.07
New York Knicks (12) 
31
51
.378
22.0
104.3
108.0
-3.87
Orlando Magic (13) 
29
53
.354
24.0
101.1
107.6
-6.61
Philadelphia 76ers (14) 
28
54
.341
25.0
102.4
108.1
-5.83
Brooklyn Nets (15) 
20
62
.244
33.0
105.8
112.5
-6.74


[[u'Boston Celtics*\xa0(1)\xa0',
  u'53',
  u'29',
  u'.646',
  u'\u2014',
  u'108.0',
  u'105.4',
  u'2.25'],
 [u'Cleveland Cavaliers*\xa0(2)\xa0',
  u'51',
  u'31',
  u'.622',
  u'2.0',
  u'110.3',
  u'107.2',
  u'2.87'],
 [u'Toronto Raptors*\xa0(3)\xa0',
  u'51',
  u'31',
  u'.622',
  u'2.0',
  u'106.9',
  u'102.6',
  u'3.65'],
 [u'Washington Wizards*\xa0(4)\xa0',
  u'49',
  u'33',
  u'.598',
  u'4.0',
  u'109.2',
  u'107.4',
  u'1.36'],
 [u'Atlanta Hawks*\xa0(5)\xa0',
  u'43',
  u'39',
  u'.524',
  u'10.0',
  u'103.2',
  u'104.0',
  u'-1.23'],
 [u'Milwaukee Bucks*\xa0(6)\xa0',
  u'42',
  u'40',
  u'.512',
  u'11.0',
  u'103.6',
  u'103.8',
  u'-0.45'],
 [u'Indiana Pacers*\xa0(7)\xa0',
  u'42',
  u'40',
  u'.512',
  u'11.0',
  u'105.1',
  u'105.3',
  u'-0.64'],
 [u'Chicago Bulls*\xa0(8)\xa0',
  u'41',
  u'41',
  u'.500',
  u'12.0',
  u'102.9',
  u'102.4',
  u'0.03'],
 [u'Miami Heat\xa0(9)\xa0',
  u'41',
  u'41',
  u'.500',
  u'12.0',
  u'103.2',
  u'102.1',
  u'0.77'],
 [u'Detroit P

In [147]:

for table in tables[0:1]: # pick only the first table
    header = table.findAll('thead')[0]
    data = table.findAll('tbody')[0]
    head = [] # define container for the header
    for column in header.findAll('th'): # collect data from header.
        head.append(column['aria-label']) # append to header container
    rows = [] # define container for the rows
    for row in data.findAll('tr'):
        temp_row = [] # define tempoary row container
        for cell in row.children: # think of children as indentation in python.
            val = cell.get_text()
            try:    
                val = float(val)
            except:
                pass
            temp_row.append(val) # append to tempoary row
        rows.append(temp_row) # append tempoary row to row container
      
import pandas as pd
df = pd.DataFrame(columns=head,data=rows) # define the dataframe

In [148]:
df # look at the dataframe

Unnamed: 0,Eastern Conference,Wins,Losses,Win-Loss Percentage,Games Behind,Points Per Game,Opponent Points Per Game,Simple Rating System
0,Boston Celtics* (1),53.0,29.0,0.646,—,108.0,105.4,2.25
1,Cleveland Cavaliers* (2),51.0,31.0,0.622,2,110.3,107.2,2.87
2,Toronto Raptors* (3),51.0,31.0,0.622,2,106.9,102.6,3.65
3,Washington Wizards* (4),49.0,33.0,0.598,4,109.2,107.4,1.36
4,Atlanta Hawks* (5),43.0,39.0,0.524,10,103.2,104.0,-1.23
5,Milwaukee Bucks* (6),42.0,40.0,0.512,11,103.6,103.8,-0.45
6,Indiana Pacers* (7),42.0,40.0,0.512,11,105.1,105.3,-0.64
7,Chicago Bulls* (8),41.0,41.0,0.5,12,102.9,102.4,0.03
8,Miami Heat (9),41.0,41.0,0.5,12,103.2,102.1,0.77
9,Detroit Pistons (10),37.0,45.0,0.451,16,101.3,102.5,-1.29


In [150]:
import pandas as pd
dfs = []
for table in tables: # pick only the first table
    header = table.findAll('thead')[0]
    data = table.findAll('tbody')[0]
    head = [] # define container for the header
    for column in header.findAll('th'): # collect data from header.
        head.append(column.text) # append to header container
    rows = [] # define container for the rows
    for row in data.findAll('tr'):
        temp_row = [] # define tempoary row container
        for cell in row.children: # think of children as indentation in python.
            val = cell.get_text()
            try:    
                val = float(val)
            except:
                pass
            temp_row.append(val) # append to tempoary row
        rows.append(temp_row) # append tempoary row to row container
        

    df = pd.DataFrame(columns=head,data=rows) # define the dataframe
    dfs.append(df)

In [152]:
#dfs[-1]

# Screen-scraping and automated interactions
* Login in
* Scrolling
* Pressing buttons

Sometimes easier than detective work in the javascript, but has RELIABILITY ISSUES.

## Installing (and updating everytime something stops working) selenium
Selenium is an programmatic interface (api) to your browser. 

You need to have firefox installed and a version of the Marionette ([Download latest!! version here](https://github.com/mozilla/geckodriver/releases))
Then install selenium:

`pip install selenium`

or 

`conda install -c conda-forge selenium`

If used on a server with no connection to a desktop screen: also install PyDisplay.

In [153]:
from selenium import webdriver
path2gecko = '/Users/axelengbergpallesen/Downloads/geckodriver' # define path to your geckodriver
browser = webdriver.Firefox(executable_path=path2gecko)


## Login

In [169]:
#url = 'https://www.facebook.com'
#browser.get(url) # go to the page
#sel = '#email'# find css selector for the name field
#element = browser.find_element_by_css_selector(sel)# locate element
#name = 'robot_trespassing@ofir.dk'# define name
#element.send_keys(name)# send keys

# find password field
#sel = '#pass'
#element = browser.find_element_by_css_selector(sel)
# locate element
#password = 'thereyougo'
# define password
#element.send_keys(password)
# send keys

sel = '#loginbutton'
# Find button. 
element = browser.find_element_by_css_selector(sel)
element.click()

# locate element
# click button

## scrolling
some pages load more results when you reach the bottom of the page, here is a simple script that does this for you.

In [None]:
browser.execute_script("window.scrollTo(0, document.body.scrollHeight);") # executing a simple javascript.

# Advanced scraping methods
... ethics
... hacking

* detective work in the javascript.
* changing headers to access mobile content that might be more simple.
    * use https://www.whatsmybrowser.org from your phone to find the right headers you need.
* using proxies, ssh tunneling.
* Simulating human-like behaviour to avoid getting caught by anomaly detection methods.

In [None]:
session = requests.session()
mobile_agent = 'Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit/537.51.2 (KHTML, like Gecko) Version/7.0 Mobile/11D257 Safari/9537.53'
session.headers['User-Agent'] = mobile_agent
session.proxies

# APIs


* No parsing. Nice
* Fast and efficient.
* You can't get all that you can see.
* They change.
* Ratelimits


## Through the API domains control what they want to share. 
## APIs can be for collecting data or it can be for accessing some service: e.g. tracking users of a website by directing traffic to google analytics, or getting geolocation data from the [Google Geoencoding API](http://maps.googleapis.com/maps/api/geocode/json?language=da&address=1600 Pennsylvania Ave NW, Washington, DC 20500, USA). 

## Collecting data from APIs
Choosing an API: twitter, facebook, reddit, wikipedia, denmarks statistics, etc 

Get to know the commands, rate limits, and errorcodes by experimenting or simply reading the documentation.

Today we will look at the Facebook API [https://developers.facebook.com/docs](https://developers.facebook.com/docs) 

instead of looking at the documentation we can go right to experimentations by using their graph explorer app [https://developers.facebook.com/tools/explorer/](https://developers.facebook.com/tools/explorer/). 


### Authentication
Basic authentication: Telling who you are, and a unique address so that can track you.

Specific permissions: Some queries is only allowed with special permissions.
    * Application specific permissions, gathering user permission. 

### Building a query

In [174]:
# get an ID. 
print(SoMe['Facebook'])
page_id = 'frank.jensen'#SoMe['Facebook'].split('/')[-1] # use the facebooklink collected earlier

https://www.facebook.com/100002941361500


In [175]:
# Authentication! set your access_token
base_url = 'https://graph.facebook.com/v2.10/%s'%page_id# define baseurl 

fields = '?fields=feed.limit(10){message,id,to,from,comments,likes}'# define fields
token = 'EAACEdEose0cBAGlmaRDaP8t7JFXce3gLphDSiFSioWyvUPFpSMbhKQy3sT0JPEcsn2uFd7PQa1PLpXPBkYG80MFGnuOixLkbV0VJbE2fZCwaOLEY2uJXHmGNYz37zJhU2z4lBKumhT90J5ZATLHHZB19W2NiFwYkBKZAsVFGxPqXMYRvwjeOc6yb6kbZBIsMZD'
authentication = '&access_token=%s'%(token)
q = base_url+fields+authentication
print(q)
response = request(q)

https://graph.facebook.com/v2.10/frank.jensen?fields=feed.limit(10){message,id,to,from,comments,likes}&access_token=EAACEdEose0cBAGlmaRDaP8t7JFXce3gLphDSiFSioWyvUPFpSMbhKQy3sT0JPEcsn2uFd7PQa1PLpXPBkYG80MFGnuOixLkbV0VJbE2fZCwaOLEY2uJXHmGNYz37zJhU2z4lBKumhT90J5ZATLHHZB19W2NiFwYkBKZAsVFGxPqXMYRvwjeOc6yb6kbZBIsMZD


In [179]:
data = response.json()
data['feed']['data'][0]['message']

'I København skal det være slut med forældre, der afleverer deres børn i institutionen med ondt i maven eller henter dem med hamrende hjerte ti minutter efter lukketid.\n\nJeg går til valg med en ny plan, der skal lette hverdagen for travle småbørnsforældre i Københavns Kommune.\n\n1. Bedre bemanding af pædagoger i alle daginstitutioner i morgen og eftermiddagstimerne. \n\n2. Kommune skal have ti daginstitutioner – en i hver bydel – der åbner allerede klokken 6:30 og lukker tidligst klokken 18:30 for at tilgodese forældre i brancher med skæve arbejdstider.'

In [180]:
import json

d = response.json()
next_link = d['feed']['paging']['next']

def grab_next(response):
    try:
        link = response['feed']['paging']['next']
        return link
    except:
        return False
responses = []
while True:
    print('-')
    response = request(next_link).json() 
    responses.append(response)
    next_link = grab_next(response)
    if not next_link:
        print('no more paging')
        break
    

-
no more paging


In [168]:
## parsing json generic

## parsing json with the requests module

### Rate limiting
APIs have rate limits, to ensure reliability you should make sure not break these.

Rate limits not always explicitly stated, need to test before launching program or create adaptable program.

In [None]:
# waiting using the time module

# Intelligent rate limiting. 
import time
logsize = 360
timestamps = [time.time()]*logsize
requestrate = 1 # number of calls pr second. If the ratelimiting error are thrown update it.
ratelimiterrors = 0 #  
def ratelimit():
    global timestamps
    global requestrate
    global ratelimiterrors
    servertime = time.time()
    timestamps.append(servertime)
    average_request_rate = (servertime-timestamps[0])/logsize
    if ratelimiterrors>0: # adopt new rate
        time.sleep(1800)
        requestrate+=0.01
        print 'updating requestrate to %r'%requestrate
        rateli miterrors = 0
    if average_request_rate < requestrate: # check if requestrate is to high
        time.sleep(1)
    timestamps.pop(0) # remove first timestamp



If you want reliable access you need to create an app. This will allow you to obtain access tokens that last for a month.
Go to developers.facebook.com and create an app.

Converting the access token is done using the following code snippet:

```url = 'https://graph.facebook.com/oauth/access_token?client_id=%s&client_secret=%s&grant_type=fb_exchange_token&fb_exchange_token=%s'%(app_id,app_secret,access_token)```

Make sure the access token is specific to you app. 

## Exercise 4
Collecting facebook data from the candidates.

First extract ids from the facebook links of each candidate
* beware of all the variations in the url.
* plus that some have names and not ids. If this is the case convert using the query = frank.jensen?fields=id,name
* some are pages and other are profiles, you can only get data from the pages.
    

# Reliability!
When using found data, you are the curator and you are responsible for enscribing trust in the datacompilation.

Reliability is ensured by an interative process, of inspection, error detection and error handling.

Build your scrape around making this process easy by:
* logging information about the collection (e.g. servertime, size of response to plot weird behavior, size of response over time,  number of calls pr day, detection of holes in your data).
* Storing raw data (before parsing it) to be able to backtrack problems, without having to wait for the error to come up.  

Thanks!

Questions? 

Exercises below...

Scraping allows you to collect data from a variety of different sources to build a new rich dataset. The following exercises are meant as an example of this. One thing we didn't cover explicitly is merging data from different sources this means matching names and so on, this you will need to figure out using standard python (hint: Dictionaries can serve as translaters) and unfortunatly some hardcoding.

You are not expected to complete all exercises, but you might be able to do it if you share the tasks in the group. *But make sure you get some data from the first page!*. 

We want you to collect data about the danish politicians from this page: http://www.folketingsvalg-2015.dk/
* Make sure you get as much meta data as possibly, but most importantly get the facebook name (note that it is either a page or a profile)
     * If you want twitter ids and more stats check out this page: https://www.danskepolitikere.dk
Use the facebook names to collect the 100 latest facebook posts or feed (what is the difference) from the API.
* Make sure you get summary statistics of likes, comments and shares.

Aggregate some statistics (average number of likes, posts a day, number of posts from others) for each politician and merge it with the other dataset.

Scrape data about the parlament members from ft.dk (they are not getting off hook, **but remember to hit them slow!**)
* Example looks like this: http://www.ft.dk/da/medlemmer/folketingetsmedlemmer/mette-frederiksen-(s) <-- get the last paranthesis that jupyter notebook does not parse correctly.
    * Hint: Links to the members are easily obtained through this page: http://www.ft.dk/da/partier
    * Parse Jobs and Highest education from the cv section and whatever you find interesting.
* Merge this data with the other two sources.



        

# SOLUTIONS
**Collecting Facebook names from folketingsvalg-2015**

In [1]:
#### LOAD DEPENDENCIES ####
import requests,time
from bs4 import BeautifulSoup
import re
####################
## Grabbing links ## 
####################

## First grab the parti links ##
url = 'http://www.folketingsvalg-2015.dk'
response = requests.get(url)
## I will do this using regex
pattern = 'href="(/parti/[a-z\-]+)"'
links = [url+i for i in re.findall(pattern,response.text)]
print(links,len(set(links)))
## Grabing links to the parties using BeautifulSoup##
soup = BeautifulSoup(response.text)
obj = soup.findAll('h3',text='Partier')[0] # first position in the list happened to be partier
links = [] 

for node in obj.find_next_siblings('p'):
    link = url+node.a['href'] # href only contains the relative url so we add it to the baseurl
    links.append(link)
print(links,len(set(links)))

([u'http://www.folketingsvalg-2015.dk/parti/alternativet', u'http://www.folketingsvalg-2015.dk/parti/alternativet', u'http://www.folketingsvalg-2015.dk/parti/venstre', u'http://www.folketingsvalg-2015.dk/parti/venstre', u'http://www.folketingsvalg-2015.dk/parti/socialdemokraterne', u'http://www.folketingsvalg-2015.dk/parti/socialdemokraterne', u'http://www.folketingsvalg-2015.dk/parti/dansk-folkeparti', u'http://www.folketingsvalg-2015.dk/parti/dansk-folkeparti', u'http://www.folketingsvalg-2015.dk/parti/radikale-venstre', u'http://www.folketingsvalg-2015.dk/parti/radikale-venstre', u'http://www.folketingsvalg-2015.dk/parti/socialistisk-folkeparti', u'http://www.folketingsvalg-2015.dk/parti/socialistisk-folkeparti', u'http://www.folketingsvalg-2015.dk/parti/enhedslisten', u'http://www.folketingsvalg-2015.dk/parti/enhedslisten', u'http://www.folketingsvalg-2015.dk/parti/liberal-alliance', u'http://www.folketingsvalg-2015.dk/parti/liberal-alliance', u'http://www.folketingsvalg-2015.dk/pa



 BeautifulSoup(YOUR_MARKUP})

to this:

 BeautifulSoup(YOUR_MARKUP, "lxml")

  markup_type=markup_type))


In [2]:
links = set(links) # there are duplicate links so we will change it to a set.

In [3]:
## Grabbing the links to the Politicians using BeautifulSoup## 
party2links = {}
for party_link in links:
    party_links = []
    party = party_link.split('/')[-1] # lets keep the meta data so we don't have to scrape it later.
    time.sleep(0.5)
    response = requests.get(party_link)
    # parse with beautifulsoup
    soup = BeautifulSoup(response.text)
    politicians = soup.findAll('p')[0:-2] # all p nodes contain the politicians except the last two which contains link to open knowledge
    for pol in politicians:
        party_links.append(url+pol.a['href'])
    party2links[party] = party_links 
    ## I am keeping it in a dictionary instead of one long list,
    ## to be able to track their origin(this might be important meta data)    
        

In [85]:
link = 'http://www.folketingsvalg-2015.dk/bjarne-fog-corydon'
response = requests.get(link)
soup = BeautifulSoup(response.text,'lxml')

In [119]:
import bs4
def get_attribute_split(child,separator=':'):
    string = child.get_text()
    key,value = string.split(separator) # unzip the two element list into.
    return key.strip(),value.strip() # strip whitespace from start and end of the string
def get_metadata(response):
    
    soup = BeautifulSoup(response.text)
    temp_metadata = {}
    name = soup.find('h1').get_text()
    party = soup.find('h2').get_text()
    temp_metadata['name'] = name.strip()
    temp_metadata['party'] = party.strip()
    section = soup.find('section',{'class':'center'})
    
    children = list(section.findAll('p'))
    
    # parsing using the split command, splitting by ':'
    # parsing using the split command, splitting by 'i'
    key,value = get_attribute_split(child,separator='i')
    temp_metadata[key] = value
    # parsing the two social media account using href call on the a tag
    for atag in children[7].findAll('a'):
        key,value = atag.get_text(),atag['href']
        temp_metadata[key] = value
    return temp_metadata
temp_metadata = {}
section = soup.find('section',{'class':'center'})
children = list(section.findAll('p'))
for num,child in enumerate(children):
    # try the two split methods first
    # this is the child separated by ':'
    try:
        key,value = get_attribute_split(child)
        value = value.strip()
      #  print(key,value,child)
        temp_metadata[key] = value
        continue
    except:
        pass
    # find atags
    try:
        atags = child.findAll('a')
    except:
        atags = []
    if len(atags)>0:
        # this is the child that contains links
        for subchild in child.children:
            try:
                key,val = subchild.text,subchild['href']
               # print(key,val,child)
                if ' hos ' in key:
                    key = key.split('hos ')[1]
                if 'Wikipedia' in key:
                    key = 'Wikipedia'
                temp_metadata[key] = val
                continue
            except:
                pass
    # this is the child separated by ' i ' 
    try:
        key,value = get_attribute_split(child,separator=' i ')
      #  print(child,num,key,value)
        temp_metadata[key] = value.strip()
        continue
    except:
        pass
            
#    atags = child.findAll('a')
 #   if len(atags)>0:
  #      break

In [115]:
#section = soup.find('section',{'class':'center'})
#children = list(section.findAll('p'))
#for child in children:
#    print(child)
children[-3].findAll('a')

[<a href="https://www.twitter.com/-">Twitter</a>,
 <a href="https://www.facebook.com/corydonbjarne">Facebook</a>]

In [122]:
import bs4
def get_attribute_split(child,separator=':'):
    string = child.get_text()
    key,value = string.split(separator) # unzip the two element list into.
    return key.strip(),value.strip() # strip whitespace from start and end of the string
def get_metadata(response):
    
    soup = BeautifulSoup(response.text)
    temp_metadata = {}
    name = soup.find('h1').get_text()
    party = soup.find('h2').get_text()
    temp_metadata['name'] = name.strip()
    temp_metadata['party'] = party.strip()
    section = soup.find('section',{'class':'center'})
    
    children = list(section.findAll('p'))
    for child in children:
        # try the two split methods first
        # this is the child separated by ':'
        try:
            key,value = get_attribute_split(child)
            value = value.strip()
            temp_metadata[key] = value
            continue
        except:
            pass
        # find atags
        try:
            atags = child.findAll('a')
        except:
            atags = []
        if len(atags)>0:
            # this is the child that contains links
            for subchild in child.children:
                try:
                    key,val = subchild.text,subchild['href']
                   # print(key,val,child)
                    if ' hos ' in key:
                        key = key.split('hos ')[1]
                    if 'Wikipedia' in key:
                        key = 'Wikipedia'
                    temp_metadata[key] = val
                    continue
                except:
                    pass
        # this is the child separated by ' i ' 
        try:
            key,value = get_attribute_split(child,separator=' i ')
            temp_metadata[key] = value.strip()
            continue
        except:
            pass
    return temp_metadata
datadict = {}
count = 0 # make an index instead of links
for party in party2links:
    links = party2links[party]
    for link in links:
        print('-'),
        time.sleep(0.5)
        response = requests.get(link)
        meta_data = get_metadata(response)
        meta_data['link'] = link
        datadict[count] = meta_data
        count+=1 # incrementing the counter



- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 

In [123]:
import pandas as pd
df = pd.DataFrame(datadict).T
df.to_pickle('folketingsvalg-2015_scraped.pkl')
df.head()

Unnamed: 0,Alder,Altinget,Bopæl,Bornholms storkreds,Danmarks Radio,DanskePolitikere.dk,Erik Lund,Facebook,Fyns storkreds,Google+,...,Søren Gade hos DanskePolitikere.dk Søren Gade,Titel,Twitter,Twitter Facebook Snapchat,Vestjyllands storkreds,Wikipedia,link,name,party,Østjyllands storkreds
0,Ukendt,http://www.altinget.dk/kandidater/ft15/Birgitt...,3770 Allinge,/kreds/bornholms-storkreds,,,,https://www.facebook.com/pages/Birgitte-Kjølle...,,,...,,Cand.mag.,https://www.twitter.com/BirgitteKP,,,,http://www.folketingsvalg-2015.dk/birgitte-kjo...,Birgitte Kjøller Pedersen,Socialistisk Folkeparti,
1,53,http://www.altinget.dk/kandidater/ft15/Leif-Olsen,3770 Allinge,/kreds/bornholms-storkreds,,,,,,,...,,Lektor,,,,,http://www.folketingsvalg-2015.dk/leif-olsen,Leif Olsen,Socialistisk Folkeparti,
2,59,http://www.altinget.dk/kandidater/ft15/Poul-Ov...,3700 Rønne,/kreds/bornholms-storkreds,,,,,,,...,,Konsulent,,,,,http://www.folketingsvalg-2015.dk/poul-overlun...,Poul Overlund-Sørensen,Socialistisk Folkeparti,
3,57,http://www.altinget.dk/kandidater/ft15/Annette...,5300 Kerteminde,,,http://www.danskepolitikere.dk/annette-vilhelmsen,,,/kreds/fyns-storkreds,,...,,Fhv. minister,https://www.twitter.com/a_vilhelmsen,,,http://da.wikipedia.org/wiki/Annette_Vilhelmsen,http://www.folketingsvalg-2015.dk/annette-vilh...,Annette Vilhelmsen,Socialistisk Folkeparti,
4,Ukendt,http://www.altinget.dk/kandidater/ft15/Carl-Va...,2450 København NV,,,,,https://www.facebook.com/CarlValentinSF,/kreds/fyns-storkreds,,...,,Phoner,https://www.twitter.com/Carl__Valentin,,,,http://www.folketingsvalg-2015.dk/carl-valentin,Carl Valentin,Socialistisk Folkeparti,


**Collect Facebook data from the politicians.**

In [124]:
len(df.Facebook.dropna()) # use to be 382

480

In [125]:
## First we need to grab the name or id from the facebook link. 
## Here we need to filter profiles from pages, since we can only
## access content from pages, as far as I know at least.

# Lets inspect the links 
df.Facebook.dropna().values

# So there are some patterns. 
# Some pages can be detected by /pages/ 
# pages with /pages/ has an id we can grab directly
# If there is an id, it is one backslash / after the name. Benedikte-Ask-Skotte/536682579767128
# we need to make sure not to grab ?fref=ts

# Some private profiles can be detected by /people/ others by profile.php?
# profile_browser is not indicative of a profile, so don't use profile as split point.
# If we wanted to split by 'www.facebook.com/' we would miss this variation: da-dk.facebook.com

## So lets try to put it together
import re
import numpy as np
def get_ID(facebook_link):
    # first we check if it is a Nan values
    typ = type(facebook_link)
    if typ==float:
        return 'no-profile'
    # first we check for the /pages/
    if '/pages/' in facebook_link:
        pattern = '/([0-9]+)'
        # the re.search will return the first result
        page_id = re.findall(pattern,facebook_link)[0]
        return page_id
    elif '/people/' in facebook_link:
        # we will return nan
        return 'private-profile'
    else: # then we will grab the name direcly after facebook.com
        try:
            page_name = facebook_link.split('facebook.com/')[1].split('/')[0]
        except:
            print(facebook_link)
            return 'no-profile' # an error with 
        return page_name
df['facebook_identifier'] = df.Facebook.apply(get_ID)
df.head()
#df[df['Facebook']=='http://-']

http://-


Unnamed: 0,Alder,Altinget,Bopæl,Bornholms storkreds,Danmarks Radio,DanskePolitikere.dk,Erik Lund,Facebook,Fyns storkreds,Google+,...,Titel,Twitter,Twitter Facebook Snapchat,Vestjyllands storkreds,Wikipedia,link,name,party,Østjyllands storkreds,facebook_identifier
0,Ukendt,http://www.altinget.dk/kandidater/ft15/Birgitt...,3770 Allinge,/kreds/bornholms-storkreds,,,,https://www.facebook.com/pages/Birgitte-Kjølle...,,,...,Cand.mag.,https://www.twitter.com/BirgitteKP,,,,http://www.folketingsvalg-2015.dk/birgitte-kjo...,Birgitte Kjøller Pedersen,Socialistisk Folkeparti,,123520157707593
1,53,http://www.altinget.dk/kandidater/ft15/Leif-Olsen,3770 Allinge,/kreds/bornholms-storkreds,,,,,,,...,Lektor,,,,,http://www.folketingsvalg-2015.dk/leif-olsen,Leif Olsen,Socialistisk Folkeparti,,no-profile
2,59,http://www.altinget.dk/kandidater/ft15/Poul-Ov...,3700 Rønne,/kreds/bornholms-storkreds,,,,,,,...,Konsulent,,,,,http://www.folketingsvalg-2015.dk/poul-overlun...,Poul Overlund-Sørensen,Socialistisk Folkeparti,,no-profile
3,57,http://www.altinget.dk/kandidater/ft15/Annette...,5300 Kerteminde,,,http://www.danskepolitikere.dk/annette-vilhelmsen,,,/kreds/fyns-storkreds,,...,Fhv. minister,https://www.twitter.com/a_vilhelmsen,,,http://da.wikipedia.org/wiki/Annette_Vilhelmsen,http://www.folketingsvalg-2015.dk/annette-vilh...,Annette Vilhelmsen,Socialistisk Folkeparti,,no-profile
4,Ukendt,http://www.altinget.dk/kandidater/ft15/Carl-Va...,2450 København NV,,,,,https://www.facebook.com/CarlValentinSF,/kreds/fyns-storkreds,,...,Phoner,https://www.twitter.com/Carl__Valentin,,,,http://www.folketingsvalg-2015.dk/carl-valentin,Carl Valentin,Socialistisk Folkeparti,,CarlValentinSF


In [47]:
pageid = df.facebook_identifier[4]
pageid

'CarlValentinSF'

In [126]:
# Define the token
token = 'EAACEdEose0cBADqAKPmf28ZAOdZAkpyPZBOoHZCheG0ud8d9XfZCXbdZBBI6v9iZAeKawwsoZApBTeu9F6iOhSxaoBkPfZBCLSGeETC0zcjmn0CfDZB8ZBsuoVVC39fDuHLmZBoMHXCT7uonhGD78yBr8V1HkFfggVjRFiBWwKF5kxjJYkJwBIZAYNjpJ1NEPyYG9TxZB2LEytjZC2AGMe80Ai9QazMehn6WuT5YjxCsRtd9YuAIQZDZD'
## Then we need to build a query
base_url = 'https://graph.facebook.com/v2.10/%s' # insert pageid using % pageid
## Here we defined the target properties of the page.
metadata_fields = 'name,id,website,were_here_count,category,description,about,fan_count,bio,general_info,is_community_page,location'
# here we define the properties of the central node: Feed. 
data_fields = 'feed.limit(100){from,status_type,message_tags,to,message,type,story,link,shares,updated_time,created_time,likes.limit(1).summary(true),comments.limit(1).summary(true),reactions.limit(1).summary(true)}'
# Here we combine the two field lists
fields = '?fields=%s,%s'%(metadata_fields,data_fields)
# Authentification
authentification = '&access_token=%s'%token


# Now lets combine the different elements into an example query
pageid = df.facebook_identifier[0]
q = base_url%pageid+fields+authentification
print(q)



https://graph.facebook.com/v2.10/123520157707593?fields=name,id,website,were_here_count,category,description,about,fan_count,bio,general_info,is_community_page,location,feed.limit(100){from,status_type,message_tags,to,message,type,story,link,shares,updated_time,created_time,likes.limit(1).summary(true),comments.limit(1).summary(true),reactions.limit(1).summary(true)}&access_token=EAACEdEose0cBADqAKPmf28ZAOdZAkpyPZBOoHZCheG0ud8d9XfZCXbdZBBI6v9iZAeKawwsoZApBTeu9F6iOhSxaoBkPfZBCLSGeETC0zcjmn0CfDZB8ZBsuoVVC39fDuHLmZBoMHXCT7uonhGD78yBr8V1HkFfggVjRFiBWwKF5kxjJYkJwBIZAYNjpJ1NEPyYG9TxZB2LEytjZC2AGMe80Ai9QazMehn6WuT5YjxCsRtd9YuAIQZDZD


In [127]:
# Now collecting the raw data
# Lets initialize the ratelimiting function first
#############################
# Intelligent rate limiting. 
import time
logsize = 360
timestamps = [time.time()]*logsize
requestrate = 1 # number of calls pr second. If the ratelimiting error are thrown update it.
ratelimiterrors = 0 #  
def ratelimit():
    global timestamps
    global requestrate
    global ratelimiterrors
    servertime = time.time()
    timestamps.append(servertime)
    average_request_rate = (servertime-timestamps[0])/logsize
    if ratelimiterrors>0: # adopt new rate
        time.sleep(1800)
        requestrate+=0.01
        print 'updating requestrate to %r'%requestrate
        ratelimiterrors = 0
    if average_request_rate < requestrate: # check if requestrate is to high
        time.sleep(1)
    timestamps.pop(0) # remove first timestamp

#############################
# And our reliable get function
def get(url,iterations=10,exceptions=(Exception)): # possible exceptions: HTTPError,ConnectionError
    """ iterations : Define number of iterations before giving up. 
        exceptions: Define which exceptions you accept, default is all. 
    """
    for iteration in range(iterations):
        try:
            # add ratelimit function call here
            ratelimit() # !!
            response = requests.get(url).json()
            return response # if succesful it will end the iterations here
        except exceptions as e: #  find exceptions in the request library requests.exceptions
            print(e) # print or log the exception message.
    return None # your code will purposely crash if you don't create a check function later.
#############################
# Create directory to the raw and initialize simple logging.

retries = [] # For connection errors
done = set() # For logging what we have already collected
errors = {} # For logging which returned error messages
import os
path = 'raw_facebookdata/'
if not os.path.exists(path): # check if directory exists
    os.makedirs(path) # if not make directory
done = set(os.listdir(path))    
# Filter the ones we do not have profile ids on.    
subdf = df[df.facebook_identifier.apply(lambda x: '-profile' not in x)]
print(len(subdf))
# Import module to dump the raw data with
import json

474


In [128]:

for num,pageid in enumerate(subdf.facebook_identifier):
    if pageid in (done or errors): # if we were to run it more than once.
        continue
    q = base_url%pageid+fields+authentification
    response = get(q)
    if response==None: # connection errors
        retries.append(pageid) 
        continue # Go to the next pageid
    if 'error' in response: # error message from facebook
        errors[pageid] = response # keep the error message
        print(response['error'])
        continue # Go to the next pageid
    filename = path+pageid
    json.dump(response,open(filename,'w'))
    done.add(pageid)
    if num%5==0:
        print('Now %d pages left to collect.\n%d Errors\t%d Done'%(len(subdf)-num,len(errors),len(done)))

    

{u'message': u"Unsupported get request. Object with ID '123520157707593' does not exist, cannot be loaded due to missing permissions, or does not support this operation. Please read the Graph API documentation at https://developers.facebook.com/docs/graph-api", u'code': 100, u'type': u'GraphMethodException', u'fbtrace_id': u'AgoHNN64dDO'}
{u'message': u'(#803) Cannot query users by their username (arly.eskildsen)', u'code': 803, u'type': u'OAuthException', u'fbtrace_id': u'EB32qEkWMpq'}
{u'message': u"Unsupported get request. Object with ID '469380419835798' does not exist, cannot be loaded due to missing permissions, or does not support this operation. Please read the Graph API documentation at https://developers.facebook.com/docs/graph-api", u'code': 100, u'type': u'GraphMethodException', u'fbtrace_id': u'Arifq74K/9m'}
{u'message': u'(#803) Cannot query users by their username (dorete.dandanell)', u'code': 803, u'type': u'OAuthException', u'fbtrace_id': u'G/guHianxxc'}
{u'message': u

In [129]:
print('%d Errors\t%d Done'%(len(errors),len(done)))

262 Errors	214 Done


In [None]:
# find the migrated ones in the error message

# find the ids that didn't work, 

In [130]:
ids = {}
for i in done:
    filename = path+i
    try:
        response = json.load(open(filename,'r'))
    except:
        print('-'),
        continue
    
    ids[i] = response['id']


In [131]:
def fb_id(identifier):
    try:
        return ids[identifier]
    except:
        return np.nan
df['fb_id'] = df.facebook_identifier.apply(fb_id)
df.head()

Unnamed: 0,Alder,Altinget,Bopæl,Bornholms storkreds,Danmarks Radio,DanskePolitikere.dk,Erik Lund,Facebook,Fyns storkreds,Google+,...,Twitter,Twitter Facebook Snapchat,Vestjyllands storkreds,Wikipedia,link,name,party,Østjyllands storkreds,facebook_identifier,fb_id
0,Ukendt,http://www.altinget.dk/kandidater/ft15/Birgitt...,3770 Allinge,/kreds/bornholms-storkreds,,,,https://www.facebook.com/pages/Birgitte-Kjølle...,,,...,https://www.twitter.com/BirgitteKP,,,,http://www.folketingsvalg-2015.dk/birgitte-kjo...,Birgitte Kjøller Pedersen,Socialistisk Folkeparti,,123520157707593,
1,53,http://www.altinget.dk/kandidater/ft15/Leif-Olsen,3770 Allinge,/kreds/bornholms-storkreds,,,,,,,...,,,,,http://www.folketingsvalg-2015.dk/leif-olsen,Leif Olsen,Socialistisk Folkeparti,,no-profile,
2,59,http://www.altinget.dk/kandidater/ft15/Poul-Ov...,3700 Rønne,/kreds/bornholms-storkreds,,,,,,,...,,,,,http://www.folketingsvalg-2015.dk/poul-overlun...,Poul Overlund-Sørensen,Socialistisk Folkeparti,,no-profile,
3,57,http://www.altinget.dk/kandidater/ft15/Annette...,5300 Kerteminde,,,http://www.danskepolitikere.dk/annette-vilhelmsen,,,/kreds/fyns-storkreds,,...,https://www.twitter.com/a_vilhelmsen,,,http://da.wikipedia.org/wiki/Annette_Vilhelmsen,http://www.folketingsvalg-2015.dk/annette-vilh...,Annette Vilhelmsen,Socialistisk Folkeparti,,no-profile,
4,Ukendt,http://www.altinget.dk/kandidater/ft15/Carl-Va...,2450 København NV,,,,,https://www.facebook.com/CarlValentinSF,/kreds/fyns-storkreds,,...,https://www.twitter.com/Carl__Valentin,,,,http://www.folketingsvalg-2015.dk/carl-valentin,Carl Valentin,Socialistisk Folkeparti,,CarlValentinSF,323782467803321.0


In [132]:
df.to_pickle('folketingsvalg-2015_scraped.pkl')

In [133]:
df.to_csv('folketingsvalg-2015_scraped.csv',index=False, encoding='utf-8')

In [134]:
print(len(ids))
f = open('actual_politician_ids','w')
f.write(','.join(ids.values()))
f.close()

214


In [None]:
import pandas as pd
df = [zip([str(i) for i in range(10),range(10)])]
df

In [1]:
import twitter

In [2]:
CONSUMER_KEY="mDHKLc1ldQig7Llj5BJkA"
CONSUMER_SECRET="MPCft5fjuVOb50XUNlzD6IIRgDhlfJz3QZTYacQVE"
OAUTH_TOKEN="1500643860-kD2jn5dQNmV5CCp7kLIOnmsHpE9FQMh52VB1fej"
OAUTH_TOKEN_SECRET="3ETsun20hwJQVKbDLflxDXUioVb5jEDYzLFNAdyCLw"


The history saving thread hit an unexpected error (OperationalError('database is locked',)).History will not be written to the database.


In [17]:
api.GetFollowersPaged(screen_name='singularity_net',cursor=-1,)

(1592124152979842452,
 0,
 [User(ID=3277520330, ScreenName=rajpootankush1),
  User(ID=276426853, ScreenName=fashionizers),
  User(ID=749896990318792704, ScreenName=KolesCoinNews),
  User(ID=1403190542, ScreenName=divars777),
  User(ID=1396351814, ScreenName=hbakhtiyor),
  User(ID=961916428931629056, ScreenName=W0eaPA9SQVTUFgN),
  User(ID=950883720885686272, ScreenName=DrHeath3),
  User(ID=3502985008, ScreenName=mjzerah1),
  User(ID=963366229430996992, ScreenName=ciamadziara),
  User(ID=811556723639123968, ScreenName=CryptonomosICO),
  User(ID=602791035, ScreenName=investor229),
  User(ID=952819104947224576, ScreenName=codingzhou),
  User(ID=2370259905, ScreenName=TradeGortat),
  User(ID=944513490903031808, ScreenName=UnitedFans_co),
  User(ID=897861355411079169, ScreenName=karannagpal788),
  User(ID=946484362467520513, ScreenName=CoinAzo),
  User(ID=884778625710645248, ScreenName=ant1_dur),
  User(ID=940448897813458945, ScreenName=zigi190315),
  User(ID=950999482501656576, ScreenName=K

In [3]:
api = twitter.Api(consumer_key=CONSUMER_KEY,
                      consumer_secret=CONSUMER_SECRET,
                      access_token_key=OAUTH_TOKEN,
                      access_token_secret=OAUTH_TOKEN_SECRET)

In [13]:
import time
responses = []
cursor = 0
while True:
    time.sleep(1)
    try:
        cursor,previous, response = api.GetFollowerIDsPaged('singularity_net',cursor=cursor)
    except Exception as e:
        print(e)
        continue
    responses+=response
    if cursor==-1:
        break
    else:
        print(cursor)
    

NameError: name 'TwitterError' is not defined