# Scrapy
- Scrapy is a Python library that handles much of the complexity of finding and evaluating links on a website, crawling domains or lists of domains with ease.

In [None]:
 pip install scrapy

### What’s the difference between an API and a regular website? 

- Despite the hype around APIs, the answer is often: not much. APIs function via HTTP, the same protocol used to fetch data for websites, download a file, and do almost anything else on the Internet.

-  The only things that makes an API an API is the extremely regulated syntax it uses, and the fact that APIs present their data as JSON or XML, rather than HTML.

### Methods
There are four ways to request information from a web server via HTTP:

  - <b>GET</b> : GET is what you use when you visit a website through the address bar in your browser. You can think of GET as saying, “Hey, web server, please get me this information.”

  - <b>POST</b> : POST is what you use when you fill out a form, or submit information, presumably to a backend script on the server. Every time you log into a website, you are making a POST request with your username and (hopefully) encrypted password. <br>
      
      If you are making a POST request with an API, you are saying, “Please store this information in your database.”

  - <b>PUT </b> :  PUT is less commonly used when interacting with websites, but is used from time to time in APIs. A PUT request is used to update an object or information. <br>
  
      An API might require a POST request to create a new user, for example, but it might need a PUT request if you want to update that user’s email address.


  - <b>DELETE </b> :DELETE is straightforward; it is used to delete an object. For instance, if I send a DELETE request to http://myapi.com/user/23, it will delete the user with the ID 23. 
  
  DELETE methods are not often encountered in public APIs, which are primarily created to disseminate information rather than allow random users to remove that information from their databases. However, like the PUT method, it’s a good one to know about.


### Responses

- An important feature of APIs is that they have well-formatted responses. The most common types of response formatting are eXtensible Markup Language (XML) and JavaScript Object Notation (JSON).

- In recent years, JSON has become vastly more popular than XML for a couple of major reasons. First, JSON files are generally smaller than well-designed XML files.

- Another reason JSON is quickly becoming more popular than XML is simply due to a shift in web technologies. In the past, it was more common for a server-side script such as PHP or .NET to be on the receiving end of an API.

- Nowadays, it is likely that a framework, such as Angular or Backbone, will be sending and receiving API calls. 

- Server-side technologies are somewhat agnostic as to the form in which their data comes. But JavaScript libraries like Backbone find JSON easier to handle.

- Although most APIs still support XML output, we will be using JSON examples in this book. Regardless, it is a good idea to familiarize yourself with both if you haven’t already — they are unlikely to go away any time soon.

### Echo Nest

- The Echo Nest is a fantastic example of a company that is built on web scrapers. Although some music-based companies, such as Pandora, depend on human intervention to categorize and annotate music, 

- The Echo Nest relies on automated intelligence and information scraped from blogs and news articles in order to categorize musical artists, songs, and albums.

- Even better, this API is freely available for noncommercial use.3 You cannot use the API without a key, but you can obtain a key by going to The Echo Nest “Create an Account” page and registering with a name, email address, and username.

#### A Few Examples

- The Echo Nest API is built around several basic content types: artists, songs, tracks, and genres. Except for genres, these content types all have unique IDs, which are used to retrieve information about them in various forms, through API calls. 

- For example, if I wanted to retrieve a list of songs performed by Monty Python, I would make the following call to retrieve their ID (remember to replace <your api key> with your own API key):


- http://developer.echonest.com/api/v4/artist/search?api_key=<your api key >&name=monty%20python"


### Twitter

- Twitter is notoriously protective of its API and rightfully so. With over 230 million active users and a revenue of over $100 million a month, the company is hesitant to let just anyone come along and have any data they want.

- Twitter’s rate limits (the number of calls it allows each user to make) fall into two categories: 15 calls per 15-minute period, and 180 calls per 15-minute period, depending on the type of call. For instance, you can make up to 12 calls a minute to retrieve basic information about Twitter users, but only one call a minute to retrieve lists of those users’ Twitter followers.4

### Google API

- Google has one of the most comprehensive, easy-to-use collections of APIs on the Web today. Whenever you’re dealing with some sort of basic subject, such as language translation, geolocation, calendars, or even genomics, Google has an API for it. 

- Google also has APIs for many of its popular apps, such as Gmail, YouTube, and Blogger

### A Few Examples

- Google’s most popular (and in my opinion most interesting) APIs can be found in its collection of Maps APIs. You might be familiar with this feature through the embeddable Google Maps found on many websites. 

- However, the Maps APIs go far beyond embedding maps — you can resolve street addresses to latitude/longitude coordinates, get the elevation of any point on Earth, create a variety of location-based visualizations, and get time zone information for an arbitrary location, among other bits of information.


- https://maps.googleapis.com/maps/api/geocode/json?address=1+Science+Park+Boston +MA+02114&key=your API key

- To get the time zone information for our newly found latitude and longitude, you can use the Time zone API:

- https://maps.googleapis.com/maps/api/timezone/json?location=42.3677994,-71.0708 078&timestamp=1412649030&key=your API key


## Parsing JSON

In [None]:
import json
from urllib.request import urlopen

def getCountry(ipAddress):

  response = urlopen("https://freegeoip.app/json/"+ipAddress).read()\
                                                             .decode('utf-8') 
  responseJson = json.loads(response)

  return responseJson.get("country_code")

In [None]:
print(getCountry("50.78.253.58"))

US


- The JSON parsing library used is part of Python’s core library. Just type in import json at the top, and you’re all set! 

- Unlike many languages that might parse JSON into a special JSON object or JSON node, Python uses a more flexible approach and turns JSON objects into dictionaries, JSON arrays into lists, JSON strings into strings, and so forth. 

- In this way, it makes it extremely easy to access and manipulate values stored in JSON.

In [None]:
import json

jsonString = '''{"arrayOfNums":[{"number":0},{"number":1},{"number":2}],
                "arrayOfFruits":[{"fruit":"apple"},{"fruit":"banana"},
                                 {"fruit":"pear"}]}'''


jsonObj = json.loads(jsonString)


In [None]:
print(jsonObj.get("arrayOfNums")) 
print(jsonObj.get("arrayOfNums")[1]) 
print(jsonObj.get("arrayOfNums")[1].get("number")+
      jsonObj.get("arrayOfNums")[2].get("number")) 

print(jsonObj.get("arrayOfFruits")[2].get("fruit"))

[{'number': 0}, {'number': 1}, {'number': 2}]
{'number': 1}
3
pear


### Bringing It All Back Home

- Although the raison d'être of many modern web applications is to take existing data and format it in a more appealing way, I would argue that this isn’t very interesting thing to do in most instances. 

- If you’re using an API as your only data source, the best you can do is merely copy someone else’s database that already exists, and which is, essentially, already published. 

- What can be far more interesting is to take two or more data sources and combine them in a novel way, or use an API as a tool to look at scraped data from a new perspective.

#### Creating a basic script that crawls Wikipedia, looks for revision history pages, and then looks for IP addresses on those revision history pages

In [27]:
from urllib.request import urlopen 
from bs4 import BeautifulSoup 
import datetime
import random
import json
import re

random.seed(datetime.datetime.now()) 

def getLinks(articleUrl):
  html = urlopen("http://en.wikipedia.org"+articleUrl)
  bsObj = BeautifulSoup(html)

  return bsObj.find("div", {"id":"bodyContent"}).findAll("a",
                                      href=re.compile("^(/wiki/)((?!:).)*$"))

In [28]:
def getHistoryIPs(pageUrl):

  #Format of revision history pages is: 
  #http://en.wikipedia.org/w/index.php?title=Title_in_URL&action=history 

  pageUrl = pageUrl.replace("/wiki/", "")
  historyUrl = "http://en.wikipedia.org/w/index.php?title="+ \
                                                      pageUrl+"&action=history" 
  print("history url is: "+historyUrl)

  html = urlopen(historyUrl)
  bsObj = BeautifulSoup(html)

  #finds only the links with class "mw-anonuserlink" which has IP addresses 
  #instead of usernames
  ipAddresses = bsObj.findAll("a", {"class":"mw-anonuserlink"}) 
  addressList = set()

  for ipAddress in ipAddresses:
    addressList.add(ipAddress.get_text())

  return addressList

In [None]:
links = getLinks("/wiki/Python_(programming_language)")

while(len(links) > 0): 
  for link in links:
    print("-------------------")
    historyIPs = getHistoryIPs(link.attrs["href"]) 
    for historyIP in historyIPs:
      print(historyIP)

  newLink = links[random.randint(0, len(links)-1)].attrs["href"]
  links = getLinks(newLink)

- This code also uses a somewhat arbitrary (yet effective for the purposes of this example) search pattern to look for articles from which to retrieve revision histories. 

- It starts by retrieving the histories of all Wikipedia articles linked to by the starting page (in this case, the article on the Python programming language). 

- Afterward, it selects a new starting page randomly, and retrieves all revision history pages of articles linked to by that page. It will continue until it hits a page with no links.

In [32]:
def getCountry(ipAddress): 
  try:
    response = urlopen("https://freegeoip.app/json/"+ipAddress).read()\
                                                                .decode('utf-8')
  except HTTPError: 
    return None
  
  responseJson = json.loads(response) 
  return responseJson.get("country_name")

In [None]:
links = getLinks("/wiki/Python_(programming_language)")

while(len(links) > 0): 
  for link in links:
    print("-------------------")
    historyIPs = getHistoryIPs(link.attrs["href"]) 

    for historyIP in historyIPs:
      country = getCountry(historyIP) 
      if country is not None:
        print(historyIP+" is from "+country)

  newLink = links[random.randint(0, len(links)-1)].attrs["href"]
  links = getLinks(newLink)