# Web Scraping

Web scraping is a valuable tool for gathering information from the internet for data analysis projects.

Today, I will demonstrate gathering information from the [International Space Station web scraping website](http://open-notify.org/) using python libraries and JSON.

I will also be web scraping [Github](http://www.github.com).

## Web Scraping Status Codes

Web scraping status codes can be helpful for debugging. Here are the codes:


    200 — Everything went okay, and the server returned a result (if any).
    201 — The request has been fulfilled and has resulted in one or more resources being created.
    204 — The server has successfully fulfilled the request and that there is no additional content to send in the response payload body.
    301 — The server is redirecting you to a different endpoint. This can happen when a company switches domain names, or when an endpoint's name has changed.
    400 — The server thinks you made a bad request. This can happen when you don't send the information the API requires to process your request (among other things).
    401 — The server thinks you're not authenticated. This happens when you don't send the right credentials to access an API (we'll talk about this in a later mission).
    403 — The resource you're trying to access is forbidden, and you don't have the right permissions to see it.
    404 — The server didn't find the resource you tried to access.


## Web Scraping the International Space Station

In [1]:
import requests
import time

In [2]:
response = requests.get("http://api.open-notify.org/iss-now.json")
response.status_code

200

In [3]:
requests.get('http://api.open-notify.org/iss-pass.json').status_code

400

This one requires two parameters. [The website](http://open-notify.org/Open-Notify-API/ISS-Pass-Times/) shows this.

In [4]:
# Including the latitude and longitude
st_louis = {'lat': 36.270, 'lon': -90.1994}
response1 = requests.get('http://api.open-notify.org/iss-pass.json', params=st_louis)
response1.status_code

200

In [5]:
response1.content

b'{\n  "message": "success", \n  "request": {\n    "altitude": 100, \n    "datetime": 1617881095, \n    "latitude": 36.27, \n    "longitude": -90.1994, \n    "passes": 5\n  }, \n  "response": [\n    {\n      "duration": 612, \n      "risetime": 1617897786\n    }, \n    {\n      "duration": 635, \n      "risetime": 1617903578\n    }, \n    {\n      "duration": 499, \n      "risetime": 1617909507\n    }, \n    {\n      "duration": 448, \n      "risetime": 1617915431\n    }, \n    {\n      "duration": 580, \n      "risetime": 1617921249\n    }\n  ]\n}\n'

In [6]:
type(response1.content)

bytes

Using the `.json()` method will turn the bytes content into a dictionary:

In [7]:
response1.json()

{'message': 'success',
 'request': {'altitude': 100,
  'datetime': 1617881095,
  'latitude': 36.27,
  'longitude': -90.1994,
  'passes': 5},
 'response': [{'duration': 612, 'risetime': 1617897786},
  {'duration': 635, 'risetime': 1617903578},
  {'duration': 499, 'risetime': 1617909507},
  {'duration': 448, 'risetime': 1617915431},
  {'duration': 580, 'risetime': 1617921249}]}

In [8]:
type(response1.json())

dict

## Realtime Observations

The API scrapes the data in realtime. Notice the `'timestamp'` value increases by about 1 each time:

In [9]:
for i in range(5):
    time.sleep(1)
    response = requests.get("http://api.open-notify.org/iss-now.json")
    content=response.json()
    print(content)

{'message': 'success', 'iss_position': {'latitude': '45.3559', 'longitude': '12.4882'}, 'timestamp': 1617881096}
{'message': 'success', 'iss_position': {'latitude': '45.3862', 'longitude': '12.5655'}, 'timestamp': 1617881097}
{'message': 'success', 'iss_position': {'latitude': '45.4164', 'longitude': '12.6428'}, 'timestamp': 1617881098}
{'message': 'success', 'iss_position': {'latitude': '45.4618', 'longitude': '12.7590'}, 'timestamp': 1617881100}
{'message': 'success', 'iss_position': {'latitude': '45.4919', 'longitude': '12.8366'}, 'timestamp': 1617881101}


The `headers` property tells us valuable information. the `Content-Type` key in the `headers` dictionary is one of the most important, showing us how to decode the response:

In [10]:
for key in response1.headers.keys():
    print(key, ':', response1.headers[key])

Server : nginx/1.10.3
Date : Thu, 08 Apr 2021 11:24:55 GMT
Content-Type : application/json
Content-Length : 522
Connection : keep-alive
Via : 1.1 vegur


As is expected, this content type is `application/json`

`astros.json` tells us how many astronauts are currently in space:

In [11]:
astros = requests.get('http://api.open-notify.org/astros.json')
astros.json()

{'message': 'success',
 'number': 7,
 'people': [{'craft': 'ISS', 'name': 'Sergey Ryzhikov'},
  {'craft': 'ISS', 'name': 'Kate Rubins'},
  {'craft': 'ISS', 'name': 'Sergey Kud-Sverchkov'},
  {'craft': 'ISS', 'name': 'Mike Hopkins'},
  {'craft': 'ISS', 'name': 'Victor Glover'},
  {'craft': 'ISS', 'name': 'Shannon Walker'},
  {'craft': 'ISS', 'name': 'Soichi Noguchi'}]}

There are 7 people in space as of this writing

## Web Scraping Github

In [12]:
headers = {'Authorization': 'token ###'}
response = requests.get('https://api.github.com/users/stephentaul22', headers=headers)
response.status_code

200

In [13]:
response.json()

{'login': 'stephentaul22',
 'id': 52689411,
 'node_id': 'MDQ6VXNlcjUyNjg5NDEx',
 'avatar_url': 'https://avatars.githubusercontent.com/u/52689411?v=4',
 'gravatar_id': '',
 'url': 'https://api.github.com/users/stephentaul22',
 'html_url': 'https://github.com/stephentaul22',
 'followers_url': 'https://api.github.com/users/stephentaul22/followers',
 'following_url': 'https://api.github.com/users/stephentaul22/following{/other_user}',
 'gists_url': 'https://api.github.com/users/stephentaul22/gists{/gist_id}',
 'starred_url': 'https://api.github.com/users/stephentaul22/starred{/owner}{/repo}',
 'subscriptions_url': 'https://api.github.com/users/stephentaul22/subscriptions',
 'organizations_url': 'https://api.github.com/users/stephentaul22/orgs',
 'repos_url': 'https://api.github.com/users/stephentaul22/repos',
 'events_url': 'https://api.github.com/users/stephentaul22/events{/privacy}',
 'received_events_url': 'https://api.github.com/users/stephentaul22/received_events',
 'type': 'User',
 '

## Viewing my Repositories

In [14]:
repos = requests.get('https://api.github.com/users/stephentaul22/repos', headers=headers).json()
for i in range(len(repos)):
    print(repos[i]['name'])

Analyzing-CIA-Factbook-Data-Using-SQL
Answering-Business-Questions-using-SQL
api-post-test2
App-Profile-Recommendation
Building-a-Database-for-Crime-Reports
Building-a-Spam-Filter-with-Naive-Bayes-Algorithm
Building-Fast-Queries-on-a-CSV
Cleaning-and-Analyzing-Employee-Exit-Surveys
Exploring-Ebay-Car-Sale-Data
Exploring-Hacker-News-Posts
Exploring-The-World-Happiness-Reports-by-Region
Finding-the-Best-Markets-to-Advertise-In
hello-world
Investigating-COVID-19-Virus-Trends
Investigating-Movie-Ratings
Mobile-App-for-Lottery-Addiction
Predicting-Car-Prices
Star-Wars-Survey
Visualizing-Earnings-Based-on-College-Majors
Visualizing-the-Gender-Gap-in-College-Degrees
Winning-Jeopardy


## Viewing Starred Repositories

In [15]:
starred = requests.get('https://api.github.com/users/stephentaul22/starred', headers=headers).json()
for i in range(len(starred)):
    print(starred[i]['name'])

Building-a-Database-for-Crime-Reports
Building-Fast-Queries-on-a-CSV
Predicting-Car-Prices
Exploring-The-World-Happiness-Reports-by-Region
Climate-ChangePrediction
Answering-Business-Questions-using-SQL
Winning-Jeopardy
Building-a-Spam-Filter-with-Naive-Bayes-Algorithm


## Token Trick

The account username is not necessary since we have the token:

In [16]:
requests.get('https://api.github.com/user', headers=headers).json()

{'login': 'stephentaul22',
 'id': 52689411,
 'node_id': 'MDQ6VXNlcjUyNjg5NDEx',
 'avatar_url': 'https://avatars.githubusercontent.com/u/52689411?v=4',
 'gravatar_id': '',
 'url': 'https://api.github.com/users/stephentaul22',
 'html_url': 'https://github.com/stephentaul22',
 'followers_url': 'https://api.github.com/users/stephentaul22/followers',
 'following_url': 'https://api.github.com/users/stephentaul22/following{/other_user}',
 'gists_url': 'https://api.github.com/users/stephentaul22/gists{/gist_id}',
 'starred_url': 'https://api.github.com/users/stephentaul22/starred{/owner}{/repo}',
 'subscriptions_url': 'https://api.github.com/users/stephentaul22/subscriptions',
 'organizations_url': 'https://api.github.com/users/stephentaul22/orgs',
 'repos_url': 'https://api.github.com/users/stephentaul22/repos',
 'events_url': 'https://api.github.com/users/stephentaul22/events{/privacy}',
 'received_events_url': 'https://api.github.com/users/stephentaul22/received_events',
 'type': 'User',
 '

## `post` requests

`post` requests send information, whereas `get` requests receive:

In [17]:
response = requests.post('https://api.github.com/user/repos', headers=headers, json={'name': 'api-post-test'})
response.status_code

201

## `patch` requests 

`patch` requests update information:

In [18]:
payload = {'name': 'api-post-test',
           'description': 'Api Patch'}
response = requests.patch('https://api.github.com/repos/stephentaul22/api-post-test', headers=headers, json=payload)
response.status_code

404

## `delete` requests

`delete` requests will remove an object from the server:

In [19]:
response = requests.delete('https://api.github.com/repos/stephentaul22/api-post-test', headers=headers)
response.status_code

204

Verifying that the repo is deleted:

In [20]:
response = requests.get('https://api.github.com/repos/stephentaul22/api-post-test', headers=headers)
response.status_code

200

# Conclusion

In this project I demonstrated a few useful ways to perform web scraping using the python `requests` API library. This is a simple introduction and the techniques here can be applied to larger, more detailed projects.