# Web Scraping and Application Program Interface (API): Part II

## Controlling the crawl-rate

Controlling the rate of crawling is beneficial for us, and for the website we are scraping. If 
we send tens of requests per second to the server, we are much likely to get 
our IP address banned. 

We’ll control the loop’s rate by using the sleep() function from Python’s 
[time](https://docs.python.org/3/library/time.html) module. sleep() will 
pause the execution of the loop for a specified amount of seconds.

To mimic human behavior, we’ll vary the amount of waiting time between requests 
by using the randint() function from the Python’s 
[random](https://docs.python.org/3/library/random.html) module. randint() randomly 
generates integers within a specified interval.

In [None]:
import time
import random

for i in range(0, 5):
    print('connected!')
    time.sleep(random.randint(1, 10))

In [None]:
### <font color=red> Question: Can we use other distributions for waiting times?</font> 

In [None]:
import time
import random

for i in range(0, 5):
    print('connected!')
    time.sleep(abs(random.gauss(1, 10)))

### <font color=red> Question: What is the difference between from time import sleep and import time? </font> 

In [None]:
from time import sleep
from random import randint

for i in range(0, 5):
    print('connected!')
    sleep(randint(1, 10))

In [None]:
from time import sleep
from random import gauss

for i in range(0, 5):
    print('connected!')
    sleep(abs(gauss(1, 5)))   

## Monitoring the loop as it’s still going

It would be nice if we could find a way to monitor the scraping process 
as it’s still going. This feature is definitely optional, but it can be 
very helpful in the testing and debugging process. Also, the greater 
the number of pages, the more helpful the monitoring becomes. If you are 
going to scrape hundreds or thousands of web pages in a single code run, 
I would say that this feature becomes a must.

For our script, we’ll make use of this feature, and monitor the following 
parameters:

-  The frequency (speed) of requests, so we make sure our program is 
   not overloading the server.

-  The number of requests, so we can halt the loop in case the number of 
   expected requests is exceeded.

-  The status code of our requests, so we make sure the server is sending 
   back the proper responses.

To get a frequency value we’ll divide the number of requests by the time 
elapsed since the first request. This is similar to computing the speed 
of a car – we divide the distance by the time taken to cover that distance. 
Let’s experiment with this monitoring technique at a small scale first. In 
the following code cell we will:

-- Set a starting time using the time() function from the time module, 
   and assign the value to start_time.
    
-- Assign 0 to the variable requests which we’ll use to count the 
   number of requests.
    
-- Start a loop, and then with each iteration:
   -  Simulate a request.
   -  Increment the number of requests by 1.
   -  Pause the loop for a time interval between 8 and 15 seconds.
   -  Calculate the elapsed time since the first request, and assign 
      the value to elapsed_time.
   -  Print the number of requests and the frequency.

In [12]:
import time
import random

start_time = time.time()
requests = 0

for i in range(5):
    # A request would go here
    requests += 1
    time.sleep(random.randint(8, 15))
    elapsed_time = time.time() - start_time
    print('Request: {}; Frequency: {} requests/s'.format(requests, requests/elapsed_time))

Request: 1; Frequency: 0.08331090815497073 requests/s
Request: 2; Frequency: 0.07405083343080417 requests/s
Request: 3; Frequency: 0.07140739940530721 requests/s
Request: 4; Frequency: 0.07840388366069614 requests/s
Request: 5; Frequency: 0.08471534564175459 requests/s


When you make many requests, your work will look a bit untidy as the 
output accumulates. To avoid that, we’ll clear the output after each 
iteration, and replace it with information about the most recent request. 
To do that we’ll use the clear_output()function from the 
IPython’s core.display module. We’ll set the wait parameter of clear_output() to 
True to wait with replacing the current output until some new output appears.

In [13]:
import time
import random
from IPython.display import clear_output

start_time = time.time()
requests = 0

for i in range(5):
    # A request would go here
    requests += 1
    time.sleep(random.randint(8, 15))
    elapsed_time = time.time() - start_time
    print('Request: {}; Frequency: {} requests/s'.format(requests, requests/elapsed_time))
    clear_output(wait=True)

Request: 5; Frequency: 0.09085763424873643 requests/s


In [21]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

import time
import random

base_url = 'http://www.imdb.com/search/title?release_date=2018-01-01,2018-12-31&sort=num_votes,desc'
current_page = 1    ## first page

start_time = time.time()

names = []
years = []
imdb_ratings = []
metascores = []
votes = []

while current_page < 11:   ## suppose we want to get the first ten pages
    print('\n')
    print('Page ', current_page)
    start = (current_page-1)*50 + 1  ## starting number; start=1 for page 1 and start=51 for page 2
    url = base_url + "&start=" + str(start)
    page = requests.get(url)

    if (page.status_code // 10**2) == 2 :
        print('succesffully connected!')
    else :
        print('succesffully failed!')

    ## Pause the loop
    time.sleep(random.randint(1, 5))
    
    ## Monitor the requests
    elapsed_time = time.time() - start_time
    print('Request: {}; Frequency: {} requests/s'.format(current_page, current_page/elapsed_time))
    clear_output(wait=True)

    soup = BeautifulSoup(page.text, 'html.parser')
    movie_containers = soup.find_all('div', class_ = 'lister-item mode-advanced')

    for container in movie_containers:  
        # The name 
        name = container.h3.a.text
        names.append(name)
        
        # The year
        year = container.h3.find('span', class_ = 'lister-item-year').text
        years.append(year)
        
        # The IMDB rating
        imdb = float(container.strong.text)
        imdb_ratings.append(imdb)
        
        # The Metascore
        try:
            m_score = container.find('span', class_ = 'metascore').text
        except:
            m_score = 'None'
        metascores.append(m_score)
        
        # The number of votes
        vote = container.find('span', attrs = {'name':'nv'})['data-value']
        votes.append(int(vote))

    del page     ## delete the current web page
    del soup     ## delete the current soup
            
    current_page += 1   ## move to next page

    
## Merge the data into a pandas DataFrame.
movie_ratings = pd.DataFrame({'movie': names,
                       'year': years,
                       'imdb': imdb_ratings,
                       'metascore': metascores,
                       'votes': votes
})

print('\n')
print(movie_ratings.info())

movie_ratings.head(10)  ## Show the first 10 movies



<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 5 columns):
movie        500 non-null object
year         500 non-null object
imdb         500 non-null float64
metascore    500 non-null object
votes        500 non-null int64
dtypes: float64(1), int64(1), object(3)
memory usage: 19.6+ KB
None


Unnamed: 0,movie,year,imdb,metascore,votes
0,Avengers: Infinity War,(2018),8.5,68,619791
1,Black Panther,(2018),7.3,88,492605
2,Deadpool 2,(2018),7.8,66,373269
3,Bohemian Rhapsody,(2018),8.1,49,322778
4,A Quiet Place,(2018),7.6,82,292629
5,Ready Player One,(2018),7.5,64,292611
6,Venom,(2018),6.8,35,258646
7,A Star Is Born,(2018),7.8,88,232475
8,Aquaman,(2018),7.2,55,230735
9,Mission: Impossible - Fallout,(2018),7.8,86,225091


## Cleaning the scraped data

We would like to clean the year column and convert the values to integers.

Note all the values in the year column are of the object type. To 
avoid ValueErrors upon conversion, we want the values to be composed 
only from numbers from 0 to 9.

Let’s examine the unique values of the year column. This helps us to 
get an idea of what we could do to make the conversions we want. To 
see all the unique values, we’ll use the unique() method:

In [24]:
movie_ratings['year'].unique()

array(['(2018)', '(I) (2018)', '(2018– )', '(III) (2018)', '(I) (2018– )',
       '(II) (2018)', '(2016– )', '(2018 Video Game)', '(2010– )',
       '(2018 TV Special)', '(2018 Video)', '(2018–2019)', '(2005– )',
       '(2015– )', '(2014– )', '(2015–2018)', '(2013–2018)',
       '(2018 TV Movie)'], dtype=object)

Counting from the end toward beginning, we can see that the years are four-digit numbers. 
We can use the regular expression with the pattern '[0-9]+' to extract these years. 
We’ll also convert the result to an integer.

In [34]:
import re
for year in movie_ratings['year']: 
    movie_ratings.loc[:, 'year_num'] = int(re.findall('[0-9]+', year)[0])

movie_ratings.head(100)    

Unnamed: 0,movie,year,imdb,metascore,votes,year_num
0,Avengers: Infinity War,(2018),8.5,68,619791,2018
1,Black Panther,(2018),7.3,88,492605,2018
2,Deadpool 2,(2018),7.8,66,373269,2018
3,Bohemian Rhapsody,(2018),8.1,49,322778,2018
4,A Quiet Place,(2018),7.6,82,292629,2018
5,Ready Player One,(2018),7.5,64,292611,2018
6,Venom,(2018),6.8,35,258646,2018
7,A Star Is Born,(2018),7.8,88,232475,2018
8,Aquaman,(2018),7.2,55,230735,2018
9,Mission: Impossible - Fallout,(2018),7.8,86,225091,2018


## Application Program Interface (API): API

### Example 1: [Quandl API](https://docs.quandl.com/)

Quandl API is a Financial Data API, which allows you get millions of financial 
and economic datasets from hundreds of publishers via a single free API.

In [None]:
#### INSTALLATION
You can download the Quandl Python package from PyPI or from GitHub. 
Follow the installation instructions below.

NOTE: Installation of the Quandl Python package varies depending on your system.

On most systems, the following commands will initiate installation:
```python
pip install quandl
```   
On some systems, you may need this command instead:
```python
pip3 install quandl
```

Additionally, you can find detailed installation instructions for Python 
modules 
here: [Python 3.x](https://docs.python.org/3/installing/index.html#installing-index) 
and [Python 2.7x](https://docs.python.org/2/installing/index.html#installing-index).

#### AUTHENTICATION
The Quandl Python module is free but you must have a Quandl API key in order 
to download data. To get your own API key, you will need to create a free 
Quandl account and set your API key.

After importing the Quandl module, you can set your API key with the 
following command: 
```python
quandl.ApiConfig.api_key = "YOURAPIKEY"
```

I save my API key in teh file "Quandl_Settings.py", which is in the 
current working directory.directory. 

In [49]:
import quandl
from Quandl_Settings import my_API_key

quandl.ApiConfig.api_key = my_API_key

In [55]:
data = quandl.get("EIA/PET_RWTC_D")

In [56]:
data.head(5)

Unnamed: 0_level_0,Value
Date,Unnamed: 1_level_1
1986-01-02,25.56
1986-01-03,26.0
1986-01-06,26.53
1986-01-07,25.85
1986-01-08,25.87


See more at https://docs.quandl.com/docs/python-time-series.

### Example 2: [Yelp API](https://www.yelp.com/developers)

See more at https://www.yelp.com/developers/documentation/v3/authentication.

### Example 3: [Twitter API](https://developer.twitter.com/en.html)

In [61]:
import tweepy

import tweepy #https://github.com/tweepy/tweepy
import csv

from Twitter_Settings import API_key, API_secret_key, Access_token, Access_token_secret

#Twitter API credentials
#consumer_key = 'iZEU80CSq6jlr69M9fFZQ'
#consumer_secret = 'G4ko4j4Af6l7DlOm1GXYq84Y8s3aslhPWryRBOYmk'
#access_key = '14518129-JLqInQwSvGisQ8u7IkuZYthBuPovTFFb1AFSe6HQA'
#access_secret = 'Cb9hngU3JnuFFYDLygOMeTLw69CZVnyVZGEbf2v84'

consumer_key = API_key
consumer_secret = API_secret_key
access_token = Access_token
access_token_secret = Access_token_secret

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth)

public_tweets = api.home_timeline()
for tweet in public_tweets:
    print(tweet.text)