# Scraping Kickstarter project pages

**Goal: Load parsed data from Web Robots database, containing URLs and other aspects of Kickstarter projects, scrape each project page, and save the results in a table.**

Note: 2000 scraped pages take up ~1 GB of storage. In addition, the scraping rate used in this notebook converges to about 0.28 pages/second.

In [1]:
# Load required libraries
import requests
import numpy as np
import pandas as pd
from sklearn.externals import joblib
import time
import random
from IPython.core.display import clear_output

Load the table containing already-parsed Web Robots data.

In [2]:
# Load table containing Web Robots data
df = joblib.load('data/web_robots_data/web_robots_data_to_06-2017.pkl')

The parsed data includes:

- `name` - project's name
- `category` - project's category as defined by Kickstarter
- `hyperlink` - project's web page URL
- `currency` - type of currency used for fundraising
- `pledged` - total amount of money pledged by backers over the course of the project
- `goal` - funding goal set by the creator
- `location` - creator's location information

Let's select the projects that only use U.S. dollars to ensure we're working with projects written in American English.

In [3]:
# Select projects described in American English
df_USD = df[df['currency'] == 'USD']

Since the Web Robots data contains nearly 200,000 projects, we'll select a random sample for scraping.

In [4]:
# Take a random sample of the Web Robots data using a seed value to ensure
# repeatability
seed = np.random.seed(42)
df_sample = df_USD.sample(50000)

During the scraping process, we'll monitor the overall progress and measure how fast we're scraping to avoid overloading the Kickstarter server. Afterwards, we'll report the total run time, average scraping speed and the total number of scraped project pages. We'll also keep track of the position of the last scraped project page, in case the scraper halts for any reason, to note where we left off.

In [5]:
# Initalize an empty DataFrame to store scraped HTML
scraped_collection = pd.DataFrame(columns=['scraped_HTML'])

# Record the start time
start_time = time.time()

# Initialize the number of requests
request_count = 0

# Select which projects to scrape via its index. This is used for starting
# at a position other than the beginning in case the scraper stopped 
# unexpectedly.
starting_point = 0
ending_point = 24000

for index, row in df_sample[starting_point:ending_point].iterrows():
    # Perform a request and timeout after 20 seconds since some pages may take
    # longer to scrape
    scraped_html = requests.get(row['hyperlink'], timeout=20)
    
    # Pause the loop for a random amount of time
    time.sleep(random.uniform(2, 4))
    
    # Monitor the requests by clearing the output and displaying current 
    # progress
    elapsed_time = time.time() - start_time
    clear_output(wait = True)
    print(
        'Request: {}; Row ID: {}; Frequency: {} requests/sec'.format(
            request_count + starting_point,
            index,
            (request_count + 1) / elapsed_time
        )
    )
    request_count += 1
    
    # Record scraped HTML
    scraped_collection.loc[index, 'scraped_HTML'] = scraped_html
    
# Display the overall time, average scraping speed and total number of scraped
# project pages
run_time = time.time() - start_time
print()
print('Run time:', run_time)
print('Average rate:', len(scraped_collection) / run_time)
print('# of projects scraped:', len(scraped_collection))

Request: 1; Row ID: 190378; Frequency: 0.31547747899263323 requests/sec

Run time: 6.344876289367676
Average rate: 0.3152149717010981
# of projects scraped: 2


Let's save the collection of scraped HTML and label the filename with the indices of the projects scraped to keep track off how far into the random sample we've scraped. This way, it's easy restart the scraper where we left off.

In [6]:
# Serialize the data table containing the scraped HTML for each project
joblib.dump(
    scraped_collection, 'scraped_collection_{}-{}.pkl'.format(
        starting_point,
        ending_point - 1
    )
)

['scraped_collection_0-1.pkl']