# Kickstarter project web scraper

This script will use parsed Web Robots data, which contains URLs for Kickstarter projects, to scrape and store the HTML content of each page.

In [1]:
# Load required libraries
import requests
import numpy as np
import pandas as pd
from sklearn.externals import joblib
import time
import random
from IPython.core.display import clear_output
from warnings import warn
from bs4 import BeautifulSoup

Load DataFrame containing parsed Web Robots data.

In [2]:
# Load master file
df = joblib.load('data/web_robots_data/web_robots_data_to_06-2017.pkl')

The parsed features include

- `name` - project's name
- `category` - project's category as defined by Kickstarter
- `hyperlink` - project's web page URL
- `currency` - type of currency used for fundraising
- `pledged` - total amount of money pledged by backers over the course of the project
- `goal` - funding goal set by the creator
- `location` - creator's location information

Let's select the projects that only use US dollars for funding to focus on American projects, in order work with web pages written in English.

In [3]:
# Select US projects
df_USD = df[df['currency'] == 'USD']

Since the Web Robots data contains 196,000+ projects, let's select a random sample that we'll aim to scrape.

In [4]:
# Take a random sample of the Web Robots data using a seed value to ensure
# repeatability
seed = np.random.seed(42)
df_sample = df_USD.sample(50000)

# Display the first five rows
df_sample.head(3)

Unnamed: 0,name,category,hyperlink,currency,pledged,goal,location,funded
88389,Help me start my cottage industry ... Bakesale...,Small Batch,https://www.kickstarter.com/projects/138529431...,USD,0.0,10000.0,"Cape Coral, FL",False
190378,The Sock Who Lost His Mate at NY Children's Th...,Musical,https://www.kickstarter.com/projects/987315242...,USD,2600.0,7000.0,"Greenwich Village, Manhattan, NY",False
21028,The 4 Disciples,Comic Books,https://www.kickstarter.com/projects/the4disci...,USD,165.0,2200.0,"Rahway, NJ",False


During the scraping process, let's monitor the progress and measure how fast we're scraping to avoid overloading the Kickstarter server. Afterwards, we'll report total run time, the average scraping speed and total number of scraped  pages. We'll also keep track of the position of the last scraped page, in case the scraper breaks or stops for any reason, to see where we left off.

In [5]:
# Initalize an empty DataFrame to store scraped HTML
scraped_collection = pd.DataFrame(columns=['scraped_HTML'])

# Record the start time
start_time = time.time()

# Initialize the number of requests
request_count = 0

# Select which projects to scrape via its index. This is used for starting
# at position other than the beginning in case the scraper stops unexpectedly.
# Note that ending_point is NOT inclusive!
starting_point = 2
ending_point = 5

# Perform web scraping
for index, row in df_sample[starting_point:ending_point].iterrows():
    # Perform a request and timeout if it exceeds 6 seconds
    scraped_html = requests.get(row['hyperlink'], timeout=6)
    
    # Pause the loop for a random amount of time
    time.sleep(random.uniform(2, 4))
    
    # Monitor the requests and display current progress
    elapsed_time = time.time() - start_time
    clear_output(wait = True)
    print(
        'Request: {}; Row ID: {}; Frequency: {} requests/sec'.format(
            request_count + starting_point,
            index,
            (request_count + 1) / elapsed_time
        )
    )
    request_count += 1
    
    # Add scraped HTML for the current project
    scraped_collection.loc[index, 'scraped_HTML'] = scraped_html
    
# Display the overall time, average scraping speed and total number of scraped
# pages
run_time = time.time() - start_time
print()
print('Run time:', run_time)
print('Average rate:', len(scraped_collection) / run_time)
print('# of projects scraped:', len(scraped_collection))

Request: 4; Row ID: 46223; Frequency: 0.3008067836360539 requests/sec

Run time: 9.980847835540771
Average rate: 0.3005756674615666
# of projects scraped: 3


Let's save the scraped HTML for each project page and label the filename with the indices of the projects scraped.

In [6]:
# Serialize the data table containing the scraped HTML for each project
joblib.dump(
    scraped_collection, 'scraped_collection_{}-{}.pkl'.format(
        starting_point,
        ending_point - 1
    )
)

['scraped_collection_2-4.pkl']