# Kickstarter project scraper

This script will use extracted Web Robots data, containing URL for Kickstarter pages, to scrape HTML content and store them in a DataFrame.

In [1]:
# Load required libraries
import requests
import numpy as np
import pandas as pd
from sklearn.externals import joblib
import time
import random
from IPython.core.display import clear_output
from warnings import warn
from bs4 import BeautifulSoup

Load file containing extracted Web Robots data.

In [2]:
# Load master file
df = joblib.load('data/web_robots_data/web_robots_data_to_06-2017.pkl')

Select projects only using US dollars for funding to focus on American projects written in English.

In [3]:
# Pull out US projects
df_USD = df[df['currency'] == 'USD']

Since the Web Robots data contains 196,000+ projects, let's select a random sample that we'll aim to scrape.

In [4]:
# Take a random sample of the projects using a seed value to allow
# repeatability
seed = np.random.seed(42)
df_sample = df_USD.sample(50000)

# Display the first five rows
df_sample.head(3)

Unnamed: 0,name,category,hyperlink,currency,pledged,goal,location,funded
88389,Help me start my cottage industry ... Bakesale...,Small Batch,https://www.kickstarter.com/projects/138529431...,USD,0.0,10000.0,"Cape Coral, FL",False
190378,The Sock Who Lost His Mate at NY Children's Th...,Musical,https://www.kickstarter.com/projects/987315242...,USD,2600.0,7000.0,"Greenwich Village, Manhattan, NY",False
21028,The 4 Disciples,Comic Books,https://www.kickstarter.com/projects/the4disci...,USD,165.0,2200.0,"Rahway, NJ",False


Begin the scraping process. During the scraping, we'll monitor progress and how fast we're scraping to avoid overloading their server. In the end, we'll report total run time, the average scraping speed, the last scraped project, and its index. This way, in case the scraper breaks for any reason, we know where we left off.

In [5]:
# Initalize an empty DataFrame to store scraped HTML
scraped_collection = pd.DataFrame(columns=['scraped_HTML'])

# Record the start time
start_time = time.time()

# Initialize the number of requests
request_count = 0

# Select which projects to scrape. Note: ending_point is NOT inclusive!
starting_point = 2
ending_point = 5

for index, row in df_sample[starting_point:ending_point].iterrows():
    # Perform a request and timeout if request exceeds 2 seconds
    scraped_html = requests.get(row['hyperlink'], timeout=2)
    
    # Pause the loop for a random amount of time
    time.sleep(random.uniform(2, 4))
    
    # Monitor the requests and display current progress  
    
    elapsed_time = time.time() - start_time
    clear_output(wait = True)
    print(
        'Request: {}; Row ID: {}; Frequency: {} requests/sec'.format(
            request_count + starting_point,
            index,
            (request_count + 1) / elapsed_time
        )
    )
    
    request_count += 1
    # Add scraped HTML to the DataFrame
    scraped_collection.loc[index, 'scraped_HTML'] = scraped_html
    
# Display the overall time and average scraping speed
run_time = time.time() - start_time
print()
print('Run time:', run_time)
print('Average rate:', len(scraped_collection) / run_time)
print('# of projects scraped:', len(scraped_collection))

Request: 4; Row ID: 46223; Frequency: 0.3008067836360539 requests/sec

Run time: 9.980847835540771
Average rate: 0.3005756674615666
# of projects scraped: 3


Dump the DataFrame containing all scraped HTML and label the filename.

In [6]:
joblib.dump(
    scraped_collection, 'scraped_collection_{}-{}.pkl'.format(
        starting_point,
        ending_point - 1
    )
)

['scraped_collection_2-4.pkl']

Also dump the DataFrame containing the sampled projects.

In [7]:
joblib.dump(df_sample, 'sampled_web_robots_data.pkl')

['sampled_web_robots_data.pkl']