# Prototyping: Scraping and extracting content from a Kickstarter project page

**Goal: Develop and test a pipeline for scraping and parsing content from a Kickstarter project page, and then extract the two main sections of a campaign: "About This Project" and "Risks and Challenges".**

In [1]:
# Load required libraries
import requests
from bs4 import BeautifulSoup
import lxml

Let's begin by selecting a hyperlink to test.

In [2]:
# Select a Kickstarter project page
hyperlink = 'https://www.kickstarter.com/projects/1799891707/ghost-huntin' + \
    'g-team-and-equipment?ref=recommended'

Next, let's scrape the HTML content from the project page and then parse it. I elected to use the `lxml` parser, in addition to utilizing `response.text` over `response.contents` as these choices yield faster parsing.

In [3]:
# Scrape the project page
scraped_html = requests.get(hyperlink)

# Parse the HTML content using an lxml parser
soup = BeautifulSoup(scraped_html.text, 'lxml')

Next, let's define functions to a) extract the two campaign sections and b) clean up the text.

In [4]:
def clean_up(messy_text):        
    """Clean up the text of a campaign section by removing unnecessary and
    extraneous content
    
    Args:
        messy_text (str): the raw text from a campaign section
    
    Returns:
        a string containing the cleaned text"""
    
    # Remove line breaks, leading and trailing whitespace, and compress all
    # whitespace to a single space
    clean_text = ' '.join(messy_text.split()).strip()
    
    # Remove the HTML5 warning for videos
    return clean_text.replace(
        "You'll need an HTML5 capable browser to see this content. " + \
        "Play Replay with sound Play with sound 00:00 00:00",
        ''
    )

In [5]:
def get_campaign(soup):
    """Extract the two campaign sections, "About this project" and "Risk and
    challenges", of a Kickstarter project
    
    Args:
        soup (soup object): parsed HTML content of a Kickstarter project page
    
    Returns:
        a dictionary of 2 strings containing each campaign section"""
    
    # Collect the "About this project" section if available
    try:
        section1 = soup.find(
            'div',
            class_='full-description js-full-description responsive-media ' + \
                'formatted-lists'
        ).get_text(' ')
    except AttributeError:
        section1 = 'section_not_found'
    
    # Collect the "Risks and challenges" section if available, and remove #
    # unnecessary text
    try:
        section2 = soup.find(
            'div', 
            class_='mb3 mb10-sm mb3 js-risks'
        ) \
            .get_text(' ') \
            .replace('Risks and challenges', '') \
            .replace('Learn about accountability on Kickstarter', '')
    except AttributeError:
        section2 = 'section_not_found'
    
    # Clean up both sections and return them in a dictionary
    return {'about': clean_up(section1), 'risks': clean_up(section2)}

Finally, let's test the extraction function on the parsed HTML.

In [6]:
# Display the `About this project` section
campaign = get_campaign(soup)
campaign['about']

'Hello, my name is cayden. I started this kickstarter thing so me and my friend could follow our dreams and be ghost hunters. THIS IS NOT A SCAM. We are serious about this im not gonna bullshit you guys, we already have some evps recorded and some haunted places to go to, we just need better film equipment like cameras and recorders etc. When we meet our goal we will start a youtube channel where we can give back to who gave to us, we will put of quality content. So please consider donating even a dollar either way were gonna continue to do this reguardless of funding. Most people get on here like oh yea hey gimmie money for some bs. We are serious about this and would love to have more people come with us!'

In [7]:
# Display the `Risks and challenges` section
campaign['risks']

"There aren't any bro lbvs. like either we get the $ and can start making videos or we don't either way we still going to try and do this, this is what we really wanna do!"

It looks like this scraping and parsing strategy is good to go!