# Scraping and extracting content from a Kickstarter project page

**The purpose of this notebook is to develop and test a pipeline for scraping and parsing content from a Kickstarter project page, and then extract the two main sections of a campaign: "About This Project" and "Risks and Challenges".**

In [1]:
# Load required libraries
import requests
from bs4 import BeautifulSoup
import lxml

In [2]:
# Select a Kickstarter page to process
hyperlink = 'https://www.kickstarter.com/projects/1385294316/help-me-' + \
    'start-my-cottage-industry-bakesalecom?ref=category_newest'

Let's begin by scraping the HTML content from the project page and then parsing it. I elected to use the `lxml` parser, in addition to utilizing `response.text` over `response.contents` as these parameters yield faster parsing.

In [3]:
# Scrape the project page
scraped_html = requests.get(hyperlink)

# Parse the HTML
soup = BeautifulSoup(scraped_html.text, 'lxml')

Next, let's define a functions to a) extract the two campaign sections and b) clean up the text.

In [4]:
def clean_up(messy_text):        
    # Remove line breaks, leading and trailing whitespace, and compress all
    # whitespace to a single space
    clean_text = ' '.join(messy_text.split()).strip()
    
    # Remove the HTML5 warning for videos
    return clean_text.replace(
        "You'll need an HTML5 capable browser to see this content. " + \
        "Play Replay with sound Play with sound 00:00 00:00",
        ''
    )

In [5]:
def get_campaign(soup):
    # Collect the 'About this project' section if available
    try:
        section1 = soup.find(
            'div',
            class_='full-description js-full-description responsive-media ' + \
                'formatted-lists'
        ).get_text(' ')
    except AttributeError:
        section1 = 'section_not_found'
    
    # Collect the 'Risks and challenges' section if available
    try:
        section2 = soup.find(
            'div', 
            class_='mb3 mb10-sm mb3 js-risks'
        ) \
            .get_text(' ') \
            .replace('Risks and challenges', '') \
            .replace('Learn about accountability on Kickstarter', '')
    except AttributeError:
        section2 = 'section_not_found'
    
    # Clean up both sections and return them in a dict
    return {'about': clean_up(section1), 'risks': clean_up(section2)}

Finally, let's test the extraction function on the parsed HTML.

In [6]:
# Display the `About this project` section
campaign = get_campaign(soup)
campaign['about']

"I am a 51 year old woman who was a general manager for restaurants for years. I moved to Florida 3 years ago and found unemployment at 11%. Without being able to find a good job I started my own business as a house cleaner. I have been very successful cleaning two houses a day during season (or city rely's on snowbirds) and right now there is little work. House cleaning is a very hard physical job, that I like, but I don't love. And I'm finding that my body only has a few more years left to be doing this kind of work. What will I do, when I can't do this anymore?????? So I sat down and started listing the things that I love. And my list kept coming back to. I love to bake! After sitting down and making a business plan, if I could get kickstarted with a few thousand dollars I could rent a small space and equip it with what I need. It would help me with licensing and insurance. And if anything was left I could get a professional website done, and have a strong base to give my last try a

In [7]:
# Display the `Risks and challenges` section
campaign['risks']

'Challenges will be getting my name out there and getting a website done. Being a restaurant manager will help my endeavors, but actually setting up a facility will require some expertise outside my knowledge, I am planning on working with the small business administration for help. I have already attended workshops with them.'