# Scraping and extracting content from a Kickstarter project page

**The purpose of this notebook is to develop and test a pipeline for scraping and parsing content from a Kickstarter project page, and then extract the two main sections of a campaign: "About This Project" and "Risks and Challenges".**

In [1]:
# Load required libraries
import requests
from bs4 import BeautifulSoup
import lxml
import pandas as pd

In [2]:
# Select a Kickstarter page to process
hyperlink = 'https://www.kickstarter.com/projects/sbf/sculpto-the-worlds-' + \
    'most-user-friendly-desktop-3d-p?ref=discovery'

Let's begin by scraping the HTML content from the project page and then parsing it. I elected to use the `lxml` parser, in addition to utilizing `response.text` over `response.contents` as these parameters yield faster parsing.

In [3]:
# Scrape the project page
scraped_html = requests.get(hyperlink)

# Parse the HTML
soup = BeautifulSoup(scraped_html.text, 'lxml')

Next, let's define a functions to a) extract the two campaign sections and b) clean up the text.

In [4]:
def clean_up(messy_text):        
    # Remove line breaks, leading and trailing whitespace, and compress all
    # whitespace to a single space
    clean_text = ' '.join(messy_text.split()).strip()
    
    # Remove the HTML5 warning for videos
    return clean_text.replace(
        "You'll need an HTML5 capable browser to see this content. " + \
        "Play Replay with sound Play with sound 00:00 00:00",
        ''
    )

In [5]:
def get_campaign(soup):
    # Collect the 'About this project' section if available
    try:
        section1 = soup.find(
            'div',
            class_='full-description js-full-description responsive-media ' + \
                'formatted-lists'
        ).get_text(' ')
    except AttributeError:
        section1 = 'section_not_found'
    
    # Collect the 'Risks and challenges' section if available
    try:
        section2 = soup.find(
            'div', 
            class_='mb3 mb10-sm mb3 js-risks'
        ) \
            .get_text(' ') \
            .replace('Risks and challenges', '') \
            .replace('Learn about accountability on Kickstarter', '')
    except AttributeError:
        section2 = 'section_not_found'
    
    # Clean up both sections and return them in a dict
    return {'about': clean_up(section1), 'risks': clean_up(section2)}

Finally, let's test the extraction function on the parsed HTML.

In [6]:
# Display the `About this project` section
campaign = get_campaign(soup)
campaign['about']

"3D printers are one of the coolest and most efficient ways for you to make your ideas go from the drawing board and into real life. Two years ago we launched a new type of 3D printer on Kickstarter. We wanted to make the amazing world of 3D printing available to everyone - not just engineers and tech-savvy people. We believe everyone should be able to bring their ideas to life. We started delivering that dream a year ago and opened a world of 3D printing for regular people and schools all over Denmark. Since then we have refined our production and software while developing Sculpto+ The result is a small 3D printer with one of the biggest print areas seen on a plug'n'play printer. We have maximized printing performance while eliminating the noise and inconveniences of 3D printing - we have made the perfect desktop 3D printer. And the best part: Everyone can use it! Sculpto+ is a small 3D printer. It only weighs 2.7 KG and is about the size of your household coffeemaker, but has a print

In [7]:
# Display the `Risks and challenges` section
campaign['risks']

"The development of the Sculpto+ is almost finalized. Due to the fact that we are already manufacturing our previous Sculpto 3D printer - with a lot similarities to the Sculpto+ and have printers operating in private homes, schools and institutions all over Denmark we are very confident we can bring the Sculpto+ to production. The past 4 months we have been testing 3 stages of different prototypes and are now testing 10 production prototypes that are manufactured in a similar way to the final product. Our results have been amazing and we truly believe this 3D printer and the upgraded app will be a game-changer within 3D printing for consumers and educational use. Take a look at our 'prototype gallery' to see more. We have already started lining the production up and believe it is realistic to deliver all or a big part (Depending on the Kickstarters outcome) of the Sculpto+ 3D printers before Christmas."