# Preview Scraper
### This notebook scrapes all campaign previews (or "cards") in each category on indiegogo. (Example: [Audio](https://www.indiegogo.com/explore/audio?project_type=campaign&project_timing=all&sort=trending))

In general, the scraping approach follows a two-step procedure. The first step is carried out in this notebook and involves scraping all available campaign preview cards on Inidiegogo. These cards contain some important information about the campaign as well as the url that links to the actual campaign page. The second step is carried out in the notebook B_CampaignScraper.ipynb and involves scraping all the camapaign pages for more details based on the links gathered in step one.

In [79]:
# Setup
import time
import os
import pandas as pd
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
from selenium.common.exceptions import ElementClickInterceptedException
from selenium.common.exceptions import ElementNotInteractableException
from selenium.common.exceptions import TimeoutException
from bs4 import BeautifulSoup

### 1) Defining Functions

When a link to a Indiegogo category is opened, it only shows a few campaign previews at once. In order to scrape all campaign previews available in each category, the "show more" button must be clicked repeatedly until all campaign previews are shown. This is done by the load_all_cards function. 

In [80]:
def load_all_cards(driver):
    """ Repeatedly clicks the 'show more' button to load all cards in category and returns html page as soup. """
    print('- Loading all cards by repeatedly clicking "show more" button..')

    page_source = driver.page_source # Get page_source

    nr_clicks = 0
    while True:
        try:  # Find and click 'show more' button
            WebDriverWait(driver, 30).until(EC.element_to_be_clickable((By.XPATH, '//a[text() = "show more"]'))).click()
        except ElementClickInterceptedException:  # most likely obscured by cookies banner
            print('- "Show more" button obscured. Accepting cookie banner..')
            html = driver.find_element_by_tag_name('html') # scroll to bottom
            html.send_keys(Keys.END)
            time.sleep(5)
            try: # accept cookie banner
                WebDriverWait(driver, 30).until(EC.element_to_be_clickable((By.ID, 'CybotCookiebotDialogBodyButtonAccept'))).click()
                print('- Successful')
            except ElementNotInteractableException:
                print('- Cookie banner not found. Trying different method..')
                cookie_banner = driver.find_element_by_id('CybotCookiebotDialogBodyButtonAccept')
                driver.execute_script("arguments[0].scrollIntoView();", cookie_banner)
                cookie_banner.click()
                print('- Successful')
        except (TimeoutException, NoSuchElementException) as e:  # show more button doesn't exist
            print('- Could not find "show more" button on page. ({})'.format(type(e).__name__))
            break
        except Exception as e:
            print('Unanticipated Error: {}:{}'.format(type(e).__name__, e))
            break
        
        # Check if page contains more information (sometimes bug results in blank page, then do not update page source)
        page_source_updated = driver.page_source
        if len(page_source_updated) < 0.5*len(page_source):
            print('- Error: "Show more" reduced page content massively (Website bug removes all content).')
            print('- Source length 1: {} Source length 2: {}'.format(len(page_source), len(page_source_updated)))
            break
        else:
            page_source = page_source_updated  # update page source
            
        nr_clicks += 1
        time.sleep(1)

    # Get html source and nr of cards
    soup = BeautifulSoup(page_source, 'html.parser')
    nr_cards = len(soup.find_all('div', class_='discoverableCard'))
    print('- Loading finished')
    print('- "Show more" clicks: {}'.format(nr_clicks))
    print('- Loaded cards: {}'.format(nr_cards))
    return soup

This function parses the html source of the page and extracts all necessary information from it as a data frame.

In [81]:
def indiegogo_parser(soup):
    ''' Parser extracts infos from indiegogo.com search results '''

    # find all cards on page
    cards = soup.find_all('div', class_='discoverableCard')

    df = pd.DataFrame()
    i = 0
    for i, card in enumerate(cards):
        #print('************ {} **************'.format(i))
        try:
            project_link = card.find('a')['href']
            project_link = 'https://www.indiegogo.com' + project_link
        except Exception as e:
            project_link = None
            print('- Error extracting project_link (card {}/{}): {}'.format(i, len(cards)-1, e))

        try:
            title = card.find('div', {'class': 'discoverableCard-title'}).get_text().strip()
        except Exception as e:
            title = None
            print('- Error extracting title (card {}/{}): {}'.format(i, len(cards)-1, e))

        try:
            description = card.find('div', {'class': 'discoverableCard-description'}).get_text().strip()
        except Exception as e:
            description = None
            print('- Error extracting description (card {}/{}): {}'.format(i, len(cards)-1, e))

        try:
            card_category = card.find('div', {'class': 'discoverableCard-category'}).get_text().strip()
        except Exception as e:
            card_category = None
            print('- Error extracting card_category (card {}/{}): {}'.format(i, len(cards)-1, e))
        
        try:
            days_left = card.find('span', {'class': 'discoverableCard-formattedDate'}).get_text().strip()
        except Exception as e:
            days_left = None

        try:
            launching_soon = card.find('span', {'class': 'discoverableCard-LaunchingSoonLabel'}).get_text().strip()
        except Exception as e:
            launching_soon = None
        
        try:
            amt_raised = card.find('div', {'class': 'discoverableCard-raised'})
            amt_raised = ' '.join(amt_raised.get_text().split())  # removes whitespaces etc.
        except Exception as e:
            amt_raised = None
            if launching_soon == None:  # If campaign has launching soon tag, it's not an error 
                print('- Error extracting amt_raised (card {}/{}): {}'.format(i, len(cards)-1, e))
        # Append card as row to data frame
        df = df.append({'project_link': project_link,
                        'title': title,
                        'description': description,
                        'card_category': card_category,
                        'amt_raised': amt_raised,
                        'days_left': days_left,
                        'launching_soon': launching_soon}, ignore_index=True)
    print('- Parsed cards: {}'.format(i+1))
    return df

In [83]:
def initialize_browser():
    """ Initialize browser for scraping """
    opts = Options()
    opts.headless = False  # hide browser bool
    driver = webdriver.Firefox(options=opts, executable_path='geckodriver.exe')
    return driver

A manual list of all the categories as they appear in the url is created. From this, the url that opens all campaign previews in a given category can be constructed.

### 2) Scrape Camapaign Previews 

In [82]:
# Define Indigogo categories
tech_innovation = ["audio", "camera-gear", "education", "energy-green-tech", "fashion-wearables", "food-beverages", 
                    "health-fitness", "home", "phones-accessories", "productivity", "transportation", "travel-outdoors"]
creative_works = ["art", "comics", "dance-theater", "film", "music", "photography", "podcasts-blogs-vlogs",
                  "tabletop-games", "video-games", "web-series-tv-shows", "writing-publishing"]
community_projects = ["culture", "environment", "human-rights", "local-businesses", "wellness"]

# Combine all categories
categories = tech_innovation + creative_works + community_projects

The main loop is the core of the script. It iterates over all categories and opens the url for each category in a selenium session. The url is then passed to the load_all_cards function which loads all campaign previews and returns the page source. The page source is then parsed bc the indiegogo_parser function and the data frame is saved in the downloads folder. Due to some seemingly unfixable errors that appear very rarely, each category is attempted to be scraped a maximum of three times in case of an unexpected error, which ensures that all campaign previews are collected even in the event of a rare error. 

In [84]:
# Main loop ( iterate over all categories )
for category in categories:
    url = 'https://www.indiegogo.com/explore/{}?project_type=campaign&project_timing=all&sort=trending'.format(category)

    print('\nCategory: {} - {}'.format(category, time.strftime('%H:%M')))

    max_tries = 3
    for try_count in range(max_tries):
        try_count += 1 # start at 1
        try:
            # Init browser and load page
            driver = initialize_browser()
            driver.get(url)
            # Extract data
            soup = load_all_cards(driver)  # load all cards on page
            df = indiegogo_parser(soup)  # parse all cards on page
            df.to_csv('downloads/{}.csv'.format(category))
            print('- Saved data frame as csv')
            break
        except Exception as e:  # if error, try again (some probabilistic serverside error seems unavoidable)
            print('- Scraper ran into unexpected error on try {}: {}:{}'.format(try_count, type(e).__name__, e))
            print('- Retrying...')
            if try_count == max_tries: # if max tries reached, skip to next category
                print('- Max tries reached, skiping to next category..')
                break
            continue
    




Category: audio - 20:29
- Loading all cards by repeatedly clicking "show more" button..
- "Show more" button obscured. Accepting cookie banner..
- Successful
- Error: "Show more" reduced page content massively (Website bug removes all content).
- Source length 1: 6140966 Source length 2: 998987
- Loading finished
- "Show more" clicks: 81
- Loaded cards: 961
- Parsed cards: 961
- Saved data frame as csv

Category: camera-gear - 20:31
- Loading all cards by repeatedly clicking "show more" button..
- "Show more" button obscured. Accepting cookie banner..
- Successful
- Could not find "show more" button on page. (TimeoutException)
- Loading finished
- "Show more" clicks: 53
- Loaded cards: 624
- Parsed cards: 624
- Saved data frame as csv

Category: education - 20:34
- Loading all cards by repeatedly clicking "show more" button..
- "Show more" button obscured. Accepting cookie banner..
- Successful
- Error: "Show more" reduced page content massively (Website bug removes all content).
- So

Lastly, all data frames in the download folder are combined and pickled as one file called raw_preview_data.pkl.

In [None]:
def combine_files(dir):
    ''' Combines all dataframes in directory. '''
    # list all files in directory
    file_paths = []
    for path, subdirs, filenames in os.walk(dir):
        for name in filenames:
            file_paths.append(os.path.join(path, name))

    # Combine all files in data
    df = pd.DataFrame()
    for file in file_paths:
        # read and evaluate string column as list
        new_data = pd.read_csv(file, index_col=0)
        df = df.append(new_data, ignore_index=True)
    return df

In [None]:
# Construct combined data frame
df = combine_files('downloads')

# Save combined data frame
df.to_pickle('data/raw_preview_data.pkl')