# Campaing Scraper
### This notebook will scrape the detailed data from each campaign website. (Example: [Center Cam](https://www.indiegogo.com/projects/center-cam-finally-a-middle-screen-webcam--2#/))

This constitutes the second part of the scraping process, where each campaign website is opened individually to scrape additional information like the campaign story or meta info about the author. Because campaigns cannot be opened in parallel, this part is the most time-consuming.

In [5]:
# Setup
import os
import time
import pandas as pd
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
from bs4 import BeautifulSoup

### 1) Define functions

The campaign parser is the most important function, as it extracts all relevant information from the campaign page and outputs it as a data frame.

In [6]:
def campaign_parser(soup):
    ''' Parser extracts infos from indiegogo.com search results '''
    df = pd.DataFrame()

    story = soup.find('div', {'class': 'routerContentStory-storyBody'})
    if story is None:  # raise error if no story found
        raise Exception('Story is None')

    story_txt = story.get_text().strip()

    images = story.find_all('img')
    if images is not None:  # no error. (maybe there is really no images)
        img_len = len(images)
    else:
        img_len = 0
    
    videos = story.find_all('div', {'class': 'campaignStoryVideoWrapper'})
    if videos is not None:  # no error. (maybe there is really no video)
        video_len = len(videos)
    else:
        video_len = 0

    campaign_count = soup.find('div', {'class': 'basicsCampaignOwner-details-count'})
    if campaign_count is not None:
        campaign_count = campaign_count.get_text().strip()
    else:  # raise error if no campaign count found
        campaign_count = None
        raise Exception('Campaign count is None')

    campaign_city = soup.find('div', {'class': 'basicsCampaignOwner-details-city'})
    if campaign_city is not None:
        campaign_city = campaign_city.get_text().strip()
    else:  # raise error if no city found
        campaign_city = None 
        raise Exception('Campaign city is None')
    
    # Append card as row to data frame
    df = df.append({'story': story,
                    'story_txt': story_txt,
                    'img_len': img_len,
                    'video_len': video_len,
                    'campaign_count': campaign_count,
                    'campaign_city': campaign_city}, ignore_index=True)
    return df

In [7]:
def initialize_browser():
    """ Initialize browser for scraping """
    opts = Options()
    opts.headless = True  # hide browser bool
    driver = webdriver.Firefox(options=opts, executable_path='geckodriver.exe')
    return driver

### 2) Scrape Individual Campaigns

The following loop is the heart of the second scraping script. It opens the url for each campaign from the preview data and extracts all information from the page. The "save_every" variable defines how many campaigns are merged together in one data frame before saving it to the disk. Again the "max_tries" parameter specifies, how many times the scraper tries to extract a single campaign if a random error occurs. The first 12'800 campaigns are skipped in this version of the notebook because I accidentally closed the script before it finished and did not want to re-scrape the first 12'800 campaigns again. From the error log it becomes clear, that the scraper only encountered a hand full of errors in spite of the high number of campaigns, and that all errors were handled accordingly.

In [8]:
# Main loop (iterate over all urls)
skip = 12800
data = pd.read_csv('raw_preview_data.csv') # Load data
df_campaign = pd.DataFrame()
driver = initialize_browser()  # Init browser
for i, url in enumerate(data.project_link):
    if i <= skip: # skip first n iterations
        continue
    max_tries = 3
    try_count = 0
    save_every = 100  # save every x campaigns
    while try_count < max_tries:
        try_count += 1  # start at 1
        try:
            if i % save_every == 0:
                print('Scraping campaigns: {}-{}/{} - {}'.format(i, i+save_every, len(data.project_link), time.strftime('%H:%M:%S %d/%m/%Y')))
            # Load page
            driver.get(url)
            page_source = driver.page_source
            soup = BeautifulSoup(page_source, 'html.parser')

            # Extract data
            row = campaign_parser(soup)
            row['url'] = url
            df_campaign = df_campaign.append(row)
            if i % save_every == 0 and i != 0: # save only every other dataframe
                df_campaign.to_csv('downloads2/campaign_{}.csv'.format(i)) # save dataframe
                df_campaign = pd.DataFrame()  # reset to empty dataframe
            break
        # Retry if error
        except Exception as e:
            try:
                driver.quit()
                driver = initialize_browser()
            except Exception as e:
                print('- Restart browser failed! {}: {}'.format(type(e).__name__, e))
                continue
        time.sleep(0.5)
    # Skip to next url if while condition is false
    else:
        print('- Failed on try: {}/{}. Url: {}\n- Skipped to next url.'.format(try_count, max_tries, url))

Scraping campaigns: 12900-13000/118816 - 23:50:21 12/11/2021
Scraping campaigns: 13000-13100/118816 - 23:55:56 12/11/2021
Scraping campaigns: 13100-13200/118816 - 00:01:20 13/11/2021
Scraping campaigns: 13200-13300/118816 - 00:06:38 13/11/2021
Scraping campaigns: 13200-13300/118816 - 00:06:47 13/11/2021
Scraping campaigns: 13300-13400/118816 - 00:12:06 13/11/2021
Scraping campaigns: 13400-13500/118816 - 00:17:28 13/11/2021
Scraping campaigns: 13500-13600/118816 - 00:23:04 13/11/2021
Scraping campaigns: 13600-13700/118816 - 00:28:18 13/11/2021
Scraping campaigns: 13700-13800/118816 - 00:33:49 13/11/2021
Scraping campaigns: 13800-13900/118816 - 00:39:13 13/11/2021
Scraping campaigns: 13900-14000/118816 - 00:44:31 13/11/2021
Scraping campaigns: 14000-14100/118816 - 00:49:41 13/11/2021
Scraping campaigns: 14000-14100/118816 - 00:49:50 13/11/2021
Scraping campaigns: 14100-14200/118816 - 00:55:08 13/11/2021
Scraping campaigns: 14200-14300/118816 - 01:00:25 13/11/2021
Scraping campaigns: 1430

Lastly, all files in the download folder are combined and saved as a single pickle file.

In [None]:
def combine_files(dir):
    ''' Combines all dataframes in directory. '''
    # list all files in directory
    file_paths = []
    for path, subdirs, filenames in os.walk(dir):
        for name in filenames:
            file_paths.append(os.path.join(path, name))

    # Combine all files in data
    df = pd.DataFrame()
    for file in file_paths:
        # read and evaluate string column as list
        new_data = pd.read_csv(file, index_col=0)
        df = df.append(new_data, ignore_index=True)
    return df

In [None]:
# Construct combined data frame
df = combine_files('downloads2')

# Save combined data frame
df.to_pickle('data/raw_campaign_data.pkl')