# Feature engineering for a single Kickstarter project

The goal of this notebook is to engineer features that will eventually be extracted from scraped HTML for each project page.

## Table of contents
1. [Loading scraping and extraction functions](#cell1)
2. [Normalizing the campaign sections](#cell2)
3. [Defining functions to compute features](#cell3)
4. [Extracting all features for a single project](#cell4)

<a id="cell1"></a>
## 1. Loading scraping and extraction functions

In [1]:
# Load required libraries
import nltk
import requests
from bs4 import BeautifulSoup
import re
import lxml
import pandas as pd
import numpy as np
from sklearn.externals import joblib

Let's load the scraping, parsing, and extraction functions.

In [2]:
def scrape(hyperlink):
    # Scrape the website
    return requests.get(hyperlink)

def parse(scraped_html):
    # Parse the HTML content
    return BeautifulSoup(scraped_html.text, 'lxml')

def clean_up(messy_text):    
    # Remove line breaks, leading and trailing whitespace, and compress all
    # whitespace to a single space
    clean_text = ' '.join(messy_text.split()).strip()
    
    # Remove the HTML5 warning for videos
    return clean_text.replace(
        "You'll need an HTML5 capable browser to see this content. " + \
        "Play Replay with sound Play with sound 00:00 00:00",
        ''
    )

def get_campaign(soup):
    # Extract the 'About this project' section if available
    try:
        section1 = soup.find(
            'div',
            class_='full-description js-full-description responsive-media ' + \
                'formatted-lists'
        ).get_text(' ')
    except AttributeError:
        section1 = 'section_not_found'
    
    # Extract the 'Risks and challenges' section if available
    try:
        section2 = soup.find(
            'div', 
            class_='mb3 mb10-sm mb3 js-risks'
        ) \
            .get_text(' ') \
            .replace('Risks and challenges', '') \
            .replace('Learn about accountability on Kickstarter', '')
    except AttributeError:
        section2 = 'section_not_found'
    
    # Clean both sections and return them in a dict
    return {'about': clean_up(section1), 'risks': clean_up(section2)}

Let's begin by selecting a project page and its URL.

In [3]:
hyperlink = 'https://www.kickstarter.com/projects/getpebble/pebble-2-time-2-and-core-an-entirely-new-3g-ultra'
#hyperlink = 'https://www.kickstarter.com/projects/sbf/sculpto-the-worlds-most-user-friendly-desktop-3d-p?ref=discovery'
#hyperlink = 'https://www.kickstarter.com/projects/getpebble/pebble-e-paper-watch-for-iphone-and-android'
#hyperlink = 'https://www.kickstarter.com/projects/1683069409/the-new-york-sorta-marathon?ref=discovery'
#hyperlink = 'https://www.kickstarter.com/projects/dinobytelabs/midli-a-dark-and-mystical-tale-of-letting-go?ref=category'
#hyperlink = 'https://www.kickstarter.com/projects/1385294316/help-me-start-my-cottage-industry-bakesalecom?ref=category_newest'
scraped_html = scrape(hyperlink)

Next, let's parse the HTML and extract the campaign sections.

In [4]:
soup = parse(scraped_html)
campaign = get_campaign(soup)

<a id="cell2"></a>
## 2. Normalizing the campaign sections

Some projects contain sections with email addresses and hyperlinks. Since these terms shouldn't be counted as words, let's tag them so we can avoid them later.

In [5]:
def normalize(text):
    # Tag email addresses
    normalized = re.sub(
        r'\b[\w\-.]+?@\w+?\.\w{2,4}\b',
        'emailaddr',
        text
    )
    
    # Tag hyperlinks
    normalized = re.sub(
        r'(http[s]?\S+)|(\w+\.[A-Za-z]{2,4}\S*)',
        'httpaddr',
        normalized
    )
    
    return normalized

In [6]:
# Normalize campaign sections
campaign['about'] = normalize(campaign['about'])
campaign['risks'] = normalize(campaign['risks'])

<a id="cell3"></a>
## 3. Defining functions to compute features

Let's create a function for each feature we want to extract and test the function on the campaign's *About this project* section. If at anytime the campaign section is missing, we'll assign `NaN` to every feature for that section.

### Count # of sentences

In [7]:
def get_sentences(text):
    # Tokenizes text into sentences and returns them in a list
    return nltk.sent_tokenize(text)

In [8]:
# If the campaign section is missing, assign NaN to the feature's value 
if campaign['about'] == 'section_not_found':
    num_sents = np.nan
else:
    num_sents = len(get_sentences(campaign['about']))
num_sents

182

### Count # of words

In [9]:
def remove_punc(text):
    # Returns the text with punctuation removed
    return re.sub(r'[^\w\d\s]', ' ', text)

In [10]:
def get_words(text):
    # Tokenizes text into words and returns them in a list excluding tags
    return [word for word in nltk.word_tokenize(remove_punc(text)) \
            if word not in ('emailaddr', 'httpaddr')]

In [11]:
if campaign['about'] == 'section_not_found':
    num_words = np.nan
else:
    num_words = len(get_words(campaign['about']))

# If the section contains no words, assign NaN to num_words to avoid potential
# division by zero
if num_words == 0:
    num_words = np.nan 
num_words

3549

### Count # of all-caps words and compute %

In [12]:
def identify_allcaps(text):
    # Counts the number of all-caps words
    return re.findall(r'\b[A-Z]{2,}', text)

In [13]:
if campaign['about'] == 'section_not_found':
    print(np.nan, np.nan)
else:
    print(
        len(identify_allcaps(campaign['about'])),
        len(identify_allcaps(campaign['about'])) / num_words
    )

26 0.007326007326007326


### Count # of exclamation marks and compute %

In [14]:
def count_exclamations(text):
    # Counts the number of exclamation marks present in the text
    return text.count('!')

In [15]:
if campaign['about'] == 'section_not_found':
    print(np.nan, np.nan)
else:
    print(
        count_exclamations(campaign['about']),
        count_exclamations(campaign['about']) / num_words
    )

12 0.0033812341504649195


### Count # of Apple adjectives and %

In [16]:
def count_apple_words(text):
    # Define a set of Apple adjectives
    apple_words = frozenset(
        ['revolutionary', 'breakthrough', 'beautiful', 'magical', 
        'gorgeous', 'amazing', 'incredible', 'awesome']
    )
    
    return sum(
        1 for word in get_words(text) if word in apple_words
    )

In [17]:
if campaign['about'] == 'section_not_found':
    print(np.nan, np.nan)
else:
    print(
        count_apple_words(campaign['about']),
        count_apple_words(campaign['about']) / num_words
    )

6 0.0016906170752324597


### Compute the average # of words per sentence

In [18]:
def compute_avg_words(text):
    # Compute the mean number of words in each sentence
    return pd.Series(
        [len(get_words(sentence)) for sentence in \
         get_sentences(text)]
    ).mean()

In [19]:
if campaign['about'] == 'section_not_found':
    print(np.nan)
else:
    print(compute_avg_words(campaign['about']))

19.5


### Count the # of paragraphs

In [20]:
def count_paragraphs(soup, section):    
    # Use tree parsing to compute number of paragraphs
    if section == 'about':
        return len(soup.find(
            'div',
            class_='full-description js-full-description responsive' + \
                '-media formatted-lists'
            ).find_all('p'))
    elif section == 'risks':
        return len(soup.find(
            'div',
            class_='mb3 mb10-sm mb3 js-risks'
            ).find_all('p'))

In [21]:
if campaign['about'] == 'section_not_found':
    print(np.nan)
else:
    print(count_paragraphs(soup, 'about'))

133


### Count the average # of sentences per paragraph

In [22]:
def compute_avg_sents_paragraph(soup, section):
    # Use tree parsing to identify all paragraphs
    if section == 'about':
        paragraphs = soup.find(
            'div',
            class_='full-description js-full-description responsive' + \
                '-media formatted-lists'
            ).find_all('p')
    elif section == 'risks':
        paragraphs = soup.find(
            'div',
            class_='mb3 mb10-sm mb3 js-risks'
            ).find_all('p')
    
    # Compute the mean number of sentences in each paragraph
    return pd.Series(
        [len(get_sentences(paragraph.get_text(' '))) for paragraph in \
         paragraphs]
    ).mean()

In [23]:
if campaign['about'] == 'section_not_found':
    print(np.nan)
else:
    print(compute_avg_sents_paragraph(soup, 'about'))

1.53383458647


### Count the average # of words per paragraph

In [24]:
def compute_avg_words_paragraph(soup, section):
    # Use tree parsing to identify all paragraphs
    if section == 'about':
        paragraphs = soup.find(
            'div',
            class_='full-description js-full-description responsive' + \
                '-media formatted-lists'
            ).find_all('p')
    elif section == 'risks':
        paragraphs = soup.find(
            'div',
            class_='mb3 mb10-sm mb3 js-risks'
            ).find_all('p')
    
    # Compute the mean number of words in each paragraph
    return pd.Series(
        [len(get_words(paragraph.get_text(' '))) for paragraph in paragraphs]
    ).mean()

In [25]:
if campaign['about'] == 'section_not_found':
    print(np.nan)
else:
    print(compute_avg_words_paragraph(soup, 'about'))

21.5112781955


### Count # of images

In [26]:
def count_images(soup, section):    
    # Use tree parsing to compute number of images
    if section == 'about':
        return len(soup.find(
            'div',
            class_='full-description js-full-description responsive' + \
                '-media formatted-lists'
            ).find_all('img'))
    elif section == 'risks':
        return len(soup.find(
            'div',
            class_='mb3 mb10-sm mb3 js-risks'
            ).find_all('img'))

In [27]:
if campaign['about'] == 'section_not_found':
    print(np.nan)
else:
    print(count_images(soup, 'about'))

25


### Count # of embedded videos

In [28]:
def count_videos(soup, section):    
    # Use tree parsing to compute number of non-YouTube videos
    if section == 'about':
        return len(soup.find(
            'div',
            class_='full-description js-full-description responsive' + \
                '-media formatted-lists'
            ).find_all('div', class_='video-player'))
    elif section == 'risks':
        return len(soup.find(
            'div',
            class_='mb3 mb10-sm mb3 js-risks'
            ).find_all('div', class_='video-player'))

In [29]:
if campaign['about'] == 'section_not_found':
    print(np.nan)
else:
    print(count_videos(soup, 'about'))

0


### Count # of YouTube videos

In [30]:
def count_youtube(soup, section):    
    # Initialize total number of YouTube videos
    youtube_count = 0

    # Use tree parsing to select all iframe tags
    if section == 'about':
        iframes = soup.find(
            'div',
            class_='full-description js-full-description responsive' + \
            '-media formatted-lists'
        ).find_all('iframe')
    elif section == 'risks':
        iframes = soup.find(
            'div',
            class_='mb3 mb10-sm mb3 js-risks'
        ).find_all('iframe')
    
    # Since YouTube videos are contained in iframe tags, determine which
    # iframe tags contain YouTube videos and count them
    for iframe in iframes:
        try:
            if 'youtube' in iframe.get('src'):
                youtube_count += 1
        except TypeError:
            pass
    
    return youtube_count

In [31]:
if campaign['about'] == 'section_not_found':
    print(np.nan)
else:
    print(count_youtube(soup, 'about'))

1


### Count # of GIFs

In [32]:
def count_gifs(soup, section):    
    # Initialize total number of GIFs
    gif_count = 0

    # Use tree parsing to select all image tags
    if section == 'about':
        images = soup.find(
            'div',
            class_='full-description js-full-description responsive' + \
            '-media formatted-lists'
        ).find_all('img')
    elif section == 'risks':
        images = soup.find(
            'div',
            class_='mb3 mb10-sm mb3 js-risks'
        ).find_all('img')
    
    # Since GIFs are contained in image tags, determine which image tags
    # contain GIFs and count them
    for image in images:
        try:
            if 'gif' in image.get('data-src'):
                gif_count += 1
        except TypeError:
            pass
    
    return gif_count

In [33]:
if campaign['about'] == 'section_not_found':
    print(np.nan)
else:
    print(count_gifs(soup, 'about'))

0


### Count # of hyperlinks

In [34]:
def count_hyperlinks(soup, section):    
    # Use tree parsing to compute number of hyperlinks
    if section == 'about':
        return len(soup.find(
            'div',
            class_='full-description js-full-description responsive' + \
                '-media formatted-lists'
            ).find_all('a'))
    elif section == 'risks':
        return len(soup.find(
            'div',
            class_='mb3 mb10-sm mb3 js-risks'
            ).find_all('a'))

In [35]:
if campaign['about'] == 'section_not_found':
    print(np.nan)
else:
    print(count_hyperlinks(soup, 'about'))

35


### Count # of bolded text and compute %

In [36]:
def count_bolded(soup, section):    
    # Use tree parsing to compute number of bolded text
    if section == 'about':
        return len(soup.find(
            'div',
            class_='full-description js-full-description responsive' + \
                '-media formatted-lists'
            ).find_all('b'))
    elif section == 'risks':
        return len(soup.find(
            'div',
            class_='mb3 mb10-sm mb3 js-risks'
            ).find_all('b'))

In [37]:
if campaign['about'] == 'section_not_found':
    print(np.nan, np.nan)
else:
    print(
        count_bolded(soup, 'about'),
        count_bolded(soup, 'about') / num_words
    )

57 0.016060862214708368


<a id="cell4"></a>
## 4. Extracting all features for a single project

Let's process all of feature extraction functions on a project page.

In [38]:
# Extract all features for the given section. If the section isn't available,
# then return np.nan for each feature.
section = 'about'
if campaign[section] == 'section_not_found':
    print([np.nan] * 19)
else:
    row = ( 
        len(get_sentences(campaign[section])),
        len(get_words(campaign[section])),
        len(identify_allcaps(campaign[section])),
        len(identify_allcaps(campaign[section])) / num_words,
        count_exclamations(campaign[section]),
        count_exclamations(campaign[section]) / num_words,
        count_apple_words(campaign[section]),
        count_apple_words(campaign[section]) / num_words,
        compute_avg_words(campaign[section]),
        count_paragraphs(soup, section),
        compute_avg_sents_paragraph(soup, section),
        compute_avg_words_paragraph(soup, section),
        count_images(soup, section),
        count_videos(soup, section),
        count_youtube(soup, section),
        count_gifs(soup, section),
        count_hyperlinks(soup, section),
        count_bolded(soup, section),
        count_bolded(soup, section) / num_words
    )
    
    print(row)

(182, 3549, 26, 0.007326007326007326, 12, 0.0033812341504649195, 6, 0.0016906170752324597, 19.5, 133, 1.5338345864661653, 21.511278195488721, 25, 0, 1, 0, 35, 57, 0.016060862214708368)
