# Engineering meta features for a Kickstarter project

**Goal: Develop a feature engineering strategy for meta features from content scraped from a Kickstarter project page.**

## Table of contents
1. [Loading scraping and extraction functions](#cell1)
2. [Normalizing the campaign sections](#cell2)
3. [Defining functions to compute meta features](#cell3)
4. [Extracting all meta features for a project](#cell4)

<a id="cell1"></a>
## 1. Loading scraping and extraction functions

In [1]:
# Load required libraries
import nltk
import requests
from bs4 import BeautifulSoup
import re
import lxml
import pandas as pd
import numpy as np
from sklearn.externals import joblib

Let's load the scraping, parsing, and extraction functions.

In [2]:
def scrape(hyperlink):
    # Scrape the website
    return requests.get(hyperlink)

def parse(scraped_html):
    # Parse the HTML content using an lxml parser
    return BeautifulSoup(scraped_html.text, 'lxml')

def clean_up(messy_text):    
    # Remove line breaks, leading and trailing whitespace, and compress all
    # whitespace to a single space
    clean_text = ' '.join(messy_text.split()).strip()
    
    # Remove the HTML5 warning for videos
    return clean_text.replace(
        "You'll need an HTML5 capable browser to see this content. " + \
        "Play Replay with sound Play with sound 00:00 00:00",
        ''
    )

def get_campaign(soup):
    # Collect the "About this project" section if available
    try:
        section1 = soup.find(
            'div',
            class_='full-description js-full-description responsive-media ' + \
                'formatted-lists'
        ).get_text(' ')
    except AttributeError:
        section1 = 'section_not_found'
    
    # Collect the "Risks and challenges" section if available, and remove all
    # unnecessary text
    try:
        section2 = soup.find(
            'div', 
            class_='mb3 mb10-sm mb3 js-risks'
        ) \
            .get_text(' ') \
            .replace('Risks and challenges', '') \
            .replace('Learn about accountability on Kickstarter', '')
    except AttributeError:
        section2 = 'section_not_found'
    
    # Clean both sections and return them in a dictionary
    return {'about': clean_up(section1), 'risks': clean_up(section2)}

Let's begin by selecting a project page and its URL.

In [3]:
# Select a test hyperlink
hyperlink = 'https://www.kickstarter.com/projects/1385294316/help-me-start' + \
    '-my-cottage-industry-bakesalecom?ref=category_newest'
scraped_html = scrape(hyperlink)

Next, let's parse the HTML and extract the campaign sections.

In [4]:
# Parse the scraped HTML and collect the campaign sections
soup = parse(scraped_html)
campaign = get_campaign(soup)

<a id="cell2"></a>
## 2. Normalizing the campaign sections

Some projects contain email addresses, phone numbers, URLs, money amounts, percentages, or plain numbers. Let's replace them with a tag so they aren't identified as unique words and don't inflate the word count.

In [5]:
def normalize(text):
    # Tag email addresses
    normalized = re.sub(
        r'\b[\w\-.]+?@\w+?\.\w{2,4}\b',
        'emailaddr',
        text
    )
    
    # Tag hyperlinks
    normalized = re.sub(
        r'(http[s]?\S+)|(\w+\.[A-Za-z]{2,4}\S*)',
        'httpaddr',
        normalized
    )
    
    # Tag money amounts
    normalized = re.sub(r'\$\d+(\.\d+)?', 'dollramt', normalized)
    
    # Tag percentages
    normalized = re.sub(r'\d+(\.\d+)?\%', 'percntg', normalized)
    
    # Tag phone numbers
    normalized = re.sub(
        r'\b(\+\d{1,2}\s)?\d?[\-(.]?\d{3}\)?[\s.-]?\d{3}[\s.-]?\d{4}\b',
        'phonenumbr',
        normalized
    )
    
    # Tag plain numbers
    return re.sub(r'\d+(\.\d+)?', 'numbr', normalized)

Next, let's normalize each campaign section.

In [6]:
# Normalize campaign sections
campaign['about'] = normalize(campaign['about'])
campaign['risks'] = normalize(campaign['risks'])

<a id="cell3"></a>
## 3. Defining functions to compute meta features

Let's define a function for each meta feature we want to extract and test the function on the campaign's "About this project" section. If at anytime the campaign section is missing, we'll assign `NaN` to every feature for that particular section to note it.

### Count # of sentences

In [7]:
def get_sentences(text):
    # Tokenize text into sentences and return them in a list
    return nltk.sent_tokenize(text)

In [8]:
# If the campaign section is missing, assign NaN to 'num_sents', otherwise
# count and display the number of sentences
if campaign['about'] == 'section_not_found':
    num_sents = np.nan
else:
    num_sents = len(get_sentences(campaign['about']))
num_sents

15

### Count # of words

In [9]:
def remove_punc(text):
    # Return text with punctuation removed
    return re.sub(r'[^\w\d\s]|\_', ' ', text)

In [10]:
def get_words(text):
    # Tokenize text into words and return them in a list
    return [word for word in nltk.word_tokenize(remove_punc(text))]

In [11]:
# If the campaign section is missing, assign NaN to 'num_words', otherwise
# count and display the number of words
if campaign['about'] == 'section_not_found':
    num_words = np.nan
else:
    num_words = len(get_words(campaign['about']))

# If the section contains no words, assign NaN to num_words to catch potential
# division by zero errors
if num_words == 0:
    num_words = np.nan 
num_words

223

### Count # of all-caps words and compute %

In [12]:
def identify_allcaps(text):
    # Count the number of all-caps words
    return re.findall(r'\b[A-Z]{2,}', text)

In [13]:
# Display 'NaN' if the section isn't found, otherwise display the # of all-caps
# words and its percentage
if campaign['about'] == 'section_not_found':
    print(np.nan, np.nan)
else:
    print(
        len(identify_allcaps(campaign['about'])),
        len(identify_allcaps(campaign['about'])) / num_words
    )

0 0.0


### Count # of exclamation marks and compute %

In [14]:
def count_exclamations(text):
    # Count the number of exclamation marks in the text
    return text.count('!')

In [15]:
# Display 'NaN' if the section isn't found, otherwise display the # of ! marks
# and its percentage
if campaign['about'] == 'section_not_found':
    print(np.nan, np.nan)
else:
    print(
        count_exclamations(campaign['about']),
        count_exclamations(campaign['about']) / num_words
    )

3 0.013452914798206279


### Count # of Apple adjectives and %

In [16]:
def count_apple_words(text):
    # Define a set of Apple adjectives
    apple_words = frozenset(
        ['revolutionary', 'breakthrough', 'beautiful', 'magical', 
        'gorgeous', 'amazing', 'incredible', 'awesome']
    )
    
    # Count total number of Apple adjectives in the text
    return sum(1 for word in get_words(text) if word in apple_words)

In [17]:
# Display 'NaN' if the section isn't found, otherwise display the # of Apple
# adjectives and its percentage
if campaign['about'] == 'section_not_found':
    print(np.nan, np.nan)
else:
    print(
        count_apple_words(campaign['about']),
        count_apple_words(campaign['about']) / num_words
    )

0 0.0


### Compute the average # of words per sentence

In [18]:
def compute_avg_words(text):
    # Compute the average number of words in each sentence
    return pd.Series(
        [len(get_words(sentence)) for sentence in \
         get_sentences(text)]
    ).mean()

In [19]:
# Display 'NaN' if the section isn't found, otherwise display the average # of
# words per sentence
if campaign['about'] == 'section_not_found':
    print(np.nan)
else:
    print(compute_avg_words(campaign['about']))

14.8666666667


### Count the # of paragraphs

In [20]:
def count_paragraphs(soup, section):    
    # Use tree parsing to compute number of paragraphs
    if section == 'about':
        return len(soup.find(
            'div',
            class_='full-description js-full-description responsive' + \
                '-media formatted-lists'
        ).find_all('p'))
    elif section == 'risks':
        return len(soup.find(
            'div',
            class_='mb3 mb10-sm mb3 js-risks'
        ).find_all('p'))

In [21]:
# Display 'NaN' if the section isn't found, otherwise display the # of
# paragraphs
if campaign['about'] == 'section_not_found':
    print(np.nan)
else:
    print(count_paragraphs(soup, 'about'))

9


### Count the average # of sentences per paragraph

In [22]:
def compute_avg_sents_paragraph(soup, section):
    # Use tree parsing to identify all paragraphs
    if section == 'about':
        paragraphs = soup.find(
            'div',
            class_='full-description js-full-description responsive' + \
                '-media formatted-lists'
        ).find_all('p')
    elif section == 'risks':
        paragraphs = soup.find(
            'div',
            class_='mb3 mb10-sm mb3 js-risks'
        ).find_all('p')
    
    # Compute the average number of sentences in each paragraph
    return pd.Series(
        [len(get_sentences(paragraph.get_text(' '))) for paragraph in \
         paragraphs]
    ).mean()

In [23]:
# Display 'NaN' if the section isn't found, otherwise display the average # of
# sentences per paragraph
if campaign['about'] == 'section_not_found':
    print(np.nan)
else:
    print(compute_avg_sents_paragraph(soup, 'about'))

1.88888888889


### Count the average # of words per paragraph

In [24]:
def compute_avg_words_paragraph(soup, section):
    # Use tree parsing to identify all paragraphs
    if section == 'about':
        paragraphs = soup.find(
            'div',
            class_='full-description js-full-description responsive' + \
                '-media formatted-lists'
        ).find_all('p')
    elif section == 'risks':
        paragraphs = soup.find(
            'div',
            class_='mb3 mb10-sm mb3 js-risks'
        ).find_all('p')
    
    # Compute the average number of words in each paragraph
    return pd.Series(
        [len(get_words(paragraph.get_text(' '))) for paragraph in paragraphs]
    ).mean()

In [25]:
# Display 'NaN' if the section isn't found, otherwise display the average # of
# words per paragraph
if campaign['about'] == 'section_not_found':
    print(np.nan)
else:
    print(compute_avg_words_paragraph(soup, 'about'))

24.7777777778


### Count # of images

In [26]:
def count_images(soup, section):    
    # Use tree parsing to compute number of images
    if section == 'about':
        return len(soup.find(
            'div',
            class_='full-description js-full-description responsive' + \
                '-media formatted-lists'
        ).find_all('img'))
    elif section == 'risks':
        return len(soup.find(
            'div',
            class_='mb3 mb10-sm mb3 js-risks'
        ).find_all('img'))

In [27]:
# Display 'NaN' if the section isn't found, otherwise display the # of images
if campaign['about'] == 'section_not_found':
    print(np.nan)
else:
    print(count_images(soup, 'about'))

0


### Count # of embedded videos

In [28]:
def count_videos(soup, section):    
    # Use tree parsing to compute number of non-YouTube videos
    if section == 'about':
        return len(soup.find(
            'div',
            class_='full-description js-full-description responsive' + \
                '-media formatted-lists'
        ).find_all('div', class_='video-player'))
    elif section == 'risks':
        return len(soup.find(
            'div',
            class_='mb3 mb10-sm mb3 js-risks'
        ).find_all('div', class_='video-player'))

In [29]:
# Display 'NaN' if the section isn't found, otherwise display the # of
# non-YouTube videos
if campaign['about'] == 'section_not_found':
    print(np.nan)
else:
    print(count_videos(soup, 'about'))

0


### Count # of YouTube videos

In [30]:
def count_youtube(soup, section):    
    # Initialize total number of YouTube videos
    youtube_count = 0

    # Use tree parsing to select all iframe tags
    if section == 'about':
        iframes = soup.find(
            'div',
            class_='full-description js-full-description responsive' + \
            '-media formatted-lists'
        ).find_all('iframe')
    elif section == 'risks':
        iframes = soup.find(
            'div',
            class_='mb3 mb10-sm mb3 js-risks'
        ).find_all('iframe')
    
    # Since YouTube videos are contained only in iframe tags, determine which
    # iframe tags contain YouTube videos and count them
    for iframe in iframes:
        # Catch any iframes that fail to include a YouTube source link
        try:
            if 'youtube' in iframe.get('src'):
                youtube_count += 1
        except TypeError:
            pass
    
    return youtube_count

In [31]:
# Display 'NaN' if the section isn't found, otherwise display the # of
# YouTube videos
if campaign['about'] == 'section_not_found':
    print(np.nan)
else:
    print(count_youtube(soup, 'about'))

0


### Count # of GIFs

In [32]:
def count_gifs(soup, section):    
    # Initialize total number of GIFs
    gif_count = 0

    # Use tree parsing to select all image tags
    if section == 'about':
        images = soup.find(
            'div',
            class_='full-description js-full-description responsive' + \
            '-media formatted-lists'
        ).find_all('img')
    elif section == 'risks':
        images = soup.find(
            'div',
            class_='mb3 mb10-sm mb3 js-risks'
        ).find_all('img')
    
    # Since GIFs are contained in image tags, determine which image tags
    # contain GIFs and count them
    for image in images:
        # Catch any iframes that fail to include an image source link
        try:
            if 'gif' in image.get('data-src'):
                gif_count += 1
        except TypeError:
            pass
    
    return gif_count

In [33]:
# Display 'NaN' if the section isn't found, otherwise display the # of
# GIFs
if campaign['about'] == 'section_not_found':
    print(np.nan)
else:
    print(count_gifs(soup, 'about'))

0


### Count # of hyperlinks

In [34]:
def count_hyperlinks(soup, section):    
    # Use tree parsing to compute number of hyperlinks
    if section == 'about':
        return len(soup.find(
            'div',
            class_='full-description js-full-description responsive' + \
                '-media formatted-lists'
        ).find_all('a'))
    elif section == 'risks':
        return len(soup.find(
            'div',
            class_='mb3 mb10-sm mb3 js-risks'
        ).find_all('a'))

In [35]:
# Display 'NaN' if the section isn't found, otherwise display the # of
# hyperlinks
if campaign['about'] == 'section_not_found':
    print(np.nan)
else:
    print(count_hyperlinks(soup, 'about'))

0


### Count # of bolded text and compute %

In [36]:
def count_bolded(soup, section):    
    # Use tree parsing to compute number of bolded text tags
    if section == 'about':
        return len(soup.find(
            'div',
            class_='full-description js-full-description responsive' + \
                '-media formatted-lists'
        ).find_all('b'))
    elif section == 'risks':
        return len(soup.find(
            'div',
            class_='mb3 mb10-sm mb3 js-risks'
        ).find_all('b'))

In [37]:
# Display 'NaN' if the section isn't found, otherwise display the # of
# bolded text tags and its percentage
if campaign['about'] == 'section_not_found':
    print(np.nan, np.nan)
else:
    print(
        count_bolded(soup, 'about'),
        count_bolded(soup, 'about') / num_words
    )

0 0.0


<a id="cell4"></a>
## 4. Extracting all meta features for a project

Let's process all of the meta feature extraction functions on a project page and return a feature vector for that project that includes both the meta features and normalized text.

In [38]:
# Extract all features for the given section. If the section isn't 
# available, then return 'NaN' for each feature.
section = 'about'
if campaign[section] == 'section_not_found':
    print([np.nan] * 20)
else:
    row = ( 
        len(get_sentences(campaign[section])),
        len(get_words(campaign[section])),
        len(identify_allcaps(campaign[section])),
        len(identify_allcaps(campaign[section])) / num_words,
        count_exclamations(campaign[section]),
        count_exclamations(campaign[section]) / num_words,
        count_apple_words(campaign[section]),
        count_apple_words(campaign[section]) / num_words,
        compute_avg_words(campaign[section]),
        count_paragraphs(soup, section),
        compute_avg_sents_paragraph(soup, section),
        compute_avg_words_paragraph(soup, section),
        count_images(soup, section),
        count_videos(soup, section),
        count_youtube(soup, section),
        count_gifs(soup, section),
        count_hyperlinks(soup, section),
        count_bolded(soup, section),
        count_bolded(soup, section) / num_words,
        campaign[section]
    )
    
    print(pd.Series(row))

0                                                    15
1                                                   223
2                                                     0
3                                                     0
4                                                     3
5                                             0.0134529
6                                                     0
7                                                     0
8                                               14.8667
9                                                     9
10                                              1.88889
11                                              24.7778
12                                                    0
13                                                    0
14                                                    0
15                                                    0
16                                                    0
17                                              