# Parsing the Web Robots database

**Goal: Load the Web Robots database for Kickstarter projects, parse it to extract relevant information, and save the results in a table.**

Since Kickstarter projects are given a maximum of 60 days for investment opportunities, I elected to parse through the database containing projects only from June 2017 and earlier, and collect the following pieces of data for every project page.

- `name` - project's title
- `category` - project's category as set by Kickstarter
- `hyperlink` - project's web page URL
- `currency` - type of currency used for fundraising
- `pledged` - total amount of money pledged by backers over the course of the project
- `goal` - funding goal set by the creator
- `location` - creator's location information

The Web Robots database can be found [here](https://webrobots.io/kickstarter-datasets/).

In [1]:
# Load required libraries
import pandas as pd
from sklearn.externals import joblib
import json
import time

In [2]:
# Initalize an empty DataFrame with labels of data to be collected
df = pd.DataFrame(
    columns=['name', 'category', 'hyperlink', 'currency', 'pledged', 'goal',
             'location']
)

In [3]:
# Select JSON streaming file containing Web Robots data
filename = 'data/Kickstarter_2017-06-15T22_20_03_059Z.json'

Since the database is in JSON streaming format (i.e., each line represents a single project and its data is stored as an individual JSON object), let's read in one line at a time, decode the JSON object and store it in a dictionary. Next, we'll extract the data using indexing store it in a DataFrame.

In [4]:
# Record start time
start = time.time()

# Open JSON streaming file
with open(filename, encoding='utf8') as file:
    for index, line in enumerate(file):
        # Read each line and record data in a dictionary
        json_obj = json.loads(line)
        
        # Catch any potential typos or missing keys that raising a KeyError
        try:
            df.loc[index, 'name'] = json_obj['data']['name']
        except KeyError:
            continue 
        
        try:
            df.loc[index, 'category'] = json_obj['data']['category']['name']
        except KeyError:
            continue
        
        try:
            df.loc[index, 'hyperlink'] = \
                json_obj['data']['urls']['web']['project']
        except KeyError:
            continue
        
        try:
            df.loc[index, 'currency'] = json_obj['data']['currency']
        except KeyError:
            continue
            
        try:
            df.loc[index, 'pledged'] = json_obj['data']['pledged']
        except KeyError:
            continue
            
        try:
            df.loc[index, 'goal'] = json_obj['data']['goal']
        except KeyError:
            continue
        
        try:
            df.loc[index, 'location'] = \
                json_obj['data']['location']['displayable_name']
        except KeyError:
            continue
            
# Report elapsed time in seconds
time.time() - start

19261.93027496338

Since the Web Robots database doesn't tell us whether a project was funded or not, let's define this with a Boolean where `pledged > goal`.

In [5]:
# Convert 'pledged' and 'goal' columns from strings to numeric variables
df['pledged'] = pd.to_numeric(df['pledged'])
df['goal'] = pd.to_numeric(df['goal'])

# Define a new column called 'funded' that identifies whether the project was 
# funded or not
df['funded'] = df['pledged'] > df['goal']

# Display collected data information
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 194818 entries, 0 to 194817
Data columns (total 8 columns):
name         194818 non-null object
category     194818 non-null object
hyperlink    194818 non-null object
currency     194818 non-null object
pledged      194818 non-null float64
goal         194818 non-null float64
location     194020 non-null object
funded       194818 non-null bool
dtypes: bool(1), float64(2), object(5)
memory usage: 17.1+ MB


Finally, let's save the data.

In [7]:
# Serialize the collected data
joblib.dump(df, 'web_robots_data_to_06-2017')

['2017-09-10_full_extracted_data.pkl']