# Parsing Web Robots' Kickstarter data

Web Robots is a web scraping company that collects some aspects of Kickstarter project pages on a monthly basis and shares that data freely and publicly in a streaming JSON file. 

## Goal

I'll be working with all available data from June 2017 and earlier as to avoid any project that is still up and running. The goal here is to extract the following features and use the project hyperlinks for scraping campaigns.

- `name` - the project's name
- `category` - the project's category as defined by Kickstarter
- `hyperlink` - the project's URL
- `currency` - the type of currency used for fundraising, linked to `location`
- `pledged` - total amount of money pledged by backers over the course of the proejct
- `goal` - the funding goal set by the creator
- `location` - the creator's location information

In [1]:
# Load the required libraries
import pandas as pd
import numpy as np
from sklearn.externals import joblib
import json
import time

In [2]:
# Initalize an empty DataFrame with features of interest
df = pd.DataFrame(
    columns=[
        'name',
        'category',
        'hyperlink',
        'currency',
        'pledged',
        'goal',
        'location'
    ]
)

Web Robots raw data can be found [here](https://webrobots.io/kickstarter-datasets/).

In [3]:
# Select jsonline file containing data
filename = 'Kickstarter_2017-06-15T22_20_03_059Z.json'

Since the data file is in JSON streaming format (i.e., each line represents a single project and its data stored as a single JSON object), we can read one line at a time, use the `json` library to decode the object into a Python `dict`, extract the features using indexing, and store the results in the DataFrame created above.

In [4]:
# Record start time 
start = time.time()

# Open jsonlines file
with open(filename, encoding='utf8') as file:
    for index, line in enumerate(file):
        # Read each line and record features in a DataFrame
        json_obj = json.loads(line)
        
        try:
            df.loc[index, 'name'] = json_obj['data']['name']
        except KeyError:
            continue 
        
        try:
            df.loc[index, 'category'] = json_obj['data']['category']['name']
        except KeyError:
            continue
        
        try:
            df.loc[index, 'hyperlink'] = \
                json_obj['data']['urls']['web']['project']
        except KeyError:
            continue
        
        try:
            df.loc[index, 'currency'] = json_obj['data']['currency']
        except KeyError:
            continue
            
        try:
            df.loc[index, 'pledged'] = json_obj['data']['pledged']
        except KeyError:
            continue
            
        try:
            df.loc[index, 'goal'] = json_obj['data']['goal']
        except KeyError:
            continue
        
        try:
            df.loc[index, 'location'] = \
                json_obj['data']['location']['displayable_name']
        except KeyError:
            continue
            
# Report elapsed time in seconds
time.time() - start

19261.93027496338

Since the JSON data doesn't tell us whether a project was funded or not, we can determine this by labeling funded projects as those with `pledged > goal`.

In [5]:
# Convert pledged and goal columns from strings to numeric variables
df['pledged'] = pd.to_numeric(df['pledged'])
df['goal'] = pd.to_numeric(df['goal'])

# Define a new column that determines whether the project was funded
df['funded'] = df['pledged'] > df['goal']
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 194818 entries, 0 to 194817
Data columns (total 8 columns):
name         194818 non-null object
category     194818 non-null object
hyperlink    194818 non-null object
currency     194818 non-null object
pledged      194818 non-null float64
goal         194818 non-null float64
location     194020 non-null object
funded       194818 non-null bool
dtypes: bool(1), float64(2), object(5)
memory usage: 17.1+ MB


In [6]:
df['funded'].value_counts()

False    113976
True      80842
Name: funded, dtype: int64

Dump the data as a pickle file.

In [7]:
joblib.dump(df, 'web_robots_data_to_06-2017')

['2017-09-10_full_extracted_data.pkl']