**Load the Web Robots database for Kickstarter projects, parse it to extract relevant information, and save the results in a table.**

**Table Contents**

**name** - project's title

**category** - project's category as set by Kickstarter

**hyperlink** - project's web page URL

**currency** - type of currency used for fundraising

**pledged** - total amount of money pledged by backers over the course of the project

**goal** - funding goal set by the creator

**location** - creator's location information

In [1]:
import pandas as pd
from sklearn.externals import joblib
import json
import time

In [2]:
# Creating an empty DataFrame with labels of data to be collected
df_project = pd.DataFrame(
    columns=['name', 'category', 'hyperlink', 'currency', 'pledged', 'goal',
             'location']
)

In [4]:
# Upload file file containing Web Robots data in JSON format
filename = '/Users/shwetapai/Downloads/test5k.json'

In [7]:
with open(filename, encoding='utf8') as file:
    for index, line in enumerate(file):
        # Read each line and record data in a dictionary
        json_obj = json.loads(line)
json_obj

{'table_id': 'Kickstarter',
 'robot_id': 'Kickstarter',
 'run_id': 'Kickstarter_2018-08-16T03_20_13_856Z',
 'data': {'id': 245053912,
  'photo': {'key': 'assets/016/825/284/f207e0d059e52c9bb9d165fd1b995cf6_original.jpg',
   'full': 'https://ksr-ugc.imgix.net/assets/016/825/284/f207e0d059e52c9bb9d165fd1b995cf6_original.jpg?crop=faces&w=560&h=315&fit=crop&v=1496243693&auto=format&q=92&s=acf2d45e7eedc05fdea13838cb68352a',
   'ed': 'https://ksr-ugc.imgix.net/assets/016/825/284/f207e0d059e52c9bb9d165fd1b995cf6_original.jpg?crop=faces&w=352&h=198&fit=crop&v=1496243693&auto=format&q=92&s=eb03a2923edd4d8daf928f7f5294705d',
   'med': 'https://ksr-ugc.imgix.net/assets/016/825/284/f207e0d059e52c9bb9d165fd1b995cf6_original.jpg?crop=faces&w=272&h=153&fit=crop&v=1496243693&auto=format&q=92&s=a8bffc489f40a15c20eb8876862958df',
   'little': 'https://ksr-ugc.imgix.net/assets/016/825/284/f207e0d059e52c9bb9d165fd1b995cf6_original.jpg?crop=faces&w=208&h=117&fit=crop&v=1496243693&auto=format&q=92&s=99aa083

Reading in one line at a time, decode the JSON object and store it in a dictionary. Next, we'll extract the data using indexing and then store it in a DataFrame.

In [8]:
# Record start time
start = time.time()

# Open JSON streaming file
with open(filename, encoding='utf8') as file:
    for index, line in enumerate(file):
        # Read each line and record data in a dictionary
        json_obj = json.loads(line)
        
        # Catch any potential typos or missing keys that can raise a KeyError
        try:
            df.loc[index, 'name'] = json_obj['data']['name']
        except KeyError:
            continue 
        
        try:
            df.loc[index, 'category'] = json_obj['data']['category']['name']
        except KeyError:
            continue
        
        try:
            df.loc[index, 'hyperlink'] = json_obj['data']['urls']['web']['project']
        except KeyError:
            continue
        
        try:
            df.loc[index, 'currency'] = json_obj['data']['currency']
        except KeyError:
            continue
            
        try:
            df.loc[index, 'pledged'] = json_obj['data']['pledged']
        except KeyError:
            continue
            
        try:
            df.loc[index, 'goal'] = json_obj['data']['goal']
        except KeyError:
            continue
        
        try:
            df.loc[index, 'location'] = \
                json_obj['data']['location']['displayable_name']
        except KeyError:
            continue
            
# Report elapsed time in seconds
time.time() - start

12.326646089553833

**Converting 'pledge' and 'goal' from strings to numeric**

In [9]:
# Convert 'pledged' and 'goal' columns from strings to numeric variables
df_project['pledged'] = pd.to_numeric(df_project['pledged'])
df_project['goal'] = pd.to_numeric(df_project['goal'])

# Define a new column called 'funded' that identifies whether the project was 
# funded or not
df_project['funded'] = df_project['pledged'] > df_project['goal']

# Display collected data information
df_project.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7884 entries, 0 to 7883
Data columns (total 8 columns):
name         7884 non-null object
category     7884 non-null object
hyperlink    7884 non-null object
currency     7884 non-null object
pledged      7884 non-null float64
goal         7884 non-null float64
location     7861 non-null object
funded       7884 non-null bool
dtypes: bool(1), float64(2), object(5)
memory usage: 820.4+ KB


In [10]:
# Pickling the collected data
joblib.dump(df_project, 'testing_1.pk')

['testing_1.pk']