# Predicting Kickstarter project success

Authors: [Su Leen Wong](https://github.com/suleenwong) and [Erick Cantu](https://github.com/eaunaicr97)

![](assets/intro.png)

Image from: [http://clipart-library.com/](http://clipart-library.com/)

## Introduction

Kickstarter, founded in 2009, is a crowdfunding platform where project creators can raise money from the public,  circumventing traditional avenues of investment. It has an all-or-nothing funding model, whereby a project is only funded if it meets its goal amount; otherwise no funds are collected.

A huge variety of factors contribute to the success or failure of a project on Kickstarter. Some of these factors are able to be quantified or categorized, which allows for the construction of a model to attempt to predict whether a project will succeed or not. 

The goal of this project is to predict if a Kickstart project will succeed or fail through using Exploratory Data Analysis and supervised Machine Learning models.

More generally, the aim is to help potential project creators as well as potential investors assess what their chances of success on Kickstarter will be.

## About the data

The dataset contains data on all projects hosted on Kickstarter between the company’s launch in April 2009 until the date of the webscrape on 14 March 14, 2019. The raw data consists of 56 separate .csv files. This notebook combines all the .csv files into a single file called ```"Kickstarter.csv"```. 

## Column names and descriptions 

- **backers_count** - number of backers of the project
- **blurb** - short description about the project
- **category** - contains category and subcategory information for the project
- **converted_pledged_amount** - amount of money pledged, converted to the currency in the 'current_currency' column
- **country** - country of project creator 
- **created_at** - date and time in Unix time of when the project was initially created
- **creator** - contains name of the project creator
- **currency** - original currency of project goal
- **currency_symbol** symbol of the original currency the project goal was denominated in
- **currency_trailing_code** - code of the original currency the project goal was denominated in
- **current_currency** - currency the project goal was converted to
- **deadline** - date and time of when the project will close for donations
- **disable_communication** - whether or not a project owner disabled communication with their backers
- **friends** - unclear (null or empty list)
- **fx_rate** - foreign exchange rate between the original currency and the current_currency
- **goal** - funding goal in the currency denominated by 'currency'
- **id** - id number of the project
- **is_backing** - ? (null or false)
- **is_starrable** - whether or not a project can be starred by users
- **is_starred** - whether or not a project has been starred by users
- **launched_at** - date and time of when the project was launched
- **location** - contains city or state information of the project creator
- **name** - name of the project
- **permissions** - ? (null or empty)
- **photo** - link and information to the project photo
- **pledged** - amount pledged in 'current_currency'
- **profile** - contains information about the project's profile and some visualization parameters
- **slug** - name of the project in lowercase with dashes instead of spaces
- **source_url** - url for the project's category
- **spotlight** - after a project has been successful, it was spotlighted on the Kickstarter website
- **staff_pick** - whether a project was highlighted as a staff pick when it was launched/live
- **state** - target variable, has 5 unique values: [successful, failed, live, canceled, suspended]
- **state_changed_at** - date and time of when a project's status was changed (same as the deadline for successful and failed projects)
- **static_usd_rate** - conversion rate between the original currency and USD
- **urls** - url link to the project's page
- **usd_pledged** - amount pledged in USD
- **usd_type** - unclear (domestic or international)

## Importing packages

In [147]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

## Combining the data

The raw data consists of 56 separate .csv files. The .csv files are combined into a single file called ```"Kickstarter.csv"```. 

In [148]:
# Initialize empty DataFrame
df = pd.DataFrame()

# Loop over every .csv file
for i in range(0,56):
    filename = f'data/Kickstarter{i:#03}.csv'
    print("Reading:", filename)
    df_i = pd.read_csv(filename, sep = ',')

    # Concatenate the DataFrames
    df = pd.concat([df, df_i], axis=0, ignore_index=True)

# Write to a single .csv file
df.to_csv('data/Kickstarter.csv', sep=',', index=False)

Reading: data/Kickstarter000.csv
Reading: data/Kickstarter001.csv
Reading: data/Kickstarter002.csv
Reading: data/Kickstarter003.csv
Reading: data/Kickstarter004.csv
Reading: data/Kickstarter005.csv
Reading: data/Kickstarter006.csv
Reading: data/Kickstarter007.csv
Reading: data/Kickstarter008.csv
Reading: data/Kickstarter009.csv
Reading: data/Kickstarter010.csv
Reading: data/Kickstarter011.csv
Reading: data/Kickstarter012.csv
Reading: data/Kickstarter013.csv
Reading: data/Kickstarter014.csv
Reading: data/Kickstarter015.csv
Reading: data/Kickstarter016.csv
Reading: data/Kickstarter017.csv
Reading: data/Kickstarter018.csv
Reading: data/Kickstarter019.csv
Reading: data/Kickstarter020.csv
Reading: data/Kickstarter021.csv
Reading: data/Kickstarter022.csv
Reading: data/Kickstarter023.csv
Reading: data/Kickstarter024.csv
Reading: data/Kickstarter025.csv
Reading: data/Kickstarter026.csv
Reading: data/Kickstarter027.csv
Reading: data/Kickstarter028.csv
Reading: data/Kickstarter029.csv
Reading: d

In [149]:
# Clear df variable
del df

## Loading the data

In [150]:
# Reload data file and check if it was combined correctly 
df = pd.read_csv("data/Kickstarter.csv", sep=',')

## Data Exploration

In [151]:
# Examine the first 5 rows
df.head()

Unnamed: 0,backers_count,blurb,category,converted_pledged_amount,country,created_at,creator,currency,currency_symbol,currency_trailing_code,...,slug,source_url,spotlight,staff_pick,state,state_changed_at,static_usd_rate,urls,usd_pledged,usd_type
0,21,2006 was almost 7 years ago.... Can you believ...,"{""id"":43,""name"":""Rock"",""slug"":""music/rock"",""po...",802,US,1387659690,"{""id"":1495925645,""name"":""Daniel"",""is_registere...",USD,$,True,...,new-final-round-album,https://www.kickstarter.com/discover/categorie...,True,False,successful,1391899046,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",802.0,international
1,97,An adorable fantasy enamel pin series of princ...,"{""id"":54,""name"":""Mixed Media"",""slug"":""art/mixe...",2259,US,1549659768,"{""id"":1175589980,""name"":""Katherine"",""slug"":""fr...",USD,$,True,...,princess-pals-enamel-pin-series,https://www.kickstarter.com/discover/categorie...,True,False,successful,1551801611,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",2259.0,international
2,88,Helping a community come together to set the s...,"{""id"":280,""name"":""Photobooks"",""slug"":""photogra...",29638,US,1477242384,"{""id"":1196856269,""name"":""MelissaThomas"",""is_re...",USD,$,True,...,their-life-through-their-lens-the-amish-and-me...,https://www.kickstarter.com/discover/categorie...,True,True,successful,1480607932,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",29638.0,international
3,193,Every revolution starts from the bottom and we...,"{""id"":266,""name"":""Footwear"",""slug"":""fashion/fo...",49158,IT,1540369920,"{""id"":1569700626,""name"":""WAO"",""slug"":""wearewao...",EUR,€,False,...,wao-the-eco-effect-shoes,https://www.kickstarter.com/discover/categorie...,True,False,successful,1544309940,1.136525,"{""web"":{""project"":""https://www.kickstarter.com...",49075.15252,international
4,20,Learn to build 10+ Applications in this comple...,"{""id"":51,""name"":""Software"",""slug"":""technology/...",549,US,1425706517,"{""id"":1870845385,""name"":""Kalpit Jain"",""is_regi...",USD,$,True,...,apple-watch-development-course,https://www.kickstarter.com/discover/categorie...,False,False,failed,1428511019,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",549.0,domestic


In [152]:
# Examine last 5 rows
df.tail()

Unnamed: 0,backers_count,blurb,category,converted_pledged_amount,country,created_at,creator,currency,currency_symbol,currency_trailing_code,...,slug,source_url,spotlight,staff_pick,state,state_changed_at,static_usd_rate,urls,usd_pledged,usd_type
209217,57,Steam Hollow is a Veteran owned Craft Brewery ...,"{""id"":307,""name"":""Drinks"",""slug"":""food/drinks""...",10320,US,1487697908,"{""id"":86849824,""name"":""Blane White"",""slug"":""st...",USD,$,True,...,steam-hollow-brewing-co,https://www.kickstarter.com/discover/categorie...,True,False,successful,1492171945,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",10320.0,international
209218,11,Over 250 healthy recipes free from ANY added S...,"{""id"":306,""name"":""Cookbooks"",""slug"":""food/cook...",305,AU,1450753265,"{""id"":1899017630,""name"":""Alan Wichert"",""is_reg...",AUD,$,True,...,fusion-detox-chef-alan-wichert-creates-the-health,https://www.kickstarter.com/discover/categorie...,False,False,failed,1453713310,0.727575,"{""web"":{""project"":""https://www.kickstarter.com...",316.495068,domestic
209219,0,"Give your baby style and flair with ""Gorgeous ...","{""id"":264,""name"":""Childrenswear"",""slug"":""fashi...",0,US,1470991682,"{""id"":1589905505,""name"":""T. Simms"",""slug"":""1pr...",USD,$,True,...,gorgeous-princess-cheetah-collection,https://www.kickstarter.com/discover/categorie...,False,False,failed,1474837755,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",0.0,international
209220,11,"""The Lion & the Lyceum"" is a children's book a...","{""id"":46,""name"":""Children's Books"",""slug"":""pub...",1400,US,1424101898,"{""id"":1468696154,""name"":""Alex Beene"",""is_regis...",USD,$,True,...,the-lion-and-the-lyceum-childrens-book-based-o...,https://www.kickstarter.com/discover/categorie...,True,False,successful,1426741214,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",1400.0,domestic
209221,49,Award-winning aviation artist Philip E West's ...,"{""id"":23,""name"":""Painting"",""slug"":""art/paintin...",4867,GB,1519817658,"{""id"":230724301,""name"":""CPS Books / G2 Enterta...",GBP,£,False,...,philip-e-west-aviation-masterworks,https://www.kickstarter.com/discover/categorie...,True,False,successful,1529851297,1.394768,"{""web"":{""project"":""https://www.kickstarter.com...",5118.799367,international


In [153]:
# Checking the number of rows and columns
df.shape

(209222, 37)

In [154]:
# Summary statistics
df.describe()

Unnamed: 0,backers_count,converted_pledged_amount,created_at,deadline,fx_rate,goal,id,launched_at,pledged,state_changed_at,static_usd_rate,usd_pledged
count,209222.0,209222.0,209222.0,209222.0,209222.0,209222.0,209222.0,209222.0,209222.0,209222.0,209222.0,209222.0
mean,145.419057,12892.9,1456089000.0,1463033000.0,0.994857,49176.04,1073222000.0,1460206000.0,18814.03,1462838000.0,1.010757,12892.13
std,885.967976,88894.14,63397110.0,63056180.0,0.211654,1179427.0,619805100.0,63090290.0,322959.6,62904210.0,0.231893,88901.24
min,0.0,0.0,1240366000.0,1241334000.0,0.008966,0.01,8624.0,1240603000.0,0.0,1241334000.0,0.008771,0.0
25%,4.0,106.0,1413317000.0,1420607000.0,1.0,1500.0,535105400.0,1417639000.0,110.0,1420485000.0,1.0,106.0014
50%,27.0,1537.0,1457895000.0,1464754000.0,1.0,5000.0,1074579000.0,1461924000.0,1556.0,1464709000.0,1.0,1537.358
75%,89.0,6548.0,1511595000.0,1519437000.0,1.0,15000.0,1609369000.0,1516694000.0,6887.2,1519366000.0,1.0,6550.0
max,105857.0,8596474.0,1552527000.0,1557721000.0,1.876033,100000000.0,2147476000.0,1552537000.0,81030740.0,1552537000.0,1.716408,8596475.0


In [155]:
# Checking for null values
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 209222 entries, 0 to 209221
Data columns (total 37 columns):
 #   Column                    Non-Null Count   Dtype  
---  ------                    --------------   -----  
 0   backers_count             209222 non-null  int64  
 1   blurb                     209214 non-null  object 
 2   category                  209222 non-null  object 
 3   converted_pledged_amount  209222 non-null  int64  
 4   country                   209222 non-null  object 
 5   created_at                209222 non-null  int64  
 6   creator                   209222 non-null  object 
 7   currency                  209222 non-null  object 
 8   currency_symbol           209222 non-null  object 
 9   currency_trailing_code    209222 non-null  bool   
 10  current_currency          209222 non-null  object 
 11  deadline                  209222 non-null  int64  
 12  disable_communication     209222 non-null  bool   
 13  friends                   300 non-null     o

In [156]:
# Examining the summary statistics
df.describe()

Unnamed: 0,backers_count,converted_pledged_amount,created_at,deadline,fx_rate,goal,id,launched_at,pledged,state_changed_at,static_usd_rate,usd_pledged
count,209222.0,209222.0,209222.0,209222.0,209222.0,209222.0,209222.0,209222.0,209222.0,209222.0,209222.0,209222.0
mean,145.419057,12892.9,1456089000.0,1463033000.0,0.994857,49176.04,1073222000.0,1460206000.0,18814.03,1462838000.0,1.010757,12892.13
std,885.967976,88894.14,63397110.0,63056180.0,0.211654,1179427.0,619805100.0,63090290.0,322959.6,62904210.0,0.231893,88901.24
min,0.0,0.0,1240366000.0,1241334000.0,0.008966,0.01,8624.0,1240603000.0,0.0,1241334000.0,0.008771,0.0
25%,4.0,106.0,1413317000.0,1420607000.0,1.0,1500.0,535105400.0,1417639000.0,110.0,1420485000.0,1.0,106.0014
50%,27.0,1537.0,1457895000.0,1464754000.0,1.0,5000.0,1074579000.0,1461924000.0,1556.0,1464709000.0,1.0,1537.358
75%,89.0,6548.0,1511595000.0,1519437000.0,1.0,15000.0,1609369000.0,1516694000.0,6887.2,1519366000.0,1.0,6550.0
max,105857.0,8596474.0,1552527000.0,1557721000.0,1.876033,100000000.0,2147476000.0,1552537000.0,81030740.0,1552537000.0,1.716408,8596475.0


## Data cleaning

### Selecting columns of interest

Examining the column names and descriptions, we conclude that the target variable is ```state```. There are five different outcomes for ```state```, namely: successful, failed, live, canceled and suspended.

In [157]:
df['state'].unique()

array(['successful', 'failed', 'live', 'canceled', 'suspended'],
      dtype=object)

In [158]:
# Count of each value in 'state'
df['state'].value_counts()

successful    117465
failed         75199
canceled        8624
live            7311
suspended        623
Name: state, dtype: int64

After analyzing the column names and descriptions, the list of columns we are interested in are as follows:

In [159]:
col_names = ['id', 'backers_count', 'blurb', 'category', 'country', 'deadline', \
                'goal', 'launched_at', 'name', 'staff_pick', 'state', \
                'static_usd_rate', 'usd_pledged']
df = df[col_names]

### Filtering for successful and failed projects

In [160]:
# Select rows only where state is successful or failed
df = df.query("state in ['successful', 'failed']")

This leaves us with 192664 rows:

In [161]:
df.shape

(192664, 13)

Converting ```state``` to 1 for successful and 0 for failed:

In [162]:
df['target'] = [1 if i == 'successful' else 0 for i in df['state']]
df.drop('state', axis=1, inplace=True)

### Removing duplicate rows

Identifying duplicate rows:

In [163]:
duplicate_id = df[df.duplicated('id')].sort_values('id')
duplicate_id['duplicate'] = True
duplicate_id = duplicate_id[['id', 'duplicate']]

In [164]:
duplicate_id.shape

(23685, 2)

**There are 23685  duplicated ids.**

In [165]:
# Drop duplicate rows
df = df.drop_duplicates('id', keep='first').reset_index()
df = df.drop('index', axis=1)

In [166]:
# Drop id column
if 'id' in df.columns:
    df.drop('id', axis=1, inplace=True)

In [167]:
df.shape

(168979, 12)

After filtering for successful/failed project and removing duplicate rows, we end up with **168979** unique rows.

### Blurb length and name length

After selecting only the columns of interest, there are only 8 null values in the blurb column, which is a short description of the project and holds no predictive value.

In [168]:
# Number of null values in the blurb column
df['blurb'].isna().sum()

2

The ```blurb``` column is a short description of the project and ```name``` is the name of the project.

In [169]:
# Calculate length of the blurb and fill the null value with 0
df['blurb_length'] = df['blurb'].str.split().str.len()
df['blurb_length'].fillna(0, inplace=True)

# Calculate name length
df['name_length'] = df['name'].str.split().str.len()

In [170]:
# Convert blurb length to integer
df['blurb_length'] = df['blurb_length'].astype(int)
df['name_length'] = df['name_length'].astype(int)

In [171]:
# Drop blurb column
if 'blurb' in df.columns:
    df.drop('blurb', axis=1, inplace=True)

# Drop name column
if 'name' in df.columns:
    df.drop('name', axis=1, inplace=True)

### Extracting the category and subcategory

In [172]:
# Extracting the relevant subcategory section from the string
f = lambda x: x['category'].split('/')[1].split('","position')[0]
df['subcategory'] = df.apply(f, axis=1)

# Extracting the relevant category section from the string, and replacing the original category variable
f = lambda x: x['category'].split('"slug":"')[1].split('/')[0]
df['category'] = df.apply(f, axis=1)

# Some categories do not have a sub-category, so do not have a '/' to split with
f = lambda x: x['category'].split('","position"')[0] 
df['category'] = df.apply(f, axis=1)

In [173]:
df.head()

Unnamed: 0,backers_count,category,country,deadline,goal,launched_at,staff_pick,static_usd_rate,usd_pledged,target,blurb_length,name_length,subcategory
0,21,music,US,1391899046,200.0,1388011046,False,1.0,802.0,1,26,4,rock
1,97,art,US,1551801611,400.0,1550073611,False,1.0,2259.0,1,9,5,mixed media
2,88,photography,US,1480607930,27224.0,1478012330,True,1.0,29638.0,1,25,9,photobooks
3,193,fashion,IT,1544309940,40000.0,1540684582,False,1.136525,49075.15252,1,13,5,footwear
4,20,technology,US,1428511017,1000.0,1425919017,False,1.0,549.0,0,22,4,software


### Convert columns containing time information to datetime format

In [174]:
# Convert to datetime
df['launched_at'] = pd.to_datetime(df['launched_at'], origin='unix', unit='s')
df['deadline'] = pd.to_datetime(df['deadline'], origin='unix', unit='s')


In [175]:
# Calculate duration of kickstarter campaign in days
df['duration'] = df['deadline'] - df['launched_at']
df['duration_days'] = df['duration'].dt.days
df['year'] = pd.DatetimeIndex(df['launched_at']).year
df['month'] = pd.DatetimeIndex(df['launched_at']).month
df['year'] = df['year'].astype('category')
df['month'] = df['month'].astype('category')


In [176]:
# Drop unneeded columns
drop_cols = ['launched_at', 'deadline', 'duration']
for col in drop_cols:
    if col in df.columns:
        df.drop(col, axis=1, inplace=True)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 168979 entries, 0 to 168978
Data columns (total 14 columns):
 #   Column           Non-Null Count   Dtype   
---  ------           --------------   -----   
 0   backers_count    168979 non-null  int64   
 1   category         168979 non-null  object  
 2   country          168979 non-null  object  
 3   goal             168979 non-null  float64 
 4   staff_pick       168979 non-null  bool    
 5   static_usd_rate  168979 non-null  float64 
 6   usd_pledged      168979 non-null  float64 
 7   target           168979 non-null  int64   
 8   blurb_length     168979 non-null  int64   
 9   name_length      168979 non-null  int64   
 10  subcategory      168979 non-null  object  
 11  duration_days    168979 non-null  int64   
 12  year             168979 non-null  category
 13  month            168979 non-null  category
dtypes: bool(1), category(2), float64(3), int64(5), object(3)
memory usage: 14.7+ MB


### Converting 'goal' to USD

Comparing the ```converted_pledged_amount``` and ```usd_pledged``` columns, the values are similar but not exactly the same in some cases. When the currency is not USD, ```usd_pledged = pledged * static_usd_rate```. Therefore we can drop ```converted_pledged_amount``` and ```pledged``` and just use ```usd_pledged```, which has already been converted to USD.

The goal amount on the website is specified by the project owner in their country's currency, therefore we will convert ```goal``` to USD by also multiplying by the ```static_usd_rate```.

In [177]:
df['goal'] = (df['goal'] * df['static_usd_rate']).round(2)
df = df.drop(['static_usd_rate'], axis=1)

In [178]:
df.head()

Unnamed: 0,backers_count,category,country,goal,staff_pick,usd_pledged,target,blurb_length,name_length,subcategory,duration_days,year,month
0,21,music,US,200.0,False,802.0,1,26,4,rock,45,2013,12
1,97,art,US,400.0,False,2259.0,1,9,5,mixed media,20,2019,2
2,88,photography,US,27224.0,True,29638.0,1,25,9,photobooks,30,2016,11
3,193,fashion,IT,45461.0,False,49075.15252,1,13,5,footwear,41,2018,10
4,20,technology,US,1000.0,False,549.0,0,22,4,software,30,2015,3


### Converting 'staff_pick' to 0 or 1 values

In [184]:
df['staff_pick'] = [1 if i == True else 0 for i in df['staff_pick']]

In [185]:
# Examine first 5 rows
df.head()

Unnamed: 0,backers_count,category,country,goal,staff_pick,usd_pledged,target,blurb_length,name_length,subcategory,duration_days,year,month
0,21,music,US,200.0,0,802.0,1,26,4,rock,45,2013,12
1,97,art,US,400.0,0,2259.0,1,9,5,mixed media,20,2019,2
2,88,photography,US,27224.0,1,29638.0,1,25,9,photobooks,30,2016,11
3,193,fashion,IT,45461.0,0,49075.15252,1,13,5,footwear,41,2018,10
4,20,technology,US,1000.0,0,549.0,0,22,4,software,30,2015,3


In [186]:
# Check to see if the data types are correct
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 168979 entries, 0 to 168978
Data columns (total 13 columns):
 #   Column         Non-Null Count   Dtype   
---  ------         --------------   -----   
 0   backers_count  168979 non-null  int64   
 1   category       168979 non-null  object  
 2   country        168979 non-null  object  
 3   goal           168979 non-null  float64 
 4   staff_pick     168979 non-null  int64   
 5   usd_pledged    168979 non-null  float64 
 6   target         168979 non-null  int64   
 7   blurb_length   168979 non-null  int64   
 8   name_length    168979 non-null  int64   
 9   subcategory    168979 non-null  object  
 10  duration_days  168979 non-null  int64   
 11  year           168979 non-null  category
 12  month          168979 non-null  category
dtypes: category(2), float64(2), int64(6), object(3)
memory usage: 14.5+ MB


## Write clean data to .csv file

Since the original data has a lot of unnecessary features, we save the cleaned DataFrame to ```Kickstarter_clean.csv``` that can be loaded later for Exploratory Data Analysis and training the Machine Learning models.

In [187]:
df.to_csv('data/Kickstarter_clean.csv', sep=',', index=False)