# Machine Learning Project - Kickstarter Data Set
*Contributor: Max Langer, René Ebrecht, Jens Reich*

This is the very first project where we build a machine learning model from scratch based on an unknown dataset.
The dataset includes data from Kickstarter projects from the years 2009 to 2019.
Our goal is to help our (fictional) stackholder, PPC Consultants with a model that can predict whether a Kickstarter project will be successful or not. 
PPC Consultants advises potential project creators (PPCs) with their projects to get them off the ground as successfully as possible.
Therefore, the value of our data product (the predictive model) is to show opportunities, save time, and in the end make money for both PPC consultants and PPCs.

In [5]:
# Import the organization modules
import pandas as pd
import numpy as np
# Import the plot modules
import matplotlib.pyplot as plt
import seaborn as sns
# Import own scripts
from scripts.data_cleaning import read_all_csvs, create_csv, get_nan_cols, convert_dates, calculate_time_periodes

In [14]:
# Create data frame from all single CSV files
df = read_all_csvs()

In [4]:
# Take a look at the first columns
df.head()

Unnamed: 0,backers_count,blurb,category,converted_pledged_amount,country,created_at,creator,currency,currency_symbol,currency_trailing_code,...,slug,source_url,spotlight,staff_pick,state,state_changed_at,static_usd_rate,urls,usd_pledged,usd_type
0,21,2006 was almost 7 years ago.... Can you believ...,"{""id"":43,""name"":""Rock"",""slug"":""music/rock"",""po...",802,US,1387659690,"{""id"":1495925645,""name"":""Daniel"",""is_registere...",USD,$,True,...,new-final-round-album,https://www.kickstarter.com/discover/categorie...,True,False,successful,1391899046,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",802.0,international
1,97,An adorable fantasy enamel pin series of princ...,"{""id"":54,""name"":""Mixed Media"",""slug"":""art/mixe...",2259,US,1549659768,"{""id"":1175589980,""name"":""Katherine"",""slug"":""fr...",USD,$,True,...,princess-pals-enamel-pin-series,https://www.kickstarter.com/discover/categorie...,True,False,successful,1551801611,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",2259.0,international
2,88,Helping a community come together to set the s...,"{""id"":280,""name"":""Photobooks"",""slug"":""photogra...",29638,US,1477242384,"{""id"":1196856269,""name"":""MelissaThomas"",""is_re...",USD,$,True,...,their-life-through-their-lens-the-amish-and-me...,https://www.kickstarter.com/discover/categorie...,True,True,successful,1480607932,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",29638.0,international
3,193,Every revolution starts from the bottom and we...,"{""id"":266,""name"":""Footwear"",""slug"":""fashion/fo...",49158,IT,1540369920,"{""id"":1569700626,""name"":""WAO"",""slug"":""wearewao...",EUR,€,False,...,wao-the-eco-effect-shoes,https://www.kickstarter.com/discover/categorie...,True,False,successful,1544309940,1.136525,"{""web"":{""project"":""https://www.kickstarter.com...",49075.15252,international
4,20,Learn to build 10+ Applications in this comple...,"{""id"":51,""name"":""Software"",""slug"":""technology/...",549,US,1425706517,"{""id"":1870845385,""name"":""Kalpit Jain"",""is_regi...",USD,$,True,...,apple-watch-development-course,https://www.kickstarter.com/discover/categorie...,False,False,failed,1428511019,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",549.0,domestic


In [6]:
# Prints out the columns with NaNs.
get_nan_cols(df)

Number of NaNs per column:
blurb:8
friends:208922
is_backing:208922
is_starred:208922
location:226
permissions:208922
usd_type:480


We can get rid of some columns, which probably will have no impact on the model later on. The list of droped columns is: `blurb`,`converted_pledged_amount`,`currency_symbol`, `currency_trailing_code`, `friends`, `fx_rate`, `is_backing`,`permissions`,`photo`,`profile`,`slug`,`source_url`,`static_usd_rate`,`urls`.

Especially `friends`, `is_backing`, `is_starred`, `permission` are almost completely empty. So no information loss here.

The 226 and 480 missing observations for `location` and `usd_type` are so small in number that we can get rid of these rows as well. 

We also rename the currency into original_currency since this is more informative of its contents. 

In [16]:
# Rename the currency column.
df.rename(columns={'currency':'original_currency'}, inplace=True)
# Drop the listed columns.
df.drop([
    'blurb', 
    'converted_pledged_amount',
    'currency_symbol', 
    'currency_trailing_code', 
    'friends', 
    'fx_rate', 
    'is_backing',
    'is_starred',
    'permissions',
    'photo',
    'profile',
    'slug',
    'source_url', 
    'static_usd_rate',
    'urls'
    ], axis=1, inplace=True)
# Drops the last few NaN values. 
df.dropna(axis=0, inplace=True)

In [19]:
# Prints out the columns with NaNs.
get_nan_cols(df)

Number of NaNs per column:


We than convert the `created_at`, `state_changed_at`, `deadline` into datetime types. 
With these we can calculate the time periodes from begin to success and also to the deadline of the project.

In [20]:
# Convert the time columns to datetime types.
df = convert_dates(df)
# Calculate the time periodes.
df = calculate_time_periodes(df)

In [37]:

df['deadline_year'] = df['deadline'].dt.year
df['deadline_month'] = df['deadline'].dt.month
df['deadline_day'] = df['deadline'].dt.day


df.drop(['created_at, state_changed_at, deadline'])

In [22]:
df['created_at'] = df['created_at'].dt.date

array([2014, 2019, 2016, 2018, 2015, 2012, 2013, 2017, 2010, 2011, 2009])

In [38]:
df.head()

Unnamed: 0,backers_count,category,country,created_at,creator,original_currency,current_currency,deadline,disable_communication,goal,...,staff_pick,state,state_changed_at,usd_pledged,usd_type,days_till_change,days_total,deadline_year,deadline_month,deadline_day
0,21,"{""id"":43,""name"":""Rock"",""slug"":""music/rock"",""po...",US,2013-12-21,"{""id"":1495925645,""name"":""Daniel"",""is_registere...",USD,USD,2014-02-08 22:37:26,False,200.0,...,False,successful,2014-02-08 22:37:26,802.0,international,49,49,2014,2,8
1,97,"{""id"":54,""name"":""Mixed Media"",""slug"":""art/mixe...",US,2019-02-08,"{""id"":1175589980,""name"":""Katherine"",""slug"":""fr...",USD,USD,2019-03-05 16:00:11,False,400.0,...,False,successful,2019-03-05 16:00:11,2259.0,international,25,25,2019,3,5
2,88,"{""id"":280,""name"":""Photobooks"",""slug"":""photogra...",US,2016-10-23,"{""id"":1196856269,""name"":""MelissaThomas"",""is_re...",USD,USD,2016-12-01 15:58:50,False,27224.0,...,True,successful,2016-12-01 15:58:52,29638.0,international,39,39,2016,12,1
3,193,"{""id"":266,""name"":""Footwear"",""slug"":""fashion/fo...",IT,2018-10-24,"{""id"":1569700626,""name"":""WAO"",""slug"":""wearewao...",EUR,USD,2018-12-08 22:59:00,False,40000.0,...,False,successful,2018-12-08 22:59:00,49075.15252,international,45,45,2018,12,8
4,20,"{""id"":51,""name"":""Software"",""slug"":""technology/...",US,2015-03-07,"{""id"":1870845385,""name"":""Kalpit Jain"",""is_regi...",USD,USD,2015-04-08 16:36:57,False,1000.0,...,False,failed,2015-04-08 16:36:59,549.0,domestic,32,32,2015,4,8


In [93]:
df[df['days_till_change']==df['days_total']]['id'].count()

192653

A lot of projects reach their goal at the same time as their deadline lies. Maybe they were just marked as finished after the goal was reached. 

In [96]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 208516 entries, 0 to 964
Data columns (total 24 columns):
 #   Column                 Non-Null Count   Dtype         
---  ------                 --------------   -----         
 0   backers_count          208516 non-null  int64         
 1   category               208516 non-null  object        
 2   country                208516 non-null  object        
 3   created_at             208516 non-null  datetime64[ns]
 4   creator                208516 non-null  object        
 5   original_currency      208516 non-null  object        
 6   current_currency       208516 non-null  object        
 7   deadline               208516 non-null  datetime64[ns]
 8   disable_communication  208516 non-null  bool          
 9   goal                   208516 non-null  float64       
 10  id                     208516 non-null  int64         
 11  is_starrable           208516 non-null  bool          
 12  launched_at            208516 non-null  int64  

In [97]:
df.nunique()

backers_count              3237
category                    169
country                      22
created_at               181898
creator                  207859
original_currency            14
current_currency              1
deadline                 170611
disable_communication         2
goal                       5103
id                       182004
is_starrable                  2
launched_at              181849
location                  15235
name                     181421
pledged                   44293
spotlight                     2
staff_pick                    2
state                         5
state_changed_at         171788
usd_pledged               79114
usd_type                      2
days_till_change           1501
days_total                 1497
dtype: int64