# Machine Learning Project - Kickstarter Data Set
*Contributor: Max Langer, René Ebrecht, Jens Reich*

This is the very first project where we build a machine learning model from scratch based on an unknown dataset.
The dataset includes data from Kickstarter projects from the years 2009 to 2019.

Our goal is to help our (fictional) stackholder, PPC Consultants with a model that can predict whether a Kickstarter project will be successful or not. 
PPC Consultants advises potential project creators (PPCs) with their projects to get them off the ground as successfully as possible.
Therefore, the value of our data product (the predictive model) is to show opportunities, save time, and in the end make money for both PPC consultants and PPCs.

In [1]:
# Import the organization modules
import pandas as pd
import numpy as np
# Import the plot modules
import matplotlib.pyplot as plt
import seaborn as sns
# Import own scripts
from scripts.data_cleaning import read_all_csvs, create_csv, get_nan_cols, convert_to_datetime, calculate_time_periods, get_year_month_day

In [27]:
# Create data frame from all single CSV files
df = read_all_csvs()

In [28]:
# Take a look at the first columns
df.head()

Unnamed: 0,backers_count,blurb,category,converted_pledged_amount,country,created_at,creator,currency,currency_symbol,currency_trailing_code,...,slug,source_url,spotlight,staff_pick,state,state_changed_at,static_usd_rate,urls,usd_pledged,usd_type
0,21,2006 was almost 7 years ago.... Can you believ...,"{""id"":43,""name"":""Rock"",""slug"":""music/rock"",""po...",802,US,1387659690,"{""id"":1495925645,""name"":""Daniel"",""is_registere...",USD,$,True,...,new-final-round-album,https://www.kickstarter.com/discover/categorie...,True,False,successful,1391899046,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",802.0,international
1,97,An adorable fantasy enamel pin series of princ...,"{""id"":54,""name"":""Mixed Media"",""slug"":""art/mixe...",2259,US,1549659768,"{""id"":1175589980,""name"":""Katherine"",""slug"":""fr...",USD,$,True,...,princess-pals-enamel-pin-series,https://www.kickstarter.com/discover/categorie...,True,False,successful,1551801611,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",2259.0,international
2,88,Helping a community come together to set the s...,"{""id"":280,""name"":""Photobooks"",""slug"":""photogra...",29638,US,1477242384,"{""id"":1196856269,""name"":""MelissaThomas"",""is_re...",USD,$,True,...,their-life-through-their-lens-the-amish-and-me...,https://www.kickstarter.com/discover/categorie...,True,True,successful,1480607932,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",29638.0,international
3,193,Every revolution starts from the bottom and we...,"{""id"":266,""name"":""Footwear"",""slug"":""fashion/fo...",49158,IT,1540369920,"{""id"":1569700626,""name"":""WAO"",""slug"":""wearewao...",EUR,€,False,...,wao-the-eco-effect-shoes,https://www.kickstarter.com/discover/categorie...,True,False,successful,1544309940,1.136525,"{""web"":{""project"":""https://www.kickstarter.com...",49075.15252,international
4,20,Learn to build 10+ Applications in this comple...,"{""id"":51,""name"":""Software"",""slug"":""technology/...",549,US,1425706517,"{""id"":1870845385,""name"":""Kalpit Jain"",""is_regi...",USD,$,True,...,apple-watch-development-course,https://www.kickstarter.com/discover/categorie...,False,False,failed,1428511019,1.0,"{""web"":{""project"":""https://www.kickstarter.com...",549.0,domestic


In [29]:
# Prints out the columns with NaNs.
get_nan_cols(df)

Number of NaNs per column:
blurb:8
friends:208922
is_backing:208922
is_starred:208922
location:226
permissions:208922
usd_type:480


We can get rid of some columns, which probably will have no impact on the model later on. 
Especially `friends`, `is_backing`, `is_starred`, `permission` are almost completely empty (over 50% of data was missing). So no information loss here.

Moreover the following columns don't seem to contain any valuable information for the predictions as well: `blurb`,`converted_pledged_amount`,`currency_symbol`, `currency_trailing_code`, `fx_rate`, `photo`,`profile`,`slug`,`source_url`,`static_usd_rate`,`urls`.

The 226 and 480 missing observations for `location` and `usd_type` are so small in number that we can get rid of these rows as well. 

We also rename the `currency` into `original_currency`, as this gives more information about its content. 

In [32]:
columns = df.columns.to_list()
df = df[[column for column in df if df[column].count() / len(df) >= 0.5]]
print("Dropped columnes:", end= " ")
dropped = [print(col, end=" ") for col in columns if col not in df.columns]

Dropped columnes: friends is_backing is_starred permissions 

In [33]:
# Drop the listed columns.
df.drop([
    'blurb', 
    'converted_pledged_amount',
    'currency_symbol', 
    'currency_trailing_code', 
    'fx_rate',
    'photo',
    'profile',
    'slug',
    'source_url', 
    'static_usd_rate',
    'urls'
    ], axis=1, inplace=True)
# Drops the last few NaN values. 
df.dropna(axis=0, inplace=True)
# Rename the currency column.
df.rename(columns={'currency':'original_currency'}, inplace=True)

In [34]:
# Prints out the columns with NaNs.
get_nan_cols(df)

Number of NaNs per column:


Now we have no NaNs left in our data set, which is another step towards a cleaned data set.

The columns of `created_at`, `state_changed_at`, `deadline` are actually dates but expressed as epoch time. We will convert these into datetime objects and thereby into actual dates.

With these we can calculate the time periods from the beginning of the project until the success and also to the deadline of the project.

Moreover, from these dates we can then get the year, month and day of the three columns.

In [7]:
# Convert the time columns to datetime types.
df = convert_to_datetime(df)
# Calculate the time periods.
df = calculate_time_periods(df)
# Get the years, months and days as separate columns.
df = get_year_month_day(df)

In [8]:
df.head()

Unnamed: 0,backers_count,category,country,creator,original_currency,current_currency,disable_communication,goal,id,is_starrable,...,days_total,created_at_year,created_at_month,created_at_day,state_changed_at_year,state_changed_at_month,state_changed_at_day,deadline_year,deadline_month,deadline_day
0,21,"{""id"":43,""name"":""Rock"",""slug"":""music/rock"",""po...",US,"{""id"":1495925645,""name"":""Daniel"",""is_registere...",USD,USD,False,200.0,287514992,False,...,49,2013,12,21,2014,2,8,2014,2,8
1,97,"{""id"":54,""name"":""Mixed Media"",""slug"":""art/mixe...",US,"{""id"":1175589980,""name"":""Katherine"",""slug"":""fr...",USD,USD,False,400.0,385129759,False,...,25,2019,2,8,2019,3,5,2019,3,5
2,88,"{""id"":280,""name"":""Photobooks"",""slug"":""photogra...",US,"{""id"":1196856269,""name"":""MelissaThomas"",""is_re...",USD,USD,False,27224.0,681033598,False,...,39,2016,10,23,2016,12,1,2016,12,1
3,193,"{""id"":266,""name"":""Footwear"",""slug"":""fashion/fo...",IT,"{""id"":1569700626,""name"":""WAO"",""slug"":""wearewao...",EUR,USD,False,40000.0,1031782682,False,...,45,2018,10,24,2018,12,8,2018,12,8
4,20,"{""id"":51,""name"":""Software"",""slug"":""technology/...",US,"{""id"":1870845385,""name"":""Kalpit Jain"",""is_regi...",USD,USD,False,1000.0,904085819,False,...,32,2015,3,7,2015,4,8,2015,4,8


In [10]:
df.shape

(208516, 30)

In [22]:
print("Percentage of the data where the project duration is the same as the time periode to reach the goal:")
print((df[df['days_till_change']==df['days_total']].shape[0] / len(df))*100)

Percentage of the data where the project duration is the same as the time periode to reach the goal:
92.39243031709798


We see that 92% of the projects reach the goal in the same number of days as the deadline. Perhaps they were simply marked as completed after the target was reached. 

Let's look further into the data types of the other columns.

In [23]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 208516 entries, 0 to 964
Data columns (total 30 columns):
 #   Column                  Non-Null Count   Dtype  
---  ------                  --------------   -----  
 0   backers_count           208516 non-null  int64  
 1   category                208516 non-null  object 
 2   country                 208516 non-null  object 
 3   creator                 208516 non-null  object 
 4   original_currency       208516 non-null  object 
 5   current_currency        208516 non-null  object 
 6   disable_communication   208516 non-null  bool   
 7   goal                    208516 non-null  float64
 8   id                      208516 non-null  int64  
 9   is_starrable            208516 non-null  bool   
 10  launched_at             208516 non-null  int64  
 11  location                208516 non-null  object 
 12  name                    208516 non-null  object 
 13  pledged                 208516 non-null  float64
 14  spotlight              

In [43]:
print(df['category'].iloc[0], '\n')
print(df['creator'].iloc[0], '\n')
print(df['location'].iloc[0], '\n')

{"id":43,"name":"Rock","slug":"music/rock","position":17,"parent_id":14,"color":10878931,"urls":{"web":{"discover":"http://www.kickstarter.com/discover/categories/music/rock"}}} 

{"id":1495925645,"name":"Daniel","is_registered":null,"chosen_currency":null,"avatar":{"thumb":"https://ksr-ugc.imgix.net/assets/006/041/047/c44d1a95c2139ae46af635c7c6e7ea76_original.jpg?ixlib=rb-1.1.0&w=40&h=40&fit=crop&v=1461362658&auto=format&frame=1&q=92&s=3d655afafac9dbb59c1e675adfa87082","small":"https://ksr-ugc.imgix.net/assets/006/041/047/c44d1a95c2139ae46af635c7c6e7ea76_original.jpg?ixlib=rb-1.1.0&w=160&h=160&fit=crop&v=1461362658&auto=format&frame=1&q=92&s=3973d24f5c3db1ed1d5c84cec8af1d6d","medium":"https://ksr-ugc.imgix.net/assets/006/041/047/c44d1a95c2139ae46af635c7c6e7ea76_original.jpg?ixlib=rb-1.1.0&w=160&h=160&fit=crop&v=1461362658&auto=format&frame=1&q=92&s=3973d24f5c3db1ed1d5c84cec8af1d6d"},"urls":{"web":{"user":"https://www.kickstarter.com/profile/1495925645"},"api":{"user":"https://api.kick

We find very strange formatted object type data in category, creator and location.

In [97]:
df.nunique()

backers_count              3237
category                    169
country                      22
created_at               181898
creator                  207859
original_currency            14
current_currency              1
deadline                 170611
disable_communication         2
goal                       5103
id                       182004
is_starrable                  2
launched_at              181849
location                  15235
name                     181421
pledged                   44293
spotlight                     2
staff_pick                    2
state                         5
state_changed_at         171788
usd_pledged               79114
usd_type                      2
days_till_change           1501
days_total                 1497
dtype: int64