Python Packages

In [1]:
import pandas as pd
import seaborn as sns
import sklearn
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

<b>PROBLEM</b>

* The problem in investing on a company is that sometimes the company doesn’t deliver what it is promised, ending up in wasting investor’s capital.

<b>GOAL</b>

* To create a model that can be used both by investors and individuals alike who are interested in either backing up a single product or investing a huge sum of capital on companies listed in Kickstarter by predicting the chances of success based on Kickstarter data.


Scrubbing is a term I use where i clean the data from rough edges.
this notebook consists of

1. null handling<br>
2. dropping problematic and duplicate columns<br>
3. Unifying Date and Datetime, and adding days remaining to see the distance between launch and deadline
4. removing Live, Canceled and recategorizing Suspended and Undefined to failure state and leaving only success and failed.. because two of those are the ones we want to predict

In [2]:
df = pd.read_csv('ks-projects-201801.csv',index_col=0)

In [3]:
df

Unnamed: 0_level_0,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,usd_pledged_real,usd_goal_real
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,2015-10-09,1000.0,2015-08-11 12:12:28,0.0,failed,0,GB,0.0,0.0,1533.95
1000003930,Greeting From Earth: ZGAC Arts Capsule For ET,Narrative Film,Film & Video,USD,2017-11-01,30000.0,2017-09-02 04:43:57,2421.0,failed,15,US,100.0,2421.0,30000.00
1000004038,Where is Hank?,Narrative Film,Film & Video,USD,2013-02-26,45000.0,2013-01-12 00:20:50,220.0,failed,3,US,220.0,220.0,45000.00
1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,USD,2012-04-16,5000.0,2012-03-17 03:24:11,1.0,failed,1,US,1.0,1.0,5000.00
1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,USD,2015-08-29,19500.0,2015-07-04 08:35:03,1283.0,canceled,14,US,1283.0,1283.0,19500.00
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
999976400,ChknTruk Nationwide Charity Drive 2014 (Canceled),Documentary,Film & Video,USD,2014-10-17,50000.0,2014-09-17 02:35:30,25.0,canceled,1,US,25.0,25.0,50000.00
999977640,The Tribe,Narrative Film,Film & Video,USD,2011-07-19,1500.0,2011-06-22 03:35:14,155.0,failed,5,US,155.0,155.0,1500.00
999986353,Walls of Remedy- New lesbian Romantic Comedy f...,Narrative Film,Film & Video,USD,2010-08-16,15000.0,2010-07-01 19:40:30,20.0,failed,1,US,20.0,20.0,15000.00
999987933,BioDefense Education Kit,Technology,Technology,USD,2016-02-13,15000.0,2016-01-13 18:13:53,200.0,failed,6,US,200.0,200.0,15000.00


TYPE CHECKING

In [4]:
df.dtypes

name                 object
category             object
main_category        object
currency             object
deadline             object
goal                float64
launched             object
pledged             float64
state                object
backers               int64
country              object
usd pledged         float64
usd_pledged_real    float64
usd_goal_real       float64
dtype: object

<b>NULL CHECKING</b>

In [5]:
df.isnull().sum()

name                   4
category               0
main_category          0
currency               0
deadline               0
goal                   0
launched               0
pledged                0
state                  0
backers                0
country                0
usd pledged         3797
usd_pledged_real       0
usd_goal_real          0
dtype: int64

<b>DROPPING 'usd pledged'</b>

description in kaggle:<br>
* usd pledged = Pledged amount in USD (conversion made by KS)<br>
* usd_pledged_real = Pledged amount in USD (conversion made by fixer.io api)

usd pleged is more problematic than usd_pledged_real with the same intentions

<b>DROPPING 'pledged'</b>

'pledged' is still using local currency. since we're going to use usd_pledged_real, we'll be dropping pledged

<b>DROPPING 'goal'</b>

* goal is the actual goal with locale currency. to keep everything in sync, we'll be using everything converted to USD in the usd_goal_real columns

<b>DROPPING 'currency'</b>

* currency explains nothing more than country of the origin of the project. since everything is converted to USD, currency is no longer relevant

https://www.kaggle.com/kemical/kickstarter-projects?select=ks-projects-201801.csv

In [6]:
df.drop(columns='usd pledged', inplace=True)
df.drop(columns='goal', inplace=True)
df.drop(columns='currency', inplace=True)
df.drop(columns='pledged', inplace=True)

In [7]:
df

Unnamed: 0_level_0,name,category,main_category,deadline,launched,state,backers,country,usd_pledged_real,usd_goal_real
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,2015-10-09,2015-08-11 12:12:28,failed,0,GB,0.0,1533.95
1000003930,Greeting From Earth: ZGAC Arts Capsule For ET,Narrative Film,Film & Video,2017-11-01,2017-09-02 04:43:57,failed,15,US,2421.0,30000.00
1000004038,Where is Hank?,Narrative Film,Film & Video,2013-02-26,2013-01-12 00:20:50,failed,3,US,220.0,45000.00
1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,2012-04-16,2012-03-17 03:24:11,failed,1,US,1.0,5000.00
1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,2015-08-29,2015-07-04 08:35:03,canceled,14,US,1283.0,19500.00
...,...,...,...,...,...,...,...,...,...,...
999976400,ChknTruk Nationwide Charity Drive 2014 (Canceled),Documentary,Film & Video,2014-10-17,2014-09-17 02:35:30,canceled,1,US,25.0,50000.00
999977640,The Tribe,Narrative Film,Film & Video,2011-07-19,2011-06-22 03:35:14,failed,5,US,155.0,1500.00
999986353,Walls of Remedy- New lesbian Romantic Comedy f...,Narrative Film,Film & Video,2010-08-16,2010-07-01 19:40:30,failed,1,US,20.0,15000.00
999987933,BioDefense Education Kit,Technology,Technology,2016-02-13,2016-01-13 18:13:53,failed,6,US,200.0,15000.00


In [8]:
df.isnull().sum()

name                4
category            0
main_category       0
deadline            0
launched            0
state               0
backers             0
country             0
usd_pledged_real    0
usd_goal_real       0
dtype: int64

<b> Deleting the rest of the Nulls in the feature 'name' </b>

* since the NaNs are in the name column and four of them could be dropped without hurting the rest of the dataset, i'll be dropping the NaNs

In [9]:
null = df[df.isna().any(axis=1)]

In [10]:
null

Unnamed: 0_level_0,name,category,main_category,deadline,launched,state,backers,country,usd_pledged_real,usd_goal_real
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1848699072,,Narrative Film,Film & Video,2012-02-29,2012-01-01 12:35:31,failed,1,US,100.0,200000.0
634871725,,Video Games,Games,2013-01-06,2012-12-19 23:57:48,failed,12,GB,316.05,3224.97
648853978,,Product Design,Design,2016-07-18,2016-06-18 05:01:47,suspended,0,US,0.0,2500.0
796533179,,Painting,Art,2011-12-05,2011-11-06 23:55:55,failed,5,US,220.0,35000.0


dikarenakan masih ada pada kolom nama, dan keempatnya gagal dalam memperoleh funding, maka keempat data diatas akans aya drop

In [11]:
df.dropna(inplace=True)

In [12]:
df.isnull().sum()

name                0
category            0
main_category       0
deadline            0
launched            0
state               0
backers             0
country             0
usd_pledged_real    0
usd_goal_real       0
dtype: int64

Unifying Dates in 'launched" and 'deadline' (convert to datetime and remove the timestamp) plus adding a new feature to calculate total dates available between launched and deadline

In [13]:
df[['launched','deadline']] = df[['launched','deadline']].apply(pd.to_datetime)
df['time_avail'] = (df['deadline'] - df['launched']).dt.days
df['launched'] = pd.to_datetime(df['launched']).dt.date

In [14]:
df

Unnamed: 0_level_0,name,category,main_category,deadline,launched,state,backers,country,usd_pledged_real,usd_goal_real,time_avail
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,2015-10-09,2015-08-11,failed,0,GB,0.0,1533.95,58
1000003930,Greeting From Earth: ZGAC Arts Capsule For ET,Narrative Film,Film & Video,2017-11-01,2017-09-02,failed,15,US,2421.0,30000.00,59
1000004038,Where is Hank?,Narrative Film,Film & Video,2013-02-26,2013-01-12,failed,3,US,220.0,45000.00,44
1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,2012-04-16,2012-03-17,failed,1,US,1.0,5000.00,29
1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,2015-08-29,2015-07-04,canceled,14,US,1283.0,19500.00,55
...,...,...,...,...,...,...,...,...,...,...,...
999976400,ChknTruk Nationwide Charity Drive 2014 (Canceled),Documentary,Film & Video,2014-10-17,2014-09-17,canceled,1,US,25.0,50000.00,29
999977640,The Tribe,Narrative Film,Film & Video,2011-07-19,2011-06-22,failed,5,US,155.0,1500.00,26
999986353,Walls of Remedy- New lesbian Romantic Comedy f...,Narrative Film,Film & Video,2010-08-16,2010-07-01,failed,1,US,20.0,15000.00,45
999987933,BioDefense Education Kit,Technology,Technology,2016-02-13,2016-01-13,failed,6,US,200.0,15000.00,30


In [15]:
df.to_csv('non_null.csv')

In [16]:
df = df[df.state != 'live']
df = df[df.state != 'canceled']

In [17]:
df['state'] = df['state'].replace(['suspended','undefined'],'failed')
df['state'].unique()

array(['failed', 'successful'], dtype=object)

In [18]:
df

Unnamed: 0_level_0,name,category,main_category,deadline,launched,state,backers,country,usd_pledged_real,usd_goal_real,time_avail
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,2015-10-09,2015-08-11,failed,0,GB,0.0,1533.95,58
1000003930,Greeting From Earth: ZGAC Arts Capsule For ET,Narrative Film,Film & Video,2017-11-01,2017-09-02,failed,15,US,2421.0,30000.00,59
1000004038,Where is Hank?,Narrative Film,Film & Video,2013-02-26,2013-01-12,failed,3,US,220.0,45000.00,44
1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,2012-04-16,2012-03-17,failed,1,US,1.0,5000.00,29
1000014025,Monarch Espresso Bar,Restaurants,Food,2016-04-01,2016-02-26,successful,224,US,52375.0,50000.00,34
...,...,...,...,...,...,...,...,...,...,...,...
999975836,"Homemade fresh dog food, Cleveland OH",Small Batch,Food,2017-04-19,2017-03-20,failed,4,US,154.0,6500.00,29
999977640,The Tribe,Narrative Film,Film & Video,2011-07-19,2011-06-22,failed,5,US,155.0,1500.00,26
999986353,Walls of Remedy- New lesbian Romantic Comedy f...,Narrative Film,Film & Video,2010-08-16,2010-07-01,failed,1,US,20.0,15000.00,45
999987933,BioDefense Education Kit,Technology,Technology,2016-02-13,2016-01-13,failed,6,US,200.0,15000.00,30


In [19]:
df.to_csv('2state.csv')