# Kickstarter

What will make your project at Kickstarter successful?

## Import stuff

In [156]:
import pandas as pd
import numpy as np

from matplotlib import pyplot as plt
import seaborn as sns

plt.style.use('fivethirtyeight')

%matplotlib inline

## Load the data

In [157]:
data = pd.read_csv('./DSI_kickstarterscrape_dataset.csv')

## Look at the data

Take an initial look at the data and especially the columns

In [158]:
data.head(3)

Unnamed: 0,project id,name,url,category,subcategory,location,status,goal,pledged,funded percentage,backers,funded date,levels,reward levels,updates,comments,duration
0,39409,WHILE THE TREES SLEEP,http://www.kickstarter.com/projects/emiliesaba...,Film & Video,Short Film,"Columbia, MO",successful,10500.0,11545.0,1.099524,66,"Fri, 19 Aug 2011 19:28:17 -0000",7,"$25,$50,$100,$250,$500,$1,000,$2,500",10,2,30.0
1,126581,Educational Online Trading Card Game,http://www.kickstarter.com/projects/972789543/...,Games,Board & Card Games,"Maplewood, NJ",failed,4000.0,20.0,0.005,2,"Mon, 02 Aug 2010 03:59:00 -0000",5,"$1,$5,$10,$25,$50",6,0,47.18
2,138119,STRUM,http://www.kickstarter.com/projects/185476022/...,Film & Video,Animation,"Los Angeles, CA",live,20000.0,56.0,0.0028,3,"Fri, 08 Jun 2012 00:00:31 -0000",10,"$1,$10,$25,$40,$50,$100,$250,$1,000,$1,337,$9,001",1,0,28.0


How many rows and columns

In [159]:
data.shape

(45957, 17)

Get more info about the data

In [160]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45957 entries, 0 to 45956
Data columns (total 17 columns):
project id           45957 non-null int64
name                 45957 non-null object
url                  45957 non-null object
category             45957 non-null object
subcategory          45957 non-null object
location             44635 non-null object
status               45957 non-null object
goal                 45957 non-null float64
pledged              45945 non-null float64
funded percentage    45957 non-null float64
backers              45957 non-null int64
funded date          45957 non-null object
levels               45957 non-null int64
reward levels        45898 non-null object
updates              45957 non-null int64
comments             45957 non-null int64
duration             45957 non-null float64
dtypes: float64(4), int64(5), object(8)
memory usage: 6.0+ MB


Get more info about missing data

In [161]:
data.isnull().sum()

project id              0
name                    0
url                     0
category                0
subcategory             0
location             1322
status                  0
goal                    0
pledged                12
funded percentage       0
backers                 0
funded date             0
levels                  0
reward levels          59
updates                 0
comments                0
duration                0
dtype: int64

Look for duplicates

In [162]:
data.duplicated().sum()

89

Get statistical info about the data (might need to be redone after other things are fixed with the data). Here we can also see signs of outliers.

In [193]:
data.describe()

Unnamed: 0,project id,goal,pledged,funded percentage,backers,levels,updates,comments,duration,weekday_nr,nr_in_month,month_nr,year
count,45957.0,45957.0,45945.0,45957.0,45957.0,45957.0,45957.0,45957.0,45957.0,45957.0,45957.0,45957.0,45957.0
mean,1080800000.0,11942.71,4980.75,1.850129,69.973192,8.004939,4.08508,8.379529,39.995547,4.126488,14.915203,6.087255,2011.199513
std,621805700.0,188758.3,56741.62,88.492706,688.628479,4.233907,6.43922,174.015737,17.414458,2.019067,9.118437,3.185649,0.75283
min,39409.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,2009.0
25%,543896200.0,1800.0,196.0,0.044,5.0,5.0,0.0,0.0,30.0,2.0,7.0,4.0,2011.0
50%,1078345000.0,4000.0,1310.0,1.0,23.0,7.0,2.0,0.0,32.0,4.0,15.0,6.0,2011.0
75%,1621596000.0,9862.0,4165.0,1.11564,59.0,10.0,6.0,3.0,48.39,6.0,23.0,9.0,2012.0
max,2147460000.0,21474840.0,10266840.0,15066.0,87142.0,80.0,149.0,19311.0,91.96,7.0,31.0,12.0,2012.0


### What I need to fix

- Category column: Make dummies out of categories - make dummies or use group by? 
- Subcategory column: Make dummies out of subcategories - make dummies or use group by? 
- Location column: Split into 2 new columns (city and state) - make dummies or use group by? 
- Status column: Give the status numbers. (1 = successful, 2 = failed, 3 = live)
- Reward levels column: Make one column for each input (dummy?)...Make numeric


Handle missing values.<BR/>
Look for duplicates.<BR/>
Look for outliers.<BR/>
...

## Work the data (EDA/Munging)

### Duplicates

#### Handle

### Missing values

### Check for outliers

In [194]:
data.describe()

Unnamed: 0,project id,goal,pledged,funded percentage,backers,levels,updates,comments,duration,weekday_nr,nr_in_month,month_nr,year
count,45957.0,45957.0,45945.0,45957.0,45957.0,45957.0,45957.0,45957.0,45957.0,45957.0,45957.0,45957.0,45957.0
mean,1080800000.0,11942.71,4980.75,1.850129,69.973192,8.004939,4.08508,8.379529,39.995547,4.126488,14.915203,6.087255,2011.199513
std,621805700.0,188758.3,56741.62,88.492706,688.628479,4.233907,6.43922,174.015737,17.414458,2.019067,9.118437,3.185649,0.75283
min,39409.0,0.01,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,2009.0
25%,543896200.0,1800.0,196.0,0.044,5.0,5.0,0.0,0.0,30.0,2.0,7.0,4.0,2011.0
50%,1078345000.0,4000.0,1310.0,1.0,23.0,7.0,2.0,0.0,32.0,4.0,15.0,6.0,2011.0
75%,1621596000.0,9862.0,4165.0,1.11564,59.0,10.0,6.0,3.0,48.39,6.0,23.0,9.0,2012.0
max,2147460000.0,21474840.0,10266840.0,15066.0,87142.0,80.0,149.0,19311.0,91.96,7.0,31.0,12.0,2012.0


### funded date column: Split column into different columns

In [164]:
data['funded date'].head()

0    Fri, 19 Aug 2011 19:28:17 -0000
1    Mon, 02 Aug 2010 03:59:00 -0000
2    Fri, 08 Jun 2012 00:00:31 -0000
3    Sun, 08 Apr 2012 02:14:00 -0000
4    Wed, 01 Jun 2011 15:25:39 -0000
Name: funded date, dtype: object

In [165]:
new_date = data['funded date'].str.split(", ", n=1, expand=True)
data["weekday"]= new_date[0] 
data["date"]= new_date[1] 

In [166]:
new_date2 = data['date'].str.split(" ", n=1, expand=True)
data["nr_in_month"]= new_date2[0] 
data["date2"]= new_date2[1] 

In [167]:
new_date3 = data['date2'].str.split(" ", n=1, expand=True)
data["month"]= new_date3[0] 
data["date3"]= new_date3[1] 

In [168]:
new_date4 = data['date3'].str.split(" ", n=1, expand=True)
data["year"]= new_date4[0] 
data["time"]= new_date4[1] 

In [169]:
# remove columns not needed
data = data.drop(['date', 'date2', 'date3'], axis=1)

In [170]:
### Make day column to number (1 = monday, 2 = tuesday etc.)
data['weekday_nr'] = data['weekday'].replace({'Mon': 1, 'Tue': 2, 'Wed': 3, 'Thu': 4, 'Fri': 5, 'Sat': 6, 'Sun':7})

In [171]:
### Make month column to number (1 = jan, 2 = feb etc.)
data['month_nr'] = data['month'].replace({'Jan': '01', 'Feb': '02', 'Mar': '03', 'Apr': '04', 'May': '05', 
                                          'Jun': '06', 'Jul':'07', 'Aug': '08', 'Sep': '09', 'Oct': '10', 
                                          'Nov': '11', 'Dec': '12'})

In [172]:
### Combind day_nr, month_nr and year (all objects)

In [173]:
data['date'] = data['month_nr'] + '/' + data['nr_in_month'] + '/' + data['year']

In [174]:
data.head(3)

Unnamed: 0,project id,name,url,category,subcategory,location,status,goal,pledged,funded percentage,...,comments,duration,weekday,nr_in_month,month,year,time,weekday_nr,month_nr,date
0,39409,WHILE THE TREES SLEEP,http://www.kickstarter.com/projects/emiliesaba...,Film & Video,Short Film,"Columbia, MO",successful,10500.0,11545.0,1.099524,...,2,30.0,Fri,19,Aug,2011,19:28:17 -0000,5,8,08/19/2011
1,126581,Educational Online Trading Card Game,http://www.kickstarter.com/projects/972789543/...,Games,Board & Card Games,"Maplewood, NJ",failed,4000.0,20.0,0.005,...,0,47.18,Mon,2,Aug,2010,03:59:00 -0000,1,8,08/02/2010
2,138119,STRUM,http://www.kickstarter.com/projects/185476022/...,Film & Video,Animation,"Los Angeles, CA",live,20000.0,56.0,0.0028,...,0,28.0,Fri,8,Jun,2012,00:00:31 -0000,5,6,06/08/2012


In [175]:
### Make date time format

In [176]:
data['date'] = pd.to_datetime(data['date'], format='%m/%d/%Y')

In [178]:
### get correct order of df
data = data[['project id', 'name', 'url', 'category', 'subcategory', 'location', 'status', 'goal', 'pledged', 
    'funded percentage', 'backers', 'funded date', 'levels', 'reward levels', 'updates', 'comments', 'duration', 
    'weekday', 'weekday_nr', 'nr_in_month', 'month', 'month_nr', 'year', 'time', 'date']]

data.head(3)

Unnamed: 0,project id,name,url,category,subcategory,location,status,goal,pledged,funded percentage,...,comments,duration,weekday,weekday_nr,nr_in_month,month,month_nr,year,time,date
0,39409,WHILE THE TREES SLEEP,http://www.kickstarter.com/projects/emiliesaba...,Film & Video,Short Film,"Columbia, MO",successful,10500.0,11545.0,1.099524,...,2,30.0,Fri,5,19,Aug,8,2011,19:28:17 -0000,2011-08-19
1,126581,Educational Online Trading Card Game,http://www.kickstarter.com/projects/972789543/...,Games,Board & Card Games,"Maplewood, NJ",failed,4000.0,20.0,0.005,...,0,47.18,Mon,1,2,Aug,8,2010,03:59:00 -0000,2010-08-02
2,138119,STRUM,http://www.kickstarter.com/projects/185476022/...,Film & Video,Animation,"Los Angeles, CA",live,20000.0,56.0,0.0028,...,0,28.0,Fri,5,8,Jun,6,2012,00:00:31 -0000,2012-06-08


In [183]:
# turn to right format
data['nr_in_month'] = pd.to_numeric(data['nr_in_month'])
data['year'] = pd.to_numeric(data['year'])
data['month_nr'] = pd.to_numeric(data['month_nr'])

In [190]:
# remove dollar signs
data['reward levels'] = data['reward levels'].str.replace('$', '', regex=True)

In [None]:
# need to split into differnt cells if I'm using

In [191]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45957 entries, 0 to 45956
Data columns (total 25 columns):
project id           45957 non-null int64
name                 45957 non-null object
url                  45957 non-null object
category             45957 non-null object
subcategory          45957 non-null object
location             44635 non-null object
status               45957 non-null object
goal                 45957 non-null float64
pledged              45945 non-null float64
funded percentage    45957 non-null float64
backers              45957 non-null int64
funded date          45957 non-null object
levels               45957 non-null int64
reward levels        45898 non-null object
updates              45957 non-null int64
comments             45957 non-null int64
duration             45957 non-null float64
weekday              45957 non-null object
weekday_nr           45957 non-null int64
nr_in_month          45957 non-null int64
month                45957 non-nul

### name column: Turn all letters to lower case

In [192]:
data['name'] = data['name'].str.lower()
data.head(3)

Unnamed: 0,project id,name,url,category,subcategory,location,status,goal,pledged,funded percentage,...,comments,duration,weekday,weekday_nr,nr_in_month,month,month_nr,year,time,date
0,39409,while the trees sleep,http://www.kickstarter.com/projects/emiliesaba...,Film & Video,Short Film,"Columbia, MO",successful,10500.0,11545.0,1.099524,...,2,30.0,Fri,5,19,Aug,8,2011,19:28:17 -0000,2011-08-19
1,126581,educational online trading card game,http://www.kickstarter.com/projects/972789543/...,Games,Board & Card Games,"Maplewood, NJ",failed,4000.0,20.0,0.005,...,0,47.18,Mon,1,2,Aug,8,2010,03:59:00 -0000,2010-08-02
2,138119,strum,http://www.kickstarter.com/projects/185476022/...,Film & Video,Animation,"Los Angeles, CA",live,20000.0,56.0,0.0028,...,0,28.0,Fri,5,8,Jun,6,2012,00:00:31 -0000,2012-06-08


## Was the initial findings in themself correct?

#### Best lenght for compaign?
more then $6,000 - 35 days

less then $6,000 - a shorter duration is better

#### Different pledge goals at low and high amonts
But for both intervals a low pledge seem to me more successful. (No surprise)

#### Some types of campaign are more successful than others
<u>Category</u>

Dance, theater, music (most successfull)

Fashion, publishing, technology (fashion is by far least successfull)


<u>Sub categorys of the Music category</u>

Indie rock, country & folk, jazz, Classical music (most successfull)

Hip hop, electronic music, world music (Hip hop is by far least successfull)


#### When to launch?

Month: February

Day: Monday

Time: 9:00 am (EST)

(The variables was studied individually, this means that a Monday in February at 9:00 am might not necessarily be the best time.)

#### More comments, more success

Dubble the successrate with 1-2 comments

Tripple the successrate with 10 comments