# Team Project - Kickstarter Project

* Author: Julia Hammerer, Vanessa Mai
* Last Update: 15.06.2018



## Project Description
In this project we will look at a list of Crowdfunding Projects pulled from the Kickstarter website in 2018. Analysis will be mainly Data Explorations on the Project and may include (but not exclusively)
* compare successful and failed projects per country, and category
* look at size/funding amount of projects
* what's the value of successful projects, are there differences to failed ones?
* time series analysis 
* how much do people donate to projects in average

Here we will import all packages required for our analysis

In [31]:
import pandas as pd
from datetime import datetime

Then first of all we import our csv file as a pandas dataframe


In [32]:
ks=pd.read_csv("ks-projects-201801.csv")

In [33]:
ks

Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,usd_pledged_real,usd_goal_real
0,1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,2015-10-09,1000.0,2015-08-11 12:12:28,0.00,failed,0,GB,0.00,0.00,1533.95
1,1000003930,Greeting From Earth: ZGAC Arts Capsule For ET,Narrative Film,Film & Video,USD,2017-11-01,30000.0,2017-09-02 04:43:57,2421.00,failed,15,US,100.00,2421.00,30000.00
2,1000004038,Where is Hank?,Narrative Film,Film & Video,USD,2013-02-26,45000.0,2013-01-12 00:20:50,220.00,failed,3,US,220.00,220.00,45000.00
3,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,USD,2012-04-16,5000.0,2012-03-17 03:24:11,1.00,failed,1,US,1.00,1.00,5000.00
4,1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,USD,2015-08-29,19500.0,2015-07-04 08:35:03,1283.00,canceled,14,US,1283.00,1283.00,19500.00
5,1000014025,Monarch Espresso Bar,Restaurants,Food,USD,2016-04-01,50000.0,2016-02-26 13:38:27,52375.00,successful,224,US,52375.00,52375.00,50000.00
6,1000023410,Support Solar Roasted Coffee & Green Energy! ...,Food,Food,USD,2014-12-21,1000.0,2014-12-01 18:30:44,1205.00,successful,16,US,1205.00,1205.00,1000.00
7,1000030581,Chaser Strips. Our Strips make Shots their B*tch!,Drinks,Food,USD,2016-03-17,25000.0,2016-02-01 20:05:12,453.00,failed,40,US,453.00,453.00,25000.00
8,1000034518,SPIN - Premium Retractable In-Ear Headphones w...,Product Design,Design,USD,2014-05-29,125000.0,2014-04-24 18:14:43,8233.00,canceled,58,US,8233.00,8233.00,125000.00
9,100004195,STUDIO IN THE SKY - A Documentary Feature Film...,Documentary,Film & Video,USD,2014-08-10,65000.0,2014-07-11 21:55:48,6240.57,canceled,43,US,6240.57,6240.57,65000.00


When looking at the data we see that there are several fields with amounts (=pledged). THe fields "pledged" and "goal" are in the original currency. There is also "usd pledged", usd_pledged_real", "uds_goal_real". "usd pledged" is the amount converted to us-dollar by Kickstarter. According to the description from Kaggle "usd_pledged_real" and "usd_goal_real" are converted using fixer.io by tonyplaysguitar. 
We'll use these fields, as this also have the "goal" in US-Dollars and we remove the other columns


In [34]:
# remove unused columns, we don't need name and id, and the mentioned amounts
# we also do not need currency, as we have everything in US-Dollar
ks=ks.drop(["ID", "name", "goal", "pledged", "usd pledged", "currency"], axis=1)

# we take a look at the datatypes, to look if we need to convert any fields to the appropriate data type
ks.dtypes 

category             object
main_category        object
deadline             object
launched             object
state                object
backers               int64
country              object
usd_pledged_real    float64
usd_goal_real       float64
dtype: object

We see that all the fields, that we need in a numeric form, are already automatically detected by python as numeric. 
However, to use the dates correctly, we need to convert launched and deadline into a datetime-datatype

In [35]:
ks["launched"] = pd.to_datetime(ks["launched"], infer_datetime_format=True)
ks["deadline"] = pd.to_datetime(ks["deadline"], infer_datetime_format=True)

In [46]:
# check for open projects
closed=ks["deadline"]>ks["launched"]
closed.value_counts()

True    378661
dtype: int64

We see that we only have closed projects here. As a prospect, we could include open projects and do a prediction on whether a project will be successful or not. Next step is to look at the data more closely and see some basic information on the data. For this we use the package pandas-profiling. See documenation [here](https://github.com/pandas-profiling/pandas-profiling)


In [50]:
import pandas_profiling

In [51]:
pandas_profiling.ProfileReport(ks)

0,1
Number of variables,9
Number of observations,378661
Total Missing (%),0.0%
Total size in memory,26.0 MiB
Average record size in memory,72.0 B

0,1
Numeric,3
Categorical,4
Boolean,0
Date,2
Text (Unique),0
Rejected,0
Unsupported,0

0,1
Distinct count,3963
Unique (%),1.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,105.62
Minimum,0
Maximum,219382
Zeros (%),14.7%

0,1
Minimum,0
5-th percentile,0
Q1,2
Median,12
Q3,56
95-th percentile,334
Maximum,219382
Range,219382
Interquartile range,54

0,1
Standard deviation,907.19
Coef of variation,8.5893
Kurtosis,13955
Mean,105.62
MAD,146.7
Skewness,86.763
Sum,39993219
Variance,822980
Memory size,2.9 MiB

Value,Count,Frequency (%),Unnamed: 3
0,55609,14.7%,
1,34869,9.2%,
2,23196,6.1%,
3,16063,4.2%,
4,12068,3.2%,
5,9716,2.6%,
6,8137,2.1%,
7,7014,1.9%,
8,6198,1.6%,
9,5553,1.5%,

Value,Count,Frequency (%),Unnamed: 3
0,55609,14.7%,
1,34869,9.2%,
2,23196,6.1%,
3,16063,4.2%,
4,12068,3.2%,

Value,Count,Frequency (%),Unnamed: 3
87142,1,0.0%,
91585,1,0.0%,
105857,1,0.0%,
154926,1,0.0%,
219382,1,0.0%,

0,1
Distinct count,159
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Product Design,22314
Documentary,16139
Music,15727
Other values (156),324481

Value,Count,Frequency (%),Unnamed: 3
Product Design,22314,5.9%,
Documentary,16139,4.3%,
Music,15727,4.2%,
Tabletop Games,14180,3.7%,
Shorts,12357,3.3%,
Video Games,11830,3.1%,
Food,11493,3.0%,
Film & Video,10108,2.7%,
Fiction,9169,2.4%,
Fashion,8554,2.3%,

0,1
Distinct count,23
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
US,292627
GB,33672
CA,14756
Other values (20),37606

Value,Count,Frequency (%),Unnamed: 3
US,292627,77.3%,
GB,33672,8.9%,
CA,14756,3.9%,
AU,7839,2.1%,
DE,4171,1.1%,
"N,0""",3797,1.0%,
FR,2939,0.8%,
IT,2878,0.8%,
NL,2868,0.8%,
ES,2276,0.6%,

0,1
Distinct count,3164
Unique (%),0.8%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Minimum,2009-05-03 00:00:00
Maximum,2018-03-03 00:00:00

0,1
Distinct count,378089
Unique (%),99.8%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Minimum,1970-01-01 01:00:00
Maximum,2018-01-02 15:02:31

0,1
Distinct count,15
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
Film & Video,63585
Music,51918
Publishing,39874
Other values (12),223284

Value,Count,Frequency (%),Unnamed: 3
Film & Video,63585,16.8%,
Music,51918,13.7%,
Publishing,39874,10.5%,
Games,35231,9.3%,
Technology,32569,8.6%,
Design,30070,7.9%,
Art,28153,7.4%,
Food,24602,6.5%,
Fashion,22816,6.0%,
Theater,10913,2.9%,

0,1
Distinct count,6
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
failed,197719
successful,133956
canceled,38779
Other values (3),8207

Value,Count,Frequency (%),Unnamed: 3
failed,197719,52.2%,
successful,133956,35.4%,
canceled,38779,10.2%,
undefined,3562,0.9%,
live,2799,0.7%,
suspended,1846,0.5%,

0,1
Distinct count,50339
Unique (%),13.3%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,45454
Minimum,0.01
Maximum,166360000
Zeros (%),0.0%

0,1
Minimum,0.01
5-th percentile,400.0
Q1,2000.0
Median,5500.0
Q3,15500.0
95-th percentile,80000.0
Maximum,166360000.0
Range,166360000.0
Interquartile range,13500.0

0,1
Standard deviation,1153000
Coef of variation,25.365
Kurtosis,7082.9
Mean,45454
MAD,66705
Skewness,78.221
Sum,17212000000
Variance,1329300000000
Memory size,2.9 MiB

Value,Count,Frequency (%),Unnamed: 3
5000.0,24173,6.4%,
10000.0,20786,5.5%,
1000.0,13029,3.4%,
3000.0,12699,3.4%,
2000.0,11915,3.1%,
15000.0,11374,3.0%,
20000.0,10121,2.7%,
2500.0,9849,2.6%,
500.0,8588,2.3%,
25000.0,8364,2.2%,

Value,Count,Frequency (%),Unnamed: 3
0.01,2,0.0%,
0.15,1,0.0%,
0.49,1,0.0%,
0.5,1,0.0%,
0.55,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
104057189.83,1,0.0%,
107369867.72,1,0.0%,
110169771.62,1,0.0%,
151395869.92,1,0.0%,
166361390.71,1,0.0%,

0,1
Distinct count,106065
Unique (%),28.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,9058.9
Minimum,0
Maximum,20339000
Zeros (%),13.9%

0,1
Minimum,0.0
5-th percentile,0.0
Q1,31.0
Median,624.33
Q3,4050.0
95-th percentile,28090.0
Maximum,20339000.0
Range,20339000.0
Interquartile range,4019.0

0,1
Standard deviation,90973
Coef of variation,10.042
Kurtosis,11796
Mean,9058.9
MAD,13153
Skewness,82.188
Sum,3430300000
Variance,8276100000
Memory size,2.9 MiB

Value,Count,Frequency (%),Unnamed: 3
0.0,52527,13.9%,
1.0,6678,1.8%,
10.0,3633,1.0%,
25.0,3455,0.9%,
50.0,2937,0.8%,
5.0,2584,0.7%,
100.0,2461,0.6%,
20.0,2354,0.6%,
2.0,1700,0.4%,
30.0,1655,0.4%,

Value,Count,Frequency (%),Unnamed: 3
0.0,52527,13.9%,
0.45,1,0.0%,
0.47,1,0.0%,
0.48,2,0.0%,
0.49,5,0.0%,

Value,Count,Frequency (%),Unnamed: 3
10266845.74,1,0.0%,
12393139.69,1,0.0%,
12779843.49,1,0.0%,
13285226.36,1,0.0%,
20338986.27,1,0.0%,

Unnamed: 0,category,main_category,deadline,launched,state,backers,country,usd_pledged_real,usd_goal_real
0,Poetry,Publishing,2015-10-09,2015-08-11 12:12:28,failed,0,GB,0.0,1533.95
1,Narrative Film,Film & Video,2017-11-01,2017-09-02 04:43:57,failed,15,US,2421.0,30000.0
2,Narrative Film,Film & Video,2013-02-26,2013-01-12 00:20:50,failed,3,US,220.0,45000.0
3,Music,Music,2012-04-16,2012-03-17 03:24:11,failed,1,US,1.0,5000.0
4,Film & Video,Film & Video,2015-08-29,2015-07-04 08:35:03,canceled,14,US,1283.0,19500.0
