# Kickstarter Project Dataset

Sven XXX & Christoph Blickle - neueFische GmbH Camp Cologne 2020

[Wikipedia Article says](https://en.wikipedia.org/wiki/Kickstarter):<br>
"Kickstarter is an American public benefit corporation based in Brooklyn, New York, that maintains a global crowdfunding platform focused on creativity.<br>
The company's stated mission is to "help bring creative projects to life".
As of December 2019, Kickstarter has received more than 4.6 billion dollars in pledges from 17.2 million backers to fund 445,000 projects, such as films, music, stage shows, comics, journalism, video games, technology, publishing, and food-related projects.
People who back Kickstarter projects are offered tangible rewards or experiences in exchange for their pledges.
This model traces its roots to subscription model of arts patronage, where artists would go directly to their audiences to fund their work."
<br>
## Goal of this notebook
In this Notebook we will take a look at a Kickstarter Project Dataset (\*This is a strongly modified version of a widely used dataset, which was used for practice in a Data Science Course\*). It contains a variety of parameters and possible factors, which influence the outcome of a Project. <br>

What we need to do:

- [ ] Import the data, which is split into 56 individual csv-files
- [ ] Clean the data
- [ ] Save and export
- [ ] Exploratory Data Analysis
- [ ] Try at least 3 different machine learning algorithms
- [ ] Give recommendations base upon findings

### Data Import

In [42]:
# Import modules
import pandas as pd
import numpy as np

In [3]:
# Import the data and combine into one Dataframe
li = [] # create empty list
for i in range(0,56): # 56 individual files
    li.append(pd.read_csv(f'data/Kickstarter0{str(i).zfill(2)}.csv')) # Datasetname = 'Kickstarter000.csv'
df = pd.concat(li, axis=0, ignore_index = True) # Put them all together into one dataframe

***
- [x] Import the data, which is split into 56 individual csv-files
- [ ] Clean the data
- [ ] Save and export
- [ ] Exploratory Data Analysis
- [ ] Try at least 3 different machine learning algorithms
- [ ] Give recommendations base upon findings
***

### Data Cleaning

In [4]:
# Overview of the dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 209222 entries, 0 to 209221
Data columns (total 37 columns):
backers_count               209222 non-null int64
blurb                       209214 non-null object
category                    209222 non-null object
converted_pledged_amount    209222 non-null int64
country                     209222 non-null object
created_at                  209222 non-null int64
creator                     209222 non-null object
currency                    209222 non-null object
currency_symbol             209222 non-null object
currency_trailing_code      209222 non-null bool
current_currency            209222 non-null object
deadline                    209222 non-null int64
disable_communication       209222 non-null bool
friends                     300 non-null object
fx_rate                     209222 non-null float64
goal                        209222 non-null float64
id                          209222 non-null int64
is_backing                  300 

*** 
This is quite a big dataset with 20922 rows and 37 columns. We can already see that some columns are missing a lot of data. <br>
The next step is to remove unneeded or unusable features. 

**Features to be dropped:**

- 'permissions', 'is_backing', 'is_starred', 'friends':   
    -> only 300 values and most of them are NaNs 
    
- 'slug', 'source_url', 'url':  
    -> contained information can also be found in 'category' and 'name'
    
- 'creator', 'id', 'profile':  
    -> information about the creator is useless for us 
    
- 'currency_symbol', 'currency_trailing_code', 'current_currency', 'usd_type', 'static_usd_rate':   
    -> redundant information about used currencies and rates. We will only use the fx rate to convert the goal into usd 
    
- 'disable_communication', 'is_starrable', 'photo', 'location', 'pledged', 'spotlight':  
    -> redundant information 
    
- 'state_changed_at': <br>
    -> same as deadline
***

In [5]:
# Drop features
df.drop(['permissions', 'slug', 'source_url', 'urls', 'creator', 'currency_symbol', 'currency_trailing_code', 
        'current_currency', 'usd_pledged', 'state_changed_at'
        'disable_communication', 'id', 'is_starrable', 'photo', 'location', 'pledged',
        'profile', 'spotlight', 'static_usd_rate', 'usd_type', 
        'is_backing', 'is_starred', 'friends'], axis=1, inplace=True)

***
Let's look at the state of the projects.
***

In [6]:
print(df.state.unique())

['successful' 'failed' 'live' 'canceled' 'suspended']


***
Only successful or failed states are interesting to us for this specific project. We decided to include the 'canceled' state as fail.
***

In [7]:
# Drop all projects that are anything but successful or failed/canceled
df = df.query("state == 'successful' or state == 'failed' or state == 'canceled'")
df.reset_index(inplace=True, drop=True) # reset the index

In [8]:
# Let's look at the Dataset again
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 201288 entries, 0 to 201287
Data columns (total 15 columns):
backers_count               201288 non-null int64
blurb                       201280 non-null object
category                    201288 non-null object
converted_pledged_amount    201288 non-null int64
country                     201288 non-null object
created_at                  201288 non-null int64
currency                    201288 non-null object
deadline                    201288 non-null int64
fx_rate                     201288 non-null float64
goal                        201288 non-null float64
launched_at                 201288 non-null int64
name                        201288 non-null object
staff_pick                  201288 non-null bool
state                       201288 non-null object
state_changed_at            201288 non-null int64
dtypes: bool(1), float64(2), int64(6), object(6)
memory usage: 21.7+ MB


***
We can look for remaining NaNs in the dataset.
***

In [17]:
# Find NaNs 
print(f'There are {df.isnull().sum().sum()} NaNs left in this dataset')

There are 8 NaNs left in this dataset


In [19]:
# We can safely drop these and proceed
df.dropna(inplace=True)

### Data extraction and some feature engineering
We have now removed all unnecessary data.
Next up let's look at the format of some of the data and improve it for interpretation. <br>
We will start by extracting the category.

In [32]:
# Let's look at the category variable
print(df.category[0])

# Looks like a dictionary, but wait! It's actually a string
print(type(df.category[0]))

{"id":43,"name":"Rock","slug":"music/rock","position":17,"parent_id":14,"color":10878931,"urls":{"web":{"discover":"http://www.kickstarter.com/discover/categories/music/rock"}}}
<class 'str'>


***
If we look at the dictionary we can find our category, in this case 'music', in the slug key. Before we can access and split the string we need to create an actual dictionary
***

In [20]:
# We can use the eval() function to extract the whole dict from the string
df.category = df.category.map(lambda x: eval(x))

# Then we split the 'slug' string and take only the first entry, which will be the main category the porject was posted under
df.category = df.category.map(lambda x: x['slug'].partition('/')[0])

In [23]:
# Let's see if it worked
print(df.category.unique())

['music' 'art' 'photography' 'fashion' 'technology' 'publishing' 'games'
 'food' 'theater' 'dance' 'crafts' 'journalism' 'film & video' 'comics'
 'design']


***
Let's look at the timeformat.
***

In [26]:
# Print time related variables
print(df.deadline[0])
print(df.created_at[0])
print(df.launched_at[0])

1391899046
1387659690
1388011046


***
Not very readable. Let's change that.
***

In [27]:
# We use the datetime to convert the timestamps into actual dates
# Deadline
df.deadline = pd.to_datetime(df.deadline, unit='s')
df.deadline = df.deadline.dt.date

# Created at
df.created_at = pd.to_datetime(df.created_at, unit='s')
df.created_at = df.created_at.dt.date

# Launched at 
df.launched_at = pd.to_datetime(df.launched_at, unit='s')
df.launched_at = df.launched_at.dt.date

***
While we're on it, we can quickly create some new variables, that give us clear information about how long a project ran for and when it was actually launched after it's creation on the site
***

In [28]:
#create new features fot total runtime and forerun time
df['days_until_launch'] = df.launched_at - df.created_at
df['days_total'] = df.deadline - df.launched_at 

In [29]:
#transform to integer
df['days_total'] = df['days_total'].dt.days.astype('int16')
df['days_until_launch'] = df['days_until_launch'].dt.days.astype('int16')

***
We still need to convert the goal into usd.
***

In [34]:
# goal is given in the native currency, so we have to multiply it with the conversion rate for usd
df['converted_goal_amount'] = df.goal * df.fx_rate
df.converted_goal_amount = df.converted_goal_amount.astype('int64')

***
Then we also still need to change the 'state' variables into 1s and 0s. While we're at it we will also change the 'staff_pick' into 1s and 0s.
***

In [35]:
# Convert state to 0s and 1s
df.replace(to_replace=['canceled', 'failed', 'successful'], value=[0, 0, 1], inplace=True)

# Staff_pick is already given as True and False so we simply have to make them integers
df.staff_pick = df.staff_pick.astype('int16')

***
We don't want to work with plane text at this point in time. So we will change the 'blurb' and 'name' into length variables. (The blurb is the quick description of the project. Like a covertext for a book)
***

In [36]:
# Create new features blurb_length and name_length
df['blurb_length'] = df.blurb.map(lambda x: len(x))
df['name_length'] = df.name.map(lambda x: len(x))
df.drop(['blurb', 'name', 'goal', 'fx_rate'], axis=1, inplace=True)

***
We're finished we the cleaning and extraction, as well as a little bit of feature engineering. Let's take one last look at the set and then export it.

- [x] Import the data, which is split into 56 individual csv-files
- [x] Clean the data
- [ ] Save and export
- [ ] Exploratory Data Analysis
- [ ] Try at least 3 different machine learning algorithms
- [ ] Give recommendations base upon findings
***

In [37]:
df.head()

Unnamed: 0,backers_count,category,converted_pledged_amount,country,created_at,currency,deadline,launched_at,staff_pick,state,state_changed_at,days_until_launch,days_total,converted_goal_amount,blurb_length,name_length
0,21,music,802,US,2013-12-21,USD,2014-02-08,2013-12-25,0,1,1391899046,4,45,200,134,21
1,97,art,2259,US,2019-02-08,USD,2019-03-05,2019-02-13,0,1,1551801611,5,20,400,55,31
2,88,photography,29638,US,2016-10-23,USD,2016-12-01,2016-11-01,1,1,1480607932,9,30,27224,135,60
3,193,fashion,49158,IT,2018-10-24,EUR,2018-12-08,2018-10-27,0,1,1544309940,3,42,45137,75,25
4,20,technology,549,US,2015-03-07,USD,2015-04-08,2015-03-09,0,0,1428511019,2,30,1000,133,30


## Save and export the  data

We will export 3 different files. First the whole dataset as one. Then we will split it into a Training and validation set for the machine learning algorithms.

In [38]:
# Save dataframe as a whole
df.to_csv('KickstarterData_full.csv')

In [39]:
# Split the dataframe
from sklearn.model_selection import train_test_split
X = df.drop('state', axis=1)
y = df.state
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42, stratify=y) # We chose to stratify here as there is a slight imbalance in the classes

In [40]:
# Combine features and target value back to together
Trainset = pd.concat([X_train, y_train], axis=1)
Trainset = Trainset.reset_index(drop=True)
Testset = pd.concat([X_test, y_test], axis=1)
Testset = Testset.reset_index(drop=True)

In [41]:
# Export datasets
Trainset.to_csv('Kickstarter_Train.csv', index=False)
Testset.to_csv('Kickstarter_Validation.csv', index=False)

***
- [x] Import the data, which is split into 56 individual csv-files
- [x] Clean the data
- [x] Save and export
- [ ] Exploratory Data Analysis
- [ ] Try at least 3 different machine learning algorithms
- [ ] Give recommendations base upon findings
***