# Final Project : Kickstarter Marketing 

In this project we are looking at a dataset called "Kickstarter Project" that looks at data for over 300,000 kickstarter projects. The data was collected from the Kickstarter Platform by Michael Mouille. The data was posted [here].(https://www.kaggle.com/kemical/kickstarter-projects)

In our analysis, we will seek to answer the following questions: 

 - Who should we market? (meaning, who is worth our time and money to push our marketing efforts in order to have a successful crowdsourcing campaign?)
- Why should we market them?


In [111]:
# import the libraries that will be necessary for our work with this dataset 
import pandas as pd
import numpy as np
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix


In [112]:
# read in the data set we will be working with 
market_df = pd.read_csv('datasets/Kickstartercombo.csv')
market_df.head()

Unnamed: 0.1,Unnamed: 0,ID,name,main_category,category,country,state,goal,pledged,currency,launched,deadline,backers,usd pledged,usd_goal_real,usd_pledged_real
0,0,1000002330,The Songs of Adelaide & Abullah,Publishing,Poetry,GB,failed,1000.0,0.0,GBP,2015-08-11 12:12:00,2015-10-09 11:36:00,0,0.0,,
1,1,1000004038,Where is Hank?,Film & Video,Narrative Film,US,failed,45000.0,220.0,USD,2013-01-12 00:20:00,2013-02-26 00:20:00,3,220.0,,
2,2,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,US,failed,5000.0,1.0,USD,2012-03-17 03:24:00,2012-04-16 04:24:00,1,1.0,,
3,3,1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,US,canceled,19500.0,1283.0,USD,2015-07-04 08:35:00,2015-08-29 01:00:00,14,1283.0,,
4,4,1000014025,Monarch Espresso Bar,Food,Restaurants,US,successful,50000.0,52375.0,USD,2016-02-26 13:38:00,2016-04-01 13:38:00,224,52375.0,,


In [113]:
#look at the column headers that we have for our dataset 
market_df.columns

Index(['Unnamed: 0', 'ID', 'name', 'main_category', 'category', 'country',
       'state', 'goal', 'pledged', 'currency', 'launched', 'deadline',
       'backers', 'usd pledged', 'usd_goal_real', 'usd_pledged_real'],
      dtype='object')

In [114]:
# assess how large the dataset is 
market_df.shape

(702411, 16)

In [115]:
# find out if the dataset contains any missing values 
market_df.isnull().sum()

Unnamed: 0               0
ID                       0
name                     8
main_category            0
category                 0
country                  0
state                    0
goal                     0
pledged                  0
currency                 0
launched                 0
deadline                 0
backers                  0
usd pledged           7594
usd_goal_real       323750
usd_pledged_real    323750
dtype: int64

We can see that markdown_df is missing values in a few rows: 
- 'name' : 8 missing values 
- 'usd pledged' : 7594 missing values 
- 'usd_goal_real' : 323750 missing values 
- 'usd_pledged_real' : 323750 missing values 

We will need to decide how to handle these missing values. The rest of the columns do not have any missing values. 

In [116]:
# it may be useful to focus just on the data that is presented in USD 

just_usd = market_df[market_df['currency'] == 'USD']
just_usd.head()

Unnamed: 0.1,Unnamed: 0,ID,name,main_category,category,country,state,goal,pledged,currency,launched,deadline,backers,usd pledged,usd_goal_real,usd_pledged_real
1,1,1000004038,Where is Hank?,Film & Video,Narrative Film,US,failed,45000.0,220.0,USD,2013-01-12 00:20:00,2013-02-26 00:20:00,3,220.0,,
2,2,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,US,failed,5000.0,1.0,USD,2012-03-17 03:24:00,2012-04-16 04:24:00,1,1.0,,
3,3,1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,US,canceled,19500.0,1283.0,USD,2015-07-04 08:35:00,2015-08-29 01:00:00,14,1283.0,,
4,4,1000014025,Monarch Espresso Bar,Food,Restaurants,US,successful,50000.0,52375.0,USD,2016-02-26 13:38:00,2016-04-01 13:38:00,224,52375.0,,
5,5,1000023410,Support Solar Roasted Coffee & Green Energy! ...,Food,Food,US,successful,1000.0,1205.0,USD,2014-12-01 18:30:00,2014-12-21 18:30:00,16,1205.0,,


In [117]:
#lets see how many rows of data we have left with just projects funded in USD 
just_usd.shape

(556182, 16)

In [118]:
#since all of our data is now in USD, we can drop the columns that contain conversion values: 'usd pledged','usd_goal_real', and 'usd_pledged_real'

just_usd = just_usd.drop(['usd pledged','usd_goal_real', 'usd_pledged_real'], axis = 1)
just_usd.head()

Unnamed: 0.1,Unnamed: 0,ID,name,main_category,category,country,state,goal,pledged,currency,launched,deadline,backers
1,1,1000004038,Where is Hank?,Film & Video,Narrative Film,US,failed,45000.0,220.0,USD,2013-01-12 00:20:00,2013-02-26 00:20:00,3
2,2,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,US,failed,5000.0,1.0,USD,2012-03-17 03:24:00,2012-04-16 04:24:00,1
3,3,1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,US,canceled,19500.0,1283.0,USD,2015-07-04 08:35:00,2015-08-29 01:00:00,14
4,4,1000014025,Monarch Espresso Bar,Food,Restaurants,US,successful,50000.0,52375.0,USD,2016-02-26 13:38:00,2016-04-01 13:38:00,224
5,5,1000023410,Support Solar Roasted Coffee & Green Energy! ...,Food,Food,US,successful,1000.0,1205.0,USD,2014-12-01 18:30:00,2014-12-21 18:30:00,16


In [119]:
# We may not need the ID column info for our analysis, but lets make sure to eliminate duplicate rows (if any) before dropping the ID column. 

dup = just_usd['ID'].duplicated()
dup.head()

1    False
2    False
3    False
4    False
5    False
Name: ID, dtype: bool

In [120]:
just_usd['duplicate'] = dup
just_usd.head()

Unnamed: 0.1,Unnamed: 0,ID,name,main_category,category,country,state,goal,pledged,currency,launched,deadline,backers,duplicate
1,1,1000004038,Where is Hank?,Film & Video,Narrative Film,US,failed,45000.0,220.0,USD,2013-01-12 00:20:00,2013-02-26 00:20:00,3,False
2,2,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,US,failed,5000.0,1.0,USD,2012-03-17 03:24:00,2012-04-16 04:24:00,1,False
3,3,1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,US,canceled,19500.0,1283.0,USD,2015-07-04 08:35:00,2015-08-29 01:00:00,14,False
4,4,1000014025,Monarch Espresso Bar,Food,Restaurants,US,successful,50000.0,52375.0,USD,2016-02-26 13:38:00,2016-04-01 13:38:00,224,False
5,5,1000023410,Support Solar Roasted Coffee & Green Energy! ...,Food,Food,US,successful,1000.0,1205.0,USD,2014-12-01 18:30:00,2014-12-21 18:30:00,16,False


In [121]:
#dup_dict = {"False" : 0, "True": 1}
#just_usd['duplicate'] = just_usd["duplicate"].map({"False" : '0', "True": '1'})

In [122]:
just_usd = just_usd.drop(['Unnamed: 0'], axis = 1)
just_usd.head()

Unnamed: 0,ID,name,main_category,category,country,state,goal,pledged,currency,launched,deadline,backers,duplicate
1,1000004038,Where is Hank?,Film & Video,Narrative Film,US,failed,45000.0,220.0,USD,2013-01-12 00:20:00,2013-02-26 00:20:00,3,False
2,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,US,failed,5000.0,1.0,USD,2012-03-17 03:24:00,2012-04-16 04:24:00,1,False
3,1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,US,canceled,19500.0,1283.0,USD,2015-07-04 08:35:00,2015-08-29 01:00:00,14,False
4,1000014025,Monarch Espresso Bar,Food,Restaurants,US,successful,50000.0,52375.0,USD,2016-02-26 13:38:00,2016-04-01 13:38:00,224,False
5,1000023410,Support Solar Roasted Coffee & Green Energy! ...,Food,Food,US,successful,1000.0,1205.0,USD,2014-12-01 18:30:00,2014-12-21 18:30:00,16,False


In [123]:
# lets fill in the missing values in the name column with "unknown"
just_usd['name'] = just_usd['name'].fillna('unknown')
just_usd.head()

Unnamed: 0,ID,name,main_category,category,country,state,goal,pledged,currency,launched,deadline,backers,duplicate
1,1000004038,Where is Hank?,Film & Video,Narrative Film,US,failed,45000.0,220.0,USD,2013-01-12 00:20:00,2013-02-26 00:20:00,3,False
2,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,US,failed,5000.0,1.0,USD,2012-03-17 03:24:00,2012-04-16 04:24:00,1,False
3,1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,US,canceled,19500.0,1283.0,USD,2015-07-04 08:35:00,2015-08-29 01:00:00,14,False
4,1000014025,Monarch Espresso Bar,Food,Restaurants,US,successful,50000.0,52375.0,USD,2016-02-26 13:38:00,2016-04-01 13:38:00,224,False
5,1000023410,Support Solar Roasted Coffee & Green Energy! ...,Food,Food,US,successful,1000.0,1205.0,USD,2014-12-01 18:30:00,2014-12-21 18:30:00,16,False


In [124]:
#lets make sure we're not missing any other values 
just_usd.isnull().sum()

ID               0
name             0
main_category    0
category         0
country          0
state            0
goal             0
pledged          0
currency         0
launched         0
deadline         0
backers          0
duplicate        0
dtype: int64

In [125]:
#Let's look at the values distribution grouped by their 'state'(whether or not the project succeeded)

just_usd['state'].value_counts()

failed        286129
successful    205359
canceled       52876
undefined       5140
live            4457
suspended       2221
Name: state, dtype: int64

In [126]:
# view our data by states, to see which states have the most duplicate IDs
table = just_usd['duplicate'].groupby(just_usd['state']).sum()
print(table)

state
canceled       24714.0
failed        135568.0
live               2.0
successful     96949.0
suspended       1014.0
undefined       2570.0
Name: duplicate, dtype: float64


In [127]:
# view our data 
#table2 = just_usd['state'].groupby(just_usd['state', 'main_category']).count()
#print(table2)

In [128]:
reg_vars = just_usd.drop(['category', 'currency', 'launched', 'deadline'], axis = 1)
reg_vars.head()

Unnamed: 0,ID,name,main_category,country,state,goal,pledged,backers,duplicate
1,1000004038,Where is Hank?,Film & Video,US,failed,45000.0,220.0,3,False
2,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,US,failed,5000.0,1.0,1,False
3,1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,US,canceled,19500.0,1283.0,14,False
4,1000014025,Monarch Espresso Bar,Food,US,successful,50000.0,52375.0,224,False
5,1000023410,Support Solar Roasted Coffee & Green Energy! ...,Food,US,successful,1000.0,1205.0,16,False


In [129]:
reg_vars['country'].unique()

array(['US', 'N,"0', 'N,0"'], dtype=object)

In [130]:
#
def usa(x):
    if x == 'US': 
        return 1
    else:
        return 0
reg_vars['country'] = reg_vars['country'].apply(usa)
reg_vars.head()

Unnamed: 0,ID,name,main_category,country,state,goal,pledged,backers,duplicate
1,1000004038,Where is Hank?,Film & Video,1,failed,45000.0,220.0,3,False
2,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,1,failed,5000.0,1.0,1,False
3,1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,1,canceled,19500.0,1283.0,14,False
4,1000014025,Monarch Espresso Bar,Food,1,successful,50000.0,52375.0,224,False
5,1000023410,Support Solar Roasted Coffee & Green Energy! ...,Food,1,successful,1000.0,1205.0,16,False


In [131]:
def dup(x):
    if x == 'True': 
        return 1
    else:
        return 0
reg_vars['duplicate'] = reg_vars['duplicate'].apply(dup)
reg_vars.head()

Unnamed: 0,ID,name,main_category,country,state,goal,pledged,backers,duplicate
1,1000004038,Where is Hank?,Film & Video,1,failed,45000.0,220.0,3,0
2,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,1,failed,5000.0,1.0,1,0
3,1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,1,canceled,19500.0,1283.0,14,0
4,1000014025,Monarch Espresso Bar,Food,1,successful,50000.0,52375.0,224,0
5,1000023410,Support Solar Roasted Coffee & Green Energy! ...,Food,1,successful,1000.0,1205.0,16,0


In [132]:
def state(x):
    if x == 'successful': 
        return 1
    else:
        return 0
reg_vars['state'] = reg_vars['state'].apply(state)
reg_vars.head()

Unnamed: 0,ID,name,main_category,country,state,goal,pledged,backers,duplicate
1,1000004038,Where is Hank?,Film & Video,1,0,45000.0,220.0,3,0
2,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,1,0,5000.0,1.0,1,0
3,1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,1,0,19500.0,1283.0,14,0
4,1000014025,Monarch Espresso Bar,Food,1,1,50000.0,52375.0,224,0
5,1000023410,Support Solar Roasted Coffee & Green Energy! ...,Food,1,1,1000.0,1205.0,16,0


In [133]:
reg_vars = reg_vars.drop(['name','ID'], axis = 1)
reg_vars.head()

Unnamed: 0,main_category,country,state,goal,pledged,backers,duplicate
1,Film & Video,1,0,45000.0,220.0,3,0
2,Music,1,0,5000.0,1.0,1,0
3,Film & Video,1,0,19500.0,1283.0,14,0
4,Food,1,1,50000.0,52375.0,224,0
5,Food,1,1,1000.0,1205.0,16,0


In [134]:
reg_vars = pd.get_dummies(data=reg_vars, columns=['main_category'])
reg_vars.head()

Unnamed: 0,country,state,goal,pledged,backers,duplicate,main_category_Art,main_category_Comics,main_category_Crafts,main_category_Dance,...,main_category_Film & Video,main_category_Food,main_category_Games,main_category_Graphic Novels,main_category_Journalism,main_category_Music,main_category_Photography,main_category_Publishing,main_category_Technology,main_category_Theater
1,1,0,45000.0,220.0,3,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
2,1,0,5000.0,1.0,1,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
3,1,0,19500.0,1283.0,14,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
4,1,1,50000.0,52375.0,224,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
5,1,1,1000.0,1205.0,16,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0


In [135]:
y = reg_vars['state']

In [136]:
x = reg_vars.drop(['state'], axis = 1)

In [137]:
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=15)

In [138]:
LogReg = LogisticRegression()
LogReg.fit(X_train, y_train)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=None, solver='warn',
          tol=0.0001, verbose=0, warm_start=False)

In [139]:
LogReg.score(X_train, y_train)

0.9832608524649114

In [140]:
y_pred = LogReg.predict(X_test)

In [142]:
cm = pd.DataFrame(
    confusion_matrix(y_test, y_pred),
    columns=['Predicted Failure', 'Predicted Success'],
    index=['True Failure', 'True Success']
)

cm

Unnamed: 0,Predicted Failure,Predicted Success
True Failure,68471,1849
True Success,42,40875


In [143]:
just_usd['main_category'].value_counts()


Film & Video      101343
Music              86352
Publishing         60454
Games              45060
Art                42004
Design             39824
Technology         39514
Food               37460
Fashion            30562
Theater            16857
Comics             16314
Photography        15396
Crafts             12226
Journalism          6646
Dance               6169
Graphic Novels         1
Name: main_category, dtype: int64

In [159]:
film = just_usd[just_usd['main_category'] == "Film & Video"]
num_film = len(film)
film.head()
print(num_film)

101343


In [161]:
suc_film = len(film[film['state'] == 'successful'])

In [164]:
percent_film_success = suc_film/num_film
percent_film_success


0.37704626861253365

In [168]:
music = just_usd[just_usd['main_category'] == "Music"]
suc_music = len(music[music['state'] == 'successful'])
percent_suc_music = suc_music/(len(music))
percent_suc_music

0.48425050954233834

In [169]:
publish = just_usd[just_usd['main_category'] == "Publishing"]
suc_pub = len(publish[publish['state'] == 'successful'])
percent_suc_pub = suc_pub/(len(publish))
percent_suc_pub

0.3073245773646078

In [170]:
design = just_usd[just_usd['main_category'] == "Design"]
suc_des = len(design[design['state'] == 'successful'])
percent_suc_des = suc_des/(len(design))
percent_suc_des

0.34793089594214544

In [171]:
games = just_usd[just_usd['main_category'] == "Games"]
suc_game = len(games[games['state'] == 'successful'])
percent_suc_game = suc_game/(len(games))
percent_suc_game

0.37303595206391477