# Problem Statement
To predict if a kickstarter project will be successful or will fail before its actual deadline. Also identify the factors that determine the success rate of a project.


# Solution Notebook
This notebook basically has 4 steps/ modules:
    1. Data Understanding (EDA) and Preprocessing
    2. Feature Engineering and heuristic feature selection
    3. Model Building
        3A. Logistic Regression with grid search
        3B. XGBoost
        3C. Random Forest
    4. Feature importance
    
The best accuracy obtained was 68.9% accuracy on Test Data from XGBoost

## Setting up the requires libraries and packages

In [1]:
# Libraries
import numpy as np
import pandas as pd
import os
from datetime import datetime
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn import preprocessing
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import roc_auc_score
from sklearn import preprocessing
import string
#import itertools
#from itertools import product

## Importing a dataset

In [2]:
# read in data
kickstarters_2017 = pd.read_csv("../input/ks-projects-201801.csv")
kickstarters_2017.head()

Unnamed: 0,ID,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd pledged,usd_pledged_real,usd_goal_real
0,1000002330,The Songs of Adelaide & Abullah,Poetry,Publishing,GBP,2015-10-09,1000.0,2015-08-11 12:12:28,0.0,failed,0,GB,0.0,0.0,1533.95
1,1000003930,Greeting From Earth: ZGAC Arts Capsule For ET,Narrative Film,Film & Video,USD,2017-11-01,30000.0,2017-09-02 04:43:57,2421.0,failed,15,US,100.0,2421.0,30000.0
2,1000004038,Where is Hank?,Narrative Film,Film & Video,USD,2013-02-26,45000.0,2013-01-12 00:20:50,220.0,failed,3,US,220.0,220.0,45000.0
3,1000007540,ToshiCapital Rekordz Needs Help to Complete Album,Music,Music,USD,2012-04-16,5000.0,2012-03-17 03:24:11,1.0,failed,1,US,1.0,1.0,5000.0
4,1000011046,Community Film Project: The Art of Neighborhoo...,Film & Video,Film & Video,USD,2015-08-29,19500.0,2015-07-04 08:35:03,1283.0,canceled,14,US,1283.0,1283.0,19500.0


## Basic Tests and EDA on input data

In [3]:
#printing all summary of the kickstarter data
#this will give the dimensions of data set : (rows, columns)
print(kickstarters_2017.shape)
#columns and data types
print(kickstarters_2017.info())
#basic stats of columns
print(kickstarters_2017.describe())
#number of unique values in all columns
print(kickstarters_2017.nunique())

(378661, 15)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 378661 entries, 0 to 378660
Data columns (total 15 columns):
ID                  378661 non-null int64
name                378657 non-null object
category            378661 non-null object
main_category       378661 non-null object
currency            378661 non-null object
deadline            378661 non-null object
goal                378661 non-null float64
launched            378661 non-null object
pledged             378661 non-null float64
state               378661 non-null object
backers             378661 non-null int64
country             378661 non-null object
usd pledged         374864 non-null float64
usd_pledged_real    378661 non-null float64
usd_goal_real       378661 non-null float64
dtypes: float64(5), int64(2), object(8)
memory usage: 43.3+ MB
None
                 ID      ...        usd_goal_real
count  3.786610e+05      ...         3.786610e+05
mean   1.074731e+09      ...         4.545440e+04
std    6.1

The above stats help us reaching the following conclusions:
1. the data is at ID level (unique of ID=number of rows)
2. The numerical data fields are: goal, pledged, backers, usd_pledged, usd_pledged_real,usd_goal_real

#### Understanding Variables in the Dataset

The dataset has 15 variablesincluding ID. SInce ID is the level of the dataset, we can set it as the index of the ata later. Variables like name, currency, deadline, launched date and country as self explanatory. Explanations of some key variables are as follows:

Main_Category: There are 15 main categories for the project. These main categories broadly classify projects based on topic and genre they belong to.

Category: Main Categories are further sub divided in categories to give more general idea of the project. For example, Main Category “Technology” has 15 categories like Gadgets, Web, Apps, Software etc. There are 159 total categories.

Goal: This is the goal amount which the company need to raise to start its project. The goal amount is important variable for company as if it is too high, the project may fail to raise that amount of money and be unsuccessful. If it is too low, then it may reach its goal soon and backers may not be interested to pledge more.

Pledged: This is amount raised by the company through its backers. On Kickstarter, if total amount pledged is lower than goal, then the project is unsuccessful and the start-up company doesn’t receive any fund. If pledged amount is more than the goal, the company is considered successful. The variable “usd pledged” is amount of money raised in US dollars.

Number of Backers: These are number of people who have supported the project by pledging some amount.

In [4]:
#Distribution of data across state
percent_success = round(kickstarters_2017["state"].value_counts() / len(kickstarters_2017["state"]) * 100,2)

print("State Percent: ")
print(percent_success)

State Percent: 
failed        52.22
successful    35.38
canceled      10.24
undefined      0.94
live           0.74
suspended      0.49
Name: state, dtype: float64


In [5]:
#renaming column usd_pledged as there is no '_' in the actual dataset variable name
col_names_prev=list(kickstarters_2017)
col_names_new= ['ID',
 'name',
 'category',
 'main_category',
 'currency',
 'deadline',
 'goal',
 'launched',
 'pledged',
 'state',
 'backers',
 'country',
 'usd_pledged',
 'usd_pledged_real',
 'usd_goal_real']
kickstarters_2017.columns= col_names_new

In [6]:
#segregating the variables as categorical and constinuous
cat_vars=[ 'category', 'main_category', 'currency','country']
cont_vars=['goal', 'pledged', 'backers','usd_pledged','usd_pledged_real','usd_goal_real']

In [7]:
#correlation of continuous variables
kickstarters_2017[cont_vars].corr()

Unnamed: 0,goal,pledged,backers,usd_pledged,usd_pledged_real,usd_goal_real
goal,1.0,0.007358,0.004012,0.005534,0.005104,0.942692
pledged,0.007358,1.0,0.717079,0.85737,0.952843,0.005024
backers,0.004012,0.717079,1.0,0.697426,0.752539,0.004517
usd_pledged,0.005534,0.85737,0.697426,1.0,0.907743,0.006172
usd_pledged_real,0.005104,0.952843,0.752539,0.907743,1.0,0.005596
usd_goal_real,0.942692,0.005024,0.004517,0.006172,0.005596,1.0


In [8]:
#setting unique ID as index of the table
#this is because the ID column will not be used in the algorithm. yet it is needed to identify the project
df_kick= kickstarters_2017.set_index('ID')

In [9]:
# Filtering only for successful and failed projects
kick_projects = df_kick[(df_kick['state'] == 'failed') | (df_kick['state'] == 'successful')]
#converting 'successful' state to 1 and failed to 0
kick_projects['state'] = (kick_projects['state'] =='successful').astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  after removing the cwd from sys.path.


In [10]:
#checking distribution of projects across various main categories
kick_projects.groupby(['main_category','state']).size()
#kick_projects.groupby(['category','state']).size()

main_category  state
Art            0        14131
               1        11510
Comics         0         4036
               1         5842
Crafts         0         5703
               1         2115
Dance          0         1235
               1         2338
Design         0        14814
               1        10550
Fashion        0        14182
               1         5593
Film & Video   0        32904
               1        23623
Food           0        15969
               1         6085
Games          0        16003
               1        12518
Journalism     0         3137
               1         1012
Music          0        21752
               1        24197
Photography    0         6384
               1         3305
Publishing     0        23145
               1        12300
Technology     0        20616
               1         6434
Theater        0         3708
               1         6534
dtype: int64

In [11]:
#correlation of continuous variables with the dependent variable
kick_projects[['goal', 'pledged', 'backers','usd_pledged','usd_pledged_real','usd_goal_real','state']].corr()

Unnamed: 0,goal,pledged,backers,usd_pledged,usd_pledged_real,usd_goal_real,state
goal,1.0,0.007965,0.004794,0.006416,0.005955,0.952614,-0.025099
pledged,0.007965,1.0,0.717316,0.857966,0.953571,0.005722,0.109507
backers,0.004794,0.717316,1.0,0.697493,0.752291,0.005208,0.12579
usd_pledged,0.006416,0.857966,0.697493,1.0,0.907713,0.006965,0.095658
usd_pledged_real,0.005955,0.953571,0.752291,0.907713,1.0,0.006354,0.108298
usd_goal_real,0.952614,0.005722,0.005208,0.006965,0.006354,1.0,-0.023735
state,-0.025099,0.109507,0.12579,0.095658,0.108298,-0.023735,1.0


## Feature Engineering

In [12]:
#creating derived metrics/ features

#converting the date columns from string to date format
#will use it to derive the duration of the project
kick_projects['launched_date'] = pd.to_datetime(kick_projects['launched'], format='%Y-%m-%d %H:%M:%S')
kick_projects['deadline_date'] = pd.to_datetime(kick_projects['deadline'], format='%Y-%m-%d %H:%M:%S')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


In [13]:
kick_projects= kick_projects.sort_values('launched_date',ascending=True)

In [14]:
kick_projects

Unnamed: 0_level_0,name,category,main_category,currency,deadline,goal,launched,pledged,state,backers,country,usd_pledged,usd_pledged_real,usd_goal_real,launched_date,deadline_date
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
1860890148,Grace Jones Does Not Give A F$#% T-Shirt (limi...,Fashion,Fashion,USD,2009-05-31,1000.0,2009-04-21 21:02:48,625.00,0,30,US,625.00,625.00,1000.00,2009-04-21 21:02:48,2009-05-31
709707365,CRYSTAL ANTLERS UNTITLED MOVIE,Shorts,Film & Video,USD,2009-07-20,80000.0,2009-04-23 00:07:53,22.00,0,3,US,22.00,22.00,80000.00,2009-04-23 00:07:53,2009-07-20
1703704063,drawing for dollars,Illustration,Art,USD,2009-05-03,20.0,2009-04-24 21:52:03,35.00,1,3,US,35.00,35.00,20.00,2009-04-24 21:52:03,2009-05-03
727286,Offline Wikipedia iPhone app,Software,Technology,USD,2009-07-14,99.0,2009-04-25 17:36:21,145.00,1,25,US,145.00,145.00,99.00,2009-04-25 17:36:21,2009-07-14
1622952265,Pantshirts,Fashion,Fashion,USD,2009-05-26,1900.0,2009-04-27 14:10:39,387.00,0,10,US,387.00,387.00,1900.00,2009-04-27 14:10:39,2009-05-26
2089078683,New York Makes a Book!!,Journalism,Journalism,USD,2009-05-16,3000.0,2009-04-28 13:55:41,3329.00,1,110,US,3329.00,3329.00,3000.00,2009-04-28 13:55:41,2009-05-16
830477146,Web Site for Short Horror Film,Shorts,Film & Video,USD,2009-05-29,200.0,2009-04-29 02:04:21,41.00,0,3,US,41.00,41.00,200.00,2009-04-29 02:04:21,2009-05-29
266044220,Help me write my second novel.,Fiction,Publishing,USD,2009-05-29,500.0,2009-04-29 02:58:50,563.00,1,18,US,563.00,563.00,500.00,2009-04-29 02:58:50,2009-05-29
813230527,Sponsor Dereck Blackburn (Lostwars) Artist in ...,Rock,Music,USD,2009-05-16,300.0,2009-04-29 05:26:32,15.00,0,2,US,15.00,15.00,300.00,2009-04-29 05:26:32,2009-05-16
469734648,kicey to iceland,Photography,Photography,USD,2009-06-17,350.0,2009-04-29 06:43:44,1630.00,1,31,US,1630.00,1630.00,350.00,2009-04-29 06:43:44,2009-06-17


In [15]:
#creating features from the project name

#length of name
kick_projects['name_len'] = kick_projects.name.str.len()

# presence of !
kick_projects['name_exclaim'] = (kick_projects.name.str[-1] == '!').astype(int)

# presence of !
kick_projects['name_question'] = (kick_projects.name.str[-1] == '?').astype(int)

# number of words in the name
kick_projects['name_words'] = kick_projects.name.apply(lambda x: len(str(x).split(' ')))

# if name is uppercase
kick_projects['name_is_upper'] = kick_projects.name.str.isupper().astype(float)

In [16]:
# normalizing goal by applying log
kick_projects['goal_log'] = np.log1p(kick_projects.goal)
#creating goal features to check what range goal lies in
kick_projects['Goal_10'] = kick_projects.goal.apply(lambda x: x // 10)
kick_projects['Goal_1000'] = kick_projects.goal.apply(lambda x: x // 1000)
kick_projects['Goal_100'] = kick_projects.goal.apply(lambda x: x // 100)
kick_projects['Goal_500'] = kick_projects.goal.apply(lambda x: x // 500)

In [17]:
#features from date column
kick_projects['duration']=(kick_projects['deadline_date']-kick_projects['launched_date']).dt.days
#the idea for deriving launched quarter month year is that perhaps projects launched in a particular year/ quarter/ month might have a low success rate
kick_projects['launched_quarter']= kick_projects['launched_date'].dt.quarter
kick_projects['launched_month']= kick_projects['launched_date'].dt.month
kick_projects['launched_year']= kick_projects['launched_date'].dt.year

In [18]:
#additional features from goal, pledge and backers columns
kick_projects.loc[:,'goal_reached'] = kick_projects['pledged'] / kick_projects['goal'] # Pledged amount as a percentage of goal.
#The above field will be used to compute another metric
# In backers column, impute 0 with 1 to prevent undefined division.
kick_projects.loc[kick_projects['backers'] == 0, 'backers'] = 1 
kick_projects.loc[:,'pledge_per_backer'] = kick_projects['pledged'] / kick_projects['backers'] # Pledged amount per backer.

In [19]:
#will create percentile buckets for the goal amount in a category
kick_projects['goal_cat_perc'] =  kick_projects.groupby(['category'])['goal'].transform(
                     lambda x: pd.qcut(x, [0, .35, .70, 1.0], labels =[1,2,3]))

In [20]:
#creating a metric to see number of competitors for a given project
#number of participants in a given category, that launched in the same year and quarter and in the same goal bucket
ks_particpants=kick_projects.groupby(['category','launched_year','launched_quarter','goal_cat_perc']).count()
ks_particpants=ks_particpants[['name']]
#since the above table has all group by columns created as index, converting them into columns
ks_particpants.reset_index(inplace=True)

In [21]:
#renaming columns of the derived table
colmns=['category', 'launched_year', 'launched_quarter', 'goal_cat_perc', 'participants']
ks_particpants.columns=colmns

In [22]:
#merging the particpants column into the base table
kick_projects = pd.merge(kick_projects, ks_particpants, on = ['category', 'launched_year', 'launched_quarter','goal_cat_perc'], how = 'left')

In [23]:
#creating 2 metrics to get average pledge per backer for a category in a year according to the goal bucket it lies in and the success rate ie average pledged to goal ratio for the category in this year
#using pledge_per_backer (computed earlier) and averaging it by category in a launch year
ks_ppb=pd.DataFrame(kick_projects.groupby(['category','launched_year','goal_cat_perc'])['pledge_per_backer','goal_reached'].mean())
#since the above table has all group by columns created as index, converting them into columns
ks_ppb.reset_index(inplace=True)
#renaming column
ks_ppb.columns= ['category','launched_year','goal_cat_perc','avg_ppb','avg_success_rate']
ks_ppb[:2]

Unnamed: 0,category,launched_year,goal_cat_perc,avg_ppb,avg_success_rate
0,3D Printing,2013,1.0,299.627488,14.359722
1,3D Printing,2013,2.0,386.287518,3.153138


In [24]:
#merging the particpants column into the base table
kick_projects = pd.merge(kick_projects, ks_ppb, on = ['category', 'launched_year','goal_cat_perc'], how = 'left')

In [25]:
# replacing all 'N,0"' values in the country column with 'NZERO' to avoid discrepancies while one hot encoding
kick_projects = kick_projects.replace({'country': 'N,0"'}, {'country': 'NZERO'}, regex=True)

In [26]:
list(kick_projects)

['name',
 'category',
 'main_category',
 'currency',
 'deadline',
 'goal',
 'launched',
 'pledged',
 'state',
 'backers',
 'country',
 'usd_pledged',
 'usd_pledged_real',
 'usd_goal_real',
 'launched_date',
 'deadline_date',
 'name_len',
 'name_exclaim',
 'name_question',
 'name_words',
 'name_is_upper',
 'goal_log',
 'Goal_10',
 'Goal_1000',
 'Goal_100',
 'Goal_500',
 'duration',
 'launched_quarter',
 'launched_month',
 'launched_year',
 'goal_reached',
 'pledge_per_backer',
 'goal_cat_perc',
 'participants',
 'avg_ppb',
 'avg_success_rate']

In [27]:
#selecting the needed fields only
#this will lead to the final features list
kick_projects=kick_projects[['category',
 'main_category',
 'currency',
 'goal',
 'state',
 'country',
 'usd_goal_real',
 'name_len',
 'name_exclaim',
 'name_question',
 'name_words',
 'name_is_upper',
 'goal_log',
 'Goal_10',
 'Goal_1000',
 'Goal_100',
 'Goal_500',
 'duration',
 'launched_quarter',
 'launched_month',
 'launched_year',
 'goal_cat_perc',
 'participants',
 'avg_ppb',
 'avg_success_rate']]

In [28]:
#these functions will be used on the textual column entries to remove '&','-' or white spaces
def replace_ampersand(val):
    if isinstance(val, str):
        return(val.replace('&', 'and'))
    else:
        return(val)

def replace_hyphen(val):
    if isinstance(val, str):
        return(val.replace('-', '_'))
    else:
        return(val)    
    
def remove_extraspace(val):
        if isinstance(val, str):
            return(val.strip())
        else:
            return(val) 

def replace_space(val):
        if isinstance(val, str):
            return(val.replace(' ', '_'))
        else:
            return(val)         

In [29]:
#apply those functions to all cat columns
#this will remove special characters from the character columns.
#Since these fileds will be one-hot encoded, the column names so derived should be compatible with the requied format
kick_projects['category'] = kick_projects['category'].apply(remove_extraspace)
kick_projects['category'] = kick_projects['category'].apply(replace_ampersand)
kick_projects['category'] = kick_projects['category'].apply(replace_hyphen)
kick_projects['category'] = kick_projects['category'].apply(replace_space)

kick_projects['main_category'] = kick_projects['main_category'].apply(remove_extraspace)
kick_projects['main_category'] = kick_projects['main_category'].apply(replace_ampersand)
kick_projects['main_category'] = kick_projects['main_category'].apply(replace_hyphen)
kick_projects['main_category'] = kick_projects['main_category'].apply(replace_space)

In [30]:
#missing value treatment
# Check for nulls.
kick_projects.isnull().sum()

category            0
main_category       0
currency            0
goal                0
state               0
country             0
usd_goal_real       0
name_len            3
name_exclaim        0
name_question       0
name_words          0
name_is_upper       3
goal_log            0
Goal_10             0
Goal_1000           0
Goal_100            0
Goal_500            0
duration            0
launched_quarter    0
launched_month      0
launched_year       0
goal_cat_perc       0
participants        0
avg_ppb             0
avg_success_rate    0
dtype: int64

There are only 3 rows with nulls, and the rows with nulls have no names. These rows can be removed.

In [31]:
#dropping all rows that have any nulls
kick_projects=kick_projects.dropna() 

In [32]:
# Check for nulls again.
kick_projects.isnull().sum()

category            0
main_category       0
currency            0
goal                0
state               0
country             0
usd_goal_real       0
name_len            0
name_exclaim        0
name_question       0
name_words          0
name_is_upper       0
goal_log            0
Goal_10             0
Goal_1000           0
Goal_100            0
Goal_500            0
duration            0
launched_quarter    0
launched_month      0
launched_year       0
goal_cat_perc       0
participants        0
avg_ppb             0
avg_success_rate    0
dtype: int64

No nulls, we are good to go

In [33]:
kick_projects.head()

Unnamed: 0,category,main_category,currency,goal,state,country,usd_goal_real,name_len,name_exclaim,name_question,name_words,name_is_upper,goal_log,Goal_10,Goal_1000,Goal_100,Goal_500,duration,launched_quarter,launched_month,launched_year,goal_cat_perc,participants,avg_ppb,avg_success_rate
0,Fashion,Fashion,USD,1000.0,0,US,1000.0,59.0,0,0,11,0.0,6.908755,100.0,1.0,10.0,2.0,39,2,4,2009,1.0,3,40.982361,0.325542
1,Shorts,Film_and_Video,USD,80000.0,0,US,80000.0,30.0,0,0,4,1.0,11.289794,8000.0,80.0,800.0,160.0,87,2,4,2009,3.0,1,65.203511,0.274317
2,Illustration,Art,USD,20.0,1,US,20.0,19.0,0,0,3,0.0,3.044522,2.0,0.0,0.0,0.0,8,2,4,2009,1.0,3,13.095238,0.5525
3,Software,Technology,USD,99.0,1,US,99.0,28.0,0,0,4,0.0,4.60517,9.0,0.0,0.0,0.0,79,2,4,2009,1.0,7,36.765524,0.572958
4,Fashion,Fashion,USD,1900.0,0,US,1900.0,10.0,0,0,1,0.0,7.550135,190.0,1.0,19.0,3.0,28,2,4,2009,1.0,3,40.982361,0.325542


In [34]:
# One-Hot encoding to convert categorical columns to numeric
print('start one-hot encoding')

kick_projects_ip = pd.get_dummies(kick_projects, prefix = [ 'category', 'main_category', 'currency','country'],
                             columns = [ 'category', 'main_category', 'currency','country'])
    
#this will have created 1-0 flag columns (like a sparse matrix)    
print('ADS dummy columns made')

start one-hot encoding
ADS dummy columns made


In [35]:
#creating 2 arrays: features and response

#features will have all independent variables
features=list(kick_projects_ip)
features.remove('state')
#response has the target variable
response= ['state']

In [36]:
features

['goal',
 'usd_goal_real',
 'name_len',
 'name_exclaim',
 'name_question',
 'name_words',
 'name_is_upper',
 'goal_log',
 'Goal_10',
 'Goal_1000',
 'Goal_100',
 'Goal_500',
 'duration',
 'launched_quarter',
 'launched_month',
 'launched_year',
 'goal_cat_perc',
 'participants',
 'avg_ppb',
 'avg_success_rate',
 'category_3D_Printing',
 'category_Academic',
 'category_Accessories',
 'category_Action',
 'category_Animals',
 'category_Animation',
 'category_Anthologies',
 'category_Apparel',
 'category_Apps',
 'category_Architecture',
 'category_Art',
 'category_Art_Books',
 'category_Audio',
 'category_Bacon',
 'category_Blues',
 'category_Calendars',
 'category_Camera_Equipment',
 'category_Candles',
 'category_Ceramics',
 "category_Children's_Books",
 'category_Childrenswear',
 'category_Chiptune',
 'category_Civic_Design',
 'category_Classical_Music',
 'category_Comedy',
 'category_Comic_Books',
 'category_Comics',
 'category_Community_Gardens',
 'category_Conceptual_Art',
 'category_

In [37]:
#creating a backup copy of the input dataset
kick_projects_ip_copy= kick_projects_ip.copy()

In [38]:
kick_projects_ip[features].shape

(331672, 231)

In [39]:
# normalize the data attributes
kick_projects_ip_scaled_ftrs = pd.DataFrame(preprocessing.normalize(kick_projects_ip[features]))
kick_projects_ip_scaled_ftrs.columns=list(kick_projects_ip[features])

In [40]:
kick_projects_ip_scaled_ftrs[:3]
#kick_projects_ip[features].shape

Unnamed: 0,goal,usd_goal_real,name_len,name_exclaim,name_question,name_words,name_is_upper,goal_log,Goal_10,Goal_1000,Goal_100,Goal_500,duration,launched_quarter,launched_month,launched_year,goal_cat_perc,participants,avg_ppb,avg_success_rate,category_3D_Printing,category_Academic,category_Accessories,category_Action,category_Animals,category_Animation,category_Anthologies,category_Apparel,category_Apps,category_Architecture,category_Art,category_Art_Books,category_Audio,category_Bacon,category_Blues,category_Calendars,category_Camera_Equipment,category_Candles,category_Ceramics,category_Children's_Books,...,main_category_Publishing,main_category_Technology,main_category_Theater,currency_AUD,currency_CAD,currency_CHF,currency_DKK,currency_EUR,currency_GBP,currency_HKD,currency_JPY,currency_MXN,currency_NOK,currency_NZD,currency_SEK,currency_SGD,currency_USD,country_AT,country_AU,country_BE,country_CA,country_CH,country_DE,country_DK,country_ES,country_FR,country_GB,country_HK,country_IE,country_IT,country_JP,country_LU,country_MX,country_NL,country_NO,country_NZ,country_NZERO,country_SE,country_SG,country_US
0,0.406455,0.406455,0.023981,0.0,0.0,0.004471,0.0,0.002808,0.040645,0.000406,0.004065,0.000813,0.015852,0.000813,0.001626,0.816567,0.000406,0.001219,0.016657,0.000132,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000406,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000406
1,0.705216,0.705216,0.000264,0.0,0.0,3.5e-05,9e-06,0.0001,0.070522,0.000705,0.007052,0.00141,0.000767,1.8e-05,3.5e-05,0.01771,2.6e-05,9e-06,0.000575,2e-06,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9e-06,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,9e-06
2,0.009953,0.009953,0.009456,0.0,0.0,0.001493,0.0,0.001515,0.000995,0.0,0.0,0.0,0.003981,0.000995,0.001991,0.99982,0.000498,0.001493,0.006517,0.000275,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000498,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000498


## Model Building

In [41]:
#creating test and train dependent and independent variables
#Split the data into test and train (30-70: random sampling)
#will be using the scaled dataset to split 
train_ind, test_ind, train_dep, test_dep = train_test_split(kick_projects_ip_scaled_ftrs, kick_projects_ip[response], test_size=0.3, random_state=0)

### XGBoost classifier

In [42]:
from xgboost import XGBClassifier

In [43]:
xgb_model = XGBClassifier(
 learning_rate =0.1,
 n_estimators=1000,
 max_depth=5,
 min_child_weight=1,
 gamma=0,
 subsample=0.8,
 colsample_bytree=0.8,
 objective= 'binary:logistic',
 nthread=4,
 scale_pos_weight=1,
 seed=27)

In [44]:
xgb_model=xgb_model.fit(train_ind[features], train_dep[response])

  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


#### Prediction XGB

In [45]:
# Predict the on the train_data
test_ind["Pred_state_XGB"] = xgb_model.predict(test_ind[features])

# Predict the on the train_data
train_ind["Pred_state_XGB"] = xgb_model.predict(train_ind[features])

# Predict the on the train_data
kick_projects_ip["Pred_state_XGB"] = xgb_model.predict(kick_projects_ip_scaled_ftrs)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


#### Evaluating XGB classifier

In [46]:
print ("Test Accuracy :: ",accuracy_score(test_dep[response], xgb_model.predict(test_ind[features])))
print ("Train Accuracy :: ",accuracy_score(train_dep[response], xgb_model.predict(train_ind[features])))
print ("Complete Accuracy  :: ",accuracy_score(kick_projects_ip[response], xgb_model.predict(kick_projects_ip_scaled_ftrs)))
print (" Confusion matrix of complete data is", confusion_matrix(kick_projects_ip[response],kick_projects_ip["Pred_state_XGB"]))

Test Accuracy ::  0.6921468915197685
Train Accuracy ::  0.7326097256320799
Complete Accuracy  ::  0.7204708265997732
 Confusion matrix of complete data is [[162064  35652]
 [ 57060  76896]]


#### Deriving important features for predicting state of kickstarter projects

In [47]:
## Feature importances
ftr_imp=zip(features,xgb_model.feature_importances_)

In [48]:
for values in ftr_imp:
    print(values)

('goal', 0.03331005)
('usd_goal_real', 0.036802586)
('name_len', 0.06784248)
('name_exclaim', 0.008469396)
('name_question', 0.0010477604)
('name_words', 0.06648913)
('name_is_upper', 0.005151489)
('goal_log', 0.02946826)
('Goal_10', 0.007814546)
('Goal_1000', 0.018772375)
('Goal_100', 0.013489915)
('Goal_500', 0.016022002)
('duration', 0.08993277)
('launched_quarter', 0.046668995)
('launched_month', 0.05631712)
('launched_year', 0.025451846)
('goal_cat_perc', 0.023880206)
('participants', 0.06775517)
('avg_ppb', 0.0542216)
('avg_success_rate', 0.064786516)
('category_3D_Printing', 0.0006548503)
('category_Academic', 0.000829477)
('category_Accessories', 0.0013533572)
('category_Action', 0.0006548503)
('category_Animals', 0.00034925347)
('category_Animation', 0.0016152973)
('category_Anthologies', 0.0012223872)
('category_Apparel', 0.0019645507)
('category_Apps', 0.0013097005)
('category_Architecture', 0.0003055968)
('category_Art', 0.0021391774)
('category_Art_Books', 0.0015279839)
('

In [49]:
feature_imp=pd.DataFrame(list(zip(features,xgb_model.feature_importances_)))
column_names= ['features','XGB_imp']
feature_imp.columns= column_names

In [50]:
feature_imp= feature_imp.sort_values('XGB_imp',ascending=False)

In [51]:
feature_imp[:15]

Unnamed: 0,features,XGB_imp
12,duration,0.089933
2,name_len,0.067842
17,participants,0.067755
5,name_words,0.066489
19,avg_success_rate,0.064787
14,launched_month,0.056317
18,avg_ppb,0.054222
13,launched_quarter,0.046669
1,usd_goal_real,0.036803
0,goal,0.03331


In [52]:
kick_projects_ip.head()

Unnamed: 0,goal,state,usd_goal_real,name_len,name_exclaim,name_question,name_words,name_is_upper,goal_log,Goal_10,Goal_1000,Goal_100,Goal_500,duration,launched_quarter,launched_month,launched_year,goal_cat_perc,participants,avg_ppb,avg_success_rate,category_3D_Printing,category_Academic,category_Accessories,category_Action,category_Animals,category_Animation,category_Anthologies,category_Apparel,category_Apps,category_Architecture,category_Art,category_Art_Books,category_Audio,category_Bacon,category_Blues,category_Calendars,category_Camera_Equipment,category_Candles,category_Ceramics,...,main_category_Technology,main_category_Theater,currency_AUD,currency_CAD,currency_CHF,currency_DKK,currency_EUR,currency_GBP,currency_HKD,currency_JPY,currency_MXN,currency_NOK,currency_NZD,currency_SEK,currency_SGD,currency_USD,country_AT,country_AU,country_BE,country_CA,country_CH,country_DE,country_DK,country_ES,country_FR,country_GB,country_HK,country_IE,country_IT,country_JP,country_LU,country_MX,country_NL,country_NO,country_NZ,country_NZERO,country_SE,country_SG,country_US,Pred_state_XGB
0,1000.0,0,1000.0,59.0,0,0,11,0.0,6.908755,100.0,1.0,10.0,2.0,39,2,4,2009,1.0,3,40.982361,0.325542,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
1,80000.0,0,80000.0,30.0,0,0,4,1.0,11.289794,8000.0,80.0,800.0,160.0,87,2,4,2009,3.0,1,65.203511,0.274317,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0
2,20.0,1,20.0,19.0,0,0,3,0.0,3.044522,2.0,0.0,0.0,0.0,8,2,4,2009,1.0,3,13.095238,0.5525,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1
3,99.0,1,99.0,28.0,0,0,4,0.0,4.60517,9.0,0.0,0.0,0.0,79,2,4,2009,1.0,7,36.765524,0.572958,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1
4,1900.0,0,1900.0,10.0,0,0,1,0.0,7.550135,190.0,1.0,19.0,3.0,28,2,4,2009,1.0,3,40.982361,0.325542,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0


### Random Forest Classifier

In [53]:
from sklearn.ensemble import RandomForestClassifier
import math

In [54]:
features_count = train_ind.shape[1]

parameters_rf = {'n_estimators':[50], 'max_depth':[20], 'max_features': 
                     [math.floor(np.sqrt(features_count)), math.floor(features_count/3)]}

def random_forest_classifier(features, target):
    """
    To train the random forest classifier with features and target data
    :param features:
    :param target:
    :return: trained random forest classifier
    """
    clf = RandomForestClassifier(n_estimators=50,criterion='gini' ,max_depth=20, max_features=2)
    clf.fit(features, target)
    return clf

In [55]:
trained_model_RF= random_forest_classifier(train_ind[features], train_dep[response])

  


#### Predictions using RF

In [56]:
# Predict the on the train_data
test_ind["Pred_state_RF"] = trained_model_RF.predict(test_ind[features])

# Predict the on the train_data
train_ind["Pred_state_RF"] = trained_model_RF.predict(train_ind[features])

# Predict the on the train_data
kick_projects_ip["Pred_state_RF"] = trained_model_RF.predict(kick_projects_ip_scaled_ftrs)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


#### Accuracies of RF

In [57]:
# Train and Test Accuracy
print ("Train Accuracy :: ", accuracy_score(train_dep[response], trained_model_RF.predict(train_ind[features])))
print ("Test Accuracy  :: ", accuracy_score(test_dep[response], trained_model_RF.predict(test_ind[features])))
print ("Complete Accuracy  :: ", accuracy_score(kick_projects_ip[response], trained_model_RF.predict(kick_projects_ip_scaled_ftrs)))
print (" Confusion matrix of complete data is", confusion_matrix(kick_projects_ip[response],kick_projects_ip["Pred_state_RF"]))

Train Accuracy ::  0.6962269027006073
Test Accuracy  ::  0.660599786938956
Complete Accuracy  ::  0.685538725005427
 Confusion matrix of complete data is [[175328  22388]
 [ 81910  52046]]


#### Key drivers from Random Forest

In [58]:
## Feature importances
ftr_imp_rf=zip(features,trained_model_RF.feature_importances_)
for values in ftr_imp_rf:
    print(values)

('goal', 0.03412417650993668)
('usd_goal_real', 0.044378602045221066)
('name_len', 0.04192523527480494)
('name_exclaim', 0.009747174218191495)
('name_question', 0.0007034058160194616)
('name_words', 0.07584053744201033)
('name_is_upper', 0.0031984765560943013)
('goal_log', 0.050464013597184464)
('Goal_10', 0.02756743238953033)
('Goal_1000', 0.025786540778261625)
('Goal_100', 0.04638058139959456)
('Goal_500', 0.01662458100327375)
('duration', 0.038955554097796836)
('launched_quarter', 0.04904811684281092)
('launched_month', 0.028613738383699395)
('launched_year', 0.03694144537784187)
('goal_cat_perc', 0.034876342537808865)
('participants', 0.029709925574358607)
('avg_ppb', 0.04438796584833301)
('avg_success_rate', 0.061733661450459595)
('category_3D_Printing', 0.00034724441887888397)
('category_Academic', 0.0003833432787463374)
('category_Accessories', 0.0007894250504031265)
('category_Action', 0.0005555100802733251)
('category_Animals', 6.53549567190293e-05)
('category_Animation', 0.00

In [59]:
feature_imp_RF=pd.DataFrame(list(zip(features,trained_model_RF.feature_importances_)))
column_names_RF= ['features','RF_imp']
feature_imp_RF.columns= column_names_RF

In [60]:
feature_imp_RF= feature_imp_RF.sort_values('RF_imp',ascending=False)
feature_imp_RF[:15]

Unnamed: 0,features,RF_imp
5,name_words,0.075841
19,avg_success_rate,0.061734
7,goal_log,0.050464
13,launched_quarter,0.049048
10,Goal_100,0.046381
18,avg_ppb,0.044388
1,usd_goal_real,0.044379
2,name_len,0.041925
12,duration,0.038956
15,launched_year,0.036941


## Ensemble Classifiers

### Simple Ensemble: Average Probabailities

In [61]:
from sklearn import tree
from sklearn import neighbors
import math

In [62]:
model_dtc_g = tree.DecisionTreeClassifier()
model_dtc_e = tree.DecisionTreeClassifier(criterion="entropy")
model_knn = neighbors.KNeighborsClassifier()
model_lr= LogisticRegression(penalty='l1',solver='saga')

In [63]:
model_dtc_g.fit(train_ind[features], train_dep[response])
model_dtc_e.fit(train_ind[features], train_dep[response])
model_knn.fit(train_ind[features], train_dep[response])
model_lr.fit(train_ind[features], train_dep[response])

pred_dtc_g=model_dtc_g.predict_proba(test_ind[features])
pred_dtc_e=model_dtc_e.predict_proba(test_ind[features])
pred_knn=model_knn.predict_proba(test_ind[features])
pred_lr=model_lr.predict_proba(test_ind[features])

finalpred=(pred_dtc_g+pred_dtc_e+pred_knn+pred_lr)/4

  This is separate from the ipykernel package so we can avoid doing imports until
  y = column_or_1d(y, warn=True)


In [64]:
pred_proba_avg=pd.DataFrame(finalpred)
col_names=['prob_0','prob_1']
pred_proba_avg.columns=col_names

In [65]:
def final_state(c):
    if c['prob_0'] >c['prob_1']:
        return 0
    else:
        return 1
    
pred_proba_avg['final_state_avg'] = pred_proba_avg.apply(final_state, axis=1)

In [66]:
test_ind = test_ind.reset_index(drop=True)
pred_proba_avg = pred_proba_avg.reset_index(drop=True)
test_ind=pd.concat([test_ind,pred_proba_avg],axis=1)

In [67]:
print ("Test Accuracy  :: ", accuracy_score(test_dep[response],test_ind['final_state_avg']))

Test Accuracy  ::  0.6362384675684911


### Boosting

In [68]:
from sklearn.ensemble import AdaBoostClassifier

In [69]:
model_ada = AdaBoostClassifier(random_state=1)
model_ada.fit(train_ind[features], train_dep[response])
model_ada.score(test_ind[features],test_dep[response])

  y = column_or_1d(y, warn=True)


0.6741975035677675