# Gathering Insight from Kickstarter Data: Machine Learning

_A project by Team Apple (Data Mining & Machine Learning, HEC Lausanne, Fall 2019)_

This notebook is dedicated to training machine learning models on the cleaned Kickstarter dataset, in order to find out which model (if any at all), and which features, can accurately predict the success or failure of a project.

**Contents**

1. [Imports](#imports)
2. [Machine learning models](#ml)
    1. [Logistic regression](#logr)
    2. [Decision tree and random forest](#dtrf)
    3. [k-nearest neighbors](#knn)
    4. [Neural network](#nn)
    5. [Linear regression](#linr)
3. [Conclusion](#conclusion)

## 1. Imports & installations<a name="imports"></a>

In [1]:
!pip install keras



In [2]:
!pip install --upgrade tensorflow==1.14.0

Requirement already up-to-date: tensorflow==1.14.0 in c:\programdata\anaconda3\lib\site-packages (1.14.0)


In [1]:
from cleaning import df

import numpy as np
import pandas as pd
import seaborn as sns

import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
%matplotlib inline
plt.rcParams['figure.figsize'] = (5,5)

from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegressionCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

from keras import optimizers
from keras.utils import np_utils
from keras.models import Sequential
from keras.layers.core import Dense, Dropout, Activation

Using TensorFlow backend.


## 2. Machine learning models<a name="ml"></a>

 The classes to be predicted are either **1** (= successful) or **0** (= failed).
 
 As a reminder, the base rates are 36.3 and 63.7% respectively.
 
 We start our research using the first model for classification that we've seen in the course, logistic regression. We will experiment with other classification models to find the most accurate ones.

Numeric features are normalized between 0 and 1 in order to have a better basis for comparison.

In [157]:
# This code can be used should normalization be necessary

scaler = MinMaxScaler()
df["usd_goal_real"] = pd.DataFrame(scaler.fit_transform(df["usd_goal_real"].to_numpy().reshape(-1, 1)))
df["elapsed_time"] = pd.DataFrame(scaler.fit_transform(pd.to_numeric(df["elapsed_time"]).to_numpy().reshape(-1, 1)))

We are setting the random state to an arbitrarily chosen number to enable comparison between models and their parameters.

In [3]:
np.random.RandomState(10)

<mtrand.RandomState at 0x243ce0587e0>

### A. Logistic regression<a name="logr"></a>

First, to see if our dataset is coherent, we create and test a model to see how accurate the goal and the amount of money pledged are to predict the success or failure. If the dataset is coherent, this number should be very close to 100% and indeed, we reach near-perfection.

Moving beyond the obvious, we start including other features such as the category and the country. We remove the amount pledged, since this is not a variable that the project creater has direct control over. Since they are categorical data, we need to use a one-hot encoder so that the regression model can work with it.

Base rate = 0.6365718669220111

first use LR (first method seen in class) to explore how different features affect accuracy.

After having played with the different features and training/testing multiple times, it turns out the specific category is the one that brings the highest marginal increase in accuracy. (but since lot of values, slow runtime --> possible to pick only maincat, but not as good accuracy) We also find that OH performs much better than LE.

It turns out simply picking the cat-main cat pair is what's more relevant to increase accuracy.

one must normalize time to increase accuracy

not only do goal and time not bring more accuracy (more or less = to base rate, they actually decrease it when used with more relevant features.

decide do keep only category and main category, which seem to work well together. Adding country gives similar scores, so no need to include it (to speed up runtime).
best accuracy for LR : 0.6698584625357895

In [4]:
X = pd.DataFrame()
X = df[["usd_goal_real", "elapsed_time"]]
y = df["state"]

one_hot = OneHotEncoder()
cat_to_onehot = pd.DataFrame(one_hot.fit_transform(df[["category", "main_category", "country"]]).toarray())
X = pd.concat((X, cat_to_onehot), axis=1)

"""
le = LabelEncoder()
for col in ["category", "main_category"]:
    encoded = pd.DataFrame(le.fit_transform(df[col]), columns=[col])
    X = pd.concat((X, encoded), axis=1)
"""

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [103]:
LR = LogisticRegressionCV(solver="lbfgs", cv=5, max_iter=1000)
LR.fit(X_train, y_train)
LR.score(X_test, y_test)

0.6661984765814921

In [None]:
classification_report(y_test, LR.predict(X_test), output_dict=True)

### B. Decision tree and random forest<a name="dtrf"></a>

also tried numerical features with decision trees and random forests. max 65.9% for both (tree more consistent across values of depth, but max is the same).

gini or entropy perform similarly

with the same features that we kept for LR (cat and maincat), reach a score of 66.57 (same as LR), but that score is reached faster!

LR:0.6657392901518017
DT:0.6657122791853493
RF:0.6645778185943493

In [77]:
scores = {}

for d in range(18,19):
    DT = DecisionTreeClassifier(criterion="entropy", max_depth=d)
    DT.fit(X_train, y_train)
    scores[d] = DT.score(X_test, y_test)

scores


{18: 0.6657122791853493}

In [71]:
scores = {}
for d in range(22,26):
    RF = RandomForestClassifier(criterion="entropy", n_estimators=15, max_depth=d, random_state=RSEED)
    RF.fit(X_train, y_train)
    scores[d] = RF.score(X_test, y_test)
scores

{22: 0.6614040300361947,
 23: 0.6614040300361947,
 24: 0.6645778185943493,
 25: 0.6621738425800875}

### C. KNN<a name="knn"></a>

In [None]:
scores = []
kMax=0
for k in range(1, 100, 1):
    clf = KNeighborsClassifier(n_neighbors=k)
    clf.fit(X_train, y_train)
    scores.append(clf.score(X_test, y_test))
    if clf.score(X_test, y_test) >= max(scores):
      kMax = k
plt.plot(range(1, 100, 1), scores)
plt.ylabel('accuracy', fontsize=15)
plt.xlabel('$k (knn2)$', fontsize=15)
print("kMax: ",kMax)

### D. Neural network<a name="nn"></a>

Highest: 67.12% with cat and maincat
adding country: 66.67

with numerical values it stays low.

all features: 67

In [159]:
Y_train = np_utils.to_categorical(y_train, 2)
Y_test = np_utils.to_categorical(y_test, 2)

NN = Sequential()
NN.add(Dense(512, input_shape=(X.shape[1],)))
NN.add(Activation("relu"))
NN.add(Dropout(0.2))
NN.add(Dense(2))
NN.add(Activation("softmax"))

optimizer = optimizers.SGD(lr=0.0001, decay=1e-6, momentum=0.9, nesterov=True)
NN.compile(loss="categorical_crossentropy", optimizer=optimizer, metrics=["accuracy"])

model.summary()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
dense_3 (Dense)              (None, 512)               89600     
_________________________________________________________________
activation_3 (Activation)    (None, 512)               0         
_________________________________________________________________
dropout_2 (Dropout)          (None, 512)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 2)                 1026      
_________________________________________________________________
activation_4 (Activation)    (None, 2)                 0         
Total params: 90,626
Trainable params: 90,626
Non-trainable params: 0
_________________________________________________________________


In [160]:
model_hist = NN.fit(X_train, Y_train, batch_size=64, epochs=30, verbose=1, validation_split=0.2)

Train on 236940 samples, validate on 59235 samples
Epoch 1/30
Epoch 2/30
Epoch 3/30
Epoch 4/30
Epoch 5/30
Epoch 6/30
Epoch 7/30
Epoch 8/30
Epoch 9/30
Epoch 10/30
Epoch 11/30
Epoch 12/30
Epoch 13/30
Epoch 14/30
Epoch 15/30
Epoch 16/30
Epoch 17/30
Epoch 18/30
Epoch 19/30
Epoch 20/30
Epoch 21/30
Epoch 22/30
Epoch 23/30
Epoch 24/30
Epoch 25/30
Epoch 26/30
Epoch 27/30
Epoch 28/30
Epoch 29/30
Epoch 30/30


### E. Linear regression<a name="linr"></a>

## 4. Conclusion<a name="conclusion"></a>

asd
