This file 'GradientBoosting.ipynb' describes the Gradient Boosting analysis.
outcome: gname, 63 categories, each having at least 150 entries
features: 35, listed in final_data_list.xlsx
rows (entries): 38251 total entries

In [None]:
# This code gives the Gradient Boosting with the best settings found (n_estimators 
# = 200 and max_depth = 5). See the section below for a summary of the results.

import math
import numpy as np
import re
import string
import scipy
import pandas as pd
import sklearn
import matplotlib.pyplot as plt
import datetime as dt
import sys

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import Imputer
from sklearn import cross_validation, metrics   #Additional scklearn functions

t1 = dt.datetime.now() 

home = r"C:\Users\ibshi\Desktop\startup.ml\challenge 2\global terrorism\data"
infile = home + r"\final_data.xlsx"
final_data = pd.read_excel(infile)

infile = home + r"\Y_out.xlsx"
Y = pd.read_excel(infile)

infile = home + r"\Y_in.csv"
Y1 = pd.read_csv(infile,names=['gname'])
                 
# Make the categorical variables into a series of dummy variables. 

catlist = ['country',
'region',
'attacktype1',
'attacktype2',
'targtype1',
'targsubtype1',
'claimmode',
'weaptype1',
'WeapRecode1',
'weaptype2',
'WeapRecode2',
'hostkidoutcome']

for cat in catlist:
    hold = pd.get_dummies(final_data[cat])
    final_data = pd.concat([final_data, hold], axis=1)

final_data.drop(catlist,inplace=True,axis=1)

#impute NaN values (replace with mean or median)

imp = Imputer(missing_values='NaN', strategy='mean', axis=1)
imputed_data = pd.DataFrame(imp.fit_transform(final_data))

# Select entries randomly for test and training sets

prop_train = 0.5
numtrain = int(prop_train* len(Y1))

sorter = pd.DataFrame(data=np.zeros(len(Y1)),columns=['sorter'])
sorter.sorter[0:numtrain-1] = 1
np.random.shuffle(sorter.sorter)

t_data = pd.concat([sorter,imputed_data],axis=1)
t_Y = pd.concat([sorter,Y1],axis=1)

train = t_data[t_data.sorter == 1]
trainY = t_Y[t_Y.sorter == 1]

test = t_data[t_data.sorter == 0]
testY = t_Y[t_Y.sorter == 0]

train.drop(['sorter'],inplace=True,axis=1)
trainY.drop(['sorter'],inplace=True,axis=1)
test.drop(['sorter'],inplace=True,axis=1)
testY.drop(['sorter'],inplace=True,axis=1)

# put Y in proper structure for gradient boosting

trainYa = np.ravel(trainY)
testYa = np.ravel(testY)

# run Gradient Boosting

GB = sklearn.ensemble.GradientBoostingClassifier(n_estimators=200,
max_depth=5)

GB.fit(train,trainYa)
out = GB.predict(test)
outDF = pd.DataFrame(out)
sc_test = GB.score(test,testYa)
print "accuracy on test set"
print sc_test
sc_train = GB.score(train,trainYa)
print "accuracy on train set"
print sc_train

t2 = dt.datetime.now()
dt.timedelta.seconds
tdiff = t2 - t1
print tdiff.total_seconds()

The results of the Gradient Boosting are found in <results gradient boosting.xlsx>.
The default settings gave test set accuracy = 0.913002562 and 
training set accuracy =	0.968207488.

Some tuning was attempted. Varying learning_rate from the default 0.1 to either
0.05 or 0.2 did not improve the fits, and thus the default learning rate was
used subsequently.  

The number of estimators=200 gave somewhat better fits than 100 (default)and 400, 
and thus that value was used subsequently.

Varying subsample, min_samples_split, andmin_samples_leaf did not seem to have large 
effects in the fits compared to the default values.

Varying max depth did seem to have an effect (see 'GB max depth.jpg'). At max_depth = 5
accuracy test and accuracy training seem to asymptote. Thus, these settings 
(number of estimators = 200 and max_depth = 5) were determined to give the best model.

accuracy test = 0.91562	
accuracy training = 0.99885
