## Predicting Group Responsible - Initial Model Evaluation

Table of Contents

Exploration of models without feature extraction


### Description

The database details information about all the terrorist attacks all over the world from 1970 to 2016 including information about description of the attack, terrorist groups invloved, weapons used, attack type etc.for each terrorist attack.

However certain incidents have not been attributed to any particular terrorist group. This model attempts to predict the terorist group responsible for such attacks.

Information about the various features of the database has been detailed at: https://www.start.umd.edu/gtd/downloads/Codebook.pdf


In [1]:
%matplotlib inline
import pandas as pd
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
data = pd.read_excel("gtd_95to12_0617dist.xlsx")
display(data.head(n=1))

Unnamed: 0,eventid,iyear,imonth,iday,approxdate,extended,resolution,country,country_txt,region,...,addnotes,scite1,scite2,scite3,dbsource,INT_LOG,INT_IDEO,INT_MISC,INT_ANY,related
0,199501000001,1995,1,0,,0,NaT,217,United States,1,...,,,,,PGIS,-9,-9,0,-9,"199501000001, 199501000002, 199501000003"


In [3]:
#info about dataset
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55055 entries, 0 to 55054
Columns: 135 entries, eventid to related
dtypes: datetime64[ns](1), float64(53), int64(24), object(57)
memory usage: 56.7+ MB


In [4]:
#total no of columns and rows present in data
print "Total no of rows and columns:",data.shape

Total no of rows and columns: (55055, 135)


In [5]:
#Removing columns which has 80% null values
def remove_columns_missing_values(data, min_threshold):
    for col in data.columns:
        rate = data[col].isnull().sum()/float(len(data)) * 100
        if rate >= min_threshold:
            data = data.drop(col,1)
    return data

In [7]:
data = remove_columns_missing_values(data , 80)
print "Total no of features values available :",len(data.columns)

 Total no of features values available : 66


In [8]:
columns_to_drop = ['INT_LOG' , 'INT_MISC', 'INT_ANY', 'INT_IDEO',
                   'eventid','extended','summary', 'scite1' , 'scite2' , 'scite3' , 'dbsource' , 
                   'provstate', 'location',  'city','nwoundte','propextent','nkillter', 
                   'guncertain1', 'nperpcap','nwoundus','nkillus','latitude','longitude',
                   'propcomment', 'weapdetail', 'corp1', 'motive', 'target1']
data = data.drop(columns_to_drop,axis = 1)

In [9]:
#No of columns present after removing columns which has null values more than 80%
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55055 entries, 0 to 55054
Data columns (total 38 columns):
iyear               55055 non-null int64
imonth              55055 non-null int64
iday                55055 non-null int64
country             55055 non-null int64
country_txt         55055 non-null object
region              55055 non-null int64
region_txt          55055 non-null object
specificity         55051 non-null float64
vicinity            55055 non-null int64
crit1               55055 non-null int64
crit2               55055 non-null int64
crit3               55055 non-null int64
doubtterr           55055 non-null int64
multiple            55055 non-null int64
success             55055 non-null int64
suicide             55055 non-null int64
attacktype1         55055 non-null int64
attacktype1_txt     55055 non-null object
targtype1           55055 non-null int64
targtype1_txt       55055 non-null object
targsubtype1        52031 non-null float64
targsubtype1_txt    52

In [10]:
#Removing columns with redunant,noisy and irrelevant data
columns_to_drop = ['country_txt','region_txt','crit1','crit2','crit3','propextent_txt','weapsubtype1_txt','weaptype1_txt',
                  'natlty1_txt','ransom','nperps','targsubtype1','weapsubtype1','specificity','nwound','nkill','targtype1_txt','targsubtype1_txt','attacktype1_txt']
data = data.drop(columns_to_drop,axis = 1)

In [11]:
#features after removing redunant,noisy and irrelevant data
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55055 entries, 0 to 55054
Data columns (total 19 columns):
iyear          55055 non-null int64
imonth         55055 non-null int64
iday           55055 non-null int64
country        55055 non-null int64
region         55055 non-null int64
vicinity       55055 non-null int64
doubtterr      55055 non-null int64
multiple       55055 non-null int64
success        55055 non-null int64
suicide        55055 non-null int64
attacktype1    55055 non-null int64
targtype1      55055 non-null int64
natlty1        54628 non-null float64
gname          55055 non-null object
individual     55055 non-null int64
claimed        45795 non-null float64
weaptype1      55055 non-null int64
property       55055 non-null int64
ishostkid      55052 non-null float64
dtypes: float64(3), int64(15), object(1)
memory usage: 8.0+ MB


In [12]:
#Total no of null values present in data
print "Total no of null values in data:",data.isnull().values.sum()

Total no of null values in data: 9690


In [13]:
#filling null values with median values
features = data.fillna(data.median())

In [14]:
#Total no of null values present in data
features.isnull().values.sum()

0

In [15]:
#checking unique values in each Gname to find no of terrorist organosation present in data
features["gname"].unique()

array([u'Anti-Abortion extremists', u'Unknown', u'Anarchists', ...,
       u'Militant Minority (Greece)', u'Punjabi Taliban',
       u'Biswabhumi Sena Bishal Nepal'], dtype=object)

In [16]:
#Encoding terrorist organisation with numerical values to train the data
features_new = features.drop("gname" , axis=1)
gname = features["gname"]
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(gname)
gname_encoded = le.transform(gname)

In [18]:
#Spliting data into training and testing test to cross validate trained model
#80% of training data and 20% testing data
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(features_new, 
                                                    gname_encoded, 
                                                    test_size = 0.2, 
                                                    random_state = 0)

In [19]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
clf = RandomForestClassifier()
clf = clf.fit(X_train, y_train)
pred = clf.predict(X_test)
important_features = clf.feature_importances_
acc = accuracy_score(y_test , pred)
print acc

0.740986286441


In [20]:
X_train_reduced = X_train[X_train.columns.values[(np.argsort(important_features)[::-1])[:9]]]
X_test_reduced = X_test[X_test.columns.values[(np.argsort(important_features)[::-1])[:9]]]

In [27]:
#Using random forest to train, test and check the accuracy of trained model with reduced feature
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
clf = RandomForestClassifier(n_estimators=100 ,criterion='entropy', random_state=10)
clf = clf.fit(X_train_reduced, y_train)
pred = clf.predict(X_test_reduced)
acc = accuracy_score(y_test , pred)
print acc

0.743801652893


### Final evaluation


After reducing no of features and tuning paramters of random forest classifier the accuracy_score is improved marginally.