# Feature Selection

Selecting only the attributes or features which contribute to the target variables is neccessary.
Uneccessary features can reduce the performance of the predictive model and can lead to low accuracy.
Only the revalant features should be used to train the model

In [1]:
import pandas as pd
import numpy as np
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2


In [2]:
data = pd.read_csv("A:/MinorProjectData/GlobalTerrorCleanPartTwo.csv")

In [3]:
data.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 165744 entries, 0 to 165743
Data columns (total 48 columns):
eventid             165744 non-null float64
iyear               165744 non-null int64
imonth              165744 non-null int64
iday                165744 non-null int64
extended            165744 non-null int64
country             165744 non-null int64
country_txt         165744 non-null object
region              165744 non-null int64
region_txt          165744 non-null object
provstate           165744 non-null object
city                165744 non-null object
latitude            165744 non-null float64
longitude           165744 non-null float64
specificity         165744 non-null int64
vicinity            165744 non-null int64
crit1               165744 non-null int64
crit2               165744 non-null int64
crit3               165744 non-null int64
doubtterr           165744 non-null int64
multiple            165744 non-null int64
success             165744 non-null int

In [4]:
#Changing the categorical columns to category data type

category_att = ['country_txt', 'region_txt', 'specificity',
              'attacktype1_txt', 'targtype1_txt', 
              'targsubtype1_txt', 'natlty1_txt', 
              'weaptype1_txt', 'weapsubtype1_txt']

for i in category_att:
    data[i] = data[i].astype('category')

In [99]:
data.info(verbose=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 165744 entries, 0 to 165743
Data columns (total 48 columns):
eventid             165744 non-null float64
iyear               165744 non-null int64
imonth              165744 non-null int64
iday                165744 non-null int64
extended            165744 non-null int64
country             165744 non-null int64
country_txt         165744 non-null category
region              165744 non-null int64
region_txt          165744 non-null category
provstate           165744 non-null object
city                165744 non-null object
latitude            165744 non-null float64
longitude           165744 non-null float64
specificity         165744 non-null category
vicinity            165744 non-null int64
crit1               165744 non-null int64
crit2               165744 non-null int64
crit3               165744 non-null int64
doubtterr           165744 non-null int64
multiple            165744 non-null int64
success             165744 non-n

Checking relation of columns with our target variables using in-built feature selection methods

In [5]:
#Value used for prediction
X = pd.get_dummies(data['country_txt'])

#Value To Predict
Y = data['weaptype1_txt']

In [6]:
feature_test = SelectKBest(score_func=chi2, k =3 )

In [7]:
sel_features = feature_test.fit(X , Y)

In [8]:
print(max(sel_features.scores_))
min(sel_features.scores_)

6139.626388483619


0.5277589965964216

In [9]:
#Value used for prediction
X = pd.get_dummies(data['region_txt'])

sel_features = feature_test.fit(X , Y)

print(max(sel_features.scores_))
min(sel_features.scores_)

4250.926157843302


38.45490794691512

In [10]:
#Value used for prediction
X = pd.get_dummies(data['attacktype1_txt'])

sel_features = feature_test.fit(X , Y)

print(max(sel_features.scores_))
min(sel_features.scores_)

93437.00489459957


616.5008918704337

FINAL LIST OF PREDICTOR VARIABLES

In [17]:
X = pd.get_dummies(data[['country_txt','attacktype1_txt','region_txt']])
X['nkill'] = data['nkill']
X['nwound'] = data['nwound']

In [18]:
Y  = data['weaptype1_txt']

Using Decision Trees For ranking the relevant columns for our modeling

In [89]:
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
X = data[['country','attacktype1','nkill' , 
          'nwound' ,'region' , 'targtype1' , 
          'iyear' , 'imonth' , 'iday' , 'extended' , 'targtype1', 'crit1',
          'property' , 'targsubtype1' , 'natlty1' , 'success' , 'guncertain1',
          'specificity','vicinity','doubtterr' ,'individual' , 'multiple']]
clf = clf.fit(X, Y)

In [90]:
for colname, feature_rank in enumerate(list(clf.feature_importances_)):
        print (X.columns[colname], feature_rank)

country 0.02112874198830816
attacktype1 0.7091105785244533
nkill 0.018443057586382663
nwound 0.014275293713675929
region 0.01431645611673498
targtype1 0.00766005012397148
iyear 0.0370851110979855
imonth 0.030887138208750907
iday 0.0401299644052795
extended 0.0017874119197285505
targtype1 0.0064943633939106345
crit1 0.0010635452653146896
property 0.01445133814195001
targsubtype1 0.033427389147072155
natlty1 0.020267935002572332
success 0.0035031797265917654
guncertain1 0.003472141001441419
specificity 0.007541528944307166
vicinity 0.003060299872051006
doubtterr 0.006850266747872304
individual 0.0007356374542667407
multiple 0.004308571617378555


In [96]:
percentile_values = np.percentile(clf.feature_importances_,70)
selected_features = list()
for colname, feature_rank in enumerate(list(clf.feature_importances_)):
        if feature_rank >= percentile_values:
            print (X.columns[colname], feature_rank)
            selected_features.append(colname)

country 0.02112874198830816
attacktype1 0.7091105785244533
iyear 0.0370851110979855
imonth 0.030887138208750907
iday 0.0401299644052795
targsubtype1 0.033427389147072155
natlty1 0.020267935002572332
