# SC1015 Data Science Project - Ryan Yu & Vaishob 
---

# Introduction to topic

Terrorism is defined as political violence in an asymmetrical conflict that is designed to induce terror and psychic fear (sometimes indiscriminate) through the violent victimization and destruction of noncombatant targets (sometimes iconic symbols).

The Global Terrorism Database (GTD)™ is the most comprehensive unclassified database of terrorist attacks in the world. It is an open-source database, which provides information on domestic and international terrorist attacks around the world since 1970, and now includes more than 200,000 events. For each event, a wide range of information is available, including the date and location of the incident, the weapons used, nature of the target, the number of casualties, and – when identifiable – the group or individual responsible.

## Practical Motivation

 ---
In the past, many terror attacks have been tipped off by intelligence agencies to the government. But the government, for varying reasons, have disregarded such calls (resulting in loss of many lives in some cases). By building a model that can accurately predict the success of future terror attacks, we can help respective governments to delegate resources more wisely and, more importantly, save more lives.


## Data Preparation & Cleaning

 ---
### Import Libraries

In [5]:
# Basic Libraries
import datetime
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt # we only need pyplot
import matplotlib.patches as mpatches 
import matplotlib.font_manager as font_manager
sb.set() # set the default Seaborn style for graphics

# Import essential models and functions from sklearn
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, accuracy_score, confusion_matrix, classification_report
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier

from sklearn import datasets
from sklearn import tree
import plotly.express as px
import folium
from folium.plugins import HeatMapWithTime

from DataSynthesizer.DataDescriber import DataDescriber
from DataSynthesizer.DataGenerator import DataGenerator
from DataSynthesizer.ModelInspector import ModelInspector
from DataSynthesizer.lib.utils import read_json_file, display_bayesian_network

### Import dataset

In [None]:
# read file using .read_excel
df = pd.read_excel("globalterrorismdb_0221dist.xlsx")
df 

### Exploring dataset

In [None]:
df.isnull().sum()

## Data cleaning 

In [None]:
#Features that we require for prediction, this will be used for our RandomForestClassifier model
feature= [      'iyear', #Iyear is the year the event occurred.
                'extended', #extended means is the event lasted for more than 24 hours
                'vicinity', #is the attack in the vicinity of city?
                'doubtterr',#any doubts that it is a terrorist aattack?
                'multiple', # is the event linked to other attacks?
                'suicide', # is this a suicide attack?
                'claimed', #has any terrorist group claimed the attack
                'property',#any property damage?
                'ishostkid',#any hostages or kidnappings?
                'country',#country
                'region',#region 
                'attacktype1',# method of attack used
                'targtype1',#type of target
                'weaptype1',#weapon used
                'longitude',
                 'latitude'
]

##### Label encoding to map each categorical data to a numeric value

In [None]:
#Data cleaning
cntry = dict(zip(df.country,df.country_txt))
weapx = dict(zip(df.weaptype1,df.weaptype1_txt))
atk = dict(zip(df.attacktype1,df.attacktype1_txt))
targ = dict(zip(df.targtype1,df.targtype1_txt))
print("Country \n",cntry)
print("\n Weapon type \n",weapx)
print("\n Attack type \n",atk)
print("\n Target type \n",targ)

#### Isolate and clean the required data by dropping all null values to obtain a dataframe, `frame`

In [None]:
feat= [      'iyear', #Iyear is the year the event occurred.
                'imonth',
                'extended', #extended means is the event lasted for more than 24 hours
                'vicinity', #is the attack in the vicinity of city?
                'doubtterr',#any doubts that it is a terrorist aattack?
                'multiple', # is the event linked to other attacks?
                'suicide', # is this a suicide attack?
                'claimed', #has any terrorist group claimed the attack
                'property',#any property damage?
                'ishostkid',#any hostages or kidnappings?
                'country',#country
                'region',#region 
                'attacktype1',# method of attack used
                'targtype1',#type of target
                'weaptype1',#weapon used
                'longitude',
                 'latitude',
       'success'
]
frame=df[feat]
#dropping all columns with NaN values
frame = frame.dropna()
frame

## Exploratory Data Analysis & Data Visualisation

 ---

##### Initially we use the original dataframe, `df`. Subsequently we will use the `frame` dataframe that was created

**Uni-variate Visualization**

Identify patterns occuring in the dataset using univariate visualization

**What proportion of past terrorist attacks was successful?**

In [None]:
df_copy = df.copy()
df_copy["success"] = df["success"].replace({1:"Successful Operation", 0:"Failed Operation"})
df_copy["success"].value_counts().plot(kind="pie", figsize=(20,10), autopct="%1.1f%%", cmap="summer")
plt.title("Successful & Failed Operations")

**What is the trend of number of terrorist attacks over the years?**

In [None]:
barplot = pd.value_counts(df['iyear'])\
.sort_index()\
.plot\
.bar(width=0.8, figsize=(16, 8), align='center', title="Yearly count of terrorist attacks")

**What is the trend of terrorist activities by region?**

In [None]:
terror_region = df['region_txt']
terror_region.rename("Region",
          inplace=True)
terror_region=pd.crosstab(df['iyear'],terror_region)
terror_region.plot(color=sb.color_palette('Set1',12))
fig=plt.gcf()
fig.set_size_inches(18,6)
plt.show()


**Here we note a distinct spike in the number of attacks in 2012-2014 period. This is reflected across all major regions, namely Middle East, North Africa, Sub-Saharan Africa, South Asia, Australasia & Oceania and Eastern Europe.**

**Which Terrorist Group has carried out the largest number of attacks?**

In [None]:
group_name=df_copy["gname"]
group_name.dropna()
# Terrorist groups in the Dataset
print("Number of Terrorist groups : \t\t\t\t\t\t\t    ", len(group_name.unique()))

# Number of attacks per terrorist groups 
print(group_name.value_counts())
#sb.catplot(y = "gname", data = df_copy['gname'], kind = "count")

While there is a vast number of terrorist groups out there, it is interesting to note the proportion of attacks without a designated perpetrator (89231 in total). This shows how sparse information is in an area that concerns national security

In [None]:
# Drop all Unknown entries (Invalid Terrorist Group name)
gname = group_name.copy()
gname.replace('Unknown', np.nan, inplace=True)
gname.dropna()

print(gname.value_counts())

#df['gname'].value_counts()


**Distribution of the top 5 terrorist groups in terms of quanitity of attacks**

In [None]:
data = {
    'Group Name': ['Taliban', 'Islamic State of Iraq and the Levant (ISIL)', 'Shining Path (SL)', 'Al-Shabaab', 'FMLN'],
    'Values': [10094, 6864, 4563, 4126, 3351]
}

group_prop = pd.DataFrame(data)
group_prop.reset_index(drop=True, inplace=True)

Groups = group_prop['Group Name']
Values = group_prop['Values']

# compute the proportion of each group with respect to the total
total_values = 10094 + 6864 + 4563 + 4126 + 3351
category_proportions = [(float(Value) / total_values) for Value in Values]

print(category_proportions)

width = 50    # width of chart 
height = 10    # height of chart

total_num_tiles = width * height # total number of tiles

tiles_per_category = [round(proportion * total_num_tiles) for proportion in category_proportions]

# initialize the waffle chart as an empty matrix
waffle_chart = np.zeros((height, width))

# define indices to loop through waffle chart
category_index = 0
tile_index = 0

# populate the waffle chart
for col in range(width):
    for row in range(height):
        tile_index +=1
        
        # if the number of tiles populated for the current category is equal to its allocated tiles...
        if tile_index > sum(tiles_per_category[0:category_index]):
            # ... proceed to the next category
            category_index += 1
        
        # set the class value to an integer, which increases with class
        waffle_chart[row, col] = category_index

# instantiate figure object
fig = plt.figure()

# use matshow to display the waffle chart
colormap = plt.cm.viridis
plt.matshow(waffle_chart, cmap=colormap)
#plt.colorbar()

# get the axis 
ax = plt.gca()

# set minor ticks
ax.set_xticks(np.arange(-.5, (width), 1), minor=True)
ax.set_yticks(np.arange(-.5, (height), 1), minor=True)

# add gridlines based on minor ticks
ax.grid(which='minor', color='w', linestyle='-', linewidth=2)
ax.set_title('Proportion of attacks by top 5 terrorist groups', fontdict={'fontsize': 22, 'fontweight': 'medium'})

plt.xticks([])
plt.yticks([])

# compute cumulative sum of individual categories to match color schemes between char and legend
values_cumsum = np.cumsum(data['Values'])
total_values = values_cumsum[len(values_cumsum) - 1]

# create legend 
legend_handles = []
#for i, category in enumerate(data['Group Name']):
 #   label_str = category + ' (' + str(data['Values'][i]) + ') '
  #  color_val = colormap(int(i+1))/len(Groups)
   # legend_handles.append(mpatches.Patch(color=color_val, label=label_str))

for indx, company_name in enumerate(Groups):
    label_text = '{1}: ({0})'.format(Values[indx], company_name)
    color_value = colormap(int(indx+1)/len(data['Group Name']))
    legend_handles.append(mpatches.Patch(color=color_value, label=label_text))

font = font_manager.FontProperties(
    family='Arial',
    weight='bold',
    style='normal',
    size=14,
)

# add legend to chart
plt.legend(handles=legend_handles,
          loc='lower center', 
          ncol=len(Groups),
          prop=font, 
          bbox_to_anchor=(0., -0.3, 0.95, .1)
          )

In [None]:
#Drop cells with predicted_terrorism 0
#clean=df.drop(newdf[newdf.Predicted_Terrorism ==0].index)
a=df["country_txt"].value_counts()
adf= pd.DataFrame(a)
#Reset so the index will now be a column
adf.reset_index(inplace=True)
#Now rename the columns
adf=adf.rename(columns={'index':'country','country_txt':'count'})
print(adf.head(n=10).to_string(index=False))

**Where do these attacks occur?**

In [None]:
plt.subplots(figsize=(16,8))
sb.countplot('country_txt',data=df,palette='inferno',order=df["country_txt"].value_counts().index[:10])
plt.xticks(rotation=90)
plt.title('Countries Most Affected By Terrorism')
plt.xlabel("Country")
plt.grid(axis="y")
plt.show()

###### It appears that Middle East, South Asia are the biggest hotspots for terror attacks. They are followed by South American nations, Philippines as well as United Kingdom

**Which Types of Attacks are Common?**

In [None]:
attack_df = df['attacktype1_txt']
attack_df.rename("Attack Type",
          inplace=True)
attack_df.value_counts().plot(kind='pie',figsize=[16,12],autopct='%1.1f%%')
plt.title("Most Common Attack Methods")

###### Bombing/Explosion comes out on top, comprising almost half of all terrorist attacks. It is followed by Armed Assault, with 23.6% of attacks. 

**Popular targets of terrorists?**

In [None]:
targ_df = df['targtype1_txt']
targ_df.rename("Target Type",
          inplace=True)
plt.subplots(figsize=(15,6))
sb.countplot(targ_df,palette='inferno',order=targ_df.value_counts().index)
plt.xticks(rotation=90)
plt.title('Popular Targets')
plt.show()


##### Private citizens and property seem to be most commonly targeted by terrorist groups. This is followed by the Military, Police, Government and Businesses


**Common Types of Attacks by Region?**

In [None]:
pd.crosstab(df.region_txt,df.attacktype1_txt).plot.barh(stacked=True,width=1,color=sb.color_palette('RdBu',9))
fig=plt.gcf()
fig.set_size_inches(16,8)
plt.ylabel("Region")
plt.show()

In most countries, we see that **bombing/explosion** is the most commonly-used attacking means. It is worth noting that for regions of **Sub-Saharan Africa**, as well as the **Central Americas & Caribbean**, **armed assault** is more prevalent than bombing.

## Analytic Visualization
---
### In order to find patterns, we need to find better visualization techniques...

##### Is there a way to incorporate past-year attacks on the world map? 


In [None]:
coords = frame[['latitude','longitude','success']]
coords


In [None]:
from collections import defaultdict, OrderedDict
data=defaultdict(list)
for x in frame.itertuples():
    if (x.success): #add only if it is successful
        data[x.iyear].append([x.latitude,x.longitude])

data = OrderedDict(sorted(data.items(), key=lambda t: t[0]))
print(data)
    

In [None]:
#Setting up the world countries data URL
url = 'https://raw.githubusercontent.com/python-visualization/folium/master/examples/data'
country_shapes = f'{url}/world-countries.json'


##### Interactive Folium heatmap to visualise distribution of attacks over the years

In [None]:
init_map = folium.Map(location=[0,0], zoom_start=2,tiles="cartodbpositron",max_bounds=True)
hm = HeatMapWithTime(data=list(data.values()),
                     index=list(data.keys()), 
                     radius=12,
                     auto_play=True,
                     max_opacity=0.3)
hm.add_to(init_map)
init_map.save("index.html")

init_map



  Time: There is a stark contrast in distribution of terrorist attacks between **pre-1998** and **post-1998**. In the latter period the number of attacks has increased with the distribution not changing significantly. 
  
  Question: Would it be advisable to use the data from pre-1998 even though the distribution is clearly significantly different from current and possibly future attacks? 

## Problem: 
## *To predict the success of terrorist attacks in 2021 with the highest possible accuracy using Machine Learning models.*

Can the success of an attack be predicted from GTD-exclusive features?

## Machine Learning

 ---
 ### Here we use RandomForestClassifier as our prediction model

In [None]:
X = df[feature].fillna(0) # Assign chosen features to X.
y = df.success

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)



model = RandomForestClassifier(n_estimators=10)

model.fit(X_train, y_train)

In [None]:
X

#### Accuracy score of prediction model

In [None]:
#Accuracy score

y_pred_test = model.predict(X_test)

accuracy_score(y_test,y_pred_test)

#### Confusion Matrix of Random Forest Model

In [None]:
#View CONFUSION matrix
#In percentages and in numbers.

c = confusion_matrix(y_test,y_pred_test)
f, axes = plt.subplots(1, 2, figsize=(12, 4))
sb.heatmap(c/np.sum(c),  annot = True, annot_kws={"size": 18} ,fmt='.2%',ax=axes[0])
sb.heatmap(c,  annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[1])

print(classification_report(y_test, y_pred_test))


### Comparing Between Random Forest And Decision Tree 

In [None]:
# Recall the Legendary-Total Dataset
X = df[feature].fillna(0) # Assign chosen features to X.
y = df.success


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)


# Decision Tree using Train Data
dectree = DecisionTreeClassifier(max_depth = 2)  # create the decision tree object
dectree.fit(X_train, y_train)                    # train the decision tree model


y_train_pred = dectree.predict(X_train)
y_test_pred = dectree.predict(X_test)

print("Goodness of Fit of Model \tTrain Dataset")
print("Classification Accuracy \t:", dectree.score(X_train, y_train))
print()

# Check the Goodness of Fit (on Test Data)
print("Goodness of Fit of Model \tTest Dataset")
print("Classification Accuracy \t:", dectree.score(X_test, y_test))
print()
# Plot the Confusion Matrix for Train and Test
f, axes = plt.subplots(1, 2, figsize=(12, 4))
sb.heatmap(confusion_matrix(y_train, y_train_pred),
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[0])
sb.heatmap(confusion_matrix(y_test, y_test_pred), 
           annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[1])

#### Fine Tuning Using Gini Importance

In [None]:
#Fine tuning, Checking the IMPORTANCE OF EACH feature
important_features = model.feature_importances_
forest_importances = pd.Series(important_features, index=feature)
fig, ax = plt.subplots(figsize=(16,10))
forest_importances.plot.bar(yerr=important_features, ax=ax)
ax.set_title("Feature importances")
ax.set_ylabel("Gini Importance")
#The higher the value, the more important the feature

#### Here we attempt to further optimise accuracy of our model by omitting less-important features (with respect to Gini importance)

In [None]:
#This shows the importancce of each features in predicting Success.
#We scrap
feature2= ['iyear',
                'property',
                'country',
                'attacktype1',
                'targtype1',
                'weaptype1',
]


X = df[feature2].fillna(0) # Assign chosen features to X.
y = df.success

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=0)


 #Use random forest
from sklearn.ensemble import RandomForestClassifier


model = RandomForestClassifier(n_estimators=10)

model.fit(X_train, y_train)

#Accuracy score

y_pred_test = model.predict(X_test)

print(accuracy_score(y_test,y_pred_test))

  
#View CONFUSION matrix
#In percentages and in numbers.

c = confusion_matrix(y_test,y_pred_test)
f, axes = plt.subplots(1, 2, figsize=(12, 4))
sb.heatmap(c/np.sum(c),  annot = True, annot_kws={"size": 18} ,fmt='.2%',ax=axes[0])
sb.heatmap(c,  annot = True, fmt=".0f", annot_kws={"size": 18}, ax = axes[1])


### Using our classification model to predict the 2021 terrorism events
How accurately can our model predict them?
#### Here we collect sample data of 2021 terrorist attacks from Wikipedia.org

In [None]:
#LIST OF 2021 ATTACKS, # CAN WE PREDICT THEM?
html_data = pd.read_html('https://en.wikipedia.org/wiki/List_of_terrorist_incidents_in_2021')
html_data
print(len(html_data)) #b)

In [None]:
html_data[0]

In [None]:
data2021 = pd.read_excel("2021terrorattacks.xlsx")
data2021

In [None]:
predicted = model.predict(data2021)

results= pd.DataFrame(predicted,columns = ["Predicted_Terrorism"])

summary = pd.concat([data2021,results],axis=1)
summary

Here we witness a **100% hit rate**, as our model managed to accurately predict **all 18 terrorism acts in 2021** (as per on Wikipedia) as successful 

### Using our classification model to predict success of future terrorist attacks in 2022
In order to predict future events, we need to generate a set of synthetic data first. Here we use `Linear Regression` and `Bayesian Network` to do the respective predictions
#### We use the year to predict terrorism count. We create a new dataframe, `lin`, that stores the year and corresponding terrorism count. Then we use `year` to predict `count`

In [None]:
##Linear regression Model:
#Use year to, predict Terrorism count.
#Create new DF, Year count and terrorism
from sklearn.linear_model import LinearRegression
da = pd.value_counts(df['iyear'])
lin = pd.DataFrame(da)
#Reset so the index will now be a column
lin.reset_index(inplace=True)
lin=lin.rename(columns={'index':'year','iyear':'count'})
lin

#### Train Linear Regression Model using the Train set
#### Use `lin_year` as *Predictor* and `lin_count` as *Response*.

In [None]:
# Split the Dataset into Train and Test
lin_year = pd.DataFrame(lin['year'])
lin_count = pd.DataFrame(lin['count'])

X_train, X_test, y_train, y_test = train_test_split(lin_year,lin_count, test_size = 0.25)

# Linear Regression using Train Data
linreg = LinearRegression()         # create the linear regression object
linreg.fit(X_train, y_train)        # train the linear regression model

#Using Our linear regression model to predict 2022.
d = {'year':[2022]}
predict= pd.DataFrame(d);
predicted_count = linreg.predict(predict)
pc=predicted_count[0][0]
pc = int(pc)
print(pc)
#Predicted count = 9768




# Coefficients of the Linear Regression line
print('Intercept \t: b = ', linreg.intercept_)
print('Coefficients \t: a = ', linreg.coef_)

#### Plot the regression line by prediction using the model

In [None]:
# Formula for the Regression line
regline_x = X_train
regline_y = linreg.intercept_ + linreg.coef_ * X_train

# Plot the Linear Regression line
# Predict Total values corresponding to HP Train
y_train_pred = linreg.predict(X_train)

# Plot the Linear Regression line based on the model
f = plt.figure(figsize=(16, 8))
plt.scatter(X_train, y_train)
plt.scatter(X_train, y_train_pred, color = "r")
plt.show()

#### Goodness of Fit of the Model on TRAIN Set

In [None]:
#goodness of fit of model ON TRAIN SET.

# Explained Variance (R^2)
print("Explained Variance (R^2) \t:", linreg.score(X_train, y_train))

# Mean Squared Error (MSE)
def mean_sq_err(actual, predicted):
    '''Returns the Mean Squared Error of actual and predicted values'''
    return np.mean(np.square(np.array(actual) - np.array(predicted)))


mse = mean_sq_err(X_train, y_train_pred)
print("Mean Squared Error (MSE) \t:", mse)
print("Root Mean Squared Error (RMSE) \t:", np.sqrt(mse))


#### Test the Linear Regression model `linreg` using the Test Set

In [None]:
#Goodness of fit of model on TEST Set.
# Predict Total count corresponding to year_Test
y_test_pred = linreg.predict(X_test)

# Plot the Predictions
f = plt.figure(figsize=(16, 8))
plt.scatter(X_test, y_test, color = "green")
plt.scatter(X_test, y_test_pred, color = "red")
plt.show()

#### Check how good the predictions are on the Test Set    
#### **Metrics :** Explained Variance and Mean Squared Error

In [None]:
# Explained Variance (R^2) ON TEST DATA
print("Explained Variance (R^2) \t:", linreg.score(X_test, y_test))

# Mean Squared Error (MSE)
def mean_sq_err(actual, predicted):
    '''Returns the Mean Squared Error of actual and predicted values'''
    return np.mean(np.square(np.array(actual) - np.array(predicted)))

mse = mean_sq_err(y_test, y_test_pred)
print("Mean Squared Error (MSE) \t:", mse)
print("Root Mean Squared Error (RMSE) \t:", np.sqrt(mse))

### We use Linear Regression model to predict count of terrorist attacks in 2022, and store it as `pc`

In [None]:
#Using Our linear regression model to predict 2022.
d = {'year':[2022]}
predict= pd.DataFrame(d);
predicted_count = linreg.predict(predict)
pc=predicted_count[0][0]
pc = int(pc)
print(pc)
#Predicted count = 9768

In [None]:
# input dataset
input_data = 'terrorismdb.csv'
# location of two output files
mode = 'correlated_attribute_mode'
description_file = f'description.json'
synthetic_data = f'sythetic_data.csv'

### Here we use Bayesian Network to generate a synthetic dataset representing 2022's data 

In [None]:
# An attribute is categorical if its domain size is less than this threshold.
# Here modify the threshold to adapt to the domain size of "education" (which is 14 in input dataset).
threshold_value = 20

# specify categorical attributes
categorical_attributes = {'country': True, 'attacktype1':True,'targtype1':True,'weaptype1':True,'property':True}

# A parameter in Differential Privacy. It roughly means that removing a row in the input dataset will not 
# change the probability of getting the same output more than a multiplicative difference of exp(epsilon).
# Increase epsilon value to reduce the injected noises. Set epsilon=0 to turn off differential privacy.
epsilon = 1

# The maximum number of parents in Bayesian network, i.e., the maximum number of incoming edges.
degree_of_bayesian_network = 2

# Number of tuples generated in synthetic dataset.
num_tuples_to_generate = pc #We use OUR PREDICTED VALUE OF 2022, FROM OUR LINREG MODEL

In [None]:
describer = DataDescriber(category_threshold=threshold_value)
describer.describe_dataset_in_correlated_attribute_mode(dataset_file=input_data, 
                                                        epsilon=epsilon, 
                                                        k=degree_of_bayesian_network,
                                                        attribute_to_is_categorical=categorical_attributes)
describer.save_dataset_description_to_file(description_file)

In [None]:
generator = DataGenerator()
generator.generate_dataset_in_correlated_attribute_mode(num_tuples_to_generate, description_file)
generator.save_synthetic_data(synthetic_data)

#### After saving the synthetically-generated dataset, we then read both datasets to verify that they are similar 

In [None]:
# Read both datasets using Pandas.
input_df = pd.read_csv(input_data, skipinitialspace=True)
synthetic_df = pd.read_csv(synthetic_data)
# Read attribute description from the dataset description file.
attribute_description = read_json_file(description_file)['attribute_description']

inspector = ModelInspector(input_df, synthetic_df, attribute_description)

#### Comparing Histograms (Left-hand side is Real data, Right-hand side is Synthetic)

In [None]:
for attribute in synthetic_df.columns:
    inspector.compare_histograms(attribute)

Here we note that for each column field, the general distribution of values across the bins is similar between the **real data** and the **synthesized data**. This shows that our synthetic dataset is similar in nature to past real-life data and we can go ahead and use it to predict using our model.

#### Comparing Heatmaps

In [None]:
inspector.mutual_information_heatmap()

Here we note that the respective correlations among the six variables are generally similar across both the real data (**Private**) and the newly-predicted data (**Synthetic**)

#### We read our synthesised data into a dataframe, `syndata`, to carry out our final prediction

In [None]:
syndata = pd.read_csv('sythetic_data.csv')
syndata['iyear']=2022 #set all year to 2022
syndata.property=np.where(syndata['property']<0,0,1)
syndata

### Use RandomForestClassifier model to predict success of terrorist events based on synthetic dataset

In [None]:
#Use our classification Model to predict successful attacks of the generated dataset
predicted = model.predict(syndata)

results= pd.DataFrame(predicted,columns = ["Predicted_Terrorism"])

summary = pd.concat([syndata,results],axis=1)
summary


Above, `Predicted_Terrorism` refers to whether or not a terrorist attack is successful 

1 --> **successful**

0 --> **unsuccessful**

In [None]:
summary.describe()

In [None]:
y.describe()

In [None]:
print("Country \n",cntry)
print("\n Weapon type \n",weapx)
print("\n Attack type \n",atk)
print("\n Target type \n",targ)

In [None]:
newdf = summary.replace({"country":cntry})
newdf

### Information Presentation
How can we better visualise our prediction findings?

In [None]:
#Setting up the world countries data URL
url = 'https://raw.githubusercontent.com/python-visualization/folium/master/examples/data'
country_shapes = f'{url}/world-countries.json'
#Getting the amount of successful terrorism count per country

#Drop cells with predicted_terrorism 0
clean=newdf.drop(newdf[newdf.Predicted_Terrorism ==0].index)
a=clean["country"].value_counts()
adf= pd.DataFrame(a)
#Reset so the index will now be a column
adf.reset_index(inplace=True)
#Now rename the columns
adf=adf.rename(columns={'index':'country','country':'count'})
print(adf.head(n=10).to_string(index=False))


## Interactive Chloropleth map showing distribution of predicted attacks

In [None]:
m=folium.Map()
folium.Choropleth(
    #The GeoJSON data to represent the world country
    geo_data=country_shapes,
    name='Predicted Terrorism events in 2022',
    data=adf,
    #The column accepting list with 2 value; The country name and  the numerical value
    columns=['country','count'],
    key_on='feature.properties.name',
    fill_color='PuRd',
    nan_fill_color='white'
).add_to(m)
m

## Statistical Inference

 ---
Based on the above Chloropleth map, we note that:
1) Countries such as `Iraq` are under --> **High Risk** of successfully being attacked

2) ``Afghanistan``, ``Pakistan`` are under --> **Moderate to High Risk**

3) ``Colombia``, ``Peru``, `Salvador`, ``Nigeria``, ``Somalia``, ``Yemen``, ``Turkey``, ``India``, ``Thailand``, ``Philippines`` are under --> **Moderate Risk**

4) Others --> **Relatively Low Risk**

## Ethical Consideration

 ---
While our model can be of great use to national security teams of the above countries as they prepare to monitor future threats, we must also take into consideration that such information can easily fall into the wrong hands. If future terrorists get their hands on such prediction insights it will make it easier for them to 'escape the radar', which leads to more damage in the long run.

Governments may benefit from using this prediction model as part of their arsenal of anti-terrorism measures, but never in place of it - nothing can replace the human intuition of deciding whether to heed or ignore a terrorism warning report.

## Intelligent Decision

 ---
**Recommendations:** Having taken care of the ethical considerations, our data-driven insights are as follows:

**1)** With the map as a reference, countries that not only fall within the moderate- to high-risk categories, but also countries that share borders with these nations (e.g. Iran, Syria, Oman) need to ramp up their monitoring regimes. Although this prediction is not definitive and fails to account for 2020's drop in terrorist activity, it can give good insight as to what terrorists' next move(s) could be. 

**2)** Instead of standalone terrorism success-predictor, this intelligence advisor can be used to supplement existing intelligence strategies for security teams. They can use them as tools to delegate the available resources for surveillance and defence as and when needed. Prevention is always better - governments can additionally choose to monitor the levels of terrorist recruitment taking place in their respective countries. This data can further add value to future anti-terrorist campaigns.

**3)** With the advent of new technologies and the Covid-19 Pandemic, the world has seen a rise in terrorist activity as a direct result of the Internet and online resources. Therefore, a prediction model that predicts future activities by region based on a population's online activity (E.g. by monitoring Google search history) could be a possibility although question marks over privacy exist. Nevertheless it is important to develop different dimensions in order to protect oneself on all fronts from potential terrorism