# GDELT Demo: Classification

Instructions: [README.md](../README.md) in project root.

Starting point for Jupyter notebooks: [/Start_here.ipynb](../Start_here.ipynb) in project root.


This notebook breaks out some of the classification tasks associated with my GDELT analysis. For this initial stage I will address questions like: 

* Can I use various aggregate stats of "relationships" (as actor 1, actor 2, or both) to identify developed versus developing (or high- vs. low-GDP, or high- vs. low-HDI) countries?

## Findings and visualizations

**NOTE**: As with the regression analysis, to allow for execution of these demos without placing multiple gigabytes in the repo, I am engineering them to run with sample data when bigger datasets are unavailable. Thus specific metrics may not agree with what I report.

In [2]:
import sys
import os
import importlib

#project imports
sys.path.insert(0, os.path.join(os.getcwd(), ".."))
import classification
import pandas_gdelt_helper

In [3]:
importlib.reload(classification)
importlib.reload(pandas_gdelt_helper)

from classification import GdeltClassificationTask as Task
task = Task()
task.do_decision_tree()
task.do_svm()
task.do_random_forest()

         name actor1_relationships  actor2_relationships
code                                                    
BOL   Bolivia                    1                     4
CAN    Canada                   12                     7
ESP     Spain                    5                     6
EUR    Europe                    9                    13
FRA    France                   13                    15

*****        SVM          *****
Classifier coef for SVM:
[[-0.9090892 ]
 [-0.20512788]
 [ 0.10256559]
 [ 0.41025349]]

*****    RANDOM FOREST    *****
[1. 1.]


### Classification by level of development

For this initial version, just divide the world into developed and developing countries based on GDP.
This classification isn't inherently earthshaking (because we can simply look up GDP and don't need GDELT to predict it) but it's a sort of sanity check that my classification strategies are interesting. More generally, predicting development by the 'signature' in GDELT might tell us something nontrivial about world events.

Features will be things like "number of different relationships," "ratio of actor 1 to actor 2 relationships," specific CAMEO codes (eventcode families) in those relationships, etc., where *relationships* means events linking actor 1 to actor 2.

In [4]:
importlib.reload(classification)
importlib.reload(pandas_gdelt_helper)
from pandas_gdelt_helper import get_country_features, get_country_external
get_country_features()


Unnamed: 0_level_0,name,actor1_relationships,actor2_relationships
code,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
BOL,Bolivia,1,4
CAN,Canada,12,7
ESP,Spain,5,6
EUR,Europe,9,13
FRA,France,13,15
GBR,United Kingdom,24,18
GHA,Ghana,14,12
ISR,Israel,28,31
ITA,Italy,17,18
LBY,Libya,14,10


### Using classifiers to predict violent events

The sanity check version of this would be whether simply a certain number of violent event codes at time *t-1* back to whatever *t-N* predicts violent events. Violent events will need to be defined based on CAMEO code.

More interestingly, the question would be predicting violence at time *t* through nonviolent patterns at time *t'* < *t*.