# GDELT Demo project showcase

This is the simple presentation of my findings and a bit about the skills involved in getting there. For a more general intro, a review of the skills demonstrated here, and setup instructions, see [README.md](https://github.com/reed9999/gdelt-demo/blob/master/README.md) and children.

**AT PRESENT EVERYTHING ALL FINDINGS ARE EXPLORATORY**

## Skills demonstrated

I've moved this to the main README.md.

## Findings and visualizations

This project is still in early stages, but I'll post small scale findings incrementally. So far I'm just running a few simple descriptives and regressions to get a feel for various aspects of the data.  

**NOTE**: To allow for execution of these demos without placing multiple gigabytes in the repo, I am engineering them to run with sample data when bigger datasets (either my local copy, still a sample but bigger, or on the full dataset) are unavailable. Thus specific metrics may not agree with what I report.

In [7]:
import importlib
import analysis.goldsteinscale_avgtone as ga
importlib.reload(ga)
# This seemed to make things very slow but perhaps it's just browser slowness.
# %reload_ext autoreload
# %autoreload 1

regr = ga.GARegression()
regr.prepare_data()


  mask |= (ar1 == a)


In [8]:
importlib.reload(ga)
regr.report_descriptives()

count     4.581512e+06
unique             NaN
top                NaN
freq               NaN
mean      3.816873e-01
std       4.886306e+00
min      -1.000000e+01
25%      -2.000000e+00
50%       1.000000e+00
75%       3.400000e+00
max       1.000000e+01
Name: goldsteinscale, dtype: float64
count     4.581512e+06
unique             NaN
top                NaN
freq               NaN
mean      3.662871e+00
std       3.697936e+00
min      -3.035714e+01
25%       2.061856e+00
50%       4.090558e+00
75%       5.954198e+00
max       3.333333e+01
Name: avgtone, dtype: float64


KeyError: 4

### International stability and media tone

Goldstein is a measure of propensity of each *type* of event to promote stability; avgtone is the average tone across all items (mostly media stories, I think) referencing the event. My intuition is that they should be positively correlated, because the media reflects the public's interest in stability. 

#### Descriptives and simple plots
A good place to start, before I do inferential stats, is to get to know the data. I jumped ahead a bit but now am coming back to this.

**COMING NEXT**: Means, stdev, plots -- maybe box plots with interquartile range, maybe histograms 

#### Regression analysis

The most interesting story I've found so far is that the tone of coverage trends downwards for time while controlling for the Goldstein score of each kind of event:

In [1]:
import analysis.goldsteinscale_avgtone as ga
regr = ga.GARegression()
regr.go()

#Coefficients are for date (in fraction of year) and Goldstein coefficient
#Dependent variable is average tone of documents, i.e. coverage.
regr.print_output()


  self.prepare_data()
  mask |= (ar1 == a)




<matplotlib.figure.Figure at 0x7f9e41288438>

Coefficients on the regression: [[-0.22229802  0.13551692]]
MSRE: 7.835794230966895
r2: 0.569243224924088


Results from running on my local "mini" database (bigger than the sample data in this repo, which of course will generate different results):

```
Coefficients on the regression:

    fractiondate: -0.2022747908947874
    goldsteinscale: 0.13080667992903958
MSRE: 8.334688464736924
r2: 0.5204274326466958
```

#### Interpretation
Each year is associated with a -0.20 change in average tone, holding constant the theoretical propensity of the events to promote stability (i.e. holding constant the Goldstein score). Each 1.0 point increase in the Goldstein store is associated with a +0.13 change, holding constant the date.

In isolation, all we can tell from the MSRE is that a mean absolute value of the std error would be sqrt(8.3) ~= 2.9. In principle avgtone can run from -100 to 100, but in practice it seems to usually be between 0 and 10. So, missing estimates by 2.8 seems not-great; however, it's a more useful metric to compare to other analyses than in isolation. The r-squared looks pretty nice for just my first attempt at specifying a regression with two independent variables, but I should be cautious about this metric as I run more regressions and risk overfitting.

I have not yet run this on all the data (e.g., with Spark on AWS Elastic MapReduce), just a subset of six data files.

### Research questions
To get interesting answers, I need good questions. This is what makes this project more than just an extended SQL or PySpark tutorial. In this kind of research there's going to be some iteration in RQs, because things you think *a priori* might be interesting turn out not to be, and vice versa. So this is largely a running list of ideas, .

1. **Media tone and stability** - The Rosenstein score is a measure of how each kind of action promotes or erodes stability. The tone score is something about the tone of the media coverage of the event. See above.

1. **Actor affinity by dyad** - Do some actor dyads consistently produce higher scores (Rosenstein or media tone) than others? I'd think this would be a trivial "Yes" because relationships between allies should produce more positive news than those between adversaries. So this is exploring the obvious, but a good sanity check to make sure I'm understanding the nature of this data.

  1. **Asymmetrical relationships** - Just brainstorming. If X threatens Y more than Y threatens X that's also significant and interesting.
  
1. **Longitudinal time series questions** - This dataset is so rich that it could support a lot of this.

1. **Clustering countries or other actors** - If I want to polish up those *k*-means clustering skills, there's probably all kinds of variables I could derive ("number of threats made", "number of overtures to negotiation", "general tone of media coverage", etc.).

1. **Network graph analysis** - I don't actually know much about this topic, but clearly this data would support a lot of inferences about how actors build their networks. 



## A note on data engineering and AWS Elastic MapReduce
It's a bit of a gray area whether learning how to move files around to AWS EMR, get Spark to run my Python scripts without hanging, etc. is properly part of data science or whether it's actually data engineering. I find it valuable to learn these skills even though they don't lead directly to findings above. See the README.md for the skills developed in this subset of the project.
