## GDELT Databuilder

In [0]:
%scala
import com.aamend.spark.gdelt._ 
val raw_gkg = spark.read.format("delta").load("s3a://osint-gdelt-reado/GDELT/delta/bronze/v1/gkg/").as[GKGEventV1]

### 1. Initial data extraction and filtering (GKG)

- We define a desired time period.
- In the same command, we stack a number of searchstrings that relate to our topic of interest, and get all news items where any of the keywords are in the url, which often reflects the article heading.
- Then drop the unwanted columns.
- Write to parquet.

In [0]:
%scala
// OUR TIMEFRAME ----------------------
val start_date = "2021-01-01 00:00:00"
val end_date = "2021-12-31 00:00:00"
// ------------------------------------

// SOME SEARCH EXAMPLES

//val filtered_gkg = raw_gkg.filter($"publishDate">start_date && $"publishDate"<end_date).filter("sourceUrls[0] like '%covid%' OR sourceUrls[0] like '%vaccine%' OR sourceUrls[0] like '%vaxx%'")

//val filtered_gkg = raw_gkg.filter($"publishDate">start_date && $"publishDate"<end_date).filter("sourceUrls[0] like '%sweden%'")

//val filtered_gkg = raw_gkg.filter("sourceUrls[0] like '%sweden%'") // search without timeframe, in all of gdelt

val filtered_gkg = raw_gkg.filter($"publishDate">start_date && $"publishDate"<end_date).filter("sourceUrls[0] like '%squid%' AND sourceUrls[0] like '%game%'")


// DROP COLUMNS AND WRITE FILE
val filtered_gkg_clean = filtered_gkg.drop("numArticles").drop("counts").drop("hash").drop("errors").drop("date")
filtered_gkg_clean.write.format("parquet").mode("overwrite").save("dbfs:/FileStore/temp.parquet")

### Optional: Add more data based on theme(s) (GKG)

GDELT's theme categories (e.g. `TAX_ETHNICITY_KOREAN` or `WB_694_BROADCAST_AND_MEDIA`) are very broad an imprecise which make them difficult to use for sampling. But in those cases where the phenomenon we study has its very own category (e.g. `TAX_DISEASE_CORONAVIRUS`), it may useful to collect that whole theme, and not only the data sampled through the urls.

</br>

- Read the parquet with `koalas` in Python.
- Get the _top themes_ from this initial dataset.
- Perform a manual inspection of this list to see if any specific enough theme(s) are there.

In [0]:
%python
import databricks.koalas as ks
dataDF = ks.read_parquet("dbfs:/FileStore/temp.parquet")
topthemesDF = ks.DataFrame(dataDF['themes'].explode().value_counts())
topthemesDF.reset_index(level=0, inplace=True)
topthemes = topthemesDF['index'][:19].to_numpy()
print(list(topthemes))

If we want to collect for a full theme, we run the code below, _inserting the relevant theme(s)_, and then adding it to the already collected data. If stacking several themes, use `||` for OR, and `&&` for AND.

In [0]:
%scala

// GET DATA FOR THEMES
val gkg_themes = raw_gkg.filter($"publishDate">start_date && $"publishDate"<end_date).filter(c => c.themes.contains("TAX_ETHNICITY_KOREAN") || c.themes.contains("TAX_WORLDLANGUAGES_KOREAN"))

// DROP COLUMNS AND COMBINE DATASETS
val gkg_themes_clean = gkg_themes.drop("numArticles").drop("counts").drop("hash").drop("errors").drop("date")
val all_gkg = filtered_gkg_clean.union(gkg_themes_clean).distinct()

NOTE: If skipping the optional step above, run the code below to rename the first dataset for continued processing.

In [0]:
%scala
val all_gkg = filtered_gkg_clean

### 2. Enrich our data (EVENTS) – more columns
- Harvest `eventIds` – that were identified in the GKG dataset – from the EVENTS data.
- Explode the GKG data on `eventIds` to get one eventId per row and thus prepare for inner join.
- Join the dataframes into one, and clean up columns.

In [0]:
%scala
// Load the GDELT `events` dataset for the same time period as for the GKG above
val raw_events = spark.read.format("delta").load("s3a://osint-gdelt-reado/GDELT/delta/bronze/v1/events").as[EventV1] // now we read the events table, rather than gkg
val filtered_events = raw_events.filter($"date">start_date && $"date"<end_date).distinct()
val filtered_events_clean = filtered_events.drop("actor1Code").drop("actor2Code").drop("isRoot").drop("cameoEventBaseCode").drop("cameoEventRootCode").drop("quadClass").drop("actor1Geo").drop("actor2Geo").drop("dateAdded").drop("hash").drop("errors").drop("date")

// Explode the gkg on eventIds to create an eventId column for inner join
import org.apache.spark.sql.functions.explode
val all_gkg_prep = all_gkg.withColumn("eventId", explode($"eventIds")).drop("eventIds")

// Perform the inner join
val joinedDF = all_gkg_prep.join(filtered_events_clean, Seq("eventId")) // This method avoids duplicate columns after join

// Column clean-up
val finalDF = joinedDF.withColumn("geoName",$"eventGeo.geoName").withColumn("latitude",$"eventGeo.geoPoint.latitude").withColumn("longitude",$"eventGeo.geoPoint.longitude").withColumn("countryCode",$"eventGeo.countryCode").withColumn("ttone",$"tone.tone").withColumn("positiveScore",$"tone.positiveScore").withColumn("negativeScore",$"tone.negativeScore").withColumn("polarity",$"tone.polarity").withColumn("activityReferenceDensity",$"tone.activityReferenceDensity").withColumn("selfGroupReferenceDensity",$"tone.selfGroupReferenceDensity").drop("locations").drop("eventGeo").drop("tone").drop("avgTone").withColumn("tone", $"ttone").drop("ttone").distinct()

As we are using only items for which there is an `eventId` the cell above may in fact shrink the size of the dataset.

In [0]:
%scala
all_gkg.count

In [0]:
%scala
finalDF.count

### Final dataset
- Write to parquet file.

In [0]:
%scala
finalDF.write.format("parquet").mode("overwrite").save("dbfs:/FileStore/dataset.parquet")