# 1 Introduction

#### Author: John Brandt

Scrape fulltext of news articles associated with each geolocated event in [GDELT](gdeltproject.org) for a given date.

Creates the following output:

* `{country}/metadata/{year}/{month}.csv`: `pd.DataFrame` with one row per candidate event and attached GDELT metadata;
* `{country}/metadata/{year}/{month}.pkl`: `Dict` matching each row of `{month}.csv` to one or more `unique_id` in `/text/`
* `{country}/text/{year}/{unique_id}.pkl`: `JSON` of article full text.

## 1.0 Package imports

[NewsPlease](https://github.com/fhamborg/news-please/tree/master/newsplease) is used to scrape news articles. Tensorflow is used for easy IO structures with GCS.

In [52]:
from newsplease import NewsPlease
import pandas as pd
import nltk
from tqdm import tnrange
import re
import multiprocessing
import pickle
import os
import tensorflow as tf
from typing import List, Any

# 2 Data Download

## 2.0 Constants

In [53]:
relevant_words = ['land', 'forest', 'agriculture', 
                  'farm', 'farmer', 'plantation', 'agrarian',
                  'smallholder', 'grazing', 'development', 'habitat', 
                  'resource', 'cattle', 'dispute', 'strife', 'peat',
                  'rice', 'palm oil', 'sugarcane', 'cassava', 'coconut',
                  'corn', 'mango', 'orange', 'maize', 'wheat', 'sorghum',
                  'bananas', 'tomatoes', 'citrus',
                  'livestock', 'kill', 'dead', 'airport',
                  'aluminum', 'mining', 'agro', 'dam',
                  'road', 'infrastructure', 'transmission', 
                  'conservation', 'settlement', 'displace',
                  'exile', 'caste', 'conflict', 'relocation',
                  'village', 'encroach', 'fertilizer', 'mine',
                  'illegal mining', 'malnutrition', 'contamination',
                  'mangrove', 'water', 'cow', 'cattle', 'appropriation', 
                  'appropriated', 'protest', 'environmental', 'pollution',
                  'copper', 'iron', 'timber', 'acre', 'hectare', ]

days_per_month = [31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31]

## 2.1 Parameters

In [79]:
year = 2019
country = 'brazil'
month = 5

input_folder = "../data/{}/raw/{}/".format(country, str(year))
metadata_folder = "../data/{}/metadata/{}/".format(country, str(year))
text_output_folder = "../data/{}/text/{}/{}/".format(country, str(year), str(month).zfill(2))

for folder in [input_folder, metadata_folder, text_output_folder]:
    if not os.path.exists(folder):
        os.makedirs(folder)

## 2.2 Function definitions

In [83]:
def load_month(country: str,
             year: int,
             month: int) -> pd.DataFrame:
    
    '''
    Loads the data downloaded in 1-download-gdelt for each day in input
    month and year for input country, returns a list of dataframes.
    
    Parameters
     country (str): either of 'brazil', 'indonesia', 'mexico'
     year (int): either of [2017, 2018, 2019]
     month (int): calendar month as integer
     
    Returns
     dfs (list): list of pandas DataFrames
    '''
    
    dfs = []
    for day in range(days_per_month[month + 1]): 
        file = (input_folder + str(year) + str(month).zfill(2) 
                + str(day).zfill(2) + ".csv")
        if os.path.exists(file):
            file = pd.read_csv(file)
            dfs.append(file)
    return dfs

def find_link_to_scrape(df: pd.DataFrame) -> pd.DataFrame:
    '''
    Identifies unique links to scrape for a single dataframe
    '''
    
    df['to_scrape'] = ''
    df['title'] = ''
    for i in range(len(df)):
        links = df['SOURCEURL'][i]
        l = re.findall(r'\w+(?:-\w+)+', links)
        if l:
            title = max(l, key = len)
            title = title.replace('-', ' ')
            df['title'][i] = title
            if any(word in title for word in relevant_words):
                df['to_scrape'][i] = str(links)
    return df

def combine_days(dfs: List[pd.DataFrame]) -> pd.DataFrame:
    '''
    Combines a list of dataframes and returns a reset index
    '''
    df_parsed = [find_link_to_scrape(dfs[x]) for x in tnrange(len(dfs))]
    df_month = pd.concat(df_parsed)
    df_subs = df_month[df_month['to_scrape'] != '']
    df_subs = df_subs.reset_index()
    return df_subs

def save_obj(obj: Any, name: str, folder: str) -> None:
    'Helper function using pickle to save and load objects'
    with open(folder + name + '.pkl', 'wb+') as f:
        pickle.dump(obj, f, pickle.HIGHEST_PROTOCOL)
        
def download_url(url: str) -> None:
    try:
        article = NewsPlease.from_url(urls[url])
        save_obj(article, str(url).zfill(5), text_output_folder)
        return 1
    except Exception as ex:
        print(url, ex)
        return 0

## 2.3 Function execution

Load the metadata for the input month, which is processed to become a list of dataframes, where each row corresponds to a candidate news event referencing environmental conflict. Save this csv to the metadata folder, named by the year and month.

In [None]:
df = load_month(country, year, month)
df = combine_days(df)
df.to_csv(metadata_folder + str(month).zfill(2) + ".csv")

Because each news article may reference more than one event (e.g. "Farmers protested because of a recent death due to malnutrition" contains both an event of protest and an event of malnutrition), a hashing table or dictionary is needed to keep track of the references.

`mapping_dictionary` is made such that the keys are the `unique_id`, and the values are a list of `pd.Index` items for the attached `pd.DataFrame`. This dictionary gets saved as a `.pkl` file in the `metadata_folder`.

In [82]:
urls = df['to_scrape'].unique()
mapping_dictionary = {}
for i, val in enumerate(urls):
    match = df.index[df['to_scrape'] == urls[i]].tolist()
    mapping_dictionary[i] = match 
    
save_obj(mapping_dictionary, str(month).zfill(2), metadata_folder)

Download the full text for each url in the `df`, saving it to a `.pkl` file in `text_output_folder` with a name corresponding to the associated `df` row and `matching_dictionary` key.

In [None]:
existing_texts = [for x in os.listdir(text_output_folder) if ".DS" not in x]
to_download = [x for x in range(0, len(urls)) if x not in existing_texts]

pool = multiprocessing.Pool(16)
zip(*pool.map(download_url, to_download))

In [None]:
pool.close()
pool.join()