# Article Content Extraction 


**Input:** A GDELT dataframe populated with online articles \
**Execution** Descriptions: scrapes each article/row for its content, addes the content as a list of words to the input dataframe\
**Output:** The original dataframe with a column appended containing a list of words from the article.

*The output for this notebook creates the input for the unsupervised topic model.*

### Runtime Configuration Settings
    -NUM_ARTICLES: specify the number of articles you would like to process here. 
    -SOURCE_CSV: Update the input URL here
    -OUTPUT_CSV: name of output CSV
    -BATCH_ROWS: number of rows per batch
    -BATCH_DELAY: number of seconds between batches
    -BEFORE_DATE: change the start date range for the articles returned. Formatted in YYYY-MM-DD HH:MM:SS.000
    -AFTER_DATE = change the end date range for the articles returned.'2019-01-01 00:00:00.000' # YYYY-MM-DD HH:MM:SS.000


## Package Installation

In [1]:
# This will run if goose3 is not already installed.
%pip install goose3

## Imports

In [2]:
import warnings
warnings.filterwarnings("ignore")

import pandas as pd
from datetime import *
from goose3 import Goose
import csv
import time


## Get current timestamp

In [3]:
now = datetime.now()
now = now.strftime("%Y-%m-%d %H:%M:%S.000")

## Runtime configuration


In [10]:
NUM_ARTICLES = 10
SOURCE_CSV = '2020.csv'
OUTPUT_CSV = 'UpennBox_SL_2020_content_extracted.csv'
BATCH_ROWS = 100 # number of rows per batch
BATCH_DELAY = 0 # number of seconds between batches
BEFORE_DATE = now # YYYY-MM-DD HH:MM:SS.000
AFTER_DATE = '2019-01-01 00:00:00.000' # YYYY-MM-DD HH:MM:SS.000

## Load input into a dataframe


In [None]:
df = pd.read_csv(SOURCE_CSV)
df.head()

## Restrict by date


In [20]:
period_df = df[(df['date'] < BEFORE_DATE) & (df['date'] > AFTER_DATE)]
period_df.shape

(36000, 16)

## Restrict by number of rows


In [27]:
if len(period_df) > NUM_ARTICLES:
    period_df = period_df[:NUM_ARTICLES]
else:
    period_df = period_df


true


(10, 16)

## Run dataframe rows through Goose3 extraction

In [28]:
g = Goose()
small_df_copy = period_df[['gkgrecordid', 'date', 'documentidentifier']]
small_df_copy['content'] = ''
batch = 0

for index, row in small_df.iterrows():
    
    batch = batch + 1
    if batch >= BATCH_ROWS:
        time.sleep(BATCH_DELAY)
        batch = 0

    url = row['documentidentifier']
    
    try:
        g.extract(url=url)
        article = g.extract(url=url)
        content = article.cleaned_text
        small_df_copy.loc[index, 'content'] = content
    except Exception:
        pass

small_df_copy.to_csv(OUTPUT_CSV, 
                     index=False, quoting=csv.QUOTE_ALL)

<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=4f0fda59-c6a8-4e8e-8948-140616c5cf47' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>