# Data Preparation

The purpose of this notebook is to compile, clean, and organize the data from my Lexis Nexis Bulk download into separate text files that can be used in subsequent analyses. Since this project involves 5 different news publication sources, the goal is to separate the compiled sources into 5 separate folders by source. Adjustments made accordingly as needed.

In [1]:
import os
from collections import Counter
import string
import matplotlib.pyplot as plt
import seaborn as sn
import zipfile
import pandas as pd
import json
import shutil
import datetime as dt

First I'm going to open, create a path, and unzip the bulk data download folder.

In [2]:
## name of zip file containing all raw data from bulk download
ZIPFILE_NAME='../data/composite.zip'

In [3]:
## making path with new folder titled "text"
if not os.path.exists('../data/text'):
    os.makedirs('../data/text')

In [4]:
## process of unzipping file
manifest_data=[]
text_file_cnt=0

with zipfile.ZipFile(ZIPFILE_NAME) as zf:
    for f in zf.filelist:
        if f.filename.count('plaintext')>0 and f.filename.endswith('txt') and not os.path.basename(f.filename).startswith('.') and not os.path.basename(f.filename).startswith('._wish-magazine') and not os.path.basename(f.filename).startswith('wish-magazine'):
            fn=os.path.basename(f.filename)
            print('Extracting', fn)
            with open(os.path.join('..','data','text',fn), 'wb') as out:
                zipfile.shutil.copyfileobj(zf.open(f),out)
                text_file_cnt+=1
        if f.filename.endswith('.csv'):
            mdf = pd.read_csv(zf.open(f))
            manifest_data.append(mdf)

Extracting china-daily---us-edition-us-shown-up-530dda30-9e68-11eb-abe9-0242ac160002.txt
Extracting china-daily---us-edition-secret-fort-in-531d7d78-9e68-11eb-abe9-0242ac160002.txt
Extracting china-daily---us-edition-public-health-scholars-531c67f8-9e68-11eb-abe9-0242ac160002.txt
Extracting china-daily---us-edition-trump-criticized-for-530f30ec-9e68-11eb-abe9-0242ac160002.txt
Extracting china-daily---us-edition-cdc-journal_-covid-19-52f21ae8-9e68-11eb-abe9-0242ac160002.txt
Extracting china-daily---us-edition-former-envoy-warns-52ece0e6-9e68-11eb-abe9-0242ac160002.txt
Extracting china-daily---us-edition-online-attacks-against-53199d0c-9e68-11eb-abe9-0242ac160002.txt
Extracting china-daily---us-edition-vice-fm_-don_t-use-53182cc4-9e68-11eb-abe9-0242ac160002.txt
Extracting china-daily---us-edition-us-urged-to-531b2c80-9e68-11eb-abe9-0242ac160002.txt
Extracting china-daily---us-edition-beijing-says-pompeo-53050310-9e68-11eb-abe9-0242ac160002.txt
Extracting china-daily---us-edition-ex-us-en

Extracting the-new-york-times-the-thanksgiving-myth-4e425b66-9e68-11eb-abe9-0242ac160002.txt
Extracting the-new-york-times-february_-the-whole-4d348154-9e68-11eb-abe9-0242ac160002.txt
Extracting the-new-york-times-citing-virus_-putin-50cf02bc-9e68-11eb-abe9-0242ac160002.txt
Extracting the-new-york-times-young-people-have-50236952-9e68-11eb-abe9-0242ac160002.txt
Extracting the-new-york-times-the-best-movies-4ceccb84-9e68-11eb-abe9-0242ac160002.txt
Extracting the-new-york-times-searching-for-our-50e80e38-9e68-11eb-abe9-0242ac160002.txt
Extracting the-new-york-times-infection-numbers-spike-4de75676-9e68-11eb-abe9-0242ac160002.txt
Extracting the-new-york-times-working-girl-4f9aac98-9e68-11eb-abe9-0242ac160002.txt
Extracting the-new-york-times-racism-and-sexism-4cd04400-9e68-11eb-abe9-0242ac160002.txt
Extracting the-new-york-times-when-did-the-4fbf4e36-9e68-11eb-abe9-0242ac160002.txt
Extracting the-new-york-times-how-plagues-shape-50b18160-9e68-11eb-abe9-0242ac160002.txt
Extracting the-new-

Extracting the-new-york-times-can-we-end-50c010f4-9e68-11eb-abe9-0242ac160002.txt
Extracting the-new-york-times-as-restrictions-lift_-504bfcf0-9e68-11eb-abe9-0242ac160002.txt
Extracting the-new-york-times-trump-defends-china-5043175c-9e68-11eb-abe9-0242ac160002.txt
Extracting the-new-york-times-icelandΓÇÖs-ΓÇÿtest-everyoneΓÇÖ-4f2fb91a-9e68-11eb-abe9-0242ac160002.txt
Extracting the-new-york-times-a-vice-president-4e648a42-9e68-11eb-abe9-0242ac160002.txt
Extracting the-new-york-times-the-virus_-the-4fbc1a7c-9e68-11eb-abe9-0242ac160002.txt
Extracting the-new-york-times-teaching-ideas-and-50c5053c-9e68-11eb-abe9-0242ac160002.txt
Extracting the-new-york-times-inoculation-gap-persists-4e9db9a2-9e68-11eb-abe9-0242ac160002.txt
Extracting the-new-york-times-japanΓÇÖs-journey-to-50dce1b6-9e68-11eb-abe9-0242ac160002.txt
Extracting the-new-york-times-medical-schools-have-4debde08-9e68-11eb-abe9-0242ac160002.txt
Extracting the-new-york-times-floyd-cardozΓÇÖs-memory-4fa0c97a-9e68-11eb-abe9-0242ac160

Extracting the-new-york-times-i_m-finally-an-50f2376e-9e68-11eb-abe9-0242ac160002.txt
Extracting the-new-york-times-the-best-movies-4d0cb87c-9e68-11eb-abe9-0242ac160002.txt
Extracting the-new-york-times-local-officials-hid-4d362d06-9e68-11eb-abe9-0242ac160002.txt
Extracting the-new-york-times-coronavirus-gloom_-and-5096e08a-9e68-11eb-abe9-0242ac160002.txt
Extracting the-new-york-times-_diego-belongs-to-5036c13c-9e68-11eb-abe9-0242ac160002.txt
Extracting the-new-york-times-housekeepers-face-a-4fa23936-9e68-11eb-abe9-0242ac160002.txt
Extracting the-new-york-times-ΓÇÿwe-need-helpΓÇÖ_-4dc338d6-9e68-11eb-abe9-0242ac160002.txt
Extracting the-new-york-times-what-students-are-4e49bef6-9e68-11eb-abe9-0242ac160002.txt
Extracting the-new-york-times-python-pants-and-4e1a6e76-9e68-11eb-abe9-0242ac160002.txt
Extracting the-new-york-times-louisiana-orders-bars-4d73869c-9e68-11eb-abe9-0242ac160002.txt
Extracting the-new-york-times-how-humanity-unleashed-4cb10234-9e68-11eb-abe9-0242ac160002.txt
Extract

Extracting the-new-york-times-what-have-we-4df79dce-9e68-11eb-abe9-0242ac160002.txt
Extracting the-new-york-times-the-coronavirus-is-4cc3793c-9e68-11eb-abe9-0242ac160002.txt
Extracting the-new-york-times-french-muslims-face-4e383528-9e68-11eb-abe9-0242ac160002.txt
Extracting the-new-york-times-virus-as-metaphor-4f296d58-9e68-11eb-abe9-0242ac160002.txt
Extracting the-new-york-times-black-americans-in-50954af4-9e68-11eb-abe9-0242ac160002.txt
Extracting the-new-york-times-coronavirus-fears-in-5018eca2-9e68-11eb-abe9-0242ac160002.txt
Extracting the-new-york-times-ratcliffe-vows-_unvarnished-50e51f98-9e68-11eb-abe9-0242ac160002.txt
Extracting the-new-york-times-tracing-and-sampling-4ee7473e-9e68-11eb-abe9-0242ac160002.txt
Extracting the-new-york-times-california-death-expands-4f09815a-9e68-11eb-abe9-0242ac160002.txt
Extracting the-new-york-times-as-south-korea-4f1a5822-9e68-11eb-abe9-0242ac160002.txt
Extracting the-new-york-times-coronavirus-briefing_-what-4cee792a-9e68-11eb-abe9-0242ac1600

Extracting the-new-york-times-a-guide-to-4e8fb690-9e68-11eb-abe9-0242ac160002.txt
Extracting the-new-york-times-trumpΓÇÖs-remarks-prompt-4efb5058-9e68-11eb-abe9-0242ac160002.txt
Extracting the-new-york-times-a-shadow-medical-4e2d831c-9e68-11eb-abe9-0242ac160002.txt
Extracting the-new-york-times-partners-in-misinformation-4e7f0408-9e68-11eb-abe9-0242ac160002.txt
Extracting the-new-york-times-i-watched-months-4e5d5506-9e68-11eb-abe9-0242ac160002.txt
Extracting the-new-york-times-lesser-powers-link-4e77c580-9e68-11eb-abe9-0242ac160002.txt
Extracting the-new-york-times-how-racism-and-4eff0cde-9e68-11eb-abe9-0242ac160002.txt
Extracting the-new-york-times-defiance-in-belarus-50570514-9e68-11eb-abe9-0242ac160002.txt
Extracting the-new-york-times-guitar-center-files-4eec7dee-9e68-11eb-abe9-0242ac160002.txt
Extracting the-new-york-times-coronavirus_-tornadoes_-n.f.l.-507c5b70-9e68-11eb-abe9-0242ac160002.txt
Extracting the-new-york-times-w.h.o.-failed-to-4ed0cde2-9e68-11eb-abe9-0242ac160002.txt


Extracting the-new-york-times-the-gambling-company-4e2a6ac4-9e68-11eb-abe9-0242ac160002.txt
Extracting the-new-york-times-coronavirus-live-updates_-4d1b7bbe-9e68-11eb-abe9-0242ac160002.txt
Extracting the-new-york-times-covid-19-arrived-in-4fde8198-9e68-11eb-abe9-0242ac160002.txt
Extracting the-new-york-times-a-failed-ebola-4e18daac-9e68-11eb-abe9-0242ac160002.txt
Extracting the-new-york-times-republican-senate-panel-5091ca64-9e68-11eb-abe9-0242ac160002.txt
Extracting the-new-york-times-venice-tourism-may-4f7fb1ae-9e68-11eb-abe9-0242ac160002.txt
Extracting the-new-york-times-u.s.-and-chinese-4f751c3a-9e68-11eb-abe9-0242ac160002.txt
Extracting the-new-york-times-corrections-4f9fb936-9e68-11eb-abe9-0242ac160002.txt
Extracting the-new-york-times-who-helps-out-4d91832c-9e68-11eb-abe9-0242ac160002.txt
Extracting the-new-york-times-25-days-that-4e6a9a18-9e68-11eb-abe9-0242ac160002.txt
Extracting the-new-york-times-how-plagues-shape-50ef44be-9e68-11eb-abe9-0242ac160002.txt
Extracting the-new-y

Extracting the-daily-telegraph-(australia)-nrl-star_-_i-517cea94-9e68-11eb-abe9-0242ac160002.txt
Extracting the-daily-telegraph-(australia)-after-season-like-52255094-9e68-11eb-abe9-0242ac160002.txt
Extracting the-daily-telegraph-(australia)-the-short-bites-51bbcec6-9e68-11eb-abe9-0242ac160002.txt
Extracting the-daily-telegraph-(australia)-todd-knows-origin-5199fba2-9e68-11eb-abe9-0242ac160002.txt
Extracting the-daily-telegraph-(australia)-china-in-cover-up-51ce8e12-9e68-11eb-abe9-0242ac160002.txt
Extracting the-daily-telegraph-(australia)-and-the-winner-52570c10-9e68-11eb-abe9-0242ac160002.txt
Extracting the-daily-telegraph-(australia)-tszyu_-gallen-to-51b5eccc-9e68-11eb-abe9-0242ac160002.txt
Extracting the-daily-telegraph-(australia)-market-mayhem-5146ccfc-9e68-11eb-abe9-0242ac160002.txt
Extracting the-daily-telegraph-(australia)-masters_-mates-&-5137a272-9e68-11eb-abe9-0242ac160002.txt
Extracting the-daily-telegraph-(australia)-what_s-the-buzz-529f7d1a-9e68-11eb-abe9-0242ac160002.tx

Extracting the-daily-telegraph-(australia)-beaches-relieved-but-5205f92e-9e68-11eb-abe9-0242ac160002.txt
Extracting the-daily-telegraph-(australia)-uk-pollie_s-call-52ac0cba-9e68-11eb-abe9-0242ac160002.txt
Extracting the-daily-telegraph-(australia)-china_s-war-on-512469d2-9e68-11eb-abe9-0242ac160002.txt
Extracting the-daily-telegraph-(australia)-our-chinese-takeaway-52af4010-9e68-11eb-abe9-0242ac160002.txt
Extracting the-daily-telegraph-(australia)-halt-joint-virus-5173f1c8-9e68-11eb-abe9-0242ac160002.txt
Extracting the-daily-telegraph-(australia)-hack-a-giant-514e1174-9e68-11eb-abe9-0242ac160002.txt
Extracting the-daily-telegraph-(australia)-buy-together-or-5154cbd6-9e68-11eb-abe9-0242ac160002.txt
Extracting the-daily-telegraph-(australia)-daine_s-great-test-517af590-9e68-11eb-abe9-0242ac160002.txt
Extracting the-daily-telegraph-(australia)-what_s-the-buzz-51a1c062-9e68-11eb-abe9-0242ac160002.txt
Extracting the-daily-telegraph-(australia)-broncos-not-keen-522bd61c-9e68-11eb-abe9-0242a

Extracting the-daily-telegraph-(australia)-the-gus-plan-5238c02a-9e68-11eb-abe9-0242ac160002.txt
Extracting the-daily-telegraph-(australia)-what_s-the-buzz-51d36306-9e68-11eb-abe9-0242ac160002.txt
Extracting the-daily-telegraph-(australia)-golden-opportunity-513cbe24-9e68-11eb-abe9-0242ac160002.txt
Extracting the-daily-telegraph-(australia)-twiggy_s-big-kowtow-51aa6118-9e68-11eb-abe9-0242ac160002.txt
Extracting the-daily-telegraph-(australia)-right-to-take-516d42b0-9e68-11eb-abe9-0242ac160002.txt
Extracting the-daily-telegraph-(australia)-american-official-revives-519d98e8-9e68-11eb-abe9-0242ac160002.txt
Extracting the-daily-telegraph-(australia)-labor-tightens-belt-513b412a-9e68-11eb-abe9-0242ac160002.txt
Extracting the-daily-telegraph-(australia)-this-is-truly-52203e60-9e68-11eb-abe9-0242ac160002.txt
Extracting the-daily-telegraph-(australia)-china-army-in-524b3110-9e68-11eb-abe9-0242ac160002.txt
Extracting the-daily-telegraph-(australia)-families-cut-out-5244edf0-9e68-11eb-abe9-0242

Extracting the-daily-telegraph-(australia)-players_-time-to-515d3528-9e68-11eb-abe9-0242ac160002.txt
Extracting the-daily-telegraph-(australia)-nicole_s-nine-perfect-51ff2c5c-9e68-11eb-abe9-0242ac160002.txt
Extracting the-daily-telegraph-(australia)-bulls-rule-in-51fc4960-9e68-11eb-abe9-0242ac160002.txt
Extracting the-daily-telegraph-(australia)-origin-in-the-52c30de8-9e68-11eb-abe9-0242ac160002.txt
Extracting the-daily-telegraph-(australia)-gal-v-hunt-519f5264-9e68-11eb-abe9-0242ac160002.txt
Extracting the-daily-telegraph-(australia)-crowds-back-for-51d73b98-9e68-11eb-abe9-0242ac160002.txt
Extracting the-daily-telegraph-(australia)-tv-detective-51668eb6-9e68-11eb-abe9-0242ac160002.txt
Extracting the-daily-telegraph-(australia)-lab-leaks-happen_-51574276-9e68-11eb-abe9-0242ac160002.txt
Extracting the-daily-telegraph-(australia)-origin-is-shaping-52136104-9e68-11eb-abe9-0242ac160002.txt
Extracting the-daily-telegraph-(australia)-make-a-run-5235d004-9e68-11eb-abe9-0242ac160002.txt
Extrac

Extracting the-guardian-(london)-victoria_-nsw-and-42df2dee-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-daniel-andrews-announces-408bdc0e-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-who-experts_-covid-47d45e64-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-return-to-local-3e3958a0-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-jeff-bezos_-the-4241a290-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-amy-coney-barrett-418c3234-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-coronavirus-live_-portugal_s-4224940c-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-four-more-deaths-460bf59c-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-joe-biden-and-3dd85348-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-australian-trade-minister-46bf6334-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-chief-medical-officer-467df5f

Extracting the-guardian-(london)-uk-police-contradict-42790e38-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-coronavirus-australia-live-3e9179cc-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-morning-mail_-borders-41a4cc04-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-the-silence-by-4b7cdbb8-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-_i-am-starving__-45a3d980-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-australia-launches-covid-19-474772b0-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-coronavirus-us-live_-3edfc410-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-us-death-toll-4ad3adfe-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-global-report_-new-406f8914-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-coronavirus-live-news_-3fbb6ca4-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-a-political-populism

Extracting the-guardian-(london)-fbi-confirms-it-49b41684-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-the-two-meetings-4459de12-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-_i_m-scared__-top-496eb512-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-friday-briefing_-lockdown-4a9151a2-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-why-has-china-43fdd266-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-how-did-coronavirus-3f2a2744-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-is-the-_challengeaccepted-4bfd6706-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-daniel-andrews-defiant-44e1d7f4-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-what-does-more-4b015ede-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-china-accuses-australia-421c75ba-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-coronavirus-california_-stat

Extracting the-guardian-(london)-kenya-issues-ultimatum-44669e72-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-russians-hacked-liam-45de6a50-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-global-alliance-formed-4bbce47e-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-mcconnell-proposes-delaying-4b3f78cc-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-coronavirus-australia-latest_-419a76fa-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-cerrie-burnell_-_disabled-4137184e-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-border-agency-reports-44b11952-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-sydney-aged-care-3e57e46e-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-behind-cambridge-analytica-46ccb99e-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-victoria-reports-51-4adeaf88-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian

Extracting the-guardian-(london)-from-a-very-47937016-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-karm-gilespie_-government-4384ed2e-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-sport-v-live-45d9eef8-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-china-backs-_comprehensive-47fe807c-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-trump-and-biden-45834972-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-coronavirus-live-news_-4a26ea60-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-a-disgraced-scientist-4785cd3a-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-coronavirus-uk_-shapps-4b465912-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-new-zealand-delays-494f44e8-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-coronavirus_-_worrying_-rise-44c4180e-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-china_s-coercive-

Extracting the-guardian-(london)-for-millions_-lockdown-4ab609f2-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-class-war-in-4c1ddd1a-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-states-begin-easing-478db7ca-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-no-more-hotspots-441193c8-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-brazil-death-toll-4416936e-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-victoria-records-113-41e2f650-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-tasmania-announces-_travel-441d22b0-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-austria-to-mass-test-40462e48-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-alert-issued-after-44a86f8c-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-global-report_-japan-4bf7d7d2-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-victoria_s-corruption-watch

Extracting the-guardian-(london)-senate-estimates-told-4bd38b84-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-_we-poked-the-3ee6a1cc-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-australia-relaxed-over-4b553c66-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-surnames-dictionary-goes-48965474-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-friday-briefing_-nhs-47615f54-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-elijah-moshinsky-obituary-3eda143e-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-coronavirus-australia-latest_-4819828c-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-china-records-first-454e46be-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-after-this-crisis_-433f9486-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-origin-seeks-fossil-44d4fc28-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-nrl-say

Extracting the-guardian-(london)-coronavirus-australia-latest_-4a667a7c-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-there_s-a-hidden-48677668-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-hard-border-leads-403868da-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-top-chinese-diplomat-4312ae6c-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-china-coronavirus-cases-42efdbf8-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-oil-drops-back-45c83cd0-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-from-the-pyramid-4ac830f0-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-who-scrambling-to-4959fc26-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-coronavirus-live-news_-3ddf1f48-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-unsolved-mystery_-what-4854c4d2-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-latin-america-and-426

Extracting the-guardian-(london)-why-do-we-444ecbc6-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-demand-for-makeup-454becfc-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-scott-morrison-to-4407ae9e-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-cambridge-colleges-criticised-3f94cca2-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-wa-requires-new-4b9d7198-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-black-lives-matter-45335f0c-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-china-reportedly-orders-429d01ee-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-morrison-urges-australians-411020fe-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-four-cruise-ship-43048b52-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-icymi_-australian-news-427a4fc8-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-greece-reports-spike-4619

Extracting the-guardian-(london)-coronavirus-live-news_-458950f6-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-china-sees-biggest-42a07c8e-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-us-capitol-on-49fc3478-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-met-uses-software-400fcf2e-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-mal-meninga-joins-40698f3c-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-this-year_-we-4a937f68-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-we-lived-the-42059066-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-black-lives-matter_s-3f3bb3ce-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-global-report_-tokyo-4552e214-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-coronavirus-15-may_-4b56982c-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-solitary-citizens_-the-486a9776-9e68-11eb-abe9

Extracting the-guardian-(london)-up-to-100-414e393e-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-only-23_-of-406d327c-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-coronavirus-live-news_-49232a34-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-island-dreams-by-4b615898-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-does-vitamin-d-47a1864c-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-how-us-cities-4a4b4a36-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-black-people-four-4a47bfd8-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-trump-suggests-more-498d61f6-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-queensland-and-brisbane-4b52a582-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-australia-coronavirus-live_-42d64530-9e68-11eb-abe9-0242ac160002.txt
Extracting the-guardian-(london)-china-hits-back-4ab72454-9e68-11eb-abe9-0242ac

Extracting hindustan-times-transmission-of-coronavirus-3ccda782-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-terrorists-may-exploit-3ab9d740-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-beijing-must-answer-3b88a034-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-man-tests-covid-19-3b8faf96-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-joe-biden-vows-3aceacd8-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-trump-calls-covid-19-3b45cce6-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-kozhikode-plane-crash-3c14f21e-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-uk-economy-faces-39fa4e84-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-indian-origin-malhotra_-dhesi-3ab446d6-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-covid-19-virus-may-3c3636fe-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-covid-19-pandemic_-timeline-3b6ac4b0-9e68-11eb-abe9-0242ac160002.txt
Extracting

Extracting hindustan-times-trump-says-china-3d69ed5e-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-1_140-manipuris-stranded-3a313386-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-celebrating-sunshine-in-3aa22348-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-timely-lockdown-helped-3b676978-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-indian-sailors-stuck-3a385756-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-covid-19_-death-toll-3bef1a30-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-covid-19-has-sharpened-3a5d842c-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-covid-19_-an-opportunity-3d005dda-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-coronavirus-live-updates_-3ca45972-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-alarm-bells-in-3cddb83e-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-economic-stimulus-by-39fd2a8c-9e68-11eb-abe9-0242ac160002.txt
Extracting 

Extracting hindustan-times-maninder-sidhu-named-3b7554ac-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-covid-19_-mea-reaches-3d19b76c-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-coronavirus-live_-say-3d5e6d9e-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-south-asian-origin-39e824c0-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-when-gandhi-battled-3b493750-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-news-updates-from-3a45738c-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-blast-near-cremation-3aa6042c-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-facebook-coo_-sheryl-3b2a628a-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-italy-seeks-independent-3a6d1716-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-40_300-cyber-attacks-3d7bf60c-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-punjab-origin-youngster-dies-3af824f0-9e68-11eb-abe9-0242ac160002.txt
Extracting hin

Extracting hindustan-times-assured-of-speedy-3c904162-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-metro-says-it-3bc3aac6-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-pm-modi-speaks-3d92cb52-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-covid-19_-indians-in-3d335172-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-africans-harassed-in-3b28f972-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-uk-indian-doctors-3bd37be0-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-8-uk-returnees-3ce021e6-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-suez-crisis-and-3ceae0fe-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-covid-not-the-3c32b736-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-trace_-track-and-3a27277e-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-covid-19_-why-it-3a9ffab4-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-twitter-takes-down-3c57c652-9e68-11eb-a

Extracting hindustan-times-govt-recommended-drug-for-39f0349e-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-chinese-aggression-against-3c53b6de-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-what-can-explain-3ac62004-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-sharma_-arora_-rathod_-3d997aa6-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-bihar-reports-7th-39fbb3fa-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-60_-olive-ridley-3a793172-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-goa-anticipates-huge-3ac4f1de-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-3-new-covid-19-3d269edc-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-canada_-armed-forces-3ae6e9e2-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-italy-wants-independent-3cf82778-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-another-_much-valued_-3baa22a4-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-

Extracting hindustan-times-reduce-dependence-on-3a49e82c-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-with-positivity-rate-3c4693fa-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-pm-narendra-modi-3cdf0324-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-live_-with-223-3d20fbe4-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-us-lawmakers-back-3a76dd00-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-icmr-says-cases-3cc97860-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-india-wary-of-3d761d5e-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-uk-seeks-non-white-3b69215a-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-more-covid-19-positive-3bfcef66-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-move-over-originals_-3a7aa30e-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-canadian-mps-urge-3af94ad8-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-community-transmission

Extracting hindustan-times-indian-origin-girl-invents-3b6077c6-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-train-crushes-migrants-39e3a742-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-news-updates-from-3bd1b760-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-eam-jaishankar-begins-3b248c8e-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-recognise-the-centrality-3b563cf2-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-live_-foreign-investors-3b2b6bc6-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-pm-modi-speaks-3d57af36-9e68-11eb-abe9-0242ac160002.txt
Extracting hindustan-times-centre-sets-up-39e067f8-9e68-11eb-abe9-0242ac160002.txt


Appears that I have a ton of files in my zip file, so I'm going to do a cursory glance at how the data is organized in the manifest_data before continuing.

In [5]:
# manifest_data[0] consists of all the China Daily publications
manifest_data[0]

Unnamed: 0,Filename,Publication,Section,Date,Title,Author,LNID
0,china-daily---us-edition-former-envoy-warns-52...,China Daily - US Edition,,2020-03-14,Former envoy warns of US-China 'decoupling',,5YDT-8741-JDJN-60RF-00000-00
1,china-daily---us-edition-at-forum_-concern-52e...,China Daily - US Edition,,2020-04-22,"At forum, concern over state of US-China ties",,5YR5-8D21-F11P-X48D-00000-00
2,china-daily---us-edition-sanctions-imposed-on-...,China Daily - US Edition,,2020-07-14,"Sanctions imposed on US entity, officials over...",,60BX-5C61-F11P-X2M1-00000-00
3,china-daily---us-edition-amazon-area-gets-52f0...,China Daily - US Edition,,2021-02-25,Amazon area gets relief with Sinovac doses,China Daily Global,6232-DSB1-JDJN-6352-00000-00
4,china-daily---us-edition-cdc-journal_-covid-19...,China Daily - US Edition,,2020-12-10,CDC journal: COVID-19 circulating in Italy in ...,chinadaily.com.cn,61GP-0DB1-F11P-X1RX-00000-00
5,china-daily---us-edition-ministry_-_enemy-is-5...,China Daily - US Edition,,2020-04-21,Ministry: 'Enemy is the virus not China',,5YR1-8991-F11P-X4RH-00000-00
6,china-daily---us-edition-top-us-general-52f4cb...,China Daily - US Edition,,2020-05-06,"Top US general says ""we don't know"" where coro...",,5YV5-HBC1-JDJN-62KV-00000-00
7,china-daily---us-edition-rebound-sought-for-52...,China Daily - US Edition,,2020-07-22,Rebound sought for Ecuador shrimps,,60DK-XV51-F11P-X38G-00000-00
8,china-daily---us-edition-ex-us-envoy-decries-5...,China Daily - US Edition,,2020-03-25,Ex-US envoy decries political focus on coronav...,,5YHD-WGD1-JDJN-63P0-00000-00
9,china-daily---us-edition-us-right-wing-media-5...,China Daily - US Edition,,2020-05-02,US right-wing media fan virus origin rumor,,5YT9-MH51-F11P-X2WY-00000-00


Interestingly, the bulk download only returned 32 publications. I'm going to have to revisit Lexis Nexis and see if there are additional publications that were missed in the download.

In [6]:
# manifest_data[2] consists of all the NY Times publications
manifest_data[2]

Unnamed: 0,Filename,Publication,Section,Date,Title,Author,LNID
0,the-new-york-times-how-democrats-win-4caca0a4-...,The New York Times,OPINION,2020-05-18,How Democrats Win in My Red State (and They Do...,Sarah Vowell,5YXH-HN21-JBG3-6015-00000-00
1,the-new-york-times-jimmy-kimmel-finds-4cae27c6...,The New York Times,ARTS,2020-05-15,Jimmy Kimmel Finds a Silver Lining in Whistle-...,Trish Bendix,5YX2-9NV1-DXY4-X2TR-00000-00
2,the-new-york-times-terroir_-travel-and-4caf5b5...,The New York Times,Section D,2020-12-02,"Terroir, Travel and Trauma",By Eric Asimov,61DY-32F1-DXY4-X42Y-00000-00
3,the-new-york-times-how-humanity-unleashed-4cb1...,The New York Times,MAGAZINE,2020-06-17,How Humanity Unleashed a Flood of New Diseases...,Ferris Jabr,6053-MSV1-JBG3-62C2-00000-00
4,the-new-york-times-house-hunting-on-4cb2ae9a-9...,The New York Times,REALESTATE,2020-09-02,House Hunting on Cyprus: Your Own Little Water...,Alison Gregor,60RJ-4KB1-JBG3-62YP-00000-00
...,...,...,...,...,...,...,...
544,the-new-york-times-racism_s-hidden-toll-50f131...,The New York Times,Section SR,2020-08-16,Racism's Hidden Toll,By Gus Wezerek,60KX-FS81-JBG3-62JY-00000-00
545,the-new-york-times-i_m-finally-an-50f2376e-9e6...,The New York Times,Section SR,2020-06-07,I'm Finally an Angry Black Man,By Issac Bailey,6030-GTN1-JBG3-6044-00000-00
546,the-new-york-times-the-race-for-50f48a14-9e68-...,The New York Times,HEALTH,2020-10-12,The Race for a Super-Antibody Against the Coro...,Apoorva Mandavilli,6123-RRT1-JBG3-64PM-00000-00
547,the-new-york-times-coronavirus_-oil-prices_-50...,The New York Times,BRIEFING,2020-04-21,"Coronavirus, Oil Prices, U.S. Immigration: You...",Isabella Kwai,5YPX-JR71-JBG3-61R4-00000-00


In [7]:
# manifest_data[0] consists of all the Daily Telegraph publications
manifest_data[4]

Unnamed: 0,Filename,Publication,Section,Date,Title,Author,LNID
0,the-daily-telegraph-(australia)-infected-punte...,The Daily Telegraph (Australia),NEWS,2020-08-07,Infected punters on pub crawls,ANGIRA BHARADWAJ & JAMES O'DOHERTY,60HY-14M1-F0JP-W01W-00000-00
1,the-daily-telegraph-(australia)-no-fifita-feud...,The Daily Telegraph (Australia),SPORT,2020-07-29,No Fifita feud at Broncos,PETER BADEL,60G1-BF91-F0JP-W0BY-00000-00
2,the-daily-telegraph-(australia)-china_s-war-on...,The Daily Telegraph (Australia),NEWS,2021-03-08,China's war on the BBC,"James Morrow, EXCLUSIVE",625B-MJD1-F0JP-W483-00000-00
3,the-daily-telegraph-(australia)-chinese-might-...,The Daily Telegraph (Australia),NEWS,2020-12-04,Chinese might at museum,ANGIE RAPHAEL & CAMPBELL GELLIE,61F9-C6T1-F0JP-W038-00000-00
4,the-daily-telegraph-(australia)-and-the-winner...,The Daily Telegraph (Australia),SPORT,2020-04-08,AND THE WINNER IS ... SYDNEY,PAUL KENT,5YM2-33T1-JD3N-53SY-00000-00
...,...,...,...,...,...,...,...
417,the-daily-telegraph-(australia)-letters-to-the...,The Daily Telegraph (Australia),LETTERS,2021-02-12,LETTERS TO THE EDITOR,,6207-B3F1-JD3N-5031-00000-00
418,the-daily-telegraph-(australia)-shock-study_-5...,The Daily Telegraph (Australia),NEWS,2020-12-31,"Shock study: 500,000 infected in Wuhan",MERRYN JOHNS,61N2-K451-JD3N-54Y2-00000-00
419,the-daily-telegraph-(australia)-twig-links-put...,The Daily Telegraph (Australia),NEWS,2020-05-01,TWIG LINKS PUT TO TEST,EXCLUSIVE MIRANDA DEVINE,5YT1-XYW1-JD3N-552V-00000-00
420,the-daily-telegraph-(australia)-read-_em-&-52c...,The Daily Telegraph (Australia),NEWS,2020-12-30,READ 'EM & WHEAT AS CHINA HITS AGAIN,JARED LYNCH,61MV-P4J1-F0JP-W21M-00000-00


In [8]:
# manifest_data[0] consists of all the Guardian publications
manifest_data[6]

Unnamed: 0,Filename,Publication,Section,Date,Title,Author,LNID
0,the-guardian-(london)-coronavirus-live-news_-3...,The Guardian (London),WORLD NEWS,2020-04-01,Coronavirus live news: US deaths could reach 2...,Helen Sullivan,5YJN-2M11-F021-62MG-00000-00
1,the-guardian-(london)-cyprus-to-allow-3dd4bd32...,The Guardian (London),WORLD NEWS,2021-03-03,Cyprus to allow fully vaccinated British touri...,"Lucy Campbell (now); Mattha Busby, Sarah Marsh...",624G-MV61-DY4H-K48R-00000-00
2,the-guardian-(london)-nrl-grand-final-3dd5d78a...,The Guardian (London),SPORT,2020-09-17,"NRL grand final set for 40,000 crowd after Cov...",Mike Hytner,60VP-FVW1-JCJY-G2P2-00000-00
3,the-guardian-(london)-joe-biden-and-3dd85348-9...,The Guardian (London),US NEWS,2020-10-15,Joe Biden and Kamala Harris both flew on fligh...,"Maanvi Singh (now), Joan E Greve, Jessica Glen...",612P-TXH1-DY4H-K536-00000-00
4,the-guardian-(london)-thursday-briefing_-_give...,The Guardian (London),WORLD NEWS,2020-04-16,Thursday briefing: 'Give loved ones the chance...,Warren Murray,5YNW-4D01-F021-60S4-00000-00
...,...,...,...,...,...,...,...
940,the-guardian-(london)-australia-records-its-4c...,The Guardian (London),AUSTRALIA NEWS,2020-08-11,Australia records its highest overnight corona...,Michael McGowan (now) and Amy Remeikis (earlier),60JX-6H11-JCJY-G55W-00000-00
941,the-guardian-(london)-the-best-birthday-4c3bc5...,The Guardian (London),SOCIETY,2020-07-03,The best birthday present for the NHS? An end ...,JS Bamrah and Kailash Chand,608H-H8N1-F021-64TT-00000-00
942,the-guardian-(london)-planes_-ships-and-4c3cdb...,The Guardian (London),AUSTRALIA NEWS,2020-10-24,"Planes, ships and hotel quarantine: how Austra...",Melissa Davey,614N-VKX1-DY4H-K0WR-00000-00
943,the-guardian-(london)-tommy-devito-obituary-4c...,The Guardian (London),MUSIC,2020-09-29,Tommy DeVito obituary,Garth Cartwright,60Y9-SPM1-JBNF-W1X8-00000-00


In [9]:
# manifest_data[0] consists of all the Hindustan Times publications
manifest_data[8]

Unnamed: 0,Filename,Publication,Section,Date,Title,Author,LNID
0,hindustan-times-uk_-leicester-town-39de0440-9e...,Hindustan Times,,2020-06-29,UK: Leicester town is facing lockdown,,607M-G5S1-JDKC-R459-00000-00
1,hindustan-times-amid-suspected-wuhan-39df6178-...,Hindustan Times,,2020-06-11,"Amid suspected Wuhan wet market link, Centre i...",,603N-N5N1-F12F-F4CW-00000-00
2,hindustan-times-centre-sets-up-39e067f8-9e68-1...,Hindustan Times,,2020-05-17,"Centre sets up an online database to monitor, ...",,5YXF-JNC1-JDKC-R0VS-00000-00
3,hindustan-times-metro-stations-in-39e16fa4-9e6...,Hindustan Times,,2020-08-30,Metro stations in red zones not likely to open...,,60PY-M6J1-JDKC-R157-00000-00
4,hindustan-times-coronavirus-live_-china-39e25f...,Hindustan Times,,2020-03-12,Coronavirus LIVE: China coronavirus adviser ex...,,5YDF-4NB1-JDKC-R4CK-00000-00
...,...,...,...,...,...,...,...
650,hindustan-times-news-updates-from-3d9b8e2c-9e6...,Hindustan Times,,2020-09-16,News updates from Hindustan Times: Former PM M...,,60VH-0N41-F12F-F3MM-00000-00
651,hindustan-times-coronavirus-live_-3-3d9ceaba-9...,Hindustan Times,,2020-03-12,Coronavirus LIVE: 3 planes to be sent to Iran ...,,5YDF-4NB1-JDKC-R4CV-00000-00
652,hindustan-times-who_s-draft-recommendation-3d9...,Hindustan Times,,2021-02-21,WHO's draft recommendation on Wuhan Covid-19 p...,,6227-6J21-F12F-F165-00000-00
653,hindustan-times-status-of-covid-19-3da082ba-9e...,Hindustan Times,,2021-02-07,Status of Covid-19 vaccine passports around th...,,61Y7-SK31-F12F-F0YM-00000-00


Already, I have quite a substantial set of data and corpus created. There are 32 publications for the China Daily, 549 in the NY Times, 422 in the Daily Telegraph, 945 in the Guardian, and 655 in the Hindustan Times. However, as previously mentioned, I have found additional sources through a manual Lexis Nexis search using the same keywords and filters which I have downloaded and added to my data folder as a separate RTF file.

For this additional sample of China Daily publications, I will go through a similar process of cleaning the file. Unlike with the zip file, a manual download from Lexis Nexis does not have the manifest_data document.

In [10]:
## additional sample of China Daily publications
cd_add_sample = open('../data/china_daily_second_sample.RTF').read()

In [11]:
cd = cd_add_sample.split('End of Document')
cd=cd[1:429]

In [12]:
## defining some functions here that extract the title and extract the date from a document
def get_title(doc):
    lines= doc.strip().split('\n')
    title=lines[0].lower().replace(' ','-')
    return(title)

def get_date(doc):
    lines= doc.strip().split('\n')
    date=lines[2].replace(',','').split()
    date = date[:3]
    date[0]= str(dt.datetime.strptime(date[0][:3],'%b').month)
    date[2]=date[2][-2:]
    date=" ".join(date).replace(' ','/')
    return(date)

Here, I will create a dataframe that matches the one in the manifest_data for the bulk data download which will allow me to easily append this separate dataset with the bulk one. Afterwards, I will create the filepaths with the same file naming format as those of the bulk download.

In [13]:
## writing out aa filepath to match the format of the bulk data download and creating a dataframe that matches that in the manifest_data
cd_df = pd.DataFrame([])
for doc in cd:
    title= get_title(doc)
    date = get_date(doc)
    cd_df = cd_df.append(pd.DataFrame({'Filename': ('china-daily---us-edition-{}.txt'.format(title)),
            'Publication':'China Daily - US Edition',
            'Section':' ',
            'Date': date, 
            'Title': title,
            'Author':'China Daily',
            'LNID':' '}, index=[0]), ignore_index=True)
    filepath = "../data/text/china-daily---us-edition-{}.txt".format(title)
    print('Creating', filepath)
    with open(filepath, 'w') as out:
        out.write(doc)

Creating ../data/text/china-daily---us-edition-research-team-member-expresses-surprise,-consternation-over-who-director's-remarks.txt
Creating ../data/text/china-daily---us-edition-origins-of-political-virus-put-in-spotlight-by-report:-china-daily-editorial.txt
Creating ../data/text/china-daily---us-edition-origins-of-covid-19-are-natural,-insists-who.txt
Creating ../data/text/china-daily---us-edition-fighting-covid-19:-china-in-action.txt
Creating ../data/text/china-daily---us-edition-who-experts-to-travel-to-china-to-research-origin-of-covid-19.txt
Creating ../data/text/china-daily---us-edition-covid-19-origin-not-from-lab,-say-foreign-experts.txt
Creating ../data/text/china-daily---us-edition-who-team:-probe-of-virus'-origin-should-not-be-'geographically-bound'.txt
Creating ../data/text/china-daily---us-edition-china,-who-discussing-expert-visit-details.txt
Creating ../data/text/china-daily---us-edition-trans-china-railway-presents-golden-opportunity-for-global-supply-chain.txt
Crea

Creating ../data/text/china-daily---us-edition-hurun-report-details-covid-19-impact-on-world's-top-billionaires.txt
Creating ../data/text/china-daily---us-edition-research-well-underway-into-covid-19-vaccines-and-drugs,-premier-says.txt
Creating ../data/text/china-daily---us-edition-us-hews-to-outdated-zero-sum-mindset-in-a-win-win-world.txt
Creating ../data/text/china-daily---us-edition-natural-fashion-is-the-style-in-nz-show.txt
Creating ../data/text/china-daily---us-edition-four-questions-the-us-must-answer-concerning-covid-19.txt
Creating ../data/text/china-daily---us-edition-scientists-in-cambodia-find-close-match-for-covid-19-pathogen-in-2010-samples.txt
Creating ../data/text/china-daily---us-edition-it's-time-for-nations-to-combat-covid-19-together.txt
Creating ../data/text/china-daily---us-edition-four-questions-the-us-must-answer-concerning-covid-19.txt
Creating ../data/text/china-daily---us-edition-moving-moments-in-the-fight-against-covid-19.txt
Creating ../data/text/china-d

Creating ../data/text/china-daily---us-edition-experts-say-it's-groundless-to-hold-china-accountable-for-covid-19.txt
Creating ../data/text/china-daily---us-edition-peru-gets-tips-from-chinese-medical-team-on-fighting-covid-19.txt
Creating ../data/text/china-daily---us-edition-a-call-for-unity,-action:-covid-19-and-principles-of-responsibility,-hope.txt
Creating ../data/text/china-daily---us-edition-who-declares-covid-19-a-pandemic.txt
Creating ../data/text/china-daily---us-edition-cold-chain-goods-major-cause-of-infections.txt
Creating ../data/text/china-daily---us-edition-virus-similar-to-pathogen-behind-covid-19-found-in-malaysian-pangolins.txt
Creating ../data/text/china-daily---us-edition-ministry-defends-handling-of-initial-stage-of-outbreak.txt
Creating ../data/text/china-daily---us-edition-facilities-in-wuhan-get-visit-from-who-team.txt
Creating ../data/text/china-daily---us-edition-facilities-in-wuhan-get-visit-from-who-team.txt
Creating ../data/text/china-daily---us-edition-u

Creating ../data/text/china-daily---us-edition-latest-on-the-novel-coronavirus-outbreak.txt
Creating ../data/text/china-daily---us-edition-latest-on-the-novel-coronavirus-outbreak.txt
Creating ../data/text/china-daily---us-edition-latest-on-the-novel-coronavirus-outbreak.txt
Creating ../data/text/china-daily---us-edition-latest-on-the-novel-coronavirus-outbreak.txt
Creating ../data/text/china-daily---us-edition-latest-on-the-novel-coronavirus-outbreak.txt
Creating ../data/text/china-daily---us-edition-curbs-to-be-extended-by-a-week-with-jump-in-untraceable-cases.txt
Creating ../data/text/china-daily---us-edition-many-factors-will-push-mncs-to-bet-on-china.txt
Creating ../data/text/china-daily---us-edition-leung:-tighter-measures-may-be-needed-to-contain-the-virus.txt
Creating ../data/text/china-daily---us-edition-neither-'wuhan-virus'-nor-'los-angeles-virus'.txt
Creating ../data/text/china-daily---us-edition-us-told-to-stop-politicization-of-pandemic.txt
Creating ../data/text/china-dai

Creating ../data/text/china-daily---us-edition-naming-of-virus-receives-mixed-review-by-scientists.txt
Creating ../data/text/china-daily---us-edition-china's-contagion-response-deserves-support,-experts-say.txt
Creating ../data/text/china-daily---us-edition-envoy:-china-us-mutual-support-needed-right-now.txt
Creating ../data/text/china-daily---us-edition-eu-urged-to-remove-market-obstructions.txt
Creating ../data/text/china-daily---us-edition-experts-hail-beijing's-strategy-to-fight-virus.txt
Creating ../data/text/china-daily---us-edition-how-to-win-battle-against-the-pandemic.txt
Creating ../data/text/china-daily---us-edition-resource-center-helps-virology-researchers.txt
Creating ../data/text/china-daily---us-edition-russian-party-member-hails-efforts-by-the-cpc.txt
Creating ../data/text/china-daily---us-edition-scholars-call-for-calmness-to-fight-virus.txt
Creating ../data/text/china-daily---us-edition-smearing-china-won't-protect-us,-ministry-says.txt
Creating ../data/text/china-da

In [14]:
cd_df

Unnamed: 0,Filename,Publication,Section,Date,Title,Author,LNID
0,china-daily---us-edition-research-team-member-...,China Daily - US Edition,,4/21/21,"research-team-member-expresses-surprise,-const...",China Daily,
1,china-daily---us-edition-origins-of-political-...,China Daily - US Edition,,3/31/21,origins-of-political-virus-put-in-spotlight-by...,China Daily,
2,china-daily---us-edition-origins-of-covid-19-a...,China Daily - US Edition,,5/5/20,"origins-of-covid-19-are-natural,-insists-who",China Daily,
3,china-daily---us-edition-fighting-covid-19:-ch...,China Daily - US Edition,,6/8/20,fighting-covid-19:-china-in-action,China Daily,
4,china-daily---us-edition-who-experts-to-travel...,China Daily - US Edition,,7/8/20,who-experts-to-travel-to-china-to-research-ori...,China Daily,
...,...,...,...,...,...,...,...
406,china-daily---us-edition-tweeted-smear-simply-...,China Daily - US Edition,,3/23/20,tweeted-smear-simply-amplifies-poor-judgment,China Daily,
407,china-daily---us-edition-virus-fight-ramps-up-...,China Daily - US Edition,,3/11/20,virus-fight-ramps-up-in-asia,China Daily,
408,china-daily---us-edition-a-time-for-solidarity...,China Daily - US Edition,,2/6/20,"a-time-for-solidarity,-not-stigma",China Daily,
409,china-daily---us-edition-trade-of-wild-animal-...,China Daily - US Edition,,1/28/20,trade-of-wild-animal-banned-to-curb-virus,China Daily,


While the bulk data download only provided me with 32 publications, I now have identified an additional 411 texts to expand my corpus. I will now append the manifest_data of these 411 to the 32.

In [15]:
## append manifest_data of additional sample with bulk sample for China Daily
manifest_data[0]=manifest_data[0].append(cd_df, ignore_index=True)

In [16]:
## write out a json file all_corpus_index (contains ALL)
mdf_comp=pd.concat(manifest_data, ignore_index=True)
mdf_comp.to_json('../data/all_corpus_index.json', orient='records')

The code cell above has exported out a json file titled all_corpus_index in the data folder that contains ALL the publications. However, given the nature of my project, I am organizing my folders and notebooks by source. Thus, the following steps will create a new path/folder for each of the five sources and move each publication to its appropriate folder based on its name.

In [17]:
## organizing data folder based on source name
if not os.path.exists('../data/text/china_daily'):
    os.makedirs('../data/text/china_daily')
if not os.path.exists('../data/text/hindustan_times'):
    os.makedirs('../data/text/hindustan_times')
if not os.path.exists('../data/text/daily_telegraph'):
    os.makedirs('../data/text/daily_telegraph')
if not os.path.exists('../data/text/guardian'):
    os.makedirs('../data/text/guardian')
if not os.path.exists('../data/text/nyt'):
    os.makedirs('../data/text/nyt')

In [18]:
## moving each text to appropriate folder
source = '../data/text/'
cd_destination = '../data/text/china_daily'
ht_destination = '../data/text/hindustan_times'
dt_destination = '../data/text/daily_telegraph'
g_destination = '../data/text/guardian'
nyt_destination = '../data/text/nyt'

for f in os.listdir(source):
    if f.startswith('china-daily'):
        shutil.move(source + f, cd_destination)
    if f.startswith('hindustan-times'):
        shutil.move(source + f, ht_destination)
    if f.startswith('the-daily-telegraph'):
        shutil.move(source + f, dt_destination)
    if f.startswith('the-guardian'):
        shutil.move(source + f, g_destination)
    if f.startswith('the-new-york'):
        shutil.move(source + f, nyt_destination)

Rather than always using the composite corpus index json, I will create five separate json files, one for each source, all located in their appropriate folders.

In [19]:
## filtering the manifest_data by source
cd_filter = manifest_data[0]
nyt_filter = manifest_data[2]
dt_filter = manifest_data[4]
g_filter = manifest_data[6]
ht_filter = manifest_data[8]

In [20]:
## read in the composite corpus json file
all_corpus = json.load(open('../data/all_corpus_index.json'))

In [21]:
## writing out five separate corpus index json files, one per source and each one found in its appropriate folder in the data folder
for article in all_corpus:
    filename = article['Filename']
    if filename.startswith('china-daily'):
        cd_filter.to_json('../data/text/china_daily/cd_corpus_index.json', orient='records')
    if filename.startswith('hindustan-times'):
        ht_filter.to_json('../data/text/hindustan_times/ht_corpus_index.json', orient='records')
    if filename.startswith('the-daily-telegraph'):
        dt_filter.to_json('../data/text/daily_telegraph/dt_corpus_index.json', orient='records')
    if filename.startswith('the-guardian'):
        g_filter.to_json('../data/text/guardian/guardian_corpus_index.json', orient='records')
    if filename.startswith('the-new-york'):
        nyt_filter.to_json('../data/text/nyt/nyt_corpus_index.json', orient='records')

My corpus is now complete! All individual publications and texts are organized in their appropriate source folders in the data folder. Now, wee can begin data analysis on the texts within each source.