# Data Analysis - The Guardian

In this data analysis notebook, I will first conduct a general analysis and further clean if necessary of all publications collected for the Guardian. Specifically, I will generate a frequency list of commonly used words and phrases, followed by a KWIC analysis to investigate the contexts surrounding a select fraction of words of interest. 

- general analysis
- frequency lists
- KWIC analysis


In [1]:
import os
import json
import random
import shutil
import pandas as pd

import datetime as dt
import calendar

from collections import Counter

In [2]:
## run the functions notebook here
%run functions.ipynb

In [3]:
## open the corpus_index file that only contains publications from the Guardian
g_corp = json.load(open('../data/text/guardian/guardian_corpus_index.json'))

In [4]:
len(g_corp)

945

There are a total of 945 publications from the Guardian in my corpus. The first general analysis I will conduct is on the date of publication to see if there are any trends in terms of when output of publications may have been greater.

In [5]:
## extract year-month and append to list dates
dates=[]

for article in g_corp:
    date=article['Date'][:7]
    dates.append(date)  

In [6]:
## counter and 5 most common times of publication
dates_dist= Counter(dates)
dates_dist.most_common(5)

[('2020-05', 167),
 ('2020-04', 117),
 ('2020-06', 95),
 ('2021-01', 65),
 ('2020-12', 59)]

In [7]:
date_df = pd.DataFrame.from_records(list(dates_dist.items()), columns=['date','article'])
date_df.sort_values('date').plot.bar(x='date',y='article', figsize=(12,6), color='salmon',ylabel='# Articles',title='The Guardian')

<AxesSubplot:title={'center':'The Guardian'}, xlabel='date', ylabel='# Articles'>

In [8]:
characters_to_remove = '!,.()[]|"'

In [9]:
## extract text from each publication corresponding to the Guardian and add to dictionary
## note this step my get stuck if you are repeating this because of the presence of special characters in the file name
### if this occurs, rername the files with special characters accordingly
for article in g_corp:
    filename = article['Filename']
    text = open('../data/text/guardian/{}'.format(filename)).read()
    article['text'] = text

In [10]:
## tokenize the text and add 'tokens', 'token_cnt', and 'type_cnt' to dictionary 
for article in g_corp:
    if article.get('text'):
        g_tokens = tokenize(article['text'], lowercase=True, strip_chars=characters_to_remove)
        article['tokens'] = g_tokens 
        article['token_cnt'] = len(g_tokens)
        article['type_cnt'] = len(set(g_tokens))

In [11]:
g_token_dist=Counter()
g_text_dist= Counter()
g_bigram_dist=Counter()
g_trigram_dist=Counter()

for article in g_corp:
    if article.get('tokens'):
        g_tokens = article['tokens']
        g_token_dist.update(g_tokens)
        g_text_dist.update(set(g_tokens))
        
        g_bigrams = get_ngram_tokens(g_tokens,2)
        g_trigrams = get_ngram_tokens(g_tokens,3)
        
        g_bigram_dist.update(g_bigrams)
        g_trigram_dist.update(g_trigrams)

In [12]:
## top 50 most common words and frequencies
g_token_dist.most_common(50)

[('the', 559468),
 ('to', 271272),
 ('of', 237387),
 ('in', 200805),
 ('and', 193956),
 ('a', 177738),
 ('that', 94405),
 ('on', 94087),
 ('for', 85154),
 ('is', 84265),
 ('has', 66647),
 ('at', 63051),
 ('have', 60849),
 ('with', 56340),
 ('as', 55870),
 ('said', 55263),
 ('from', 54181),
 ('it', 53206),
 ('be', 53157),
 ('are', 50431),
 ('will', 44636),
 ('coronavirus', 39481),
 ('was', 39402),
 ('by', 39377),
 ('we', 38086),
 ('this', 36750),
 ('people', 35641),
 ('cases', 34161),
 ('not', 33656),
 ('been', 33275),
 ('new', 33254),
 ('he', 33139),
 ('block-time', 32366),
 ('but', 30715),
 ('health', 29578),
 ('-', 28804),
 ('an', 28264),
 ('bst', 27867),
 ('more', 27414),
 ('who', 26767),
 ('covid-19', 26542),
 ('they', 24963),
 ('had', 23685),
 ('were', 23330),
 ('its', 22399),
 ('published-time', 21493),
 ('after', 21269),
 ('i', 20970),
 ('their', 20762),
 ('which', 20631)]

In [13]:
g_bigram_dist.most_common(50)

[('of the', 59010),
 ('in the', 47160),
 ('to the', 25400),
 ('block-time published-time', 21493),
 ('on the', 17226),
 ('for the', 17185),
 ('to be', 14599),
 ('in a', 13856),
 ('at the', 13728),
 ('will be', 13419),
 ('and the', 13164),
 ('from the', 12794),
 ('have been', 12734),
 ('that the', 11876),
 ('with the', 11863),
 ('number of', 11857),
 ('the coronavirus', 11694),
 ('the virus', 11595),
 ('by the', 10942),
 ('block-time updated-timeupdated', 10873),
 ('updated-timeupdated at', 10873),
 ('more than', 10750),
 ('has been', 10308),
 ('the country', 9384),
 ('said the', 9338),
 ('the government', 9039),
 ('it is', 8966),
 ('the us', 8847),
 ('according to', 8251),
 ('of a', 8060),
 ('the first', 7899),
 ('the world', 7717),
 ('as the', 7527),
 ('the pandemic', 7419),
 ('bst block-time', 6931),
 ('to a', 6722),
 ('the uk', 6494),
 ('is a', 6394),
 ('as a', 6273),
 ('new cases', 6137),
 ('over the', 6127),
 ('the number', 6076),
 ('we have', 5774),
 ('it was', 5537),
 ('he said'

In [14]:
g_trigram_dist.most_common(50)

[('block-time updated-timeupdated at', 10873),
 ('bst block-time published-time', 6931),
 ('the number of', 5780),
 ('gmt block-time published-time', 3810),
 ('the spread of', 3480),
 ('of the coronavirus', 3424),
 ('of the virus', 3368),
 ('one of the', 3361),
 ('the end of', 3174),
 ('according to the', 2845),
 ('the coronavirus pandemic', 2828),
 ('tested positive for', 2465),
 ('the world health', 2449),
 ('spread of the', 2256),
 ('said in a', 2212),
 ('the prime minister', 2181),
 ('in the uk', 2166),
 ('be able to', 2164),
 ('world health organization', 2151),
 ('the united states', 2139),
 ('in the us', 2104),
 ('as well as', 2080),
 ('in the country', 2063),
 ('of the pandemic', 2062),
 ('the white house', 2039),
 ('in a statement', 1908),
 ('in the past', 1837),
 ('total number of', 1700),
 ('cases in the', 1631),
 ('the total number', 1570),
 ('the first time', 1541),
 ('new south wales', 1479),
 ('the health ministry', 1468),
 ('in the last', 1451),
 ('number of cases', 143

The frequency lists above are similar to the ones from the other sources. Nothing in particular stands out.

It is useful to be able to refer to one document that contains all the text, so I will quickly write out a file that consists of only the text portion of each publication.

In [15]:
## write out one doc titled dt_doc that consists of solely the texts from each publication
g_single_text_list=[]

for article in g_corp:
    if article.get('tokens'):
        g_single_text_list.append(article['text'])

g_doc = '\n---\n'.join(g_single_text_list)
with open('../data/text/guardian/g_composite_text.txt','w') as out:
    out.write(g_doc)

In [16]:
g_comp_toks = tokenize(g_doc, lowercase=True, strip_chars=characters_to_remove)
g_comp_toks_dist = Counter(g_comp_toks)

In [1]:
len(g_comp_toks)

NameError: name 'g_comp_toks' is not defined

In [17]:
print("{: <20}{: <6}\t{}".format('term','The Guardian','Norm Freq'))
print("="*62)
origin_terms = ['laboratory','lab','bioweapon','market','military','cold-chain','conspiracy','army','detrick', 'transparency','origins','wuhan','theory','imported']
for term in origin_terms:
    print("{: <20}{: <6}\t{}".format(term, g_comp_toks_dist[term], g_comp_toks_dist[term]/len(g_comp_toks)*10000))

term                The Guardian	Norm Freq
laboratory          438   	0.47853236091478785
lab                 565   	0.6172848947873405
bioweapon           7     	0.00764777745754227
market              2080  	2.272482444526846
military            898   	0.9811005938389942
cold-chain          14    	0.01529555491508454
conspiracy          519   	0.5670280714949197
army                324   	0.353982842320528
detrick             1     	0.001092539636791753
transparency        360   	0.3933142692450311
origins             1089  	1.189775664466219
wuhan               2165  	2.3653483136541453
theory              403   	0.44029347362707644


Appears that all the origin terms are present! The largest in frequency are Wuhan and market, market coming in at a close tie with Wuhan which is not the case with the other sources. High frequency of mentions of origins, conspiracy, lab (lab-leak theory). Bioweapon, Fort Detrick, and cold-chain do not each have high frequencies but they are mentioned!

In [18]:
print_kwic(make_kwic('laboratory',g_comp_toks))

              virus including a possible  laboratory  leak in geneva who
                      have leaked from a  laboratory  in a press conference
              virus including a possible  laboratory  leak in geneva who
                      have leaked from a  laboratory  in a press conference
             from a coronavirus research  laboratory  in the chinese city
             from a coronavirus research  laboratory  in the chinese city
                 from a chinese research  laboratory  intelligence sources have told
                     this came from that  laboratory  in wuhan no evidence
                      the attempt by the  laboratory  to refute the claims
                          view of the p4  laboratory  at the wuhan institute
                      the attempt by the  laboratory  to refute the claims
                      the attempt by the  laboratory  to refute the claims
                      the attempt by the  laboratory  to refute the claims
                 

In [19]:
print_kwic(make_kwic('lab',g_comp_toks))

             visit to glasgow lighthouse  lab  he said: we will
             visit to glasgow lighthouse  lab  he said: we will
                  ever having symptoms a  lab  experiment suggests scientists who
              deliberately infected in a  lab  does not mean that
               spy chief dismisses wuhan  lab  conspiracy theory in the
               spy chief dismisses wuhan  lab  conspiracy theory in the
                     that workers in the  lab  may not have always
                         a leak from the  lab  could have caused the
                   focusing on the wuhan  lab  for several days culminating
                  that could support any  lab  theory a report in
                  high security work the  lab  itself says it is
               tenuous suggestion of the  lab  theory and noted the
                       not match any her  lab  had previously studied charles
                   a research and design  lab  in copenhagen mentally it's
                

All referencing the lab-leak theory!

In [20]:
print_kwic(make_kwic('bioweapon',g_comp_toks))

                     that covid-19 was a  bioweapon  engineered by china the
               10-day period pushing the  bioweapon  conspiracy theory which were
                   have observed for the  bioweapon  conspiracy are orchestrated by
                covid-19 is a human-made  bioweapon  produced by the chinese
                      was a north korean  bioweapon  --- when the outbreak
                was a deliberate chinese  bioweapon  attack though he has
                     was engineered as a  bioweapon  in fact epidemiologists say


Ah, here we see the bioweapon theory --- that COVID-19 was a bioweapon engineered by China. 

In [22]:
print_kwic(make_kwic('cold-chain',g_comp_toks))

                      not related to the  cold-chain  and logistical management of
           distribution to regions where  cold-chain  storage is not an
           distribution to regions where  cold-chain  storage is not an
                    do not have advanced  cold-chain  storage networks it also
                    do not have advanced  cold-chain  storage networks it also
               the mrna vaccines require  cold-chain  storage at extremely low
                said that spread through  cold-chain  food products was possible
                said that spread through  cold-chain  food products was possible
                said that spread through  cold-chain  food products was possible
                said that spread through  cold-chain  food products was possible
                said that spread through  cold-chain  food products was possible
                said that spread through  cold-chain  food products was possible
                said that spread through  cold-cha

And here we see the cold-chain theory!

In [23]:
print_kwic(make_kwic('conspiracy',g_comp_toks))

      believing a baseless qanon-related  conspiracy  theory that the online
               she believed the debunked  conspiracy  theory while continuing to
                  in a global pedophilia  conspiracy  she replied: you know
               the far right antisemitic  conspiracy  theory she has almost
      believing a baseless qanon-related  conspiracy  theory that the online
               she believed the debunked  conspiracy  theory while continuing to
                  in a global pedophilia  conspiracy  she replied: you know
               the far right antisemitic  conspiracy  theory she has almost
              ratcheted up his obamagate  conspiracy  theory to implicate joe
      related: trump deepens 'obamagate'  conspiracy  theory with biden unmasking
               chief dismisses wuhan lab  conspiracy  theory in the uk
               chief dismisses wuhan lab  conspiracy  theory in the uk
              of receiving such vaccines  conspiracy  theories and misinformat

WOW! A new obscure theory that I hadn't even heard of --- the 5G coronavirus conspiracy theory. It appears that conspiracy is being used more broadly than in the other sources, with some seemingly unrelated to COVID-19 origin conspiracies but broader political conspiracies. But the 5G conspiracy is a new one.

In [24]:
print_kwic(make_kwic('army',g_comp_toks))

                for treatment the former  army  captain has seen his
                for treatment the former  army  captain has seen his
                affected by the slowdown  army  medics test soliders arriving
                   regiment the new york  army  national guard's 1st battalion
                   august at new delhi's  army  hospital research and referral
                   august at new delhi's  army  hospital research and referral
                   august at new delhi's  army  hospital research and referral
                   august at new delhi's  army  hospital research and referral
                   august at new delhi's  army  hospital research and referral
                   august at new delhi's  army  hospital research and referral
                   august at new delhi's  army  hospital research and referral
                   august at new delhi's  army  hospital research and referral
                    world war one french  army  uniform wear protective f

In [25]:
print_kwic(make_kwic('origins',g_comp_toks))

 organization team investigating covid's  origins  is planning to scrap
 organization team investigating covid's  origins  is planning to scrap
                 interim report on virus  origins  - report the wall
 organization team investigating covid's  origins  is planning to scrap
                who-led team probing the  origins  of the pandemic dominic
              all hypotheses on covid-19  origins  still being investigated says
 organization team investigating covid's  origins  is planning to scrap
 organization team investigating covid's  origins  is planning to scrap
                who-led team probing the  origins  of the pandemic dominic
                  to questions about the  origins  of the materials included
                  to questions about the  origins  of the materials included
                    casting doubt on the  origins  of the virus and
                 global inquiry into the  origins  and early handling of
                     global study of the  ori

Mentions of both independent and institutional/larger investigations into the origins.

In [26]:
print_kwic(make_kwic('transparency',g_comp_toks))

            has again questioned china's  transparency  over the coronavirus outbreak
                        on areas such as  transparency  turkey has recorded 235
                        on areas such as  transparency  turkey has recorded 235
                  for climate action and  transparency  at oil and gas
                    moment he also vowed  transparency  and accountability on the
                     who is committed to  transparency  accountability and continuous improvement
                    moment he also vowed  transparency  and accountability on the
                     who is committed to  transparency  accountability and continuous improvement
                   cover-ups and lack of  transparency  the two top economies
                   cover-ups and lack of  transparency  the two top economies
             an assault on international  transparency  that cost tens of
             an assault on international  transparency  that cost tens of
                  h

Lacking transparency, questioning transparency from China...

In [29]:
print_kwic(make_kwic('theory',g_comp_toks))

     a baseless qanon-related conspiracy  theory  that the online furniture
        believed the debunked conspiracy  theory  while continuing to deny
        far right antisemitic conspiracy  theory  she has almost no
     a baseless qanon-related conspiracy  theory  that the online furniture
        believed the debunked conspiracy  theory  while continuing to deny
        far right antisemitic conspiracy  theory  she has almost no
             up his obamagate conspiracy  theory  to implicate joe biden
    trump deepens 'obamagate' conspiracy  theory  with biden unmasking move
          dismisses wuhan lab conspiracy  theory  in the uk the
          dismisses wuhan lab conspiracy  theory  in the uk the
                    - and the prevailing  theory  is that the virus
                   could support any lab  theory  a report in the
                   suggestion of the lab  theory  and noted the genetic
                 mike pompeo brought the  theory  into the mainstream by
      

Wide variety in the usage of theory, most of which mention specific conspiracy theories and the investigation to greater truths. For both conspiracy and theory we see some broader usages in the political world, with anti-semitic, Trump, Obama, etc.

In [30]:
print_kwic(make_kwic('detrick',g_comp_toks))

                  biological lab at fort  detrick  give more transparency to


In [54]:
# specifically texts containing selected origin terms
origin_txt_g=[]
for word in origin_terms:
    for article in g_corp:
        if article in origin_txt_g:
            continue
        elif article['tokens'].count(word)>0:
            origin_txt_g.append(article)

In [55]:
len(origin_txt_g)

767

In [57]:
# percentage of all g_texts narrowed down to
round(len(origin_txt_g)/len(g_corp)*100,2)

81.16

In [58]:
# write out
terms_only_txt=[]
for article in origin_txt_g:
    terms_only_txt.append(article['text'])
    
terms_only_doc = '\n---\n'.join(terms_only_txt)
with open('../data/text/guardian/g_terms_only_texts.txt','w') as out:
    out.write(terms_only_doc)