# Data Analysis - The Daily Telegraph

In this data analysis notebook, I will first conduct a general analysis and further clean if necessary of all publications collected for the Daily Telegraph. Specifically, I will generate a frequency list of commonly used words and phrases, followed by a KWIC analysis to investigate the contexts surrounding a select fraction of words of interest. 

- general analysis
- frequency lists
- KWIC analysis

In [1]:
import os
import json
import random
import shutil

import datetime as dt
import calendar

from collections import Counter

In [2]:
## run the functions notebook here
%run functions.ipynb

In [3]:
## open the corpus_index file that only contains publications from the Daily Telegraph
dt_corp = json.load(open('../data/text/daily_telegraph/dt_corpus_index.json'))

In [4]:
for article in dt_corp:
    if article['Filename'].startswith('wish-magazine'):
        dt_corp.remove(article)

In [5]:
len(dt_corp)

415

Total of 415 publications from the Daily Telegraph of Australia. The bulk download oddly included 7 entries from Wish Magazine which is entirely unrelated here, so I have filtered that out and removed it from the corpus. The first general analysis I will conduct is on the date of publication to see if there are any trends in terms of when output of publications may have been greater.

In [6]:
## extract year-month and append to list dates
dates=[]

for article in dt_corp:
    date=article['Date'][:7]
    dates.append(date)

In [7]:
## counter and 5 most common times of publication
dates_dist= Counter(dates)
dates_dist.most_common(5)

[('2020-05', 96),
 ('2020-04', 54),
 ('2020-08', 46),
 ('2020-12', 38),
 ('2020-06', 31)]

Similar to what was observed with the China Daily and the NY Times, the most publications were released in May of 2020. Unlike with the China Daily, but similar to NY Times, the next most common was not November but April instead. To visualize the data, I will plot the distribution as a bar graph.

In [8]:
date_df = pd.DataFrame.from_records(list(dates_dist.items()), columns=['date','article'])
date_df.sort_values('date').plot.bar(x='date',y='article', figsize=(12,6), color='salmon',title='The Daily Telegraph',ylabel='# Articles')

<AxesSubplot:title={'center':'The Daily Telegraph'}, xlabel='date', ylabel='# Articles'>

In [9]:
characters_to_remove = '!,.()[]"'

In [10]:
## extract text from each publication corresponding to the Daily Telegraph and add to dictionary
for article in dt_corp:
    filename = article['Filename']
    text = open('../data/text/daily_telegraph/{}'.format(filename)).read()
    article['text'] = text

In [11]:
## tokenize the text and add 'tokens', 'token_cnt', and 'type_cnt' to dictionary 
for article in dt_corp:
    if article.get('text'):
        dt_tokens = tokenize(article['text'], lowercase=True, strip_chars=characters_to_remove)
        article['tokens'] = dt_tokens 
        article['token_cnt'] = len(dt_tokens)
        article['type_cnt'] = len(set(dt_tokens))

In [12]:
## counters for tokens of individual words, bigrams, and trigrams
dt_token_dist=Counter()
dt_text_dist= Counter()
dt_bigram_dist=Counter()
dt_trigram_dist=Counter()

for article in dt_corp:
    if article.get('tokens'):
        dt_tokens = article['tokens']
        dt_token_dist.update(dt_tokens)
        dt_text_dist.update(set(dt_tokens))
        
        dt_bigrams=get_ngram_tokens(dt_tokens,2)
        dt_trigrams=get_ngram_tokens(dt_tokens,3)
        
        dt_bigram_dist.update(dt_bigrams)
        dt_trigram_dist.update(dt_trigrams) 

In [13]:
## top 50 most common words and frequencies
dt_token_dist.most_common(50)

[('the', 18811),
 ('to', 8707),
 ('of', 7400),
 ('and', 7133),
 ('a', 6310),
 ('in', 5696),
 ('is', 3174),
 ('for', 3054),
 ('that', 2516),
 ('on', 2426),
 ('with', 2241),
 ('it', 2237),
 ('was', 2016),
 ('be', 1957),
 ('have', 1762),
 ('as', 1756),
 ('has', 1705),
 ('he', 1600),
 ('at', 1568),
 ('from', 1453),
 ('this', 1440),
 ('his', 1375),
 ('are', 1338),
 ('but', 1320),
 ('by', 1206),
 ('will', 1169),
 ('an', 1156),
 ('not', 1145),
 ('we', 1120),
 ('-', 1119),
 ('said', 1096),
 ('they', 1045),
 ('who', 1001),
 ('been', 967),
 ('i', 950),
 ('their', 942),
 ('all', 779),
 ('china', 775),
 ('covid-19', 697),
 ('there', 697),
 ('nrl', 694),
 ('one', 691),
 ('when', 682),
 ('or', 676),
 ('had', 672),
 ('up', 672),
 ('our', 670),
 ('australia', 662),
 ('if', 656),
 ('about', 655)]

In [14]:
## top 50 most common bigrams
dt_bigram_dist.most_common(50)

[('of the', 1828),
 ('in the', 1339),
 ('to the', 813),
 ('for the', 673),
 ('on the', 625),
 ('to be', 604),
 ('at the', 571),
 ('and the', 525),
 ('with the', 497),
 ('the nrl', 406),
 ('will be', 374),
 ('has been', 360),
 ('in a', 351),
 ('the world', 348),
 ('it was', 343),
 ('from the', 326),
 ('it is', 314),
 ('have been', 308),
 ('by the', 304),
 ('the virus', 301),
 ('per cent', 288),
 ('of a', 281),
 ('is a', 270),
 ('into the', 269),
 ('that the', 265),
 ('is the', 261),
 ('as a', 255),
 ('one of', 251),
 ('he said', 244),
 ('the chinese', 235),
 ('the game', 233),
 ('the first', 229),
 ('as the', 220),
 ('for a', 210),
 ('of origin', 208),
 ('with a', 204),
 ('rugby league', 202),
 ('state of', 200),
 ('would be', 193),
 ('the coronavirus', 189),
 ('to get', 177),
 ('the australian', 175),
 ('about the', 174),
 ('there is', 173),
 ('the covid-19', 173),
 ('more than', 172),
 ('this year', 165),
 ('said the', 163),
 ('to a', 163),
 ('he was', 161)]

In [15]:
## top 50 most common trigrams
dt_trigram_dist.most_common(50)

[('state of origin', 182),
 ('one of the', 148),
 ('the daily telegraph', 132),
 ('the origins of', 125),
 ('wuhan institute of', 89),
 ('the end of', 86),
 ('institute of virology', 81),
 ('into the origins', 80),
 ('origins of the', 79),
 ('around the world', 74),
 ('of the virus', 71),
 ('the world health', 66),
 ('of the coronavirus', 66),
 ('the chinese government', 66),
 ('the rest of', 64),
 ('the wuhan institute', 64),
 ('of the year', 64),
 ('the covid-19 pandemic', 60),
 ('a lot of', 59),
 ('world health organisation', 57),
 ('there is no', 57),
 ('source of power:', 57),
 ('the first time', 56),
 ('inquiry into the', 55),
 ('the chinese communist', 52),
 ('rest of the', 51),
 ('chinese communist party', 50),
 ('some of the', 48),
 ('of the world', 48),
 ('to be a', 48),
 ('out of the', 48),
 ('we need to', 47),
 ('per cent of', 47),
 ('at the end', 46),
 ('need to be', 45),
 ('for the first', 45),
 ('as well as', 44),
 ('in the world', 43),
 ('end of the', 42),
 ('part of th

It's interesting to see the explicit mention of the Chinese government and the Chinese communist party. Additionally, origins of COVID-19 and state of origin directly reference the origins examined in this project. Finally, it mentions the Wuhan Institute of Virology which indicates to me the possibility that they will also highlight the lab-leak theory.

It is useful to be able to refer to one document that contains all the text, so I will quickly write out a file that consists of only the text portion of each publication.

In [16]:
## write out one doc titled dt_doc that consists of solely the texts from each publication
dt_single_text_list=[]

for article in dt_corp:
    if article.get('tokens'):
        dt_single_text_list.append(article['text'])

dt_doc = '\n---\n'.join(dt_single_text_list)
with open('../data/text/daily_telegraph/dt_composite_text.txt','w') as out:
    out.write(dt_doc)

In [17]:
dt_comp_toks = tokenize(dt_doc, lowercase=True, strip_chars=characters_to_remove)
dt_comp_toks_dist = Counter(dt_comp_toks)

In [18]:
print("{: <20}{: <6}\t{}".format('term','The Daily Telegraph', 'Norm Freq'))
print("="*42)
origin_terms = ['laboratory','lab','bioweapon','market','military','cold-chain','conspiracy','army','detrick', 'transparency','origins','wuhan','theory','imported']
for term in origin_terms:
    print("{: <20}{: <6}\t{}".format(term, dt_comp_toks_dist[term], round((dt_comp_toks_dist[term]/len(dt_comp_toks)*10000), 2)))

term                The Daily Telegraph	Norm Freq
laboratory          173   	5.67
lab                 90    	2.95
bioweapon           0     	0.0
market              156   	5.11
military            85    	2.78
cold-chain          0     	0.0
conspiracy          23    	0.75
army                34    	1.11
detrick             0     	0.0
transparency        30    	0.98
origins             176   	5.77
wuhan               357   	11.69
theory              29    	0.95


As we saw with the China Daily and the NY Times, the frequency of Wuhan was the highest among all the origin terms selected. Following behind are origins, laboratory, and market, all common general words. While the lab is mentioned, there is no indication of the other theories like bioweapon, cold-chain, or Fort Detrick.

In [19]:
print_kwic(make_kwic('laboratory',dt_comp_toks))

                    have originated in a  laboratory  the former boss of
                   was manufactured in a  laboratory  and he believes the
                       been created in a  laboratory  a number of european
                        was created in a  laboratory  the official government language
                      have leaked from a  laboratory  he said: politics it
                     leaked from a wuhan  laboratory  is akin to their
                   originated in a wuhan  laboratory  and could be making
                         came out of the  laboratory  rather than the wet
                 virus originated from a  laboratory  instead it centres on
                      of the virus wuhan  laboratory  has nothing to do
                  escaped from a chinese  laboratory  the research facility is
                     a biosafety level 4  laboratory  that is able to
                     leaked from a wuhan  laboratory  the origins of coronavirus
                   

In [20]:
print_kwic(make_kwic('lab',dt_comp_toks))

               in an american biowarfare  lab  some 110 of them
                   more evidence for the  lab  theory by the day
             competition --- fears china  lab  virus origin theory could
                      created in a wuhan  lab  is precisely why an
                 racism k baker engadine  lab  security is an issue
          of sensitive biological agents  lab  leaks happen: former spy
                     was also working on  lab  ventures he and his
                  link to china military  lab  china's people's liberation army
   coronavirus american official revives  lab  theory; 4/1 i was
                 reports about the wuhan  lab  â€¦ they were public
               was an infectious disease  lab  in wuhan the epicentre
                    pandemic to a secret  lab  in wuhan in his
             us operatives revealing the  lab  as the pandemic's origin
                      be done of csiro's  lab  the cyber security of
                    escaped from a lo

As expected, the instances of lab reference the high-security Wuhan lab again in a variety of permutations, from the military lab, Chinese lab, Wuhan lab, BSL-4 lab, many of which referring to the potential creation and escape of the virus from the lab. These all reference the lab-leak theory which seems to be the most popularly mentioned theory of the lot.

In [21]:
print_kwic(make_kwic('military',dt_comp_toks))

     refrain from purchasing unnecessary  military  equipment and upgrading our
                       has to adjust its  military  capabilities and strategy in
                  in response to china's  military  build-up under xi jinping's
         announcement of modernising our  military  capability is for china
                  media and also through  military  medical channels and high-level
                     the covid files top  military  scientist has ties with
                    the pla's academy of  military  medical sciences dr tu
                     at the institute of  military  veterinary medicine of the
                chinese pla's academy of  military  medical sciences tu changchun
                     with the academy of  military  medical sciences often involves
                  that come from animals  military  research the academy of
                 research the academy of  military  sciences is the highest-level
                       sure there are no  mil

In [22]:
print_kwic(make_kwic('conspiracy',dt_comp_toks))

          that amplified covid-19 origin  conspiracy  theories and those that
                 as a donald trump-style  conspiracy  theory when the daily
                    to the pandemic with  conspiracy  theories and fake miracle
                          the story as a  conspiracy  theory and -citing condemnations
                   high and mighty about  conspiracy  theories recall that the
               cash pushing the greatest  conspiracy  theory of all still
               and russia have -targeted  conspiracy  narratives to shift the
                    to the pandemic with  conspiracy  theories and fake miracle
                media posts and -bizarre  conspiracy  theories as a friend
                     again for pushing a  conspiracy  -theory that covid-19 came
            obsession of popular youtube  conspiracy  theorist cum political commentator
                      with a giant media  conspiracy  to shut him up
                    zhao even pushed the  conspiracy  

Although the frequency of the word conspiracy is not as high as some of the other terms, it's amusing to see that the usage is quite different. Obviously there are mentions of the COVID-19 origin conspiracy theories, but there are also mentions of a Donald Trump-style conspiracy theory, a phrase not yet encountered.

In [24]:
print_kwic(make_kwic('origins',dt_comp_toks))

                     an inquiry into the  origins  of covid-19 in return
                   continue to probe the  origins  of the virus a
               bottom of the -pandemic's  origins  --- cordner says players
            independent inquiry into the  origins  and the handling of
       investigation into the pandemic's  origins  is thorough and conclusive
            independent inquiry into the  origins  of covid-19 chinese newspaper
                 inquiry into the virus'  origins  is needed labor remains
          reading sharri markson's virus  origins  story where it was
            independent inquiry into the  origins  of the coronavirus crisis
                  deflect blame over the  origins  of the virus wuhan
            independent inquiry into the  origins  of the coronavirus it
         thorough investigation into the  origins  of the covid-19 pandemic
               an investigation into the  origins  of covid-19 the cancellation
                     an inquiry in

              spread from its quarantine  origins  neal and the inquiry
                 struggling to trace the  origins  in australia companies including
                      for 50 years their  origins  and evolution prove to
          australian government into the  origins  of the coronavirus which
                 probe into the pandemic  origins  us intelligence officials are
                    traced back to their  origins  to identify others unknowingly
                 an investigation of the  origins  of the covid-19 pandemic
           team's investigation into the  origins  of covid-19 pandemic the
                 hurried report into the  origins  of the coronavirus was
              the theory that covid-19's  origins  lie behind the gates
                         it comes to the  origins  of covid-19 the question
                 an investigation of the  origins  of the wuhan virus
              for investigation into the  origins  of covid-19 and that
               

Nothing too surprising here, with a lot of mentions of the inquiry, investigation, research, and probing of the origins. Some interesting ones include "deflecting blame" and a seeming heavy emphasis on the international nature of the investigation which is more explicit than with the China Daily and NY Times.

In [25]:
print_kwic(make_kwic('transparency',dt_comp_toks))

                left-wing has shown some  transparency  on this subject if
            its assault on international  transparency  as it sought to
                 marise payne inset said  transparency  - particularly from the
                    and death rates lack  transparency  â– china not counting
                    a shake-up to ensure  transparency  the 16 clubs should
                         that we get the  transparency  that the world gets
                      the world gets the  transparency  it needs mr pompeo
                   are simply asking for  transparency  and co-operation the australian
                   support the calls for  transparency  and accountability from australia
             an assault on international  transparency  and to the endangerment
                            - we get the  transparency  that the world gets
                      the world gets the  transparency  it needs mr pompeo
                   about china's lack of  transparency  ove

Here we see somewhat of a confusing mixture regarding the transparency topic. There are some instances of China's lack of transparency, but other instances of quoting people who praise China's transparency.

In [27]:
print_kwic(make_kwic('theory',dt_comp_toks))

                    evidence for the lab  theory  by the day it's
                  china lab virus origin  theory  could be wmd 20
               institute of virology the  theory  is being investigated by
                   undue emphasis on the  theory  the virus originated in
               share which supported the  theory  the coronavirus was an
                   undue emphasis on the  theory  the virus originated in
               share which supported the  theory  the coronavirus was an
                 evidence to support the  theory  that the origin was
               unknown animal source the  theory  that the virus may
                    the who's report the  theory  that covid-19's origins lie
         a donald trump-style conspiracy  theory  when the daily telegraph
            experts have discredited the  theory  the world health organisation
               said referring to another  theory  the virus was transmitted
              agreed with donald trump's  theory  t

Observations:
Interestingly, unlike with the China Daily and NY Times where theory was largely preceded by "conspiracy", here we see that is not always the case. Instead the specific theories are elucidated following the word, although again we see the "Donald Trump-style conspiracy" notion again.

#### Additional code for me

In [48]:
# select texts with specific origin terms
origin_txt_dt=[]
for word in origin_terms:
    for article in dt_corp:
        if article in origin_txt_dt:
            continue
        elif article['tokens'].count(word)>0:
            origin_txt_dt.append(article)

In [49]:
# number of texts narrowed down
len(origin_txt_dt)

195

In [51]:
# percentage of all dt_texts
round(len(origin_txt_dt)/len(dt_corp)*100,2)

46.99

In [52]:
# write out as doc
terms_only_txt=[]
for article in origin_txt_dt:
    terms_only_txt.append(article['text'])
    
terms_only_doc = '\n---\n'.join(terms_only_txt)
with open('../data/text/daily_telegraph/dt_terms_only_texts.txt','w') as out:
    out.write(terms_only_doc)