# Data Analysis - The New York Times

In this data analysis notebook, I will first conduct a general analysis and further clean if necessary of all publications collected for the New York Times. Specifically, I will generate a frequency list of commonly used words and phrases, followed by a KWIC analysis to investigate the contexts surrounding a select fraction of words of interest. Following the KWIC, I will conduct a collocate analysis and sentiment analysis.

- general analysis
- frequency lists
- KWIC analysis
- collocate analysis
- sentiment analysis

In [1]:
import os
import json
import random
import shutil

import datetime as dt
import calendar

from collections import Counter

In [2]:
## run the functions notebook here
%run functions.ipynb

In [3]:
## open the corpus_index file that only contains publications from the New York Times
nyt_corp = json.load(open('../data/text/nyt/nyt_corpus_index.json'))

In [4]:
len(nyt_corp)

549

Confirmed that there are 549 publications from the New York Times in my corpus. The first general analysis I will conduct is on the date of publication to see if there are any trends in terms of when output of publications may have been greater.

In [5]:
## extract year-month here and append to list dates
dates=[]
for article in nyt_corp:
    date=article['Date'][:7]
    dates.append(date)

In [6]:
## counter and 5 most common times of publication
dates_dist= Counter(dates)
dates_dist.most_common(5)

[('2020-05', 84),
 ('2020-04', 60),
 ('2020-06', 54),
 ('2021-03', 44),
 ('2020-12', 41)]

Similar to what was observed with the China Daily, the most publications were released in May of 2020. Unlike with the China Daily, the next most common was not November but April instead. To visualize the data, I will plot the distribution as a bar graph.

In [7]:
date_df = pd.DataFrame.from_records(list(dates_dist.items()), columns=['date','article'])
date_df.sort_values('date').plot.bar(x='date',y='article', figsize=(12,6), color='salmon',title='NY Times', ylabel='# Articles')

<AxesSubplot:title={'center':'NY Times'}, xlabel='date', ylabel='# Articles'>

Slightly less bimodal distribution compared to China Daily, but we do see the surge in the May months region.

In [8]:
characters_to_remove = '!,.()[]|"'

In [9]:
## extract just the text and add to dictionary
## note this step my get stuck if you are repeating this because of the presence of special characters in the file name
### if this occurs, rername the files with special characters accordingly
for article in nyt_corp:
    filename = article['Filename']
    text = open('../data/text/nyt/{}'.format(filename)).read()
    article['text'] = text

In [10]:
## tokenize the text and add 'tokens', 'token_cnt', and 'type_cnt' to dictionary 
for article in nyt_corp:
    if article.get('text'):
        nyt_tokens = tokenize(article['text'], lowercase=True, strip_chars=characters_to_remove)
        article['tokens'] = nyt_tokens 
        article['token_cnt'] = len(nyt_tokens)
        article['type_cnt'] = len(set(nyt_tokens))

In [11]:
## counters for tokens of individual words, bigrams, and trigrams
nyt_token_dist=Counter()
nyt_text_dist= Counter()
nyt_bigram_dist=Counter()
nyt_trigram_dist=Counter()

for article in nyt_corp:
    if article.get('tokens'):
        nyt_tokens = article['tokens']
        nyt_token_dist.update(nyt_tokens)
        nyt_text_dist.update(set(nyt_tokens))
        
        nyt_bigrams=get_ngram_tokens(nyt_tokens,2)
        nyt_trigrams=get_ngram_tokens(nyt_tokens,3)
        
        nyt_bigram_dist.update(nyt_bigrams)
        nyt_trigram_dist.update(nyt_trigrams) 

In [12]:
## top 50 most common tokens
nyt_token_dist.most_common(50)

[('the', 68965),
 ('of', 32875),
 ('to', 30913),
 ('and', 28526),
 ('a', 27806),
 ('in', 25360),
 ('that', 14833),
 ('for', 10957),
 ('on', 9618),
 ('is', 8524),
 ('with', 7348),
 ('as', 7214),
 ('was', 7113),
 ('it', 6963),
 ('said', 6763),
 ('have', 6303),
 ('from', 6092),
 ('at', 6091),
 ('by', 5980),
 ('are', 5522),
 ('he', 5492),
 ('has', 5308),
 ('but', 4788),
 ('an', 4577),
 ('be', 4489),
 ('who', 4244),
 ('not', 4109),
 ('had', 4019),
 ('they', 3950),
 ('new', 3710),
 ('mr', 3572),
 ('more', 3571),
 ('his', 3563),
 ('people', 3557),
 ('this', 3450),
 ('about', 3374),
 ('or', 3294),
 ('their', 3286),
 ('were', 3256),
 ('been', 3197),
 ('—', 2893),
 ('virus', 2857),
 ('i', 2851),
 ('its', 2676),
 ('coronavirus', 2671),
 ('one', 2622),
 ('she', 2508),
 ('than', 2492),
 ('health', 2398),
 ('we', 2394)]

One interesting thing to note is that the word global was used frequently in the China Daily, but it certainly is not here in the NY Times. It might be that the NY Times is more focused and centered on reporting of that within the United States and individual states and less so on a global perspective that the China Daily might reference.

In [13]:
## top 50 most common bigrams
nyt_bigram_dist.most_common(50)

[('of the', 7715),
 ('in the', 5537),
 ('to the', 2942),
 ('for the', 2352),
 ('the virus', 2239),
 ('on the', 1973),
 ('at the', 1964),
 ('and the', 1905),
 ('that the', 1856),
 ('in a', 1780),
 ('the coronavirus', 1554),
 ('to be', 1475),
 ('from the', 1426),
 ('of a', 1415),
 ('by the', 1393),
 ('with the', 1298),
 ('more than', 1287),
 ('new york', 1254),
 ('the united', 1183),
 ('the pandemic', 1176),
 ('as a', 1167),
 ('he said', 1110),
 ('united states', 1043),
 ('the new', 1019),
 ('have been', 983),
 ('it was', 969),
 ('the world', 945),
 ('the first', 883),
 ('has been', 882),
 ('to a', 881),
 ('as the', 859),
 ('one of', 856),
 ('about the', 826),
 ('is a', 782),
 ('for a', 774),
 ('according to', 771),
 ('the country', 751),
 ('with a', 743),
 ('and a', 711),
 ('it is', 684),
 ('had been', 653),
 ('mr trump', 640),
 ('york times', 579),
 ('at a', 562),
 ('said that', 551),
 ('said the', 546),
 ('that it', 539),
 ('but the', 533),
 ('is the', 530),
 ('was a', 528)]

In [14]:
## top 50 most common trigrams
nyt_trigram_dist.most_common(50)

[('the united states', 1001),
 ('the new york', 580),
 ('new york times', 579),
 ('one of the', 515),
 ('of the virus', 440),
 ('in the united', 384),
 ('for the new', 345),
 ('of the coronavirus', 339),
 ('as well as', 281),
 ('the university of', 270),
 ('of the pandemic', 258),
 ('the white house', 235),
 ('the end of', 231),
 ('the number of', 223),
 ('that the virus', 222),
 ('the spread of', 219),
 ('around the world', 218),
 ('at the university', 213),
 ('world health organization', 208),
 ('the world health', 207),
 ('in new york', 204),
 ('the coronavirus pandemic', 201),
 ('some of the', 196),
 ('a lot of', 195),
 ('percent of the', 195),
 ('said in a', 195),
 ('the origins of', 192),
 ('director of the', 191),
 ('according to the', 190),
 ('spread of the', 187),
 ('the trump administration', 167),
 ('the virus and', 167),
 ('origins of the', 164),
 ('according to a', 157),
 ('for the virus', 157),
 ('the first time', 154),
 ('part of the', 150),
 ('for disease control', 146)

The most commonly used trigram is "the United States" which is something that was not seen with the China Daily trigrams. We also see the White House as another, the WHO, and the Trump administration, very US-centric which is to be expected. We also see that "origins of the" was also included in this most common list and the Chinese government.

It is useful to be able to refer to one document that contains all the text, so I will quickly write out a file that consists of only the text portion of each publication.

In [15]:
## write out one doc titled nyt_doc that consists of solely the texts from each publication
nyt_single_text_list=[]

for article in nyt_corp:
    if article.get('tokens'):
        nyt_single_text_list.append(article['text'])

nyt_doc = '\n---\n'.join(nyt_single_text_list)
with open('../data/text/nyt/nyt_composite_text.txt','w') as out:
    out.write(nyt_doc)

In [16]:
## tokenize the single doc and use counter
nyt_comp_toks = tokenize(nyt_doc, lowercase=True, strip_chars=characters_to_remove)
nyt_comp_toks_dist = Counter(nyt_comp_toks)

For the sake of analysis, I will select a handful of terms that could relate to questions surrounding the origins of COVID-19 and conspiracy theories. These words include laboratory, lab, bioweapon, market, military, cold-chain, conspiracy, army, detrick, transparency, origins, wuhan, and theory. I also added the word imported because of its high prevalence in the token frequency list previously generated.

In [17]:
print("{: <20}{: <6}\t{}".format('term','New York Times','Normalized Freq'))
print("="*62)
origin_terms = ['laboratory','lab','bioweapon','market','military','cold-chain','conspiracy','army','detrick', 'transparency','origins','wuhan','theory','imported']
for term in origin_terms:
    print("{: <20}{: <6}\t\t{}".format(term, nyt_comp_toks_dist[term], round((nyt_comp_toks_dist[term]/len(nyt_comp_toks)*10000),3)))

term                New York Times	Normalized Freq
laboratory          114   		0.984
lab                 247   		2.133
bioweapon           10    		0.086
market              418   		3.609
military            228   		1.969
cold-chain          0     		0.0
conspiracy          139   		1.2
army                55    		0.475
detrick             0     		0.0
transparency        39    		0.337
origins             396   		3.419
wuhan               629   		5.431
theory              197   		1.701
imported            35    		0.302


Of the origin terms, we see the highest for "Wuhan" which is to be expected given the numerous contexts that Wuhan can be used in. It is followed by "market" and "origins". Unsurprisingly, the frequency of laboratory and lab (likely in reference to the lab in China and lab-leak theory) is noticeably high. But what stands out to me the most at this point are the two words that aren't mentioned whatsoever, detrick (in relation to Fort Detrick in the US) and the cold-chain theory.

I am now interested in understanding the context surrounding the use of these words. Using a KWIC analysis, I will explore many of the origin terms seen in the chart above.

In [18]:
print_kwic(make_kwic('laboratory',nyt_comp_toks))

                 virus from a government  laboratory  but mr pottinger continued
                       virus came from a  laboratory  accident in wuhan the
           buddhist retreat part science  laboratory  it’s not outdoor decorating
                theory that a government  laboratory  in wuhan china was
                        an accident at a  laboratory  in wuhan” intelligence agencies
            coronavirus originated “in a  laboratory  either intentionally or by
                   or the jet propulsion  laboratory  but like many children
             accidentally from a chinese  laboratory  as “extremely unlikely” critics
                  emerged from a chinese  laboratory  some members of the
                    have emerged after a  laboratory  accident in china while
         which houses a state-of-the-art  laboratory  known for its research
                   noted that a separate  laboratory  run by the wuhan
                theory that a government  laboratory  in w

In [19]:
print_kwic(make_kwic('lab',nyt_comp_toks))

                     at the fort collins  lab  “i got halfway through
                             a link to a  lab  can be found and
                      theory linked to a  lab  and anthony ruggiero the
                    outbreak came from a  lab  the chinese government has
                     virus leaked from a  lab  while pushing disinformation on
               accidentally from a wuhan  lab  in response to a
                         to a history of  lab  accidents infecting researchers to
             officials to reconsider the  lab  theory the precise nature
          any information supporting the  lab  theory to set the
                repeated emphasis of the  lab  theory as “conclusion shopping”
                 evidence to bolster the  lab  theory according to current
                   getting access to the  lab  itself and the virus
                          of a theory of  lab  origin officials said senior
                   hammer china over the  lab  on wednesday

Both laboratory and lab are used in the context of the lab-leak theory, in which it is believed that COVID-19 was released from a lab in Wuhan.

In [20]:
print_kwic(make_kwic('bioweapon',nyt_comp_toks))

                      was concocted as a  bioweapon  and they agree that
                      was concocted as a  bioweapon  and they agree that
                        was created as a  bioweapon  by the chinese government
                      was concocted as a  bioweapon  they agree that it
                        it is an escaped  bioweapon  misinformation about the virus
                      was concocted as a  bioweapon  and they agree that
                        it is an escaped  bioweapon  misinformation about the virus
                      was concocted as a  bioweapon  they agree that it
                      was concocted as a  bioweapon  and they agree that
                        was created as a  bioweapon  by the chinese government


A word that was found in the NY Times but not in the China Daily was bioweapon, which relates to the theory that the virus was created by the Chinese government as a potential bioweapon. Its absence in the China Daily and presence in the NY Times is particularly telling of how the perspective and source frames the information being presented!

In [21]:
print_kwic(make_kwic('military',nyt_comp_toks))

                   to stay home israel’s  military  said that two soldiers
              with cabinet officials and  military  personnel who then meet
                       health care c the  military  d the workplace 2
                     of belarus staged a  military  parade on saturday to
                       and settled for a  military  flyby over moscow’s mostly
                      they served in the  military  in korea and how
                   korean war brides and  military  wives'' dr park said
                  that the united states  military  could have brought the
             with a subscription china’s  military  is tied to new
                   linked to the chinese  military  according to a report
                   traced to the chinese  military  its targets include the
                 that the government and  military  do not engage in
            suggesting that the american  military  created it any american
                    and police reform on  milit

                he is readying belarus's  military  to repel invaders the
                 generals in the belarus  military  have for years had
                 the 1970s when brazil’s  military  dictatorship opened the perimetral
            tuesday protests against the  military  coup that deposed daw
                        with ties to the  military  as he denounced mr
                  biden with voters from  military  and veteran households in
          challenges facing veterans and  military  families including mental health
                   best to veterans “our  military  is the greatest fighting
                 his transition from the  military  to intelligence work in
                         was ousted in a  military  coup two weeks ago
                  weeks during which the  military  stripped away civil liberties
                   enemy for the israeli  military  in israel brig gen
                 him from taking further  military  action against iran without
   

In [22]:
print_kwic(make_kwic('conspiracy',nyt_comp_toks))

                     publicly spun a new  conspiracy  about the origins of
                        this was an evil  conspiracy  of the enemy” mr
         chinese officials openly spread  conspiracy  theories of their own
                   only further fuel the  conspiracy  theories” he said as
                        a history of the  conspiracy  theory and concluding that
                believing in ugly racist  conspiracy  theories my disappointment was
  dysfunction widespread complacency and  conspiracy  theories on wednesday the
              the kagame government this  conspiracy  theory has become known
               so many stupid ridiculous  conspiracy  theories about black people
                       the form of those  conspiracy  theories and low-information rumors
        depravity by prominent alt-right  conspiracy  theorists like jack posobiec
          american websites that promote  conspiracy  theories one such story
               duran to popular american  cons

Some notable observations here is that, most clearly with the word conspiracy it's followed by theory/theories. However, what precedes it is quite varied. We see instances of falsehoods of anti-American conspiracies and claims of conspiracy regarding the virus' origins, spread by "Chinese officials". But we also see a lot of these instances include terms relating to politics, something that was not seen whatsoever in the China Daily. Mentions of democratic, right-wing, alt-right, pro-Trump, etc. that tie these conspiracies with the political realm.

In [24]:
print_kwic(make_kwic('origins',nyt_comp_toks))

                    don’t know about the  origins  of the ongoing pandemic
                 listener loses track of  origins  and with it any
              to investigating the virus  origins  from the outset” tarik
                new conspiracy about the  origins  of covid-19: that it
             owner acknowledged that its  origins  were ''based on a
               beard-winning chef on the  origins  of soul food “while
              failure to investigate the  origins  of the coronavirus which
                 asian americans and the  origins  of the model minority''
               give falsehoods about the  origins  of the virus the
                         to pin down the  origins  of a virus that
              schools as “prisons” whose  origins  lay in capitalists’ desire
                    episode is about the  origins  of the coronavirus outbreak
           organization inquiry into the  origins  of the virus arguing
           organization inquiry into the  origins  of the

                     more aware of their  origins  what role you play
                    — mostly people with  origins  in the former colonies
                    up in exchange?” the  origins  of six feet the
                 global inquiry into the  origins  of the coronavirus pandemic
              fueled suspicion about the  origins  of the virus an
             about the new coronavirus's  origins  ''dr yan is one
      independent investigation into the  origins  of covid-19 a ban
             providing insights into the  origins  and spread of the
                to trace the infection's  origins  random testing in schools
             in other developments: •the  origins  of the coronavirus remain
                 about new mutations the  origins  of the coronavirus remain
            future goldstein unpicks the  origins  of bitcoin a new
         prompting speculation about its  origins  cdc scientists were asked
                 arendt’s 1951 book “the  origins  of totalita

A majority of these refer to tracing, identifying, investigating, all actions to pinpoint the origins of the virus.

In [25]:
print_kwic(make_kwic('transparency',nyt_comp_toks))

                       a sign of china’s  transparency  but several hours after
              played out with remarkable  transparency  as part of his
                  as evidence of china’s  transparency  mr xu said that
                        i tend to prefer  transparency  over artificially low expectations
                 has lobbied for greater  transparency  from nursing homes “if
                widely varying levels of  transparency  when it comes to
       praising the chinese government’s  transparency  but pushing for more
                    there is very little  transparency  the reopening the last
                   will be revealed when  transparency  increases” lease signings are
                  can be quickly exposed  transparency  an official in taiwan
                        and you have the  transparency  that scientists are desperately
                       its first test of  transparency  and ethics 3 a
                        as a defender of  transparency 

With transparency, many of these instances cite China's lack of transparency, demanding greater transparency from the Chinese government, ultimately very critical in nature. This is quite different from the almost-laudatory nature from the China Daily.

In [27]:
print_kwic(make_kwic('theory',nyt_comp_toks))

                  that might bolster his  theory  they didn’t have any
                     doubled down on the  theory  “more and more evidence
                         seem to share a  theory  of freedom reminiscent of
               history of the conspiracy  theory  and concluding that despite
                 prove an italian origin  theory  for pot au feu
           to support an unsubstantiated  theory  that a government laboratory
                might support any origin  theory  linked to a lab
                       to get behind any  theory  of the outbreak’s origin
                      to support any one  theory  with high confidence at
          in intelligence supporting the  theory  the virus emerged accidentally
                   to reconsider the lab  theory  the precise nature of
          information supporting the lab  theory  to set the stage
                     emphasis of the lab  theory  as “conclusion shopping” a
                      to bolster the lab  the

Finally, with theory, we see mentions unsurprisingly of conspiracy theories, but they also specifically name the lab leak and two mentions of the frozen-food theory.

The following collocation analysis will provide more affirmation of some of the patterns seen with the KWIC.

In [28]:
origins_colls= Counter()
origins_colls.update(collocates(nyt_comp_toks, 'origins'))

In [29]:
collocates(nyt_comp_toks,'origins',win=[4,0])[:25]

['don’t',
 'know',
 'about',
 'the',
 'listener',
 'loses',
 'track',
 'of',
 'to',
 'investigating',
 'the',
 'virus',
 'new',
 'conspiracy',
 'about',
 'the',
 'owner',
 'acknowledged',
 'that',
 'its',
 'beard-winning',
 'chef',
 'on',
 'the',
 'failure']

In [30]:
for coll in ['conspiracy', 'theories', 'theory', 'investigating', 'insulting','virological','acknowledged','asian','falsehoods']:
    print("{: >20}{: >10}{: >10}".format(coll, origins_colls.get(coll), nyt_comp_toks_dist.get(coll)))

          conspiracy         2       139
            theories         1       143
              theory         3       197
       investigating         4        51
           insulting         2         8
         virological         2         6
        acknowledged         1        88
               asian         2       105
          falsehoods         2        27


In [31]:
bweapon_colls= Counter()
bweapon_colls.update(collocates(nyt_comp_toks, 'bioweapon'))

In [32]:
collocates(nyt_comp_toks,'bioweapon',win=[4,0])[:100]

['was',
 'concocted',
 'as',
 'a',
 'was',
 'concocted',
 'as',
 'a',
 'was',
 'created',
 'as',
 'a',
 'was',
 'concocted',
 'as',
 'a',
 'it',
 'is',
 'an',
 'escaped',
 'was',
 'concocted',
 'as',
 'a',
 'it',
 'is',
 'an',
 'escaped',
 'was',
 'concocted',
 'as',
 'a',
 'was',
 'concocted',
 'as',
 'a',
 'was',
 'created',
 'as',
 'a']

In [33]:
for coll in ['concocted', 'created', 'escaped']:
    print("{: >20}{: >10}{: >10}".format(coll, bweapon_colls.get(coll), nyt_comp_toks_dist.get(coll)))

           concocted         6         6
             created         2       189
             escaped         2        30


In [34]:
military_colls= Counter()
military_colls.update(collocates(nyt_comp_toks, 'military'))

In [35]:
collocates(nyt_comp_toks,'military',win=[4,0])[:25]

['to',
 'stay',
 'home',
 'israel’s',
 'with',
 'cabinet',
 'officials',
 'and',
 'health',
 'care',
 'c',
 'the',
 'of',
 'belarus',
 'staged',
 'a',
 'and',
 'settled',
 'for',
 'a',
 'they',
 'served',
 'in',
 'the',
 'korean']

In [36]:
for coll in ['government','chinese','war','traced','staged','settled','linked']:
    print("{: >20}{: >10}{: >10}".format(coll, military_colls.get(coll), nyt_comp_toks_dist.get(coll)))

          government         1      1295
             chinese         4      1162
                 war         2       246
              traced         1        43
              staged         1         8
             settled         1        32
              linked         1       125


In [37]:
lab_colls= Counter()
lab_colls.update(collocates(nyt_comp_toks, 'lab'))

In [38]:
collocates(nyt_comp_toks,'lab',win=[4,0])[:25]

['at',
 'the',
 'fort',
 'collins',
 'a',
 'link',
 'to',
 'a',
 'theory',
 'linked',
 'to',
 'a',
 'outbreak',
 'came',
 'from',
 'a',
 'virus',
 'leaked',
 'from',
 'a',
 'accidentally',
 'from',
 'a',
 'wuhan',
 'to']

In [39]:
for coll in ['out','found','came','from','originated','escape','wuhan','accidentally','made','theory','link','linked','leaked']:
    print("{: >20}{: >10}{: >10}".format(coll, lab_colls.get(coll), nyt_comp_toks_dist.get(coll)))

                 out         1      1734
               found         9       662
                came         6       494
                from        27      6092
          originated        10       115
              escape         2        47
               wuhan        31       629
        accidentally         8        19
                made         7       836
              theory        26       197
                link         6        76
              linked         3       125
              leaked         5        22


In [40]:
conspiracy_colls= Counter()
conspiracy_colls.update(collocates(nyt_comp_toks, 'conspiracy'))

In [41]:
collocates(nyt_comp_toks,'conspiracy',win=[4,0])[:25]

['publicly',
 'spun',
 'a',
 'new',
 'this',
 'was',
 'an',
 'evil',
 'chinese',
 'officials',
 'openly',
 'spread',
 'only',
 'further',
 'fuel',
 'the',
 'a',
 'history',
 'of',
 'the',
 'believing',
 'in',
 'ugly',
 'racist',
 'dysfunction']

In [42]:
for coll in ['publicly','spun','officials','government','stupid','ridiculous','depravity','alt-right','promote','websites','spread','evil','ugly','racist','dysfunction']:
    print("{: >20}{: >10}{: >10}".format(coll, conspiracy_colls.get(coll), nyt_comp_toks_dist.get(coll)))

            publicly         2       144
                spun         2         7
           officials         2      1537
          government         2      1295
              stupid         3         5
          ridiculous         3        12
           depravity         2         2
           alt-right         2         8
             promote         4        32
            websites         2        46
              spread         9       824
                evil         1         8
                ugly         2        16
              racist         2        69
         dysfunction         1         4


In [43]:
theory_colls= Counter()
theory_colls.update(collocates(nyt_comp_toks, 'theory'))

In [44]:
collocates(nyt_comp_toks,'theory',win=[4,0])[:25]

['that',
 'might',
 'bolster',
 'his',
 'doubled',
 'down',
 'on',
 'the',
 'seem',
 'to',
 'share',
 'a',
 'history',
 'of',
 'the',
 'conspiracy',
 'prove',
 'an',
 'italian',
 'origin',
 'to',
 'support',
 'an',
 'unsubstantiated',
 'might']

In [45]:
for coll in ['bolster','prove','share','unsubstantiated','origin','support','history','intelligence','conspiracy']:
    print("{: >20}{: >10}{: >10}".format(coll, theory_colls.get(coll), nyt_comp_toks_dist.get(coll)))

             bolster         4        16
               prove         2        58
               share         2       161
     unsubstantiated        14        32
              origin         9       342
             support        21       294
             history         2       371
        intelligence         2       344
          conspiracy        30       139


In [46]:
transparency_colls= Counter()
transparency_colls.update(collocates(nyt_comp_toks, 'transparency'))

In [47]:
collocates(nyt_comp_toks,'transparency',win=[4,0])[:25]

['a',
 'sign',
 'of',
 'china’s',
 'played',
 'out',
 'with',
 'remarkable',
 'as',
 'evidence',
 'of',
 'china’s',
 'i',
 'tend',
 'to',
 'prefer',
 'has',
 'lobbied',
 'for',
 'greater',
 'widely',
 'varying',
 'levels',
 'of',
 'praising']

In [48]:
for coll in ["china's",'government','evidence','revealed','little','exposed','praising','greater']:
    print("{: >20}{: >10}{: >10}".format(coll, transparency_colls.get(coll), nyt_comp_toks_dist.get(coll)))

             china's         3       143
          government         1      1295
            evidence         2       403
            revealed         1        66
              little         1       387
             exposed         1        88
            praising         2         8
             greater         1       126


Finding similar issues with China Daily collocate analysis. I'm less inclined to use it at this point in time.

#### Additional Code

In [51]:
# select texts with origin terms
origin_txt_nyt=[]
for word in origin_terms:
    for article in nyt_corp:
        if article in origin_txt_nyt:
            continue
        elif article['tokens'].count(word)>0:
            origin_txt_nyt.append(article)

In [52]:
# number of narrowed down texts
len(origin_txt_nyt)

439

In [54]:
# percentage of all nyt texts
round(len(origin_txt_nyt)/len(nyt_corp)*100,2)

79.96

In [55]:
# write out as doc
terms_only_txt=[]
for article in origin_txt_nyt:
    terms_only_txt.append(article['text'])
    
terms_only_doc = '\n---\n'.join(terms_only_txt)
with open('../data/text/nyt/nyt_terms_only_texts.txt','w') as out:
    out.write(terms_only_doc)