# Data Analysis - The Hindustan Times

In this data analysis notebook, I will first conduct a general analysis and further clean if necessary of all publications collected for the Hindustan Time. Specifically, I will generate a frequency list of commonly used words and phrases, followed by a KWIC analysis to investigate the contexts surrounding a select fraction of words of interest. 

- general analysis
- frequency lists
- KWIC analysis


In [1]:
import os
import json
import random
import shutil

import datetime as dt
import calendar

from collections import Counter

In [2]:
## run the functions notebook here
%run functions.ipynb

In [3]:
## open the corpus_index file that only contains publications from the Hindustan Times
ht_corp = json.load(open('../data/text/hindustan_times/ht_corpus_index.json'))

In [4]:
len(ht_corp)

655

There are a total of 655 publications from the Hindustan Times. The first general analysis I will conduct is on the date of publication to see if there are any trends in terms of when output of publications may have been greater.

In [5]:
## extract year-month and append to list dates
dates=[]

for article in ht_corp:
    date=article['Date'][:7]
    dates.append(date)

In [6]:
## counter and 5 most common times of publication
dates_dist= Counter(dates)
dates_dist.most_common(5)

[('2020-05', 145),
 ('2020-03', 112),
 ('2020-04', 100),
 ('2020-06', 58),
 ('2020-12', 34)]

Similar to what was observed with the other four sources, the most publications were released in May of 2020. Here, March and April followed behind. To visualize the data, I will plot the distribution as a bar graph.

In [7]:
date_df = pd.DataFrame.from_records(list(dates_dist.items()), columns=['date','article'])
date_df.sort_values('date').plot.bar(x='date',y='article', figsize=(12,6), color='salmon',title='Hindustan Times',ylabel='# Articles')

<AxesSubplot:title={'center':'Hindustan Times'}, xlabel='date', ylabel='# Articles'>

In [8]:
characters_to_remove = '!,.()[]"'

In [9]:
## extract text from each publication corresponding to the Daily Telegraph and add to dictionary
for article in ht_corp:
    filename = article['Filename']
    text = open('../data/text/hindustan_times/{}'.format(filename)).read()
    article['text'] = text

In [10]:
## tokenize the text and add 'tokens', 'token_cnt', and 'type_cnt' to dictionary 
for article in ht_corp:
    if article.get('text'):
        ht_tokens = tokenize(article['text'], lowercase=True, strip_chars=characters_to_remove)
        article['tokens'] = ht_tokens 
        article['token_cnt'] = len(ht_tokens)
        article['type_cnt'] = len(set(ht_tokens))

In [11]:
## counters for tokens of individual words, bigrams, and trigrams
ht_token_dist=Counter()
ht_text_dist= Counter()
ht_bigram_dist=Counter()
ht_trigram_dist=Counter()

for article in ht_corp:
    if article.get('tokens'):
        ht_tokens = article['tokens']
        ht_token_dist.update(ht_tokens)
        ht_text_dist.update(set(ht_tokens))
        
        ht_bigrams=get_ngram_tokens(ht_tokens,2)
        ht_trigrams=get_ngram_tokens(ht_tokens,3)
        
        ht_bigram_dist.update(ht_bigrams)
        ht_trigram_dist.update(ht_trigrams) 

In [12]:
ht_token_dist.most_common(50)

[('the', 24895),
 ('to', 11042),
 ('of', 11005),
 ('and', 9607),
 ('in', 8419),
 ('a', 6554),
 ('for', 4096),
 ('with', 3831),
 ('on', 3661),
 ('is', 3603),
 ('that', 3599),
 ('from', 3046),
 ('by', 2665),
 ('said', 2502),
 ('at', 2393),
 ('has', 2350),
 ('have', 2267),
 ('as', 2200),
 ('are', 2033),
 ('it', 2016),
 ('india', 1995),
 ('this', 1968),
 ('be', 1895),
 ('was', 1823),
 ('any', 1644),
 ('covid-19', 1608),
 ('or', 1577),
 ('will', 1493),
 ('content', 1339),
 ('other', 1330),
 ('who', 1323),
 ('an', 1302),
 ('been', 1299),
 ('not', 1278),
 ('people', 1250),
 ('we', 1219),
 ('also', 1162),
 ('he', 1087),
 ('china', 1049),
 ('more', 1018),
 ('health', 996),
 ('were', 961),
 ('its', 959),
 ('which', 943),
 ('their', 933),
 ('virus', 874),
 ('had', 866),
 ('contact', 859),
 ('times', 854),
 ('coronavirus', 852)]

In [13]:
ht_bigram_dist.most_common(50)

[('of the', 3021),
 ('in the', 2008),
 ('to the', 1174),
 ('hindustan times', 732),
 ('on the', 726),
 ('and the', 705),
 ('from hindustan', 684),
 ('to this', 684),
 ('for any', 679),
 ('any other', 675),
 ('with respect', 669),
 ('or any', 668),
 ('by the', 667),
 ('respect to', 667),
 ('published by', 665),
 ('by ht', 663),
 ('services with', 657),
 ('this article', 656),
 ('ht digital', 655),
 ('digital content', 655),
 ('content services', 655),
 ('with permission', 655),
 ('permission from', 655),
 ('times for', 655),
 ('any query', 655),
 ('query with', 655),
 ('article or', 655),
 ('other content', 655),
 ('content requirement', 655),
 ('requirement please', 655),
 ('please contact', 655),
 ('contact editor', 655),
 ('editor at', 655),
 ('at contentservices@htlivecom', 655),
 ('the world', 624),
 ('that the', 618),
 ('for the', 616),
 ('to be', 603),
 ('the virus', 580),
 ('at the', 563),
 ('have been', 547),
 ('with the', 546),
 ('has been', 499),
 ('in a', 498),
 ('from the',

Here we see mentions of digital content and content services. Perhaps a focal point for articles in the Hindustan Times will be on how the theories are spread? Just a thought at this time.

In [14]:
ht_trigram_dist.most_common(50)

[('from hindustan times', 684),
 ('with respect to', 667),
 ('or any other', 659),
 ('published by ht', 655),
 ('by ht digital', 655),
 ('ht digital content', 655),
 ('digital content services', 655),
 ('content services with', 655),
 ('services with permission', 655),
 ('with permission from', 655),
 ('permission from hindustan', 655),
 ('hindustan times for', 655),
 ('times for any', 655),
 ('for any query', 655),
 ('any query with', 655),
 ('query with respect', 655),
 ('respect to this', 655),
 ('to this article', 655),
 ('this article or', 655),
 ('article or any', 655),
 ('any other content', 655),
 ('other content requirement', 655),
 ('content requirement please', 655),
 ('requirement please contact', 655),
 ('please contact editor', 655),
 ('contact editor at', 655),
 ('editor at contentservices@htlivecom', 655),
 ('the world health', 212),
 ('of the virus', 189),
 ('the number of', 170),
 ('the united states', 158),
 ('the covid-19 pandemic', 156),
 ('world health organizatio

A unique trigram here is the mention of the prime minister Narendra Modi. At this time there is no conspiracy theory he is heavily involved/mentioned alongside with, so a KWIC analysis will be helpful here.

It is useful to be able to refer to one document that contains all the text, so I will quickly write out a file that consists of only the text portion of each publication.

In [15]:
## write out one doc titled dt_doc that consists of solely the texts from each publication
ht_single_text_list=[]

for article in ht_corp:
    if article.get('tokens'):
        ht_single_text_list.append(article['text'])

ht_doc = '\n---\n'.join(ht_single_text_list)
with open('../data/text/hindustan_times/ht_composite_text.txt','w') as out:
    out.write(ht_doc)

In [16]:
ht_comp_toks = tokenize(ht_doc, lowercase=True, strip_chars=characters_to_remove)
ht_comp_toks_dist = Counter(ht_comp_toks)

In [17]:
print("{: <20}{: <6}\t{}".format('term','Hindustan Times Freq', 'Norm Freq'))
print("="*62)
origin_terms = ['laboratory','lab','bioweapon','market','military','cold-chain','conspiracy','army','detrick', 'transparency','origins','wuhan','theory','narendra']
for term in origin_terms:
    print("{: <20}{: <6}\t\t\t{}".format(term, ht_token_dist[term], round((ht_token_dist[term]/len(ht_comp_toks)*10000),2)))

term                Hindustan Times Freq	Norm Freq
laboratory          74    			1.94
lab                 102   			2.68
bioweapon           0     			0.0
market              105   			2.76
military            62    			1.63
cold-chain          3     			0.08
conspiracy          25    			0.66
army                39    			1.02
detrick             3     			0.08
transparency        50    			1.31
origins             174   			4.57
wuhan               322   			8.45
theory              33    			0.87
narendra            109   			2.86


As with all four other sources, Wuhan has the highest frequency of the origin terms. It is followed by, unsurprisingly, the word origins, market, and lab. The Hindustan Times does mention cold-chain and Fort Detrick, albeit only 3 times each (likely just one article). There is no mention of the bioweapon theory but heavy emphasis on the lab theory. Interestingly we see the frequent mention of the prime minister, which although likely unrelated to any conspiracy theory is worth further investigation.

In [18]:
print_kwic(make_kwic('laboratory',ht_comp_toks))

                       virus came from a  laboratory  more than 38 million
              probably originated from a  laboratory  in wuhan authorities shut
             authorities shut a shanghai  laboratory  a day after its
              nccs and national chemical  laboratory  csir-ncl also conducted their
                      been leaked from a  laboratory  we share the need
                   covid-19 began from a  laboratory  leak and also rebuked
                    has concluded that a  laboratory  leak is the least
                       been created in a  laboratory  and leaked from a
                          could not be a  laboratory  construct or purposefully manipulated
                          it came from a  laboratory  researching bats in wuhan
                     have emerged from a  laboratory  at the fort detrick
                         role in the mrc  laboratory  of molecular biology in
                        wuhan was from a  laboratory  you know it escaped
 

In [19]:
print_kwic(make_kwic('lab',ht_comp_toks))

         originated from a high-security  lab  in wuhan we have
               mike pompeo the dangerous  lab  research in wuhan may
        coronavirus sample destroyed its  lab  samples just think -
                     virus leaked from a  lab  although the team has
                     us is probing china  lab  link us president donald
            escaped the chinese virology  lab  in wuhan robert r
                  been reports that some  lab  workers at the wuhan
                     came from the wuhan  lab  i am a virologist;
                  leaked from a virology  lab  in wuhan - and
                     based on that after  lab  modification becomes a novel
                  accident at a virology  lab  in wuhan officials of
  high-containment laboratory though the  lab  in wuhan couldstudysars-likecoronaviruses extracted
the trump administrationhasn'tpushedthe wuhan  lab  origin theory much but
                        human cells in a  lab  lead researcher nikolai petrovs

Heavy emphasis on the lab leak theory, mentioning the Wuhan virology lab. Some instances appear to denounce the theory, others less so.

In [20]:
print_kwic(make_kwic('military',ht_comp_toks))

                novel coronavirus has us  military  origins by citing a
                   followers that the us  military  might be behind the
                           us and the us  military  or imported frozen packaged
                     that accused the us  military  of starting the coronavirus
                for reciprocal access to  military  logisticsduringavirtualsummit last month australia
                     the backdrop of the  military  standoff with china that
                for reciprocal access to  military  logistics facilities and other
                     that accused the us  military  of starting the coronavirus
                for reciprocal access to  military  logistics facilities and other
                      based on the china  military  institute that discovered and
               power politics - building  military  infrastructure conducting naval exercises
           defence cooperation by closer  military  to military cooperation rajnath
          

In [21]:
print_kwic(make_kwic('conspiracy',ht_comp_toks))

                website known to promote  conspiracy  theories to bolster his
          ministry officials even shared  conspiracy  theories that accused the
                    last year - plugging  conspiracy  theories about the origin
          ministry officials even shared  conspiracy  theories that accused the
               epidemic and for criminal  conspiracy  union home minister amit
                       which grew into a  conspiracy  theory was that the
                  to murder and criminal  conspiracy  were added to the
          the centre of covid-19-related  conspiracy  theories because of her
                remove false content and  conspiracy  theories about covid-19 that
             and some celebrities spread  conspiracy  theories about its origins
          ministry officials even shared  conspiracy  theories that accused the
                          is the rise of  conspiracy  theories around the issue
          ministry officials even shared  conspiracy  t

A majority of these cite conspiracy theories but there are interesting mentions of sources. For example, there is a "website known to promote conspiracy theories", "some celebrities spread conspiracy theories about its origins", and also a Nature Medicine study "dismissing the theories". This was not seen in the other sources in that there are mentions of how these theories are propagating.

In [22]:
print_kwic(make_kwic('army',ht_comp_toks))

                          it might be us  army  who brought the epidemic
                     at the fort detrick  army  medical command in the
         chinese pla people's liberation  army  has escalated border tensions--we
          contested border twenty indian  army  soldiers and an undisclosed
                  samant goel and indian  army  chief gen mm naravane
                      are fighting as an  army  vipin krishnan who has
                    you are a nonviolent  army  gandhi said in his
                         the team at the  army  camp however the army
                   army camp however the  army  will be taking care
                    to assist the soviet  army  therefore it is a
        when china's people's liberation  army  pla soldiers adopted an
                  out of kalapani indian  army  chief general mm navrane's
                 detail at the three-day  army  commanders' conference that began
              level leadership of indian  army  will brai

In [23]:
print_kwic(make_kwic('origins',ht_comp_toks))

            international probe into the  origins  of the pandemic and
            international probe into the  origins  of the pathogen and
               year hashimoto's name has  origins  in olympic flame her
             coronavirus has us military  origins  by citing a story
          international inquiry into the  origins  and spread of the
                      china has said the  origins  of the virus are
             chinese scientists into the  origins  of the virus? it
            independent inquiry into the  origins  and spread of the
       international assessment into the  origins  of the pandemic the
               whilst masking the funds'  origins  fatf says in the
                      and china over the  origins  and handling of the
               whilst masking the funds'  origins  fatf said in the
         to comprehensively evaluate the  origins  of covid-19 after eu
                       who report on the  origins  of the coronavirus disease
              l

Notable dichotomy of the independent investigations of the origins and the joint international investigations! Mentions of China's probe into the origins.

In [24]:
print_kwic(make_kwic('transparency',ht_comp_toks))

              germany have urged greater  transparency  from china until now
                   also offered the same  transparency  that marked the rest
                         to the need for  transparency  and accountability china which
                   effort to bring about  transparency  and accountability for the
            thursday called for complete  transparency  in the reporting of
                   the need for complete  transparency  and the timely sharing
                         to the need for  transparency  and accountability china which
                   effort to bring about  transparency  and accountability for the
         countries on accountability and  transparency  on the origin of
                       need to stress on  transparency  and accountability for the
                 disease the emphasis on  transparency  and accountability at this
      inclusive indo-pacific respect for  transparency  strengthening and diversifying supply
                  

Notable mentions of the need for transparency and accountability from China, accountability being a common word accompanying. A majority reference the lack of openness and transparency.

In [26]:
print_kwic(make_kwic('theory',ht_comp_toks))

              to change the virus-origin  theory  and shift the blame
            further investigation of the  theory  that covid-19 began from
            was transmitted the lab-leak  theory  was viewed as extremely
administrationhasn'tpushedthe wuhan lab origin  theory  much but it has
             the findings strengthen the  theory  that pangolins could be
                   march 17 debunked the  theory  that the sars-cov-2 virus
                  grew into a conspiracy  theory  was that the virus
                    sort of outbreak one  theory  is that the chinese
                      have proposed is a  theory  backed by data and
                 resulted in a plausible  theory  about its bio-engineered origins
              the researchers made their  theory  public in february 2020
              to change the virus-origin  theory  and shift the blame
          the bat-hosted coronavirus one  theory  - so far unproven
                 disprove the lab origin  theory  of the

Specific mentions of the lab-leak theory and a natural origin theory, otherwise nothing to significant or out of the ordinary.

In [27]:
print_kwic(make_kwic('cold-chain',ht_comp_toks))

                    had any contact with  cold-chain  products or with wild
                through the packaging of  cold-chain  food products is unlikely
               staff and people handling  cold-chain  imported products over 1


The Hindustan Times sources altogether mentioned cold-chain only 3 times and all reference the frozen imported food products.

In [28]:
print_kwic(make_kwic('detrick',ht_comp_toks))

                  laboratory at the fort  detrick  army medical command in
                     please open up fort  detrick  and make public more
               the hashtag american's ft  detrick  started by the communist


Altogether mentioned Fort Detrick 3 times, one seeming to demand for transparency from the US.

In [29]:
print_kwic(make_kwic('narendra',ht_comp_toks))

       district collector prime minister  narendra  modi expressed anguish over
       district collector prime minister  narendra  modi expressed anguish over
       in 2014 indian-american physicist  narendra  singh kapany who is
                letter to prime minister  narendra  modi and external affairs
             into account prime minister  narendra  modi who on tuesday
                    25 -- prime minister  narendra  modi wants the g-20
                   out in prime minister  narendra  modi's sagar vision or
                    28 -- prime minister  narendra  modi on saturday said
             hours before prime minister  narendra  modi addresses the nation
           president with prime minister  narendra  modi from their joint
           summit between prime minister  narendra  modi and his australian
             farmers' agitation with the  narendra  modi government read more
               aircraft carrier with the  narendra  modi government calling it
         vid

Remembering the high frequency mentions of the prime minister, we see that they do not strictly reference the conspiracies and rather are mentioned in a broader context of the state of the pandemic.

In [50]:
# select texts with origin terms
origin_txt_ht=[]
orig_terms = ['laboratory','lab','bioweapon','market','military','cold-chain','conspiracy','army','detrick', 'transparency','origins','wuhan','theory']

for word in orig_terms:
    for article in ht_corp:
        if article in origin_txt_ht:
            continue
        elif article['tokens'].count(word)>0:
            origin_txt_ht.append(article)

In [51]:
# number narrowed down
len(origin_txt_ht)

305

In [53]:
# percentage of all ht texts
round(len(origin_txt_ht)/len(ht_corp)*100,2)

46.56

In [54]:
# write out as doc
terms_only_txt=[]
for article in origin_txt_ht:
    terms_only_txt.append(article['text'])
    
terms_only_doc = '\n---\n'.join(terms_only_txt)
with open('../data/text/hindustan_times/ht_terms_only_texts.txt','w') as out:
    out.write(terms_only_doc)