## Demo of King, Lam, and Roberts algorithm for finding sets of keywords.

You can read the [paper](https://gking.harvard.edu/files/gking/files/ajps12291_final.pdf) and download their [code and replication data](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/FMJDCD).

The code is stored in `keywords.py`. Some minor modifications have been made due to some out-of-date code.


In [110]:
#Start by reading in some data
import os
import pandas as pd
from sociallistening.io import S3TweetReader
import dateutil

start_datetime_str = "2017-06-19"
end_datetime_str = "2017-06-25"

start_dt = dateutil.parser.parse(
    start_datetime_str, default=dateutil.parser.parse("00:00Z"))

end_dt = dateutil.parser.parse(
    end_datetime_str, default=dateutil.parser.parse("00:00Z"))

local=True
gnip_tag='politics_sample'
temp_dir='tweets/'+gnip_tag
os.makedirs(temp_dir, exist_ok=True)

In [None]:
reader = S3TweetReader(temp_dir=temp_dir, 
                       start_dt=start_dt,
                       end_dt=end_dt, 
                       local=False, 
                       remove_replies=True,
                      gnip_tag=gnip_tag)

In [22]:
tweets = [x for x in reader.read_tweets()]

### Extract the texts from the tweets, we can ignore the rest here

In [23]:
texts = [x.text for x in tweets]

In [139]:
len(texts)

106660

## Defining the reference set $R$ and the search set $S$

## Specifying initial query, $Q_R$

I am using the keyword "AHCA", case insensitive and potentially with hashtags.

In [1]:
import re
Q_r=re.compile(r"(#?)AHCA", re.IGNORECASE)

In [99]:
R = [x for x in texts if Q_r.search(x) is not None]

In [100]:
S = [x for x in texts if x not in R]

In [140]:
print(len(R))
print(len(S))

1797
104801


## Now we can specify a model to find new keywords

In [2]:
import keywords



Existing codebase reads in the data from CSV so easiest to dump our data into a CSV and then read it back in (rather than modify the code).

In [127]:
R_=pd.DataFrame(R)
R_.columns=['text'] 
R_['id']=list(R_.index.asi8)
S_=pd.DataFrame(S)
S_.columns=['text']
S_['id']=list(S_.index.asi8)
R_.to_csv('R.csv')
S_.to_csv('S.csv')

In [128]:
R_.head()

Unnamed: 0,text,id
0,@grandmapurse @WahcaMia @MeGminor All crazines...,0
1,@grandmapurse @WahcaMia @MeGminor Nice try Mad...,1
2,@scolderscholar @conradhackett If you don't li...,2
3,Opinion | Trump predictably abandons the AHCA ...,3
4,@PattyMurray @SenatorCantwell #votearama strat...,4


In [129]:
S_.head()

Unnamed: 0,text,id
0,#JonesvKelly\n\nMSM is dead!\n\nHalelujah!!,0
1,@ArthurSchwartz @PreetBharara Worst troll atte...,1
2,@drose come back to chicago please!,2
3,My little brother is acting like a total greml...,3
4,Proverbs 21:2\nAll a man's ways seem right to ...,4


In [11]:
S=pd.read_csv(open('S.csv','rU'))

  """Entry point for launching an IPython kernel.


In [17]:
S=S.sample(10000)

In [18]:
S.to_csv('S_.csv')

In [153]:
import importlib
importlib.reload(keywords)

<module 'keywords' from '/Users/tdavidson/automated-keyword-discovery/keywords.py'>

In [19]:
ahca = keywords.Keywords()
ahca.ReferenceSet(data='R.csv', text_colname='text', id_colname='id')
ahca.SearchSet(data='S_.csv', text_colname='text', id_colname='id')
ahca.ProcessData(remove_wordlist=[], keep_twitter_symbols=True)

Keyword object initialized.
Loaded reference set of size 802 in 0.01 seconds.
Loaded search set of size 9997 in 0.07 seconds.
Time to process corpus: 2.91 seconds


In [20]:
ahca.ReferenceKeywords()


2717 reference set keywords found.


In [21]:
ahca.ClassifyDocs(algorithms=['nbayes', 'logit'])


Document Term Matrix: 10799 by 25036 with 85584 nonzero elements

Time to get document-term matrix: 0.18 seconds

Ref training size: 265; Search training size: 3299; Training size: 3564; Test size: 9997

Time for Naive Bayes: 0.0 seconds
Time for Logit: 0.01 seconds


In [22]:
ahca.FindTargetSet()

162 documents in target set
9835 documents in non-target set


In [23]:
ahca.FindKeywords()
ahca.PrintKeywords()

72 target set keywords found
1155 non-target set keywords found
   Reference                  Target                        Non-target
   ----------                 ----------                    ----------
1. #ahca                      que                           @realdonaldtrump
2. ahca                       medicaid                      like
3. senat                      throw                         get
4. bill                       @senwarren                    amp
5. amp                        #trumpcare                    peopl
6. vote                       california                    one
7. know                       aca                           whi
8. peopl                      #healthcarebill               would
9. #trumpcare                 easili                        make
10. #ahcakills                file                          good
11. #fullrepeal               @leahr77                      think
12. health                    huh                           want
13.

While some of the top terms in T appear to be noise we can see that trumpcare and healthcare bill appear high. We can thus use these to redefine the reference set.

In [131]:
kw=['healthcarebill','trumpcare','medicaid',' ACA ', 
       'ahcakills', 'fullrepeal','votenoahca','noahca',
      'ahca']

In [117]:
T = [x for x in list(S.text) if any(i in x for i in kw) ]

In [118]:
len(T)

14

In [119]:
T

["@SenJohnBarrasso  U are a liar U say Obamacare every other word (code 2 base 'the black guy') IT'S ACA &amp; Republicans are gutting subsidies!",
 '@foxnewspolitics ACA repeal on the fast track? How is that possible considering that only 13 senators have any idea what is in it?',
 '@PizzazicUrge @Corrynmb @realDonaldTrump @rjfbobb @ICEgov And that "everyone" would be covered by the ACA replacement.',
 "@FoxNews @POTUS Well that's not covered under #trumpcare",
 "@realDonaldTrump There are rumors the new 'health bill' wants to take away medicaid from the sick &amp; old, needless t… https://t.co/IbLfIYkZFc",
 "@SherryTerm @mkhammer That's funny the ACA in California covered mammograms and my daughter's therapy for autism",
 "The #HealthcareBill is not a #healthcarebill; it's a tax cut for the wealthy that cuts health care for everyone else",
 '@realDonaldTrump The best solution is to fix the ACA not get rid of it.',
 '@NBCPolitics Remember when we made concession after concession on AC

In [111]:
reader = S3TweetReader(temp_dir=temp_dir, 
                       start_dt=start_dt,
                       end_dt=end_dt, 
                       local=True, 
                       remove_replies=True,
                      gnip_tag=gnip_tag)

In [112]:
tweets = [x for x in reader.read_tweets()]

In [113]:
len(tweets)

106660

In [121]:
texts = [x.text for x in tweets]

In [132]:
R=[]
for t in kw:
    Q_rt=re.compile(r''+t, re.IGNORECASE)
    relevant = [x for x in texts if Q_rt.search(x) is not None]
    R.extend(relevant)
R=list(set(R))

In [133]:
len(R)

1797

In [134]:
S = [x for x in texts if x not in R]

In [135]:
len(S)

104801

In [136]:
R_=pd.DataFrame(R)
R_.columns=['text'] 
R_['id']=list(R_.index.asi8)
S_=pd.DataFrame(S)
S_.columns=['text']
S_['id']=list(S_.index.asi8)
S_=S_.sample(10000)
R_.to_csv('R_2.csv')
S_.to_csv('S_2.csv')

## Now that the new sets have been defined we can re-run the algorithm.

In [137]:
ahca = keywords.Keywords()
ahca.ReferenceSet(data='R_2.csv', text_colname='text', id_colname='id')
ahca.SearchSet(data='S_2.csv', text_colname='text', id_colname='id')
ahca.ProcessData(remove_wordlist=[], keep_twitter_symbols=True)

Keyword object initialized.
Loaded reference set of size 1797 in 0.02 seconds.
Loaded search set of size 9989 in 0.09 seconds.
Time to process corpus: 3.48 seconds


In [138]:
ahca.ReferenceKeywords()
ahca.ClassifyDocs(algorithms=['nbayes', 'logit'])
ahca.FindTargetSet()
ahca.FindKeywords()
ahca.PrintKeywords()


4506 reference set keywords found.

Document Term Matrix: 11786 by 24171 with 95119 nonzero elements

Time to get document-term matrix: 0.16 seconds

Ref training size: 593; Search training size: 3296; Training size: 3889; Test size: 9989

Time for Naive Bayes: 0.0 seconds
Time for Logit: 0.02 seconds
245 documents in target set
9744 documents in non-target set
246 target set keywords found
1081 non-target set keywords found
   Reference                  Target                        Non-target
   ----------                 ----------                    ----------
1. medicaid                   health                        trump
2. senat                      bill                          like
3. #ahca                      senat                         get
4. #healthcarebill            care                          know
5. #trumpcare                 repeal                        louisemensch
6. trumpcar                   tax                           good
7. health                     