## Demo of King, Lam, and Roberts algorithm for finding sets of keywords.

You can read the [paper](https://gking.harvard.edu/files/gking/files/ajps12291_final.pdf) and download their [code and replication data](https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/FMJDCD).

The code is stored in `keywords.py`. Some minor modifications have been made due to some out-of-date code.


In [2]:
# Load in a list of tweets from the API (not provided here)
len(tweets)

### Extract the texts from the tweets, we can ignore the rest here

In [145]:
texts = [x.text for x in tweets]

In [146]:
len(texts)

106660

## Defining the reference set $R$ and the search set $S$

## Specifying initial query, $Q_R$

I am using the keyword "AHCA", case insensitive and potentially with hashtags.

In [147]:
import re
Q_r=re.compile(r"(#?)AHCA", re.IGNORECASE)

In [148]:
R = [x for x in texts if Q_r.search(x) is not None]

In [149]:
S = [x for x in texts if x not in R]

In [150]:
print(len(R))
print(len(S))

480
106180


## Now we can specify a model to find new keywords

In [151]:
import keywords

Existing codebase reads in the data from CSV so easiest to dump our data into a CSV and then read it back in (rather than modify the code).

In [159]:
R_=pd.DataFrame(R)
R_.columns=['text'] 
R_['id']=list(R_.index.asi8)
S_=pd.DataFrame(S)
S_.columns=['text']
S_['id']=list(S_.index.asi8)
R_.to_csv('R.csv')
S_ = S_.sample(10000)
S_.to_csv('S.csv')

In [160]:
R_.head()

Unnamed: 0,text,id
0,Opinion | Trump predictably abandons the AHCA ...,0
1,"This is a LOL cartoon about #SteveScalise, #AH...",1
2,Demand from your Senators a Public Hearing on ...,2
3,you can disagree all you want about the charac...,3
4,"My three things: (a) contact senators re AHCA,...",4


In [161]:
S_.head()

Unnamed: 0,text,id
40838,USA TOADY misses another one. Must be the Russ...,40838
9338,What was humming @ the Great Northern? The e-l...,9338
10372,Nice to see #Vermont's @SeventhGen quoted in t...,10372
83874,Tell them you don't support their death bill. ...,83874
85995,I saw @samirawiley name pop up on episode 6 Se...,85995


In [153]:
##import importlib
##importlib.reload(keywords)

<module 'keywords' from '/Users/tdavidson/automated-keyword-discovery/keywords.py'>

In [162]:
ahca = keywords.Keywords()
ahca.ReferenceSet(data='R.csv', text_colname='text', id_colname='id')
ahca.SearchSet(data='S.csv', text_colname='text', id_colname='id')
ahca.ProcessData(remove_wordlist=[], keep_twitter_symbols=True)

Keyword object initialized.
Loaded reference set of size 478 in 0.01 seconds.
Loaded search set of size 9983 in 0.06 seconds.
Time to process corpus: 2.96 seconds


In [163]:
ahca.ReferenceKeywords()


1856 reference set keywords found.


In [164]:
ahca.ClassifyDocs(algorithms=['nbayes', 'logit'])


Document Term Matrix: 10461 by 22351 with 82678 nonzero elements

Time to get document-term matrix: 0.22 seconds

Ref training size: 158; Search training size: 3294; Training size: 3452; Test size: 9983

Time for Naive Bayes: 0.0 seconds
Time for Logit: 0.02 seconds


In [165]:
ahca.FindTargetSet()

87 documents in target set
9896 documents in non-target set


In [166]:
ahca.FindKeywords()
ahca.PrintKeywords()

52 target set keywords found
1310 non-target set keywords found
   Reference                  Target                        Non-target
   ----------                 ----------                    ----------
1. #ahca                      que                           trump
2. ahca                       #fullrepeal                   via
3. senat                      los                           #1
4. bill                       expand                        eric
5. amp                        con                           decis
6. #fullrepeal                por                           deport
7. #trumpcare                 medicaid                      develop
8. medicaid                   aim                           dick
9. peopl                      @senatemajldr                 disappoint
10. call                      agenc                         econom
11. health                    noth                          economi
12. vote                      protect                       effo

While some of the top terms in T appear to be noise we can see that trumpcare and healthcare bill appear high. We can thus use these to redefine the reference set.

In [167]:
kw=['healthcarebill','trumpcare','medicaid',' ACA ', 
       'ahcakills', 'fullrepeal','ahca']

In [169]:
T = [x for x in list(S_.text) if any(i in x for i in kw) ]

In [170]:
len(T)

11

In [171]:
T

["Tell them you don't support their death bill. Don't cut medicaid, Medicare. Go back to the drawing board. This bill… https://t.co/VYdF4CG7Uv",
 "Funny, after all the year of swearing ACA wasn't about a single-payer system, they all openly admit it now. ACA WAS… https://t.co/7riXvO0beo",
 'FROM THE PEOPLE WHO BROUGHT YOU, "NO CHILD LEFT BEHIND," BRINGS YOU, #trumpcare https://t.co/aKhS4mLTii',
 'Cowardly Senate Republicans Running Away When Asked About Health Care Bill https://t.co/iFElTfWEUX #healthcarebill',
 "Destroyed it for us that's for sure. Can't get shit now with ACA in place w/out paying through the nose and opening… https://t.co/vprT2OsFGw",
 'Fuck the Turtle!! \n#Resist #trumpcare',
 'Your health care &lt; their tax breaks. This is all you need to remember next November. #healthcarebill https://t.co/inZmuGgcVk',
 '(Mitch and Ryan Got Theirs! Screw YOU!!!!!) Mitch McConnell, Healthcare, and the ACA https://t.co/KiC5vCyoIY',
 "don't listen to anything America First Policies 

In [172]:
R=[]
for t in kw:
    Q_rt=re.compile(r''+t, re.IGNORECASE)
    relevant = [x for x in texts if Q_rt.search(x) is not None]
    R.extend(relevant)
R=list(set(R))

In [173]:
len(R)

1797

In [174]:
S = [x for x in texts if x not in R]

In [175]:
len(S)

104801

In [176]:
R_=pd.DataFrame(R)
R_.columns=['text'] 
R_['id']=list(R_.index.asi8)
S_=pd.DataFrame(S)
S_.columns=['text']
S_['id']=list(S_.index.asi8)
S_=S_.sample(10000)
R_.to_csv('R_2.csv')
S_.to_csv('S_2.csv')

## Now that the new sets have been defined we can re-run the algorithm.

In [177]:
ahca = keywords.Keywords()
ahca.ReferenceSet(data='R_2.csv', text_colname='text', id_colname='id')
ahca.SearchSet(data='S_2.csv', text_colname='text', id_colname='id')
ahca.ProcessData(remove_wordlist=[], keep_twitter_symbols=True)

Keyword object initialized.
Loaded reference set of size 1797 in 0.02 seconds.
Loaded search set of size 9975 in 0.06 seconds.
Time to process corpus: 3.21 seconds


In [178]:
ahca.ReferenceKeywords()
ahca.ClassifyDocs(algorithms=['nbayes', 'logit'])
ahca.FindTargetSet()
ahca.FindKeywords()
ahca.PrintKeywords()


4506 reference set keywords found.

Document Term Matrix: 11772 by 23767 with 95381 nonzero elements

Time to get document-term matrix: 0.15 seconds

Ref training size: 593; Search training size: 3292; Training size: 3885; Test size: 9975

Time for Naive Bayes: 0.0 seconds
Time for Logit: 0.01 seconds
220 documents in target set
9755 documents in non-target set
203 target set keywords found
1120 non-target set keywords found
   Reference                  Target                        Non-target
   ----------                 ----------                    ----------
1. medicaid                   senat                         get
2. senat                      bill                          say
3. #ahca                      health                        need
4. #healthcarebill            petit                         louisemensch
5. #trumpcare                 care                          like
6. trumpcar                   sign                          trump
7. health                     t

The top discriminating words now appear to be much more closely related to the topic of interest. A new query set $Q_RT$ could now be incorporated including keywords from R and T.