# Names in the News

Find the most-often-mentioned names by week in a sample of news stories from 1996

Feb 15, 2018  
Ling 583

In [1]:
import pandas as pd
import spacy
from spacy import displacy
import re
from collections import Counter

pd.set_option('display.max_rows', 20)

In [2]:
nlp = spacy.load('en', disable=['parser'])

Load a pre-processed sample of articles from [http://www.daviddlewis.com/resources/testcollections/rcv1/](Reuters Corpus Volume 1 (RCV1))

In [3]:
df = pd.read_csv('http://bulba.sdsu.edu/rcv1.csv', na_filter=True)

As an example, look for named entities in the first article:

In [4]:
doc = nlp(df['text'][0])
displacy.render(doc, style='ent', jupyter=True)

Now let's do that for the whole database:

In [5]:
df['doc'] = list(nlp.pipe(df['text']))

We need a function that takes a bunch of Docs and returns the most frequently mentioned personal names.  We could use pd.values_count, but in this case the non-pandas Python `Counter()` is probably simpler:

In [6]:
def get_top_names(docs):
    name = Counter(e.orth_ for d in docs for e in d.ents if e.label_=='PERSON')
    return name.most_common(10)

In [7]:
get_top_names(df['doc'])

[('Clinton', 361),
 ('Dole', 184),
 ('\t\t', 180),
 ('\t     ', 149),
 ('\n     ', 127),
 ('Yeltsin', 127),
 ('Netanyahu', 110),
 ('Bill Clinton', 109),
 ('Lebed', 62),
 ('\t     FORECAST', 61)]

So, that's not so great.  Spacy's NER labeler is, for some reason, identifying strings of blanks as personal names. There are a few ways to help with that, but for now let's take a simple one: we'll strip off blank spaces from names and then ignore any that don't start with a letter:

In [8]:
def valid_name(name):
    return re.search(r'^[A-Za-z]', name)

def get_top_names(docs):
    name = Counter(filter(valid_name,
                          (e.orth_.strip() for d in docs for e in d.ents if e.label_=='PERSON')))                    
    return name.most_common(10)

In [9]:
get_top_names(df['doc'])

[('Clinton', 364),
 ('Dole', 190),
 ('Yeltsin', 143),
 ('Netanyahu', 114),
 ('Bill Clinton', 109),
 ('FORECAST', 72),
 ('Lebed', 66),
 ('Bhutto', 63),
 ('Arafat', 63),
 ('M4', 61)]

That's better. There are still errors (like *FORECAST* and *M4*) but for now we'll just live with them.

Once we've got a way to count names in a batch of articles, the final steps are to group articles by week and apply our `get_top_names()` function to each week's articles: 

In [10]:
df['date'] = pd.to_datetime(df['date'])
df['week'] = df['date'].dt.to_period('W')
df.groupby('week')['doc'].apply(get_top_names)

week
1996-08-19/1996-08-25    [(Lebed, 26), (Clinton, 17), (Dole, 14), (Yelt...
1996-08-26/1996-09-01    [(Clinton, 30), (Dole, 16), (Chun, 13), (Harir...
1996-09-02/1996-09-08    [(Mao, 27), (Graf, 12), (Tyson, 10), (Zigun, 1...
1996-09-09/1996-09-15    [(Clinton, 17), (Yeltsin, 16), (Netanyahu, 15)...
1996-09-16/1996-09-22    [(Clinton, 18), (Yeltsin, 15), (Bossi, 13), (B...
1996-09-23/1996-09-29    [(Tyson, 16), (Netanyahu, 15), (Arafat, 15), (...
1996-09-30/1996-10-06    [(Clinton, 69), (Dole, 47), (Yeltsin, 26), (Ne...
1996-10-07/1996-10-13    [(Clinton, 35), (Dole, 32), (Kemp, 20), (Bill ...
1996-10-14/1996-10-20    [(Dole, 19), (Clinton, 18), (Lebed, 12), (Neta...
1996-10-21/1996-10-27    [(Clinton, 29), (Dole, 25), (Lee, 17), (A1, 13...
1996-10-28/1996-11-03    [(Clinton, 34), (Dole, 24), (M4, 13), (Jewell,...
1996-11-04/1996-11-10    [(Bhutto, 23), (Clinton, 21), (Leghari, 14), (...
1996-11-11/1996-11-17    [(Clinton, 20), (FORECAST, 13), (Winterthur, 1...
1996-11-18/1996-11-2