# EDA: Named Entity Recognition

Named entity recognition is the process of identifing particular elements from text, such as names, places, quantities, percentages, times/dates, etc. Identifying and quantifying what the general content types an article contains seems like a good predictor of what type of article it is. World news articles, for example, might mention more places than opinion articles, and business articles might have more percentages or dates than other sections. For each article, I'll count how many total mentions of people or places there are in the titles, as well as how many unique mentions for article bodies.

The Stanford NLP group has published three [Named-Entity Recognizers](http://nlp.stanford.edu/software/CRF-NER.shtml). The three class model recognizes locations, persons, and organizations, and at least for now, this is the one I'll be using. Although NER's are written in Java, there is the Pyner interface for Python, as well as an NLTK wrapper (which I'll be using).

Although state-of-the-art taggers can achieve near-human levels of accuracy, this one does make a few mistakes. One obvious flaw is that if I feed the tagger unigram terms, two-part names such as "Michael Jordan" will count as ("Michael", "PERSON") and ("Jordan", "PERSON"). I can roughly correct for this by dividing my average name entity count by two if need be. Additionally, sometimes the tagger mis-tags certain people or places. For instance, it failed to recognize "Cameroon" as a location, but tagged the word "Heartbreak" in the article title "A Personal Trainer for Heartbreak" as a person.

In [1]:
import articledata # importing the module I've written to format
import pandas as pd



In [2]:
data = pd.read_pickle('/Users/teresaborcuch/capstone_project/notebooks/pickled_data.pkl')

In [3]:
del data['SA_body']
del data['SA_title']
del data['SA_diff']

In [4]:
test_data = data.iloc[:5]

In [5]:
articledata.get_sent_scores(test_data)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  data['SA_body'] = [compute_score(x) for x in data['body']]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  data['SA_title'] = [compute_score(x) for x in data['title']]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  data['SA_diff'] = abs(data['SA_title'] - data['SA_body'])


Unnamed: 0,title,date,body,section,source,condensed_section,SA_body,SA_title,SA_diff
0,$5 Million for a Super Bowl Ad. Another Millio...,2017-01-29,"This month, Anheuser-Busch InBev hosted a doze...",business,NYT,business,0.01624,-0.023148,0.039388
1,"$60,000 in Tuition, and My Son Wants to Become...",2017-01-12,My wife and I are spending a fortune to send o...,fashion,NYT,entertainment,0.020668,0.041667,0.020999
2,"1 Patient, 7 Tumors and 100 Billion Cells Equa...",2016-12-07,The remarkable recovery of a woman with advanc...,health,NYT,sci_health,0.000946,-0.034722,0.035668
5,15 of the Best Journals by Our Reporters Aroun...,2016-12-30,Our foreign correspondents wrote about dozens ...,world,NYT,world,-0.020845,0.052083,0.072928
6,2 Arrested in Central Germany on Suspicion of ...,2017-02-09,BERLIN — An Algerian man and a Nigerian man we...,world,NYT,world,-0.001084,-0.007639,0.006555


In [6]:
articledata.count_entities(data = test_data, section = 'title')

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  data['total_persons'] = persons
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  data['total_places'] = places


In [7]:
test_data

Unnamed: 0,title,date,body,section,source,condensed_section,SA_body,SA_title,SA_diff,total_persons,total_places
0,$5 Million for a Super Bowl Ad. Another Millio...,2017-01-29,"This month, Anheuser-Busch InBev hosted a doze...",business,NYT,business,0.01624,-0.023148,0.039388,0,0
1,"$60,000 in Tuition, and My Son Wants to Become...",2017-01-12,My wife and I are spending a fortune to send o...,fashion,NYT,entertainment,0.020668,0.041667,0.020999,0,0
2,"1 Patient, 7 Tumors and 100 Billion Cells Equa...",2016-12-07,The remarkable recovery of a woman with advanc...,health,NYT,sci_health,0.000946,-0.034722,0.035668,0,0
5,15 of the Best Journals by Our Reporters Aroun...,2016-12-30,Our foreign correspondents wrote about dozens ...,world,NYT,world,-0.020845,0.052083,0.072928,0,0
6,2 Arrested in Central Germany on Suspicion of ...,2017-02-09,BERLIN — An Algerian man and a Nigerian man we...,world,NYT,world,-0.001084,-0.007639,0.006555,0,2
