### Feature Engineering Tutorial: Expanding the text content using Named Entity 

In this notebook, I'm going to show you how to expand the given text content to include more text, which will be featurized after. One of the content expansion methods is to detect named entities in the text and extend the original text with information about each named entity in Wikipedia. <br>

I came up with this idea because I thought there might be repeated location, person name(full) and organization. With named entities set for each author, I scraped <br>

There are some approaches for a Named Entity Classification task but here `MITIE`(SVM approach) is used. MITIE provides state-of-the-art information extraction tools. You can also find tools for training custom extractors and relation detectors. (https://github.com/mit-nlp/MITIE) 

I'll work through to following tasks:<br>
    * Loading train data into DataFrame
    * Extracting features from the train data 
    * Expanding the content with Scraping relevant named entity 
    * Featurize

In [77]:
import pandas as pd
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

In [13]:
train = pd.read_csv('./input/train.csv')
train.head()

Unnamed: 0,id,text,author
0,id26305,"This process, however, afforded me no means of...",EAP
1,id17569,It never once occurred to me that the fumbling...,HPL
2,id11008,"In his left hand was a gold snuff box, from wh...",EAP
3,id27763,How lovely is spring As we looked from Windsor...,MWS
4,id12958,"Finding nothing else, not even gold, the Super...",HPL


In [31]:
train_EAP = train[train['author']=='EAP'].text
train_HPL = train[train['author']=='HPL'].text
train_MWS = train[train['author']=='MWS'].text

In [39]:
import itertools 
tokenized_train_EAP = list(itertools.chain(*[word_tokenize(sentence) for sentence in train_EAP]))
tokenized_train_HPL = list(itertools.chain(*[word_tokenize(sentence) for sentence in train_HPL]))
tokenized_train_MWS = list(itertools.chain(*[word_tokenize(sentence) for sentence in train_MWS]))

In [84]:
from string import punctuation
cachedStopWords = stopwords.words("english") + list(punctuation)
tokenized_train_EAP = [x for x in tokenized_train_EAP if x not in cachedStopWords]
tokenized_train_HPL = [x for x in tokenized_train_HPL if x not in cachedStopWords]
tokenized_train_MWS = [x for x in tokenized_train_MWS if x not in cachedStopWords]

In [3]:
from mitie import *
from collections import defaultdict

In [10]:
print("loading NER model .. ")
ner = named_entity_extractor('../MITIE-models/english/ner_model.dat')
print("Tags output by this NER model: {} ".format(ner.get_possible_ner_tags()))

loading NER model .. 
Tags output by this NER model: ['PERSON', 'LOCATION', 'ORGANIZATION', 'MISC'] 


In [85]:
print("Tokenized input example: {}".format(tokenized_train_EAP))



In [86]:
entities = []
for tokens_by_author in [tokenized_train_EAP, tokenized_train_HPL, tokenized_train_MWS]:
    entities_by_author = ner.extract_entities(tokens_by_author)
    if len(entities_by_author)>0:
        #print("\nEntities found in :",i, entities_by_author)
        print("\nNumber of entities detected :",len(entities_by_author))
        entities.append(entities_by_author)   


Number of entities detected: 2504

Number of entities detected: 2951

Number of entities detected: 2159


In [145]:
df_ner_EAP = pd.DataFrame.from_records(entities[0], columns=["Range", "Tag", "Score"])
df_ner_EAP['Token']=[" ".join(tokenized_train_EAP[j] for j in entities[0][i][0]) for i in xrange(len(entities[0]))]
df_ner_EAP.sort_values(by='Score', inplace=True, ascending=False)
df_ner_EAP.head(10)

Unnamed: 0,Range,Tag,Score,Token
1878,"(83439, 83440, 83441, 83442, 83443, 83444, 834...",ORGANIZATION,3.328084,Philadelphia Regular Exchange Tea Total Young ...
1392,"(61852, 61853, 61854, 61855, 61856, 61857, 618...",ORGANIZATION,3.28982,Philadelphia Regular Exchange Tea Total Young ...
2465,(107844),LOCATION,1.767325,Charlestown
2283,"(99512, 99513)",PERSON,1.652745,John Smith
897,"(41000, 41001)",PERSON,1.59443,David Brewster
999,"(45649, 45650)",PERSON,1.59398,John Smith
24,"(918, 919)",PERSON,1.572509,John Smith
2078,"(92570, 92571)",PERSON,1.558129,John Neal
2112,(94084),LOCATION,1.554519,Charlottesville
308,"(15117, 15118, 15119, 15120, 15121)",ORGANIZATION,1.542165,Bogs Hogs Logs Frogs Company


In [148]:
import numpy as np
import plotly.plotly as py 
import plotly.graph_objs as go
from plotly.offline import iplot, init_notebook_mode
init_notebook_mode(connected=True)

In [146]:
df_ner_HPL = pd.DataFrame.from_records(entities[1], columns=["Range", "Tag", "Score"])
df_ner_HPL['Token']=[" ".join(tokenized_train_HPL[j] for j in entities[1][i][0]) for i in xrange(len(entities[1]))]
df_ner_HPL.sort_values(by='Score', inplace=True, ascending=False)
df_ner_HPL.head(10)

Unnamed: 0,Range,Tag,Score,Token
2025,"(60805, 60806)",PERSON,1.760086,Joel Manton
1459,"(43340, 43341, 43342, 43343, 43344, 43345, 433...",ORGANIZATION,1.745268,Providence Gazette Country Journal April Daily...
1030,"(30047, 30048, 30049, 30050, 30051, 30052, 300...",ORGANIZATION,1.720886,Bakery squalid Rifkin School Modern Economics ...
2274,"(68274, 68275)",PERSON,1.695395,John Hawkins
470,"(13732, 13733)",PERSON,1.646287,Earl Sawyer
1934,"(57725, 57726)",PERSON,1.636166,Angell Dombrowski
2072,(62302),PERSON,1.617597,Gilman
420,"(12395, 12396, 12397, 12398, 12399, 12400, 124...",ORGANIZATION,1.587331,Bakery Rifkin School Modern Economics Circle S...
724,"(21692, 21693)",PERSON,1.581991,Allan Halsey
1795,"(54054, 54055)",PERSON,1.581931,John Hawkins


In [147]:
df_ner_MWS = pd.DataFrame.from_records(entities[2], columns=["Range", "Tag", "Score"])
df_ner_MWS['Token']=[" ".join(tokenized_train_MWS[j] for j in entities[2][i][0]) for i in xrange(len(entities[2]))]
df_ner_MWS.sort_values(by='Score', inplace=True, ascending=False)
df_ner_MWS.head(10)

Unnamed: 0,Range,Tag,Score,Token
1066,(45341),LOCATION,1.775354,Austria
1672,"(69143, 69144)",PERSON,1.64901,M. Waldman
1508,(62866),LOCATION,1.564823,England
1082,"(45861, 45862)",PERSON,1.535843,Lionel Verney
342,(14448),PERSON,1.477535,Raymond
366,"(15306, 15307)",PERSON,1.468387,M. Waldman
1020,"(43785, 43786)",PERSON,1.455855,M. Waldman
679,(30125),MISC,1.449786,English
793,(34736),LOCATION,1.447024,London
349,"(14612, 14613)",PERSON,1.446623,M. Waldman
