<a href="https://colab.research.google.com/github/rahiakela/natural-language-processing-research-and-practice/blob/main/getting-started-with-nlp/11-named-entity-recognition/02_named_entity_practical_applications.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Practical Applications of NER

Let’s suppose ourselves of the scenario: it is widely known that certain events influence the trends of stock price movements: specifically, you can extract relevant facts from the news and then use these facts to predict company stock prices. 

Suppose you have access to a large collection of news; now your task is to extract the relevant events and facts
that can be linked to the stock market in the downstream (stock market price prediction)
application. 

How will you do that?

This means that you have access to a collection of news texts, and among other
preprocessing steps, you apply NER. Then you can focus only on the texts and sentences
that are relevant for your task: for instance, if you are interested in the recent events, in
which a particular company (e.g., “Apple”) participated, you can easily identify such texts,
sentences, and contexts.

<img src='images/1.png' width='600'/>

##Setup

In [None]:
!pip install spacy

In [None]:
!python -m spacy download en_core_web_md 

After install, just restart the colab runtime.

In [None]:
import spacy
from spacy import displacy

import pandas as pd

Let's download dataset from Kaggle.

In [None]:
from google.colab import files
files.upload() # upload kaggle.json file

In [2]:
%%shell

mkdir -p ~/.kaggle
mv kaggle.json ~/.kaggle/
ls ~/.kaggle
chmod 600 /root/.kaggle/kaggle.json

# download dataset from kaggle> URL: https://www.kaggle.com/datasets/snapcrack/all-the-news?select=articles1.csv
kaggle datasets download -d snapcrack/all-the-news
unzip -qq all-the-news.zip
rm -rf all-the-news.zip

kaggle.json
Downloading all-the-news.zip to /content
 94% 229M/244M [00:03<00:00, 55.5MB/s]
100% 244M/244M [00:04<00:00, 63.5MB/s]




##Data Loading and Exploration

We are going to use the data that has already been extracted from a range of news portals: the
dataset called “All the news” is hosted on the Kaggle website. 

The dataset consists of
143,000 articles scraped from 15 news websites, including The New York Times, CNN,
Business Insider, The Washington Post, etc.

In [4]:
news_df = pd.read_csv("articles1.csv")
news_df.head()

Unnamed: 0.1,Unnamed: 0,id,title,publication,author,date,year,month,url,content
0,0,17283,House Republicans Fret About Winning Their Hea...,New York Times,Carl Hulse,2016-12-31,2016.0,12.0,,WASHINGTON — Congressional Republicans have...
1,1,17284,Rift Between Officers and Residents as Killing...,New York Times,Benjamin Mueller and Al Baker,2017-06-19,2017.0,6.0,,"After the bullet shells get counted, the blood..."
2,2,17285,"Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ...",New York Times,Margalit Fox,2017-01-06,2017.0,1.0,,"When Walt Disney’s “Bambi” opened in 1942, cri..."
3,3,17286,"Among Deaths in 2016, a Heavy Toll in Pop Musi...",New York Times,William McDonald,2017-04-10,2017.0,4.0,,"Death may be the great equalizer, but it isn’t..."
4,4,17287,Kim Jong-un Says North Korea Is Preparing to T...,New York Times,Choe Sang-Hun,2017-01-02,2017.0,1.0,,"SEOUL, South Korea — North Korea’s leader, ..."


In [5]:
news_df.shape

(50000, 10)

Since the data from 15 news sources is split between several .csv files, let’s find out which news sources are covered.

In [7]:
source = news_df["publication"].unique()
print(source)

['New York Times' 'Breitbart' 'CNN' 'Business Insider' 'Atlantic']


##Word ambiguity

In [None]:
doc = nlp("An apple a day keeps a doctor away.")
displacy.render(doc, style="ent", jupyter=True)

  "__main__", mod_spec)


In [None]:
doc = nlp("Apple announces a new iPad Pro.")
displacy.render(doc, style="ent", jupyter=True)

In [None]:
doc = nlp("Turkey is the main dish served at Thanksgiving.")
displacy.render(doc, style="ent", jupyter=True)

In [None]:
doc = nlp("Turkey is a country with amazing landscapes.")
displacy.render(doc, style="ent", jupyter=True)

In [None]:
doc = nlp("The tiger is the largest living cat species.")
displacy.render(doc, style="ent", jupyter=True)

  "__main__", mod_spec)


In [None]:
doc = nlp("Tiger Woods is an American professional golfer.")
displacy.render(doc, style="ent", jupyter=True)

Finally, ambiguity in NER poses a
challenge not only when the algorithm needs to define whether a word or a phrase is
a named entity or not.

In [None]:
doc = nlp("Washington was born into slavery on the farm of James Burroughs.")
displacy.render(doc, style="ent", jupyter=True)

In [None]:
doc = nlp("Washington went up 2 games to 1 in the four-game series.")
displacy.render(doc, style="ent", jupyter=True)

In [None]:
doc = nlp("Blair arrived in Washington for what may well be his last state visit.")
displacy.render(doc, style="ent", jupyter=True)

In [None]:
doc = nlp("In June, Washington passed a primary seatbelt law.")
displacy.render(doc, style="ent", jupyter=True)

Note that in all these examples Washington is a named entity, but in each case, it is a
named entity of a different type, as is clear from the surrounding context.