<a href="https://colab.research.google.com/github/rskrisel/NER_workshop/blob/main/NER_workshop.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Named Entity Recognition

In this workshop, we are going to learn how to transform large amounts of text into a database using Named Entity Recognition (NER). NER can computationally identify people, places, laws, events, dates, and other elements in a text or collection of texts.

## What is Named Entity Recognition?
*Explanation borrowed from Melanie Walsh's [Introduction to Cultural Analytics & Python](https://melaniewalsh.github.io/Intro-Cultural-Analytics/05-Text-Analysis/12-Named-Entity-Recognition.html)*
</br>
</br>
Named Entity Recognition is a fundamental task in the field of natural language processing (NLP). NLP is an interdisciplinary field that blends linguistics, statistics, and computer science. The heart of NLP is to understand human language with statistics and computers. Applications of NLP are all around us. Have you ever heard of a little thing called spellcheck? How about autocomplete, Google translate, chat bots, or Siri? These are all examples of NLP in action!

Thanks to recent advances in machine learning and to increasing amounts of available text data on the web, NLP has grown by leaps and bounds in the last decade. NLP models that generate texts and images are now getting eerily good.

Open-source NLP tools are getting very good, too. We’re going to use one of these open-source tools, the Python library spaCy, for our Named Entity Recognition tasks in this lesson.

## What is spaCy?
In this workshop, we are using the spaCy library to run the NER. SpaCy relies on machine learning models that were trained on a large amount of carefully-labeled texts. These texts were, in fact, often labeled and corrected by hand. The English-language spaCy model that we’re going to use in this lesson was trained on an annotated corpus called “OntoNotes”: 2 million+ words drawn from “news, broadcast, talk shows, weblogs, usenet newsgroups, and conversational telephone speech,” which were meticulously tagged by a group of researchers and professionals for people’s names and places, for nouns and verbs, for subjects and objects, and much more. Like a lot of other major machine learning projects, OntoNotes was also sponsored by the Defense Advaced Research Projects Agency (DARPA), the branch of the Defense Department that develops technology for the U.S. military.

When spaCy identifies people and places in a text or collection of text, the NLP model is actually making predictions about the text based on what it has learned about how people and places function in English-language sentences.

### spaCy Named Entities
Below is a Named Entities chart for English-language spaCy taken from [its website](https://spacy.io/api/annotation#named-entities). This chart shows the different named entities that spaCy can identify as well as their corresponding type labels.

|Type Label|Description|
|:---:|:---:|
|PERSON|People, including fictional.|
|NORP|Nationalities or religious or political groups.|
|FAC|Buildings, airports, highways, bridges, etc.|
|ORG|Companies, agencies, institutions, etc.|
|GPE|Countries, cities, states.|
|LOC|Non-GPE locations, mountain ranges, bodies of water.|
|PRODUCT|Objects, vehicles, foods, etc. (Not services.)|
|EVENT|Named hurricanes, battles, wars, sports events, etc.|
|WORK_OF_ART|Titles of books, songs, etc.|
|LAW|Named documents made into laws.|
|LANGUAGE|Any named language.|
|DATE|Absolute or relative dates or periods.|
|TIME|Times smaller than a day.|
|PERCENT|Percentage, including ”%“.|
|MONEY|Monetary values, including unit.|
|QUANTITY|Measurements, as of weight or distance.|
|ORDINAL|“first”, “second”, etc.|
|CARDINAL|Numerals that do not fall under another type.|


### Install spaCy:

In [64]:
# !pip install -U spacy

### Download the spaCy Language Model
Next we need to download the English-language model (en_core_web_sm), which will be processing and making predictions about our texts. This is the model that was trained on the annotated “OntoNotes” corpus. You can download the en_core_web_sm model by running the cell below:

In [65]:
# !python -m spacy download en_core_web_sm

*Note: spaCy offers models for other languages including Chinese, German, French, Spanish, Portuguese, Russian, Italian, Dutch, Greek, Norwegian, and Lithuanian.*

*spaCy offers language and tokenization support for other language via external dependencies — such as PyviKonlpy for Korean*

## Import all relevant libraries for collecting data and processing the NER

We will import:
- Spacy and displacy to run the NER and visualize our results
- en_core_web_sm to import the spaCy language model
- Pandas library for organizing and displaying data (we’re also changing the pandas default max row and column width display setting)
- Glob and pathlib to connect to folders on our operating system
- Requests to get data from an API and also to web scrape
- PPrint to make our JSON results readable
- Beautiful Soup to make our HTML results readable


In [66]:
import spacy
from spacy import displacy
import en_core_web_sm
from collections import Counter
import pandas as pd
from bs4 import BeautifulSoup

## Load the spaCy language model

In [67]:
nlp = en_core_web_sm.load()

The `en_core_web_sm` model is a small, general-purpose English model that includes parts of speech, dependencies, and named entities.

### Comparison of SpaCy's Small, Medium, and Large Models

SpaCy offers different English models in small, medium, and large sizes (e.g., `en_core_web_sm`, `en_core_web_md`, and `en_core_web_lg`). These models vary in size, accuracy, and features. Here’s a breakdown of their differences:

| Aspect             | Small (`en_core_web_sm`)                    | Medium (`en_core_web_md`)                  | Large (`en_core_web_lg`)                   |
|--------------------|---------------------------------------------|--------------------------------------------|--------------------------------------------|
| **Size & Speed**   | Smallest and fastest. Low memory usage, suitable for quick processing. | Balanced size and speed. Slower than small, but more accurate. | Largest and slowest. Requires high memory, best for nuanced analysis. |
| **Word Vectors**   | Limited or no word vectors. Basic similarity tasks only. | Includes more extensive word vectors. Better for similarity comparisons. | Most extensive word vectors. Best for capturing semantic relationships. |
| **Accuracy**       | Basic accuracy for part-of-speech tagging, dependencies, and named entity recognition. | Improved accuracy in named entity recognition and dependency parsing. | Highest accuracy across all tasks, especially beneficial for deep NLP applications. |
| **Use Case**       | Prototyping, applications needing speed, or lightweight NLP tasks. | Most general NLP applications needing a balance of accuracy, memory, and speed. | High-stakes applications where accuracy is critical and resources are ample. |

### Recommendations
- **Small**: Ideal for prototyping or applications requiring speed over accuracy.
- **Medium**: A good balance for most NLP tasks, providing reasonable accuracy without high memory demands.
- **Large**: Best for applications that prioritize accuracy and can handle the memory and processing requirements.




## Collect your Data: Combining APIs and Web Scraping

In this workshop, we are going to collect data from news articles in two ways. First, by using connect to the NewsAPI and gathering a collection of URLs related to a specific news topic. Next, by web scraping those URLs to save the articles as text files. For detailed instructions on working with the NewsAPI, please refer to this ["Working with APIs" tutorial](https://gist.github.com/rskrisel/4ff9629df9f9d6bf5a638b8ba6c13a68) and for detailed instructions on how to web scrape a list of URLs please refer to the ["Web Scraping Media URLs in Python"](https://github.com/rskrisel/web_scraping_workshop) tutorial.

### Install the News API

In [68]:
# !pip install newsapi-python

### Store your secret key

In [69]:
secret= '571e874fe6674690a5ea658e5937d47c'

### Define your endpoint

In [70]:
url = 'https://newsapi.org/v2/everything?'

### Define your query parameters

In [71]:
parameters = {
    'q': 'drought',
    'searchIn':'title',
    'pageSize': 20,
    'language' : 'en',
    'apiKey': secret
    }

### Make your data request

In [72]:
response = requests.get(url, params=parameters)

### Visualize your JSON results

In [73]:
response_json = response.json()
pprint.pprint(response_json)

{'articles': [{'author': None,
               'content': 'The area of land surface affected by drought has '
                          'trebled since the 1980s, a new report into the '
                          'effects of climate change has revealed.\r\n'
                          'Forty-eight per cent of the Earths land surface had '
                          'at least o… [+5752 chars]',
               'description': 'Forty-eight percent of the world went through '
                              'at least one month of extreme drought in 2023.',
               'publishedAt': '2024-10-30T00:15:58Z',
               'source': {'id': None, 'name': 'BBC News'},
               'title': 'Drought areas have trebled in size since 1980s, study '
                        'finds',
               'url': 'https://www.bbc.com/news/articles/clyvje458rvo',
               'urlToImage': 'https://ichef.bbci.co.uk/news/1024/branded_news/0768/live/40f1a250-9612-11ef-a9c1-65d004933863.jpg'},
              {'a

### Check what keys exist in your JSON data

In [74]:
response_json.keys()

dict_keys(['status', 'totalResults', 'articles'])

### See the data stored in each key

In [75]:
print(response_json['status'])
print(response_json['totalResults'])
print(response_json['articles'])

ok
187


### Check the datatype for each key

In [76]:
print(type(response_json['status']))
print(type(response_json['totalResults']))
print(type(response_json['articles']))

<class 'str'>
<class 'int'>
<class 'list'>


### Make sure the list reads as a dictionary

In [77]:
type(response_json['articles'][0])

dict

### Convert the JSON key into a Pandas Dataframe

In [78]:
df_articles = pd.DataFrame(response_json['articles'])
df_articles

Unnamed: 0,source,author,title,description,url,urlToImage,publishedAt,content
0,"{'id': None, 'name': 'BBC News'}",,"Drought areas have trebled in size since 1980s, study finds",Forty-eight percent of the world went through at least one month of extreme drought in 2023.,https://www.bbc.com/news/articles/clyvje458rvo,https://ichef.bbci.co.uk/news/1024/branded_news/0768/live/40f1a250-9612-11ef-a9c1-65d004933863.jpg,2024-10-30T00:15:58Z,"The area of land surface affected by drought has trebled since the 1980s, a new report into the effects of climate change has revealed.\r\nForty-eight per cent of the Earths land surface had at least o… [+5752 chars]"
1,"{'id': None, 'name': 'Science Daily'}",,Combining satellite methods provides drought detection from space,"Observing sites like the Amazon basin from space has underscored the capability of satellites to better detect signs of drought, according to a new study. The researchers combined Global Positioning System (GPS) and the Gravity Recovery and Climate Experiment…",https://www.sciencedaily.com/releases/2024/10/241021123324.htm,https://www.sciencedaily.com/images/scidaily-icon.png,2024-10-21T16:33:24Z,"Observing sites like the Amazon basin from space has underscored the capability of satellites to better detect signs of drought, according to a new study.Led by Military University of Technology Pola… [+3670 chars]"
2,"{'id': None, 'name': 'Jalopnik'}",Andy Kalmowitz,206-Year-Old Bridge Uncovered In Pennsylvania Because Of Drought,"If you live in the northeastern-ish United States, you’ve probably realized it’s barely rained in a very long time. Well, you’re not imagining things because it has been so dry in western Pennsylvania that a 200-year-old bridge under the Youghiogheny River La…",https://jalopnik.com/206-year-old-bridge-uncovered-in-pennsylvania-because-o-1851687377,"https://i.kinja-img.com/image/upload/c_fill,h_675,pg_1,q_80,w_1200/a15298fcf961384d340d6102727bbe3b.jpg",2024-11-01T16:30:00Z,"If you live in the northeastern-ish United States, youve probably realized its barely rained in a very long time. Well, youre not imagining things because it has been so dry in western Pennsylvania t… [+3047 chars]"
3,"{'id': None, 'name': 'Phys.Org'}",Science X,Eight dead as heavy rain thrashes Brazil after long drought,"At least eight people died after heavy rains in Brazil, authorities said Saturday, as storms swept parts of the country following a severe drought that fueled a record wave of wildfires.",https://phys.org/news/2024-10-dead-heavy-thrashes-brazil-drought.html,https://scx2.b-cdn.net/gfx/news/2024/brazil-has-in-recent-m.jpg,2024-10-13T13:36:16Z,"At least eight people died after heavy rains in Brazil, authorities said Saturday, as storms swept parts of the country following a severe drought that fueled a record wave of wildfires.\r\nCentral and… [+1759 chars]"
4,"{'id': None, 'name': 'Phys.Org'}",Vassilis KYRIAKOULIS,Cracked earth in Greece's saffron heartland as drought takes toll,"At a field outside Kozani, northern Greece, the strikingly blue-and-purple petals of saffron give off an intoxicating scent that underscores the value of one of the country's most lucrative crops.",https://phys.org/news/2024-11-earth-greece-saffron-heartland-drought.html,https://scx2.b-cdn.net/gfx/news/2024/greek-saffron-yields-a.jpg,2024-11-08T09:40:01Z,"At a field outside Kozani, northern Greece, the strikingly blue-and-purple petals of saffron give off an intoxicating scent that underscores the value of one of the country's most lucrative crops.\r\nB… [+3862 chars]"
5,"{'id': None, 'name': 'Yahoo Entertainment'}",Reuters,Panama Canal's net income rose to $3.45 billion in fiscal year despite drought,The Panama Canal's profit increased 9.5% in the fiscal year ended in September to $3.45 billion despite a severe drought that reduced the number of vessels...,https://finance.yahoo.com/news/panama-canals-net-income-rose-191009003.html,https://media.zenfs.com/en/reuters-finance.com/b109388d8ed1f10d0ef85961e4c68995,2024-10-25T19:10:09Z,PANAMA CITY (Reuters) - The Panama Canal's profit increased 9.5% in the fiscal year ended in September to $3.45 billion despite a severe drought that reduced the number of vessels that passed through… [+297 chars]
6,"{'id': 'al-jazeera-english', 'name': 'Al Jazeera English'}",Al Jazeera,Worst drought in century devastates Southern Africa with millions at risk,"Over 27 million lives affected by worst drought in a century, with 21 million children malnourished, says WFP.",https://www.aljazeera.com/news/2024/10/15/worst-drought-in-century-devastates-southern-africa-with-millions-at-risk,https://www.aljazeera.com/wp-content/uploads/2024/08/2024-07-03T082525Z_666080128_RC2ZM8AADCCG_RTRMADP_3_ZIMBABWE-EL-NINO-DROUGHT-1723902046.jpg?resize=1920%2C1440,2024-10-15T17:27:42Z,"Millions of people across Southern Africa are going hungry due to a historic drought, risking a full-scale humanitarian catastrophe, the United Nations has warned.\r\nLesotho, Malawi, Namibia, Zambia, … [+2224 chars]"
7,"{'id': 'al-jazeera-english', 'name': 'Al Jazeera English'}",Al Jazeera,"More than 420,000 children affected by record drought in Amazon: UN",UNICEF chief urges leaders at the upcoming COP29 summit in Azerbaijan to increase climate financing for children.,https://www.aljazeera.com/news/2024/11/7/more-than-420000-children-affected-by-record-drought-in-amazon-un,https://www.aljazeera.com/wp-content/uploads/2024/11/2024-09-19T122603Z_1160265155_RC293AA62LDV_RTRMADP_3_BRAZIL-ENVIRONMENT-AMAZON-1730959636.jpg?resize=1920%2C1440,2024-11-07T06:51:06Z,"More than 420,000 children in the Amazon basin have been affected by dangerous levels of water scarcity and drought in three countries, according to the United Nations.\r\nThe record-breaking drought, … [+2857 chars]"
8,"{'id': None, 'name': 'Phys.Org'}",Matt Shipman,Q&A: A faster way to identify drought-resistant plants for crop breeding research,Climate change is making droughts more common and more severe—which makes research into developing drought-resistant crops more important than ever. Now researchers have developed a new framework that should expedite this important research.,https://phys.org/news/2024-10-qa-faster-drought-resistant-crop.html,https://scx2.b-cdn.net/gfx/news/hires/2024/qa-researcher-discusse-4.jpg,2024-10-18T16:44:04Z,Climate change is making droughts more common and more severewhich makes research into developing drought-resistant crops more important than ever. Now researchers have developed a new framework that… [+4886 chars]
9,"{'id': None, 'name': 'Phys.Org'}",Liu Jia,"Compound drought–heat wave events under-recognized in global soils, finds study","Soil is essential for life and plays a crucial role in the Earth's ecosystem, providing support for plant roots and hosting countless microorganisms. In a warming world, it is important to understand how soil hydrothermal conditions, particularly the dry-hot …",https://phys.org/news/2024-10-compound-droughtheat-events-global-soils.html,https://scx2.b-cdn.net/gfx/news/2024/compound-droughtheat-w.jpg,2024-10-14T15:51:03Z,"Soil is essential for life and plays a crucial role in the Earth's ecosystem, providing support for plant roots and hosting countless microorganisms. In a warming world, it is important to understand… [+3770 chars]"


### Define a function to web scrape text from the list of URLs in the Dataframe

In [79]:
def scrape_article(url):
    response = requests.get(url)
    response.encoding = 'utf-8'
    html_string = response.text
    return html_string

### Apply the function to the Dataframe and store the results in a new column

In [80]:
df_articles['scraped_text'] = df_articles['url'].apply(scrape_article)

In [81]:
df_articles

Unnamed: 0,source,author,title,description,url,urlToImage,publishedAt,content,scraped_text
0,"{'id': None, 'name': 'BBC News'}",,"Drought areas have trebled in size since 1980s, study finds",Forty-eight percent of the world went through at least one month of extreme drought in 2023.,https://www.bbc.com/news/articles/clyvje458rvo,https://ichef.bbci.co.uk/news/1024/branded_news/0768/live/40f1a250-9612-11ef-a9c1-65d004933863.jpg,2024-10-30T00:15:58Z,"The area of land surface affected by drought has trebled since the 1980s, a new report into the effects of climate change has revealed.\r\nForty-eight per cent of the Earths land surface had at least o… [+5752 chars]","<!DOCTYPE html><html lang=""en-GB""><head><meta charSet=""utf-8""/><meta name=""viewport"" content=""width=device-width""/><title>Three times more land in drought than in 1980s, study finds</title><meta property=""og:title"" content=""Three times more land in drought than in 1980s, study finds""/><meta name=""twitter:title"" content=""Three times more land in drought than in 1980s, study finds""/><meta name=""..."
1,"{'id': None, 'name': 'Science Daily'}",,Combining satellite methods provides drought detection from space,"Observing sites like the Amazon basin from space has underscored the capability of satellites to better detect signs of drought, according to a new study. The researchers combined Global Positioning System (GPS) and the Gravity Recovery and Climate Experiment…",https://www.sciencedaily.com/releases/2024/10/241021123324.htm,https://www.sciencedaily.com/images/scidaily-icon.png,2024-10-21T16:33:24Z,"Observing sites like the Amazon basin from space has underscored the capability of satellites to better detect signs of drought, according to a new study.Led by Military University of Technology Pola… [+3670 chars]",
2,"{'id': None, 'name': 'Jalopnik'}",Andy Kalmowitz,206-Year-Old Bridge Uncovered In Pennsylvania Because Of Drought,"If you live in the northeastern-ish United States, you’ve probably realized it’s barely rained in a very long time. Well, you’re not imagining things because it has been so dry in western Pennsylvania that a 200-year-old bridge under the Youghiogheny River La…",https://jalopnik.com/206-year-old-bridge-uncovered-in-pennsylvania-because-o-1851687377,"https://i.kinja-img.com/image/upload/c_fill,h_675,pg_1,q_80,w_1200/a15298fcf961384d340d6102727bbe3b.jpg",2024-11-01T16:30:00Z,"If you live in the northeastern-ish United States, youve probably realized its barely rained in a very long time. Well, youre not imagining things because it has been so dry in western Pennsylvania t… [+3047 chars]","<!DOCTYPE html><html lang=""en"" style=""scroll-behavior:smooth"" data-reactroot=""""><head><meta name=""google-site-verification"" content=""TKMuVW6pEGpNnTDe1eH7tb7YWn4jSmYz1DaSnitFNyA""/><meta name=""google-site-verification"" content=""QDPLbDJXTQNT0n69mvNADCeRmwnbkYyL20OKJAVCKq8""/><meta name=""ir-site-verification-token"" content=""-1270174611""/><meta name=""viewport"" content=""width=device-width, initial-sc..."
3,"{'id': None, 'name': 'Phys.Org'}",Science X,Eight dead as heavy rain thrashes Brazil after long drought,"At least eight people died after heavy rains in Brazil, authorities said Saturday, as storms swept parts of the country following a severe drought that fueled a record wave of wildfires.",https://phys.org/news/2024-10-dead-heavy-thrashes-brazil-drought.html,https://scx2.b-cdn.net/gfx/news/2024/brazil-has-in-recent-m.jpg,2024-10-13T13:36:16Z,"At least eight people died after heavy rains in Brazil, authorities said Saturday, as storms swept parts of the country following a severe drought that fueled a record wave of wildfires.\r\nCentral and… [+1759 chars]","<!DOCTYPE html>\r\n<html lang=""en"">\r\n<head>\r\n <meta charset=""UTF-8"">\r\n <meta name=""viewport"" content=""width=device-width, initial-scale=1.0"">\r\n <title>400 Bad Request</title>\r\n <style>\r\n body {\r\n font-family: Arial, sans-serif;\r\n text-align: center;\r\n margin: 0;\r\n padding: 50px;\r\n background-color: ..."
4,"{'id': None, 'name': 'Phys.Org'}",Vassilis KYRIAKOULIS,Cracked earth in Greece's saffron heartland as drought takes toll,"At a field outside Kozani, northern Greece, the strikingly blue-and-purple petals of saffron give off an intoxicating scent that underscores the value of one of the country's most lucrative crops.",https://phys.org/news/2024-11-earth-greece-saffron-heartland-drought.html,https://scx2.b-cdn.net/gfx/news/2024/greek-saffron-yields-a.jpg,2024-11-08T09:40:01Z,"At a field outside Kozani, northern Greece, the strikingly blue-and-purple petals of saffron give off an intoxicating scent that underscores the value of one of the country's most lucrative crops.\r\nB… [+3862 chars]","<!DOCTYPE html>\r\n<html lang=""en"">\r\n<head>\r\n <meta charset=""UTF-8"">\r\n <meta name=""viewport"" content=""width=device-width, initial-scale=1.0"">\r\n <title>400 Bad Request</title>\r\n <style>\r\n body {\r\n font-family: Arial, sans-serif;\r\n text-align: center;\r\n margin: 0;\r\n padding: 50px;\r\n background-color: ..."
5,"{'id': None, 'name': 'Yahoo Entertainment'}",Reuters,Panama Canal's net income rose to $3.45 billion in fiscal year despite drought,The Panama Canal's profit increased 9.5% in the fiscal year ended in September to $3.45 billion despite a severe drought that reduced the number of vessels...,https://finance.yahoo.com/news/panama-canals-net-income-rose-191009003.html,https://media.zenfs.com/en/reuters-finance.com/b109388d8ed1f10d0ef85961e4c68995,2024-10-25T19:10:09Z,PANAMA CITY (Reuters) - The Panama Canal's profit increased 9.5% in the fiscal year ended in September to $3.45 billion despite a severe drought that reduced the number of vessels that passed through… [+297 chars],"<!doctype html>\n<html lang=""en-US"" theme=""light"" data-color-scheme=""light"" class=""desktop neo-green dock-upscale failsafe"">\n <head>\n <meta charset=""utf-8"">\n <meta name=""oath:guce:consent-host"" content=""guce.yahoo.com"">\n <link rel=""preconnect"" href=""//s.yimg.com"" crossorigin=""anonymous""><link rel=""preconnect"" href=""//geo.yahoo.com""><link rel=""preconnect"" href=""//que..."
6,"{'id': 'al-jazeera-english', 'name': 'Al Jazeera English'}",Al Jazeera,Worst drought in century devastates Southern Africa with millions at risk,"Over 27 million lives affected by worst drought in a century, with 21 million children malnourished, says WFP.",https://www.aljazeera.com/news/2024/10/15/worst-drought-in-century-devastates-southern-africa-with-millions-at-risk,https://www.aljazeera.com/wp-content/uploads/2024/08/2024-07-03T082525Z_666080128_RC2ZM8AADCCG_RTRMADP_3_ZIMBABWE-EL-NINO-DROUGHT-1723902046.jpg?resize=1920%2C1440,2024-10-15T17:27:42Z,"Millions of people across Southern Africa are going hungry due to a historic drought, risking a full-scale humanitarian catastrophe, the United Nations has warned.\r\nLesotho, Malawi, Namibia, Zambia, … [+2224 chars]","<!doctype html><html lang=""en"" dir=""ltr"" class=""theme-aje""><head><meta charset=""utf-8""/><meta name=""viewport"" content=""width=device-width,initial-scale=1,shrink-to-fit=no""/><meta http-equiv=""Content-Type"" content=""text/html;charset=utf-8""><link rel=""shortcut icon"" href=""/favicon_aje.ico""><title data-rh=""true"" data-reactroot="""">Worst drought in century devastates Southern Africa, millions at ri..."
7,"{'id': 'al-jazeera-english', 'name': 'Al Jazeera English'}",Al Jazeera,"More than 420,000 children affected by record drought in Amazon: UN",UNICEF chief urges leaders at the upcoming COP29 summit in Azerbaijan to increase climate financing for children.,https://www.aljazeera.com/news/2024/11/7/more-than-420000-children-affected-by-record-drought-in-amazon-un,https://www.aljazeera.com/wp-content/uploads/2024/11/2024-09-19T122603Z_1160265155_RC293AA62LDV_RTRMADP_3_BRAZIL-ENVIRONMENT-AMAZON-1730959636.jpg?resize=1920%2C1440,2024-11-07T06:51:06Z,"More than 420,000 children in the Amazon basin have been affected by dangerous levels of water scarcity and drought in three countries, according to the United Nations.\r\nThe record-breaking drought, … [+2857 chars]","<!doctype html><html lang=""en"" dir=""ltr"" class=""theme-aje""><head><meta charset=""utf-8""/><meta name=""viewport"" content=""width=device-width,initial-scale=1,shrink-to-fit=no""/><meta http-equiv=""Content-Type"" content=""text/html;charset=utf-8""><link rel=""shortcut icon"" href=""/favicon_aje.ico""><title data-rh=""true"" data-reactroot="""">More than 420,000 children affected by record drought in Amazon: UN..."
8,"{'id': None, 'name': 'Phys.Org'}",Matt Shipman,Q&A: A faster way to identify drought-resistant plants for crop breeding research,Climate change is making droughts more common and more severe—which makes research into developing drought-resistant crops more important than ever. Now researchers have developed a new framework that should expedite this important research.,https://phys.org/news/2024-10-qa-faster-drought-resistant-crop.html,https://scx2.b-cdn.net/gfx/news/hires/2024/qa-researcher-discusse-4.jpg,2024-10-18T16:44:04Z,Climate change is making droughts more common and more severewhich makes research into developing drought-resistant crops more important than ever. Now researchers have developed a new framework that… [+4886 chars],"<!DOCTYPE html>\r\n<html lang=""en"">\r\n<head>\r\n <meta charset=""UTF-8"">\r\n <meta name=""viewport"" content=""width=device-width, initial-scale=1.0"">\r\n <title>400 Bad Request</title>\r\n <style>\r\n body {\r\n font-family: Arial, sans-serif;\r\n text-align: center;\r\n margin: 0;\r\n padding: 50px;\r\n background-color: ..."
9,"{'id': None, 'name': 'Phys.Org'}",Liu Jia,"Compound drought–heat wave events under-recognized in global soils, finds study","Soil is essential for life and plays a crucial role in the Earth's ecosystem, providing support for plant roots and hosting countless microorganisms. In a warming world, it is important to understand how soil hydrothermal conditions, particularly the dry-hot …",https://phys.org/news/2024-10-compound-droughtheat-events-global-soils.html,https://scx2.b-cdn.net/gfx/news/2024/compound-droughtheat-w.jpg,2024-10-14T15:51:03Z,"Soil is essential for life and plays a crucial role in the Earth's ecosystem, providing support for plant roots and hosting countless microorganisms. In a warming world, it is important to understand… [+3770 chars]","<!DOCTYPE html>\r\n<html lang=""en"">\r\n<head>\r\n <meta charset=""UTF-8"">\r\n <meta name=""viewport"" content=""width=device-width, initial-scale=1.0"">\r\n <title>400 Bad Request</title>\r\n <style>\r\n body {\r\n font-family: Arial, sans-serif;\r\n text-align: center;\r\n margin: 0;\r\n padding: 50px;\r\n background-color: ..."


### Use the Beautiful Soup library to make the scraped html text legible and save the output in a new `cleaned_text` column

In [82]:
# Create a new column 'cleaned_text' by applying the cleaning function to each row in 'scraped_text'
df_articles['cleaned_text'] = df_articles['scraped_text'].apply(lambda text: BeautifulSoup(text, "html.parser").get_text())

In [83]:
df_articles[['cleaned_text']]

Unnamed: 0,cleaned_text
0,"Three times more land in drought than in 1980s, study findsSkip to contentBritish Broadcasting CorporationWatchHomeNewsUS ElectionSportBusinessInnovationCultureArtsTravelEarthVideoLiveHomeNewsIsrael-Gaza WarWar in UkraineUS & CanadaUKUK PoliticsEnglandN. IrelandN. Ireland PoliticsScotlandScotland PoliticsWalesWales PoliticsAfricaAsiaChinaIndiaAustraliaEuropeLatin AmericaMiddle EastIn PicturesB..."
1,
2,206-Year-Old Bridge Uncovered In Pennsylvania Because Of Drought\n\n\nObsessed With The Culture Of CarsHomeLatestReviewsUnpavedBuyingTechRacingCultureTrucksWrenchingBeyond CarsEditionsEspañolDeutschFrançaisDiscoverHomeLatestReviewsUnpavedBuyingTechRacingCultureTrucksWrenchingBeyond CarsEditionsEspañolDeutschFrançaisMoreLog In / Sign UpSend us a Tip!SubscribeExtraAboutJalopnik AdvisorPrivacyTer...
3,"\n\n\n\n\n400 Bad Request\n\n\n\n\n400 Bad Request\nYour request has been blocked by our server's security policies.\n\nIf you believe this is an error, please contact our support team.\n\n\n"
4,"\n\n\n\n\n400 Bad Request\n\n\n\n\n400 Bad Request\nYour request has been blocked by our server's security policies.\n\nIf you believe this is an error, please contact our support team.\n\n\n"
5,\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nPanama Canal's net income rose to $3.45 billion in fiscal year despite drought \n\n\n\n News Today's news US Politics World Tech Reviews and deals Audio Computing Gaming Health Home Phones ...
6,"Worst drought in century devastates Southern Africa, millions at risk | Climate News | Al Jazeera\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSkip linksSkip to Contentplay Live Show navigation menuNavigation menuNewsShow more news sectionsMiddle EastAfricaAsiaUS & CanadaLatin AmericaEuropeAsia PacificWar on GazaUS ElectionOpinionSportVideoMoreShow more sectionsFeaturesUkraine warEcono..."
7,"More than 420,000 children affected by record drought in Amazon: UN | Climate Crisis News | Al Jazeera\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nSkip linksSkip to Contentplay Live Show navigation menuNavigation menuNewsShow more news sectionsMiddle EastAfricaAsiaUS & CanadaLatin AmericaEuropeAsia PacificWar on GazaUS ElectionOpinionSportVideoMoreShow more sectionsFeaturesUkraine war..."
8,"\n\n\n\n\n400 Bad Request\n\n\n\n\n400 Bad Request\nYour request has been blocked by our server's security policies.\n\nIf you believe this is an error, please contact our support team.\n\n\n"
9,"\n\n\n\n\n400 Bad Request\n\n\n\n\n400 Bad Request\nYour request has been blocked by our server's security policies.\n\nIf you believe this is an error, please contact our support team.\n\n\n"


### Let's run the NER across the `cleaned_text` column:

In [84]:
# Apply the NLP pipeline to each row in the 'cleaned_text' column and store results in 'processed_doc'
df_articles['processed_doc'] = df_articles['cleaned_text'].apply(nlp)

### Let's use displacy to visualize our results

In [85]:
# Specify the index of the row you want to visualize
row_index = 0  # Change this to the desired row index

# Select the NLP processed document at the specified index
doc = df_articles['processed_doc'].iloc[row_index]

# Render the entities for the selected document
displacy.render(doc, style="ent")

### Let's see a list of the identified entities

In [86]:
doc.ents

(Three,
 1980s,
 Broadcasting CorporationWatchHomeNewsUS ElectionSportBusinessInnovationCultureArtsTravelEarthVideoLiveHomeNewsIsrael-Gaza WarWar,
 UkraineUS & CanadaUKUK PoliticsEnglandN. IrelandN.,
 Ireland,
 PoliticsScotlandScotland,
 PoliticsWalesWales PoliticsAfricaAsiaChinaIndiaAustraliaEuropeLatin AmericaMiddle EastIn PicturesBBC,
 HarrisDonald,
 VanceTim,
 BusinessFuture of BusinessInnovationTechnologyScience & HealthArtificial,
 & TVMusicArt & DesignStyleBooksEntertainment,
 MotionTravelDestinationsAfricaAntarcticaAsiaAustralia,
 PacificCaribbean & BermudaCentral AmericaEuropeMiddle EastNorth AmericaSouth AmericaWorld’s,
 TableCulture & ExperiencesAdventuresThe SpeciaListEarthNatural WondersWeather & ScienceClimate SolutionsSustainable BusinessGreen LivingVideoLiveLive,
 NewsLive,
 80s,
 Sunday,
 South Sudan,
 the 1980s,
 Forty-eight,
 Earth,
 at least one month,
 last year,
 the Lancet Countdown on Health and Climate Change - up,
 15%,
 the 1980s,
 Almost a third,
 30%,
 thre

### Let's add the entity label next to each entity:

In [87]:
for named_entity in doc.ents:
    print(named_entity, named_entity.label_)

Three CARDINAL
1980s DATE
Broadcasting CorporationWatchHomeNewsUS ElectionSportBusinessInnovationCultureArtsTravelEarthVideoLiveHomeNewsIsrael-Gaza WarWar PERSON
UkraineUS & CanadaUKUK PoliticsEnglandN. IrelandN. ORG
Ireland GPE
PoliticsScotlandScotland GPE
PoliticsWalesWales PoliticsAfricaAsiaChinaIndiaAustraliaEuropeLatin AmericaMiddle EastIn PicturesBBC PERSON
HarrisDonald ORG
VanceTim ORG
BusinessFuture of BusinessInnovationTechnologyScience & HealthArtificial ORG
& TVMusicArt & DesignStyleBooksEntertainment ORG
MotionTravelDestinationsAfricaAntarcticaAsiaAustralia GPE
PacificCaribbean & BermudaCentral AmericaEuropeMiddle EastNorth AmericaSouth AmericaWorld’s ORG
TableCulture & ExperiencesAdventuresThe SpeciaListEarthNatural WondersWeather & ScienceClimate SolutionsSustainable BusinessGreen LivingVideoLiveLive ORG
NewsLive ORG
80s DATE
Sunday DATE
South Sudan GPE
the 1980s DATE
Forty-eight CARDINAL
Earth LOC
at least one month DATE
last year DATE
the Lancet Countdown on Health and 

### Let's filter the results to see all entities labelled as "PERSON":

In [88]:
for named_entity in doc.ents:
    if named_entity.label_ == "PERSON":
        print(named_entity)

Broadcasting CorporationWatchHomeNewsUS ElectionSportBusinessInnovationCultureArtsTravelEarthVideoLiveHomeNewsIsrael-Gaza WarWar
PoliticsWalesWales PoliticsAfricaAsiaChinaIndiaAustraliaEuropeLatin AmericaMiddle EastIn PicturesBBC
Lancet Countdown
Marina Romanello
Osman Gaddo
Nyakuma
Nyakuma
hardens
Romanello
Romanello
Justin Rowlatt
confused’16 Oct
agoScotland5
COP29 hostAzerbaijan's


### Let's filter the results to see all entities labelled as "NORP":

In [89]:
for named_entity in doc.ents:
    if named_entity.label_ == "NORP":
        print(named_entity)

Greek
agoPolitics3
Australian


### Let's filter the results to see all entities labelled as "GPE":

In [90]:
for named_entity in doc.ents:
    if named_entity.label_ == "GPE":
        print(named_entity)

Ireland
PoliticsScotlandScotland
MotionTravelDestinationsAfricaAntarcticaAsiaAustralia
South Sudan
malaria
Syria
Iraq
Hasakah
South Sudan
UK
UK


### Let's filter the results to see all entities labelled as "LOC":

In [91]:
for named_entity in doc.ents:
    if named_entity.label_ == "LOC":
        print(named_entity)

Earth
South America
the Middle East
the Horn of Africa
South America's
West Nile
the Khabor River
Africa


### Let's filter the results to see all entities labelled as "FAC":

In [92]:
for named_entity in doc.ents:
    if named_entity.label_ == "FAC":
        print(named_entity)

### Let's filter the results to see all entities labelled as "ORG":

In [93]:
for named_entity in doc.ents:
    if named_entity.label_ == "ORG":
        print(named_entity)

UkraineUS & CanadaUKUK PoliticsEnglandN. IrelandN.
HarrisDonald
VanceTim
BusinessFuture of BusinessInnovationTechnologyScience & HealthArtificial
& TVMusicArt & DesignStyleBooksEntertainment
PacificCaribbean & BermudaCentral AmericaEuropeMiddle EastNorth AmericaSouth AmericaWorld’s
TableCulture & ExperiencesAdventuresThe SpeciaListEarthNatural WondersWeather & ScienceClimate SolutionsSustainable BusinessGreen LivingVideoLiveLive
NewsLive
the Lancet Countdown on Health and Climate Change - up
Amazon
Amazon
the Lancet Countdown
BBC World Service
Hasakah
Hasakah City Water Board
Hasakah
BBC
Future Earth
BBC
Climate
agoScience &
Amazon
agoLights
Rosebank
Rosebank
agoScience & Environment5
agoOil
BBC
BBCAdvertise
BBC
BBC


### Let's define a function that will entify all the entities in our document and save the output as a dictionary:

In [94]:
entities=[]
entity_type = []
entity_identified = []
for named_entity in doc.ents:
    entity_type.append(named_entity.label_)
    entity_identified.append(named_entity.text)
    entity_dict = {'Entity_type': entity_type, 'Entity_identified': entity_identified}
    entities.append(entity_dict)
print(entities)

[{'Entity_type': ['CARDINAL', 'DATE', 'PERSON', 'ORG', 'GPE', 'GPE', 'PERSON', 'ORG', 'ORG', 'ORG', 'ORG', 'GPE', 'ORG', 'ORG', 'ORG', 'DATE', 'DATE', 'GPE', 'DATE', 'CARDINAL', 'LOC', 'DATE', 'DATE', 'ORG', 'PERCENT', 'DATE', 'CARDINAL', 'PERCENT', 'DATE', 'DATE', 'DATE', 'ORG', 'NORP', 'DATE', 'LOC', 'LOC', 'LOC', 'LOC', 'ORG', 'DATE', 'PERCENT', 'DATE', 'PERSON', 'CARDINAL', 'DATE', 'DATE', 'PERCENT', 'DATE', 'GPE', 'LOC', 'CARDINAL', 'PERSON', 'ORG', 'QUANTITY', 'ORG', 'DATE', 'GPE', 'GPE', 'GPE', 'DATE', 'ORG', 'CARDINAL', 'DATE', 'LOC', 'DATE', 'PERSON', 'ORG', 'ORG', 'QUANTITY', 'CARDINAL', 'ORG', 'GPE', 'PERCENT', 'DATE', 'DATE', 'CARDINAL', 'DATE', 'CARDINAL', 'PERSON', 'DATE', 'DATE', 'DATE', 'PERSON', 'PERSON', 'PERSON', 'DATE', 'DATE', 'PERSON', 'LOC', 'ORG', 'ORG', 'ORG', 'PERSON', 'DATE', 'GPE', 'DATE', 'ORG', 'PERSON', 'ORG', 'GPE', 'PERCENT', 'TIME', 'NORP', 'ORG', 'NORP', 'ORG', 'ORG', 'PRODUCT', 'PERSON', 'CARDINAL', 'CARDINAL', 'DATE', 'ORG', 'ORG', 'PERSON', 'ORG', 

### Let's build on this function to run this process across our entire collection of texts:

In [95]:
# Initialize a list to store entity data for each article
all_entities = []

# Iterate over each row in the 'processed_doc' column of df_articles
for idx, doc in enumerate(df_articles['processed_doc']):
    # Collect entity types and texts for each document
    entity_type = [ent.label_ for ent in doc.ents]
    entity_identified = [ent.text for ent in doc.ents]

    # Create a dictionary with the document index as the identifier
    ent_dict = {
        'Doc_index': idx,  # Use the row index as an identifier
        'Entity_type': entity_type,
        'Entity_identified': entity_identified
    }

    # Append the dictionary to the all_entities list
    all_entities.append(ent_dict)

# Print the list of dictionaries
print(all_entities)

[{'Doc_index': 0, 'Entity_type': ['CARDINAL', 'DATE', 'PERSON', 'ORG', 'GPE', 'GPE', 'PERSON', 'ORG', 'ORG', 'ORG', 'ORG', 'GPE', 'ORG', 'ORG', 'ORG', 'DATE', 'DATE', 'GPE', 'DATE', 'CARDINAL', 'LOC', 'DATE', 'DATE', 'ORG', 'PERCENT', 'DATE', 'CARDINAL', 'PERCENT', 'DATE', 'DATE', 'DATE', 'ORG', 'NORP', 'DATE', 'LOC', 'LOC', 'LOC', 'LOC', 'ORG', 'DATE', 'PERCENT', 'DATE', 'PERSON', 'CARDINAL', 'DATE', 'DATE', 'PERCENT', 'DATE', 'GPE', 'LOC', 'CARDINAL', 'PERSON', 'ORG', 'QUANTITY', 'ORG', 'DATE', 'GPE', 'GPE', 'GPE', 'DATE', 'ORG', 'CARDINAL', 'DATE', 'LOC', 'DATE', 'PERSON', 'ORG', 'ORG', 'QUANTITY', 'CARDINAL', 'ORG', 'GPE', 'PERCENT', 'DATE', 'DATE', 'CARDINAL', 'DATE', 'CARDINAL', 'PERSON', 'DATE', 'DATE', 'DATE', 'PERSON', 'PERSON', 'PERSON', 'DATE', 'DATE', 'PERSON', 'LOC', 'ORG', 'ORG', 'ORG', 'PERSON', 'DATE', 'GPE', 'DATE', 'ORG', 'PERSON', 'ORG', 'GPE', 'PERCENT', 'TIME', 'NORP', 'ORG', 'NORP', 'ORG', 'ORG', 'PRODUCT', 'PERSON', 'CARDINAL', 'CARDINAL', 'DATE', 'ORG', 'ORG', '

### Let's visualize our results in a Pandas Dataframe sorted by the file name

In [96]:
df_NER = pd.DataFrame(all_entities)
df_NER = df_NER.sort_values(by='Doc_index', ascending=True)
df_NER

Unnamed: 0,Doc_index,Entity_type,Entity_identified
0,0,"[CARDINAL, DATE, PERSON, ORG, GPE, GPE, PERSON, ORG, ORG, ORG, ORG, GPE, ORG, ORG, ORG, DATE, DATE, GPE, DATE, CARDINAL, LOC, DATE, DATE, ORG, PERCENT, DATE, CARDINAL, PERCENT, DATE, DATE, DATE, ORG, NORP, DATE, LOC, LOC, LOC, LOC, ORG, DATE, PERCENT, DATE, PERSON, CARDINAL, DATE, DATE, PERCENT, DATE, GPE, LOC, CARDINAL, PERSON, ORG, QUANTITY, ORG, DATE, GPE, GPE, GPE, DATE, ORG, CARDINAL, DAT...","[Three, 1980s, Broadcasting CorporationWatchHomeNewsUS ElectionSportBusinessInnovationCultureArtsTravelEarthVideoLiveHomeNewsIsrael-Gaza WarWar, UkraineUS & CanadaUKUK PoliticsEnglandN. IrelandN., Ireland, PoliticsScotlandScotland, PoliticsWalesWales PoliticsAfricaAsiaChinaIndiaAustraliaEuropeLatin AmericaMiddle EastIn PicturesBBC, HarrisDonald, VanceTim, BusinessFuture of BusinessInnovationTe..."
1,1,[],[]
2,2,"[DATE, DATE, GPE, DATE, PERSON, CARDINAL, CARDINAL, GPE, GPE, DATE, LOC, DATE, FAC, MONEY, WORK_OF_ART, GPE, CARDINAL, ORG, WORK_OF_ART, GPE, ORG, ORG, ORG, ORG, MONEY, WORK_OF_ART, GPE, CARDINAL, ORG, WORK_OF_ART, GPE, ORG, ORG, GPE, ORG, DATE, ORG, ORG, LOC, ORG, DATE, LOC, FAC, QUANTITY, DATE, ORDINAL, CARDINAL, DATE, ORG, ORG, ORG, CARDINAL, DATE, PERSON, LOC, CARDINAL, DATE, DATE, PERSON,...","[206-Year-Old, CarsBridgelopnik206-Year-Old, Pennsylvania, over 25 years, ByAndy KalmowitzPublishedNovember 1, 2024Comments, 19)We, United States, Pennsylvania, 200-year-old, the Youghiogheny River Lake, the past 80 years, Great Crossing Bridge, 8,000, Suzuki, India, 5, Playoff Championship Illegitimate, A Bunch Of Bull***, U.S., Jalopinions, CCShare, VideoFacebookTwitterEmailRedditLinkview, J..."
3,3,"[CARDINAL, PRODUCT, CARDINAL, WORK_OF_ART]","[400, Bad Request, 400, Bad Request]"
4,4,"[CARDINAL, PRODUCT, CARDINAL, WORK_OF_ART]","[400, Bad Request, 400, Bad Request]"
5,5,"[ORG, MONEY, DATE, DATE, GPE, ORG, DATE, PERSON, ORG, WORK_OF_ART, PERSON, PERSON, ORG, ORG, ORG, PERSON, GPE, ORG, ORG, WORK_OF_ART, PRODUCT, GPE, GPE, LOC, ORG, EVENT, ORG, ORG, GPE, GPE, ORG, ORG, ORG, PERSON, ORG, PERSON, EVENT, PERSON, PERSON, ORG, EVENT, ORG, GPE, ORG, EVENT, ORG, PERSON, DATE, WORK_OF_ART, ORG, GPE, NORP, NORP, NORP, GPE, PERSON, ORG, NORP, NORP, NORP, WORK_OF_ART, DATE...","[Panama Canal's, $3.45 billion, fiscal year, Today, US, Audio Computing Gaming Health Home Phones Science, 2024, Autos Gift, Latest News Stock Market , The Morning Brief Premium News Economics Housing Earnings Tech, Crypto Biden Economy Markets Stocks, Crypto Top ETFs Top Mutual Funds Options: Highest Open Interest Options:, Basic Materials Communication Services ..."
6,6,"[DATE, GPE, CARDINAL, ORG, ORG, WORK_OF_ART, ORG, GPE, DATE, GPE, CARDINAL, CARDINAL, DATE, CARDINAL, TIME, CARDINAL, CARDINAL, LOC, ORG, PERSON, PERSON, GPE, GPE, GPE, DATE, ORG, ORG, ORG, ORG, DATE, DATE, DATE, CARDINAL, ORG, ORG, CARDINAL, DATE, GPE, DATE, DATE, DATE, DATE, CARDINAL, MONEY, DATE, ORG, DATE, ORG, DATE, PERCENT, GPE, PERCENT, GPE, ORG, GPE, PERSON, GPE, GPE, GPE, GPE, LOC, CA...","[century, Southern Africa, millions, Climate News, Al Jazeera, Contentplay Live Show, PacificWar, GazaUS, century, Southern Africa, millions, 27 million, a century, 21 million, 09 seconds, 15, 2024Millions, Southern Africa, the United Nations, Lesotho, Malawi, Namibia, Zambia, Zimbabwe, the past months, Mozambique, UN, World Food Programme, WFP, Tuesday, March, April 2025, more than 27 million..."
7,7,"[CARDINAL, ORG, ORG, ORG, ORG, WORK_OF_ART, ORG, GPE, CARDINAL, ORG, ORG, GPE, FAC, ORG, GPE, GPE, GPE, PERSON, DATE, CARDINAL, ORG, CARDINAL, ORG, DATE, GPE, GPE, GPE, ORG, GPE, GPE, ORG, PERSON, DATE, DATE, ORG, ORG, ORG, GPE, ORG, CARDINAL, CARDINAL, GPE, ORG, CARDINAL, GPE, CARDINAL, ORG, MONEY, DATE, CARDINAL, GPE, ORG, GPE, ORG, ORG, ORG, DATE, DATE, ORG, LOC, GPE, GPE, GPE, GPE, GPE, GP...","[More than 420,000, Amazon, UN, Climate Crisis News, Al Jazeera, Contentplay Live Show, PacificWar, GazaUS, 420,000, Amazon, UNUNICEF, Azerbaijan, Lake Tefe, Amazon, Tefe, Amazonas, Brazil, Jorge Silva, 7 Nov 20247 Nov, 2024More than 420,000, Amazon, three, the United Nations, last year, Brazil, Colombia, Peru, the UN Children’s Fund, Baku, Azerbaijan, UNICEF, Catherine Russell, Thursday, toda..."
8,8,"[CARDINAL, PRODUCT, CARDINAL, WORK_OF_ART]","[400, Bad Request, 400, Bad Request]"
9,9,"[CARDINAL, PRODUCT, CARDINAL, WORK_OF_ART]","[400, Bad Request, 400, Bad Request]"


### Let's explode our Dataframe so we have just one entity value per row pegged to the file name

In [97]:
df_NER = df_NER.set_index(['Doc_index'])
df_NER = df_NER.apply(pd.Series.explode).reset_index()
df_NER[:25]

Unnamed: 0,Doc_index,Entity_type,Entity_identified
0,0,CARDINAL,Three
1,0,DATE,1980s
2,0,PERSON,Broadcasting CorporationWatchHomeNewsUS ElectionSportBusinessInnovationCultureArtsTravelEarthVideoLiveHomeNewsIsrael-Gaza WarWar
3,0,ORG,UkraineUS & CanadaUKUK PoliticsEnglandN. IrelandN.
4,0,GPE,Ireland
5,0,GPE,PoliticsScotlandScotland
6,0,PERSON,PoliticsWalesWales PoliticsAfricaAsiaChinaIndiaAustraliaEuropeLatin AmericaMiddle EastIn PicturesBBC
7,0,ORG,HarrisDonald
8,0,ORG,VanceTim
9,0,ORG,BusinessFuture of BusinessInnovationTechnologyScience & HealthArtificial


### Let's filter our results by GPE

In [98]:
df_NER[df_NER['Entity_type'] == 'GPE'][:15]

Unnamed: 0,Doc_index,Entity_type,Entity_identified
4,0,GPE,Ireland
5,0,GPE,PoliticsScotlandScotland
11,0,GPE,MotionTravelDestinationsAfricaAntarcticaAsiaAustralia
17,0,GPE,South Sudan
48,0,GPE,malaria
56,0,GPE,Syria
57,0,GPE,Iraq
58,0,GPE,Hasakah
71,0,GPE,South Sudan
94,0,GPE,UK


### Let's filter our results by LAW

In [99]:
df_NER[df_NER['Entity_type'] == 'LAW'][:15]

Unnamed: 0,Doc_index,Entity_type,Entity_identified
1116,13,LAW,"ESTUpdated Nov 3, 2024"
1570,14,LAW,Constitution
1759,14,LAW,COP 29
1891,14,LAW,These Constitutional Amendments Didn’t Quite Make
2506,15,LAW,the Euro 2024
2509,15,LAW,UEFA Euro 2024


### Let's filter our results by Money

In [100]:
df_NER[df_NER['Entity_type'] == 'MONEY'][:15]

Unnamed: 0,Doc_index,Entity_type,Entity_identified
134,2,MONEY,8000
145,2,MONEY,8000
212,5,MONEY,$3.45 billion
298,5,MONEY,$3.45 billion
311,5,MONEY,$3.45 billion
314,5,MONEY,$18 million to $4.99 billion
359,5,MONEY,141.96 +1.55
418,6,MONEY,Tens of millions
502,7,MONEY,10
919,10,MONEY,250000
