# NLP Class Assignment 5 -- Named Entity Recognition (NER) and Location Extraction

Richard Yang


1. Identify what is this company name, by looking at the entity distributions across both tweets and news articles
2. Identify what other companies are most frequently mentioned along with your primary company
    - Analyze what companies are most frequently mentioned within the same document (tweet and news article)
    - While analyzing news articles, extract separate entities from titles and texts
3. Identify most frequent locations of events, by extracting appropriate named entities
    - Locations may include countries, states, cities, regions, etc.
 

In order to complete this analysis:

- Discard non-English results
- Apply appropriate text cleaning methods
- Within your Jupyter notebook:
    - Show a table or chart with your top-20 companies (sorted in the descending order)
    - You are welcome to use separate tables for titles and texts of the news articles
- Use a couple of different NER packages and options, (i.e. both NLTK and SpaCy, also with and without sentence segmentation).  This way you can evaluate which model provided you the best results
    - Your top-20 list should only be based on your most accurate results from the best performing NER package


## Data Preparation

In [1]:
import os
import requests
import nltk as nltk
import nltk.corpus  
from nltk.text import Text
import pandas as pd

import re
import sys
import pandas as pd
import spacy 
from spacy import displacy
import warnings

warnings.filterwarnings('ignore')
nlp = spacy.load("en_core_web_sm")


In [4]:
import pandas as pd

pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 500)

#### Read news data

In [5]:
news_path = 'https://storage.googleapis.com/msca-bdp-data-open/news/nlp_a_5_news.json'
news_df = pd.read_json(news_path, orient='records', lines=True)

print(f'Sample contains {news_df.shape[0]:,.0f} news articles')
news_df.head(2)

Sample contains 10,012 news articles


Unnamed: 0,url,date,language,title,text
0,http://kokomoperspective.com/obituaries/jon-w-horton/article_b6ba8e1e-cb9c-11eb-9868-fb11b88b9778.html,2021-06-13,en,Jon W. Horton | Obituaries | kokomoperspective.com,Jon W. Horton | Obituaries | kokomoperspective.comYou have permission to edit this article. EditCloseSign Up Log In Dashboard LogoutMy Account Dashboard Profile Saved items LogoutCOVID-19Click here for the latest local news on COVID-19HomeAbout UsContact UsNewsLocalOpinionPoliticsNationalStateAgricultureLifestylesEngagements/Anniversaries/WeddingsAutosEntertainmentHealthHomesOutdoorsSportsNFLNCAAVitalsObituariesAutomotivee-EditionCouponsGalleries74°...
1,https://auto.economictimes.indiatimes.com/news/auto-components/birla-precision-to-ramp-up-capacity-to-tap-emerging-opportunities-in-india/81254902,2021-02-28,en,"Birla Precision to ramp up capacity to tap emerging opportunities in India, Auto News, ET Auto","Birla Precision to ramp up capacity to tap emerging opportunities in India, Auto News, ET Auto We have updated our terms and conditions and privacy policy Click ""Continue"" to accept and continue with ET AutoAccept the updated privacy & cookie policyDear user, ET Auto privacy and cookie policy has been updated to align with the new data regulations in European Union. Please review and accept these changes below to continue using the website.You can see our privacy policy & our cookie ..."


In [6]:
news_df.shape

(10012, 5)

#### Read Tweets data

In [7]:
tweets_path = 'https://storage.googleapis.com/msca-bdp-data-open/tweets/nlp_a_5_tweets.json'
tweets_df = pd.read_json(tweets_path, orient='records', lines=True)
print(f'Sample contains {tweets_df.shape[0]:,.0f} tweets')
tweets_df.head(2)

Sample contains 10,105 tweets


Unnamed: 0,id,lang,date,name,retweeted,text
0,1534565117614084096,en,2022-06-08,Low Orbit Tourist 🌍📷,,"Body &amp; Assembly - Halewood - United Kingdom\n🌍53.3504,-2.8352296,402m\n\nHalewood Body &amp; Assembly is a Jaguar Land Rover factory in Halewood, England, and forms the major part of the Halewood complex which is shared with Ford who manufacture transmissions at the site. [Wikipedia] https://t.co/LPmCnZIaVt"
1,1534565743429394439,en,2022-06-08,CompleteCar.ie,RT,"Land Rover Ireland has announced that the new Range Rover Sport starts at €114,150, now on @completecar:\n\nhttps://t.co/TjGUkL3FYr https://t.co/QdVaEiJkjO"


## Solution

## Tweets Name Entity Recognition (NER) and Location Extraction


In this part, I will use Spacy(With or without sentence segmentation) and NLTK(With or without sentence segmentation) to extract the company name and location from tweets.

Tweets_Spacy_Without_Sentence_Segmentation

In [127]:
# define a function to check if a token is in English
def is_english(token):
    return token.lang_ == 'en'

# apply the language model to each tweet and filter out non-English tokens
def filter_non_english(text):
    doc = nlp(text)
    english_tokens = [token.text for token in doc if is_english(token)]
    # remove single-character tokens
    english_tokens = [token for token in english_tokens if len(token) > 1]
    # remove urls
    english_tokens = [token for token in english_tokens if not token.startswith('http')]
    return ' '.join(english_tokens)
# apply the function to the 'text' column and store the result in a new column 'text_english'
tweets_df['text_english'] = tweets_df['text'].apply(filter_non_english)


In [148]:
# extract entities from the 'text_english' column
entities = []
labels = []
position_start = []
position_end = []

for text in tweets_df['text_english']:
    doc = nlp(text)
    for ent in doc.ents:
        entities.append(ent.text)
        labels.append(ent.label_)
        position_start.append(ent.start_char)
        position_end.append(ent.end_char)

# create a dataframe of entities
df = pd.DataFrame({'Entities':entities,'Labels':labels,'Position_Start':position_start, 'Position_End':position_end})

# count the number of rows where column 'Labels' = 'ORG'
df_org = df[df['Labels'] == 'ORG']
df_org_loc = df[(df['Labels'] == 'LOC') | (df['Labels'] == 'GPE')]

# get the frequency of each unique entity in the column 'Entities' in the df_org dataframe, and sort the values in descending order
tweets_spacy_noss = pd.DataFrame(df_org['Entities'].value_counts().sort_values(ascending=False).head(20))
tweets_spacy_noss_loc = pd.DataFrame(df_org_loc['Entities'].value_counts().sort_values(ascending=False).head(20))


Tweets_Spacy_With_Sentence_Segmentation

In [149]:
entities = []
labels = []
position_start = []
position_end = []
sentences = []

for text in tweets_df['text_english']:
    doc = nlp(text)
    for sent in doc.sents:
        for ent in sent.ents:
            entities.append(ent.text)
            labels.append(ent.label_)
            position_start.append(ent.start_char)
            position_end.append(ent.end_char)
            sentences.append(sent.text)

# create a dataframe of entities
df = pd.DataFrame({'Entities':entities,'Labels':labels,'Position_Start':position_start, 'Position_End':position_end, 'Sentence':sentences})

# count the number of rows where column 'Labels' = 'ORG'

df_org = df[df['Labels'] == 'ORG']
df_org_loc = df[(df['Labels'] == 'LOC') | (df['Labels'] == 'GPE')]

# get the frequency of each unique entity in the column 'Entities' in the df_org dataframe, and sort the values in descending order
tweets_spacy_ss = pd.DataFrame(df_org['Entities'].value_counts().sort_values(ascending=False).head(20))
tweets_spacy_ss_loc = pd.DataFrame(df_org_loc['Entities'].value_counts().sort_values(ascending=False).head(20))

Tweets_NLTK_Witout_Sentence_Segmentation

In [146]:
entities = []
labels = []
for text in tweets_df['text_english']:
    for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(text)), binary = False):
        if hasattr(chunk, 'label'):
            entities.append(' '.join(c[0] for c in chunk)) #Add space as between multi-token entities
            labels.append(chunk.label())

entities_labels = list(zip(entities, labels)) # zip the two lists together
entities_df = pd.DataFrame(entities_labels)
entities_df.columns = ["Entities", "Labels"]
# count the number of rows where column 'Labels' = 'ORG'
df_org = entities_df[entities_df['Labels'] == 'ORGANIZATION']
df_gpe_loc_NLTK = entities_df[(entities_df['Labels'] == 'GPE')]

# get the frequency of each unique entity in the column 'Entities' in the df_org dataframe, and sort the values in descending order
tweets_nltk_noss = pd.DataFrame(df_org['Entities'].value_counts().sort_values(ascending=False).head(20))
tweets_nltk_noss_loc = pd.DataFrame(df_gpe_loc_NLTK['Entities'].value_counts().sort_values(ascending=False).head(20))


Tweets_NLTK_With_Sentence_Segmentation

In [147]:
# count the number of rows where column 'Labels' = 'ORGANIZATION'
entities = []
labels = []

for text in tweets_df['text_english']:
    for sent in nltk.sent_tokenize(text):
        for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent)), binary = False):
            if hasattr(chunk, 'label'):
                entities.append(' '.join(c[0] for c in chunk)) #Add space as between multi-token entities
                labels.append(chunk.label())

entities_labels = list(zip(entities, labels)) # zip the two lists together
entities_df = pd.DataFrame(entities_labels)
entities_df.columns = ["Entities", "Labels"]

# count the number of rows
df_org = entities_df[entities_df['Labels'] == 'ORGANIZATION']
df_gpe_loc_NLTK = entities_df[(entities_df['Labels'] == 'GPE')]

# get the frequency of each unique entity
tweets_NLTK_ss = pd.DataFrame(df_org['Entities'].value_counts().sort_values(ascending=False).head(20))
tweets_NLTK_ss_loc = pd.DataFrame(df_gpe_loc_NLTK['Entities'].value_counts().sort_values(ascending=False).head(20))


## Results Comparison for Tweets



In [143]:
# Result Comparison for NER
tweets_NLTK_ss_show = tweets_NLTK_ss.reset_index()
tweets_NLTK_ss_show.columns = ['Entities_NLTK_SS', 'Frequency']
tweets_nltk_noss_show = tweets_nltk_noss.reset_index()
tweets_nltk_noss_show.columns = ['Entities_NLTK_NOSS', 'Frequency']
tweets_spacy_ss_show = tweets_spacy_ss.reset_index()
tweets_spacy_ss_show.columns = ['Entities_Spacy_SS', 'Frequency']
tweets_spacy_noss_show = tweets_spacy_noss.reset_index()
tweets_spacy_noss_show.columns = ['Entities_Spacy_NOSS', 'Frequency']

# concatenate the dataframes
tweets_ner_show = pd.concat([tweets_spacy_ss_show, tweets_spacy_noss_show, tweets_NLTK_ss_show, tweets_nltk_noss_show], axis=1)
tweets_ner_show


Unnamed: 0,Entities_Spacy_SS,Frequency,Entities_Spacy_NOSS,Frequency.1,Entities_NLTK_SS,Frequency.2,Entities_NLTK_NOSS,Frequency.3
0,Land Rover,1196,Land Rover,1196,Land Rover,995,Land Rover,995
1,Jaguar Land Rover,647,Jaguar Land Rover,647,Land,521,Land,521
2,eBay,409,eBay,409,General Motors,283,General Motors,283
3,General Motors,284,General Motors,284,LAND,227,LAND,227
4,Jaguar Land Rover BMW Mercedes Benz Citroen,241,Jaguar Land Rover BMW Mercedes Benz Citroen,241,eBay,202,eBay,202
5,Ford,104,Ford,104,NigelAndArron,190,NigelAndArron,190
6,Volvo,87,Volvo,87,SUV,182,SUV,182
7,Jaguar,76,Jaguar,76,Duke,178,Duke,178
8,BMW,72,BMW,72,Jaguar Land Rover,163,Jaguar Land Rover,163
9,n’t,69,n’t,69,UK,156,UK,156


In [150]:
# Result Comparison for Location
tweets_NLTK_ss_loc_show = tweets_NLTK_ss_loc.reset_index()
tweets_NLTK_ss_loc_show.columns = ['Entities_NLTK_SS', 'Frequency']
tweets_nltk_noss_loc_show = tweets_nltk_noss_loc.reset_index()
tweets_nltk_noss_loc_show.columns = ['Entities_NLTK_NOSS', 'Frequency']
tweets_spacy_ss_loc_show = tweets_spacy_ss_loc.reset_index()
tweets_spacy_ss_loc_show.columns = ['Entities_Spacy_SS', 'Frequency']
tweets_spacy_noss_loc_show = tweets_spacy_noss_loc.reset_index()
tweets_spacy_noss_loc_show.columns = ['Entities_Spacy_NOSS', 'Frequency']

# concatenate the dataframes
tweets_ner_loc_show = pd.concat([tweets_spacy_ss_loc_show, tweets_spacy_noss_loc_show, tweets_NLTK_ss_loc_show, tweets_nltk_noss_loc_show], axis=1)
tweets_ner_loc_show

Unnamed: 0,Entities_Spacy_SS,Frequency,Entities_Spacy_NOSS,Frequency.1,Entities_NLTK_SS,Frequency.2,Entities_NLTK_NOSS,Frequency.3
0,Russia,471,Russia,471,Land,1553,Land,1553
1,UK,365,UK,365,Russia,464,Russia,464
2,n’t,208,n’t,208,British,188,British,188
3,India,88,India,88,Sussex,119,Sussex,119
4,Kibaki,76,Kibaki,76,India,87,India,87
5,Meghan,72,Meghan,72,Russian,84,Russian,84
6,Jamaica,66,Jamaica,66,New,79,New,79
7,Britain,60,Britain,60,Zimbabwe,75,Zimbabwe,75
8,Zimbabwe,41,Zimbabwe,41,LAND,71,LAND,71
9,London,39,London,39,Car,68,Car,68


## News Title Name Entity Recognition (NER) and Location Extraction

For News Title, it is not necessary to do sentence segmentation as the title is already a sentence.Therefore, I will only use Spacy and NLTK(Without sentence segmentation) to extract the company name and location from news title.

In [20]:
# define a function to check if a token is in English
def is_english(token):
    return token.lang_ == 'en'

# apply the language model to each tweet and filter out non-English tokens
def clean_text(text):
    doc = nlp(text)
    english_tokens = [token.text for token in doc if is_english(token)]
    # remove single-character tokens
    english_tokens = [token for token in english_tokens if len(token) > 1]
    # remove urls
    english_tokens = [token for token in english_tokens if not token.startswith('http')]
    return ' '.join(english_tokens)
# apply the function to the 'text' column and store the result in a new column 'text_english'
news_df['title_new'] = news_df['title'].apply(clean_text)


News_Title_Spacy_Without_Sentence_Segmentation

In [151]:
# extract entities from the 'text_english' column
entities = []
labels = []
position_start = []
position_end = []

for text in news_df['title']:
    doc = nlp(text)
    for ent in doc.ents:
        entities.append(ent.text)
        labels.append(ent.label_)
        position_start.append(ent.start_char)
        position_end.append(ent.end_char)

# create a dataframe of entities
df = pd.DataFrame({'Entities':entities,'Labels':labels,'Position_Start':position_start, 'Position_End':position_end})

# count the number of rows where column 'Labels' = 'ORG'

df_org = df[df['Labels'] == 'ORG']
df_org_loc = df[(df['Labels'] == 'LOC') | (df['Labels'] == 'GPE')]


# get the frequency of each unique entity in the column 'Entities' in the df_org dataframe, and sort the values in descending order
news_title_spacy_noss = pd.DataFrame(df_org['Entities'].value_counts().sort_values(ascending=False).head(20))
news_title_spacy_noss_loc = pd.DataFrame(df_org_loc['Entities'].value_counts().sort_values(ascending=False).head(20))

News_Title_NLTK_Without_Sentence_Segmentation

In [152]:
entities = []
labels = []
for text in tweets_df['text_english']:
    for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(text)), binary = False):
        if hasattr(chunk, 'label'):
            entities.append(' '.join(c[0] for c in chunk)) #Add space as between multi-token entities
            labels.append(chunk.label())

entities_labels = list(zip(entities, labels)) # zip the two lists together
entities_df = pd.DataFrame(entities_labels)
entities_df.columns = ["Entities", "Labels"]
# count the number of rows where column 'Labels' = 'ORG'
df_org = entities_df[entities_df['Labels'] == 'ORGANIZATION']
df_gpe_loc_NLTK = entities_df[(entities_df['Labels'] == 'GPE')]


# get the frequency of each unique entity in the column 'Entities' in the df_org dataframe, and sort the values in descending order
news_title_nltk_noss = pd.DataFrame(df_org['Entities'].value_counts().sort_values(ascending=False).head(20))
news_title_nltk_noss_loc = pd.DataFrame(df_gpe_loc_NLTK['Entities'].value_counts().sort_values(ascending=False).head(20))


## Result Comparison for News Title

In [153]:
# Result Comparison for News Title NER
news_title_spacy_noss_show = news_title_spacy_noss.reset_index()
news_title_spacy_noss_show.columns = ['Entities_Spacy_NOSS', 'Frequency']
news_title_nltk_noss_show = news_title_nltk_noss.reset_index()
news_title_nltk_noss_show.columns = ['Entities_NLTK_NOSS', 'Frequency']

# concatenate the dataframes
news_title_ner_show = pd.concat([news_title_spacy_noss_show, news_title_nltk_noss_show], axis=1)
news_title_ner_show

Unnamed: 0,Entities_Spacy_NOSS,Frequency,Entities_NLTK_NOSS,Frequency.1
0,Ford,270,Land Rover,995
1,Daily Mail Online,212,Land,521
2,Star News,209,General Motors,283
3,Hyundai,205,LAND,227
4,Toyota,162,eBay,202
5,Chevrolet,160,NigelAndArron,190
6,Honda,146,SUV,182
7,Shropshire Star,126,Duke,178
8,Express & Star,120,Jaguar Land Rover,163
9,Automotive News,108,UK,156


For News Title, Ford is the most frequently mentioned company in Spacy. However, in NLTK, it is not the case. The most frequently mentioned company is Land Rover.

In [154]:
# Result Comparison for News Title Location
news_title_spacy_noss_loc_show = news_title_spacy_noss_loc.reset_index()
news_title_spacy_noss_loc_show.columns = ['Entities_Spacy_NOSS', 'Frequency']
news_title_nltk_noss_loc_show = news_title_nltk_noss_loc.reset_index()
news_title_nltk_noss_loc_show.columns = ['Entities_NLTK_NOSS', 'Frequency']

# concatenate the dataframes
news_title_ner_loc_show = pd.concat([news_title_spacy_noss_loc_show, news_title_nltk_noss_loc_show], axis=1)
news_title_ner_loc_show

Unnamed: 0,Entities_Spacy_NOSS,Frequency,Entities_NLTK_NOSS,Frequency.1
0,Carpages.ca,1962,Land,1553
1,Ontario,1316,Russia,464
2,British Columbia,198,British,188
3,UK,194,Sussex,119
4,Manitoba,181,India,87
5,Winnipeg,137,Russian,84
6,India,121,New,79
7,Toronto,118,Zimbabwe,75
8,Alberta,116,LAND,71
9,London,112,Car,68


## News Text Name Entity Recognition (NER) and Location Extraction

In this part, Since Spacy without sentence segmentation has the best performance based on the previous results, I will use Spacy without sentence segmentation and NLTK (With or without sentence segmentation) to extract the company name and location from news text.

In [13]:
import re

# define a regular expression to match URLs
url_pattern = re.compile(r'https?://\S+')

# apply the language model to each tweet, remove URLs, and return the cleaned text
clean_text = lambda text: url_pattern.sub('', text)

# apply the function to the 'text' column and store the result in a new column 'text_new'
news_df['text_new'] = news_df['text'].apply(clean_text)

In [None]:
text_entities = []
index = []
labels = []
for doc in nlp.pipe(
    news_df["text_new"], 
    disable=["tok2vec","tagger","parser","attribute ruler", "lemmatizer"],
    batch_size=100,
    n_process=2
):
    for ent in doc.ents:
        text_entities.append(ent.text)
        labels.append(ent.label_)


In [None]:
# create a dataframe, count the number of occurrences of each 'text_entities', and sort the values in descending order. The dataframe contains two columns: 'Entities' and 'Count'
news_text_spacy_noss_df = pd.DataFrame({'Entities':text_entities,'Labels':labels})

# count the number of rows where column 'Labels' = 'ORG'
news_text_spacy_noss = news_text_spacy_noss_df[news_text_spacy_noss_df['Labels'] == 'ORG']
# get the frequency of each unique entity in the column 'Entities' in the df_org dataframe, and sort the values in descending order
news_text_spacy_noss = pd.DataFrame(news_text_spacy_noss['Entities'].value_counts().sort_values(ascending=False).head(20))

# count the number of rows where column 'Labels' = 'GPE' or 'LOC'
news_text_spacy_noss_loc = news_text_spacy_noss_df[(news_text_spacy_noss_df['Labels'] == 'GPE') | (news_text_spacy_noss_df['Labels'] == 'LOC')]
# get the frequency of each unique entity in the column 'Entities' in the df_org dataframe, and sort the values in descending order
news_text_spacy_noss_loc = pd.DataFrame(news_text_spacy_noss_loc['Entities'].value_counts().sort_values(ascending=False).head(20))

NLTK_News_Text without sentence segmentation

Since NLTK is too computational expensive, even though I tried parallel processing and multiprocessing, but I found my local machine is not able to run it. Therefore, I have no choince but to sample 2000 news articles to run NLTK.

In [40]:
news_df_sample = news_df.sample(n=2000, random_state=1)
news_df_sample.shape

(2000, 6)

In [41]:
%%time
entities = []
labels = []
for text in news_df_sample['text_new']:
    for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(text)), binary = False):
        if hasattr(chunk, 'label'):
            entities.append(' '.join(c[0] for c in chunk)) #Add space as between multi-token entities
            labels.append(chunk.label())

entities_labels = list(zip(entities, labels)) # zip the two lists together
entities_df = pd.DataFrame(entities_labels)
entities_df.columns = ["Entities", "Labels"]


CPU times: total: 27min 45s
Wall time: 27min 49s


In [None]:
# # count the number of rows where column 'Labels' = 'ORG'
df_org = entities_df[entities_df['Labels'] == 'ORGANIZATION']
# get the frequency of each unique entity in the column 'Entities' in the df_org dataframe, and sort the values in descending order
news_text_nltk_noss = pd.DataFrame(df_org['Entities'].value_counts().sort_values(ascending=False).head(20))

# create a subset of the dataframe where column 'Labels' = 'GPE' or 'LOCATION' in entities_df
df_loc = entities_df[(entities_df['Labels'] == 'GPE') | (entities_df['Labels'] == 'LOCATION')]
# get the frequency of each unique entity in the column 'Entities' in the df_loc dataframe, and sort the values in descending order
news_text_nltk_noss_loc = pd.DataFrame(df_loc['Entities'].value_counts().sort_values(ascending=False).head(20))

NLTK_News_Text with sentence segmentation

In [53]:
%%time
# count the number of rows where column 'Labels' = 'ORGANIZATION'
entities = []
labels = []

for text in news_df_sample['text_new']:
    for sent in nltk.sent_tokenize(text):
        for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent)), binary = False):
            if hasattr(chunk, 'label'):
                entities.append(' '.join(c[0] for c in chunk)) #Add space as between multi-token entities
                labels.append(chunk.label())

entities_labels = list(zip(entities, labels)) # zip the two lists together
entities_df = pd.DataFrame(entities_labels)
entities_df.columns = ["Entities", "Labels"]

CPU times: total: 29min 20s
Wall time: 29min 25s


In [None]:
# # count the number of rows where column 'Labels' = 'ORG'
news_text_nltk_ss = entities_df[entities_df['Labels'] == 'ORGANIZATION']
# get the frequency of each unique entity in the column 'Entities' in the df_org dataframe, and sort the values in descending order
news_text_nltk_ss = pd.DataFrame(news_text_nltk_ss['Entities'].value_counts().sort_values(ascending=False).head(20))
# create a subset of the dataframe where column 'Labels' = 'GPE' or 'LOCATION' in entities_df
df_loc = entities_df[(entities_df['Labels'] == 'GPE') | (entities_df['Labels'] == 'LOCATION')]
# get the frequency of each unique entity in the column 'Entities' in the df_loc dataframe, and sort the values in descending order
news_text_nltk_ss_loc = pd.DataFrame(df_loc['Entities'].value_counts().sort_values(ascending=False).head(20))

## Result Comparison for News Text

In [116]:
# rename the column 1 as 'spacy_noss' second column as 'Frequency' in the dataframe news_text_spacy_noss
news_text_spacy_noss = news_text_spacy_noss.rename(columns={news_text_spacy_noss.columns[0]: 'spacy_noss', news_text_spacy_noss.columns[1]: 'Frequency1'})
# rename the column 1 as 'nltk_ss' second column as 'Frequency' in the dataframe  news_text_nltk_ss
news_text_nltk_ss = news_text_nltk_ss.rename(columns={news_text_nltk_ss.columns[0]: 'nltk_ss', news_text_nltk_ss.columns[1]: 'Frequency2'})
# rename the column 1 as 'nltk_noss' second column as 'Frequency' in the dataframe news_text_nltk_noss
news_text_nltk_noss = news_text_nltk_noss.rename(columns={news_text_nltk_noss.columns[0]: 'nltk_noss', news_text_nltk_noss.columns[1]: 'Frequency3'})

# concat the three dataframes together
news_text_combine = pd.concat([news_text_spacy_noss, news_text_nltk_ss, news_text_nltk_noss], axis=1)
news_text_combine

Unnamed: 0,spacy_noss,Frequency1,nltk_ss,Frequency2,nltk_noss,Frequency3
0,MailOnline,8849,NYC,2090,NYC,2090
1,COVID-19,6374,MailOnline,1886,MailOnline,1886
2,Ford,5624,VERY,1576,VERY,1578
3,Toyota,5475,Duke,1227,Duke,1222
4,Hyundai,4333,LA,1177,LA,1178
5,Instagram,4023,Queen,1106,Queen,1106
6,Trump,3877,UK,982,UK,981
7,Honda,3786,COVID,945,COVID,945
8,BMW,3783,Conditions,881,Conditions,881
9,Amazon,3595,THE,738,THE,779


As we can see, Spacy with no sentence segmentation is the best model for news text. The Company name that mentioned the most is "Ford"

In [157]:
news_text_spacy_noss_loc_show = news_text_spacy_noss_loc.reset_index()
news_text_spacy_noss_loc_show.columns = ['spacy_noss_location', 'Frequency']

news_text_nltk_ss_loc_show = news_text_nltk_ss_loc.reset_index()
news_text_nltk_ss_loc_show.columns = ['nltk_ss_location', 'Frequency']

news_text_nltk_noss_loc_show = news_text_nltk_noss_loc.reset_index()
news_text_nltk_noss_loc_show.columns = ['nltk_noss_location', 'Frequency']

news_text_combine_loc_show = pd.concat([news_text_spacy_noss_loc_show, news_text_nltk_ss_loc_show, news_text_nltk_noss_loc_show], axis=1)
news_text_combine_loc_show.head(20)

Unnamed: 0,spacy_noss_location,Frequency,nltk_ss_location,Frequency.1,nltk_noss_location,Frequency.2
0,LA,18382,Los Angeles,2194,Los Angeles,2194
1,NYC,11070,London,2071,London,2070
2,UK,10682,New York City,1615,New York City,1615
3,London,10504,British,1322,British,1322
4,Los Angeles,9930,New York,1275,New York,1275
5,US,9855,West,1089,West,1083
6,New York City,7231,California,970,California,968
7,Hollywood,5907,India,955,India,951
8,Australia,5104,Miami,921,Miami,921
9,India,5015,Australia,902,Australia,895


### Summary of NER:

1. In this Project, Spacy with no sentence segmentation is the best approch for Name and Entity Recognition
2. For tweets, the most frequently mentioned company under the best model is "Land Rover"
3. For news title, the most frequently mentioned company under the best model is "Ford"
4. For news text, the most frequently mentioned company under the best model is "Ford"


### 2. Identify what other companies are most frequently mentioned along with your primary company
- Analyze what companies are most frequently mentioned within the same document (tweet and news article)
- While analyzing news articles, extract separate entities from titles and texts

Since for NER, the best model is Spacy without sentence segmentation, I will use Spacy to do the analysis.

In [None]:
%%time

indexlist = []
entities = []
labels = []

docs = nlp.pipe(
    news_df['text_new'].tolist(),
    disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"],
    batch_size=100,
    n_process=2
)

for i, doc in enumerate(docs):
    index = news_df.index[i]
    for ent in doc.ents:
        indexlist.append(index)
        entities.append(ent.text)
        labels.append(ent.label_)

# create a dataframe to store the entities and labels and index
entities_labels_index = pd.DataFrame({'Index': indexlist, 'Entities': entities, 'Labels': labels})

In [98]:
NER_byindex = entities_labels_index.groupby(['Index', 'Entities', 'Labels']).size().reset_index(name='Occurences')
NER_byindex = NER_byindex[NER_byindex['Labels'] == 'ORG']
NER_byindex.sort_values(by=['Index', 'Entities', 'Labels'], ascending=False, inplace=True)
NER_byindex.head(3)

Unnamed: 0,Index,Entities,Labels,Occurences
2498249,10011,the YearStreet Machine,ORG,2
2498248,10011,the Product Safety Australia,ORG,1
2498246,10011,the Nissan Customer Service Centre,ORG,1


For Tweets, we have "Land Rover" as the most frequently mentioned company

In [165]:
# If the most frequentt mentioned Company name is Land Rover, then get the top 20 entities 
most_freq_company = 'Land Rover' 
print(most_freq_company)

news_id_toyota = NER_byindex[NER_byindex.Entities == most_freq_company]['Index'].unique().tolist()
NER_byindex_toyota = NER_byindex[NER_byindex['Index'].isin(news_id_toyota)]

# drop label
NER_byindex_toyota.drop('Labels', axis=1, inplace=True)
# group by Index and entities to get the count of entities
NER_byindex_toyota_count_news = NER_byindex_toyota.groupby(['Index', 'Entities']).count()
# groupby entities to get average count of occurance for each entity
NER_byindex_toyota_gb = NER_byindex_toyota_count_news.groupby('Entities').count().sort_values(by='Occurences', ascending=False)
NER_byindex_toyota_gb.head(20)

Land Rover


Unnamed: 0_level_0,Occurences
Entities,Unnamed: 1_level_1
Land Rover,1290
PrintsOur,564
TeamAdvertise,554
the Daily Mail,553
ArchiveTopics,551
Associated Newspapers LtdPart,551
LocationPublished,551
The Mail,551
COVID-19,545
Instagram,519


For both News Text and News Title, we have "Ford" as the most frequently mentioned company

In [163]:
# If the most frequentt mentioned Company name is Ford, then get the top 20 entities 
most_freq_company = 'Ford' 
print(most_freq_company)

news_id_toyota = NER_byindex[NER_byindex.Entities == most_freq_company]['Index'].unique().tolist()
NER_byindex_toyota = NER_byindex[NER_byindex['Index'].isin(news_id_toyota)]

# drop label
NER_byindex_toyota.drop('Labels', axis=1, inplace=True)
# group by Index and entities to get the count of entities
NER_byindex_toyota_count_news = NER_byindex_toyota.groupby(['Index', 'Entities']).count()
# groupby entities to get average count of occurance for each entity
NER_byindex_toyota_gb = NER_byindex_toyota_count_news.groupby('Entities').count().sort_values(by='Occurences', ascending=False)
NER_byindex_toyota_gb.head(20)

Ford


Unnamed: 0_level_0,Occurences
Entities,Unnamed: 1_level_1
Ford,1881
Toyota,838
Honda,680
Hyundai,599
Autopath Technologies Inc.,576
COVID-19,569
BMW,556
the Carpages.ca Terms & Conditions,435
Chevrolet,421
Nissan,406


### Summary of Other Associated Companies Analysis

'Land Rover' and 'Ford' are the most frequently mentioned companies in the news and tweets.

Some other frequently mentioned companies are car brands such as Toyota and Honda. Also, there are some media company such as the daily mail and BBC

### 3. Identify most frequent locations of events, by extracting appropriate named entities
- Locations may include countries, states, cities, regions, etc.

Since the best model is Spacy without sentence segmentation, I will use Spacy to do the analysis.

In [204]:
# rename the column 1 as 'Tweets_spacy_noss' for tweets_spacy_noss_loc_show
tweets_spacy_noss_loc_show = tweets_spacy_noss_loc.reset_index()
tweets_spacy_noss_loc_show.columns = ['Tweets_spacy_noss', 'Frequency']

# rename the column 1 as 'news_title_spacy_noss' for news_title_spacy_noss_loc_show
news_title_spacy_noss_loc_show = news_title_spacy_noss_loc.reset_index()
news_title_spacy_noss_loc_show.columns = ['News_title_spacy_noss', 'Frequency']

# rename the column 1 as 'news_text_spacy_noss' for news_text_spacy_noss_loc_show
news_text_spacy_noss_loc_show = news_text_spacy_noss_loc.reset_index()
news_text_spacy_noss_loc_show.columns = ['News_text_spacy_noss', 'Frequency']
# create a dataframe concact the tweets_nltk_noss_loc_show, news_title_spacy_noss_loc_show, news_text_spacy_noss_loc_show
df_loc = pd.concat([tweets_spacy_noss_loc_show, news_title_spacy_noss_loc_show, news_text_spacy_noss_loc_show], axis=1)
df_loc

Unnamed: 0,Tweets_spacy_noss,Frequency,News_title_spacy_noss,Frequency.1,News_text_spacy_noss,Frequency.2
0,Russia,471,Carpages.ca,1962,LA,18382
1,UK,365,Ontario,1316,NYC,11070
2,n’t,208,British Columbia,198,UK,10682
3,India,88,UK,194,London,10504
4,Kibaki,76,Manitoba,181,Los Angeles,9930
5,Meghan,72,Winnipeg,137,US,9855
6,Jamaica,66,India,121,New York City,7231
7,Britain,60,Toronto,118,Hollywood,5907
8,Zimbabwe,41,Alberta,116,Australia,5104
9,London,39,London,112,India,5015


### Summary of Location Analysis

For Tweets, we have **"Russia"** as the most frequently mentioned location

For News Title, we have **"Ontario"** as the most frequently mentioned location

For News Text, we have **"LA"** as the most frequently mentioned location

In [205]:
import datetime
import pytz

time = datetime.datetime.now(pytz.timezone('US/Central')).strftime("%a, %d %B %Y %H:%M:%S")
sign = 'Richard Yang'

print(f'Created at: {time} by {sign}')

Created at: Tue, 25 April 2023 21:17:04 by Richard Yang
