## NLP Name Entity Recognition Project

### Project Summary
This project aims to apply NER techniques on a collection of tweets and a collection of news articles about one particular company. From these text data, I identified that company name, other companies frequently mentioned along with the primary company, and frequently mentioned locations of events.

### Data
The data has 10,012 news articles and 10,105 tweets.

### Project Sections
1. Data Import

2. Text Cleaning

3. Experiment Various NER Packages & Sentence Segmentation Choice
 - NLTK with sentence segmentation
 - NLTK without sentence segmentation
 - Spacy with sentence segmentation
 - Spacy without sentence segmentation

4. NER Analysis with Best Practice: Spacy with Sentence Segmentation on Entire Datasets

### Author & Platform
Yezi Liu conducted this project independently in Visual Studio Code.

## Load Packages

In [None]:
import pandas as pd
import os
import requests
import nltk as nltk
import nltk.corpus
from nltk.text import Text
import pandas as pd
import re
import sys
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.chunk import ne_chunk
from collections import Counter
import matplotlib.pyplot as plt
import seaborn as sns
pip install pandarallel
from pandarallel import pandarallel
import multiprocessing
from nltk.tokenize import TweetTokenizer
pip install spacy
import spacy
!pip install -U spacy
!python -m spacy download en_core_web_md

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 500)

In [None]:
num_processors = multiprocessing.cpu_count()
print(f'Available CPUs: {num_processors}')
pandarallel.initialize(nb_workers=num_processors-1, use_memory_fs=False)

Available CPUs: 2
INFO: Pandarallel will run on 1 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.


## Set Up Environmental Variables

In [None]:
NEWS_PATH = 'https://storage.googleapis.com/msca-bdp-data-open/news/nlp_a_5_news.json'
TWEETS_PATH = 'https://storage.googleapis.com/msca-bdp-data-open/tweets/nlp_a_5_tweets.json'

## Data Import

In [None]:
news_df = pd.read_json(NEWS_PATH, orient='records', lines=True)
print(f'Sample contains {news_df.shape[0]:,.0f} news articles')
news_df.head(2)

Sample contains 10,012 news articles


Unnamed: 0,url,date,language,title,text
0,http://kokomoperspective.com/obituaries/jon-w-horton/article_b6ba8e1e-cb9c-11eb-9868-fb11b88b9778.html,2021-06-13,en,Jon W. Horton | Obituaries | kokomoperspective.com,Jon W. Horton | Obituaries | kokomoperspective.comYou have permission to edit this article. EditCloseSign Up Log In Dashboard LogoutMy Account Dashboard Profile Saved items LogoutCOVID-19Click here for the latest local news on COVID-19HomeAbout UsContact UsNewsLocalOpinionPoliticsNationalStateAgricultureLifestylesEngagements/Anniversaries/WeddingsAutosEntertainmentHealthHomesOutdoorsSportsNFLNCAAVitalsObituariesAutomotivee-EditionCouponsGalleries74°...
1,https://auto.economictimes.indiatimes.com/news/auto-components/birla-precision-to-ramp-up-capacity-to-tap-emerging-opportunities-in-india/81254902,2021-02-28,en,"Birla Precision to ramp up capacity to tap emerging opportunities in India, Auto News, ET Auto","Birla Precision to ramp up capacity to tap emerging opportunities in India, Auto News, ET Auto We have updated our terms and conditions and privacy policy Click ""Continue"" to accept and continue with ET AutoAccept the updated privacy & cookie policyDear user, ET Auto privacy and cookie policy has been updated to align with the new data regulations in European Union. Please review and accept these changes below to continue using the website.You can see our privacy policy & our cookie ..."


In [None]:
tweets_df = pd.read_json(TWEETS_PATH, orient='records', lines=True)
print(f'Sample contains {tweets_df.shape[0]:,.0f} tweets')
tweets_df.head(2)

Sample contains 10,105 tweets


Unnamed: 0,id,lang,date,name,retweeted,text
0,1534565117614084096,en,2022-06-08,Low Orbit Tourist 🌍📷,,"Body &amp; Assembly - Halewood - United Kingdom\n🌍53.3504,-2.8352296,402m\n\nHalewood Body &amp; Assembly is a Jaguar Land Rover factory in Halewood, England, and forms the major part of the Halewood complex which is shared with Ford who manufacture transmissions at the site. [Wikipedia] https://t.co/LPmCnZIaVt"
1,1534565743429394439,en,2022-06-08,CompleteCar.ie,RT,"Land Rover Ireland has announced that the new Range Rover Sport starts at €114,150, now on @completecar:\n\nhttps://t.co/TjGUkL3FYr https://t.co/QdVaEiJkjO"


## Text Cleaning

### Functions Used

In [None]:
stop_words = set(nltk.corpus.stopwords.words('english'))

def cleaned_news(text, max_length=20):
    """
    This function applies text cleaning to news text and news titles.
    It only removed stop words, numbers, unusually long word chunks
    from webscrapping to avoid aggressive cleaning.
    """
    text = re.sub(r'\d+', '', text)
    tokens = nltk.tokenize.word_tokenize(text)
    return ' '.join([token for token in tokens
        if token.lower() not in stop_words
        and not token.isnumeric()
        and len(token) <= max_length])

In [None]:
tweet_tokenizer = TweetTokenizer(preserve_case=True, strip_handles=True, reduce_len=True)

def cleaned_tweets(text):
    """
    This function applies text cleaning to tweets text.
    It removes urls, \n, emojis, #, @ for tweets as well as stop words and numbers.
    """
    text = re.sub(r'(?:\@|http?\://|https?\://|www)\S+', '', text)
    text = re.sub(r'(?:\n)', '', text)

    emoji_pattern = re.compile(
        "["
        "\U0001F600-\U0001F64F"  # emoticons
        "\U0001F300-\U0001F5FF"  # symbols & pictographs
        "\U0001F680-\U0001F6FF"  # transport & map symbols
        "\U0001F1E0-\U0001F1FF"  # flags (iOS)
        "\U00002702-\U000027B0"
        "\U000024C2-\U0001F251"
        "]+",
        flags=re.UNICODE)
    text =  emoji_pattern.sub(r'', text)
    text = re.sub(r'\d+', '', text)

    tokens = tweet_tokenizer.tokenize(text)

    return ' '.join([token.lstrip('#@') for token in tokens
            if token.lower().lstrip('#@') not in stop_words
            ])

### Cleaning

In [None]:
# Discard non-English results
news_df = news_df[news_df['language'] == 'en']
tweets_df = tweets_df[tweets_df['lang'] == 'en']

In [None]:
news_df['cleaned_news_text'] = news_df['text'].apply(lambda x: cleaned_news(x))
news_df['cleaned_news_title'] = news_df['title'].apply(lambda x: cleaned_news(x))
tweets_df['cleaned_tokens'] = tweets_df['text'].apply(cleaned_tweets)

For text cleaning methods of both news and tweets, I adopted a conservative approach to preserve as much important information as possible in the text while eliminating extra noise and unnecessary computational cost. I preserved the original cases and punctuations in order for the algorithms to pick up and extract correct company entities.

## Experiment Various NER Packages & Sentence Segmentation Choice with Sampled News Articles

### Functions Used

In [None]:
def extract_org_with_seg(text):
    """
    This function extracts Name Entities WITH Sentence Segmentation with NLTK.
    """
    organizations = []

    for sent in nltk.sent_tokenize(text):
        for chunk in ne_chunk(pos_tag(word_tokenize(sent)), binary = False):
            if hasattr(chunk, 'label') and chunk.label() == 'ORGANIZATION':
                organizations.append(' '.join(c[0] for c in chunk))

    return organizations

In [None]:
def extract_org_without_seg(text):
    """
    This function extracts Name Entities WITHOUT Sentence Segmentation with NLTK.
    """
    organizations = []
    for chunk in ne_chunk(pos_tag(word_tokenize(text)), binary = False):
        if hasattr(chunk, 'label') and chunk.label() == 'ORGANIZATION':
            organizations.append(' '.join(c[0] for c in chunk))
    return organizations

In [None]:
def extract_orgs_with_seg_spacy(text):
    """
    This function extracts Name Entities WITH Sentence Segmentation with Spacy.
    """
    doc = nlp(text)
    return [ent.text for sent in doc.sents for ent in sent.ents if ent.label_ == 'ORG']

In [None]:
def extract_orgs_without_seg_spacy(text):
    """
    This function extracts Name Entities WITHOUT Sentence Segmentation with Spacy.
    """
    doc = nlp(text)
    return [ent.text for ent in doc.ents if ent.label_ == 'ORG']

In [None]:
def count_and_sort(org_list):
    """
    This function counts the occurrences of each element in the input list and returns
    a sorted list of tuples. Each tuple contains an element and its count,
    sorted in descending order by count.
    """
    count = Counter(org_list)
    return sorted(count.items(), key=lambda x: x[1], reverse=True)

### NLTK

In [None]:
# Sample 20% of news article dataset to experiment with different NER packages and options:
# both NLTK and SpaCy, also with and without sentence segmentation.
sampled_news_df = news_df.sample(frac=0.2, random_state=1)

In [None]:
# Try nltk with and without sentence segmentation on sampled article news text

sampled_news_df['news_text_orgs_with_seg'] = sampled_news_df['cleaned_news_text'].apply(extract_org_with_seg)
sampled_news_df['news_text_orgs_without_seg'] = sampled_news_df['cleaned_news_text'].apply(extract_org_without_seg)

news_text_orgs_with_seg_counts = count_and_sort([org for sublist in sampled_news_df['news_text_orgs_with_seg'] for org in sublist])
news_text_orgs_without_seg_counts = count_and_sort([org for sublist in sampled_news_df['news_text_orgs_without_seg'] for org in sublist])

df_news_text_with_seg = pd.DataFrame(news_text_orgs_with_seg_counts, columns=['Organization/Company', 'Count'])
df_news_text_without_seg = pd.DataFrame(news_text_orgs_without_seg_counts, columns=['Organization/Company', 'Count'])


In [None]:
# nltk results WITH sentence segmentation on sampled news article text
df_news_text_with_seg.head(10)

Unnamed: 0,Organization/Company,Count
0,MailOnline,1877
1,NYC,1561
2,LA,1079
3,COVID,1034
4,Conditions,815
5,Princess Diana,646
6,SUV,586
7,LACMA,578
8,US,575
9,Princess,507


In [None]:
# nltk results WITHOUT sentence segmentation on sampled news article text
df_news_text_without_seg.head(10)

Unnamed: 0,Organization/Company,Count
0,MailOnline,1877
1,NYC,1532
2,LA,1086
3,COVID,1035
4,Conditions,815
5,Princess Diana,662
6,SUV,592
7,LACMA,578
8,US,575
9,UK,501


Both methods with and without sentence segmentation using nltk package didn't generate good results. For the sorted top 10 organizations for these two methods, most of them are not companies, such as NYC, LA, COVID, Conditions, US, etc. Both methods gave many false positives and their counts don't differ a lot as well, so both of them are not good approaches.


### Spacy

In [None]:
nlp = spacy.load("en_core_web_md", disable = ["tagger", "lemmatizer", "textcat", "attribute_ruler", "tok2vec"])
nlp.pipe_names

['parser', 'ner']

In [None]:
with nlp.select_pipes(enable=["parser", "ner"]):
    print(nlp.pipe_names)
    nlp.analyze_pipes(pretty=True)

['parser', 'ner']
[1m

#   Component   Assigns               Requires   Scores             Retokenizes
-   ---------   -------------------   --------   ----------------   -----------
0   parser      token.dep                        dep_uas            False      
                token.head                       dep_las                       
                token.is_sent_start              dep_las_per_type              
                doc.sents                        sents_p                       
                                                 sents_r                       
                                                 sents_f                       
                                                                               
1   ner         doc.ents                         ents_f             False      
                token.ent_iob                    ents_p                        
                token.ent_type                   ents_r                        
                

In [None]:
sampled_news_df['news_text_orgs_with_seg_spacy'] = sampled_news_df['cleaned_news_text'].apply(extract_orgs_with_seg_spacy)
sampled_news_df['news_text_orgs_without_seg_spacy'] = sampled_news_df['cleaned_news_text'].apply(extract_orgs_without_seg_spacy)

In [None]:
news_text_orgs_with_seg_counts_spacy = count_and_sort([org for sublist in sampled_news_df['news_text_orgs_with_seg_spacy'] for org in sublist])
news_text_orgs_without_seg_counts_spacy = count_and_sort([org for sublist in sampled_news_df['news_text_orgs_without_seg_spacy'] for org in sublist])

df_news_text_with_seg_spacy = pd.DataFrame(news_text_orgs_with_seg_counts_spacy, columns=['Organization/Company', 'Count'])
df_news_text_without_seg_spacy = pd.DataFrame(news_text_orgs_without_seg_counts_spacy, columns=['Organization/Company', 'Count'])

In [None]:
# Spacy results WITH sentence segmentation on sampled news article text
df_news_text_with_seg_spacy.head(10)

Unnamed: 0,Organization/Company,Count
0,Netflix,1586
1,Facebook,1280
2,Ford,1240
3,resize=,1056
4,EV,996
5,Amazon,832
6,Hyundai,830
7,Toyota,822
8,Honda,771
9,Royal,594


In [None]:
# Spacy results WITHOUT sentence segmentation on sampled news article text
df_news_text_without_seg_spacy.head(10)

Unnamed: 0,Organization/Company,Count
0,Netflix,1586
1,Facebook,1280
2,Ford,1240
3,resize=,1056
4,EV,996
5,Amazon,832
6,Hyundai,830
7,Toyota,822
8,Honda,771
9,Royal,594


Both methods with and without sentence segmentation using Spacy package are much better than the two methods with nltk packages. Methods using Spacy package have fewer false positives and the sorted top 10 entities are almost all company names, except "resize=" and "EV". These two methods gave the same top 10 companies with the same counts, so either one is fine. I just chose to use method with sentence segmentation using Spacy package.

## NER Analysis with Best Practice: Spacy with Sentence Segmentation on Entire Datasets

In [None]:
# Functions Used
def extract_location_with_seg_spacy(text):
    """
    This function extracts geographic location entities using Spacy with sentence segmentation.
    """
    doc = nlp(text)
    return [ent.text for sent in doc.sents for ent in sent.ents if ent.label_ == 'GPE']

In [None]:
news_df['news_text_orgs_with_seg_spacy'] = news_df['cleaned_news_text'].apply(extract_orgs_with_seg_spacy)
news_df['news_title_orgs_with_seg_spacy'] = news_df['cleaned_news_title'].apply(extract_orgs_with_seg_spacy)
tweets_df['tweets_orgs_with_seg_spacy'] = tweets_df['cleaned_tokens'].apply(extract_orgs_with_seg_spacy)

In [None]:
total_news_text_orgs_with_seg_counts_spacy = count_and_sort([org for sublist in news_df['news_text_orgs_with_seg_spacy'] for org in sublist])
total_news_title_orgs_with_seg_counts_spacy = count_and_sort([org for sublist in news_df['news_title_orgs_with_seg_spacy'] for org in sublist])
total_tweets_orgs_with_seg_counts_spacy = count_and_sort([org for sublist in tweets_df['tweets_orgs_with_seg_spacy'] for org in sublist])

df_total_news_text_orgs = pd.DataFrame(total_news_text_orgs_with_seg_counts_spacy, columns=['Company From News Text', 'Count'])
df_total_news_title_orgs = pd.DataFrame(total_news_title_orgs_with_seg_counts_spacy, columns=['Company From News Title', 'Count'])
df_total_tweets_orgs = pd.DataFrame(total_tweets_orgs_with_seg_counts_spacy, columns=['Company From Tweets', 'Count'])

In [None]:
# Table Results of Top-20 Sorted Companies From News Article Text
df_total_news_text_orgs.head(20)

Unnamed: 0,Company From News Text,Count
0,Netflix,6585
1,Facebook,6074
2,Ford,6057
3,Toyota,4901
4,EV,4394
5,Hyundai,4188
6,Honda,3923
7,Amazon,3689
8,resize=,3229
9,Land Rover,2743


The primary company from news article text is Netflix with a count of 6585.

In [None]:
# Table Results of Top-20 Sorted Companies From News Article Title
df_total_news_title_orgs.head(20)

Unnamed: 0,Company From News Title,Count
0,| Daily Mail,1140
1,Ontario | Carpages.ca,1108
2,Ford,287
3,Toyota,203
4,Chevrolet,198
5,British Columbia |,198
6,Hyundai,177
7,Honda,152
8,| Star News,125
9,Alberta | Carpages.ca,114


The primary company from news article title is Daily Mail with a count of 1140.

In [None]:
# Table Results of Top-20 Sorted Companies From Tweets
df_total_tweets_orgs.head(20)

Unnamed: 0,Company From Tweets,Count
0,Land Rover,1774
1,Jaguar Land Rover,1019
2,Land Rover Defender,430
3,BMW,366
4,Audi,339
5,Mercedes-Benz,302
6,Citroen General Motors,277
7,eBay,181
8,Jaguar,156
9,Ford,108


The primary company from tweets is Land Rover with a count of 1774.

In [None]:
# Now examine what other companies are most frequently mentioned along with Netflix in news article text
news_text_primary_company = "Netflix"
news_texts_with_netflix = news_df[news_df['cleaned_news_text'].str.contains(news_text_primary_company, case=False)]
news_texts_with_netflix['other_companies_with_netflix'] = news_texts_with_netflix['cleaned_news_text'].apply(extract_orgs_with_seg_spacy)

other_companies_with_netflix= count_and_sort([org for sublist in news_texts_with_netflix['other_companies_with_netflix']
                                    for org in sublist if org.lower() != news_text_primary_company.lower()])

df_other_companies_with_netflix = pd.DataFrame(other_companies_with_netflix, columns=['Other Companies with Netflix From News Text', 'Count'])


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  news_texts_with_netflix['other_companies_with_netflix'] = news_texts_with_netflix['cleaned_news_text'].apply(extract_orgs_with_seg_spacy)


In [None]:
# Other companies that are most frequently mentioned along with Netflix in news article texts
df_other_companies_with_netflix.head(10)

Unnamed: 0,Other Companies with Netflix From News Text,Count
0,Facebook,5211
1,Amazon,3203
2,Royal,2533
3,Facebook Timeline,2444
4,House,2413
5,Instagram,2216
6,BBC,2003
7,White House,1908
8,IndexMobile,1761
9,PrintsOur PapersTop,1761


The table above shows top 10 companies that are most frequently mentioned along with the primary company Netflix in the news article texts, such as Facebook, Amazon, Instagram, etc.

In [None]:
tweets_primary_company = "Land Rover"

tweets_with_land_rover = tweets_df[tweets_df['cleaned_tokens'].str.contains(tweets_primary_company, case=False)]
tweets_with_land_rover['other_companies_with_land_rover'] = tweets_with_land_rover['cleaned_tokens'].apply(extract_orgs_with_seg_spacy)

other_companies_with_land_rover= count_and_sort([org for sublist in tweets_with_land_rover['other_companies_with_land_rover']
                                    for org in sublist if org.lower() != tweets_primary_company.lower()])

df_other_companies_with_land_rover = pd.DataFrame(other_companies_with_land_rover, columns=['Other Companies with Land Rover From Tweets', 'Count'])

In [None]:
# Other companies that are most frequently mentioned along with Land Rover in tweets
df_other_companies_with_land_rover.head(10)

Unnamed: 0,Other Companies with Land Rover From Tweets,Count
0,Jaguar Land Rover,1019
1,Land Rover Defender,430
2,BMW,366
3,Audi,339
4,Mercedes-Benz,302
5,Citroen General Motors,277
6,eBay,181
7,Jaguar,156
8,Ford,108
9,Tesla,99


The table above shows top 10 companies that are most frequently mentioned along with the primary company Land Rover in the tweets, such as Jaguar Land Rover, Audi, eBay, Ford, Tesla, etc.

In [None]:
news_df['news_text_loc_with_seg_spacy'] = news_df['cleaned_news_text'].apply(extract_location_with_seg_spacy)
news_df['news_title_loc_with_seg_spacy'] = news_df['cleaned_news_title'].apply(extract_location_with_seg_spacy)
tweets_df['tweets_loc_with_seg_spacy'] = tweets_df['cleaned_tokens'].apply(extract_location_with_seg_spacy)

total_news_text_loc_with_seg_counts_spacy = count_and_sort([org for sublist in news_df['news_text_loc_with_seg_spacy'] for org in sublist])
total_news_title_loc_with_seg_counts_spacy = count_and_sort([org for sublist in news_df['news_title_loc_with_seg_spacy'] for org in sublist])
total_tweets_loc_with_seg_counts_spacy = count_and_sort([org for sublist in tweets_df['tweets_loc_with_seg_spacy'] for org in sublist])

df_total_news_text_loc = pd.DataFrame(total_news_text_loc_with_seg_counts_spacy, columns=['Locations From News Text', 'Count'])
df_total_news_title_loc = pd.DataFrame(total_news_title_loc_with_seg_counts_spacy, columns=['Locations From News Title', 'Count'])
df_total_tweets_loc = pd.DataFrame(total_tweets_loc_with_seg_counts_spacy, columns=['Locations From Tweets', 'Count'])

In [None]:
# Top 10 most frequent locations of events from news article text
df_total_news_text_loc.head(10)

Unnamed: 0,Locations From News Text,Count
0,LA,15742
1,UK,10308
2,US,9436
3,London,8125
4,NYC,7750
5,Los Angeles,6829
6,New York City,6580
7,Hollywood,6040
8,Australia,5217
9,Miami,4897


In [None]:
# Top 10 most frequent locations of events from news article title
df_total_news_title_loc.head(10)

Unnamed: 0,Locations From News Title,Count
0,Manitoba,180
1,UK,168
2,India,81
3,US,79
4,Cambridge,77
5,North York,62
6,U.S.,60
7,London,58
8,Toronto,55
9,China,54


In [None]:
# Top 10 most frequent locations of events from tweets
df_total_tweets_loc.head(10)

Unnamed: 0,Locations From Tweets,Count
0,Russia,472
1,UK,306
2,NigelAndArron,190
3,Zimbabwe,87
4,India,85
5,Cambridge,69
6,Britain,59
7,Jamaica,51
8,Netherlands,47
9,weekChase,40
