<img src="https://www.nlplanet.org/images/NLP_tasks.png" width="800">


# Named Entity Recognition (NER)
## Example: 2019 Merge and Acquisition (M&A)
<img src="https://www.netscribes.com/wp-content/uploads/2021/06/MA-Analysis.jpg" width="700">

In [0]:
%sh
python -m spacy download en_core_web_sm

2023-09-16 03:10:54.956180: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Collecting en-core-web-sm==3.5.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.5.0/en_core_web_sm-3.5.0-py3-none-any.whl (12.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.8/12.8 MB 40.3 MB/s eta 0:00:00
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.5.0



[notice] A new release of pip available: 22.2.2 -> 23.2.1
[notice] To update, run: pip install --upgrade pip


[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [0]:
import pandas as pd
import spacy
nlp = spacy.load("en_core_web_sm")

In [0]:
ma_data_2019 = [{"id":1, "description":"In the pharmaceutical sector, Bristol-Myers Squibb, led by CEO Giovanni Caforio, acquired Celgene, and AbbVie, under CEO Richard A. Gonzalez, acquired Allergan, creating major healthcare entities. These deals were based in New York, USA, and New Jersey, USA, respectively."},
{"id":2, "description":"The aerospace and defense industry saw a merger between Raytheon and United Technologies, led by CEOs Thomas A. Kennedy and Gregory J. Hayes. The resulting Raytheon Technologies Corporation operates out of Massachusetts, USA, and Connecticut, USA."},
{"id":3, "description":"In the telecommunications field, T-Mobile and Sprint merged to form the 'New T-Mobile.' Key figures included John Legere and Marcelo Claure. The companies are based in Washington, USA, and Kansas, USA."},
{"id":4, "description":"Fidelity National Information Services (FIS) acquired Worldpay, with CEO Gary Norcross at the helm. FIS is located in Florida, USA, while Worldpay operates from Georgia, USA."},
{"id":5, "description":"Occidental Petroleum, led by CEO Vicki Hollub, acquired Anadarko Petroleum in the energy sector. Both companies are headquartered in Texas, USA."},
{"id":6, "description":"Aptiv and Hyundai formed an autonomous driving joint venture called 'Motional,' with Karl Iagnemma as President. Aptiv is based in Ireland, with headquarters in the UK and the USA, while Hyundai Motor Group is in South Korea."},
{"id":7, "description":"In the luxury goods industry, LVMH, headed by CEO Bernard Arnault, acquired Tiffany & Co. LVMH is based in France, and Tiffany & Co. is headquartered in New York, USA."},
{"id":8, "description":"Charles Schwab acquired TD Ameritrade, a major brokerage firm, with Walt Bettinger II as CEO. Charles Schwab operates from California, USA, and TD Ameritrade from Nebraska, USA."},
{"id":9, "description":"Eldorado Resorts, led by CEO Thomas Reeg, acquired Caesars Entertainment, creating a major casino and entertainment company. Both companies are based in Nevada, USA."},
{"id":10, "description":"Chevron, with CEO Michael K. Wirth, acquired Anadarko Petroleum in the oil and gas industry. Chevron is headquartered in California, USA, and Anadarko Petroleum in Texas, USA."},
{"id":11, "description":"Morgan Stanley, under CEO James P. Gorman, acquired ETRADE Financial Corporation in the online brokerage and financial services sector. Morgan Stanley operates out of New York, USA, and ETRADE from New York, USA."},
{"id":12, "description":"Fiserv, with Jeffery W. Yabuki as CEO, acquired First Data Corporation, a payment technology provider. Fiserv is based in Wisconsin, USA, and First Data Corporation in Georgia, USA."},
{"id":13, "description":"Amgen, led by CEO Robert A. Bradway, acquired Otezla from Celgene. Amgen is headquartered in California, USA, and Celgene in New Jersey, USA."},
{"id":14, "description":"Marriott Vacations Worldwide, with CEO Stephen P. Weisz, acquired ILG, a provider of vacation experiences and services. Marriott Vacations Worldwide operates out of Florida, USA, and ILG from Florida, USA."},
{"id":15, "description":"UnitedHealth Group, under CEO David S. Wichmann, acquired DaVita Medical Group, a large medical group and outpatient care provider. UnitedHealth Group is based in Minnesota, USA, and DaVita Medical Group in California, USA."},
{"id":16, "description":"Roche, with CEO Severin Schwan, acquired Spark Therapeutics, a biotechnology company specializing in gene therapies. Roche is headquartered in Switzerland, and Spark Therapeutics operates from Pennsylvania, USA."},
{"id":17, "description":"Barrick Gold, led by CEO Mark Bristow, merged with Randgold Resources, creating the world's largest gold mining company, Barrick, based in Canada and the Jersey, Channel Islands."},
{"id":18, "description":"Xerox Holdings, with CEO John Visentin, made an unsolicited takeover bid to acquire HP Inc. Xerox Holdings operates out of Connecticut, USA, while HP Inc. is based in California, USA."},
{"id":19, "description":"Nestlé, under CEO Mark Schneider, acquired Aimmune Therapeutics, a biopharmaceutical company specializing in food allergy treatments. Nestlé is headquartered in Switzerland, and Aimmune Therapeutics is in California, USA.]"}]

In [0]:
# select the named entity categoreis that we want to recognize
ner_categories = ['PERSON', 'ORG', 'GPE'] 

In [0]:
df = pd.DataFrame(ma_data_2019)
print(df.info())
df.head(5)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19 entries, 0 to 18
Data columns (total 2 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   id           19 non-null     int64 
 1   description  19 non-null     object
dtypes: int64(1), object(1)
memory usage: 432.0+ bytes
None


Unnamed: 0,id,description
0,1,"In the pharmaceutical sector, Bristol-Myers Sq..."
1,2,The aerospace and defense industry saw a merge...
2,3,"In the telecommunications field, T-Mobile and ..."
3,4,Fidelity National Information Services (FIS) a...
4,5,"Occidental Petroleum, led by CEO Vicki Hollub,..."


# Named Entity Recognition and Visualization

In [0]:
def ner(row, visual=True):
  """
    This function inputs description and outputs recognized entities.
  """
  text = row['description']
  doc = nlp(text)
  entities = []
  for ent in doc.ents:
    if ent.label_ in ner_categories:
      entities.append((ent.text, ent.label_))
  row['person_ent'] = []
  row['org_ent'] = []
  row['gpe_ent'] = []
  for entity, category in entities:
    if category == 'PERSON':
      row['person_ent'].append(entity)
    if category == "ORG":
      row['org_ent'].append(entity)
    if category == "GPE":
      row['gpe_ent'].append(entity)
  if visual:
    html = spacy.displacy.render(doc, style = "ent")
    displayHTML(html)
  return row 

df_ent = df.apply(ner, axis=1)

In [0]:
df_ent

Unnamed: 0,id,description,person_ent,org_ent,gpe_ent
0,1,"In the pharmaceutical sector, Bristol-Myers Sq...","[Giovanni Caforio, Celgene, Richard A. Gonzalez]",[Bristol-Myers Squibb],"[Allergan, New York, USA, New Jersey, USA]"
1,2,The aerospace and defense industry saw a merge...,"[Thomas A. Kennedy, Gregory J. Hayes]","[Raytheon, United Technologies, Raytheon Techn...","[Massachusetts, USA, Connecticut, USA]"
2,3,"In the telecommunications field, T-Mobile and ...",[John Legere],[Sprint],"[Washington, USA, Kansas, USA]"
3,4,Fidelity National Information Services (FIS) a...,[Gary Norcross],"[Fidelity National Information Services, FIS, ...","[Florida, USA, Georgia, USA]"
4,5,"Occidental Petroleum, led by CEO Vicki Hollub,...",[Vicki Hollub],[Occidental Petroleum],"[Anadarko, Texas, USA]"
5,6,Aptiv and Hyundai formed an autonomous driving...,"[Aptiv, Karl Iagnemma, Aptiv]","[Hyundai, Hyundai Motor Group]","[Ireland, UK, USA, South Korea]"
6,7,"In the luxury goods industry, LVMH, headed by ...",[Bernard Arnault],"[Tiffany & Co., Tiffany & Co.]","[France, New York, USA]"
7,8,"Charles Schwab acquired TD Ameritrade, a major...","[Charles Schwab, Walt Bettinger II, Charles Sc...",[TD Ameritrade],"[TD Ameritrade, California, USA, Nebraska, USA]"
8,9,"Eldorado Resorts, led by CEO Thomas Reeg, acqu...","[Eldorado Resorts, Thomas Reeg]",[Caesars Entertainment],"[Nevada, USA]"
9,10,"Chevron, with CEO Michael K. Wirth, acquired A...",[Michael K. Wirth],"[Chevron, Chevron, Anadarko Petroleum]","[Anadarko, California, USA, Texas, USA]"


# Save Output

In [0]:
file_path = "/dbfs/FileStore/tables/ent.json"
df_ent.to_json(file_path, orient='records')
print(f'DataFrame saved to "{file_path}".')

DataFrame saved to "/dbfs/FileStore/tables/ent.json".


In [0]:
text = df.iloc[0][1]
doc = nlp(text)
entities = []
for ent in doc.ents:
  if ent.label_ in ner_categories:
    entities.append((ent.text, ent.label_))
html = spacy.displacy.render(doc, style = "ent")
displayHTML(html)

In [0]:
spacy.displacy.render(doc, style = "ent")

'<div class="entities" style="line-height: 2.5; direction: ltr">In the pharmaceutical sector, \n<mark class="entity" style="background: #7aecec; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">\n    Bristol-Myers Squibb\n    <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">ORG</span>\n</mark>\n, led by CEO \n<mark class="entity" style="background: #aa9cfc; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">\n    Giovanni Caforio\n    <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem">PERSON</span>\n</mark>\n, acquired \n<mark class="entity" style="background: #aa9cfc; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;">\n    Celgene\n    <span style="font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; verti