<a href="https://colab.research.google.com/github/yuvraj-singh/Codes/blob/master/NLP_Problem.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Problem: 
It is critical to know all subsidiaries of a company to fully understand its business in financial analysis. 
    In Europe, companies have to report all of their subsidiaries in a document, which makes this task easier. However, a company reports all legal entities of a subsidiary, while you may only need to know the main business unit. For example, Microsoft will report `Linkedin US LLC`, `Linkedin UK Ltd`, `Linkedin India Pvt Ltd`, whereas all we are interested in is knowing that `Linkedin` is a subsidiary of Microsoft.
    
#### To do: 
The subsidiary list of a UK-based Airline Parts Manufacturer, `Rolls Royce`.  can be found in page 1-5 of this document: https://www.rolls-royce.com/~/media/Files/R/Rolls-Royce/documents/annual-report/rr-ar2016-other-information.pdf
You have to create a pipeline that takes the raw pdf as input and outputs the main subsidiary names of Rolls Royce. Some suggested steps would be:
- Download pdf
- Extract the tables into a text/tabular format
- Extract all subsidiary legal entities from the table
- Condense similar subsidiary legal entities into a single subsidiary name

For example, rows 5 - 13  of the table would be extracted as:
- Bergen Engines AS Hordvikneset 
- Bergen Engines Bangladesh Private Limited 
- Bergen Engines BV 
- Bergen Engines Denmark A/S 
- Bergen Engines India Private Limited 
- Bergen Engines Limited 
- Bergen Engines PropertyCo AS 
- Bergen Engines S.L. 
- Bergen Engines S.r.l. 

which should condense into a single subsidiary, `Bergen Engines`

#### Output: 
The output of your pipeline should be a list of subsidiary names of Rolls Royce. A sample output (for data on pg 1) would like:

```['Rolls-Royce', 'A. F. C. Wultex', 'A.P.E. – Allen Gears', 'Allen Power Engineering', 'Amalgamated Power Engineering',  'Bergen Engines', 'Bristol Siddeley Engines Limited', 'Brooks Inspection Solutions', 'Brown Brothers & Company', 'C.A. Parsons & Company', 'Composite Technology and Applications', 'Croydon Energy Limited', 'Data Systems & Solutions', 'Deeside Titanium', 'Derby Cogeneration', 'Derby Specialist Fabrications', 'Europea Microfusioni Aerospaziali', 'Exeter Power', 'Fluid Mechanics', 'Heartlands Power', 'Heaton Power', 'Kalvet Engineering', 'Karl Maybach-Hilfe , 'Mansfield Holdings Limited', 'Kamewa', 'MTU', 'L'Orange', 'John Thompson'] ```


### Bonus Problem:
Is this model generalisable? Let's try a new company, British American Tobacco. The subsidiaries are on page 241-250 of this document: https://www.annualreports.com/HostedData/AnnualReports/PDF/LSE_BATS_2019.pdf

Try the pipeline just on page 241. What is your output? Are you happy with this output? 

In [None]:
# installing PyPDF2 library 
!pip install PyPDF2
import PyPDF2


# Downloading pdf files
!wget  https://www.rolls-royce.com/~/media/Files/R/Rolls-Royce/documents/annual-report/rr-ar2016-other-information.pdf
file_name = 'rr-ar2016-other-information.pdf'
ground_truth = ['Rolls-Royce', 'A. F. C. Wultex', 'A.P.E. – Allen Gears', 'Allen Power Engineering', 'Amalgamated Power Engineering', 'Bergen Engines', 
                'Bristol Siddeley Engines', 'Brooks Inspection Solutions', 'Brown Brothers & Company', 'C.A. Parsons & Company', 
                'Composite Technology and Applications', 'Croydon Energy', 'Data Systems & Solutions', 'Deeside Titanium', 'Derby Cogeneration', 
                'Derby Specialist Fabrications', 'Europea Microfusioni Aerospaziali', 'Exeter Power', 'Fluid Mechanics', 'Heartlands Power',
                'Heaton Power', 'Kalvet Engineering', 'Karl Maybach-Hilfe' , 'Mansfield Holdings Limited', 'Kamewa', 'MTU', 'L\'Orange', 'John Thompson']

# !wget https://www.bat.com/ar/2019/pdf/BAT_Annual_Report_and_Form_20-F_2019.pdf
# file_name = 'BAT_Annual_Report_and_Form_20-F_2019.pdf'


# reading pdf file
reader = PyPDF2.PdfFileReader(file_name)

--2021-09-21 12:47:53--  https://www.rolls-royce.com/~/media/Files/R/Rolls-Royce/documents/annual-report/rr-ar2016-other-information.pdf
Resolving www.rolls-royce.com (www.rolls-royce.com)... 23.66.114.64, 23.66.114.66
Connecting to www.rolls-royce.com (www.rolls-royce.com)|23.66.114.64|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 363607 (355K) [application/pdf]
Saving to: ‘rr-ar2016-other-information.pdf.7’


2021-09-21 12:47:55 (4.36 MB/s) - ‘rr-ar2016-other-information.pdf.7’ saved [363607/363607]



In [None]:
################ Provide input pages ##########################
list_pages = [0,1]

In [None]:
pages_raw_data = []

for i in list_pages:
# reading pages one by one
  data = reader.getPage(i).extractText()
  pages_raw_data.append(data)

# splitting the complete string into words
for i in range(len(pages_raw_data)): 
  pages_raw_data[i] = pages_raw_data[i].split('\n')
  
# merging all the lists of words for all pages into one
complete_list_words = [y for x in pages_raw_data for y in x]
complete_list_words
reader.getPage(0).extractText()

"Subsidiaries\n* Dormant entity.\n1 Moor Lane, Derby, DE24 8BJ, England.\n2 Corporation Service Company, 2711 Centerville Road, Suite 400, Wilmington, DE19808, United States.\n3 62 Buckingham Gate, London, SW1E 6AT, England.\nCompany name\nAddress\nClass \n of shares\n% of \n class held\nA. F. C. Wultex Limited*\nDerby\n 1Ordinary\n90\nA.P.E. Œ Allen Gears Limited*\nDerby\n 1Ordinary\n100\nAllen Power Engineering Limited*\nDerby\n 1Ordinary\n100\nAmalgamated Power Engineering Limited*\nDerby\n 1Deferred \n100\nOrdinary\n100\nBergen Engines AS\nHordvikneset 125, N-5108, Hordvik, Bergen 1201, Norway\nOrdinary\n100\nBergen Engines Bangladesh Private Limited\nGreen Granduer, 6th Floor, Plot n.58 E, Kamal Ataturk Avenue Banani, C/A \nDhaka, 1213, Bangladesh\nOrdinary\n100\nBergen Engines BV\nWerfd˜k 2, 3195HV Pernis, Rotterdam, Netherlands\nOrdinary\n100\nBergen Engines Denmark A/S\nVærftsvej 23, 9000 Ålborg, Denmark\nOrdinary\n100\nBergen Engines India Private Limited\n52-b, 2nd Floor, Okh

In [None]:
# importing spacy and loading spacy's NER model
import spacy
!python -m spacy download en_core_web_lg
!python -m spacy download en
# !python -m spacy link en_core_web_lg en
NER = spacy.load("en_core_web_lg")

Collecting en_core_web_lg==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-2.2.5/en_core_web_lg-2.2.5.tar.gz (827.9 MB)
[K     |████████████████████████████████| 827.9 MB 1.3 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_lg')
Collecting en_core_web_sm==2.2.5
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz (12.0 MB)
[K     |████████████████████████████████| 12.0 MB 7.4 MB/s 
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('en_core_web_sm')
[38;5;2m✔ Linking successful[0m
/usr/local/lib/python3.7/dist-packages/en_core_web_sm -->
/usr/local/lib/python3.7/dist-packages/spacy/data/en
You can now load the model via spacy.load('en')


In [None]:
# filtering out words by entity tags and keeping only words with 'ORG' tag
subsidiary_list = []
for word in complete_list_words:
  result  = NER(word)
  for ent in result.ents:
    if ent.label_ == 'ORG':
      subsidiary_list.append(ent.text)
subsidiary_list.sort()
subsidiary_list

!pip install cleanco
from cleanco import cleanco


# keeping only words based on some suffixes which frequently occurs in Organizations' name
import copy
import re
temp = copy.deepcopy(complete_list_words)
subsidiary_list = []
for word in temp:
  # if re.search(" limited| gmbh| company| s\.p\.a\.| llc|proprietary| inc| ltd| limitada| ab", word,  re.IGNORECASE):
  x = cleanco(word)
  if x.type():
    subsidiary_list.append(x.clean_name())



# sorting the list of organizational entities
subsidiary_list.sort()
subsidiary_list
# List of all the suffixes to be used later for improvement
# https://en.wikipedia.org/wiki/List_of_legal_entity_types_by_country



['2 Corporation Service Company, 2711 Centerville Road, Suite 400, Wilmington, DE19808, United States.',
 '2 Corporation Service Company, 2711 Centerville Road, Suite 400, Wilmington, DE19808, United States.',
 '75NEI Combustion Engineering',
 'A. F. C. Wultex',
 'A.P.E. Œ Allen Gears',
 'Allen Power Engineering',
 'Amalgamated Power Engineering',
 'Bergen Engines',
 'Bergen Engines',
 'Bergen Engines',
 'Bergen Engines',
 'Bergen Engines Bangladesh',
 'Bergen Engines Denmark',
 'Bergen Engines India',
 'Bergen Engines PropertyCo',
 'Bristol Siddeley Engines',
 'Brooks Inspection Solutions',
 'Brown Brothers & Company',
 'C.A. Parsons & Company',
 'Company name',
 'Company name',
 'Composite Technology and Applications',
 'Corporation Service Company, 100 North Main Street, Suite 2, Barre, VT',
 'Croydon Energy',
 'Data Systems & Solutions',
 'Deeside Titanium',
 'Derby Cogeneration',
 'Derby Specialist Fabrications',
 'Europea Microfusioni Aerospaziali',
 'Exeter Power',
 'Fluid Mecha

In [None]:
# !pip install cleanco
# from cleanco import cleanco
# x = cleanco("Linkedin limited")
# x.type()
subsidiary_list

['2 Corporation Service Company, 2711 Centerville Road, Suite 400, Wilmington, DE19808, United States.',
 '2 Corporation Service Company, 2711 Centerville Road, Suite 400, Wilmington, DE19808, United States.',
 '75NEI Combustion Engineering',
 'A. F. C. Wultex',
 'A.P.E. Œ Allen Gears',
 'Allen Power Engineering',
 'Amalgamated Power Engineering',
 'Bergen Engines',
 'Bergen Engines',
 'Bergen Engines',
 'Bergen Engines',
 'Bergen Engines Bangladesh',
 'Bergen Engines Denmark',
 'Bergen Engines India',
 'Bergen Engines PropertyCo',
 'Bristol Siddeley Engines',
 'Brooks Inspection Solutions',
 'Brown Brothers & Company',
 'C.A. Parsons & Company',
 'Company name',
 'Company name',
 'Composite Technology and Applications',
 'Corporation Service Company, 100 North Main Street, Suite 2, Barre, VT',
 'Croydon Energy',
 'Data Systems & Solutions',
 'Deeside Titanium',
 'Derby Cogeneration',
 'Derby Specialist Fabrications',
 'Europea Microfusioni Aerospaziali',
 'Exeter Power',
 'Fluid Mecha

In [None]:
# Method to create clusters of all legal entities of a subsidiary based on same first words 
def get_clusters(subsidiary_list):
  cluster_list = []
  n = len(subsidiary_list)
  i = 0
  while(i<n):
    if i==n-1:
      break
    cluster = [subsidiary_list[i]]
    while (subsidiary_list[i].split(' ')[0] == subsidiary_list[i+1].split(' ')[0]):
      cluster.append(subsidiary_list[i+1])
      if i==n-2:
        break
      i+=1
    i += 1
    cluster_list.append(cluster)
  return cluster_list

# Method to process each cluster to pick the smallest string name(business unit) to represent all the different legal names of a subsidiary in a cluster
def get_buisiness_unit(cluster_of_legal_entities):
  res = ""
  min_len = 100000
  for i  in range (len(cluster_of_legal_entities)):
    cluster_of_legal_entities[i] = cluster_of_legal_entities[i].split(' ')
    if min_len > len(cluster_of_legal_entities[i]):
      min_len = len(cluster_of_legal_entities[i])
  
  res = cluster_of_legal_entities[0][0]
  for i in range(1, min_len):
    word  = cluster_of_legal_entities[0][i]
    for orgs in cluster_of_legal_entities:
      if orgs[i]!= word:
        return res
    res += " {}".format(word)
  return res


final_subsidiary_list = set()
# getting list of clusters from the complete list of subsidiaries
cluster_list = get_clusters(subsidiary_list)
for cluster in cluster_list:
  # choosing the main business unit out of all the legal entities in the cluster
  print(cluster)
  subsidiary = cluster[0] if len(cluster)==1 else get_buisiness_unit(cluster)
  final_subsidiary_list.add(subsidiary)
final_subsidiary_list

['2 Corporation Service Company, 2711 Centerville Road, Suite 400, Wilmington, DE19808, United States.', '2 Corporation Service Company, 2711 Centerville Road, Suite 400, Wilmington, DE19808, United States.']
['75NEI Combustion Engineering']
['A. F. C. Wultex']
['A.P.E. Œ Allen Gears']
['Allen Power Engineering']
['Amalgamated Power Engineering']
['Bergen Engines', 'Bergen Engines', 'Bergen Engines', 'Bergen Engines', 'Bergen Engines Bangladesh', 'Bergen Engines Denmark', 'Bergen Engines India', 'Bergen Engines PropertyCo']
['Bristol Siddeley Engines']
['Brooks Inspection Solutions']
['Brown Brothers & Company']
['C.A. Parsons & Company']
['Company name', 'Company name']
['Composite Technology and Applications']
['Corporation Service Company, 100 North Main Street, Suite 2, Barre, VT']
['Croydon Energy']
['Data Systems & Solutions']
['Deeside Titanium']
['Derby Cogeneration', 'Derby Specialist Fabrications']
['Europea Microfusioni Aerospaziali']
['Exeter Power']
['Fluid Mechanics']
['H

{'2 Corporation Service Company, 2711 Centerville Road, Suite 400, Wilmington, DE19808, United States.',
 '75NEI Combustion Engineering',
 'A. F. C. Wultex',
 'A.P.E. Œ Allen Gears',
 'Allen Power Engineering',
 'Amalgamated Power Engineering',
 'Bergen Engines',
 'Bristol Siddeley Engines',
 'Brooks Inspection Solutions',
 'Brown Brothers & Company',
 'C.A. Parsons & Company',
 'Company name',
 'Composite Technology and Applications',
 'Corporation Service Company, 100 North Main Street, Suite 2, Barre, VT',
 'Croydon Energy',
 'Data Systems & Solutions',
 'Deeside Titanium',
 'Derby',
 'Europea Microfusioni Aerospaziali',
 'Exeter Power',
 'Fluid Mechanics',
 'Heartlands Power',
 'Heaton Power',
 'John Thompson',
 'Kalvet Engineering (Proprietary',
 'Kamewa',
 'Karl Maybach-Hilfe',
 "L'Orange",
 'MTU',
 'Mans˚eld Holdings',
 'NEI',
 'Navis Consult',
 'Nightingale Insurance',
 'Optimized Systems and Solutions (US',
 'PKMJ Technical Services',
 'PT',
 'Power˚eld',
 'Prokura Diesel Serv

In [None]:
# Desired output in form of a list of subsidiaries
final_subsidiary_list = list(final_subsidiary_list)
final_subsidiary_list.sort()


# Applying NER once again in  order to check if it is still tagged as an 'ORG'

# import copy
# temp = copy.deepcopy(final_subsidiary_list)

# for word in temp:
#   result  = NER(word)
#   if len(result.ents) == 0:
#     final_subsidiary_list.remove(word)

#     for ent in result.ents:
#       # print(ent.text,ent.label_)
#       if ent.label_ != 'ORG':
#         final_subsidiary_list.remove(word)

# using regex to remove subsidiaries which have digits in their name 
import copy
temp = copy.deepcopy(final_subsidiary_list)

for word in temp:
  if re.search("\d", word):
    final_subsidiary_list.remove(word)

# using regex to remove unwanted characters and suffix words like limited, llc, private etc.
import re
for i in range(len(final_subsidiary_list)):
  final_subsidiary_list[i] = re.sub('\(|\)', '', final_subsidiary_list[i],flags=re.I)
  final_subsidiary_list[i] = re.sub(' limited($| )| llc($| )| private($| )| proprietary($| )| ltd($| )| inc.($| )|ltd.($| )| ab($| )| gmbh($| )| s\.p\.a\.($| )', '', final_subsidiary_list[i],flags=re.I)
  final_subsidiary_list[i] = final_subsidiary_list[i].strip()

final_subsidiary_list


['A. F. C. Wultex',
 'A.P.E. Œ Allen Gears',
 'Allen Power Engineering',
 'Amalgamated Power Engineering',
 'Bergen Engines',
 'Bristol Siddeley Engines',
 'Brooks Inspection Solutions',
 'Brown Brothers & Company',
 'C.A. Parsons & Company',
 'Company name',
 'Composite Technology and Applications',
 'Croydon Energy',
 'Data Systems & Solutions',
 'Deeside Titanium',
 'Derby',
 'Europea Microfusioni Aerospaziali',
 'Exeter Power',
 'Fluid Mechanics',
 'Heartlands Power',
 'Heaton Power',
 'John Thompson',
 'Kalvet Engineering',
 'Kamewa',
 'Karl Maybach-Hilfe',
 "L'Orange",
 'MTU',
 'Mans˚eld Holdings',
 'NEI',
 'Navis Consult',
 'Nightingale Insurance',
 'Optimized Systems and Solutions US',
 'PKMJ Technical Services',
 'PT',
 'Power˚eld',
 'Prokura Diesel Services',
 'R.O.V. Technologies',
 'Rallyswift',
 'Reyrolle Belmos',
 'Rolls-Royce']

In [None]:
# calculating precision 
true_positives, false_positives = 0, 0
true_positives_list, false_negatives_list, false_positives_list = [], [], []

for subsidiary in final_subsidiary_list:
  if subsidiary in ground_truth:
    true_positives += 1
    true_positives_list.append(subsidiary)
  else:
    false_positives +=1
    false_positives_list.append(subsidiary)
precision =  true_positives /(false_positives + true_positives)
print("Precision is {}".format(precision))


# calculating recall
true_positives, false_negatives  = 0, 0

for subsidiary in ground_truth:
  if subsidiary in final_subsidiary_list:
    true_positives += 1
  else:
    false_negatives +=1
    false_negatives_list.append(subsidiary)
recall =  true_positives /(false_negatives + true_positives)
print("Recall is {}".format(recall))


# calculating F1 score

f1_score = (2*(precision * recall))/(precision + recall)
print("F1 score is {}".format(f1_score))

print("TP: ",true_positives_list)
print("FN: ",false_negatives_list)
print("FP: ",false_positives_list)

# Results for all variants of code
#  https://gist.github.com/yuvraj-singh/47927c082d8725489aded05c36abddae

Precision is 0.6153846153846154
Recall is 0.8571428571428571
F1 score is 0.7164179104477612
TP:  ['A. F. C. Wultex', 'Allen Power Engineering', 'Amalgamated Power Engineering', 'Bergen Engines', 'Bristol Siddeley Engines', 'Brooks Inspection Solutions', 'Brown Brothers & Company', 'C.A. Parsons & Company', 'Composite Technology and Applications', 'Croydon Energy', 'Data Systems & Solutions', 'Deeside Titanium', 'Europea Microfusioni Aerospaziali', 'Exeter Power', 'Fluid Mechanics', 'Heartlands Power', 'Heaton Power', 'John Thompson', 'Kalvet Engineering', 'Kamewa', 'Karl Maybach-Hilfe', "L'Orange", 'MTU', 'Rolls-Royce']
FN:  ['A.P.E. – Allen Gears', 'Derby Cogeneration', 'Derby Specialist Fabrications', 'Mansfield Holdings Limited']
FP:  ['A.P.E. Œ Allen Gears', 'Company name', 'Derby', 'Mans˚eld Holdings', 'NEI', 'Navis Consult', 'Nightingale Insurance', 'Optimized Systems and Solutions US', 'PKMJ Technical Services', 'PT', 'Power˚eld', 'Prokura Diesel Services', 'R.O.V. Technologies'