# Task 2: Company Name Matcher

## Loading Data

In [4]:
# Load the Pandas libraries with alias 'pd' 
import pandas as pd 

# Read company names from the database 
company_names_database = pd.read_csv("data/company_name_matcher_data_esg_db.csv") 

# Read file you want to compare with
company_names_uploaded = pd.read_csv("data/company_name_matcher_data_3.csv")

## Setting parameter for algorithm

Here you can define how many similar words you want to search for in the ESG database.

In [5]:
#number of similar company names you want to search in database
n = 2;

## Enter a company name and compare to ESG database

In this section you can change the value of the variable 'your_company_name' and run the algo to search similar companies.

In [6]:
#enter the company name
your_company_name = "Adobe Inclusive"

For the search and compare algorithm the phyton library fuzzywuzzy is used. It uses Levenshtein distance to calculate the differences between sequences of letters. It is a technique of finding strings that match a pattern approximately (rather than exactly). 
Therefore, it uses a similarity score (0-100) which means the higher this score the more similar are the two strings. It is printed out in every result below, so it is easier to understand the comparison.

In [7]:
#Search for similar company names
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

output = process.extract(your_company_name, company_names_database['name_'].to_list(),limit = n)
print('Your company name: ' + '\033[1m'+  your_company_name + '\033[0m')
print('Number of most similar names you want to search for: ' + '\033[1m' + str(n) + '\033[0m')
print('Founded similair company names in ESG database: '+ '\033[1m' + ''.join(map(str, output))+ '\033[0m')

Your company name: [1mAdobe Inclusive[0m
Number of most similar names you want to search for: [1m2[0m
Founded similair company names in ESG database: [1m('ADOBE INC.', 90)('ADOBE SYSTEMS INCORPORATED', 86)[0m


## Compare company names from .csv file to ESG database

In [8]:
# Preview the first 5 lines of the loaded data - to see name of column with company names
company_names_uploaded.head()

Unnamed: 0,Company Name,ISIN,Target Qualification,SME?,Business ambitions 1.5,Country,Region,Sector,Status,Date,Date Explanation,Target,Target Classification
0,A1 Telekom Austria Group,AT0000720008,1.5°C,,,Austria,Europe,Telecommunication Services,Targets Set,2020-09-01 00:00:00,,A1 Telekom Austria Group commits to reduce abs...,The targets covering greenhouse gas emissions ...
1,A2A S.p.A.,IT0001233417,2°C,,,Italy,Europe,Electric Utilities and Independent Power Produ...,Targets Set,2020-03-01 00:00:00,,Italian multi-utility company A2A S.p.A commit...,The targets covering greenhouse gas emissions ...
2,Aardvark Certification Ltd,,Well-below 2°C,1.0,,United Kingdom (UK),Europe,Professional Services,Targets Set,2020-07-01 00:00:00,,Aardvark Certification Ltd. commits to reduce ...,The targets covering greenhouse gas emissions ...
3,AB InBev,BE0974293251,1.5°C,,,Belgium,Europe,Food and Beverage Processing,Targets Set,2018-03-01 00:00:00,,Global Brewer AB InBev commits to reduce absol...,The targets covering greenhouse gas emissions ...
4,ABB,CH0012221716,1.5°C,,1.0,Switzerland,Europe,Electrical Equipment and Machinery,Targets Set,2021-06-01 00:00:00,,ABB commits to reduce absolute scope 1 and 2 G...,The targets covering greenhouse gas emissions ...


Take out the columnn name of your file which includes all the company names.

In [9]:
#enter the name of this column
column_company_name = "Company Name"

Lets run the algo for the 5 first columns of the file due to performance reason.

In [10]:
#get small sample (n=5) out of csv file (performance reason)
company_names = company_names_uploaded.head(5)

#create dataframe
cols = ['company_name', 'similar_names_from_database']
result = []

#loop through your file and check for similar words in ESG database
for ind in company_names.index:
    similar_names = process.extract(company_names[column_company_name][ind], company_names_database['name_'].to_list(),limit = n)
    result.append([(company_names[column_company_name][ind]), similar_names])

#show result in a dataframe   
comparison = pd.DataFrame(result, columns=cols)
pd.set_option('display.max_colwidth', None)
comparison.head()


Unnamed: 0,company_name,similar_names_from_database
0,A1 Telekom Austria Group,"[(Monex Group, Inc., 86), (Airbus Group SE, 86)]"
1,A2A S.p.A.,"[(A2A S.P.A., 100), (A2A S.p.A., 100)]"
2,Aardvark Certification Ltd,"[(VALQUA, LTD., 86), (Standard Life Investments Property Inc Trust Ltd, 86)]"
3,AB InBev,"[(B&B Tools AB, 86), (ANHEUSER-BUSCH INBEV S.A., 86)]"
4,ABB,"[(ABBOTT INDIA LIMITED, 90), (Speed Rabbit Pizza SA, 90)]"


As you can see sometimes no matching or pretty similar company name is found (e.g. for ABB).

## Export comparison dataframe as csv file

In [None]:
df.to_csv('data/comparison.csv')