## Primer on fuzzy name matching

<br/>
**TL;DR: Each algorithm solves a slightly different problem, none of which are the problem we have, exactly**
<br/><br/>
**Fuzzy matching:** Different types exist, but mostly what they do is break the name into tokens and then try to say something about how many tokens match exactly, and whether they are in the same order in both names. This is great for catching prefixes & suffixes, e.g. 'microsoft corp.' and 'microsoft inc.' would give high, if not perfect similarity with this one. However, it's bad for cases where part the firm name matches exactly for random reasons: e.g. 'johnson & johnson' vs. 'johnson smith kline', or worse, 'zoom inc.' and 'zoom tech'. <br/>
**Hamming and other distance measures:** Measures how many character changes are required to map one name to another. So for 'microsoft' and 'micrasoft' this would be 1. For 'micro' and 'microsoft' this would be 4, etc. This is great for catching spelling errors.
<br/>
**Soundex methods:** Try to measure how the names are pronouced using pre-specified character-/ngram-based dictionaries. Most common algos are: soundex (most basic and coarse one), metaphone (more granular) and double metaphone (very granular). These are great for single-token names, but become complicated with multiple tokens. But their main point is to capture different spellings of the same name - we are unlikely to have many spelling errors though.
<br/>
**Statistical methods:** We label the data manually for a subset of matches and then train a model on this. This is labor intensive and unclear if it will really be that much more accurate.
<br/>
**Word embedding methods:** This is likely our best bet, as it's specifically designed to pick up on similarities between words like 'inc.' and 'comp.'.
<br/>
<br/>
**Conclusion:** Perhaps we can go for a combination of methods, specifically we start with the embedding approach and then see if we can further refine with the others.

### TF-IDF Approach 
<br/>
Source of text below: https://github.com/black-tea/data-projects/blob/master/string-matching-at-scale/String%20Matching%20at%20Scale.ipynb
<br/><br/>
The TF-IDF calculations typically consist of the following steps:
<br/>

**Pre-processing & Tokenization:** Perform any cleaning on the data (case conversion, removal of stopwords & punctuation) and convert each document into tokens. Although tokenization is typically performed at the word level, we have the flexibility to define a token at a lower level, such as an n-gram, which is more useful for short string matching since we might only have a few words in each string.<br/><br/>

**Calculate the Term Frequency:** The purpose of this step is to determine which words define the document; words that appear more frequently are indicative of what the document's subject matter. For each document (a string in our case), calculate the frequency for each term (token) in the document and divide by the total number of terms in the document. If we define a token as an n-gram, we will calculate the frequency of each n-gram in our string.<br/><br/>
    $$DF(t) = (# of times term appears in doc) / (Total # of terms in doc)$$
<br/>

**Calculate the Inverse Document Frequency:** The purpose of this step is to calculate the appropriate weight for each term, depending on how often it appears across all documents. A term that appears in all the different documents will have a lower weight compared to a term that only appears in one of the documents. The idea is that a token that appears in all documents is less is less descriptive of any particular document compared to a token that appears in only one of the documents.<br/>
$$IDF(t) = ln(Total number of documents / Number of documents with term t in it)$$
<br/>
**Calculate the TF-IDF Weights for each token:** Multiply the term frequency with the inverse document frequency
<br/><br/>
**Calculate the Cosine Similarity:** Cosine similarity is often used to compare the similarity of two vectors (in this case TF-IDF values). As described by Chris van den Berg, data scientists at ING developed a custom library to make the cosine similarity calcualtions faster than the built-in sci-kit learn implementation. We will use this library for a faster cosine similarity calculation than the built-in scikit learn cosine_similarity function.

In [1]:
import os
import pandas as pd
import numpy as np
from fuzzywuzzy import process, fuzz
import timeit
import multiprocessing as mp
from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
from metaphone import doublemetaphone
from cleanco import cleanco
import re
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from scipy.sparse import csr_matrix
import sparse_dot_topn.sparse_dot_topn as ct
from datetime import datetime


In [2]:
# Set path to BG data
os.chdir('/Users/victoriasevcenko/Dropbox (INSEAD)/RS_RA_INSEAD/0_Data/Output/2016')
path_bg='/Users/victoriasevcenko/Dropbox (INSEAD)/RS_RA_INSEAD/0_Data/Output/2016'
# Set path to glassdoor data
path_gl='/Users/victoriasevcenko/Dropbox (Personal)/Burning Glass/Data/Glassdoor M&A Sample/BG_M&ASample_AriannaJMP.csv'
# Set path to output
path_out='/Users/victoriasevcenko/Dropbox (Personal)/Burning Glass/Analysis/glassdoor_bg_merge/2_Results'

In [3]:
# Clean data function
def clean_data(df):
    df=df.drop_duplicates().dropna()
    df = df.reset_index(drop = True)
    df['name']=df['name'].str.lower()
    df['name'] = df['name'].map(lambda x: x.replace(',', '').replace(' - ', ' ') \
                               .replace(r"\(.*\)","").replace(' and ', ' & ').strip())
    df.name = df.name.str.replace(' inc.', '')
    df.name = df.name.str.replace(' co.', '')
    df.name = df.name.apply(lambda x: cleanco(x).clean_name() if type(x)==str else x)
    df.name = df.name.str.replace('.', '')
    df.name = df.name.apply(lambda x: cleanco(x).clean_name() if type(x)==str else x)       
    
    return df

In [4]:
# Get CSV list in directory function
def get_csv_files(path):
    csv_list = []
    for filename in os.listdir(path):
        if filename.endswith(".csv"):
            csv_list.append(os.path.join(path, filename))
    return csv_list

In [5]:
class StringMatch():
    
    def __init__(self, source_names, target_names):
        self.source_names = source_names
        self.target_names = target_names
        self.ct_vect      = None
        self.tfidf_vect   = None
        self.vocab        = None
        self.sprse_mtx    = None
        
        
    def tokenize(self, analyzer='char_wb', n=3):
        '''
        Tokenizes the list of strings, based on the selected analyzer

        :param str analyzer: Type of analyzer ('char_wb', 'word'). Default is trigram
        :param str n: If using n-gram analyzer, the gram length
        '''
        # Create initial count vectorizer & fit it on both lists to get vocab
        self.ct_vect = CountVectorizer(analyzer=analyzer, ngram_range=(n, n))
        self.vocab   = self.ct_vect.fit(self.source_names + self.target_names).vocabulary_
        
        # Create tf-idf vectorizer
        self.tfidf_vect  = TfidfVectorizer(vocabulary=self.vocab, analyzer=analyzer, ngram_range=(n, n))
        
        
    def match(self, ntop=1, lower_bound=0, output_fmt='df'):
        '''
        Main match function. Default settings return only the top candidate for every source string.
        
        :param int ntop: The number of top-n candidates that should be returned
        :param float lower_bound: The lower-bound threshold for keeping a candidate, between 0-1.
                                   Default set to 0, so consider all canidates
        :param str output_fmt: The output format. Either dataframe ('df') or dict ('dict')
        '''
        self._awesome_cossim_top(ntop, lower_bound)
        
        if output_fmt == 'df':
            match_output = self._make_matchdf()
        elif output_fmt == 'dict':
            match_output = self._make_matchdict()
            
        return match_output
        
        
    def _awesome_cossim_top(self, ntop, lower_bound):
        ''' https://gist.github.com/ymwdalex/5c363ddc1af447a9ff0b58ba14828fd6#file-awesome_sparse_dot_top-py '''
        # To CSR Matrix, if needed
        A = self.tfidf_vect.fit_transform(self.source_names).tocsr()
        B = self.tfidf_vect.fit_transform(self.target_names).transpose().tocsr()
        M, _ = A.shape
        _, N = B.shape

        idx_dtype = np.int32

        nnz_max = M * ntop

        indptr = np.zeros(M+1, dtype=idx_dtype)
        indices = np.zeros(nnz_max, dtype=idx_dtype)
        data = np.zeros(nnz_max, dtype=A.dtype)

        ct.sparse_dot_topn(
            M, N, np.asarray(A.indptr, dtype=idx_dtype),
            np.asarray(A.indices, dtype=idx_dtype),
            A.data,
            np.asarray(B.indptr, dtype=idx_dtype),
            np.asarray(B.indices, dtype=idx_dtype),
            B.data,
            ntop,
            lower_bound,
            indptr, indices, data)

        self.sprse_mtx = csr_matrix((data,indices,indptr), shape=(M,N))
    
    
    def _make_matchdf(self):
        ''' Build dataframe for result return '''
        # CSR matrix -> COO matrix
        cx = self.sprse_mtx.tocoo()

        # COO matrix to list of tuples
        match_list = []
        for row,col,val in zip(cx.row, cx.col, cx.data):
            match_list.append((row, self.source_names[row], col, self.target_names[col], val))

        # List of tuples to dataframe
        colnames = ['Row Idx', 'Title', 'Candidate Idx', 'Candidate Title', 'Score']
        match_df = pd.DataFrame(match_list, columns=colnames)

        return match_df

    
    def _make_matchdict(self):
        ''' Build dictionary for result return '''
        # CSR matrix -> COO matrix
        cx = self.sprse_mtx.tocoo()

        # dict value should be tuple of values
        match_dict = {}
        for row,col,val in zip(cx.row, cx.col, cx.data):
            if match_dict.get(row):
                match_dict[row].append((col,val))
            else:
                match_dict[row] = [(col, val)]

        return match_dict

Now that we have the StringMatch class, we can run the matching algorithm using just a few lines of code (with default arguments):

    titlematch = StringMatch(source_titles, target_titles)
    titlematch.tokenize()
    match_df = titlematch.match()
Let's take a look at how well it performs.

### First, load & clean glassdoor data

In [14]:
# Load glassdoor data
glass=pd.read_csv(path_gl)
glass_t=glass[['T_glassdoorname_stemmed']].drop_duplicates()
# Change firm names to 'name' so that it's easy to append
glass_t.rename(columns = {'T_glassdoorname_stemmed':'name'}, inplace = True)
glass_t['target'] = 1
glass_a=glass[['A_glassdoorname_stemmed']].drop_duplicates()
glass_a['target'] = 0
glass_a.rename(columns = {'A_glassdoorname_stemmed':'name'}, inplace = True)
glass=glass_a.append(glass_t)
glass=clean_data(glass)
glass.head()

Unnamed: 0,name,target
0,21st century fox,0
1,abb,0
2,aecom,0
3,agco,0
4,amec foster wheeler,0


In [15]:
# Convert glass names to list:
glass_list = glass.name.tolist()

### Now, write a loop: 
#### For each week of BG data:
    1. Load it
    2. Clean it
    3. Find a match in Glassdoor
    4. Send output to csv

In [8]:
# Get list of all CSVs in a given directory (now using: 2016)
csv_list=get_csv_files(path_bg)

In [9]:
# Allow output to be set to 1000 rows
pd.set_option('display.max_rows', 1000)

In [None]:
def clean_stuff(file):
    pass


import concurrent.futures.ProcessPoolExecutor() as executor:
    result = executor.map(clean_stuff, list_of_data)
    

In [17]:
# Run matching function for all of 2016
# for file in csv_list:

def clean_stuff(file):
    bg_data=pd.read_csv(file)
    bg_data=bg_data[['CanonEmployer']]
    bg_data=bg_data.rename(columns={'CanonEmployer':'name'})
    bg_data=clean_data(bg_data)
    # Convert BG names to list
    bg_list = bg_data.name.tolist()

    # Match the BG names to Glass names (and time it)
    t0 = datetime.now()
    namematch = StringMatch(glass_list, bg_list)
    namematch.tokenize()
    match_df = namematch.match()
    t1 = datetime.now()
    full_time_tfidf = (t1-t0).total_seconds()
    print(file)
    print(full_time_tfidf)
    
    #Organize Output
    match_df=match_df.rename(columns={"Title": "Glass_Name", "Candidate Title": "BG_Name"})
    match_df=match_df.sort_values(by='Score', ascending=False)
    match_df = pd.merge(match_df,glass,left_on='Glass_Name',right_on='name')
    match_df = match_df.drop(columns=['name','Row Idx','Candidate Idx'])

    #Send output to csv
    file_csv=file.split(r"2016/")[1]
    outpath = os.path.join(path_out, file_csv)
    match_df.to_csv(outpath, index = False)

  interactivity=interactivity, compiler=compiler, result=result)


/Users/victoriasevcenko/Dropbox (INSEAD)/RS_RA_INSEAD/0_Data/Output/2016/AddFeed_20160101_20160107.csv
1.092471


  interactivity=interactivity, compiler=compiler, result=result)


/Users/victoriasevcenko/Dropbox (INSEAD)/RS_RA_INSEAD/0_Data/Output/2016/AddFeed_20160108_20160114.csv
1.137398


  interactivity=interactivity, compiler=compiler, result=result)


/Users/victoriasevcenko/Dropbox (INSEAD)/RS_RA_INSEAD/0_Data/Output/2016/AddFeed_20160115_20160121.csv
1.008466


  interactivity=interactivity, compiler=compiler, result=result)


/Users/victoriasevcenko/Dropbox (INSEAD)/RS_RA_INSEAD/0_Data/Output/2016/AddFeed_20160122_20160128.csv
1.032929


  interactivity=interactivity, compiler=compiler, result=result)


/Users/victoriasevcenko/Dropbox (INSEAD)/RS_RA_INSEAD/0_Data/Output/2016/AddFeed_20160129_20160204.csv
1.043863


  interactivity=interactivity, compiler=compiler, result=result)


/Users/victoriasevcenko/Dropbox (INSEAD)/RS_RA_INSEAD/0_Data/Output/2016/AddFeed_20160205_20160211.csv
1.186096


  interactivity=interactivity, compiler=compiler, result=result)


/Users/victoriasevcenko/Dropbox (INSEAD)/RS_RA_INSEAD/0_Data/Output/2016/AddFeed_20160212_20160218.csv
1.134086


  interactivity=interactivity, compiler=compiler, result=result)


/Users/victoriasevcenko/Dropbox (INSEAD)/RS_RA_INSEAD/0_Data/Output/2016/AddFeed_20160219_20160225.csv
1.033389
/Users/victoriasevcenko/Dropbox (INSEAD)/RS_RA_INSEAD/0_Data/Output/2016/AddFeed_20160226_20160303.csv
1.019084
/Users/victoriasevcenko/Dropbox (INSEAD)/RS_RA_INSEAD/0_Data/Output/2016/AddFeed_20160304_20160310.csv
1.150529


  interactivity=interactivity, compiler=compiler, result=result)


/Users/victoriasevcenko/Dropbox (INSEAD)/RS_RA_INSEAD/0_Data/Output/2016/AddFeed_20160311_20160317.csv
1.024458


  interactivity=interactivity, compiler=compiler, result=result)


/Users/victoriasevcenko/Dropbox (INSEAD)/RS_RA_INSEAD/0_Data/Output/2016/AddFeed_20160318_20160324.csv
1.104026


  interactivity=interactivity, compiler=compiler, result=result)


/Users/victoriasevcenko/Dropbox (INSEAD)/RS_RA_INSEAD/0_Data/Output/2016/AddFeed_20160325_20160331.csv
1.080976
/Users/victoriasevcenko/Dropbox (INSEAD)/RS_RA_INSEAD/0_Data/Output/2016/AddFeed_20160401_20160407.csv
1.09746
/Users/victoriasevcenko/Dropbox (INSEAD)/RS_RA_INSEAD/0_Data/Output/2016/AddFeed_20160408_20160414.csv
1.070913


  interactivity=interactivity, compiler=compiler, result=result)


/Users/victoriasevcenko/Dropbox (INSEAD)/RS_RA_INSEAD/0_Data/Output/2016/AddFeed_20160415_20160421.csv
1.115718


  interactivity=interactivity, compiler=compiler, result=result)


/Users/victoriasevcenko/Dropbox (INSEAD)/RS_RA_INSEAD/0_Data/Output/2016/AddFeed_20160422_20160428.csv
1.062578


  interactivity=interactivity, compiler=compiler, result=result)


/Users/victoriasevcenko/Dropbox (INSEAD)/RS_RA_INSEAD/0_Data/Output/2016/AddFeed_20160429_20160505.csv
1.003676


  interactivity=interactivity, compiler=compiler, result=result)


/Users/victoriasevcenko/Dropbox (INSEAD)/RS_RA_INSEAD/0_Data/Output/2016/AddFeed_20160506_20160512.csv
1.020764


  interactivity=interactivity, compiler=compiler, result=result)


/Users/victoriasevcenko/Dropbox (INSEAD)/RS_RA_INSEAD/0_Data/Output/2016/AddFeed_20160513_20160519.csv
1.003327


  interactivity=interactivity, compiler=compiler, result=result)


/Users/victoriasevcenko/Dropbox (INSEAD)/RS_RA_INSEAD/0_Data/Output/2016/AddFeed_20160520_20160526.csv
0.785524


  interactivity=interactivity, compiler=compiler, result=result)


/Users/victoriasevcenko/Dropbox (INSEAD)/RS_RA_INSEAD/0_Data/Output/2016/AddFeed_20160527_20160602.csv
1.07476


  interactivity=interactivity, compiler=compiler, result=result)


/Users/victoriasevcenko/Dropbox (INSEAD)/RS_RA_INSEAD/0_Data/Output/2016/AddFeed_20160603_20160609.csv
1.183776


  interactivity=interactivity, compiler=compiler, result=result)


/Users/victoriasevcenko/Dropbox (INSEAD)/RS_RA_INSEAD/0_Data/Output/2016/AddFeed_20160610_20160616.csv
0.988463


  interactivity=interactivity, compiler=compiler, result=result)


/Users/victoriasevcenko/Dropbox (INSEAD)/RS_RA_INSEAD/0_Data/Output/2016/AddFeed_20160617_20160623.csv
1.077146
/Users/victoriasevcenko/Dropbox (INSEAD)/RS_RA_INSEAD/0_Data/Output/2016/AddFeed_20160624_20160630.csv
0.748559


  interactivity=interactivity, compiler=compiler, result=result)


/Users/victoriasevcenko/Dropbox (INSEAD)/RS_RA_INSEAD/0_Data/Output/2016/AddFeed_20160701_20160707.csv
0.91154


  interactivity=interactivity, compiler=compiler, result=result)


/Users/victoriasevcenko/Dropbox (INSEAD)/RS_RA_INSEAD/0_Data/Output/2016/AddFeed_20160708_20160714.csv
0.980816


  interactivity=interactivity, compiler=compiler, result=result)


/Users/victoriasevcenko/Dropbox (INSEAD)/RS_RA_INSEAD/0_Data/Output/2016/AddFeed_20160715_20160721.csv
0.932676


  interactivity=interactivity, compiler=compiler, result=result)


/Users/victoriasevcenko/Dropbox (INSEAD)/RS_RA_INSEAD/0_Data/Output/2016/AddFeed_20160722_20160728.csv
0.833474


  interactivity=interactivity, compiler=compiler, result=result)


/Users/victoriasevcenko/Dropbox (INSEAD)/RS_RA_INSEAD/0_Data/Output/2016/AddFeed_20160729_20160804.csv
1.010125


  interactivity=interactivity, compiler=compiler, result=result)


/Users/victoriasevcenko/Dropbox (INSEAD)/RS_RA_INSEAD/0_Data/Output/2016/AddFeed_20160805_20160811.csv
0.940592


  interactivity=interactivity, compiler=compiler, result=result)


/Users/victoriasevcenko/Dropbox (INSEAD)/RS_RA_INSEAD/0_Data/Output/2016/AddFeed_20160812_20160818.csv
0.816716


  interactivity=interactivity, compiler=compiler, result=result)


/Users/victoriasevcenko/Dropbox (INSEAD)/RS_RA_INSEAD/0_Data/Output/2016/AddFeed_20160819_20160825.csv
0.777403


  interactivity=interactivity, compiler=compiler, result=result)


/Users/victoriasevcenko/Dropbox (INSEAD)/RS_RA_INSEAD/0_Data/Output/2016/AddFeed_20160826_20160901.csv
0.63628


  interactivity=interactivity, compiler=compiler, result=result)


/Users/victoriasevcenko/Dropbox (INSEAD)/RS_RA_INSEAD/0_Data/Output/2016/AddFeed_20160902_20160908.csv
1.005971


  interactivity=interactivity, compiler=compiler, result=result)


/Users/victoriasevcenko/Dropbox (INSEAD)/RS_RA_INSEAD/0_Data/Output/2016/AddFeed_20160909_20160915.csv
0.958622


  interactivity=interactivity, compiler=compiler, result=result)


/Users/victoriasevcenko/Dropbox (INSEAD)/RS_RA_INSEAD/0_Data/Output/2016/AddFeed_20160916_20160922.csv
0.995291


  interactivity=interactivity, compiler=compiler, result=result)


/Users/victoriasevcenko/Dropbox (INSEAD)/RS_RA_INSEAD/0_Data/Output/2016/AddFeed_20160923_20160929.csv
0.94874


  interactivity=interactivity, compiler=compiler, result=result)


/Users/victoriasevcenko/Dropbox (INSEAD)/RS_RA_INSEAD/0_Data/Output/2016/AddFeed_20160930_20161006.csv
0.969322


  interactivity=interactivity, compiler=compiler, result=result)


/Users/victoriasevcenko/Dropbox (INSEAD)/RS_RA_INSEAD/0_Data/Output/2016/AddFeed_20161007_20161013.csv
0.812361


  interactivity=interactivity, compiler=compiler, result=result)


/Users/victoriasevcenko/Dropbox (INSEAD)/RS_RA_INSEAD/0_Data/Output/2016/AddFeed_20161014_20161020.csv
0.975102


  interactivity=interactivity, compiler=compiler, result=result)


/Users/victoriasevcenko/Dropbox (INSEAD)/RS_RA_INSEAD/0_Data/Output/2016/AddFeed_20161021_20161027.csv
0.843253


  interactivity=interactivity, compiler=compiler, result=result)


/Users/victoriasevcenko/Dropbox (INSEAD)/RS_RA_INSEAD/0_Data/Output/2016/AddFeed_20161028_20161103.csv
1.532479


  interactivity=interactivity, compiler=compiler, result=result)


/Users/victoriasevcenko/Dropbox (INSEAD)/RS_RA_INSEAD/0_Data/Output/2016/AddFeed_20161104_20161110.csv
0.995818


  interactivity=interactivity, compiler=compiler, result=result)


/Users/victoriasevcenko/Dropbox (INSEAD)/RS_RA_INSEAD/0_Data/Output/2016/AddFeed_20161111_20161117.csv
0.8476


  interactivity=interactivity, compiler=compiler, result=result)


/Users/victoriasevcenko/Dropbox (INSEAD)/RS_RA_INSEAD/0_Data/Output/2016/AddFeed_20161118_20161124.csv
0.911555


  interactivity=interactivity, compiler=compiler, result=result)


/Users/victoriasevcenko/Dropbox (INSEAD)/RS_RA_INSEAD/0_Data/Output/2016/AddFeed_20161125_20161201.csv
0.775756


  interactivity=interactivity, compiler=compiler, result=result)


/Users/victoriasevcenko/Dropbox (INSEAD)/RS_RA_INSEAD/0_Data/Output/2016/AddFeed_20161202_20161208.csv
0.849789


  interactivity=interactivity, compiler=compiler, result=result)


/Users/victoriasevcenko/Dropbox (INSEAD)/RS_RA_INSEAD/0_Data/Output/2016/AddFeed_20161209_20161215.csv
0.969316


  interactivity=interactivity, compiler=compiler, result=result)


/Users/victoriasevcenko/Dropbox (INSEAD)/RS_RA_INSEAD/0_Data/Output/2016/AddFeed_20161216_20161222.csv
0.844431


  interactivity=interactivity, compiler=compiler, result=result)


/Users/victoriasevcenko/Dropbox (INSEAD)/RS_RA_INSEAD/0_Data/Output/2016/AddFeed_20161223_20161229.csv
0.643523


  interactivity=interactivity, compiler=compiler, result=result)


/Users/victoriasevcenko/Dropbox (INSEAD)/RS_RA_INSEAD/0_Data/Output/2016/AddFeed_20161230_20161231.csv
0.341566


In [60]:
# Append all results together
all2016 = pd.DataFrame()
csv_list2=get_csv_files(path_out)
for file in csv_list2:
    df = pd.read_csv(file)
    all2016=all2016.append(df)

In [61]:
all2016=all2016.drop_duplicates(subset=['Glass_Name','BG_Name'], keep="last").drop(columns=['Row Idx','Candidate Idx'])
all2016=all2016[all2016['Score'] > 0.74]  
all2016.shape

(637, 4)

In [66]:
all2016.sort_values(by=['Glass_Name','Score'], ascending=[True,False]).head(300)

Unnamed: 0,Glass_Name,BG_Name,Score,target
192,21st century fox,21st century fox,0.969647,1
270,21st century fox,21st century media,0.765244,1
285,21st century fox,21st century auto,0.756468,1
113,3i infotech,3i infotech,0.984677,1
198,3i infotech,3i infotec,0.864776,1
198,3i infotech,infotech,0.842419,1
263,3i infotech,eng infotech,0.748641,1
58,abbvie,abbvie,0.987977,0
37,accelops,accelops,0.994131,1
72,accenture,accenture,0.985878,1


### Final Result for 2016 merge:

In [79]:
# Number of unique Target & Acquirer names in Glassdoor that have been matched to at least one BG name
temp = all2016.groupby('target')['Glass_Name'].nunique()
temp2 = glass.groupby('target')['name'].nunique()
print("Number of Glass buyers in total:" ,temp2[0])
print("Number of Glass buyers found in BG:",temp[0])
print("")
print("Number of Glass targets in total:" ,temp2[1])
print("Number of Glass targets found in BG:",temp[1])
print("")
print("Number of Glass firms in total:" ,glass.shape[0])
print("Number of Glass firms found in BG:",temp[1]+temp[0])

Number of Glass buyers in total: 247
Number of Glass buyers found in BG: 159

Number of Glass targets in total: 318
Number of Glass targets found in BG: 213

Number of Glass firms in total: 566
Number of Glass firms found in BG: 372
