# Deduplication & Record Linkage. 

# This notebook shows how to use TD IDF, FUZZY to both dedupe and match records at scale besides K Nearest Neighbour algorithm as an alternative closeness measure 


Data in the real world is messy. Dealing with messy data sets is painful and burns through time which could be spent analysing the data itself.

![https://www.acronis.com/en-us/articles/deduplication/](https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTc_jlg2hrSRYqdenJdv7p_4Xo6Uj-qqCPpx4ANHI2hNkA8TJQPJQ&s)

- **Deduplication**. Aligning similar categories or entities in a data set (for example, we may need to combine ‘D J Trump’, ‘D. Trump’ and ‘Donald Trump’ into the same entity).
- **Record Linkage**. Joining data sets on a particular entity (for example, joining records of ‘D J Trump’ to a URL of his Wikipedia page)


In [None]:
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=FutureWarning)

#### Important Talk by: presented at PyBay2018 


In [None]:
from IPython.display import HTML
HTML('<iframe width="1280" height="720" src="https://www.youtube.com/embed/McsTWXeURhA" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>')


**Record Deduplication**, or more generally, Record Linkage is the task of finding which records refer to the same entity, like a person or a company. It's used mainly when there isn't a unique identifier in records like Social Security Number for US citizens
[Dedupe.io](https://dedupe.io)


<html>
<body>

<p><font size="5" color="Purple">If you find this kernel useful or interesting, please don't forget to upvote the kernel =)

</body>
</html>



# Import libs

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import time

pd.set_option('display.max_colwidth', -1)

## Read in Data

In [None]:
import os
print(os.listdir("../input/sec-edgar-companies-list/"))

In [None]:
root = '../input/sec-edgar-companies-list/'

data = pd.read_csv(root + 'sec__edgar_company_info.csv',encoding='latin')


## Glimpse of Data

In [None]:
print('Size of data ',data.shape)

In [None]:
data.head()

In [None]:
data.select_dtypes('object').apply(pd.Series.nunique, axis=0)

## FuzzyWuzzy

In computer science, fuzzy string matching is the technique of finding strings that match a pattern approximately (rather than exactly). In another word, fuzzy string matching is a type of search that will find matches even when users misspell words or enter only partial words for the search. It is also known as approximate string matching.


- Fuzzywuzzy is a Python library uses **Levenshtein Distance** to calculate the differences between sequences in a simple-to-use package.
- Instalation: !pip install fuzzywuzzy, import: from fuzzywuzzy import fuzz, from fuzzywuzzy import process


In [None]:
!pip install fuzzywuzzy
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

In [None]:
data.tail()

**ratio** , compares the entire string similarity, in order.

In [None]:
fuzz.ratio('ZZ GLOBAL LLC', 'ZZLL INFORMATION TECHNOLOGY, INC')

This is telling us that the 'ZZ GLOBAL LLC' and 'ZZLL INFORMATION TECHNOLOGY, INC' pair are about **36%** the same.

In [None]:
fuzz.ratio('ZZ GLOBAL LLC', 'ZZX, LLC')

This is telling us that the 'ZZ GLOBAL LLC' and 'ZZX, LLC' pair are about **57%** the same.

**partial_ratio** , compares partial string similarity.

- We are still using the same data pairs.

In [None]:
fuzz.partial_ratio('ZZ GLOBAL LLC', 'ZZLL INFORMATION TECHNOLOGY, INC')

In [None]:
fuzz.partial_ratio('ZZ GLOBAL LLC', 'ZZX, LLC')

**token_sort_ratio** , ignores word order.

In [None]:
fuzz.token_sort_ratio('ZZ GLOBAL LLC', 'ZZLL INFORMATION TECHNOLOGY, INC')

In [None]:
fuzz.token_sort_ratio('ZZ GLOBAL LLC', 'ZZX, LLC')

**token_set_ratio** , ignores duplicated words. It is similar with token sort ratio, but a little bit more flexible.

In [None]:
fuzz.token_set_ratio('ZZ GLOBAL LLC', 'ZZLL INFORMATION TECHNOLOGY, INC')

In [None]:
fuzz.token_set_ratio('ZZ GLOBAL LLC', 'ZZX, LLC')

## TF-IDF & N-Grams

**TF-IDF** is a method to generate features from text by multiplying the frequency of a term (usually a word) in a document (the Term Frequency, or TF) by the importance (the Inverse Document Frequency or IDF) of the same term in an entire corpus. This last term weights less important words (e.g. the, it, and etc) down, and words that don’t occur frequently up. IDF is calculated as:



<html>
<body>

<p><font size="4" color="Purple">IDF(t) = log_e(Total number of documents / Number of documents with term t in it) 

</body>
</html>

### N-Grams  & De-Duplication

While the terms in **TF-IDF** are usually words, this is not a necessity. In our case using words as terms wouldn’t help us much, as most company names only contain one or two words. This is why we will use n-grams: sequences of N contiguous items, in this case characters. The following function cleans a string and generates all n-grams in this string:

In [None]:
!pip install ftfy # amazing text cleaning for decode issues..

In [None]:
import re
from ftfy import fix_text

def ngrams(string, n=3):
    string = fix_text(string) # fix text
    string = string.encode("ascii", errors="ignore").decode() #remove non ascii chars
    string = string.lower()
    chars_to_remove = [")","(",".","|","[","]","{","}","'"]
    rx = '[' + re.escape(''.join(chars_to_remove)) + ']'
    string = re.sub(rx, '', string)
    string = string.replace('&', 'and')
    string = string.replace(',', ' ')
    string = string.replace('-', ' ')
    string = string.title() # normalise case - capital at start of each word
    string = re.sub(' +',' ',string).strip() # get rid of multiple spaces and replace with a single
    string = ' '+ string +' ' # pad names for ngrams...
    string = re.sub(r'[,-./]|\sBD',r'', string)
    ngrams = zip(*[string[i:] for i in range(n)])
    return [''.join(ngram) for ngram in ngrams]

In [None]:
print('All 3-grams in "McDonalds":')
ngrams('McDonalds')

> The code to generate the matrix of TF-IDF values for each is shown below.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

company_names = data['Company Name'].unique()
vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams)
tf_idf_matrix = vectorizer.fit_transform(company_names)

The resulting matrix is very sparse as most terms in the corpus will not appear in most company names. Scikit-learn deals with this nicely by returning a sparse CSR matrix.

You can see the first row (**“!J INC”**) contains three terms for the columns 11, 16196, and 15541.

In [None]:
data.head()

In [None]:
print( tf_idf_matrix.shape, tf_idf_matrix[5] )
# Check if this makes sense:

ngrams('#1 PAINTBALL CORP')

> The last term (**‘ORP’**) has a relatively low value, **0.22892**, which makes sense as this term will appear often in the corpus, thus receiving a lower IDF weight.

In [None]:
t1 = time.time()
print(process.extractOne('Ministry of Justice', company_names[0:999])) #org names is our list of organization names
t = time.time()-t1
print("SELFTIMED:", t)
print("Estimated hours to complete for 1000 rows of  dataset:", (t*len(company_names[0:999]))/60/60)

## Record linkage and a different approach
> In the below section we will see how this is achieved and also use the K Nearest Neighbour algorithm as an alternative closeness measure.
The dataset we would like to join on is a set of ‘clean’ organization names created by the Office for National Statistics (ONS):

![](https://miro.medium.com/max/1014/1*k45HFixH1Q-qxxH1i2rsxQ.png)

As can be shown in the code below, the only difference in this approach is to transform the messy data set using the tdif matrix which has been learned on the clean data set.

The **‘getNearestN’** then uses Scikit’s implementation of K Nearest Neighbours to find the closest matches in the dataset:

In [None]:
##################
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
root2 = '../input/gov-names/'
clean_org_names = pd.read_excel(root2 + 'Gov Orgs ONS.xlsx')
clean_org_names = clean_org_names.iloc[:, 0:6]


In [None]:

org_name_clean = clean_org_names['Institutions'].unique()

print('Vecorizing the data - this could take a few minutes for large datasets...')
vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams, lowercase=False)
tfidf = vectorizer.fit_transform(org_name_clean)
print('Vecorizing completed...')

from sklearn.neighbors import NearestNeighbors
nbrs = NearestNeighbors(n_neighbors=1, n_jobs=-1).fit(tfidf)

org_column = 'Company Name' #column to match against in the messy data
unique_org = set(data[org_column].values) # set used for increased performance


In [None]:
###matching query:
def getNearestN(query):
    queryTFIDF_ = vectorizer.transform(query)
    distances, indices = nbrs.kneighbors(queryTFIDF_)
    return distances, indices

import time
t1 = time.time()
print('getting nearest n...')
distances, indices = getNearestN(unique_org)
t = time.time()-t1
print("COMPLETED IN:", t)

unique_org = list(unique_org) #need to convert back to a list
print('finding matches...')
matches = []
for i,j in enumerate(indices):
    temp = [round(distances[i][0],2), clean_org_names.values[j][0][0],unique_org[i]]
    matches.append(temp)

print('Building data frame...')  
matches = pd.DataFrame(matches, columns=['Match confidence (lower is better)','Matched name','Origional name'])
print('Done') 

In [None]:
matches.head(10)

### Finding close matches through getNearestN

In [None]:
matches.sort_values('Match confidence (lower is better)')

# In summary, tf-idf can be a highly effective and highly performant way of cleaning, deduping and matching data when dealing with larger record counts.

**References**

http://towardsdatascience.com/natural-language-processing-for-fuzzy-string-matching-with-python-6632b7824c49,

https://towardsdatascience.com/fuzzy-matching-at-scale-84f2bfd0c536,

https://bergvca.github.io/2017/10/14/super-fast-string-matching.html?source=post_page-----84f2bfd0c536---------------------- 


<html>
<body>

<p><font size="5" color="Red">If you like my kernel please consider upvoting it</font></p>
<p><font size="4" color="Green">Don't hesitate to give your suggestions in the comment section</font></p>

</body>
</html>


# Final