dupandas: data deduplication of text records in a pandas dataframe

dupandas is a python package to perform data deduplication on columns of a pandas dataframe using flexible text matching. It is compatible with both versions of python (2.x and 3.x). dupandas can find duplicate any kinds of text records in the pandas data. It comprises of sophisticated Matchers that can handle spelling differences and phonetics. It also comprises of several Cleaners, which can be used to clean up the noise present in the text data such as punctuations, digits, casing etc.

For fast computations, dupandas uses lucene based text indexing. In the input_config, if "indexing" = True, then it indexes the dataset in RAMDirectory which is used to identify and search similar strings. Check out the instructions of installing PyLucene below.

The beautiful part of dupandas is that it's Matchers, Cleaners and Indexing functions can be used as standalone packages while working with text data.

Installation

Following python modules are required to use dupandas: pandas, fuzzy, python-levenshtein . These modules can be installed using pip command:

    pip install dupandas pandas fuzzy python-levenshtein

OR if dependencies are already installed:

    pip install dupandas

OPTIONAL For faster implementation dupandas with indexing feature is recommended. dupandas uses PuLucene for data indexing purposes.
PyLucene Installation: Please note that for lucene indexing, java needs to be installed. Java 8 is recommended. Refer to this link

    sudo apt-get update
    sudo apt-get install pylucene

    After Installation, edit ~/.bashrc file, and add the following line at the end 
    export LD_LIBRARY_PATH=/usr/lib/jvm/java_folder_name/jre/lib/amd64/server
    
    example: export LD_LIBRARY_PATH=/usr/lib/jvm/java-8-oracle/jre/lib/amd64/server

Note: The use of indexing can reduce the overall time of computation and execution to one third of original.

Usage : dupandas

dupandas using default Matchers and Cleaners, Default Matcher and Cleaners are Exact Match and No Cleaning respectively.

    from dupandas import Dedupe
    dupe = Dedupe()
    
    input_config = {
        'input_data' : pandas_dataframe,
        'column' : 'column_name_to_deduplicate',
        '_id' : 'unique_id_column_of_dataset',
        }
    results = dupe.dedupe(input_config)

dupandas using custom Cleaner and Matcher configs

    from dupandas import Dedupe

    clean_config = { 'lower' : True, 'punctuation' : True, 'whitespace' : True, 'digit' : True }
    match_config = { 'exact' : False, 'levenshtein' : True, 'soundex' : False, 'nysiis' : False}
    dupe = Dedupe(clean_config = clean_config, match_config = match_config)

    input_config = {
        'input_data' : pandas_dataframe,
        'column' : 'column_name_to_deduplicate',
        '_id' : 'unique_id_column_of_dataset',
        }
    results = dupe.dedupe(input_config)

Other options in input_config

    input_config = {
        'input_data' : pandas_dataframe,
        'column' : 'column_name_to_deduplicate',
        '_id' : 'unique_id_column_of_dataset',
        'score_column' : 'name_of_the_column_for_confidence_score',
        'threshold' : 0.75, # float value of threshold
        'unique_pairs' : True, # boolean to get unique (A=B) or duplicate (A=B and B=A) results
        'indexing' : False # Boolean to set lucene indexing = True / False, Default: False
        }

Usage : standalone Cleaner class

    from dupandas import Cleaner
    clean_config = { 'lower' : True, 'punctuation' : True, 'whitespace' : True, 'digit' : True }
    clean = Cleaner(clean_config)
    clean.clean_text("new Delhi 3#! 34 ")

Usage: standalone Matcher class

    from dupandas import Matcher
    match_config = { 'exact' : False, 'levenshtein' : True, 'soundex' : False, 'nysiis' : False}
    match = Matcher(match_config)
    match.match_elements("new delhi", "newdeli")

Issues

Thanks for checking this work, Yes ofcourse there is a scope of improvement, Feel free to submit issues and enhancement requests.

Contributing

ToDos

V2: Add Support for multi column match
V2: Add Matchers, Cleaners
V2: Remove Library Dependencies
V2: Handle Longer Texts, Command Line Arguments

Steps

Fork the repo on GitHub
Clone the project to your own machine
Commit changes to your own branch
Push your work back up to your fork
Submit a Pull request

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
dupandas		dupandas
examples		examples
.gitignore		.gitignore
MANIFEST.in		MANIFEST.in
README.md		README.md
setup.cfg		setup.cfg
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dupandas: data deduplication of text records in a pandas dataframe

Installation

Usage : dupandas

Usage : standalone Cleaner class

Usage: standalone Matcher class

Issues

Contributing

ToDos

Steps

About

Releases

Packages

Languages

shivam5992/dupandas

Folders and files

Latest commit

History

Repository files navigation

dupandas: data deduplication of text records in a pandas dataframe

Installation

Usage : dupandas

Usage : standalone Cleaner class

Usage: standalone Matcher class

Issues

Contributing

ToDos

Steps

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages