Skip to content

Robust CSV dialect detection methodology for Python that outperforms existing state of the art solutions by 8.35% in terms of their F1 scores, using only built-in Python modules.

License

Notifications You must be signed in to change notification settings

ws-garcia/P-CSVsniffer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CSVsniffer

DOI

A Python implementation of the method described in the paper:

Detecting CSV File Dialects by Table Uniformity Measurement and Data Type Inference (PDF)

by W. GarcĂ­a.

Introduction

The results presented here can be reproduced by running the scripts from the src/run_tests.py file and are stored in the python/tests results/ folder.

Data

The CSV folder contains the files copied from the Pollock framework and other collected test files. Also the dataset used for the CSV wrangling research is available in the CSV_Wranglin folder. Note that only link to the files can be provided, in this last case,due to the authors holds the copyright. A dataset from w3c, CSVW project, is available in the W3C-CSVW folder.

The expect configuration for each tested CSV is saved in the Dialect_annotations.txt, Manual_dialect_annotation.txt and W3C-CSVW-Dialect_annotations.txt files. All of them are stored in the ground truth folder.

Results

In this section, the results after running tests with the Beta Python implementation of the Table Uniformity method are presented. In order to obtain more representative performance metrics, it was decided to run the tests on a system running Linux.

The table below shows the dialect detection success ratio for P-CSVsniffer, CleverCSV and the built-in Python csv.Sniffer class module. Note that the accuracy has been measured using only those files that do not produce a failure when attempting to infer CSV dialects

Data set P-CSVsniffer CleverCSV csv.Sniffer
POLLOCK 96.5517% 95.1724% 96.3504%
CSV Wrangling 90.5660% 84.3137% 80.5556%
CSV Wrangling filtered CODEC 89.4737% 84.2520% 80.0000%
CSV Wrangling MESSY 78.1955% 71.6535% 66.6667%
W3C-CSVW 95.3917% 61.1111% 97.6923%

The table below shows the failure ratio for each tool.

Data set P-CSVsniffer CleverCSV csv.Sniffer
POLLOCK [148 files] 2.0270% 2.0270% 7.4324%
CSV Wrangling [179 files] 11.1732% 14.5251% 19.5531%
CSV Wrangling filtered CODEC [142 files] 6.3380% 10.5634% 15.4930%
CSV Wrangling MESSY [126 files] 6.3380% 10.5634% 15.4930%
W3C-CSVW [221 files] 1.8100% 2.2624% 41.1765%

The following table shows the average success and failure ratio for selected tools. The higher the number of errors obtained, the lower the reliability for detection.

Tool Success ratio (SR) Failure ratio (FR)
P-CSVsniffer 90.04% 5.54%
CleverCSV 79.30% 7.99%
csv.Sniffer 84.25% 19.83%

As a complementary metric, the table below shows the average reliability factor for CSV dialect detection. This value is computed as: $$RF=SR\times (1-FR)$$.

Tool Reliability factor (RF)
P-CSVsniffer 85.05%
CleverCSV 72.96%
csv.Sniffer 67.54%

The below table shows the execution times obtained. In this one we can see that the Python module, reading 6144 characters from the CSV files, is incredibly efficient, easily outperforming the other tools.

Tool Run-time
csv.Sniffer 0.98 sec.
CleverCSV 6.54 sec.
P-CSVsniffer 17.00 sec.

Accuracy analysis

For dialect detection, we have defined True Positive (TP) as the number of CSV files where the dialect was correctly detected. By its way, False Positive (FP) is defined as the number of CSV files where the dialect was incorrectly identified as a specific dialect when it was actually a different dialect. False Negatives (FN) is defined as the number of CSV files where the specific dialect was present but not detected.

The next table shows the precision (P), which measures the accuracy of dialect detection when predicting a specific dialect. The metric is calculated as follows

$$P=\frac{TP}{TP+FP}$$

Data set P-CSVsniffer CleverCSV csv.Sniffer
POLLOCK 0.9655 0.9517 0.9635
CSV Wrangling 0.9057 0.8431 0.8056
CSV Wrangling filtered CODEC 0.8947 0.8425 0.8000
CSV Wrangling MESSY 0.7820 0.7165 0.6667
W3C-CSVW 0.9539 0.6111 0.9769

The following table shows the recall (R), which measures the ability of the method to detect the specific dialect when it is actually present. The metric is calculated as follows

$$R=\frac{TP}{TP+FN}$$

Data set P-CSVsniffer CleverCSV csv.Sniffer
POLLOCK 0.9790 0.9787 0.9231
CSV Wrangling 0.8780 0.8323 0.7682
CSV Wrangling filtered CODEC 0.9297 0.8770 0.8136
CSV Wrangling MESSY 0.9204 0.8585 0.7843
W3C-CSVW 0.9810 0.9635 0.5826

The below table shows the F1 score, which is the most polished measure of dialect detection accuracy. The metric is calculated as follows

$$F1=2 \times \frac{P \times R}{P+R}$$

Data set P-CSVsniffer CleverCSV csv.Sniffer
POLLOCK 0.9722 0.9650 0.9429
CSV Wrangling 0.8916 0.8377 0.7865
CSV Wrangling filtered CODEC 0.9119 0.8594 0.8067
CSV Wrangling MESSY 0.8456 0.7811 0.7207
W3C-CSVW 0.9673 0.7479 0.7299

Thus, the True Positive (TP) weighted F1 score for each tool is computed as

$$F1_{Weighted} Score = \frac{\sum_{i=1}^{n} TP_i \times F1_{Score}i}{\sum_{i=1}^{n} TP_i}$$

where

  • $\text{TP}_i$: The number of True Positive instances of dataset $i$.
  • $F1_{Score}i$: The F1 score for dataset $i$.

The computations are given in the below table.

Tool F1 score
P-CSVsniffer 0.9260
CleverCSV 0.8425
csv.Sniffer 0.8049

Conclusions

By studying the last table it is concluded that the Table Uniformity method is able to predict and determine the dialects of CSV files with an accuracy of 92.60% using a sample of 10 records, while CleverCSV can reach 84.25% accuracy by loading 6144 characters.

The proposed methodology shows an improvement of 8.35% over CleverCSV using the same source code for data type detection. A substantial improvement could derive from a stricter data detection, reducing the number of false positives detected in cells. On the other hand, CleverCSV doesn't shows significant accuracy improvements when reading all the data from the CSV files. This unexpected result helps to reaffirm that dialect detection does not always require reading all the information from the CSV files.

So we can conclude that the Table Uniformity method is the ideal candidate to be implemented as the default dialect detection method of the Python csv module, despite the fact that the increased accuracy leads to an increase in execution time, as shown in a previous table. An alternative would be to use Pandas to avoid the overhead of using the file iterator to pre-filter the lines to be used in the creation of the tables to be evaluated.

Requirements

Below are the requirements for reproducing the experiments.

  • Microsoft Office Excel.
  • Python v3
  • CleverCSV and all its dependencies.

Credits

Many of the CSV files used in this research were recovered from different repositories. Below you can review the list.

  • franciscom/testlink-code-playground/
  • boryn/yii_demo/
  • dengkeaway/kunagi/
  • austinrfnd/arethedodgersplayingtonight.com/
  • okfn/messytables/
  • kmugglet/Reporting/
  • jankvak/Schedule-of-pain/
  • kamilklw/onlineDR/
  • mavcunha/dojosp/
  • evenwestvang/skoolgate/
  • shalomb/dotfiles/
  • ockam/php-csv/
  • code34/war-in-takistan/
  • thejesusbr/GPU-Color2Gray/
  • zardosht/clustering/
  • bryanburgers/personal-site/
  • godlessendeavor/MultiDB/
  • vofp/Relation-Browser-Ruby/
  • javierfdr/Endrov-collaboration/
  • nmklong/limesurvey-cdio3/
  • jkamenik/Rails-Profiler/
  • philogb/philogb.github.com/
  • jaredcohe/sec_scrape/
  • andrewxhill/MOL/
  • abelhegedus/Magic-Collection-Builder/
  • Momus/Multipass/
  • christhorpe/gacscraper/
  • ikharlampenkov/gkh/
  • jekozyra/ngo-project/
  • dynamicpacket-public/opensips/
  • ryzhov/ATS0/
  • pejo751/sube751/
  • markkimsal/agent-ohm/
  • ericpaulbishop/gargoyle/
  • Dopi/JetPlatform/
  • jburks/OpenDidj/
  • matsubo/emoji-sprite/
  • 19maps/getWeather/
  • athomason/perl-Memcached-libmemcached/
  • nadavoid/ReadySiteBase/
  • lagenar/rivadavia/
  • DAISUKEICHIKAWA/pred/
  • mhausenblas/schema-org-rdf/
  • mhausenblas/omnidator/
  • shinichinomura/zipcode_jp-python/
  • bashofmann/opensocial_demo_game/
  • rajkosto/mxoemu/
  • pino1068/riskman/
  • rocksolidwebdesign/AweCMS/
  • duckduckgo/zeroclickinfo-goodies/
  • jeffreyhorner/coloring/
  • ivanistheone/Latent-Dirichlet-Allocation/
  • mvilrokx/BPAD/
  • navid/Nav/
  • lboaretto/Stratos/
  • robertogds/ReallyLateBooking/
  • regdog/goodlife/
  • danux/django-holding-page/
  • popdevelop/snapplr/
  • MichaelMarner/Half-Real-Scenes/
  • bjpop/website/
  • dvydra/coupon-rails/
  • aaronzhang/kunagi/
  • tbielawa/py-prtstat/
  • tricycle/electrodrive-market-analysis/
  • darpified/kunagi/
  • jgarciagarrido/SegoviaOpenTeam/
  • datenspiel/is_csv_importable/
  • cgueret/eRDF/
  • zikula-modules/Eternizer/
  • azizmb/TWSS/
  • kms/ds1052e-measurements/
  • pphetra/fresh-odhd/
  • levivya/investmarketkz/
  • TeamImba/MDAS/
  • saghul/sipp-scenarios/
  • wliao008/mysteryleague/
  • mgius/cpe458/
  • ecell/ecell3-spatiocyte/
  • gnorsilva/frontlinesms-core/
  • tonytian33/shoppinglist/
  • Airead/excise/
  • UncleCJ/alexastuff/
  • colinmollenhour/magento-lite/
  • barbie/test-xhtml/
  • kyr0/ExtZF/
  • petewarden/dstkdata/
  • lacimol/kunagi/
  • apslab/ap-manager/
  • noosamad/mxp-consolidate/
  • practicalparticipation/IKMLinkedResearch/
  • cwegrzyn/RHIPE/
  • nadineproject/nadine/
  • samsonjs/samhuri.net/
  • kcampos/Kuali-Sakai-Functional-Test-Automation-Framework/
  • max7255/FPGA-Analyzer/
  • LATC/sandbox/
  • jeffkit/autoforms/
  • d8agroup/metaLayer-Gateway/
  • hmgaspar/eesddp/
  • fadzril/freelovr/
  • fguillen/simplecov-csv/
  • aminin/google-api-adwords-php/
  • michaeltyson/potionstore/9
  • srveit/distance/
  • zarma/baghdad01/
  • hippiefahrzeug/jeannie/
  • digitalfox/py10n/
  • JanHoralik/jh-prototypes/
  • alexmajy/flashcards/
  • onlytiancai/codesnip/
  • vofp/relation-browser/
  • Error-331/Axis-modules/
  • CarnegieLearning/MathFluency/
  • i-dcc/allele_image/
  • hogi/kunagi/
  • BarbaraEMac/TrivBot/
  • Wisser/Jailer/
  • jasonlong/benfords-law/
  • soccermetrics/marcotti-sql/
  • kennym/itools/
  • rkoeppl/skeinforge_settings/
  • aalzola/prueba/
  • cmheisel/kardboard/
  • gfalav/newcar/
  • PabloCastellano/pablog-scripts/
  • elle/music-library/
  • oppian/xerobis/
  • twidi/satchmo/
  • 34/T/
  • apache/bigtop/
  • ciriarte/laundry/
  • romanchyla/montysolr/
  • alibezz/SNRails_Research/
  • SQLServerIO/IometerParser/
  • renduples/alibtob/
  • evanleonard/kunagi/
  • yuvadm/ayalon-dat/
  • marplatense/erlang/
  • spazm/Iron-Munger/
  • crschmidt/haitibrowser/
  • eLBirador/AllAboutCity/
  • Coalas/apsilesia/
  • emelleme/All-Drag-Racing/
  • despesapublica/site/
  • apkennedy/Emmaus-Silverstripe/
  • Coalas/kolakowski/
  • notioncollective/
  • emelleme/G8-life/
  • mgc544/sscustom-tra/
  • andyinabox/Beyond-the-Debt-Ceiling/
  • emelleme/PhillyOpen/
  • sponsoredlinx/Total-Care-Asphalting/
  • davidmontgomery/Wireframe/
  • frappe/frappe/
  • davedash/SUMO-issues/
  • Habrok/HelloWorld/
  • gaubert/java-balivernes/
  • befair/gasistafelice/
  • komagata/cloister/
  • ryannscy/Unemployment-Chart/
  • guylhem/PerlMSI/
  • lightspeedretail/webstore/
  • naderman/symfony/
  • orchestra-io/sample-symfony2/
  • biow0lf/prometheus2.0/
  • sctape/GrinnellPlans/
  • joshuabenuck/eshell/
  • cmcginn/Common/
  • ruby-rdf/sparql-client/
  • doubi/Text-CSV_XS/
  • cpf-se/citrus/
  • abhishekkr/
  • NeoGeographyToolkit/kraken/
  • denisoid/homebank_import_scripts/
  • Jazzinghen/spamOSEK/
  • DiUS/java-api-bindings/
  • ikko/hazaitop/
  • blancavg/ggplot2-basics/
  • learnstream/ls_atomic/
  • pdawczak/onko/
  • peterkrenn/spit-generator/
  • TasteeWheat/mxoemu1/
  • digitarald/redracer/
  • Ezku/xi-framework/
  • item/sugarcrm_dev/
  • candidasa/silverstripe-phpunit-3.4/
  • EHER/phpunit-all-in-one/
  • gsdevme-archive/CURL/
  • vivid-planet/library/
  • psawaya/CS34-Bus-Routing/
  • headius/jruby-cdc/
  • sbourdeauducq/milkymist-ruby/
  • hamhei/hamcolor/
  • johnl/deb-ruby1.9.1/
  • darealcaffeine/first_app/
  • srirammca53/update_status/
  • nikuuchi/oreore_ruby/
  • Pettrov/Asynchronous-Operations-Module--PHP-/
  • TeamRocketScience/Claroline-TRS-Edition/
  • khjgbkbk/CRMS/
  • milandobrota/WPSocialNetwork/
  • cciechad/brlcad/
  • mattmccray/gumdrop/
  • irace/irace.github.com/
  • ronny/kodepos/
  • Juuro/Dreamapp-Website/
  • Belarus/Windows.OmegaT/
  • tuxnani/pyrhmn/
  • cesine/ToolsForFieldLinguistics/
  • mathics/Mathics/
  • jwiegley/CSV2Ledger/
  • jmlegr/GestComp/
  • mchelen/data_gov_catalog_data/
  • jasherai/maatkit/
  • cwarden/kettle/
  • purpleKarrot/wowmapview/
  • egonz/old_sql/
  • johnantoni/old.tutorials/
  • code-mangler/my-emacs-package/
  • calle/fn/
  • matthewfarrell/gargoyle/
  • e2thex/hackunteers.org/
  • andrewjpage/freezer_tracking/
  • fbacall/simple-spreadsheet-extractor/
  • simonvh/gimmemotifs/
  • disnet/contracts.js/
  • eckes/wandern/
  • zxvf/--2/
  • tyage/town-cake/
  • whiteshark/kunagi/
  • Surgeon/Watir/
  • cawka/DSMS_NBC/
  • dptww1/WW1AirMap2/
  • jonibo/ledgersmb/
  • dlc/ttcsbrandon/
  • akariv/obudget2/
  • meloun/py_ewitis/
  • semantic-im/sim-rf/
  • sankroh/django-form-manager/
  • JudoWill/pyMutF/
  • jou4/parsure/
  • shanx/djangocon.eu/
  • paulgessinger/fermi/
  • ajpalkovic/b2010/
  • OhTu-IDDQD/scheduler3000/
  • OlliD/DtwSequenceCompare/
  • lukeorland/phonography/
  • r0ck3y3/AAI/
  • glenbot/beerapp/
  • StephanHoyer/magento-lucene/

About

Robust CSV dialect detection methodology for Python that outperforms existing state of the art solutions by 8.35% in terms of their F1 scores, using only built-in Python modules.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages