A Python implementation of the method described in the paper:
Detecting CSV File Dialects by Table Uniformity Measurement and Data Type Inference (PDF)
by W. GarcĂa.
The results presented here can be reproduced by running the scripts from the src/run_tests.py
file and are stored in the python/tests results/
folder.
The CSV
folder contains the files copied from the Pollock framework and other collected test files. Also the dataset used for the CSV wrangling research is available in the CSV_Wranglin
folder. Note that only link to the files can be provided, in this last case,due to the authors holds the copyright. A dataset from w3c, CSVW project, is available in the W3C-CSVW
folder.
The expect configuration for each tested CSV is saved in the Dialect_annotations.txt
, Manual_dialect_annotation.txt
and W3C-CSVW-Dialect_annotations.txt
files. All of them are stored in the ground truth
folder.
In this section, the results after running tests with the Beta Python implementation of the Table Uniformity method are presented. In order to obtain more representative performance metrics, it was decided to run the tests on a system running Linux.
The table below shows the dialect detection success ratio for P-CSVsniffer
, CleverCSV
and the built-in Python csv.Sniffer
class module. Note that the accuracy has been measured using only those files that do not produce a failure when attempting to infer CSV dialects
Data set | P-CSVsniffer |
CleverCSV |
csv.Sniffer |
---|---|---|---|
POLLOCK | 96.5517% | 95.1724% | 96.3504% |
CSV Wrangling | 90.5660% | 84.3137% | 80.5556% |
CSV Wrangling filtered CODEC | 89.4737% | 84.2520% | 80.0000% |
CSV Wrangling MESSY | 78.1955% | 71.6535% | 66.6667% |
W3C-CSVW | 95.3917% | 61.1111% | 97.6923% |
The table below shows the failure ratio for each tool.
Data set | P-CSVsniffer |
CleverCSV |
csv.Sniffer |
---|---|---|---|
POLLOCK [148 files] | 2.0270% | 2.0270% | 7.4324% |
CSV Wrangling [179 files] | 11.1732% | 14.5251% | 19.5531% |
CSV Wrangling filtered CODEC [142 files] | 6.3380% | 10.5634% | 15.4930% |
CSV Wrangling MESSY [126 files] | 6.3380% | 10.5634% | 15.4930% |
W3C-CSVW [221 files] | 1.8100% | 2.2624% | 41.1765% |
The following table shows the average success and failure ratio for selected tools. The higher the number of errors obtained, the lower the reliability for detection.
Tool | Success ratio (SR) | Failure ratio (FR) |
---|---|---|
P-CSVsniffer |
90.04% | 5.54% |
CleverCSV |
79.30% | 7.99% |
csv.Sniffer |
84.25% | 19.83% |
As a complementary metric, the table below shows the average reliability factor for CSV dialect detection. This value is computed as:
Tool | Reliability factor (RF) |
---|---|
P-CSVsniffer |
85.05% |
CleverCSV |
72.96% |
csv.Sniffer |
67.54% |
The below table shows the execution times obtained. In this one we can see that the Python module, reading 6144 characters from the CSV files, is incredibly efficient, easily outperforming the other tools.
Tool | Run-time |
---|---|
csv.Sniffer |
0.98 sec. |
CleverCSV |
6.54 sec. |
P-CSVsniffer |
17.00 sec. |
For dialect detection, we have defined True Positive (TP) as the number of CSV files where the dialect was correctly detected. By its way, False Positive (FP) is defined as the number of CSV files where the dialect was incorrectly identified as a specific dialect when it was actually a different dialect. False Negatives (FN) is defined as the number of CSV files where the specific dialect was present but not detected.
The next table shows the precision (P), which measures the accuracy of dialect detection when predicting a specific dialect. The metric is calculated as follows
Data set | P-CSVsniffer |
CleverCSV |
csv.Sniffer |
---|---|---|---|
POLLOCK | 0.9655 | 0.9517 | 0.9635 |
CSV Wrangling | 0.9057 | 0.8431 | 0.8056 |
CSV Wrangling filtered CODEC | 0.8947 | 0.8425 | 0.8000 |
CSV Wrangling MESSY | 0.7820 | 0.7165 | 0.6667 |
W3C-CSVW | 0.9539 | 0.6111 | 0.9769 |
The following table shows the recall (R), which measures the ability of the method to detect the specific dialect when it is actually present. The metric is calculated as follows
Data set | P-CSVsniffer |
CleverCSV |
csv.Sniffer |
---|---|---|---|
POLLOCK | 0.9790 | 0.9787 | 0.9231 |
CSV Wrangling | 0.8780 | 0.8323 | 0.7682 |
CSV Wrangling filtered CODEC | 0.9297 | 0.8770 | 0.8136 |
CSV Wrangling MESSY | 0.9204 | 0.8585 | 0.7843 |
W3C-CSVW | 0.9810 | 0.9635 | 0.5826 |
The below table shows the F1 score, which is the most polished measure of dialect detection accuracy. The metric is calculated as follows
Data set | P-CSVsniffer |
CleverCSV |
csv.Sniffer |
---|---|---|---|
POLLOCK | 0.9722 | 0.9650 | 0.9429 |
CSV Wrangling | 0.8916 | 0.8377 | 0.7865 |
CSV Wrangling filtered CODEC | 0.9119 | 0.8594 | 0.8067 |
CSV Wrangling MESSY | 0.8456 | 0.7811 | 0.7207 |
W3C-CSVW | 0.9673 | 0.7479 | 0.7299 |
Thus, the True Positive (TP) weighted F1 score for each tool is computed as
where
-
$\text{TP}_i$ : The number of True Positive instances of dataset$i$ . -
$F1_{Score}i$ : The F1 score for dataset$i$ .
The computations are given in the below table.
Tool | F1 score |
---|---|
P-CSVsniffer |
0.9260 |
CleverCSV |
0.8425 |
csv.Sniffer |
0.8049 |
By studying the last table it is concluded that the Table Uniformity method is able to predict and determine the dialects of CSV files with an accuracy of 92.60% using a sample of 10 records, while CleverCSV
can reach 84.25% accuracy by loading 6144 characters.
The proposed methodology shows an improvement of 8.35% over CleverCSV
using the same source code for data type detection. A substantial improvement could derive from a stricter data detection, reducing the number of false positives detected in cells. On the other hand, CleverCSV
doesn't shows significant accuracy improvements when reading all the data from the CSV files. This unexpected result helps to reaffirm that dialect detection does not always require reading all the information from the CSV files.
So we can conclude that the Table Uniformity method is the ideal candidate to be implemented as the default dialect detection method of the Python csv
module, despite the fact that the increased accuracy leads to an increase in execution time, as shown in a previous table. An alternative would be to use Pandas
to avoid the overhead of using the file iterator to pre-filter the lines to be used in the creation of the tables to be evaluated.
Below are the requirements for reproducing the experiments.
- Microsoft Office Excel.
- Python v3
- CleverCSV and all its dependencies.
Many of the CSV files used in this research were recovered from different repositories. Below you can review the list.
- franciscom/testlink-code-playground/
- boryn/yii_demo/
- dengkeaway/kunagi/
- austinrfnd/arethedodgersplayingtonight.com/
- okfn/messytables/
- kmugglet/Reporting/
- jankvak/Schedule-of-pain/
- kamilklw/onlineDR/
- mavcunha/dojosp/
- evenwestvang/skoolgate/
- shalomb/dotfiles/
- ockam/php-csv/
- code34/war-in-takistan/
- thejesusbr/GPU-Color2Gray/
- zardosht/clustering/
- bryanburgers/personal-site/
- godlessendeavor/MultiDB/
- vofp/Relation-Browser-Ruby/
- javierfdr/Endrov-collaboration/
- nmklong/limesurvey-cdio3/
- jkamenik/Rails-Profiler/
- philogb/philogb.github.com/
- jaredcohe/sec_scrape/
- andrewxhill/MOL/
- abelhegedus/Magic-Collection-Builder/
- Momus/Multipass/
- christhorpe/gacscraper/
- ikharlampenkov/gkh/
- jekozyra/ngo-project/
- dynamicpacket-public/opensips/
- ryzhov/ATS0/
- pejo751/sube751/
- markkimsal/agent-ohm/
- ericpaulbishop/gargoyle/
- Dopi/JetPlatform/
- jburks/OpenDidj/
- matsubo/emoji-sprite/
- 19maps/getWeather/
- athomason/perl-Memcached-libmemcached/
- nadavoid/ReadySiteBase/
- lagenar/rivadavia/
- DAISUKEICHIKAWA/pred/
- mhausenblas/schema-org-rdf/
- mhausenblas/omnidator/
- shinichinomura/zipcode_jp-python/
- bashofmann/opensocial_demo_game/
- rajkosto/mxoemu/
- pino1068/riskman/
- rocksolidwebdesign/AweCMS/
- duckduckgo/zeroclickinfo-goodies/
- jeffreyhorner/coloring/
- ivanistheone/Latent-Dirichlet-Allocation/
- mvilrokx/BPAD/
- navid/Nav/
- lboaretto/Stratos/
- robertogds/ReallyLateBooking/
- regdog/goodlife/
- danux/django-holding-page/
- popdevelop/snapplr/
- MichaelMarner/Half-Real-Scenes/
- bjpop/website/
- dvydra/coupon-rails/
- aaronzhang/kunagi/
- tbielawa/py-prtstat/
- tricycle/electrodrive-market-analysis/
- darpified/kunagi/
- jgarciagarrido/SegoviaOpenTeam/
- datenspiel/is_csv_importable/
- cgueret/eRDF/
- zikula-modules/Eternizer/
- azizmb/TWSS/
- kms/ds1052e-measurements/
- pphetra/fresh-odhd/
- levivya/investmarketkz/
- TeamImba/MDAS/
- saghul/sipp-scenarios/
- wliao008/mysteryleague/
- mgius/cpe458/
- ecell/ecell3-spatiocyte/
- gnorsilva/frontlinesms-core/
- tonytian33/shoppinglist/
- Airead/excise/
- UncleCJ/alexastuff/
- colinmollenhour/magento-lite/
- barbie/test-xhtml/
- kyr0/ExtZF/
- petewarden/dstkdata/
- lacimol/kunagi/
- apslab/ap-manager/
- noosamad/mxp-consolidate/
- practicalparticipation/IKMLinkedResearch/
- cwegrzyn/RHIPE/
- nadineproject/nadine/
- samsonjs/samhuri.net/
- kcampos/Kuali-Sakai-Functional-Test-Automation-Framework/
- max7255/FPGA-Analyzer/
- LATC/sandbox/
- jeffkit/autoforms/
- d8agroup/metaLayer-Gateway/
- hmgaspar/eesddp/
- fadzril/freelovr/
- fguillen/simplecov-csv/
- aminin/google-api-adwords-php/
- michaeltyson/potionstore/9
- srveit/distance/
- zarma/baghdad01/
- hippiefahrzeug/jeannie/
- digitalfox/py10n/
- JanHoralik/jh-prototypes/
- alexmajy/flashcards/
- onlytiancai/codesnip/
- vofp/relation-browser/
- Error-331/Axis-modules/
- CarnegieLearning/MathFluency/
- i-dcc/allele_image/
- hogi/kunagi/
- BarbaraEMac/TrivBot/
- Wisser/Jailer/
- jasonlong/benfords-law/
- soccermetrics/marcotti-sql/
- kennym/itools/
- rkoeppl/skeinforge_settings/
- aalzola/prueba/
- cmheisel/kardboard/
- gfalav/newcar/
- PabloCastellano/pablog-scripts/
- elle/music-library/
- oppian/xerobis/
- twidi/satchmo/
- 34/T/
- apache/bigtop/
- ciriarte/laundry/
- romanchyla/montysolr/
- alibezz/SNRails_Research/
- SQLServerIO/IometerParser/
- renduples/alibtob/
- evanleonard/kunagi/
- yuvadm/ayalon-dat/
- marplatense/erlang/
- spazm/Iron-Munger/
- crschmidt/haitibrowser/
- eLBirador/AllAboutCity/
- Coalas/apsilesia/
- emelleme/All-Drag-Racing/
- despesapublica/site/
- apkennedy/Emmaus-Silverstripe/
- Coalas/kolakowski/
- notioncollective/
- emelleme/G8-life/
- mgc544/sscustom-tra/
- andyinabox/Beyond-the-Debt-Ceiling/
- emelleme/PhillyOpen/
- sponsoredlinx/Total-Care-Asphalting/
- davidmontgomery/Wireframe/
- frappe/frappe/
- davedash/SUMO-issues/
- Habrok/HelloWorld/
- gaubert/java-balivernes/
- befair/gasistafelice/
- komagata/cloister/
- ryannscy/Unemployment-Chart/
- guylhem/PerlMSI/
- lightspeedretail/webstore/
- naderman/symfony/
- orchestra-io/sample-symfony2/
- biow0lf/prometheus2.0/
- sctape/GrinnellPlans/
- joshuabenuck/eshell/
- cmcginn/Common/
- ruby-rdf/sparql-client/
- doubi/Text-CSV_XS/
- cpf-se/citrus/
- abhishekkr/
- NeoGeographyToolkit/kraken/
- denisoid/homebank_import_scripts/
- Jazzinghen/spamOSEK/
- DiUS/java-api-bindings/
- ikko/hazaitop/
- blancavg/ggplot2-basics/
- learnstream/ls_atomic/
- pdawczak/onko/
- peterkrenn/spit-generator/
- TasteeWheat/mxoemu1/
- digitarald/redracer/
- Ezku/xi-framework/
- item/sugarcrm_dev/
- candidasa/silverstripe-phpunit-3.4/
- EHER/phpunit-all-in-one/
- gsdevme-archive/CURL/
- vivid-planet/library/
- psawaya/CS34-Bus-Routing/
- headius/jruby-cdc/
- sbourdeauducq/milkymist-ruby/
- hamhei/hamcolor/
- johnl/deb-ruby1.9.1/
- darealcaffeine/first_app/
- srirammca53/update_status/
- nikuuchi/oreore_ruby/
- Pettrov/Asynchronous-Operations-Module--PHP-/
- TeamRocketScience/Claroline-TRS-Edition/
- khjgbkbk/CRMS/
- milandobrota/WPSocialNetwork/
- cciechad/brlcad/
- mattmccray/gumdrop/
- irace/irace.github.com/
- ronny/kodepos/
- Juuro/Dreamapp-Website/
- Belarus/Windows.OmegaT/
- tuxnani/pyrhmn/
- cesine/ToolsForFieldLinguistics/
- mathics/Mathics/
- jwiegley/CSV2Ledger/
- jmlegr/GestComp/
- mchelen/data_gov_catalog_data/
- jasherai/maatkit/
- cwarden/kettle/
- purpleKarrot/wowmapview/
- egonz/old_sql/
- johnantoni/old.tutorials/
- code-mangler/my-emacs-package/
- calle/fn/
- matthewfarrell/gargoyle/
- e2thex/hackunteers.org/
- andrewjpage/freezer_tracking/
- fbacall/simple-spreadsheet-extractor/
- simonvh/gimmemotifs/
- disnet/contracts.js/
- eckes/wandern/
- zxvf/--2/
- tyage/town-cake/
- whiteshark/kunagi/
- Surgeon/Watir/
- cawka/DSMS_NBC/
- dptww1/WW1AirMap2/
- jonibo/ledgersmb/
- dlc/ttcsbrandon/
- akariv/obudget2/
- meloun/py_ewitis/
- semantic-im/sim-rf/
- sankroh/django-form-manager/
- JudoWill/pyMutF/
- jou4/parsure/
- shanx/djangocon.eu/
- paulgessinger/fermi/
- ajpalkovic/b2010/
- OhTu-IDDQD/scheduler3000/
- OlliD/DtwSequenceCompare/
- lukeorland/phonography/
- r0ck3y3/AAI/
- glenbot/beerapp/
- StephanHoyer/magento-lucene/