CSVsniffer

A Python implementation of the method described in the paper:

Detecting CSV File Dialects by Table Uniformity Measurement and Data Type Inference (PDF)

Introduction

The results presented here can be reproduced by running the scripts from the src/run_tests.py file and are stored in the python/tests results/ folder.

Data

The CSV folder contains the files copied from the Pollock framework and other collected test files. Also the dataset used for the CSV wrangling research is available in the CSV_Wranglin folder. Note that only link to the files can be provided, in this last case,due to the authors holds the copyright. A dataset from w3c, CSVW project, is available in the W3C-CSVW folder.

The expect configuration for each tested CSV is saved in the Dialect_annotations.txt, Manual_dialect_annotation.txt and W3C-CSVW-Dialect_annotations.txt files. All of them are stored in the ground truth folder.

Results

In this section, the results after running tests with the Beta Python implementation of the Table Uniformity method are presented. In order to obtain more representative performance metrics, it was decided to run the tests on a system running Linux.

The table below shows the dialect detection success ratio for P-CSVsniffer, CleverCSV and the built-in Python csv.Sniffer class module. Note that the accuracy has been measured using only those files that do not produce a failure when attempting to infer CSV dialects

Data set	`P-CSVsniffer`	`CleverCSV`	`csv.Sniffer`
POLLOCK	96.5517%	95.1724%	96.3504%
CSV Wrangling	90.5660%	84.3137%	80.5556%
CSV Wrangling filtered CODEC	89.4737%	84.2520%	80.0000%
CSV Wrangling MESSY	78.1955%	71.6535%	66.6667%
W3C-CSVW	95.3917%	61.1111%	97.6923%

The table below shows the failure ratio for each tool.

Data set	`P-CSVsniffer`	`CleverCSV`	`csv.Sniffer`
POLLOCK [148 files]	2.0270%	2.0270%	7.4324%
CSV Wrangling [179 files]	11.1732%	14.5251%	19.5531%
CSV Wrangling filtered CODEC [142 files]	6.3380%	10.5634%	15.4930%
CSV Wrangling MESSY [126 files]	6.3380%	10.5634%	15.4930%
W3C-CSVW [221 files]	1.8100%	2.2624%	41.1765%

The following table shows the average success and failure ratio for selected tools. The higher the number of errors obtained, the lower the reliability for detection.

Tool	Success ratio (SR)	Failure ratio (FR)
`P-CSVsniffer`	90.04%	5.54%
`CleverCSV`	79.30%	7.99%
`csv.Sniffer`	84.25%	19.83%

As a complementary metric, the table below shows the average reliability factor for CSV dialect detection. This value is computed as: $$RF=SR\times (1-FR)$$.

Tool	Reliability factor (RF)
`P-CSVsniffer`	85.05%
`CleverCSV`	72.96%
`csv.Sniffer`	67.54%

The below table shows the execution times obtained. In this one we can see that the Python module, reading 6144 characters from the CSV files, is incredibly efficient, easily outperforming the other tools.

Tool	Run-time
`csv.Sniffer`	0.98 sec.
`CleverCSV`	6.54 sec.
`P-CSVsniffer`	17.00 sec.

Accuracy analysis

For dialect detection, we have defined True Positive (TP) as the number of CSV files where the dialect was correctly detected. By its way, False Positive (FP) is defined as the number of CSV files where the dialect was incorrectly identified as a specific dialect when it was actually a different dialect. False Negatives (FN) is defined as the number of CSV files where the specific dialect was present but not detected.

The next table shows the precision (P), which measures the accuracy of dialect detection when predicting a specific dialect. The metric is calculated as follows

$$P=\frac{TP}{TP+FP}$$

Data set	`P-CSVsniffer`	`CleverCSV`	`csv.Sniffer`
POLLOCK	0.9655	0.9517	0.9635
CSV Wrangling	0.9057	0.8431	0.8056
CSV Wrangling filtered CODEC	0.8947	0.8425	0.8000
CSV Wrangling MESSY	0.7820	0.7165	0.6667
W3C-CSVW	0.9539	0.6111	0.9769

The following table shows the recall (R), which measures the ability of the method to detect the specific dialect when it is actually present. The metric is calculated as follows

$$R=\frac{TP}{TP+FN}$$

Data set	`P-CSVsniffer`	`CleverCSV`	`csv.Sniffer`
POLLOCK	0.9790	0.9787	0.9231
CSV Wrangling	0.8780	0.8323	0.7682
CSV Wrangling filtered CODEC	0.9297	0.8770	0.8136
CSV Wrangling MESSY	0.9204	0.8585	0.7843
W3C-CSVW	0.9810	0.9635	0.5826

The below table shows the F1 score, which is the most polished measure of dialect detection accuracy. The metric is calculated as follows

$$F1=2 \times \frac{P \times R}{P+R}$$

Data set	`P-CSVsniffer`	`CleverCSV`	`csv.Sniffer`
POLLOCK	0.9722	0.9650	0.9429
CSV Wrangling	0.8916	0.8377	0.7865
CSV Wrangling filtered CODEC	0.9119	0.8594	0.8067
CSV Wrangling MESSY	0.8456	0.7811	0.7207
W3C-CSVW	0.9673	0.7479	0.7299

Thus, the True Positive (TP) weighted F1 score for each tool is computed as

$$F1_{Weighted} Score = \frac{\sum_{i=1}^{n} TP_i \times F1_{Score}i}{\sum_{i=1}^{n} TP_i}$$

where

$\text{TP}_i$: The number of True Positive instances of dataset $i$.
$F1_{Score}i$: The F1 score for dataset $i$.

The computations are given in the below table.

Tool	F1 score
`P-CSVsniffer`	0.9260
`CleverCSV`	0.8425
`csv.Sniffer`	0.8049

Conclusions

By studying the last table it is concluded that the Table Uniformity method is able to predict and determine the dialects of CSV files with an accuracy of 92.60% using a sample of 10 records, while CleverCSV can reach 84.25% accuracy by loading 6144 characters.

The proposed methodology shows an improvement of 8.35% over CleverCSV using the same source code for data type detection. A substantial improvement could derive from a stricter data detection, reducing the number of false positives detected in cells. On the other hand, CleverCSV doesn't shows significant accuracy improvements when reading all the data from the CSV files. This unexpected result helps to reaffirm that dialect detection does not always require reading all the information from the CSV files.

So we can conclude that the Table Uniformity method is the ideal candidate to be implemented as the default dialect detection method of the Python csv module, despite the fact that the increased accuracy leads to an increase in execution time, as shown in a previous table. An alternative would be to use Pandas to avoid the overhead of using the file iterator to pre-filter the lines to be used in the creation of the tables to be evaluated.

Requirements

Below are the requirements for reproducing the experiments.

Microsoft Office Excel.
Python v3
CleverCSV and all its dependencies.

Credits

Many of the CSV files used in this research were recovered from different repositories. Below you can review the list.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
CSV		CSV
CSV_Wrangling		CSV_Wrangling
CSV_Wrangling_All		CSV_Wrangling_All
W3C-CSVW		W3C-CSVW
ground truth		ground truth
src		src
tests results		tests results
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CSVsniffer

Introduction

Data

Results

Accuracy analysis

Conclusions

Requirements

Credits

About

Releases

Packages

Languages

License

ws-garcia/P-CSVsniffer

Folders and files

Latest commit

History

Repository files navigation

CSVsniffer

Introduction

Data

Results

Accuracy analysis

Conclusions

Requirements

Credits

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages