Autodetect file delimiters by scanning the first ten lines #2409

jwnacnud · 2024-05-21T00:31:51Z

I deal with comma, tab, pipe, and even tilde separated text files, and I swap back and forth. In python scripts that I have written I scan the first ten lines to find frequency of common delimiters and use that to interpret the file. It's correct 99% of the time. Can there be an option to autodetect, not based on the file extension?

Here's what I use to autodetect:

import sys
import subprocess

def detect_delimiter(filename, num_lines=10):
delimiters = {'|': 0, ',': 0, '\t': 0}

# Read the first few lines and count occurrences of each potential delimiter
with open(filename, 'r') as file:
    for _ in range(num_lines):
        line = file.readline()
        if not line:
            break
        for delimiter in delimiters:
            delimiters[delimiter] += line.count(delimiter)

# Determine the most common delimiter
max_delimiter = max(delimiters, key=delimiters.get)
if max_delimiter == '|':
    return "|"
elif max_delimiter == ',':
    return ","
elif max_delimiter == '\t':
    return "\\t"
else:
    return None

I'm sure there's a more elegant way to do it.

The text was updated successfully, but these errors were encountered:

saulpw · 2024-05-21T01:03:20Z

Hi @jwnacnud, you can do this since v3.0 with a guess_ function: https://www.visidata.org/docs/api/loaders#guessing-filetypes

You should be able to port the above snippet into a function in your visidatarc and it should be used automatically.

jwnacnud added the wishlist label May 21, 2024

saulpw closed this as completed May 21, 2024

anjakefala added the wish granted label May 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autodetect file delimiters by scanning the first ten lines #2409

Autodetect file delimiters by scanning the first ten lines #2409

jwnacnud commented May 21, 2024

saulpw commented May 21, 2024

Autodetect file delimiters by scanning the first ten lines #2409

Autodetect file delimiters by scanning the first ten lines #2409

Comments

jwnacnud commented May 21, 2024

saulpw commented May 21, 2024