# Problems with text as tables (draft)

## Introduction

This applies to all TSV, CSV, etc.

The [csv_reader](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html) has many features, but getting the settings right is quite a pain.

I look after a corporate data analysis platform and found that the largest waste of time over a due course of 76,000 engineering hours came from figuring out how to import the data.

To back this with some numbers I find from 1.5Tb of zipped csv data (billions of rows of data):

- Reattempts for engineers to upload the same file with different encoding: 4.1x (Poisson distributed).
- CSV imports that are accepted by the engineer: 95/105
- Correct inference of datetime: 2 / 396 cols with name date, time or datetime (in 17 languages)

***Reading CSV files has been treated as an encoding problem. It evidently isn't. It's a pattern recognition problem.***

Existing csv-readers and file sniffers fail because they:
- don't detect encoding correctly or work on bad encoding.
- typically only read the first line after the headers.
- don't handle text escapes during column detection.
- `dateutils` parser only evaluates a single value and will always say that 1.2.2012 is mm.dd.yyyy despite that the next value might be 29.2.2012 (I've got plenty of tests with that problem).

Here's an ugly example:

`Birthdate, (Family\nnames), Father, Mother, Child, known for\n1879-4-14, Einstein, Hermann, Pauline, Albert,"\nGeneral relativity,\nSpecial relativity,\nPhotoelectric effect"`

That should be (including some text escape):

|Birthdate| (Family<br>names)| Father | Mother | Child | known for |
|---|---|---|---|---|---|
|1879-4-14| Einstein| Hermann| Pauline| Albert| General relativity,<br>Special relativity,<br>Photoelectric effect|


To add a non-breaking package to pandas, would enables pandas csv_reader can receive the keyword "analyse" and ignore the default keywords.

When  `pandas.csv_reader(somepath, analyse=True)` is called, the keyword will trigger the usage of a function in the top of the file_reader that will:

- Detect the encoding using one of the ~300 python accepted [encodings](https://github.com/Ousret/charset_normalizer/blob/51af624b59a7f1a1aaa36f9ae71bee4364e39409/charset_normalizer/constant.py#L310)
- analyse the file,
- return the correct formats to pandas as **kwargs

If the function is external to pandas, it could be used as:

```
import pandas as pd
import csv_analyze
df = pandas.csv_reader(somepath, **csv_analyze(somepath))
```

I can imagine additional output that pandas could react to, such as more detailed information from the analysis:
```
d = csv_analyze(path)
d
    {"encoding": {"depth": 10043,  # characters checked.
                  'cp855': 10043,  # meaning 10043/10043 = 100% characters match.
                  'utf_8_sig': 12,  # meaning decode error after 12th character.
                  'utf-8': 5,
                  "....."},
     "text_escape": [False, True, False, False, True],
     "column_names": ["Birthdate", "(Family\nnames)", "Father", "Mother ", "Child", "known for"],
     "datatypes": ["date", "str", "str", "str", "str"],
     "metadata": [  # in order like columns, with list of probabilities for each type.
         {"date": {'yyyy-mm-dd': (100, 100),
                   'mm-dd-yyyy': (0, 100),
                   "....": (0, 0)},
          "time": 0,
          "datetime": 0,
          "str": "pass",
          "int": 0,
          "float": 0},
         {"date": ValueError, "time": ValueError, "datetime": ValueError, "str": (100, 100), "int": ValueError,
          "float": ValueError},
         {"date": ValueError, "time": ValueError, "datetime": ValueError, "str": (100, 100), "int": ValueError,
          "float": ValueError},
         {"date": ValueError, "time": ValueError, "datetime": ValueError, "str": (100, 100), "int": ValueError,
          "float": ValueError},
         {"date": ValueError, "time": ValueError, "datetime": ValueError, "str": (100, 100), "int": ValueError,
          "float": ValueError}
     ]
     }

```

#### API breaking implications

An additional keyword will probably not break much. `csv_reader` already has 52 keywords.

#### Describe alternatives you've considered

Create a package and have the option for pandas:

```
import pandas as pd
import csv_analyze
d = csv_analyze(somepath)
pandas.csv_reader(somepath, **d['pandas'])
```

Keeping the package external will be the pain-point if there is be a disconnect between the pd.csv_reader `kwargs` and the dict that `csv_analyze` returns.

#### Additional context

The strings I've encountered in csv data are:

SEPARATORS

#,###.#####  last non-digit character indicates decimal, preceding different characters are thousand separators.
#.###,#####

Examples:
```
4 294 967 295,000  Canadian (English and French), Danish, Finnish, French, German
4.294.967.295,000  Italian, Norwegian, Spanish,
4 294 967 295,000  Swedish
4,294,967,295.000  GB-English, US-English, Thai
```
(full treaty on: https://en.wikipedia.org/wiki/Decimal_separator)

However, notice that Hindi uses a 2-digit grouping, except for the 3-digit grouping for denoting hundreds: 12,34,56,789.00

SCIENTIFIC NOTATION

   ###E###   integer before and after E
   ###e###
#.####E###   floating point before, integer after E
#.####e###

###.###N <=3 digit float, followed by N belongs to (K,M,G/B, T,E,P) for kilo, mega, giga/bill, tera, exa,...

POSITIVE/NEGATIVE

Negative numbers can have tailing minus
```
527-
-527
(527)
[527]
```

ADDITIONAL SIGNS

The same applies for percentages: 98%, 98 %, 98 pct, %98
And for currencies: $100.00, kr100,00

NON-LATIN numbers

NUMBER FORMATTING
Script	Digits Used
Latin	0 1 2 3 4 5 6 7 8 9
Arabic	٠‎ ١‎ ٢‎ ٣‎ ٤‎ ٥‎ ٦‎ ٧‎ ٨‎ ٩
Chinese / Japanese	〇 一 二 三 四 五 六 七 八 九 十…
Hebrew	א ,ב ,ג, ד, ה, ו, ז, ח ,ט…
Korean	일 이 삼 사 오 육 칠 팔 구…
The Korean regularly uses both a Sino-Korean system and a native Korean system. Everything that can be counted will use one of the two systems, but seldom both.	하나 둘 셋 넷 다섯 여섯 일곱 여덟 아홉….
Bengla	০ ১ ২ ৩ ৪ ৫ ৬ ৭ ৮ ৯
Devanagari (script used to write Hindi,Marathi, and other languages)	० १ २ ३ ४ ५ ६ ७ ८ ९
Gujarati	૦ ૧ ૨ ૩ ૪ ૫ ૬ ૭ ૮ ૯
Gurmukhi (one of the scripts used to write Punjabi)	੦ ੧ ੨ ੩ ੪ ੫ ੬ ੭ ੮ ੯
Kannada	೦ ೧ ೨ ೩ ೪ ೫ ೬ ೭ ೮ ೯
Malayalam	൦ ൧ ൨ ൩ ൪ ൫ ൬ ൭ ൮ ൯
Odia	୦ ୧ ୨ ୩ ୪ ୫ ୬ ୭ ୮ ୯
Tamil	௦ ௧ ௨ ௩ ௪ ௫ ௬ ௭ ௮ ௯
Telugu	౦ ౧ ౨ ౩ ౪ ౫ ౬ ౭ ౮ ౯
Thai	๐ ๑ ๒ ๓ ๔ ๕ ๖ ๗ ๘ ๙
Tibetan	༠ ༡ ༢ ༣ ༤ ༥ ༦ ༧ ༨ ༩

FLOATING POINT
the floating point precision issue (which in particular haunts anyone who reads long barcodes that aren't imported as integers):
```
val = "0.3066101993807095471566981359501369297504425048828125"
print(float(val))
0.30661019938070955
```

NOT A NUMBER
```
"", "#N/A", "#N/A N/A", "#NA", "-1.#IND", "-1.#QNAN", "-NaN", "-nan", "1.#IND", "1.#QNAN", "<NA>", "N/A", "NA", "NULL", "NaN", "n/a", "nan", "null". "-", "--", "###"
```

Similar variation appears in datetime locale:

```
yyyy-mm-dd  Canadian (English and French), Danish, German, Swedish
dd.mm.yyyy  Finnish
dd.mm.yy    Italian, Norwegian
dd-mm-yy    Spanish
dd/mm/yy    GB-English
mm-dd-yy    US-English
dd/mm/yyyy  Thai
```

And so for time:
```
23:59      Canadian
23.59      Finnish
23.59 Uhr  German
Kl 23.59   Norwegian
11:59 PM   Thai
11.59 PM   UK english
```

TEXT ESCAPE
Finally we also see newline characters in headers which the csv-reader cannot deal with. To detect the correct format of the example below multiple lines have to be read and the internal between newline and separators. These will also have to be text and bracket escaped.

`Birthdate, (Family\nnames), Father, Mother, Child, known for\n1879-4-14, Einstein, Hermann, Pauline, Albert,"\nGeneral relativity,\nSpecial relativity,\nPhotoelectric effect"`

See table in introduction as `ugly example`

LINEBREAK
very old files can have linebreaks \r \r\n or \n




