Skip to content

Commit

Permalink
rewrite ExcelParser to use openpyxl
Browse files Browse the repository at this point in the history
 - rename old xlrd wrapper to OldExcelParser
 - remove remaining Python 2 code
  • Loading branch information
sheppard committed Nov 16, 2021
1 parent 556ade5 commit 475eb6c
Show file tree
Hide file tree
Showing 28 changed files with 157 additions and 129 deletions.
18 changes: 8 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
```python
from itertable import load_file

for row in load_file("example.xls"):
for row in load_file("example.xlsx"):
print(row.date, row.name)
```

Expand Down Expand Up @@ -39,21 +39,21 @@ python3 -m pip install itertable
# GIS support (Fiona & Shapely)
python3 -m pip install itertable[gis]

# Excel write support
python3 -m pip install itertable[write]
# (xls/xlsx read support is enabled by default)
# Excel 97-2003 (.xls) support
python3 -m pip install itertable[oldexcel]
# (xlsx support is enabled by default)

# Pandas integration
python3 -m pip install itertable[pandas]
```

## Overview

IterTable provides a general purpose API for loading, iterating over, and writing tabular datasets. The goal is to avoid needing to remember the unique usage of e.g. [csv], [xlrd], or [xml.etree] every time one needs to work with external data. Instead, IterTable abstracts these libraries into a consistent interface that works as an [iterable] of [namedtuples]. Whenever possible, the field names for a dataset are automatically determined from the source file, e.g. the column headers in an Excel spreadsheet.
IterTable provides a general purpose API for loading, iterating over, and writing tabular datasets. The goal is to avoid needing to remember the unique usage of e.g. [csv], [openpyxl], or [xml.etree] every time one needs to work with external data. Instead, IterTable abstracts these libraries into a consistent interface that works as an [iterable] of [namedtuples]. Whenever possible, the field names for a dataset are automatically determined from the source file, e.g. the column headers in an Excel spreadsheet.

```python
from itertable import ExcelFileIter
data = ExcelFileIter(filename='example.xls')
data = ExcelFileIter(filename='example.xlsx')
for row in data:
print(row.name, row.date)
```
Expand All @@ -67,7 +67,7 @@ for row in data:
print(row.name, row.date)
```

All of the included `*FileIter` classes support both reading and writing to external files, though write support for Excel files requires `itertable[write]` (which installs `xlwt` and `xlswriter`).
All of the included `*FileIter` classes support both reading and writing to external files.

### Network Client

Expand Down Expand Up @@ -126,13 +126,11 @@ It is straightforward to [extend IterTable][custom] to support arbitrary formats

[wq framework]: https://wq.io/
[csv]: https://docs.python.org/3/library/csv.html
[xlrd]: http://www.python-excel.org/
[openpyxl]: https://openpyxl.readthedocs.io/en/stable/
[xml.etree]: https://docs.python.org/3/library/xml.etree.elementtree.html
[iterable]: https://docs.python.org/3/glossary.html#term-iterable
[namedtuples]: https://docs.python.org/3/library/collections.html#collections.namedtuple
[requests]: http://python-requests.org/
[xlwt]: http://www.python-excel.org/
[xlsxwriter]: https://xlsxwriter.readthedocs.org/
[Pandas]: http://pandas.pydata.org/
[DataFrame]: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html
[Fiona]: https://github.com/Toblerity/Fiona
Expand Down
2 changes: 1 addition & 1 deletion docs/mappers.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ Mappers
from itertable import ExcelFileIter

# Loader and Parser do their work here
instance = ExcelFileIter(filename='example.xls')
instance = ExcelFileIter(filename='example.xlsx')

# Mapper does its work here
for row in instance:
Expand Down
8 changes: 5 additions & 3 deletions docs/parsers.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,9 @@ quotechar | Quotation character for text values containing spaces or delimiters,
reader_class() | Method returning an uninstantiated `DictReader` class for use in parsing the data. The default method returns a subclass of `SkipPreludeReader` that passes along the `max_header_row` option.

#### [ExcelParser (`WorkbookParser`)[itertable.parsers.xls]
`ExcelParser` provides support for both "old" (.xls) and "new" (.xlsx) files via the [xlrd] module. Write support can be enabled by installing the [xlwt] and/or [xlsxwrite] modules. `ExcelParser` extends a somewhat more generic `WorkbookParser`, with the idea that the latter could eventually be extended to other "workbook" style formats like ODS.
`ExcelParser` provides support for `.xlsx` files via the [openpyxl] module, while `OldExcelParser` supports `.xls` if [xlrd] and [xlwt] are installed. `ExcelParser` and `OldExcelParser` extend a somewhat more generic `WorkbookParser`, with the idea that the latter could eventually be extended to other "workbook" style formats like ODS.

> Note: In previous versions of itertable, `ExcelParser` relied on [xlrd] to support both `.xlsx` and `.xls` formats. Now that xlrd has dropped `.xlsx` support, `ExcelParser` has been rewritten to use [openpyxl], which only supports `.xlsx` files. The old `xlrd`-based `ExcelParser` class has been renamed to `OldExcelParser`. (In most cases you can just use `itertable.load_file()` which automatically determines whether to use `ExcelParser` or `OldExcelParser`).
##### Properties
name | purpose
Expand All @@ -110,7 +112,7 @@ name | purpose
`parse_row(row)` | Convert the given row object into a dict, usually by mapping the column header to the value in each cell
`get_value(cell)` | Retrieve the actual value from the cell.

The methods listed above are called in turn by `parse()`, which is defined by `WorkbookParser`. Working implementations of the methods are defined in `ExcelParser`.
The methods listed above are called in turn by `parse()`, which is defined by `WorkbookParser`. Working implementations of the methods are defined in `ExcelParser` and `OldExcelParser`.

[itertable.parsers]: https://github.com/wq/itertable/blob/master/itertable/parsers/
[itertable.parsers.base]: https://github.com/wq/itertable/blob/master/itertable/parsers/base.py
Expand All @@ -133,4 +135,4 @@ The methods listed above are called in turn by `parse()`, which is defined by `W
[Django Data Wizard]: https://github.com/wq/django-data-wizard
[DictReader]: https://docs.python.org/3/library/csv.html#csv.DictReader
[xlwt]: http://www.python-excel.org/
[xlsxwrite]: https://xlsxwriter.readthedocs.org/
[openpyxl]: https://openpyxl.readthedocs.io/en/stable/
5 changes: 5 additions & 0 deletions itertable/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
CsvParser,
JsonParser,
XmlParser,
OldExcelParser,
ExcelParser,
)

Expand Down Expand Up @@ -52,6 +53,7 @@
'JsonParser',
'XmlParser',
'ExcelParser',
'OldExcelParser',

'BaseMapper',
'DictMapper',
Expand Down Expand Up @@ -80,6 +82,7 @@
'XmlNetIter',
'XmlStringIter',

'OldExcelFileIter',
'ExcelFileIter',
)

Expand All @@ -96,7 +99,9 @@
XmlNetIter = make_iter(NetLoader, XmlParser)
XmlStringIter = make_iter(StringLoader, XmlParser)

OldExcelFileIter = make_iter(FileLoader, OldExcelParser)
ExcelFileIter = make_iter(FileLoader, ExcelParser)
OldExcelNetIter = make_iter(NetLoader, OldExcelParser)
ExcelNetIter = make_iter(NetLoader, ExcelParser)

try:
Expand Down
8 changes: 1 addition & 7 deletions itertable/loaders.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,6 @@
from __future__ import print_function
import requests
try:
# Python 2 (uses str)
from StringIO import StringIO
except ImportError:
# Python 3 (Python 2 equivalent uses unicode)
from io import StringIO
from io import BytesIO
from io import StringIO, BytesIO
from .version import VERSION
from .exceptions import LoadFailed
from zipfile import ZipFile
Expand Down
10 changes: 3 additions & 7 deletions itertable/mappers.py
Original file line number Diff line number Diff line change
Expand Up @@ -88,13 +88,9 @@ def field_map(self):
def tuple_field_name(self, field):
field = self.clean_field_name(field)
field = re.sub(r'\W', '', field.lower())
try:
# normalize identifiers for consistency with namedtuple
# http://bugs.python.org/issue23091
field = normalize('NFKC', field)
except TypeError:
# normalize doesn't work on Python 2 str() instances
pass
# normalize identifiers for consistency with namedtuple
# http://bugs.python.org/issue23091
field = normalize('NFKC', field)
return field

def clean_field_name(self, field):
Expand Down
3 changes: 2 additions & 1 deletion itertable/parsers/__init__.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,11 @@
from .text import CsvParser, JsonParser, XmlParser
from .xls import ExcelParser
from .xls import OldExcelParser, ExcelParser


__all__ = (
'CsvParser',
'JsonParser',
'XmlParser',
'OldExcelParser',
'ExcelParser',
)
23 changes: 5 additions & 18 deletions itertable/parsers/readers.py
Original file line number Diff line number Diff line change
@@ -1,21 +1,7 @@
try:
import unicodecsv as csv
UNICODE_CSV = True
except ImportError:
import csv
UNICODE_CSV = False
import csv


if issubclass(csv.DictReader, object):
# Python 3
DictReader = csv.DictReader
else:
# Python 2
class DictReader(object, csv.DictReader):
pass


class SkipPreludeReader(DictReader):
class SkipPreludeReader(csv.DictReader):
"""
A specialized version of DictReader that attempts to find where the "real"
CSV data is in a file that may contain a prelude of non-CSV text.
Expand All @@ -32,8 +18,9 @@ def __init__(self, f, fieldnames=None, restkey=None, restval=None,
readeropts = [f, dialect]
readeropts.extend(args)
self._readeropts = (readeropts, kwds)
csv.DictReader.__init__(self, f, fieldnames, restkey, restval,
dialect, *args, **kwds)
super().__init__(
f, fieldnames, restkey, restval, dialect, *args, **kwds
)

@property
def fieldnames(self):
Expand Down
16 changes: 9 additions & 7 deletions itertable/parsers/text.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
import csv
import json
from .readers import csv, UNICODE_CSV, SkipPreludeReader
from .readers import SkipPreludeReader
from xml.etree import ElementTree as ET

from .base import BaseParser, TableParser
Expand All @@ -10,7 +11,7 @@ class CsvParser(TableParser):
delimiter = ","
quotechar = '"'
no_pickle_parser = ['csvdata']
binary = UNICODE_CSV
binary = False

def parse(self):
# Like DictReader, assume explicit field definition means CSV does not
Expand Down Expand Up @@ -48,11 +49,12 @@ class Reader(SkipPreludeReader):
def dump(self, file=None):
if file is None:
file = self.file
args = file, self.get_field_names()
kwargs = {'encoding': 'utf-8'} if UNICODE_CSV else {}
kwargs['delimiter'] = self.delimiter
kwargs['quotechar'] = self.quotechar
csvout = csv.DictWriter(*args, **kwargs)
csvout = csv.DictWriter(
file,
self.get_field_names(),
delimiter=self.delimiter,
quotechar=self.quotechar
)
csvout.writeheader()
for row in self.data:
csvout.writerow(row)
Expand Down
Loading

0 comments on commit 475eb6c

Please sign in to comment.