rewrite ExcelParser to use openpyxl

- rename old xlrd wrapper to OldExcelParser - remove remaining Python 2 code
wq · Nov 16, 2021 · 475eb6c · 475eb6c
1 parent 556ade5
commit 475eb6c
Show file tree

Hide file tree

Showing 28 changed files with 157 additions and 129 deletions.
diff --git a/README.md b/README.md
@@ -3,7 +3,7 @@
 ```python
 from itertable import load_file
 
-for row in load_file("example.xls"):
+for row in load_file("example.xlsx"):
     print(row.date, row.name)
 ```
 
@@ -39,21 +39,21 @@ python3 -m pip install itertable
 # GIS support (Fiona & Shapely)
 python3 -m pip install itertable[gis]
 
-# Excel write support
-python3 -m pip install itertable[write]
-# (xls/xlsx read support is enabled by default)
+# Excel 97-2003 (.xls) support
+python3 -m pip install itertable[oldexcel]
+# (xlsx support is enabled by default)
 
 # Pandas integration
 python3 -m pip install itertable[pandas]
 ```
 
 ## Overview
 
-IterTable provides a general purpose API for loading, iterating over, and writing tabular datasets.  The goal is to avoid needing to remember the unique usage of e.g. [csv], [xlrd], or [xml.etree] every time one needs to work with external data.  Instead, IterTable abstracts these libraries into a consistent interface that works as an [iterable] of [namedtuples].  Whenever possible, the field names for a dataset are automatically determined from the source file, e.g. the column headers in an Excel spreadsheet.
+IterTable provides a general purpose API for loading, iterating over, and writing tabular datasets.  The goal is to avoid needing to remember the unique usage of e.g. [csv], [openpyxl], or [xml.etree] every time one needs to work with external data.  Instead, IterTable abstracts these libraries into a consistent interface that works as an [iterable] of [namedtuples].  Whenever possible, the field names for a dataset are automatically determined from the source file, e.g. the column headers in an Excel spreadsheet.
 
 ```python
 from itertable import ExcelFileIter
-data = ExcelFileIter(filename='example.xls')
+data = ExcelFileIter(filename='example.xlsx')
 for row in data:
     print(row.name, row.date)
 ```
@@ -67,7 +67,7 @@ for row in data:
     print(row.name, row.date)
 ```
 
-All of the included `*FileIter` classes support both reading and writing to external files, though write support for Excel files requires `itertable[write]` (which installs `xlwt` and `xlswriter`).
+All of the included `*FileIter` classes support both reading and writing to external files.
 
 ### Network Client
 
@@ -126,13 +126,11 @@ It is straightforward to [extend IterTable][custom] to support arbitrary formats
 
 [wq framework]: https://wq.io/
 [csv]: https://docs.python.org/3/library/csv.html
-[xlrd]: http://www.python-excel.org/
+[openpyxl]: https://openpyxl.readthedocs.io/en/stable/
 [xml.etree]: https://docs.python.org/3/library/xml.etree.elementtree.html
 [iterable]: https://docs.python.org/3/glossary.html#term-iterable
 [namedtuples]: https://docs.python.org/3/library/collections.html#collections.namedtuple
 [requests]: http://python-requests.org/
-[xlwt]: http://www.python-excel.org/
-[xlsxwriter]: https://xlsxwriter.readthedocs.org/
 [Pandas]: http://pandas.pydata.org/
 [DataFrame]: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html
 [Fiona]: https://github.com/Toblerity/Fiona

diff --git a/docs/mappers.md b/docs/mappers.md
@@ -10,7 +10,7 @@ Mappers
 from itertable import ExcelFileIter
 
 # Loader and Parser do their work here
-instance = ExcelFileIter(filename='example.xls')
+instance = ExcelFileIter(filename='example.xlsx')
 
 # Mapper does its work here
 for row in instance:

diff --git a/docs/parsers.md b/docs/parsers.md
@@ -94,7 +94,9 @@ quotechar | Quotation character for text values containing spaces or delimiters,
 reader_class() | Method returning an uninstantiated `DictReader` class for use in parsing the data.  The default method returns a subclass of `SkipPreludeReader` that passes along the `max_header_row` option.
 
 #### [ExcelParser (`WorkbookParser`)[itertable.parsers.xls]
-`ExcelParser` provides support for both "old" (.xls) and "new" (.xlsx) files via the [xlrd] module.  Write support can be enabled by installing the [xlwt] and/or [xlsxwrite] modules.  `ExcelParser` extends a somewhat more generic `WorkbookParser`, with the idea that the latter could eventually be extended to other "workbook" style formats like ODS.
+`ExcelParser` provides support for `.xlsx` files via the [openpyxl] module, while `OldExcelParser` supports `.xls` if [xlrd] and [xlwt] are installed.  `ExcelParser` and `OldExcelParser` extend a somewhat more generic `WorkbookParser`, with the idea that the latter could eventually be extended to other "workbook" style formats like ODS.
+
+> Note: In previous versions of itertable, `ExcelParser` relied on [xlrd] to support both `.xlsx` and `.xls` formats.  Now that xlrd has dropped `.xlsx` support, `ExcelParser` has been rewritten to use [openpyxl], which only supports `.xlsx` files.  The old `xlrd`-based `ExcelParser` class has been renamed to `OldExcelParser`.  (In most cases you can just use `itertable.load_file()` which automatically determines whether to use `ExcelParser` or `OldExcelParser`).
 
 ##### Properties
 name | purpose
@@ -110,7 +112,7 @@ name | purpose
 `parse_row(row)` | Convert the given row object into a dict, usually by mapping the column header to the value in each cell
 `get_value(cell)` | Retrieve the actual value from the cell.
 
-The methods listed above are called in turn by `parse()`, which is defined by `WorkbookParser`.  Working implementations of the methods are defined in `ExcelParser`.
+The methods listed above are called in turn by `parse()`, which is defined by `WorkbookParser`.  Working implementations of the methods are defined in `ExcelParser` and `OldExcelParser`.
 
 [itertable.parsers]: https://github.com/wq/itertable/blob/master/itertable/parsers/
 [itertable.parsers.base]: https://github.com/wq/itertable/blob/master/itertable/parsers/base.py
@@ -133,4 +135,4 @@ The methods listed above are called in turn by `parse()`, which is defined by `W
 [Django Data Wizard]: https://github.com/wq/django-data-wizard
 [DictReader]: https://docs.python.org/3/library/csv.html#csv.DictReader
 [xlwt]: http://www.python-excel.org/
-[xlsxwrite]: https://xlsxwriter.readthedocs.org/
+[openpyxl]: https://openpyxl.readthedocs.io/en/stable/
diff --git a/itertable/__init__.py b/itertable/__init__.py
@@ -14,6 +14,7 @@
     CsvParser,
     JsonParser,
     XmlParser,
+    OldExcelParser,
     ExcelParser,
 )
 
@@ -52,6 +53,7 @@
     'JsonParser',
     'XmlParser',
     'ExcelParser',
+    'OldExcelParser',
 
     'BaseMapper',
     'DictMapper',
@@ -80,6 +82,7 @@
     'XmlNetIter',
     'XmlStringIter',
 
+    'OldExcelFileIter',
     'ExcelFileIter',
 )
 
@@ -96,7 +99,9 @@
 XmlNetIter = make_iter(NetLoader, XmlParser)
 XmlStringIter = make_iter(StringLoader, XmlParser)
 
+OldExcelFileIter = make_iter(FileLoader, OldExcelParser)
 ExcelFileIter = make_iter(FileLoader, ExcelParser)
+OldExcelNetIter = make_iter(NetLoader, OldExcelParser)
 ExcelNetIter = make_iter(NetLoader, ExcelParser)
 
 try:

diff --git a/itertable/loaders.py b/itertable/loaders.py
@@ -1,12 +1,6 @@
 from __future__ import print_function
 import requests
-try:
-    # Python 2 (uses str)
-    from StringIO import StringIO
-except ImportError:
-    # Python 3 (Python 2 equivalent uses unicode)
-    from io import StringIO
-from io import BytesIO
+from io import StringIO, BytesIO
 from .version import VERSION
 from .exceptions import LoadFailed
 from zipfile import ZipFile

diff --git a/itertable/mappers.py b/itertable/mappers.py
@@ -88,13 +88,9 @@ def field_map(self):
     def tuple_field_name(self, field):
         field = self.clean_field_name(field)
         field = re.sub(r'\W', '', field.lower())
-        try:
-            # normalize identifiers for consistency with namedtuple
-            # http://bugs.python.org/issue23091
-            field = normalize('NFKC', field)
-        except TypeError:
-            # normalize doesn't work on Python 2 str() instances
-            pass
+        # normalize identifiers for consistency with namedtuple
+        # http://bugs.python.org/issue23091
+        field = normalize('NFKC', field)
         return field
 
     def clean_field_name(self, field):

diff --git a/itertable/parsers/__init__.py b/itertable/parsers/__init__.py
@@ -1,10 +1,11 @@
 from .text import CsvParser, JsonParser, XmlParser
-from .xls import ExcelParser
+from .xls import OldExcelParser, ExcelParser
 
 
 __all__ = (
     'CsvParser',
     'JsonParser',
     'XmlParser',
+    'OldExcelParser',
     'ExcelParser',
 )
diff --git a/itertable/parsers/readers.py b/itertable/parsers/readers.py
@@ -1,21 +1,7 @@
-try:
-    import unicodecsv as csv
-    UNICODE_CSV = True
-except ImportError:
-    import csv
-    UNICODE_CSV = False
+import csv
 
 
-if issubclass(csv.DictReader, object):
-    # Python 3
-    DictReader = csv.DictReader
-else:
-    # Python 2
-    class DictReader(object, csv.DictReader):
-        pass
-
-
-class SkipPreludeReader(DictReader):
+class SkipPreludeReader(csv.DictReader):
     """
     A specialized version of DictReader that attempts to find where the "real"
     CSV data is in a file that may contain a prelude of non-CSV text.
@@ -32,8 +18,9 @@ def __init__(self, f, fieldnames=None, restkey=None, restval=None,
         readeropts = [f, dialect]
         readeropts.extend(args)
         self._readeropts = (readeropts, kwds)
-        csv.DictReader.__init__(self, f, fieldnames, restkey, restval,
-                                dialect, *args, **kwds)
+        super().__init__(
+            f, fieldnames, restkey, restval, dialect, *args, **kwds
+        )
 
     @property
     def fieldnames(self):

diff --git a/itertable/parsers/text.py b/itertable/parsers/text.py
@@ -1,5 +1,6 @@
+import csv
 import json
-from .readers import csv, UNICODE_CSV, SkipPreludeReader
+from .readers import SkipPreludeReader
 from xml.etree import ElementTree as ET
 
 from .base import BaseParser, TableParser
@@ -10,7 +11,7 @@ class CsvParser(TableParser):
     delimiter = ","
     quotechar = '"'
     no_pickle_parser = ['csvdata']
-    binary = UNICODE_CSV
+    binary = False
 
     def parse(self):
         # Like DictReader, assume explicit field definition means CSV does not
@@ -48,11 +49,12 @@ class Reader(SkipPreludeReader):
     def dump(self, file=None):
         if file is None:
             file = self.file
-        args = file, self.get_field_names()
-        kwargs = {'encoding': 'utf-8'} if UNICODE_CSV else {}
-        kwargs['delimiter'] = self.delimiter
-        kwargs['quotechar'] = self.quotechar
-        csvout = csv.DictWriter(*args, **kwargs)
+        csvout = csv.DictWriter(
+            file,
+            self.get_field_names(),
+            delimiter=self.delimiter,
+            quotechar=self.quotechar
+        )
         csvout.writeheader()
         for row in self.data:
             csvout.writerow(row)