Skip to content

Commit

Permalink
Merge pull request #218 from pyexcel/dev
Browse files Browse the repository at this point in the history
release 0.6.3
  • Loading branch information
chfw committed Jul 31, 2020
2 parents 12688ed + bd60888 commit f0e0613
Show file tree
Hide file tree
Showing 76 changed files with 811 additions and 315 deletions.
4 changes: 2 additions & 2 deletions .github/PULL_REQUEST_TEMPLATE.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,10 @@ With your PR, here is a check list:
- [ ] Has Test cases written
- [ ] Has all code lines tested
- [ ] Has `make format` been run?
- [ ] Has `moban` been run?
- [ ] Please update CHANGELOG.yml
- [ ] Passes all Travis CI builds
- [ ] Has fair amount of documentation if your change is complex
- [ ] run 'make format' so as to confirm the pyexcel organisation's coding style
- [ ] Please update CHANGELOG.rst
- [ ] Please add yourself to CONTRIBUTORS.rst
- [ ] Agree on NEW BSD License for your contribution
- [ ] Has `moban` been run?
10 changes: 10 additions & 0 deletions .isort.cfg
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
[settings]
line_length=79
known_first_party=
known_third_party=mock,nose,pyexcel_io
indent=' '
multi_line_output=3
length_sort=1
default_section=FIRSTPARTY
no_lines_before=LOCALFOLDER
sections=FUTURE,STDLIB,FIRSTPARTY,THIRDPARTY,LOCALFOLDER
8 changes: 8 additions & 0 deletions .moban.d/CONTRIBUTORS.rst.jj2
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
Contributors
================================================================================

In alphabetical order:

{% for contributor in contributors|sort(attribute='name') %}
* `{{ contributor.name }} <{{ contributor.url }}>`_
{% endfor %}
4 changes: 4 additions & 0 deletions .moban.d/custom_setup.py.jj2
Original file line number Diff line number Diff line change
Expand Up @@ -13,3 +13,7 @@
{% block additional_classifiers %}
'Development Status :: 3 - Alpha',
{% endblock %}}

{%- block morefiles %}
"CONTRIBUTORS.rst",
{%- endblock %}
6 changes: 6 additions & 0 deletions .moban.d/docs/source/bigdata.rst.jj2
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
================================================================================
Read partial data
================================================================================


{% include "partial-data.rst.jj2" %}
4 changes: 2 additions & 2 deletions .moban.d/docs/source/pyexcel-index.rst.jj2
Original file line number Diff line number Diff line change
Expand Up @@ -263,7 +263,7 @@ Design
architecture

New tutorial
----------
--------------
.. toctree::

quickstart
Expand All @@ -276,7 +276,7 @@ New tutorial
database

Old tutorial
----------
--------------
.. toctree::

tutorial_file
Expand Down
12 changes: 3 additions & 9 deletions .moban.d/one-liners.rst.jj2
Original file line number Diff line number Diff line change
Expand Up @@ -65,11 +65,8 @@ And let's check what do we have:

.. code-block:: python

>>> for record in records:
... print("%s of %s has %s mg" % (
... record['Serving Size'],
... record['Coffees'],
... record['Caffeine (mg)']))
>>> for r in records:
... print(f"{r['Serving Size']} of {r['Coffees']} has {r['Caffeine (mg)']} mg")
venti(20 oz) of Starbucks Coffee Blonde Roast has 475 mg
large(20 oz.) of Dunkin' Donuts Coffee with Turbo Shot has 398 mg
grande(16 oz.) of Starbucks Coffee Pike Place Roast has 310 mg
Expand All @@ -84,10 +81,7 @@ Instead, what if you have to use `pyexcel.get_array` to do the same:
.. code-block:: python

>>> for row in p.get_array(file_name="your_file.xls", start_row=1):
... print("%s of %s has %s mg" % (
... row[1],
... row[0],
... row[2]))
... print(f"{row[1]} of {row[0]} has {row[2]} mg")
venti(20 oz) of Starbucks Coffee Blonde Roast has 475 mg
large(20 oz.) of Dunkin' Donuts Coffee with Turbo Shot has 398 mg
grande(16 oz.) of Starbucks Coffee Pike Place Roast has 310 mg
Expand Down
187 changes: 187 additions & 0 deletions .moban.d/partial-data.rst.jj2
Original file line number Diff line number Diff line change
@@ -0,0 +1,187 @@

When you are dealing with huge amount of data, e.g. 64GB, obviously you would not
like to fill up your memory with those data. What you may want to do is, record
data from Nth line, take M records and stop. And you only want to use your memory
for the M records, not for beginning part nor for the tail part.

Hence partial read feature is developed to read partial data into memory for
processing.

You can paginate by row, by column and by both, hence you dictate what portion of the
data to read back. But remember only row limit features help you save memory. Let's
you use this feature to record data from Nth column, take M number of columns and skip
the rest. You are not going to reduce your memory footprint.

Why did not I see above benefit?
--------------------------------------------------------------------------------

This feature depends heavily on the implementation details.

`pyexcel-xls`_ (xlrd), `pyexcel-xlsx`_ (openpyxl), `pyexcel-ods`_ (odfpy) and
`pyexcel-ods3`_ (pyexcel-ezodf) will read all data into memory. Because xls,
xlsx and ods file are effective a zipped folder, all four will unzip the folder
and read the content in xml format in **full**, so as to make sense of all details.

Hence, during the partial data is been returned, the memory consumption won't
differ from reading the whole data back. Only after the partial
data is returned, the memory comsumption curve shall jump the cliff. So pagination
code here only limits the data returned to your program.

With that said, `pyexcel-xlsxr`_, `pyexcel-odsr`_ and `pyexcel-htmlr`_ DOES read
partial data into memory. Those three are implemented in such a way that they
consume the xml(html) when needed. When they have read designated portion of the
data, they stop, even if they are half way through.

In addition, pyexcel's csv readers can read partial data into memory too.

{% if sphinx %}

.. testcode::
:hide:

>>> import sys
>>> if sys.version_info[0] < 3:
... from StringIO import StringIO
... else:
... from io import StringIO
>>> from pyexcel_io._compact import OrderedDict

{% endif %}

Let's assume the following file is a huge csv file:

.. code-block:: python

>>> import datetime
>>> import pyexcel as pe
>>> data = [
... [1, 21, 31],
... [2, 22, 32],
... [3, 23, 33],
... [4, 24, 34],
... [5, 25, 35],
... [6, 26, 36]
... ]
>>> pe.save_as(array=data, dest_file_name="your_file.csv")


And let's pretend to read partial data:


.. code-block:: python

>>> pe.get_sheet(file_name="your_file.csv", start_row=2, row_limit=3)
your_file.csv:
+---+----+----+
| 3 | 23 | 33 |
+---+----+----+
| 4 | 24 | 34 |
+---+----+----+
| 5 | 25 | 35 |
+---+----+----+

And you could as well do the same for columns:

.. code-block:: python

>>> pe.get_sheet(file_name="your_file.csv", start_column=1, column_limit=2)
your_file.csv:
+----+----+
| 21 | 31 |
+----+----+
| 22 | 32 |
+----+----+
| 23 | 33 |
+----+----+
| 24 | 34 |
+----+----+
| 25 | 35 |
+----+----+
| 26 | 36 |
+----+----+

Obvious, you could do both at the same time:

.. code-block:: python

>>> pe.get_sheet(file_name="your_file.csv",
... start_row=2, row_limit=3,
... start_column=1, column_limit=2)
your_file.csv:
+----+----+
| 23 | 33 |
+----+----+
| 24 | 34 |
+----+----+
| 25 | 35 |
+----+----+


The pagination support is available across all pyexcel plugins.

.. note::

No column pagination support for query sets as data source.


Formatting while transcoding a big data file
--------------------------------------------------------------------------------

If you are transcoding a big data set, conventional formatting method would not
help unless a on-demand free RAM is available. However, there is a way to minimize
the memory footprint of pyexcel while the formatting is performed.

Let's continue from previous example. Suppose we want to transcode "your_file.csv"
to "your_file.xls" but increase each element by 1.

What we can do is to define a row renderer function as the following:

.. code-block:: python

>>> def increment_by_one(row):
... for element in row:
... yield element + 1

Then pass it onto save_as function using row_renderer:

.. code-block:: python

>>> pe.isave_as(file_name="your_file.csv",
... row_renderer=increment_by_one,
... dest_file_name="your_file.xlsx")


.. note::

If the data content is from a generator, isave_as has to be used.

We can verify if it was done correctly:

.. code-block:: python

>>> pe.get_sheet(file_name="your_file.xlsx")
your_file.csv:
+---+----+----+
| 2 | 22 | 32 |
+---+----+----+
| 3 | 23 | 33 |
+---+----+----+
| 4 | 24 | 34 |
+---+----+----+
| 5 | 25 | 35 |
+---+----+----+
| 6 | 26 | 36 |
+---+----+----+
| 7 | 27 | 37 |
+---+----+----+

{% if sphinx %}

.. testcode::
:hide:

>>> import os
>>> os.unlink("your_file.csv")
>>> os.unlink("your_file.xlsx")

{% endif %}
13 changes: 10 additions & 3 deletions .moban.d/pyexcel-README.rst.jj2
Original file line number Diff line number Diff line change
Expand Up @@ -15,22 +15,29 @@ Feature Highlights
* SQLAlchemy table
* Django Model
* Python data structures: dictionary, records and array

2. One API to read and write data in various excel file formats.
3. For large data sets, data streaming are supported. A genenerator can be returned to you. Checkout iget_records, iget_array, isave_as and isave_book_as.

{% endblock %}

{%block usage%}

{%include "one-liners.rst.jj2"%}
{%include "one-liners.rst.jj2" %}

Hidden feature: partial read
===============================================

Most pyexcel users do not know, but other library users were requesting `the similar features <https://github.com/jazzband/tablib/issues/467>`_

{%include "two-liners.rst.jj2"%}
{%include "partial-data.rst.jj2" %}

{%include "two-liners.rst.jj2" %}

Available Plugins
=================

{% include "plugins-list.rst.jj2"%}
{% include "plugins-list.rst.jj2" %}


Acknowledgement
Expand Down
18 changes: 7 additions & 11 deletions .moban.d/two-liners.rst.jj2
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@ Stream APIs for big file : A set of two liners
This section shows you how to get data from your **BIG** excel files and how to
export data to excel files in **two lines** at most.

Please use dedicated readers to gain the extra memory savings.


Two liners for get data from big excel files
--------------------------------------------------------------------------------
Expand Down Expand Up @@ -66,17 +68,14 @@ And let's check what do we have:

.. code-block:: python

>>> for record in records:
... print("%s of %s has %s mg" % (
... record['Serving Size'],
... record['Coffees'],
... record['Caffeine (mg)']))
>>> for r in records:
... print(f"{r['Serving Size']} of {r['Coffees']} has {r['Caffeine (mg)']} mg")
venti(20 oz) of Starbucks Coffee Blonde Roast has 475 mg
large(20 oz.) of Dunkin' Donuts Coffee with Turbo Shot has 398 mg
grande(16 oz.) of Starbucks Coffee Pike Place Roast has 310 mg
regular(16 oz.) of Panera Coffee Light Roast has 300 mg

Please do not forgot the second line:
Please do not forgot the second line to close the opened file handle:

.. code-block:: python

Expand All @@ -90,10 +89,7 @@ Instead, what if you have to use `pyexcel.get_array` to do the same:
.. code-block:: python

>>> for row in p.iget_array(file_name="your_file.xls", start_row=1):
... print("%s of %s has %s mg" % (
... row[1],
... row[0],
... row[2]))
... print(f"{row[1]} of {row[0]} has {row[2]} mg")
venti(20 oz) of Starbucks Coffee Blonde Roast has 475 mg
large(20 oz.) of Dunkin' Donuts Coffee with Turbo Shot has 398 mg
grande(16 oz.) of Starbucks Coffee Pike Place Roast has 310 mg
Expand Down Expand Up @@ -326,7 +322,7 @@ Again it is really simple. Let's verify what we have gotten:
| Smith | 4.2 | 12/11/14 |
+-------+--------+----------+

.. NOTE::
.. note::

Please note that csv(comma separate value) file is pure text file. Formula, charts, images and formatting in xls file will disappear no matter which transcoding tool you use. Hence, pyexcel is a quick alternative for this transcoding job.

Expand Down
4 changes: 3 additions & 1 deletion .moban.yml
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ targets:
- "docs/source/conf.py": "docs/source/custom_conf.py.jj2"
- "docs/source/quickstart.rst": "docs/source/quickstart.rst.jj2"
- "docs/source/two-liners.rst": "docs/source/two-liners.rst.jj2"
- "docs/source/bigdata.rst": "docs/source/bigdata.rst.jj2"
- .travis.yml: pyexcel-travis.yml.jj2
- MANIFEST.in: CUSTOM_MANIFEST.in.jj2
- README.rst: pyexcel-README.rst.jj2
Expand All @@ -14,4 +15,5 @@ targets:
- "pyexcel/__version__.py": version.txt
- "docs/source/index.rst": "docs/source/pyexcel-index.rst.jj2"
- "tests/requirements.txt": "tests/custom_requirements.txt.jj2"
- "min_requirements.txt": "minimum_requirements.txt.jj2"
- "min_requirements.txt": "minimum_requirements.txt.jj2"
- CONTRIBUTORS.rst: CONTRIBUTORS.rst.jj2
Loading

0 comments on commit f0e0613

Please sign in to comment.