-
-
Notifications
You must be signed in to change notification settings - Fork 163
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #218 from pyexcel/dev
release 0.6.3
- Loading branch information
Showing
76 changed files
with
811 additions
and
315 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
[settings] | ||
line_length=79 | ||
known_first_party= | ||
known_third_party=mock,nose,pyexcel_io | ||
indent=' ' | ||
multi_line_output=3 | ||
length_sort=1 | ||
default_section=FIRSTPARTY | ||
no_lines_before=LOCALFOLDER | ||
sections=FUTURE,STDLIB,FIRSTPARTY,THIRDPARTY,LOCALFOLDER |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
Contributors | ||
================================================================================ | ||
|
||
In alphabetical order: | ||
|
||
{% for contributor in contributors|sort(attribute='name') %} | ||
* `{{ contributor.name }} <{{ contributor.url }}>`_ | ||
{% endfor %} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
================================================================================ | ||
Read partial data | ||
================================================================================ | ||
|
||
|
||
{% include "partial-data.rst.jj2" %} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,187 @@ | ||
|
||
When you are dealing with huge amount of data, e.g. 64GB, obviously you would not | ||
like to fill up your memory with those data. What you may want to do is, record | ||
data from Nth line, take M records and stop. And you only want to use your memory | ||
for the M records, not for beginning part nor for the tail part. | ||
|
||
Hence partial read feature is developed to read partial data into memory for | ||
processing. | ||
|
||
You can paginate by row, by column and by both, hence you dictate what portion of the | ||
data to read back. But remember only row limit features help you save memory. Let's | ||
you use this feature to record data from Nth column, take M number of columns and skip | ||
the rest. You are not going to reduce your memory footprint. | ||
|
||
Why did not I see above benefit? | ||
-------------------------------------------------------------------------------- | ||
|
||
This feature depends heavily on the implementation details. | ||
|
||
`pyexcel-xls`_ (xlrd), `pyexcel-xlsx`_ (openpyxl), `pyexcel-ods`_ (odfpy) and | ||
`pyexcel-ods3`_ (pyexcel-ezodf) will read all data into memory. Because xls, | ||
xlsx and ods file are effective a zipped folder, all four will unzip the folder | ||
and read the content in xml format in **full**, so as to make sense of all details. | ||
|
||
Hence, during the partial data is been returned, the memory consumption won't | ||
differ from reading the whole data back. Only after the partial | ||
data is returned, the memory comsumption curve shall jump the cliff. So pagination | ||
code here only limits the data returned to your program. | ||
|
||
With that said, `pyexcel-xlsxr`_, `pyexcel-odsr`_ and `pyexcel-htmlr`_ DOES read | ||
partial data into memory. Those three are implemented in such a way that they | ||
consume the xml(html) when needed. When they have read designated portion of the | ||
data, they stop, even if they are half way through. | ||
|
||
In addition, pyexcel's csv readers can read partial data into memory too. | ||
|
||
{% if sphinx %} | ||
|
||
.. testcode:: | ||
:hide: | ||
|
||
>>> import sys | ||
>>> if sys.version_info[0] < 3: | ||
... from StringIO import StringIO | ||
... else: | ||
... from io import StringIO | ||
>>> from pyexcel_io._compact import OrderedDict | ||
|
||
{% endif %} | ||
|
||
Let's assume the following file is a huge csv file: | ||
|
||
.. code-block:: python | ||
|
||
>>> import datetime | ||
>>> import pyexcel as pe | ||
>>> data = [ | ||
... [1, 21, 31], | ||
... [2, 22, 32], | ||
... [3, 23, 33], | ||
... [4, 24, 34], | ||
... [5, 25, 35], | ||
... [6, 26, 36] | ||
... ] | ||
>>> pe.save_as(array=data, dest_file_name="your_file.csv") | ||
|
||
|
||
And let's pretend to read partial data: | ||
|
||
|
||
.. code-block:: python | ||
|
||
>>> pe.get_sheet(file_name="your_file.csv", start_row=2, row_limit=3) | ||
your_file.csv: | ||
+---+----+----+ | ||
| 3 | 23 | 33 | | ||
+---+----+----+ | ||
| 4 | 24 | 34 | | ||
+---+----+----+ | ||
| 5 | 25 | 35 | | ||
+---+----+----+ | ||
|
||
And you could as well do the same for columns: | ||
|
||
.. code-block:: python | ||
|
||
>>> pe.get_sheet(file_name="your_file.csv", start_column=1, column_limit=2) | ||
your_file.csv: | ||
+----+----+ | ||
| 21 | 31 | | ||
+----+----+ | ||
| 22 | 32 | | ||
+----+----+ | ||
| 23 | 33 | | ||
+----+----+ | ||
| 24 | 34 | | ||
+----+----+ | ||
| 25 | 35 | | ||
+----+----+ | ||
| 26 | 36 | | ||
+----+----+ | ||
|
||
Obvious, you could do both at the same time: | ||
|
||
.. code-block:: python | ||
|
||
>>> pe.get_sheet(file_name="your_file.csv", | ||
... start_row=2, row_limit=3, | ||
... start_column=1, column_limit=2) | ||
your_file.csv: | ||
+----+----+ | ||
| 23 | 33 | | ||
+----+----+ | ||
| 24 | 34 | | ||
+----+----+ | ||
| 25 | 35 | | ||
+----+----+ | ||
|
||
|
||
The pagination support is available across all pyexcel plugins. | ||
|
||
.. note:: | ||
|
||
No column pagination support for query sets as data source. | ||
|
||
|
||
Formatting while transcoding a big data file | ||
-------------------------------------------------------------------------------- | ||
|
||
If you are transcoding a big data set, conventional formatting method would not | ||
help unless a on-demand free RAM is available. However, there is a way to minimize | ||
the memory footprint of pyexcel while the formatting is performed. | ||
|
||
Let's continue from previous example. Suppose we want to transcode "your_file.csv" | ||
to "your_file.xls" but increase each element by 1. | ||
|
||
What we can do is to define a row renderer function as the following: | ||
|
||
.. code-block:: python | ||
|
||
>>> def increment_by_one(row): | ||
... for element in row: | ||
... yield element + 1 | ||
|
||
Then pass it onto save_as function using row_renderer: | ||
|
||
.. code-block:: python | ||
|
||
>>> pe.isave_as(file_name="your_file.csv", | ||
... row_renderer=increment_by_one, | ||
... dest_file_name="your_file.xlsx") | ||
|
||
|
||
.. note:: | ||
|
||
If the data content is from a generator, isave_as has to be used. | ||
|
||
We can verify if it was done correctly: | ||
|
||
.. code-block:: python | ||
|
||
>>> pe.get_sheet(file_name="your_file.xlsx") | ||
your_file.csv: | ||
+---+----+----+ | ||
| 2 | 22 | 32 | | ||
+---+----+----+ | ||
| 3 | 23 | 33 | | ||
+---+----+----+ | ||
| 4 | 24 | 34 | | ||
+---+----+----+ | ||
| 5 | 25 | 35 | | ||
+---+----+----+ | ||
| 6 | 26 | 36 | | ||
+---+----+----+ | ||
| 7 | 27 | 37 | | ||
+---+----+----+ | ||
|
||
{% if sphinx %} | ||
|
||
.. testcode:: | ||
:hide: | ||
|
||
>>> import os | ||
>>> os.unlink("your_file.csv") | ||
>>> os.unlink("your_file.xlsx") | ||
|
||
{% endif %} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.