-
Notifications
You must be signed in to change notification settings - Fork 501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Excelx Memory and Performance Optimization #434
Excelx Memory and Performance Optimization #434
Conversation
@@ -40,19 +40,24 @@ def type | |||
end | |||
|
|||
def self.create_cell(type, *values) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've avoided using create_cell as calling this method will create intermediate array object for values. Generally, I'll be against doing this but this method is invoked once per cell which can result in noticeable overhead.
require 'forwardable' | ||
require 'roo/excelx/extractor' | ||
|
||
module Roo | ||
class Excelx | ||
class SheetDoc < Excelx::Extractor | ||
extend Forwardable | ||
delegate [:styles, :workbook, :shared_strings, :base_date] => :@shared |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again avoided using delegated method(which are called almost once per cell) because it will result in creating a blank array for *args. I'm open to creating helper method like delegate_without_argument if that is more acceptable.
lib/roo/excelx/sheet_doc.rb
Outdated
return Excelx::Cell.create_cell(:string, content_arr.join(''), formula, style, hyperlink, coordinate) | ||
when 'is'.freeze | ||
content = "" | ||
cell.children.each do |inline_str| |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Iterating over cell.children rather than using cell.search resulted in a major improvement.
@tgturner Some changes in this PR are very low-level optimization but since they are in the critical path they do make difference. If you have a better approach for those optimizations let me know. Also don't just merge yet I'll like to test Excelx::Cell::DateTime#create_datetime a bit more to ensure that it does not break on some edge case |
I'll give this a look over today! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of this I would definitely need to look at further to fully understand the need, but I trust your judgement. This is really good work however.
My one takeaway though is that adding in some constants could really improve readability. No one likes magic strings!
lib/roo/excelx/cell/datetime.rb
Outdated
t = round_datetime(datetime_string) | ||
def create_datetime(base_timestamp, value) | ||
timestamp = (base_timestamp + (value.to_f.round(6) * SECONDS_IN_DAY)).round(0) | ||
t = ::Time.at(timestamp).utc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we just call Time.at(timestamp).utc.to_datetime
here?
lib/roo/excelx/shared_strings.rb
Outdated
@@ -49,11 +49,11 @@ def extract_shared_strings | |||
shared_string = '' | |||
si.children.each do |elem| | |||
case elem.name | |||
when 'r' | |||
when 'r'.freeze |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we use a constant for these instead of freezing them in place? This comment applies to all of the strings frozen in match statements.
lib/roo/excelx/sheet_doc.rb
Outdated
@@ -41,7 +41,7 @@ def each_cell(row_xml) | |||
row_xml.children.each do |cell_element| | |||
# If you're sure you're not going to need this hyperlinks you can discard it | |||
hyperlinks = unless @options[:no_hyperlinks] | |||
key = ::Roo::Utils.ref_to_key(cell_element['r']) | |||
key = ::Roo::Utils.ref_to_key(cell_element['r'.freeze]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Constant? Applies to all the other use of these single letter strings.
lib/roo/excelx/sheet_doc.rb
Outdated
end | ||
when 'f' | ||
unless content.empty? | ||
return Excelx::Cell.cell_class(:string).new(content, formula, style, hyperlink, coordinate) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use cell_class
here instead of Excelx::Cell::String
?
lib/roo/excelx/sheet_doc.rb
Outdated
when 'v' | ||
return create_cell_from_value(value_type, cell, formula, format, style, hyperlink, base_date, coordinate) | ||
when 'v'.freeze | ||
format = style_format(style) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we cache format
and value_type
for reuse on subsequent iterations? Considering they operate on the parent cell_xml
and style
there appears to be reuse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No we don't need to cache format and value_type as we use return statement in next line
Use Time#to_datetime over Datetime.civil
cedfb54
to
699ff68
Compare
@tgturner I've rebased with master as most recent commit had a conflict |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small style fixes, otherwise these look great. Excited to get this stuff into Master. Are you still looking into adding more test cases for this?
lib/roo/utils.rb
Outdated
y = number | ||
[y, x] | ||
def extract_coord(s) | ||
num =letter_num = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing space after =
|
||
::Time.new(yyyy, mm, dd, hh, mi, ss.to_r).round(0) | ||
def create_datetime(base_timestamp, value) | ||
timestamp = (base_timestamp + (value.to_f.round(6) * SECONDS_IN_DAY)).round(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've compared the o/p of new and old implementation.
I/P value edge case (if your timezone is IST) :
2193.0..2193.00613 # 8min 20sec long
15250.0..15250.04166 # 1 hour long
15585.0..15585.04166 # 1 hour long
The above range of values are overlap of timezone's (IST) offset changes.
At the moment I'm not sure if the new implementation is right or older one. I'll have to test a bit more to verify that. If it is the older implementation which is wrong we can go ahead with this PR otherwise need to find some workaround
@tgturner I've added relevant test case for timezone offset change overlap bug. On master this test case will fail for ENV['TZ'] = "Asia/Calcutta". New implementation of Excelx::Cell::DateTime#create_datetime handles this edge case properly. I think we can go ahead with this now. |
LGTM! |
1. After roo-rb#434 we no longer need to cache Roo::Utils.extract_coordinate 2. Roo::Excelx::Sheet#present_cells was only used to calculate first/last row/column. Refactored code to avoid creation of present_cells hash. Memory Profile Script ``` require 'roo' require 'memory_profiler' MemoryProfiler.report do excel = Roo::Excelx.new('test/files/Bibelbund.xlsx') (1..excel.last_row).each{|a| excel.row a} end.pretty_print ``` Master ``` Total allocated: 42820054 bytes (545014 objects) Total retained: 12833905 bytes (185969 objects) allocated memory by gem ----------------------------------- 24082522 roo/lib 12173047 nokogiri-1.8.4 6563085 rubyzip-1.2.2 1184 tmpdir 216 other retained memory by gem ----------------------------------- 9059622 roo/lib 3773019 nokogiri-1.8.4 792 rubyzip-1.2.2 296 tmpdir 176 other ``` Modified ``` Total allocated: 38800821 bytes (517025 objects) Total retained: 9879455 bytes (157986 objects) allocated memory by gem ----------------------------------- 20951802 roo/lib 11053807 nokogiri-1.8.4 6793812 rubyzip-1.2.2 1184 tmpdir 216 other retained memory by gem ----------------------------------- 7224422 roo/lib 2653779 nokogiri-1.8.4 782 rubyzip-1.2.2 296 tmpdir 176 other ``` This PR also improves preformance by 5-12%
1. After roo-rb#434 we no longer need to cache Roo::Utils.extract_coordinate 2. Roo::Excelx::Sheet#present_cells is only used to calculate first/last row/column. Refactored code to avoid creation of present_cells hash. Memory Profile Script ``` require 'roo' require 'memory_profiler' MemoryProfiler.report do excel = Roo::Excelx.new('test/files/Bibelbund.xlsx') (1..excel.last_row).each{|a| excel.row a} end.pretty_print ``` Master ``` Total allocated: 42820054 bytes (545014 objects) Total retained: 12833905 bytes (185969 objects) allocated memory by gem ----------------------------------- 24082522 roo/lib 12173047 nokogiri-1.8.4 6563085 rubyzip-1.2.2 1184 tmpdir 216 other retained memory by gem ----------------------------------- 9059622 roo/lib 3773019 nokogiri-1.8.4 792 rubyzip-1.2.2 296 tmpdir 176 other ``` Modified ``` Total allocated: 38800821 bytes (517025 objects) Total retained: 9879455 bytes (157986 objects) allocated memory by gem ----------------------------------- 20951802 roo/lib 11053807 nokogiri-1.8.4 6793812 rubyzip-1.2.2 1184 tmpdir 216 other retained memory by gem ----------------------------------- 7224422 roo/lib 2653779 nokogiri-1.8.4 782 rubyzip-1.2.2 296 tmpdir 176 other ``` This PR also improves performance by 5-12%
* Remove unnecessary cache hash 1. After #434 we no longer need to cache Roo::Utils.extract_coordinate 2. Roo::Excelx::Sheet#present_cells is only used to calculate first/last row/column. Refactored code to avoid creation of present_cells hash.
pkgsrc change: add "USE_LANGUAGES= # none". ## [2.8.0] 2019-01-18 ### Fixed - Fixed inconsistent column length for CSV [375](roo-rb/roo#375) - Fixed formatted_value with `%` for Excelx [416](roo-rb/roo#416) - Improved Memory consumption and performance [434](roo-rb/roo#434) [449](roo-rb/roo#449) [454](roo-rb/roo#454) [456](roo-rb/roo#456) [458](roo-rb/roo#458) [462](roo-rb/roo#462) [466](roo-rb/roo#466) - Accept both Transitional and Strict Type for Excelx's worksheets [441](roo-rb/roo#441) - Fixed ruby warnings [442](roo-rb/roo#442) [476](roo-rb/roo#476) - Restore support for URL as file identifier for CSV [462](roo-rb/roo#462) - Fixed missing location for Excelx's links [482](roo-rb/roo#482) ### Changed / Added - Drop support for ruby 2.2.x and lower - Updated rubyzip version for fixing security issue. Now minimal version is 1.2.1 - Roo::Excelx::Coordinate now inherits Array [458](roo-rb/roo#458) - Improved Roo::HeaderRowNotFoundError exception's message [461](roo-rb/roo#461) - Added `empty_cell` option which by default disable allocation for Roo::Excelx::Cell::Empty [464](roo-rb/roo#464) - Added support for variable number of decimals for Excelx's formatted_value [387](roo-rb/roo#387) - Added `disable_html_injection` option to disable html injection for shared string in `Roo::Excelx` [392](roo-rb/roo#392) - Added image extraction for Excelx [414](roo-rb/roo#414) [397](roo-rb/roo#397) - Added support for `1e6` as scientific notation for Excelx [433](roo-rb/roo#433) - Added support for Integer as 0 based index for Excelx's `sheet_for` [455](roo-rb/roo#455) - Extended `no_hyperlinks` option for non streaming Excelx methods [459](roo-rb/roo#459) - Added `empty_cell` option to disable Roo::Excelx::Cell::Empty allocation for Excelx [464](roo-rb/roo#464) - Added support for Integer with leading zero for Roo:Excelx [479](roo-rb/roo#479) - Refactored Excelx code [453](roo-rb/roo#453) [477](roo-rb/roo#477) [483](roo-rb/roo#483) [484](roo-rb/roo#484) ### Deprecations - Roo::Excelx::Sheet#present_cells is deprecated [454](roo-rb/roo#454) - Roo::Utils.split_coordinate is deprecated [458](roo-rb/roo#458) - Roo::Excelx::Cell::Base#link is deprecated [457](roo-rb/roo#457)
* Excelx Memory and Performance Optimization * test case fixup * Use magic string literal comment over String#freeze Use Time#to_datetime over Datetime.civil * Cosmetic fixup * Added Test Case for Datetime timezone offset change overlap
* Remove unnecessary cache hash 1. After roo-rb#434 we no longer need to cache Roo::Utils.extract_coordinate 2. Roo::Excelx::Sheet#present_cells is only used to calculate first/last row/column. Refactored code to avoid creation of present_cells hash.
Although not much difference for retained memory, this PR focuses on reducing allocated object/memory. Some of the changes are really a low-level optimization but they helped in reducing memory usage.
The script used for benchmarking
Results for master:
Results after changes:
This PR improves both memory usage and CPU performance by around 60%.
But this result will majorly vary based on xlsx file used for benchmarking (Thus my result may be biased).
I'll suggest creating a common big xlsx file which contains all supported data types and re-benchmark on that file.