Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excelx Memory and Performance Optimization #434

Merged

Conversation

chopraanmol1
Copy link
Member

@chopraanmol1 chopraanmol1 commented Jul 2, 2018

Although not much difference for retained memory, this PR focuses on reducing allocated object/memory. Some of the changes are really a low-level optimization but they helped in reducing memory usage.

The script used for benchmarking

MemoryProfiler.report{Roo::Excelx.new('Some File Name.xlsx').tap{|x|(2..x.last_row).each{|i| x.row(i)}}}.pretty_print

Benchmark.measure{ Roo::Excelx.new('Some File Name.xlsx').tap{|x|(2..x.last_row).each{|i| x.row(i)}} }

Results for master:

Total allocated: 46177542 bytes (712001 objects)
Total retained:  5272998 bytes (59447 objects)

allocated memory by gem
-----------------------------------
  25878133  roo/lib
  15386849  nokogiri-1.8.3
   4909416  rubyzip-1.2.1
      1488  racc
      1400  tmpdir
       256  other

  0.810000   0.020000   0.830000 (  0.831865)

Results after changes:

Total allocated: 18303433 bytes (201238 objects)
Total retained:  4705998 bytes (59431 objects)

allocated memory by gem
-----------------------------------
   9875103  roo/lib
   4909416  rubyzip-1.2.1
   3516378  nokogiri-1.8.3
      1400  tmpdir
       880  racc
       256  other

  0.310000   0.010000   0.320000 (  0.309184)

This PR improves both memory usage and CPU performance by around 60%.
But this result will majorly vary based on xlsx file used for benchmarking (Thus my result may be biased).

I'll suggest creating a common big xlsx file which contains all supported data types and re-benchmark on that file.

@@ -40,19 +40,24 @@ def type
end

def self.create_cell(type, *values)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've avoided using create_cell as calling this method will create intermediate array object for values. Generally, I'll be against doing this but this method is invoked once per cell which can result in noticeable overhead.

@coveralls
Copy link

coveralls commented Jul 2, 2018

Coverage Status

Coverage increased (+0.3%) to 94.001% when pulling ead58f3 on chopraanmol1:excelx_memory_performace_optimization into 6f703f0 on roo-rb:master.

require 'forwardable'
require 'roo/excelx/extractor'

module Roo
class Excelx
class SheetDoc < Excelx::Extractor
extend Forwardable
delegate [:styles, :workbook, :shared_strings, :base_date] => :@shared
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again avoided using delegated method(which are called almost once per cell) because it will result in creating a blank array for *args. I'm open to creating helper method like delegate_without_argument if that is more acceptable.

return Excelx::Cell.create_cell(:string, content_arr.join(''), formula, style, hyperlink, coordinate)
when 'is'.freeze
content = ""
cell.children.each do |inline_str|
Copy link
Member Author

@chopraanmol1 chopraanmol1 Jul 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Iterating over cell.children rather than using cell.search resulted in a major improvement.

@chopraanmol1
Copy link
Member Author

@tgturner Some changes in this PR are very low-level optimization but since they are in the critical path they do make difference. If you have a better approach for those optimizations let me know.

Also don't just merge yet I'll like to test Excelx::Cell::DateTime#create_datetime a bit more to ensure that it does not break on some edge case

@tgturner
Copy link
Contributor

I'll give this a look over today!

Copy link
Contributor

@tgturner tgturner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of this I would definitely need to look at further to fully understand the need, but I trust your judgement. This is really good work however.

My one takeaway though is that adding in some constants could really improve readability. No one likes magic strings!

t = round_datetime(datetime_string)
def create_datetime(base_timestamp, value)
timestamp = (base_timestamp + (value.to_f.round(6) * SECONDS_IN_DAY)).round(0)
t = ::Time.at(timestamp).utc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lib/roo/excelx/cell/number.rb Show resolved Hide resolved
lib/roo/excelx/format.rb Show resolved Hide resolved
@@ -49,11 +49,11 @@ def extract_shared_strings
shared_string = ''
si.children.each do |elem|
case elem.name
when 'r'
when 'r'.freeze
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we use a constant for these instead of freezing them in place? This comment applies to all of the strings frozen in match statements.

@@ -41,7 +41,7 @@ def each_cell(row_xml)
row_xml.children.each do |cell_element|
# If you're sure you're not going to need this hyperlinks you can discard it
hyperlinks = unless @options[:no_hyperlinks]
key = ::Roo::Utils.ref_to_key(cell_element['r'])
key = ::Roo::Utils.ref_to_key(cell_element['r'.freeze])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Constant? Applies to all the other use of these single letter strings.

end
when 'f'
unless content.empty?
return Excelx::Cell.cell_class(:string).new(content, formula, style, hyperlink, coordinate)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use cell_class here instead of Excelx::Cell::String?

when 'v'
return create_cell_from_value(value_type, cell, formula, format, style, hyperlink, base_date, coordinate)
when 'v'.freeze
format = style_format(style)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we cache format and value_type for reuse on subsequent iterations? Considering they operate on the parent cell_xml and style there appears to be reuse.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No we don't need to cache format and value_type as we use return statement in next line

lib/roo/excelx/sheet_doc.rb Show resolved Hide resolved
lib/roo/utils.rb Show resolved Hide resolved
@chopraanmol1 chopraanmol1 force-pushed the excelx_memory_performace_optimization branch from cedfb54 to 699ff68 Compare September 13, 2018 12:08
@chopraanmol1
Copy link
Member Author

@tgturner I've rebased with master as most recent commit had a conflict

Copy link
Contributor

@tgturner tgturner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small style fixes, otherwise these look great. Excited to get this stuff into Master. Are you still looking into adding more test cases for this?

lib/roo/excelx/cell/boolean.rb Show resolved Hide resolved
lib/roo/utils.rb Outdated
y = number
[y, x]
def extract_coord(s)
num =letter_num = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing space after =


::Time.new(yyyy, mm, dd, hh, mi, ss.to_r).round(0)
def create_datetime(base_timestamp, value)
timestamp = (base_timestamp + (value.to_f.round(6) * SECONDS_IN_DAY)).round(0)
Copy link
Member Author

@chopraanmol1 chopraanmol1 Sep 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've compared the o/p of new and old implementation.

I/P value edge case (if your timezone is IST) :

2193.0..2193.00613  # 8min 20sec long
15250.0..15250.04166 # 1 hour long
15585.0..15585.04166 # 1 hour long

The above range of values are overlap of timezone's (IST) offset changes.

At the moment I'm not sure if the new implementation is right or older one. I'll have to test a bit more to verify that. If it is the older implementation which is wrong we can go ahead with this PR otherwise need to find some workaround

@chopraanmol1
Copy link
Member Author

@tgturner I've added relevant test case for timezone offset change overlap bug. On master this test case will fail for ENV['TZ'] = "Asia/Calcutta". New implementation of Excelx::Cell::DateTime#create_datetime handles this edge case properly. I think we can go ahead with this now.

@tgturner
Copy link
Contributor

LGTM!

@tgturner tgturner merged commit 4f9b166 into roo-rb:master Sep 15, 2018
chopraanmol1 added a commit to chopraanmol1/roo that referenced this pull request Sep 16, 2018
1. After roo-rb#434 we no longer need to cache Roo::Utils.extract_coordinate

2. Roo::Excelx::Sheet#present_cells was only used to calculate first/last row/column. Refactored code to avoid creation of present_cells hash.

Memory Profile Script
```
require 'roo'
require 'memory_profiler'

MemoryProfiler.report do
  excel = Roo::Excelx.new('test/files/Bibelbund.xlsx')
  (1..excel.last_row).each{|a| excel.row a}
end.pretty_print
```

Master
```
Total allocated: 42820054 bytes (545014 objects)
Total retained:  12833905 bytes (185969 objects)

allocated memory by gem
-----------------------------------
  24082522  roo/lib
  12173047  nokogiri-1.8.4
   6563085  rubyzip-1.2.2
      1184  tmpdir
       216  other

retained memory by gem
-----------------------------------
   9059622  roo/lib
   3773019  nokogiri-1.8.4
       792  rubyzip-1.2.2
       296  tmpdir
       176  other
```

Modified
```
Total allocated: 38800821 bytes (517025 objects)
Total retained:  9879455 bytes (157986 objects)

allocated memory by gem
-----------------------------------
  20951802  roo/lib
  11053807  nokogiri-1.8.4
   6793812  rubyzip-1.2.2
      1184  tmpdir
       216  other

retained memory by gem
-----------------------------------
   7224422  roo/lib
   2653779  nokogiri-1.8.4
       782  rubyzip-1.2.2
       296  tmpdir
       176  other
```

This PR also improves preformance by 5-12%
chopraanmol1 added a commit to chopraanmol1/roo that referenced this pull request Sep 16, 2018
1. After roo-rb#434 we no longer need to cache Roo::Utils.extract_coordinate

2. Roo::Excelx::Sheet#present_cells is only used to calculate first/last row/column. Refactored code to avoid creation of present_cells hash.

Memory Profile Script
```
require 'roo'
require 'memory_profiler'

MemoryProfiler.report do
  excel = Roo::Excelx.new('test/files/Bibelbund.xlsx')
  (1..excel.last_row).each{|a| excel.row a}
end.pretty_print
```

Master
```
Total allocated: 42820054 bytes (545014 objects)
Total retained:  12833905 bytes (185969 objects)

allocated memory by gem
-----------------------------------
  24082522  roo/lib
  12173047  nokogiri-1.8.4
   6563085  rubyzip-1.2.2
      1184  tmpdir
       216  other

retained memory by gem
-----------------------------------
   9059622  roo/lib
   3773019  nokogiri-1.8.4
       792  rubyzip-1.2.2
       296  tmpdir
       176  other
```

Modified
```
Total allocated: 38800821 bytes (517025 objects)
Total retained:  9879455 bytes (157986 objects)

allocated memory by gem
-----------------------------------
  20951802  roo/lib
  11053807  nokogiri-1.8.4
   6793812  rubyzip-1.2.2
      1184  tmpdir
       216  other

retained memory by gem
-----------------------------------
   7224422  roo/lib
   2653779  nokogiri-1.8.4
       782  rubyzip-1.2.2
       296  tmpdir
       176  other
```

This PR also improves performance by 5-12%
tgturner pushed a commit that referenced this pull request Sep 17, 2018
* Remove unnecessary cache hash

1. After #434 we no longer need to cache Roo::Utils.extract_coordinate

2. Roo::Excelx::Sheet#present_cells is only used to calculate first/last row/column. Refactored code to avoid creation of present_cells hash.
netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this pull request Jan 20, 2019
pkgsrc change: add "USE_LANGUAGES= # none".

##  [2.8.0] 2019-01-18
### Fixed
- Fixed inconsistent column length for CSV [375](roo-rb/roo#375)
- Fixed formatted_value with `%` for Excelx [416](roo-rb/roo#416)
- Improved Memory consumption and performance [434](roo-rb/roo#434) [449](roo-rb/roo#449) [454](roo-rb/roo#454) [456](roo-rb/roo#456) [458](roo-rb/roo#458) [462](roo-rb/roo#462) [466](roo-rb/roo#466)
- Accept both Transitional and Strict Type for Excelx's worksheets [441](roo-rb/roo#441)
- Fixed ruby warnings [442](roo-rb/roo#442) [476](roo-rb/roo#476)
- Restore support for URL as file identifier for CSV [462](roo-rb/roo#462)
- Fixed missing location for Excelx's links [482](roo-rb/roo#482)

### Changed / Added
- Drop support for ruby 2.2.x and lower
- Updated rubyzip version for fixing security issue. Now minimal version is 1.2.1
- Roo::Excelx::Coordinate now inherits Array [458](roo-rb/roo#458)
- Improved Roo::HeaderRowNotFoundError exception's message [461](roo-rb/roo#461)
- Added `empty_cell` option which by default disable allocation for Roo::Excelx::Cell::Empty [464](roo-rb/roo#464)
- Added support for variable number of decimals for Excelx's formatted_value [387](roo-rb/roo#387)
- Added `disable_html_injection` option to disable html injection for shared string in `Roo::Excelx` [392](roo-rb/roo#392)
- Added image extraction for Excelx [414](roo-rb/roo#414) [397](roo-rb/roo#397)
- Added support for `1e6` as scientific notation for Excelx [433](roo-rb/roo#433)
- Added support for Integer as 0 based index for Excelx's `sheet_for` [455](roo-rb/roo#455)
- Extended `no_hyperlinks` option for non streaming Excelx methods [459](roo-rb/roo#459)
- Added `empty_cell` option to disable Roo::Excelx::Cell::Empty allocation for Excelx [464](roo-rb/roo#464)
- Added support for Integer with leading zero for Roo:Excelx [479](roo-rb/roo#479)
- Refactored Excelx code [453](roo-rb/roo#453) [477](roo-rb/roo#477) [483](roo-rb/roo#483) [484](roo-rb/roo#484)

### Deprecations
- Roo::Excelx::Sheet#present_cells is deprecated [454](roo-rb/roo#454)
- Roo::Utils.split_coordinate is deprecated [458](roo-rb/roo#458)
- Roo::Excelx::Cell::Base#link is deprecated [457](roo-rb/roo#457)
aravindm pushed a commit to chobiwa/roo that referenced this pull request Jun 18, 2019
* Excelx Memory and Performance Optimization

* test case fixup

* Use magic string literal comment over String#freeze

Use Time#to_datetime over Datetime.civil

* Cosmetic fixup

* Added Test Case for Datetime timezone offset change overlap
aravindm pushed a commit to chobiwa/roo that referenced this pull request Jun 18, 2019
* Remove unnecessary cache hash

1. After roo-rb#434 we no longer need to cache Roo::Utils.extract_coordinate

2. Roo::Excelx::Sheet#present_cells is only used to calculate first/last row/column. Refactored code to avoid creation of present_cells hash.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants