Replace DataFrames's default `_repr_html_` (closes #76) #175

tcurvelo · 2019-10-20T22:34:37Z

~~I added a subclass for DataFrame, in order to override its to_html(), allowing us to define some defaults styling, like the clickable URLs from #76 .~~

I changed my approach on this feature. The way I've tried previously doesn't work on new DataFramess created by common pandas functions (eg. df.head() df[df['url'].notna()]).
Now I'm replacing the default's _repr_html_() method from DataFrames.

Let me know your thougths on this one.

codecov · 2019-10-20T22:35:29Z

Codecov Report

Merging #175 into master will increase coverage by 0.2%.
The diff coverage is 92.85%.

@@            Coverage Diff            @@
##           master     #175     +/-   ##
=========================================
+ Coverage      81%   81.21%   +0.2%     
=========================================
  Files          24       25      +1     
  Lines        1606     1634     +28     
  Branches      279      281      +2     
=========================================
+ Hits         1301     1327     +26     
- Misses        251      252      +1     
- Partials       54       55      +1

Impacted Files	Coverage Δ
src/arche/__init__.py	`100% <100%> (ø)`	⬆️
src/arche/tools/dataframe.py	`91.66% <91.66%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 476a9dd...0655121. Read the comment docs.

manycoding

Awesome.

Can you time it with a big enough dataset (>100k items) to see if there's a difference?
What about _key which is df.index?

src/arche/readers/items.py

tcurvelo · 2019-11-18T19:11:51Z

Here is a simple benchmark I did for measuring the runtime.
Below is the script I used. It loads a dataset of 100K+ items and prints its HTML representation. I forced it to display 100_000 lines instead of truncate them.

# render_links_benchmark.py
import time
import arche
import pandas as pd

df = pd.read_json("./327565_39_252_items.jl", lines=True)

with pd.option_context("display.min_rows", 100_000, "display.max_rows", 100_000):
    t = time.process_time()
    out = df._repr_html_()
    print(f"Time expended on `_repr_html_`: {time.process_time() - t}")
    print(f"Len: {len(out)}")

Executing it:

$ for branch in master clickable_urls; do git checkout $branch; ./render_links_benchmark.py; done
Already on 'master'
Time expended on `_repr_html_`: 164.999099792
Len: 276061183
Switched to branch 'clickable_urls'
Time expended on `_repr_html_`: 208.39284144
Len: 322665630

It turns out that, for that dataset, rendering links generates about 17% more data and takes about 27% longer to complete.

manycoding

I think we don't care that much about full rendering performance - people wouldn't want to render it.
I checked normal cases myself https://jupyter.scrapinghub.com/user/v/lab/tree/shared/Experiments/clickable_urls.ipynb, there's no difference

manycoding · 2019-11-18T20:59:10Z

src/arche/tools/dataframe.py

+            render_links=render_links,
+        )
+        formatter.to_html(notebook=True)
+        return formatter.buf.getvalue()


I compared to the source, trying to find if there's an alternative to this hack and noticed this.

Why buf.getvalue() is returned instead of formatted as in the source https://github.com/pandas-dev/pandas/blob/c23649143781c658f792e8f7a5b4368ed01f719c/pandas/core/frame.py#L724?

manycoding · 2019-11-18T21:00:15Z

tests/tools/test_dataframe_html_output.py

+@pytest.fixture()
+def df_with_urls():
+    pd.set_option("display.notebook_repr_html", True)
+    data = {"col1": [1, 2], "col2": ["http://foo.com", "https://bar.com"]}


Let's add index here too, that's the main use case.

manycoding · 2019-11-18T21:04:19Z

tests/tools/test_dataframe_html_output.py

@@ -0,0 +1,36 @@
+import pandas as pd


To be consistent with test files naming - tests/tools/test_dataframe.py

manycoding · 2019-11-18T21:06:59Z

tests/tools/test_dataframe_html_output.py

+    return pd.DataFrame(data)
+
+
+def test_df_has_clickable_urls(df_with_urls):


The convention we use is test_function\class_description.

Like test_df_repr_html_true

Does it look better? Mainly it's done for quick search.

manycoding · 2019-11-18T21:07:47Z

tests/tools/test_dataframe_html_output.py

+    assert "<a href=" not in html
+
+
+def test_large_repr(df_with_urls):


I am not sure what this test does. Could you please clarify?

tcurvelo requested a review from manycoding October 20, 2019 22:34

tcurvelo changed the title ~~Wrap DataFrames to allow custom styles (closes #76)~~ Wrap DataFrame to allow styling it (closes #76) Oct 20, 2019

manycoding suggested changes Oct 22, 2019

View reviewed changes

src/arche/readers/items.py Outdated Show resolved Hide resolved

Replace DataFrames's default _repr_html_ (closes #76)

e3c106b

tcurvelo force-pushed the clickable_urls branch from 69b9401 to e3c106b Compare October 24, 2019 23:46

tcurvelo changed the title ~~Wrap DataFrame to allow styling it (closes #76)~~ Replace DataFrames's default _repr_html_ (closes #76) Oct 24, 2019

Merge branch 'master' into clickable_urls

4c756b9

tcurvelo changed the title ~~Replace DataFrames's default _repr_html_ (closes #76)~~ [WIP] Replace DataFrames's default _repr_html_ (closes #76) Oct 25, 2019

tcurvelo added 3 commits November 3, 2019 21:52

Copy pandas.DataFrame._repr_html_ as it is

3329217

Register display.render_links as a pandas option

6b8b2f7

Avoid breaking rendered links by increase the columns width

0655121

tcurvelo changed the title ~~[WIP] Replace DataFrames's default _repr_html_ (closes #76)~~ Replace DataFrames's default _repr_html_ (closes #76) Nov 18, 2019

tcurvelo requested a review from manycoding November 18, 2019 19:15

manycoding suggested changes Nov 18, 2019

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace DataFrames's default `_repr_html_` (closes #76) #175

Replace DataFrames's default `_repr_html_` (closes #76) #175

tcurvelo commented Oct 20, 2019 •

edited

Loading

codecov bot commented Oct 20, 2019 •

edited

Loading

manycoding left a comment

tcurvelo commented Nov 18, 2019

manycoding left a comment

manycoding Nov 18, 2019

manycoding Nov 18, 2019

manycoding Nov 18, 2019

manycoding Nov 18, 2019

manycoding Nov 18, 2019

		return pd.DataFrame(data)


		def test_df_has_clickable_urls(df_with_urls):

		assert "<a href=" not in html


		def test_large_repr(df_with_urls):

Replace DataFrames's default _repr_html_ (closes #76) #175

Are you sure you want to change the base?

Replace DataFrames's default _repr_html_ (closes #76) #175

Conversation

tcurvelo commented Oct 20, 2019 • edited Loading

codecov bot commented Oct 20, 2019 • edited Loading

Codecov Report

manycoding left a comment

Choose a reason for hiding this comment

tcurvelo commented Nov 18, 2019

manycoding left a comment

Choose a reason for hiding this comment

manycoding Nov 18, 2019

Choose a reason for hiding this comment

manycoding Nov 18, 2019

Choose a reason for hiding this comment

manycoding Nov 18, 2019

Choose a reason for hiding this comment

manycoding Nov 18, 2019

Choose a reason for hiding this comment

manycoding Nov 18, 2019

Choose a reason for hiding this comment

Replace DataFrames's default `_repr_html_` (closes #76) #175

Replace DataFrames's default `_repr_html_` (closes #76) #175

tcurvelo commented Oct 20, 2019 •

edited

Loading

codecov bot commented Oct 20, 2019 •

edited

Loading