Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load HTML and/or Consider Refactoring #10

Closed
danieldjewell opened this issue Feb 18, 2021 · 10 comments
Closed

Load HTML and/or Consider Refactoring #10

danieldjewell opened this issue Feb 18, 2021 · 10 comments

Comments

@danieldjewell
Copy link

Just stumbled across tablite. Really cool, definitely can agree with the use case (pandas and numpy are amazing, but also super massive when you just want to deal with a 5 column 20 row table.)

What are your thoughts on:

  1. Loading HTML tables - Writing a parser from scratch seems duplicative - I was thinking of something along the lines of being able to pass in a bs4.element.Tag (the result of page.select_one('table#tabid') from BeautifulSoup)? (Although I try to avoid suggesting adding dependencies, BeautifulSoup seems to be popular enough that making it an optional dependency for parsing HTML tables seems like a reasonable compromise to avoid having to parse manually - or with html5lib/lxml)
  2. Refactoring some of the code to multiple files? While it's convenient to have everything in one file, the source is up to 2700+ lines. As new features are added, it's only going to be more work to rearrange. (I get that the goal is to keep things lightweight - that's the whole point of tablite - but that doesn't mean that multiple source files is bad. It helps with organziation and also would offer an opportunity to review for more fine-grained refactoring to see if there are any duplicated sections that could be optimized.)
  3. Renaming the installed package name from table to tablite to match the project name?
  • This one isn't great because it's a major breaking change... which sucks
  • This stands out for a couple of reasons:
    • When installing a package, I usually assume that after installing, I'm going to do an import «packagename» - which in this case, obviously won't work
    • I'm not sure if there's a PEP with guidance on this, but I think it's a common enough practice that when the installed module name is different from the project/package name, it is a surprise
    • "table" is super common. The chances of someone having a table.py file somewhere in a project (or the current directory) are much higher than someone having a tablite.py file -- this would create an import conflict and would break tablite from loading. (Again, not sure if there's PEP guidance on this, but as a general rule, I think avoiding naming packages with common words is a good idea. For submodules/packages this isn't an issue if the main/parent name is completely unique.)
@root-11
Copy link
Owner

root-11 commented Feb 18, 2021

Renaming the installed package name from table to tablite to match the project name?

Funny. I've just had the same conversation with another user. We decided the rename about 25 minutes ago.

@root-11
Copy link
Owner

root-11 commented Feb 18, 2021

In regard to HTML import. I'm totally okay with that. If you want to write the code and make a pull request I can merge it.

@root-11
Copy link
Owner

root-11 commented Feb 18, 2021

I've just pushed a refactor of the code in the latest commit. Could you have a look and tell me if it's easier to navigate?

@root-11
Copy link
Owner

root-11 commented Feb 18, 2021

Summary thus far:

  1. Great idea.
  2. Done.
  3. Done.

@root-11
Copy link
Owner

root-11 commented Feb 18, 2021

You can have the latest version from pip as I've just packaged it.

https://pypi.org/project/tablite/2021.2.18.60263/

@root-11
Copy link
Owner

root-11 commented Feb 19, 2021

@danieldjewell reg. the HTML import: Is it something like this you wanted to do:
https://www.thepythoncode.com/article/convert-html-tables-into-csv-files-in-python

?

@danieldjewell
Copy link
Author

Funny. I've just had the same conversation with another user. We decided the rename about 25 minutes ago.

Hehe. It is. Must be thinking on the same wavelength. 😁 🍻

pushed the refactor

Yeah. Both the rename and refactoring helps loads. (The fact that the rename is a breaking change sucks, but I guess it's kinda like "Well, let's get this one over with because it's just going to be harder in the future." 😁 )

Is it better

I would suggest moving Table out of __init__.py as well - that way init can be used to import what you want to be publicly viewable (from what I've seen and done, to streamline & remove clutter).

For example -- if you do a:

dir(tablite)

... the result is nearly (?) everything in the package.

Imports

Semi Long Philosophical Paragraph Follows

Side note/philosophical note/disclaimer: I come from a pretty strong object oriented background (PHP/Java/C#/C++) - For whatever reason, I never really liked the Pythonic way of importing specific classes/functions from a package (e.g. from tablite import Table). For me, it's somehow like a deadly sin to pollute the global namespace of a script like that. That's not meant to be a criticism of tablite (even though the code does have some of that 😁 ) ... but more of an observation on the way I tend to use things - I personally very much like when I can do a one line import and keep everything nice and tidy: e.g. import numpy as np, or import xarray as xr, or import pandas as pd ... and not have to mess around with secondary imports.

That said, when a package imports nothing (see pyca/cryptography) I find it rather infuriating. (I personally do a lot of prototyping using either JupyterLab or more often an ipython session in a terminal - having to run multiple help() commands to find the contents of a package is really a pain... (And I don't always have a GUI web browser to open API docs easily...)


tl;dr;: I recognize that my opinions may not be representative of what other Python developers might say. (The Python community seems to have lots of differing opinions on lots of things... )

That said, have a look at pandas' init.py and xarray's init.py (or even numpy's but that one has a lot more going on in it than is necessary to illustrate my point). Basically, everything that's needed for a user/developer to use/interact with the package is imported in to the root namespace. So after a import xarray as xr you pretty much have every class/function/enum/whatever you'll need (and if there are more, they are usually static/class methods within some of the imported classes). The top of the cake is the __all__ list/tuple that defines what to import.

So, specifically for tablite... Table, GroupBy, file_reader definitely need to be in the "root" namespace. Not sure about the others. It looks like GroupBy already has imports setup -- which is great!

tablite/tablite/__init__.py

Lines 744 to 754 in 2072e8e

max = Max # shortcuts to avoid having to type a long list of imports.
min = Min
sum = Sum
first = First
last = Last
count = Count
count_unique = CountUnique
avg = Average
stdev = StandardDeviation
median = Median
mode = Mode

One interesting other way to do it would be to put the entire "from tablite.groupby_utils ..." line as the first line inside of the GroupBy class declaration... That way they're local... something like:

class GroupBy(object):
    from tablite.groupby_utils import GroupbyFunction, Max, Min, Sum, First, Last, Count, CountUnique, Average, StandardDeviation, Median, Mode

Of course this would be breaking if anyone is using tablite.Max directly ... The other way to do it could be to leave it inside of GroupBy and add almost like an short-hand accessor

import tablite as tl 
tab = tl.Table(...) 
g = tab.groupby(keys=['a','b'],
               functions=[('f', tl.gb.Max)])

## if "gb" is just an alias for the GroupBy class and the GroupBy functions (Max,Min, etc.) are imported into that class, this becomes possible: 

g = tab.gb(keys=['a', 'b'], 
        .... )

### OR if you do something like
import tablite.GroupBy as gb 
## then 
g = tab.groupby(keys=['a','b'],
               functions=[('f', gb.Max)])

### This is what I'm referring to when I mentioned the way I like to personally import things and not import things into the main global namespace 

It might also be interesting to handle groupby a bit like pandas -- at least just the syntax -- e.g. df.groupby(['key1','key2]).agg(['min', 'mean', 'max'])

HTML Tables

reg. the HTML import: Is it something like this you wanted to do:
https://www.thepythoncode.com/article/convert-html-tables-into-csv-files-in-python

Kinda. That particular code example is the general idea... And having sort of a 2-stage option would be really nice I think:

  • The "lower level" interface could accept a bs4.element.Tag object that has been pre-processed to be (hopefully) just one HTML table object. (You'd still need options to help guide headers - or even overriding column names with a separate list, rowskip, and definitely the ability to specify datatypes for casting... kinda the same general requirements as for a CSV)
  • A higher-level "get all the tables" could accept a file pointer object (e.g. StringIO/BytesIO), a filename, or a str/bytes object with the raw HTML. Some preliminary parsing could attempt to find all of the tables and then essentially loop through using the "lower level" interface (above) and return a collection of some kind of tablite.Table() objects

It just now occurred to me that pandas actually has a read_html() function ... The trouble always is that there are so many ways to write an HTML table. Colspan? Rowspan? (And I'm not even going to go to that place where some tables are built using <dl> and <dt> ... sheesh)

[Sorry this got kinda long... I started writing and then it was like "oh! one more idea! oh! can't forget to mention this other thing..." 😁 ]

@root-11
Copy link
Owner

root-11 commented Feb 22, 2021

Hi Daniel,

TL;DR: I think you've made a good case. Let's get it right.

I've refactored into the branch name_space_review for you to review. All tests pass.

Imports

As the only things the user really needs are Table and GroupBy this should do:

__all__ = [Table, Groupby]

For pycharm users, the helper will give a compact presentation as shown below:

image

and for jupyter users dir is compact and unpolluted:

>>> import tablite as tl
>>> dir(tl)
['GroupBy', 'Table', '__all__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__', 'core', 'datatypes', 'file_reader_utils', 'groupby_utils', 'stored_list']

There are dependencies between file_reader and Table, so moving the file readers away from core.py will cause cyclic imports. However as file_reader isn't in __all__ it won't appear and users can use the class method as always:

Table.from_file(path)

which uses the file_readers


Shorthand will also work fine: I've updated groupby_tests.py, but am unsure if there is a better way than this:

from tablite import Table, GroupBy
gb = GroupBy

g = GroupBy(keys=['a', 'b'],
functions=[('f', gb.max),
('f', gb.min),
('f', gb.sum),
('f', gb.first),
('f', gb.last),
('f', gb.count),
('f', gb.count_unique),
('f', gb.avg),
('f', gb.stdev),
('a', gb.stdev),
('f', gb.median),
('f', gb.mode),
('g', gb.median)])

https://github.com/root-11/tablite/blob/name_space_review/tests/groupby_tests.py#L34

HTML reader.

As you want to extend the file_readers with your own html_reader you can use:

>>> def html_reader(path):   # define the reader.
>>>     # do magic
>>>     return 1

>>> from tablite.core import readers

>>> readers['my_html_reader']= [html_reader, {}]

>>>for kv in readers.items():
>>>    print(kv)
    
csv [<function text_reader at 0x0000020FFF373C18>, {}]
tsv [<function text_reader at 0x0000020FFF373C18>, {}]
txt [<function text_reader at 0x0000020FFF373C18>, {}]
xls [<function excel_reader at 0x0000020FFF299DC8>, {}]
xlsx [<function excel_reader at 0x0000020FFF299DC8>, {}]
xlsm [<function excel_reader at 0x0000020FFF299DC8>, {}]
ods [<function ods_reader at 0x0000020FFF299E58>, {}]
zip [<function zip_reader at 0x0000020FFF299EE8>, {}]
log [<function log_reader at 0x0000020FFF299F78>, {'sep': False}]
my_html_reader [<function html_reader at 0x0000020FFF3828B8>, {}]  # <---------

....

Longer answer too, but in the face of ambiguity ten extra words may save ten days of work ;-)

@root-11
Copy link
Owner

root-11 commented Mar 1, 2021

@danieldjewell - Hi - I'm just picking up this thread. Are you still reviewing?

@root-11
Copy link
Owner

root-11 commented Mar 3, 2021

@danieldjewell
name_space_review has been merged into master. closing ticket.

@root-11 root-11 closed this as completed Mar 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants