Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Select starting sheet in html/xls/sqlite from command-line #214

Closed
aborruso opened this issue Nov 24, 2018 · 11 comments
Closed

Select starting sheet in html/xls/sqlite from command-line #214

aborruso opened this issue Nov 24, 2018 · 11 comments

Comments

@aborruso
Copy link
Contributor

Hi,
if I use vd -b t.html -o html.csv I have the table below in CSV and not the html table I have inside my file.

tag,id,nrows,ncols,classes
table,,72,4,wikitable sortable

table is the html table inside my t.html file. Is there a way to pass to the command line the sheet name? Something like vd -b t.html -sheet table -o html.csv

Thank you

@aborruso aborruso changed the title Ho to convert to HTML without visualization? Ho to convert from HTML without visualization? Nov 24, 2018
@saulpw
Copy link
Owner

saulpw commented Nov 24, 2018

Hi @aborruso, there's not an easy way yet. I've wanted something like this myself at times. Let me see if I can come up with something. Thanks for the suggestion!

@aborruso aborruso changed the title Ho to convert from HTML without visualization? How to convert from HTML without visualization? Dec 2, 2018
@saulpw saulpw changed the title How to convert from HTML without visualization? [Feature request] select starting sheet in html/xls/sqlite from command-line Dec 26, 2018
@saulpw saulpw changed the title [Feature request] select starting sheet in html/xls/sqlite from command-line Select starting sheet in html/xls/sqlite from command-line Jan 11, 2019
@aborruso
Copy link
Contributor Author

Hi @saulpw is there a way to open the table directly, when it is only one?

My final goal is to use VisiData as HTML to CSV converter with something like below in which I use an xpath query to extract only one table

curl "http://example.com/page.html" | myScrapeUtilty -xpathRule '//table[count(tr/td)>7]' | vd -b  -f html -o out.csv

But also with only one table visidata asks me to choose, and it saves as csv sheets sheet

image

Thank you

@saulpw
Copy link
Owner

saulpw commented Jun 18, 2019

Hi @aborruso, try adding -p dive.vd with the attached small .vd script.

sheet	col	row	longname	input	keystrokes	comment
			open-file	-	o	
-		0	dive-row		^J	

The first command opens the input from stdin (-), and the second command dives into the first row (0).

You can get this .vd yourself with:

  1. the same command you have but without -b
  2. press Enter and do any other manual steps
  3. press Shift+D to go to the commandlog
  4. finally, press Ctrl+S to save to dive.vd, which you can use with your pipeline.

dive.vd.txt

@aborruso
Copy link
Contributor Author

@saulpw you are really brilliant, I'm impressed VisiData is a kind of magic

@aborruso
Copy link
Contributor Author

@saulpw I have added a recipe in my VisiData Italian guide https://github.com/ondata/guidaVisiData/blob/master/testo/README.md#Salvare-una-tabella-HTML-in-CSV-a-partire-da-una-pagina-web

Thank you againg

@saulpw
Copy link
Owner

saulpw commented Aug 21, 2019

Fixed for html loader in f55de38; requires changes in other loaders with a sheet index.

anjakefala pushed a commit that referenced this issue Aug 22, 2019
This permits indexing into sub-sheets through CLI (`+toplevel:subsheet::`).
@saulpw
Copy link
Owner

saulpw commented Oct 5, 2019

To-do to resolve this issue:

  1. Fix loaders with sheet index to have rowdef sheets.
  2. Write above requirement into book/loaders.md.
  3. Improve startup with large files to remove sync(); file should load sync, cursor should jump after load completes (including ^C), or after sheet/row/col is available, if possible.

@anjakefala
Copy link
Collaborator

anjakefala commented Nov 10, 2019

The IndexSheet has been developed (see visidata/sheets.py). It contains the attribute rowtype = 'sheets' on default.

Loaders to be ported:

  • html
  • xls
  • xlsx
  • xlsb
  • hdf5
  • sqlite
  • postgres

Misc:

  • requirements needs to be added to loaders.md

@saulpw
Copy link
Owner

saulpw commented Nov 10, 2019

CLI syntax is +:<sheet>:<row>:<col>.

  • +:subsheet:: to ignore row/col
  • can name toplevel source index if more than one: +toplevel:subsheet::

saulpw added a commit that referenced this issue Nov 12, 2019
@saulpw saulpw closed this as completed Nov 14, 2019
@aborruso
Copy link
Contributor Author

Hi @saulpw if I run

curl -L "https://en.wikipedia.org/wiki/Olympic_medal" | vd -f html +:table_2:1:1

vd does not open the table_e. What's wrong in my command?

vd 2 is really great!

@anjakefala
Copy link
Collaborator

Hey @aborruso!

Can you please open a bug report, and link to this issue?

There is not a good way for me to remember to check up on this potential bug, otherwise. 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants