Skip to content
This repository has been archived by the owner on Aug 30, 2022. It is now read-only.

Where to place data #714

Closed
pradeepgaur opened this issue May 20, 2021 · 7 comments
Closed

Where to place data #714

pradeepgaur opened this issue May 20, 2021 · 7 comments
Labels

Comments

@pradeepgaur
Copy link

I have two questions,

  1. When I am installing Hillview on a single machine, where should I place data during the installation process?
  2. When I am installing it on a cluster, how should I divide the data?

Any articles or guidance would be appreciated.

@mihaibudiu
Copy link
Contributor

The diagram here: https://github.com/vmware/hillview/blob/master/docs/userManual.md#11-system-architecture shows the system architecture. Only workers read data. If you read data from files, the files should be on the same machines where the workers reside. On a single machine you can load files from the same machine. On a cluster it is easiest to divide the data among the machines where the workers are, placing all files that should be analyzed together in the same directory. If you have a more concrete use case we can discuss about it specifically.

@mihaibudiu
Copy link
Contributor

But ideally you should not need to move any of your data when using Hillview. If you already have the data stored in a distributed system, e.g. a set of logs on some machines, the ideal case is to deploy a Hillview worker on each machine which stores some of the data. Many data lakes look like this.

@pradeepgaur
Copy link
Author

Is there a way to simply browse data? I get the attached view which needs individual column double clicking to load data. I think on the hosted demo "flights csv" dataset, few days back I was able to just browse data.

image

@mihaibudiu
Copy link
Contributor

mihaibudiu commented May 20, 2021

I think that you have loaded the data alright. The issue is that your table has lots of columns, and thus it starts in a "schema" view instead of a "Table" view. In the schema view you are shown all columns and you can choose which ones to see in a table view. So you have selected 9 columns and displayed these as a table. If you want to see a table with all columns, just select all (using click on the first row, and then shift-click on the last row) and use the menu "view selected columns." I could also add menu buttons "select all columns", or "view all columns as table" to make this easier.

@mihaibudiu
Copy link
Contributor

The reason we show a schema view for wide tables is that they do not really fit nicely on the screen being very wide, so we give you the option to select only a subset of the columns.

@pradeepgaur
Copy link
Author

Thanks for your prompt responses. I wanted to know following.

  1. "View Selected Columns" gives me some kind of aggregated view, so each row can be a an aggregation of multiple. I just want to take a look at raw data without aggregation.
  2. "Demo Dataset" main menu item - can I add my own menu item to load a frequently used dataset? how?

I really appreciate your responses, and I see good potential of Hillview on my project.

@mihaibudiu
Copy link
Contributor

Hillview will always aggregate the displayed data in some form, because most data does not fit on the screen.
To see data in all columns you can select all columns and then click "view/show". But even then the view you will see will be aggregated. There is no easy way to see the rows of the original file in the order they are in the file. This is because we assume the data is split between multiple files and there is no clear ordering of the files. We could add an option to also have a column which is the line number, and then you could sort on that column.

For 2. the only solution right now is to edit the code; this is in file loadView.ts.

But this is a good idea: to give you the possibility of creating JSON file with a set of files to load. I will file a separate issue for that. If you look at the code in loadView.ts, it looks like this:

testitems.push(
            { text: "Flights (15 columns, CSV)",
                action: () => {
                    const files: FileSetDescription = {
                        fileNamePattern: "data/ontime/????_*.csv*",
                        schemaFile: "short.schema",
                        schema: null,
                        headerRow: true,
                        name: "Flights (15 columns)",
                        fileKind: "csv",
                    };
                    this.init.loadFiles(files, this.page);
                },
                help: "The US flights dataset.",
            },
...

So this is in fact just a JSON object. We could read this object from a file. But I will need to document the schema of the JSON.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

2 participants