Where to place data #714

pradeepgaur · 2021-05-20T19:34:05Z

I have two questions,

When I am installing Hillview on a single machine, where should I place data during the installation process?
When I am installing it on a cluster, how should I divide the data?

Any articles or guidance would be appreciated.

mihaibudiu · 2021-05-20T20:02:45Z

The diagram here: https://github.com/vmware/hillview/blob/master/docs/userManual.md#11-system-architecture shows the system architecture. Only workers read data. If you read data from files, the files should be on the same machines where the workers reside. On a single machine you can load files from the same machine. On a cluster it is easiest to divide the data among the machines where the workers are, placing all files that should be analyzed together in the same directory. If you have a more concrete use case we can discuss about it specifically.

mihaibudiu · 2021-05-20T20:14:41Z

But ideally you should not need to move any of your data when using Hillview. If you already have the data stored in a distributed system, e.g. a set of logs on some machines, the ideal case is to deploy a Hillview worker on each machine which stores some of the data. Many data lakes look like this.

pradeepgaur · 2021-05-20T21:02:32Z

Is there a way to simply browse data? I get the attached view which needs individual column double clicking to load data. I think on the hosted demo "flights csv" dataset, few days back I was able to just browse data.

mihaibudiu · 2021-05-20T21:09:16Z

I think that you have loaded the data alright. The issue is that your table has lots of columns, and thus it starts in a "schema" view instead of a "Table" view. In the schema view you are shown all columns and you can choose which ones to see in a table view. So you have selected 9 columns and displayed these as a table. If you want to see a table with all columns, just select all (using click on the first row, and then shift-click on the last row) and use the menu "view selected columns." I could also add menu buttons "select all columns", or "view all columns as table" to make this easier.

mihaibudiu · 2021-05-20T21:09:51Z

The reason we show a schema view for wide tables is that they do not really fit nicely on the screen being very wide, so we give you the option to select only a subset of the columns.

pradeepgaur · 2021-05-20T21:27:08Z

Thanks for your prompt responses. I wanted to know following.

"View Selected Columns" gives me some kind of aggregated view, so each row can be a an aggregation of multiple. I just want to take a look at raw data without aggregation.
"Demo Dataset" main menu item - can I add my own menu item to load a frequently used dataset? how?

I really appreciate your responses, and I see good potential of Hillview on my project.

mihaibudiu · 2021-05-20T21:42:58Z

Hillview will always aggregate the displayed data in some form, because most data does not fit on the screen.
To see data in all columns you can select all columns and then click "view/show". But even then the view you will see will be aggregated. There is no easy way to see the rows of the original file in the order they are in the file. This is because we assume the data is split between multiple files and there is no clear ordering of the files. We could add an option to also have a column which is the line number, and then you could sort on that column.

For 2. the only solution right now is to edit the code; this is in file loadView.ts.

But this is a good idea: to give you the possibility of creating JSON file with a set of files to load. I will file a separate issue for that. If you look at the code in loadView.ts, it looks like this:

testitems.push(
            { text: "Flights (15 columns, CSV)",
                action: () => {
                    const files: FileSetDescription = {
                        fileNamePattern: "data/ontime/????_*.csv*",
                        schemaFile: "short.schema",
                        schema: null,
                        headerRow: true,
                        name: "Flights (15 columns)",
                        fileKind: "csv",
                    };
                    this.init.loadFiles(files, this.page);
                },
                help: "The US flights dataset.",
            },
...

So this is in fact just a JSON object. We could read this object from a file. But I will need to document the schema of the JSON.

mihaibudiu added the question label May 20, 2021

pradeepgaur closed this as completed May 24, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Where to place data #714

Where to place data #714

pradeepgaur commented May 20, 2021

mihaibudiu commented May 20, 2021

mihaibudiu commented May 20, 2021

pradeepgaur commented May 20, 2021

mihaibudiu commented May 20, 2021 •

edited

Loading

mihaibudiu commented May 20, 2021

pradeepgaur commented May 20, 2021

mihaibudiu commented May 20, 2021

Where to place data #714

Where to place data #714

Comments

pradeepgaur commented May 20, 2021

mihaibudiu commented May 20, 2021

mihaibudiu commented May 20, 2021

pradeepgaur commented May 20, 2021

mihaibudiu commented May 20, 2021 • edited Loading

mihaibudiu commented May 20, 2021

pradeepgaur commented May 20, 2021

mihaibudiu commented May 20, 2021

mihaibudiu commented May 20, 2021 •

edited

Loading