Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eda.create_report: page design prototype #171

Closed
eutialia opened this issue May 28, 2020 · 12 comments · Fixed by #202
Closed

eda.create_report: page design prototype #171

eutialia opened this issue May 28, 2020 · 12 comments · Fixed by #202
Assignees
Labels
module: EDA type: enhancement New feature or request
Milestone

Comments

@eutialia
Copy link
Member

eutialia commented May 28, 2020

We have added stats info to our plot function, now we can use all those information to generate an HTML page for our users.

I prototyped this layout without adding any practical plots, so we can change this design easily.
Every element is 1:1 to our current code's definition, I believe this may give you a better concept of how this webpage would look like. The width of page is 1920px.

Screen Shot 2020-05-28 at 15 12 07

I will put the prototype here if anyone needs a more detailed inspection. Let me know if you have any suggestions. @jnwang @jinglinpeng @dovahcrow @Waterpine @brandonlockhart
https://www.figma.com/file/txfQwkocxBOFOilPvaI9MC/Untitled?node-id=0%3A1

@jnwang
Copy link

jnwang commented May 29, 2020

Looks very cool!!

A couple of comments:

  1. Can we create an outline so that the user can quickly go to the part of the report that she is interested in? See https://github.com/mstaniak/autoEDA-resources/blob/master/autoEDA-paper/plots/dlookr/dlookr_eda.pdf for example.
  2. Will it be a long report or will you create five tags (Overview, Variable, Interaction, Correlation, Missing)?
  3. Please check whether there is any useful information that can be seen in pandas-profiling but not in our report. If so, please consider to add it.
  4. For the "Variables" section, currently, we have 4 plots for numerical and 2 plots for categorical. What if we add more plots? Can the current design easily support it?
  5. We will have more plots for missing (Enhance eda.plot_missing: add heatmap and dendrogram #168)
  6. Will the report have a nice layout on mobile?

@brandonlockhart
Copy link

Good job Ling! For the Overview section, if there are a lot of variables there will be a lot of empty space under the statistics. I think it would be better to put the stats on top of the plots and make the plots larger.

I'm concerned that if we put all of our content into one report, then it will take longer than pandas-profiling to produce. @jnwang if one of our selling points is speed, is it imperative that we don't take longer than pandas-profiling to produce the report? We could have one report that contains basically the same content as pandas-profiling that is fast to produce, and another with all of our visualizations.

@jnwang
Copy link

jnwang commented May 29, 2020

@brandonlockhart this is a good point. In my opinion, we need to make the report easy to config and define a couple of default configuration files (e.g., "pandas-profiling", "full", "minimal") for the user to select from. We can set it to "pandas-profiling" by default.

@eutialia
Copy link
Member Author

Looks very cool!!

A couple of comments:

  1. Can we create an outline so that the user can quickly go to the part of the report that she is interested in? See https://github.com/mstaniak/autoEDA-resources/blob/master/autoEDA-paper/plots/dlookr/dlookr_eda.pdf for example.
  2. Will it be a long report or will you create five tags (Overview, Variable, Interaction, Correlation, Missing)?
  3. Please check whether there is any useful information that can be seen in pandas-profiling but not in our report. If so, please consider to add it.
  4. For the "Variables" section, currently, we have 4 plots for numerical and 2 plots for categorical. What if we add more plots? Can the current design easily support it?
  5. We will have more plots for missing (Enhance eda.plot_missing: add heatmap and dendrogram #168)
  6. Will the report have a nice layout on mobile?

My answers:

  1. Yes, we can add a navigation bar at the top.
  2. The length of Variable section depends on the number of variables in the dataset, the length of reset sections should be fixed.
  3. Sure.
  4. Yes, it's easy to add more plots to report. So far I stick to what plot(df, x) method generates.
  5. This will be in our report once it's finished.
  6. I'm not too sure if it's going to be nice, but I think we can optimize it to look informative on mobile devices

@eutialia
Copy link
Member Author

The first version is now public to everyone: https://vigilant-nobel-679808.netlify.app/

I hope we can figure out if the structure of the report is clear enough now then we can move forward to each section.

Known problem: plots in overview section won't show in mobile devices because it doesn't meet the minimum required space set by bokeh.

Let me know if you have any suggestion. @jnwang @jinglinpeng @dovahcrow @Waterpine @brandonlockhart @Sanjana12111994 @dylanzxc

@brandonlockhart
Copy link

I really like it! My comments/questions:

  1. In the Variables section, for each variable, I think we should show the overview stats on the left and either the histogram or bar chart on the right - like pandas-profiling. And a button that toggles the detailed statistics tables underneath (like you currently have for the plots). In my opinion, this section currently has too much text and I don't think users will always care about the detailed stats.
  2. How will the column layout be in the Interactions section if the number of columns is very large (like 100 columns)?
  3. I think the plot(df) output might not be needed in the Overview section since we show its contents in the variables section. Or we could have a button to toggle these plots in the Overview section, if people want to see the column distributions in a grid.
  4. I wonder if we should put the correlation matrices in tabs, so that it's easier to add more in the future.
  5. How long does it take for us and pandas-profiling to generate this report?

@jinglinpeng
Copy link
Contributor

jinglinpeng commented Jun 16, 2020

Same as Brandon, I think the plots in overview section are not necessary. Maybe they could be replaced with auto-insight once the auto-insight feature is done. For the Variable section, it would be better to show hist. + basic stat., and fold the detailed stat. with other plots by default.

Besides, could we add the function of plot_correlation(df, x) in Correlation section? It is useful when user wants to know what features are correlated to the label column. It could be interactive, and the output may look like:
Picture1
(In this example user wants to find the top 10 most correlated columns to Age column)

@eutialia
Copy link
Member Author

I really like it! My comments/questions:

  1. In the Variables section, for each variable, I think we should show the overview stats on the left and either the histogram or bar chart on the right - like pandas-profiling. And a button that toggles the detailed statistics tables underneath (like you currently have for the plots). In my opinion, this section currently has too much text and I don't think users will always care about the detailed stats.
  2. How will the column layout be in the Interactions section if the number of columns is very large (like 100 columns)?
  3. I think the plot(df) output might not be needed in the Overview section since we show its contents in the variables section. Or we could have a button to toggle these plots in the Overview section, if people want to see the column distributions in a grid.
  4. I wonder if we should put the correlation matrices in tabs, so that it's easier to add more in the future.
  5. How long does it take for us and pandas-profiling to generate this report?
  1. In order to achieve this layout we need to rewrite render module for create_report function, which will give us more flexibility.
  2. The buttons will split into multiple lines.
  3. Maybe we should remove those plots.
  4. Sounds good.
  5. dataprep takes about 9 sec and pandas-profiling takes about 11 sec.

@jnwang
Copy link

jnwang commented Jun 17, 2020

This version looks much better! Great job! @eutialia

I just want to comment on the plot size. The size of each plot looks pretty big on my screen, so I have to scroll up and down a lot to see everything. It would be better to make them a bit smaller.

@brandonlockhart
Copy link

@jinglinpeng I like this feature. However, I think we should add it to the "full" configuration from @jnwang's above comment. I think @eutialia is currently making the pandas-profiling config (which I was thinking would be the default for create_report(df)) for which, in my opinion, we should do effectively what pandas-profiling does. I think if we are much faster and with our interactive visualizations, this will convince people to use us. What do you think @jinglinpeng?

@jinglinpeng
Copy link
Contributor

@brandonlockhart Ic. Yes I agree.

@brandonlockhart
Copy link

As a plan to finish the report feature, I propose we

  1. Use Linghao's report as the default configuration
  2. Create a "pandas-profiling" configuration. I think this will just require removing the plot(df, x) plots from the Variables section. The purpose of this is to show that we are much faster than pandas-profiling at creating the same report.
  3. Create a "minimal" configuration for large datasets: I propose we remove the "toggle details" content from the Variables section and only calculate Pearson correlation.

Please let me know any comments.

@brandonlockhart brandonlockhart added this to the v0.2.8 milestone Jul 3, 2020
devinllu pushed a commit to devinllu/dataprep that referenced this issue Nov 9, 2021
Usage:
  >> from dataprep.eda import create_report
  >> create_report(df)

Resolve sfu-db#171
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: EDA type: enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants