Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(eda): add stat. in plot_missing #385

Merged
merged 1 commit into from Oct 22, 2020

Conversation

yuzhenmao
Copy link
Contributor

Description

Fixes #367 - EDA.plot_missing: enrich with stat. I only implemented it under the situation: plot_missing(df)

How Has This Been Tested?

manually

Snapshots:

image

Checklist:

  • My code follows the style guidelines of this project
  • I have already squashed the commits and make the commit message conform to the project standard.
  • I have already marked the commit with "BREAKING CHANGE" or "Fixes #" if needed.
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules

@jnwang
Copy link

jnwang commented Oct 10, 2020

Great job!! @yuzhenmao This is a good starting point. I have three suggestions for your reference.

  1. Try to make our description of stats consistent with other plot functions. For example, the first row in your table is called Missing Cells in plot(df).

  2. Try to think about what other stats can be added. You can take a look at stats in plot(df, x). For example, we can add the average number of missing cells per column or per row.

  3. The third row in your table is useful but I feel like it should be an insight rather than a stat. I suggest we add the top-k missing columns as an insight of the Bar chart.

@yuzhenmao
Copy link
Contributor Author

For sure! I will continue to improve it.

@yuzhenmao
Copy link
Contributor Author

Hi professor @jnwang , I have implemented your suggestions, here are the results:

@6HU ``_54ABU%FB)DV${T

5J{Z2OU{C)BCE(8J1M`$8O8

The stats information is now consistent with "plot(df)", but I also tried another version:

SGIB7_QF G5A)7@@8A~~Q$3

I don't know which version is better, the red one or the black one?

@yuzhenmao yuzhenmao force-pushed the missing_stats branch 2 times, most recently from fabeac8 to 5c55343 Compare October 11, 2020 11:39
@yuzhenmao
Copy link
Contributor Author

Additionally, I only implemented "top-k, k==1" situation. I am afraid if there is only one column in that dataframe, k > 1 will cause an error. I am wandering if there are other ideas about this.

@dovahcrow
Copy link
Member

Maybe we should give col and raw a good name in the insight?

@yuzhenmao
Copy link
Contributor Author

How about:
col[1] has the most missing cell
row[2,1] has the most missing cell

@jinglinpeng
Copy link
Contributor

jinglinpeng commented Oct 12, 2020

How about:
col[1] has the most missing cell
row[2,1] has the most missing cell

Good job @yuzhenmao ! How about we show the column name and row id? E.g., 'Age contains the most missing values' and 'row 10 contains the most missing attributes'?

Besides, could you also capitalize the first letters of the text as in plot(df)? E.g., 'missing cells' -> 'Missing Cells'.

For the layout choices, are the difference of red one and black one just color? I personally prefer the black one. The red color in plot(df) stat is used to highlight the information such that user is aware of some characteristics of the data, e.g., it contains missing values. But here all information is about missing values, it seems nothing need to be highlighted. Please let me know if you have a different option. @dovahcrow @jnwang .

@dovahcrow
Copy link
Member

row 10 contains the most missing attributes

Will it be too many if lots of rows have the same amount of missing values?

@jinglinpeng
Copy link
Contributor

row 10 contains the most missing attributes

Will it be too many if lots of rows have the same amount of missing values?

I think we should just show limited number of rows. Say 1 or 2 row. How about display information like row 10, 12, ... contains the most missing attributes (3).

@yuzhenmao
Copy link
Contributor Author

I have no idea now... Let's discuss this in this week's meeting!

@yuzhenmao
Copy link
Contributor Author

yuzhenmao commented Oct 15, 2020

I have some ideas:

  1. Since the tab_script.html will assign strings having "Missing" to red, so if we don't change tab_script.html, it can only be red+capitalized or black+lowercase
  2. I prefer using "..." if there are many top-1 missing columns/rows. But for showing columns' name, we don't have enough space and I think it is not necessary to show column name in plot_missing. Maybe other functions can do this more perfectly.

@jinglinpeng
Copy link
Contributor

I have some ideas:

  1. Since the tab_script.html will assign strings having "Missing" to red, so if we don't change tab_script.html, it can only be red+capitalized or black+lowercase
  2. I prefer using "..." if there are many top-1 missing columns/rows. But for showing columns' name, we don't have enough space and I think it is not necessary to show column name in plot_missing. Maybe other functions can do this more perfectly.

IC. For the first issue you do not need to change it in your side. Just keep red + capitalized. For the column name, I still think it is necessary since it is more intuitive than column id. If the name is long, we can truncate it as we did for the long label in plot(df). E.g., If 10 columns with long name both contain the most missing values, we can show 10 columns (abcde..., defg..., ...) contain the most missing values with missing rate 60%. Please let me know your opinions. @yuzhenmao @dovahcrow @jnwang

@yuzhenmao
Copy link
Contributor Author

@jinglinpeng I think your suggestions make sense, I will try to implement them. Thanks!

@yuzhenmao
Copy link
Contributor Author

I have fixed the issue and now it can show the column names. If the length of name is larger than 5, it will use '...', like this:
image

@yuzhenmao yuzhenmao force-pushed the missing_stats branch 2 times, most recently from 96777be to bcdbe07 Compare October 18, 2020 14:35
@yuzhenmao
Copy link
Contributor Author

Now it is like this:
O RCL)0Z0V4HLA 0FJYI~(6

@yuzhenmao yuzhenmao force-pushed the missing_stats branch 11 times, most recently from 67fc594 to 4d30bfc Compare October 20, 2020 02:59
@yuzhenmao yuzhenmao force-pushed the missing_stats branch 4 times, most recently from b8b4f04 to e0ce491 Compare October 20, 2020 06:20
@codecov
Copy link

codecov bot commented Oct 20, 2020

Codecov Report

Merging #385 into develop will increase coverage by 0.24%.
The diff coverage is 100.00%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #385      +/-   ##
===========================================
+ Coverage    69.68%   69.93%   +0.24%     
===========================================
  Files           64       64              
  Lines         4599     4640      +41     
===========================================
+ Hits          3205     3245      +40     
- Misses        1394     1395       +1     
Impacted Files Coverage Δ
dataprep/eda/missing/compute/nullivariate.py 100.00% <100.00%> (ø)
dataprep/eda/missing/render.py 96.24% <100.00%> (+0.01%) ⬆️
dataprep/eda/distribution/render.py 68.09% <0.00%> (-0.13%) ⬇️
dataprep/eda/distribution/compute/overview.py 93.29% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b59faa7...0f44f15. Read the comment docs.

@@ -14,42 +14,115 @@
from ...intermediate import Intermediate
from ...staged import staged

__most__ = 5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter name is a little confusing. Could you come up with a more clear name? Besides, please try to avoid use global variables as parameters. I think these two parameters could be put into plot_missing function and then passed to other sub-functions. This part need to be handled by parameter management in the future.

+ "-col(s) "
+ str("(" + ", ".join(abbr(df.columns[e]) for e in most_col[2]) + ")")
)
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is a little redundant. There are some suggestions.

  1. abbr(df.columns[e]) for e in most_col[2] may be write as df.columns[most_col[2]].map(abbr), which looks simpler.
  2. you could use a variable to save the suffix. E.g., suffix = "" if most_col[0]<=__most__ else ", ...". Then the last str are str("(" + ", ".join(abbr(df.columns[e]) for e in most_col[2]) + suffix + ")") for both conditions and top_miss_col only need to be write once.

The same comments also hold for processing of top_miss_row

return cnt, rate, rst


def missing_most_row(df: DataArray) -> Tuple[int, float, List[Any]]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add the meaning of input and output parameter in the docstring? E.g., is the input df a row or a dataframe; and the meaning of the three output values. The same comments also hold for missing_most_col function.

@jinglinpeng
Copy link
Contributor

@yuzhenmao Good job Yuzhen! I left some comments in the code. I think current PR could be merged after you fix them. But it requires further optimizations in terms of efficiency in the future. E.g., some computations may be reused; and current dask.compute is called inside the function like missing_most_col (not sure why it is faster but we can figure it out later).

@yuzhenmao
Copy link
Contributor Author

@jinglinpeng Thanks Jinglin! I will fix these ASAP

@@ -18,38 +18,103 @@
def _compute_missing_nullivariate(df: DataArray, bins: int) -> Generator[Any, Any, Intermediate]:
"""Calculate the data for visualizing the plot_missing(df).
This contains the missing spectrum, missing bar chart and missing heatmap."""
# pylint: disable=too-many-locals

most_show = 5
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you write a comment about the meaning of the two parameters here? Other parts looks good to me.

@jinglinpeng
Copy link
Contributor

LGTM. Could you do a rebase before the merge?

@yuzhenmao yuzhenmao force-pushed the missing_stats branch 4 times, most recently from 95ef502 to 7591a02 Compare October 22, 2020 04:18
@jinglinpeng jinglinpeng merged commit 2db8ee1 into sfu-db:develop Oct 22, 2020
devinllu pushed a commit to devinllu/dataprep that referenced this pull request Nov 9, 2021
feat(eda): add stat. in plot_missing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

EDA.plot_missing: enrich with stat.
4 participants