feat(eda): add stat. in plot_missing #385

yuzhenmao · 2020-10-10T09:49:30Z

Description

Fixes #367 - EDA.plot_missing: enrich with stat. I only implemented it under the situation: plot_missing(df)

How Has This Been Tested?

manually

Snapshots:

Checklist:

jnwang · 2020-10-10T15:58:15Z

Great job!! @yuzhenmao This is a good starting point. I have three suggestions for your reference.

Try to make our description of stats consistent with other plot functions. For example, the first row in your table is called Missing Cells in plot(df).
Try to think about what other stats can be added. You can take a look at stats in plot(df, x). For example, we can add the average number of missing cells per column or per row.
The third row in your table is useful but I feel like it should be an insight rather than a stat. I suggest we add the top-k missing columns as an insight of the Bar chart.

yuzhenmao · 2020-10-10T16:40:03Z

For sure! I will continue to improve it.

yuzhenmao · 2020-10-11T08:45:59Z

Hi professor @jnwang , I have implemented your suggestions, here are the results:

$@6HU ``_54ABU%FB)DV${T$

The stats information is now consistent with "plot(df)", but I also tried another version:

I don't know which version is better, the red one or the black one?

yuzhenmao · 2020-10-11T12:13:32Z

Additionally, I only implemented "top-k, k==1" situation. I am afraid if there is only one column in that dataframe, k > 1 will cause an error. I am wandering if there are other ideas about this.

dovahcrow · 2020-10-11T21:39:12Z

Maybe we should give col and raw a good name in the insight?

yuzhenmao · 2020-10-12T02:29:10Z

How about:
col[1] has the most missing cell
row[2,1] has the most missing cell

jinglinpeng · 2020-10-12T16:33:36Z

How about:
col[1] has the most missing cell
row[2,1] has the most missing cell

Good job @yuzhenmao ! How about we show the column name and row id? E.g., 'Age contains the most missing values' and 'row 10 contains the most missing attributes'?

Besides, could you also capitalize the first letters of the text as in plot(df)? E.g., 'missing cells' -> 'Missing Cells'.

For the layout choices, are the difference of red one and black one just color? I personally prefer the black one. The red color in plot(df) stat is used to highlight the information such that user is aware of some characteristics of the data, e.g., it contains missing values. But here all information is about missing values, it seems nothing need to be highlighted. Please let me know if you have a different option. @dovahcrow @jnwang .

dovahcrow · 2020-10-12T19:35:21Z

row 10 contains the most missing attributes

Will it be too many if lots of rows have the same amount of missing values?

jinglinpeng · 2020-10-12T21:57:19Z

row 10 contains the most missing attributes

Will it be too many if lots of rows have the same amount of missing values?

I think we should just show limited number of rows. Say 1 or 2 row. How about display information like row 10, 12, ... contains the most missing attributes (3).

yuzhenmao · 2020-10-13T15:20:50Z

I have no idea now... Let's discuss this in this week's meeting!

yuzhenmao · 2020-10-15T07:52:59Z

I have some ideas:

Since the tab_script.html will assign strings having "Missing" to red, so if we don't change tab_script.html, it can only be red+capitalized or black+lowercase
I prefer using "..." if there are many top-1 missing columns/rows. But for showing columns' name, we don't have enough space and I think it is not necessary to show column name in plot_missing. Maybe other functions can do this more perfectly.

jinglinpeng · 2020-10-15T22:40:49Z

I have some ideas:

Since the tab_script.html will assign strings having "Missing" to red, so if we don't change tab_script.html, it can only be red+capitalized or black+lowercase

I prefer using "..." if there are many top-1 missing columns/rows. But for showing columns' name, we don't have enough space and I think it is not necessary to show column name in plot_missing. Maybe other functions can do this more perfectly.

IC. For the first issue you do not need to change it in your side. Just keep red + capitalized. For the column name, I still think it is necessary since it is more intuitive than column id. If the name is long, we can truncate it as we did for the long label in plot(df). E.g., If 10 columns with long name both contain the most missing values, we can show 10 columns (abcde..., defg..., ...) contain the most missing values with missing rate 60%. Please let me know your opinions. @yuzhenmao @dovahcrow @jnwang

yuzhenmao · 2020-10-16T09:03:57Z

@jinglinpeng I think your suggestions make sense, I will try to implement them. Thanks!

yuzhenmao · 2020-10-18T06:39:01Z

I have fixed the issue and now it can show the column names. If the length of name is larger than 5, it will use '...', like this:

yuzhenmao · 2020-10-18T14:38:35Z

Now it is like this:

codecov · 2020-10-20T06:27:13Z

Codecov Report

Merging #385 into develop will increase coverage by 0.24%.
The diff coverage is 100.00%.

@@             Coverage Diff             @@
##           develop     #385      +/-   ##
===========================================
+ Coverage    69.68%   69.93%   +0.24%     
===========================================
  Files           64       64              
  Lines         4599     4640      +41     
===========================================
+ Hits          3205     3245      +40     
- Misses        1394     1395       +1

Impacted Files	Coverage Δ
dataprep/eda/missing/compute/nullivariate.py	`100.00% <100.00%> (ø)`
dataprep/eda/missing/render.py	`96.24% <100.00%> (+0.01%)`	⬆️
dataprep/eda/distribution/render.py	`68.09% <0.00%> (-0.13%)`	⬇️
dataprep/eda/distribution/compute/overview.py	`93.29% <0.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b59faa7...0f44f15. Read the comment docs.

jinglinpeng · 2020-10-21T08:10:40Z

dataprep/eda/missing/compute/nullivariate.py

@@ -14,42 +14,115 @@
 from ...intermediate import Intermediate
 from ...staged import staged

+__most__ = 5


The parameter name is a little confusing. Could you come up with a more clear name? Besides, please try to avoid use global variables as parameters. I think these two parameters could be put into plot_missing function and then passed to other sub-functions. This part need to be handled by parameter management in the future.

jinglinpeng · 2020-10-21T08:34:17Z

dataprep/eda/missing/compute/nullivariate.py

+            + "-col(s) "
+            + str("(" + ", ".join(abbr(df.columns[e]) for e in most_col[2]) + ")")
+        )
+    else:


This part is a little redundant. There are some suggestions.

abbr(df.columns[e]) for e in most_col[2] may be write as df.columns[most_col[2]].map(abbr), which looks simpler.

you could use a variable to save the suffix. E.g., suffix = "" if most_col[0]<=__most__ else ", ...". Then the last str are str("(" + ", ".join(abbr(df.columns[e]) for e in most_col[2]) + suffix + ")") for both conditions and top_miss_col only need to be write once.

The same comments also hold for processing of top_miss_row

jinglinpeng · 2020-10-21T08:39:38Z

dataprep/eda/missing/compute/nullivariate.py

+    return cnt, rate, rst
+
+
+def missing_most_row(df: DataArray) -> Tuple[int, float, List[Any]]:


Could you add the meaning of input and output parameter in the docstring? E.g., is the input df a row or a dataframe; and the meaning of the three output values. The same comments also hold for missing_most_col function.

jinglinpeng · 2020-10-21T08:49:11Z

@yuzhenmao Good job Yuzhen! I left some comments in the code. I think current PR could be merged after you fix them. But it requires further optimizations in terms of efficiency in the future. E.g., some computations may be reused; and current dask.compute is called inside the function like missing_most_col (not sure why it is faster but we can figure it out later).

yuzhenmao · 2020-10-21T08:51:46Z

@jinglinpeng Thanks Jinglin! I will fix these ASAP

jinglinpeng · 2020-10-21T16:21:14Z

dataprep/eda/missing/compute/nullivariate.py

@@ -18,38 +18,103 @@
 def _compute_missing_nullivariate(df: DataArray, bins: int) -> Generator[Any, Any, Intermediate]:
    """Calculate the data for visualizing the plot_missing(df).
    This contains the missing spectrum, missing bar chart and missing heatmap."""
+    # pylint: disable=too-many-locals
+
+    most_show = 5


Could you write a comment about the meaning of the two parameters here? Other parts looks good to me.

jinglinpeng · 2020-10-21T18:10:44Z

LGTM. Could you do a rebase before the merge?

feat(eda): add stat. in plot_missing

yuzhenmao force-pushed the missing_stats branch 2 times, most recently from fabeac8 to 5c55343 Compare October 11, 2020 11:39

yuzhenmao force-pushed the missing_stats branch 2 times, most recently from 96777be to bcdbe07 Compare October 18, 2020 14:35

yuzhenmao force-pushed the missing_stats branch 11 times, most recently from 67fc594 to 4d30bfc Compare October 20, 2020 02:59

yuzhenmao force-pushed the missing_stats branch 4 times, most recently from b8b4f04 to e0ce491 Compare October 20, 2020 06:20

jinglinpeng reviewed Oct 21, 2020

View reviewed changes

yuzhenmao force-pushed the missing_stats branch from e0ce491 to 98eae8e Compare October 21, 2020 16:11

jinglinpeng reviewed Oct 21, 2020

View reviewed changes

yuzhenmao force-pushed the missing_stats branch from 98eae8e to 86dc64f Compare October 21, 2020 16:30

yuzhenmao force-pushed the missing_stats branch 4 times, most recently from 95ef502 to 7591a02 Compare October 22, 2020 04:18

feat(eda): add stat. in plot_missing

0f44f15

yuzhenmao force-pushed the missing_stats branch from 7591a02 to 0f44f15 Compare October 22, 2020 04:51

jinglinpeng approved these changes Oct 22, 2020

View reviewed changes

jinglinpeng merged commit 2db8ee1 into sfu-db:develop Oct 22, 2020

devinllu pushed a commit to devinllu/dataprep that referenced this pull request Nov 9, 2021

Merge pull request sfu-db#385 from yuzhenmao/missing_stats

96c8190

feat(eda): add stat. in plot_missing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(eda): add stat. in plot_missing #385

feat(eda): add stat. in plot_missing #385

yuzhenmao commented Oct 10, 2020

jnwang commented Oct 10, 2020 •

edited

yuzhenmao commented Oct 10, 2020

yuzhenmao commented Oct 11, 2020

yuzhenmao commented Oct 11, 2020

dovahcrow commented Oct 11, 2020

yuzhenmao commented Oct 12, 2020

jinglinpeng commented Oct 12, 2020 •

edited

dovahcrow commented Oct 12, 2020

jinglinpeng commented Oct 12, 2020

yuzhenmao commented Oct 13, 2020

yuzhenmao commented Oct 15, 2020 •

edited

jinglinpeng commented Oct 15, 2020

yuzhenmao commented Oct 16, 2020

yuzhenmao commented Oct 18, 2020

yuzhenmao commented Oct 18, 2020

codecov bot commented Oct 20, 2020 •

edited

jinglinpeng Oct 21, 2020

jinglinpeng Oct 21, 2020

jinglinpeng Oct 21, 2020

jinglinpeng commented Oct 21, 2020

yuzhenmao commented Oct 21, 2020

jinglinpeng Oct 21, 2020

jinglinpeng commented Oct 21, 2020

		return cnt, rate, rst


		def missing_most_row(df: DataArray) -> Tuple[int, float, List[Any]]:

feat(eda): add stat. in plot_missing #385

feat(eda): add stat. in plot_missing #385

Conversation

yuzhenmao commented Oct 10, 2020

Description

How Has This Been Tested?

Snapshots:

Checklist:

jnwang commented Oct 10, 2020 • edited

yuzhenmao commented Oct 10, 2020

yuzhenmao commented Oct 11, 2020

yuzhenmao commented Oct 11, 2020

dovahcrow commented Oct 11, 2020

yuzhenmao commented Oct 12, 2020

jinglinpeng commented Oct 12, 2020 • edited

dovahcrow commented Oct 12, 2020

jinglinpeng commented Oct 12, 2020

yuzhenmao commented Oct 13, 2020

yuzhenmao commented Oct 15, 2020 • edited

jinglinpeng commented Oct 15, 2020

yuzhenmao commented Oct 16, 2020

yuzhenmao commented Oct 18, 2020

yuzhenmao commented Oct 18, 2020

codecov bot commented Oct 20, 2020 • edited

Codecov Report

jinglinpeng Oct 21, 2020

Choose a reason for hiding this comment

jinglinpeng Oct 21, 2020

Choose a reason for hiding this comment

jinglinpeng Oct 21, 2020

Choose a reason for hiding this comment

jinglinpeng commented Oct 21, 2020

yuzhenmao commented Oct 21, 2020

jinglinpeng Oct 21, 2020

Choose a reason for hiding this comment

jinglinpeng commented Oct 21, 2020

jnwang commented Oct 10, 2020 •

edited

jinglinpeng commented Oct 12, 2020 •

edited

yuzhenmao commented Oct 15, 2020 •

edited

codecov bot commented Oct 20, 2020 •

edited