New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(eda): add stat. in plot_missing #385
Conversation
Great job!! @yuzhenmao This is a good starting point. I have three suggestions for your reference.
|
For sure! I will continue to improve it. |
Hi professor @jnwang , I have implemented your suggestions, here are the results: The stats information is now consistent with "plot(df)", but I also tried another version: I don't know which version is better, the red one or the black one? |
fabeac8
to
5c55343
Compare
Additionally, I only implemented "top-k, k==1" situation. I am afraid if there is only one column in that dataframe, k > 1 will cause an error. I am wandering if there are other ideas about this. |
Maybe we should give col and raw a good name in the insight? |
How about: |
Good job @yuzhenmao ! How about we show the column name and row id? E.g., 'Age contains the most missing values' and 'row 10 contains the most missing attributes'? Besides, could you also capitalize the first letters of the text as in For the layout choices, are the difference of red one and black one just color? I personally prefer the black one. The red color in |
Will it be too many if lots of rows have the same amount of missing values? |
I think we should just show limited number of rows. Say 1 or 2 row. How about display information like |
I have no idea now... Let's discuss this in this week's meeting! |
I have some ideas:
|
IC. For the first issue you do not need to change it in your side. Just keep red + capitalized. For the column name, I still think it is necessary since it is more intuitive than column id. If the name is long, we can truncate it as we did for the long label in |
@jinglinpeng I think your suggestions make sense, I will try to implement them. Thanks! |
96777be
to
bcdbe07
Compare
67fc594
to
4d30bfc
Compare
b8b4f04
to
e0ce491
Compare
Codecov Report
@@ Coverage Diff @@
## develop #385 +/- ##
===========================================
+ Coverage 69.68% 69.93% +0.24%
===========================================
Files 64 64
Lines 4599 4640 +41
===========================================
+ Hits 3205 3245 +40
- Misses 1394 1395 +1
Continue to review full report at Codecov.
|
@@ -14,42 +14,115 @@ | |||
from ...intermediate import Intermediate | |||
from ...staged import staged | |||
|
|||
__most__ = 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parameter name is a little confusing. Could you come up with a more clear name? Besides, please try to avoid use global variables as parameters. I think these two parameters could be put into plot_missing
function and then passed to other sub-functions. This part need to be handled by parameter management in the future.
+ "-col(s) " | ||
+ str("(" + ", ".join(abbr(df.columns[e]) for e in most_col[2]) + ")") | ||
) | ||
else: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part is a little redundant. There are some suggestions.
abbr(df.columns[e]) for e in most_col[2]
may be write asdf.columns[most_col[2]].map(abbr)
, which looks simpler.- you could use a variable to save the suffix. E.g.,
suffix = "" if most_col[0]<=__most__ else ", ..."
. Then the last str arestr("(" + ", ".join(abbr(df.columns[e]) for e in most_col[2]) + suffix + ")")
for both conditions andtop_miss_col
only need to be write once.
The same comments also hold for processing of top_miss_row
return cnt, rate, rst | ||
|
||
|
||
def missing_most_row(df: DataArray) -> Tuple[int, float, List[Any]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add the meaning of input and output parameter in the docstring? E.g., is the input df
a row or a dataframe; and the meaning of the three output values. The same comments also hold for missing_most_col
function.
@yuzhenmao Good job Yuzhen! I left some comments in the code. I think current PR could be merged after you fix them. But it requires further optimizations in terms of efficiency in the future. E.g., some computations may be reused; and current |
@jinglinpeng Thanks Jinglin! I will fix these ASAP |
e0ce491
to
98eae8e
Compare
@@ -18,38 +18,103 @@ | |||
def _compute_missing_nullivariate(df: DataArray, bins: int) -> Generator[Any, Any, Intermediate]: | |||
"""Calculate the data for visualizing the plot_missing(df). | |||
This contains the missing spectrum, missing bar chart and missing heatmap.""" | |||
# pylint: disable=too-many-locals | |||
|
|||
most_show = 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you write a comment about the meaning of the two parameters here? Other parts looks good to me.
98eae8e
to
86dc64f
Compare
LGTM. Could you do a rebase before the merge? |
95ef502
to
7591a02
Compare
7591a02
to
0f44f15
Compare
feat(eda): add stat. in plot_missing
Description
Fixes #367 - EDA.plot_missing: enrich with stat. I only implemented it under the situation: plot_missing(df)
How Has This Been Tested?
manually
Snapshots:
Checklist: