-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(eda.plot_prediction): Implement the calculation of intermediates #13
Conversation
dataprep/eda/eda_plot_pred.py
Outdated
""" Returns the type of the input data. | ||
Identified types are according to the DataType Enumeration. | ||
|
||
Parameter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parameter and return comments are no the same form as other functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code has been moved to utils.py, and I think we should study numpy's parameter and return comments.
dataprep/eda/eda_plot_pred.py
Outdated
if pd.api.types.is_bool_dtype(data): | ||
col_type = DataType.TYPE_CAT | ||
elif pd.api.types.is_numeric_dtype(data) and dask.compute( | ||
data.dropna().unique().size) == 2: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unique values=2 -> maybe smaller than a threshold?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is written by shubham, maybe you should discuss it with him.
dataprep/eda/eda_plot_pred.py
Outdated
for column_name in pd_data_frame.columns.values: | ||
if get_type(pd_data_frame[column_name]) != DataType.TYPE_NUM: | ||
drop_list.append(column_name) | ||
pd_data_frame.drop(columns=drop_list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make sure this just return a copy and will not affect the input dataframe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a bug. It is true that it will not affect the input data frame. However, I have to change line 70 to pd_data_frame = pd_data_frame.drop(columns=drop_list). Otherwise, the data frame will not drop non-numerical columns.
dataprep/eda/eda_plot_pred.py
Outdated
def _calc_corr( | ||
data_a: np.ndarray, | ||
data_b: np.ndarray | ||
) -> Any: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not use Any
dataprep/eda/eda_plot_pred.py
Outdated
def _calc_pred_corr( | ||
pd_data_frame: pd.DataFrame, | ||
target: str | ||
) -> Any: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not use Any
dataprep/eda/eda_plot_pred.py
Outdated
target: str, | ||
x_name: str, | ||
target_type: DataType | ||
) -> Any: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change Any with the specific data type.
dataprep/eda/eda_plot_pred.py
Outdated
pd_data_frame[target], | ||
np.arange(min_value, | ||
max_value + 1, | ||
(max_value - min_value) / 10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change 10 to a parameter
dataprep/eda/eda_plot_pred.py
Outdated
|
||
def _calc_scatter( | ||
intermediate: Intermediate | ||
) -> Any: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change Any to the specific data type
dataprep/eda/eda_plot_miss.py
Outdated
data: np.ndarray, | ||
length: int | ||
) -> Any: | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
write the doc string in the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added the doc string.
dataprep/eda/eda_plot_miss.py
Outdated
from dataprep.utils import get_type, DataType | ||
|
||
|
||
def _calc_none_sum( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change the name to _calc_nonzero_rate. Besides, whether length is necessary (it could be computed as len(data))?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it could be computed as len(data). However, I think we only need to calculate the length once. So, I pass the length to this function.
dataprep/eda/eda_plot_miss.py
Outdated
def _calc_none_count( | ||
pd_data_frame: pd.DataFrame | ||
) -> Intermediate: | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the doc string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added the doc string.
dataprep/eda/eda_plot_miss.py
Outdated
return np.count_nonzero(data) / length | ||
|
||
|
||
def _calc_none_count( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the function name is confusing (none means None or nonzero?), think of a good name that related to the functionality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
none means None. I agree with you. I will modify the function name from none to nonzero so that our library is consistent with numpy.
dataprep/eda/eda_plot_miss.py
Outdated
x_name: str, | ||
num_bins: int = 10 | ||
) -> Intermediate: | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add the doc string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added the doc string.
dataprep/eda/vis_plot_miss.py
Outdated
def _vis_drop_y( # pylint: disable=too-many-locals | ||
intermediate: Intermediate | ||
) -> Tabs: | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add doc string
dataprep/eda/vis_plot_pred.py
Outdated
def _vis_pred_corr( | ||
intermediate: Intermediate | ||
) -> Figure: | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add doc string
dataprep/eda/vis_plot_pred.py
Outdated
# pylint: disable=too-many-statements | ||
intermediate: Intermediate | ||
) -> Figure: | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add doc string
dataprep/eda/vis_plot_pred.py
Outdated
# pylint: disable=too-many-statements | ||
intermediate: Intermediate | ||
) -> Figure: | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add doc string
dataprep/eda/vis_plot_pred.py
Outdated
('Count', '@Count') | ||
] | ||
hover = HoverTool(tooltips=tooltips) | ||
if target_type == DataType.TYPE_NUM: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 'if' branch is too long. Is it possible to write the different branch of 'if' to different function?
.pylintrc
Outdated
ignore-comments=yes | ||
|
||
# Ignore docstrings when computing similarities. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we remove this so we can ensure the docstrings are properly written?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, add tests to test all the possible param combinations for each user API, and squash the commits.
Sure, I will add tests as soon as possible.
dataprep/eda/eda_plot_corr.py
Outdated
|
||
|
||
def _vis_correlation_pd_x_k( # pylint: disable=too-many-locals | ||
intermediate: Intermediate | ||
) -> Figure: | ||
) -> Tabs: | ||
""" | ||
:param intermediate: An object to encapsulate the | ||
intermediate results. | ||
:return: A figure object | ||
""" | ||
result = intermediate.result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just have a single for loop for all three correlation methods? I see a lot of redundancy here.
dataprep/eda/eda_plot_corr.py
Outdated
method_list = ['pearson', 'spearman', 'kendall'] | ||
result = {} | ||
for method in method_list: | ||
if method == 'pearson': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this long if-else condition, can we wrap the content into three functions rather than make the branches verbose?
dataprep/eda/eda_plot_corr.py
Outdated
def _discard_unused_visual_elems( | ||
fig: Figure | ||
) -> None: | ||
""" | ||
:param fig: A figure object | ||
:return: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there's no return value I think there's no need for a :return:
block
dataprep/eda/eda_plot_corr.py
Outdated
'x_name': x_name, | ||
'k': k | ||
} | ||
if value_range is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have 3 functions handling these three correlation cases? The if block is way too long for easy reading
dataprep/eda/vis_plot_miss.py
Outdated
) | ||
fig_pdf = hv.render( | ||
(pdf_origin * pdf_drop).opts( | ||
height=375, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These same configurations are used multiple times. It would be better to store them in a dict and use **params
to expand them at the call site.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, add tests to test all the possible param combinations for each user API, and squash the commits.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW please complete the pull request message.
…and plot_correlation
feat(eda.plot_prediction): Implement the calculation of intermediates
feat(eda.plot_prediction): Implement the calculation of intermediates
Description
Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.
How Has This Been Tested?
Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration
Checklist: