Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(eda.plot_prediction): Implement the calculation of intermediates #13

Merged
merged 1 commit into from
Aug 24, 2019

Conversation

Waterpine
Copy link
Contributor

@Waterpine Waterpine commented Jul 18, 2019

Description

Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

Checklist:

  • My code follows the style guidelines of this project
  • I have already squashed the commits and make the commit message conform to the project standard.
  • I have already marked the commit with "BREAKING CHANGE" or "Fixes #" if needed.
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged and published in downstream modules

""" Returns the type of the input data.
Identified types are according to the DataType Enumeration.

Parameter
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parameter and return comments are no the same form as other functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code has been moved to utils.py, and I think we should study numpy's parameter and return comments.

if pd.api.types.is_bool_dtype(data):
col_type = DataType.TYPE_CAT
elif pd.api.types.is_numeric_dtype(data) and dask.compute(
data.dropna().unique().size) == 2:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unique values=2 -> maybe smaller than a threshold?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is written by shubham, maybe you should discuss it with him.

for column_name in pd_data_frame.columns.values:
if get_type(pd_data_frame[column_name]) != DataType.TYPE_NUM:
drop_list.append(column_name)
pd_data_frame.drop(columns=drop_list)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sure this just return a copy and will not affect the input dataframe.

Copy link
Contributor Author

@Waterpine Waterpine Jul 20, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bug. It is true that it will not affect the input data frame. However, I have to change line 70 to pd_data_frame = pd_data_frame.drop(columns=drop_list). Otherwise, the data frame will not drop non-numerical columns.

def _calc_corr(
data_a: np.ndarray,
data_b: np.ndarray
) -> Any:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not use Any

def _calc_pred_corr(
pd_data_frame: pd.DataFrame,
target: str
) -> Any:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not use Any

target: str,
x_name: str,
target_type: DataType
) -> Any:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change Any with the specific data type.

pd_data_frame[target],
np.arange(min_value,
max_value + 1,
(max_value - min_value) / 10)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change 10 to a parameter


def _calc_scatter(
intermediate: Intermediate
) -> Any:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change Any to the specific data type

@dovahcrow
Copy link
Member

Review plot_corr

data: np.ndarray,
length: int
) -> Any:
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

write the doc string in the comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added the doc string.

from dataprep.utils import get_type, DataType


def _calc_none_sum(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change the name to _calc_nonzero_rate. Besides, whether length is necessary (it could be computed as len(data))?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it could be computed as len(data). However, I think we only need to calculate the length once. So, I pass the length to this function.

def _calc_none_count(
pd_data_frame: pd.DataFrame
) -> Intermediate:
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the doc string

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added the doc string.

return np.count_nonzero(data) / length


def _calc_none_count(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the function name is confusing (none means None or nonzero?), think of a good name that related to the functionality.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

none means None. I agree with you. I will modify the function name from none to nonzero so that our library is consistent with numpy.

x_name: str,
num_bins: int = 10
) -> Intermediate:
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the doc string

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added the doc string.

def _vis_drop_y( # pylint: disable=too-many-locals
intermediate: Intermediate
) -> Tabs:
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add doc string

def _vis_pred_corr(
intermediate: Intermediate
) -> Figure:
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add doc string

# pylint: disable=too-many-statements
intermediate: Intermediate
) -> Figure:
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add doc string

# pylint: disable=too-many-statements
intermediate: Intermediate
) -> Figure:
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add doc string

('Count', '@Count')
]
hover = HoverTool(tooltips=tooltips)
if target_type == DataType.TYPE_NUM:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 'if' branch is too long. Is it possible to write the different branch of 'if' to different function?

.pylintrc Outdated
ignore-comments=yes

# Ignore docstrings when computing similarities.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove this so we can ensure the docstrings are properly written?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, add tests to test all the possible param combinations for each user API, and squash the commits.

Sure, I will add tests as soon as possible.



def _vis_correlation_pd_x_k( # pylint: disable=too-many-locals
intermediate: Intermediate
) -> Figure:
) -> Tabs:
"""
:param intermediate: An object to encapsulate the
intermediate results.
:return: A figure object
"""
result = intermediate.result
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just have a single for loop for all three correlation methods? I see a lot of redundancy here.

method_list = ['pearson', 'spearman', 'kendall']
result = {}
for method in method_list:
if method == 'pearson':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this long if-else condition, can we wrap the content into three functions rather than make the branches verbose?

def _discard_unused_visual_elems(
fig: Figure
) -> None:
"""
:param fig: A figure object
:return:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's no return value I think there's no need for a :return: block

'x_name': x_name,
'k': k
}
if value_range is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have 3 functions handling these three correlation cases? The if block is way too long for easy reading

)
fig_pdf = hv.render(
(pdf_origin * pdf_drop).opts(
height=375,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These same configurations are used multiple times. It would be better to store them in a dict and use **params to expand them at the call site.

Copy link
Member

@dovahcrow dovahcrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, add tests to test all the possible param combinations for each user API, and squash the commits.

Copy link
Member

@dovahcrow dovahcrow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW please complete the pull request message.

dovahcrow
dovahcrow previously approved these changes Aug 24, 2019
jinglinpeng
jinglinpeng previously approved these changes Aug 24, 2019
dovahcrow
dovahcrow previously approved these changes Aug 24, 2019
@dovahcrow dovahcrow merged commit 64ad5dd into master Aug 24, 2019
dovahcrow added a commit that referenced this pull request May 29, 2020
feat(eda.plot_prediction): Implement the calculation of intermediates
devinllu pushed a commit to devinllu/dataprep that referenced this pull request Nov 9, 2021
feat(eda.plot_prediction): Implement the calculation of intermediates
fatbuddy added a commit to fatbuddy/dataprep that referenced this pull request Feb 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants