feat(eda.plot_prediction): Implement the calculation of intermediates #13

Waterpine · 2019-07-18T18:28:14Z

Description

Please include a summary of the change and which issue is fixed. Please also include relevant motivation and context. List any dependencies that are required for this change.

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration

Checklist:

jinglinpeng · 2019-07-18T20:26:02Z

dataprep/eda/eda_plot_pred.py

+    """ Returns the type of the input data.
+        Identified types are according to the DataType Enumeration.
+
+    Parameter


parameter and return comments are no the same form as other functions.

This code has been moved to utils.py, and I think we should study numpy's parameter and return comments.

jinglinpeng · 2019-07-18T20:28:02Z

dataprep/eda/eda_plot_pred.py

+        if pd.api.types.is_bool_dtype(data):
+            col_type = DataType.TYPE_CAT
+        elif pd.api.types.is_numeric_dtype(data) and dask.compute(
+                data.dropna().unique().size) == 2:


unique values=2 -> maybe smaller than a threshold?

This code is written by shubham, maybe you should discuss it with him.

jinglinpeng · 2019-07-18T22:45:50Z

dataprep/eda/eda_plot_pred.py

+    for column_name in pd_data_frame.columns.values:
+        if get_type(pd_data_frame[column_name]) != DataType.TYPE_NUM:
+            drop_list.append(column_name)
+    pd_data_frame.drop(columns=drop_list)


make sure this just return a copy and will not affect the input dataframe.

It is a bug. It is true that it will not affect the input data frame. However, I have to change line 70 to pd_data_frame = pd_data_frame.drop(columns=drop_list). Otherwise, the data frame will not drop non-numerical columns.

jinglinpeng · 2019-07-18T22:46:21Z

dataprep/eda/eda_plot_pred.py

+def _calc_corr(
+        data_a: np.ndarray,
+        data_b: np.ndarray
+) -> Any:


not use Any

jinglinpeng · 2019-07-18T22:47:52Z

dataprep/eda/eda_plot_pred.py

+def _calc_pred_corr(
+        pd_data_frame: pd.DataFrame,
+        target: str
+) -> Any:


Not use Any

jinglinpeng · 2019-07-19T01:29:20Z

dataprep/eda/eda_plot_pred.py

+        target: str,
+        x_name: str,
+        target_type: DataType
+) -> Any:


change Any with the specific data type.

jinglinpeng · 2019-07-19T01:30:23Z

dataprep/eda/eda_plot_pred.py

+                pd_data_frame[target],
+                np.arange(min_value,
+                          max_value + 1,
+                          (max_value - min_value) / 10)


change 10 to a parameter

jinglinpeng · 2019-07-19T01:32:23Z

dataprep/eda/eda_plot_pred.py

+
+def _calc_scatter(
+        intermediate: Intermediate
+) -> Any:


change Any to the specific data type

dovahcrow · 2019-08-16T02:26:13Z

Review plot_corr

jinglinpeng · 2019-08-19T03:41:33Z

dataprep/eda/eda_plot_miss.py

+        data: np.ndarray,
+        length: int
+) -> Any:
+    """


write the doc string in the comment.

I have added the doc string.

jinglinpeng · 2019-08-19T03:43:19Z

dataprep/eda/eda_plot_miss.py

+from dataprep.utils import get_type, DataType
+
+
+def _calc_none_sum(


change the name to _calc_nonzero_rate. Besides, whether length is necessary (it could be computed as len(data))?

Yes, it could be computed as len(data). However, I think we only need to calculate the length once. So, I pass the length to this function.

jinglinpeng · 2019-08-19T03:44:18Z

dataprep/eda/eda_plot_miss.py

+def _calc_none_count(
+        pd_data_frame: pd.DataFrame
+) -> Intermediate:
+    """


add the doc string

I have added the doc string.

jinglinpeng · 2019-08-19T03:47:54Z

dataprep/eda/eda_plot_miss.py

+    return np.count_nonzero(data) / length
+
+
+def _calc_none_count(


the function name is confusing (none means None or nonzero?), think of a good name that related to the functionality.

none means None. I agree with you. I will modify the function name from none to nonzero so that our library is consistent with numpy.

jinglinpeng · 2019-08-19T03:48:34Z

dataprep/eda/eda_plot_miss.py

+        x_name: str,
+        num_bins: int = 10
+) -> Intermediate:
+    """


add the doc string

I have added the doc string.

jinglinpeng · 2019-08-19T04:14:52Z

dataprep/eda/vis_plot_miss.py

+def _vis_drop_y(  # pylint: disable=too-many-locals
+        intermediate: Intermediate
+) -> Tabs:
+    """


add doc string

jinglinpeng · 2019-08-19T04:15:22Z

dataprep/eda/vis_plot_pred.py

+def _vis_pred_corr(
+        intermediate: Intermediate
+) -> Figure:
+    """


add doc string

jinglinpeng · 2019-08-19T04:15:47Z

dataprep/eda/vis_plot_pred.py

+                     # pylint: disable=too-many-statements
+        intermediate: Intermediate
+) -> Figure:
+    """


add doc string

jinglinpeng · 2019-08-19T04:16:41Z

dataprep/eda/vis_plot_pred.py

+                         # pylint: disable=too-many-statements
+        intermediate: Intermediate
+) -> Figure:
+    """


add doc string

jinglinpeng · 2019-08-19T04:19:00Z

dataprep/eda/vis_plot_pred.py

+        ('Count', '@Count')
+    ]
+    hover = HoverTool(tooltips=tooltips)
+    if target_type == DataType.TYPE_NUM:


The 'if' branch is too long. Is it possible to write the different branch of 'if' to different function?

dovahcrow · 2019-08-19T05:40:42Z

.pylintrc

+  ignore-comments=yes
+
+  # Ignore docstrings when computing similarities.


Should we remove this so we can ensure the docstrings are properly written?

Also, add tests to test all the possible param combinations for each user API, and squash the commits.

Sure, I will add tests as soon as possible.

dovahcrow · 2019-08-19T05:48:19Z

dataprep/eda/eda_plot_corr.py



 def _vis_correlation_pd_x_k(  # pylint: disable=too-many-locals
        intermediate: Intermediate
-) -> Figure:
+) -> Tabs:
    """
    :param intermediate: An object to encapsulate the
    intermediate results.
    :return: A figure object
    """
    result = intermediate.result


Can we just have a single for loop for all three correlation methods? I see a lot of redundancy here.

dovahcrow · 2019-08-19T05:51:04Z

dataprep/eda/eda_plot_corr.py

+    method_list = ['pearson', 'spearman', 'kendall']
+    result = {}
+    for method in method_list:
+        if method == 'pearson':


For this long if-else condition, can we wrap the content into three functions rather than make the branches verbose?

dovahcrow · 2019-08-19T05:54:58Z

dataprep/eda/eda_plot_corr.py

 def _discard_unused_visual_elems(
        fig: Figure
 ) -> None:
+    """
+    :param fig: A figure object
+    :return:


If there's no return value I think there's no need for a :return: block

dovahcrow · 2019-08-19T05:57:44Z

dataprep/eda/eda_plot_corr.py

-        'x_name': x_name,
-        'k': k
-    }
+    if value_range is not None:


Can we have 3 functions handling these three correlation cases? The if block is way too long for easy reading

dovahcrow · 2019-08-19T06:05:28Z

dataprep/eda/vis_plot_miss.py

+        )
+        fig_pdf = hv.render(
+            (pdf_origin * pdf_drop).opts(
+                height=375,


These same configurations are used multiple times. It would be better to store them in a dict and use **params to expand them at the call site.

dovahcrow

Also, add tests to test all the possible param combinations for each user API, and squash the commits.

dovahcrow

BTW please complete the pull request message.

…and plot_correlation

feat(eda.plot_prediction): Implement the calculation of intermediates

jinglinpeng reviewed Jul 19, 2019

View reviewed changes

dovahcrow assigned jinglinpeng Aug 16, 2019

jinglinpeng reviewed Aug 19, 2019

View reviewed changes

dovahcrow reviewed Aug 19, 2019

View reviewed changes

dovahcrow assigned Waterpine Aug 19, 2019

dovahcrow previously approved these changes Aug 24, 2019

View reviewed changes

dovahcrow dismissed their stale review via aa04b49 August 24, 2019 08:41

dovahcrow force-pushed the plot_corr2 branch from a60f054 to aa04b49 Compare August 24, 2019 08:41

jinglinpeng previously approved these changes Aug 24, 2019

View reviewed changes

dovahcrow previously approved these changes Aug 24, 2019

View reviewed changes

feat(eda.plot_missing, eda.plot_correlation): implement plot_missing …

a3c78a8

…and plot_correlation

dovahcrow dismissed stale reviews from jinglinpeng and themself via a3c78a8 August 24, 2019 08:45

dovahcrow force-pushed the plot_corr2 branch from aa04b49 to a3c78a8 Compare August 24, 2019 08:45

dovahcrow approved these changes Aug 24, 2019

View reviewed changes

jinglinpeng approved these changes Aug 24, 2019

View reviewed changes

dovahcrow merged commit 64ad5dd into master Aug 24, 2019

dovahcrow added a commit that referenced this pull request May 29, 2020

Merge pull request #13 from sfu-db/plot_corr2

a311488

feat(eda.plot_prediction): Implement the calculation of intermediates

dovahcrow added the module: EDA label Jun 15, 2020

devinllu pushed a commit to devinllu/dataprep that referenced this pull request Nov 9, 2021

Merge pull request sfu-db#13 from sfu-db/plot_corr2

0c2f011

feat(eda.plot_prediction): Implement the calculation of intermediates

fatbuddy added a commit to fatbuddy/dataprep that referenced this pull request Feb 10, 2024

Windows dependency fix sfu-db#13

d5a6f88

		from dataprep.utils import get_type, DataType


		def _calc_none_sum(

		ignore-comments=yes

		# Ignore docstrings when computing similarities.

feat(eda.plot_prediction): Implement the calculation of intermediates #13

feat(eda.plot_prediction): Implement the calculation of intermediates #13

Conversation

Waterpine commented Jul 18, 2019 • edited

Description

How Has This Been Tested?

Checklist:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Waterpine Jul 20, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dovahcrow commented Aug 16, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dovahcrow left a comment

Choose a reason for hiding this comment

dovahcrow left a comment

Choose a reason for hiding this comment

Waterpine commented Jul 18, 2019 •

edited

Waterpine Jul 20, 2019 •

edited