# How-to Guide: Relationships

The class `cc_tk.relationship.RelationshipSummary` can be used to quickly evaluate the relationships between features and a target.

The relationships are evaluated through statistical tests. For now, the tests are:
- numeric feature, numeric target: pearson correlation test
- numeric feature, categorical target: anova test if its hypotheses are verified, Kruskal-Wallis otherwise
- idem for categorical feature with numeric target
- categorical feature, categorical target: chi-2 test

The `cc_tk.relationship` submodule provides functions that allow you to study the relationship between variables.
It is particularly useful for feature selection when you have a lot of variables and you want to understand which ones are the most statistically significant to discriminate the target variable.
Target variable can be either numeric or categorical.

## Overall Summary

In [None]:
from cc_tk.relationship import RelationshipSummary
from sklearn.datasets import load_iris


In [2]:
X, y = load_iris(return_X_y=True, as_frame=True)

In [3]:
X = X.assign(
    sepal_type=(X["sepal length (cm)"] < 6).map({False: "big", True: "small"})
)

When using `RelationshipSummary`, you can build summary with its `build_summary` method and/or save it to an excel file with `to_excel` method.

In [18]:
relationship_summary = RelationshipSummary(X, y.astype(object))
# relationship_summary.to_excel("../../data/output/test_relationship.xlsx")
relationship_summary.build_summary();

You can access the overall distributions for numeric and categorical variables with `numeric_distribution` and `categorical_distribution` attributes of the `summary_output`.

In [6]:
relationship_summary.summary_output.numeric_distribution

Unnamed: 0,Variable,count,mean,std,min,25%,50%,75%,max
0,sepal length (cm),150.0,5.843333,0.828066,4.3,5.1,5.8,6.4,7.9
1,sepal width (cm),150.0,3.057333,0.435866,2.0,2.8,3.0,3.3,4.4
2,petal length (cm),150.0,3.758,1.765298,1.0,1.6,4.35,5.1,6.9
3,petal width (cm),150.0,1.199333,0.762238,0.1,0.3,1.3,1.8,2.5


You can acces the relationships summary with `numeric_significance` and `categorical_significance` attributes of the `summary_output`.

In the following output, we can interpret that the petal length values are significantly lower in the group 0 and significantly higher for the group 2, and this is confirmed by the distribution by group (min and max for example).

In [11]:
relationship_summary.summary_output.numeric_significance.drop(columns=["pvalue", "statistic", "message"])

Unnamed: 0_level_0,Unnamed: 1_level_0,influence,significance,count,mean,std,min,25%,50%,75%,max
Variable,Target,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
petal length (cm),0,--,strong,50.0,1.462,0.173664,1.0,1.4,1.5,1.575,1.9
petal length (cm),1,,strong,50.0,4.26,0.469911,3.0,4.0,4.35,4.6,5.1
petal length (cm),2,++,strong,50.0,5.552,0.551895,4.5,5.1,5.55,5.875,6.9
petal width (cm),0,--,strong,50.0,0.246,0.105386,0.1,0.2,0.2,0.3,0.6
petal width (cm),1,,strong,50.0,1.326,0.197753,1.0,1.2,1.3,1.5,1.8
petal width (cm),2,++,strong,50.0,2.026,0.27465,1.4,1.8,2.0,2.3,2.5
sepal length (cm),0,--,strong,50.0,5.006,0.35249,4.3,4.8,5.0,5.2,5.8
sepal length (cm),1,,strong,50.0,5.936,0.516171,4.9,5.6,5.9,6.3,7.0
sepal length (cm),2,++,strong,50.0,6.588,0.63588,4.9,6.225,6.5,6.9,7.9
sepal width (cm),0,++,strong,50.0,3.428,0.379064,2.3,3.2,3.4,3.675,4.4


In the following output, we see that the big sepals are over-represented in the group 2 and under-represented in the group 0.

In [12]:
relationship_summary.summary_output.categorical_significance.drop(columns=["pvalue", "statistic", "message"])

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,influence,significance,count,proportion
Variable,Target,Value,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
sepal_type,0,big,--,strong,,
sepal_type,0,small,+,strong,50.0,1.0
sepal_type,1,big,,strong,24.0,0.48
sepal_type,1,small,,strong,26.0,0.52
sepal_type,2,big,++,strong,43.0,0.86
sepal_type,2,small,-,strong,7.0,0.14


## Single variable relationship

You may also want to use directly the underlying functions to study the relationship between a single variable and the target variable.

:::{warning}
Be careful as when you use these functions you should be aware of the type of both feature variable and target variable.
:::

In [None]:
from cc_tk.relationship import get_significance
significance_numeric_categorical = get_significance("numeric", "categorical", "statistical")
significance_categorical_categorical = get_significance("categorical", "categorical", "statistical")

# Create a dataframe
X, y = load_iris(return_X_y=True, as_frame=True)

# Artificially create a categorical variable
X["sepal_length_cat"] = (X["sepal length (cm)"] > 5.5).astype(str)

# Study the relationship specific features and y
significance_sepal_length_num = significance_numeric_categorical(X["sepal length (cm)"], y.astype(object))
significance_sepal_length_cat = significance_categorical_categorical(X["sepal_length_cat"], y.astype(object))

significance_sepal_length_num

SignificanceOutput(pvalue=8.91873433246198e-22, influence=0    --
1      
2    ++
dtype: category
Categories (6, object): ['--' < '-' < '' < ' ' < '+' < '++'], statistic=96.93743600064833, message='sepal length (cm) grouped by target are not gaussians with equal variances. Computing Kruskal-Wallis p-value.')

:::{admonition} Future work

I am planning to add more features to the `cc_tk.relationship` submodule.
Already planned features are:
- a scikit-learn transformer that will allow you to select the most significant features based on the relationship with the target variable
- a parametrization of significance tests to allow the user to choose the most appropriate test for their data
:::
