# Implementing the correlation ratio

In [None]:
# Don't change this cell; just run it.
import numpy as np  # The array library.

import pandas as pd

# Safe setting for Pandas.  Needs Pandas version >= 1.5.
pd.set_option('mode.copy_on_write', True)

# The OKpy testing system.
from client.api.notebook import Notebook
ok = Notebook('correlation_ratio.ok')

## About the correlation ratio

The correlation is a statistic for the situation where you have some continuous variable, with numbers, and you have another variable that contains labels, that identify groups to which each row belongs.   See below for a worked example.

Have a look at the [Wikipedia page on correlation ratio](https://en.wikipedia.org/wiki/Correlation_ratio) for background.

It is not widely used, and for that reason, there is no standard implementation of this statistic.

Here you will do you own implementation.

## The data

`wine.csv` has a table of data, for which each row corresponds to a particular wine, and the columns correspond to measure for that wine.   The first column gives the class of the wine, where the class corresponds to a particular [cultivar](https://en.wikipedia.org/wiki/Cultivar) - a grape variety.  Thus all rows with wine class 1 come from one cultivar, from class 2 another, and from class three, a third.

See [the dataset page](https://archive.ics.uci.edu/ml/datasets/Wine) for more detail.

In [None]:
wine = pd.read_csv('wine.csv')
wine.head()

## Correlation ratio long way round.

We want to calculate the correlation ratio for the numerical column `Alcohol` (the alcohol level of the wine), given the labels in `Class`.

This is how we would do that by hand, following the formula in the Wikipedia page.

In [None]:
numerical = wine['Alcohol']
labels = wine['Class']
overall_mean = numerical.mean()
overall_sum_of_squares = ((numerical - overall_mean) ** 2).sum()
levels = labels.dropna().unique()
n_levels = len(levels)
group_sums_of_squares = np.zeros(n_levels)
for group_no in np.arange(n_levels):
    level = levels[group_no]
    is_in_level = labels == level
    level_values = numerical[is_in_level]
    n_in_level = len(level_values)
    group_sos = n_in_level * (level_values.mean() - overall_mean) ** 2
    group_sums_of_squares[group_no] = group_sos
top_of_stat = np.sum(group_sums_of_squares)
bottom_of_stat = np.sum((numerical - overall_mean) ** 2)
eta = np.sqrt(top_of_stat / bottom_of_stat)
eta

While we're here, let's get another dataset.  For this dataset, each row is a patient, each column is some measure of blood, urine, or clinical feature for that patient.  The column `Coronary Artery Disease` specifies whether the patient qualified for a diagnosis of heart vessel disease ('yes') or not ('no').  The column `Hemoglobin`  gives the blood level of the protein that carries oxygen around the body.

In [None]:
ckd = pd.read_csv('ckd.csv')
ckd.head()

## Your job

Write a function `correlation_ratio`, that returns this value for any data frame, for a given numerical column (given by the column name) and label column (given by a column name):

In [None]:
    # Your code here
    ...
    return ...

In [None]:
# Test your function by replicating the analysis above.
# When your function is correct, you should be able to run this cell without error
assert np.isclose(correlation_ratio('Alcohol', 'Class', wine), eta)

In [None]:
# Test it on another column.
assert np.isclose(correlation_ratio('Malic Acid', 'Class', wine), 0.544857081967286)

In [None]:
# Test it on another dataset.
assert np.isclose(correlation_ratio('Hemoglobin', 'Coronary Artery Disease', ckd), 0.3787772107398905)

## Done.

Congratulations, you're done with the assignment!  Be sure to:

- **run all the tests** (the next cell has a shortcut for that).
- **Save and Checkpoint** from the `File` menu.

In [None]:
# For your convenience, you can run this cell to run all the tests at once!
import os
_ = [ok.grade(q[:-3]) for q in os.listdir("tests") if q.startswith('q')]