In [2]:
from causality_simulation import *
import plotly.express as px
import pandas as pd
%matplotlib inline

Goal is to see whether kombucha is good or bad for growing fruits.
1. Simple bad experiment: kombucha in western half, water in eastern half (confound with x). Only show variables: Suppliment, Number of Fruits. Ask why that may be a problem.
2. Introduce two hidden variables. Suppose you are god and can control them for every tree. Design experiment.
3. Introduce two more hidden variables. etc
4. You don't know if there's any more hidden variables, so you must randomise the assignment of suppliment. Design an experiment to see if kombucha is good for growing fruits.

BONUS:
5. What if you still control for number of bees? Is kombucha still good for growing fruits?
6. Conditional causation / interaction of variables / representative sample population

# Recap

In a previous notebook on truffula trees, we looked at the causal relationship between the number of bees and the number of fruits. We concluded that the "bees cause fruits". That is, if more bees come near the tree, then more fruits will grow. Of course, bees are not the only cause of fruiting, so this time we would like to investigate if watering the trees with particular supplements will help with fruiting as well.

## ADD STORY ABOUT HANS

With all the hype around kombucha in Berkeley, Hans decides to perform the following experiment. He has a 1000 metre by 1000 metre orchard, so he divides the orchard into two halves. The eastern half receives water, and the western half receives kombucha. After a summer, he compares the two experimental groups to see if trees in one group has a higher number of fruits than the other group.

His experiment can be represented through our setup interface as follows. In the __East (Water)__ group, he selects 250 trees in a range of longitudes (east-west location) from 0 to 500 m measured from the easternmost edge of his orchard and fixes the supplement to be the regular __Water__. In the __West (Kombucha)__ group, he selects 250 trees in a range of longitudes from 500 to 1000 m and fixes the supplement to be __Kombucha__.

In [4]:
config_east = {
    'name': 'East (Water)',
    'N': 250,
    'intervene': {
        'Longitude': ['range', 0, 500],
        'Supplement': ['fixed', 'Water']
    }
}
config_west = {
    'name': 'West (Kombucha)',
    'N': 250,
    'intervene': {
        'Longitude': ['range', 500, 1000],
        'Supplement': ['fixed', 'Kombucha']
    }
}
config = [config_east, config_west]
east_west_experiment = Experiment(truffula)
east_west_experiment.fixedSetting(config=config, show=['Longitude', 'Supplement'])

VBox(children=(HBox(children=(Label(value='Name the Group', layout=Layout(width='150px')), Text(value='East (W…

VBox(children=(HBox(children=(Label(value='Name the Group', layout=Layout(width='150px')), Text(value='West (K…

Label(value='Data from experiment collected!')

In [8]:
east_west_experiment.plotOrchard(gradient='Number of Fruits', show=['Number of Fruits'])

TypeError: plotOrchard() missing 1 required positional argument: 'name'

In [10]:
east_west_experiment.newPlot(show=['Longitude', 'Supplement', 'Number of Fruits'])

VBox(children=(HBox(children=(Dropdown(description='x-Axis Variable: ', options=('Longitude', 'Number of Fruit…

RadioButtons(description='Group', layout=Layout(width='max-content'), options=('East (Water)', 'West (Kombucha…




elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison



Is there a correlation (how strong?) between adding kombucha and the number of fruits? What can you conclude about the causal relationship between adding kombucha and fruiting?

Is there any aspect of this experiment that should make you suspicious of this conclusion? If so, what might make the conclusion less believable?

# Control

Causation of B by A is sometimes defined as "the existence of correlation between A and B under experimental intervention on A, all else kept constant". In Hans' experiment above, A is adding kombucha, while B is the number of fruits. While he may have found correlation between A and B, he did not keep all else constant, as the "water" group is in the east, while the "kombucha" group is in the west. This can have many effects that cast doubt on his results. For example, let us look at the number of bees that each tree receives.

In [None]:
east_west_experiment.plotOrchard(gradient='Number of Bees', show=['Number of Bees', 'Number of Fruits'])

Can you notice any trend in the number of bees depending on thee location in his orchard?

It turns out there is a large bee hive near his orchard. Where do you think it is?

Based on the experimental conclusion last time that bees cause fruits, does the proximity to the beehive make the correlation between adding kombucha and the number of fruits more or less prominent? Is this effect a source of statistical or systematic uncertainty?

To mitigate the problem that the number of bees varies across the orchard, we can control for the number of bees, so that it is equal for every tree in the orchard. Set up such an experiment below. Copy all the settings from Hans' experiment above, but now we fix the number of bees to __100__ in both experiment groups.

In [12]:
east_west_controlled_experiment = Experiment(truffula)
#east_west_controlled_experiment.fixedSetting(config=config, show=['x', 'Supplement', 'Number of Bees'])
east_west_controlled_experiment.setting(show=['Longitude', 'Supplement', 'Number of Bees'])

VBox(children=(HBox(children=(Label(value='Name the Group', layout=Layout(width='150px')), Text(value='', layo…

In [13]:
east_west_controlled_experiment.newPlot(show=['Longitude', 'Supplement', 'Number of Fruits'])

VBox(children=(HBox(children=(Dropdown(description='x-Axis Variable: ', options=('Longitude', 'Number of Fruit…

RadioButtons(description='Group', layout=Layout(width='max-content'), options=('All',), value='All')



What can we conclude about the causal relationship...

Etc

However, the number of bees was just one of "confounding variables" that need to be controlled for. It turns out that the wind speed also varies depending on the location within the orchard. We don't know if wind speed has anything to do with the number of fruits of each tree, but it's safe to control for it anyway. Set up such an experiment below. Copy the same settings as in the previous part, but now we fix wind speed to __20__ for both experimental groups.

In [14]:
east_west_controlled_experiment2 = Experiment(truffula)
#east_west_controlled_experiment2.fixedSetting(config=config, show=['x', 'Supplement', 'Number of Bees'])
east_west_controlled_experiment2.setting(show=['Longitude', 'Supplement', 'Number of Bees', 'Wind Speed'])

VBox(children=(HBox(children=(Label(value='Name the Group', layout=Layout(width='150px')), Text(value='', layo…

Same questions as before... etc

We can of course go on forever. There are practically countless variables that can vary within the orchard—temperature, humidity, density of grass, soil composition, or even the number of naked mole rats that live deep underground. It is virtually impossible to control for all of those variables by fixing each of them constant. What we could do, however, is randomise which trees to water with kombucha and which to water with water. This way, any location-dependent effect from any other variable is unlikely to be concentrated to the "water" group or the "kombucha" group of trees, enhancing or diluting the actual correlation between adding kombucha and the number of fruits. In the following experiment, check 

TODO:

* Fixed intervention for subset of variables, preset multiple experimental groups
* PlotOrchard combine groups into single plot, show/hide each group with button
* Feedback text when experiment is done