# Starting with Exploring the Water Samples

In [None]:
import os
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

# machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
#from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
from IPython.core.display import display, HTML, Javascript
from string import Template
import json
import IPython.display


import warnings
warnings.filterwarnings('ignore')

In [None]:
water_df = pd.read_csv('/kaggle/input/water-potability/water_potability.csv')
water_df.head(4)

# Getting to know the data

## pH
pH is a measurement of electrically charged particles in a substance. It indicates how acidic or alkaline (basic) that substance is. The pH scale ranges from 0 to 14

The WHO guidelines for Drinking water is between 6.5 and 8.5
Outside these limits many halmful metals may become soluble.

## Hardness
Hardness is the amount of dissolved calcium and magnesium in the water.
Not of health concern at levels found in
drinking-water, however can remove other metals that maybe be harmful.

The taste threshold for the calcium ion is in the range of 100–300 mg/l. the taste threshold for magnesium is probably lower than that for calcium. In
some instances, consumers tolerate water hardness in excess of 500 mg/l

## Solids
Total dissolved solids (TDS) comprise inorganic salts (principally calcium, magnesium, potassium, sodium, bicarbonates, chlorides and sulfates) and small amounts of organic matter that are dissolved in water

The palatability of water with a total dissolved solids (TDS) level of less than about 600 mg/l is generally considered to be good; drinking-water becomes significantly and increasingly unpalatable at TDS levels greater than about 1000 mg/l. 

Desirable limit for TDS is 500 mg/l and maximum limit is 1000 mg/l which prescribed for drinking purpose.

## Chloramines
Monochloramine, dichloramines and trichloramines are considered by-products of drinking-water chlorination. Chloramines, such as monochloramine, dichloramine and trichloramine (nitrogen trichloride), are generated from the reaction of chlorine with ammonia. Among chloramines, monochloramine is the only useful chlorine disinfectant, and chloramination systems are operated to minimize the formation of dichloramine and trichloramine. Higher chloramines, particularly trichloramine, are likely to give rise to taste and odour complaints, except at very low concentrations.

For monochloramine, no odour or taste was detected at concentrations between
0.5 and 1.5 mg/l. For dichloramine, the organoleptic effects between 0.1 and 0.5 mg/l were found to be “slight” and “acceptable”. 
Most individuals are able to taste chloramines at concentrations below 5
mg/l, and some at levels as low as 0.3 mg/l.

Chlorine levels up to 4 milligrams per liter (mg/L or 4 parts per million (ppm)) are considered safe in drinking water.

## Sulfate
Sulfates occur naturally in numerous minerals and are used commercially, principally in the chemical industry. They are discharged into water in industrial wastes and through atmospheric deposition; however, the highest levels usually occur in groundwater and are from natural sources.
The presence of sulfate in drinking-water can cause noticeable taste, and very high levels might cause a laxative effect in unaccustomed consumers. Taste impairment varies with the nature of the associated cation; taste thresholds have been found to range from 250 mg/l for sodium sulfate to 1000 mg/l for calcium sulfate. It is generally considered that taste impairment is minimal at levels below 250 mg/l. No health-based guideline value has been derived for sulfate.
Not of health concern at levels found in drinking-water.
The ratio of the chloride and sulfate concentrations to the bicarbonate concentration (Larson ratio).

## Conductivity
Pure water is not a good conductor of electric current rather’s a good insulator. Increase in ions concentration enhances the electrical conductivity of water. Generally, the amount of dissolved solids in water determines the electrical conductivity. Electrical conductivity (EC) actually measures the ionic process of a solution that enables it to transmit current. According to WHO standards, EC value should not exceeded 400 μS/cm.

## Total Organic Carbon (TOC)
High colour from natural organic carbon (e.g. humics) could also indicate a high propensity to produce by-products from disinfection processes. No health-based guideline value is proposed for colour in drinking-water. According to US EPA < 2 mg/L as TOC in treated / drinking water, and < 4 mg/Lit in source water which is use for treatment.

## Trihalomethanes
THMs are formed in drinking-water primarily as a result of chlorination of organic matter present naturally in raw water supplies. The rate and degree of THM formation increase as a function of the chlorine and humic acid concentration, temperature, pH and bromide ion concentration.
Bromide can be involved in the reaction between chlorine and naturally occurring organic matter in drinking-water, forming brominated and mixed chloro-bromo by-products, such as trihalomethanes (THMs) and halogenated acetic acids (HAAs), or it can react with ozone to form bromate. Trihalomethanes and haloacetic acids are the most common DBPs and occur at among the highest concentrations in drinking-water.
THM levels up to 80 ppm is considered safe in drinking water.

## Turbidity
The turbidity of water depends on the quantity of solid matter present in the suspended state. It is a measure of light emitting properties of water and the test is used to indicate the quality of waste discharge with respect to colloidal matter. 
High levels of turbidity can protect microorganisms from the effects of disinfection, stimulate the growth of bacteria and give rise to a
significant chlorine demand.
The mean turbidity value obtained for Wondo Genet Campus (0.98 NTU) is lower than the WHO recommended value of 5.00 NTU.

## Potability
Indicates if water is safe for human consumption where 1 means Potable and 0 means Not potable.

In [None]:
water_df.describe()

# Data cleaning
We have incomplete data for pH, Sulphate & Trihalomethanes.
Ideally we want to replace or remove the missing values. The question is what to replace the missing values with?  For now it will be assumed to be the mean of the classification unless a relationship can be observed.

## pH
The WHO guidelines for Drinking water is between 6.5 and 8.5 Outside these limits many halmful metals may become soluble.

In [None]:
water_df.hist(column='ph', by='Potability')

There are values outside of the WHO guidelines that are classified as Potable.  We will reclassify these values.

In [None]:
#first replace the Nan values with the mean of the classification
pH_0_1 = water_df.query('Potability == 1')['ph'][water_df['ph'] == 0].index
water_df.loc[pH_0_1,'ph'] = water_df.query('Potability == 1')['ph'][water_df['ph'] == 0 ].mean()
pH_nan_1 = water_df.query('Potability == 1')['ph'][water_df['ph'].isna()].index
water_df.loc[pH_nan_1,'ph'] = water_df.query('Potability == 1')['ph'][water_df['ph'].notna()].mean()
pH_0_0 = water_df.query('Potability == 0')['ph'][water_df['ph'] == 0].index
water_df.loc[pH_0_0,'ph'] = water_df.query('Potability == 0')['ph'][water_df['ph'] == 0 ].mean()
pH_nan_0 = water_df.query('Potability == 0')['ph'][water_df['ph'].isna()].index
water_df.loc[pH_nan_0,'ph'] = water_df.query('Potability == 0')['ph'][water_df['ph'].notna()].mean()
#Set any value that fails the guideline for pH not to be potable
water_df.loc[~water_df.ph.between(6.5, 8.5), 'Potability'] = 0


In [None]:
water_df.hist(column='ph', by='Potability')

## Hardness
In some instances, consumers tolerate water hardness in excess of 500 mg/l. 

In [None]:
water_df.hist(column='Hardness', by='Potability')

All values are within the acceptable range.

## Solids (TDS)
Desirable limit for TDS is 500 mg/L and maximum limit is 1000 mg/l which prescribed for drinking purpose.  However these guidelines are based on taste.  Over 1000 mg/L is considered unacceptable.

In [None]:
water_df.hist(column='Solids', by='Potability')

There is a large number of water samples that are above the acceptable 1,000 mg/l TDS limit.  However this would leave most of our water samples unaceptable, it is only for this reason we will not reclassify the water samples.  There is expected to be a strong correclation between TDS and Conductivity.
<img src="https://www.researchgate.net/profile/Azm-Al-Homoud/publication/227328358/figure/fig10/AS:397293128830980@1471733471117/Relationship-between-electrical-conductivity-EC-and-total-dissolved-solids-TDS.png" width="400"> https://www.researchgate.net/figure/Relationship-between-electrical-conductivity-EC-and-total-dissolved-solids-TDS_fig10_227328358

In [None]:
sns.set_theme(style="ticks")

ax = sns.regplot(x="Conductivity", y="Solids", data=water_df)

The solids values are questionable.  EC will be a more reliable measure of the quality of water.

## Chloramines

Chlorine levels up to 4 milligrams per liter (mg/L or 4 parts per million (ppm)) are considered safe in drinking water. Chloramines are lower based on taste and smell.

In [None]:
water_df.hist(column='Chloramines', by='Potability')

Again there is a large number of samples that are above the acceptable drinking limits

## Sulfate
It is generally considered that taste impairment is minimal at levels below 250 mg/l. No health-based guideline value has been derived for sulfate.

In [None]:
water_df.hist(column='Sulfate', by='Potability')

There are a few values below what is considered good for drinking water.
We can aslo replace the values as we did with pH.

In [None]:
#first replace the Nan values with the mean of the classification
Sulfate_nan_1 = water_df.query('Potability == 1')['Sulfate'][water_df['Sulfate'].isna()].index
water_df.loc[Sulfate_nan_1,'Sulfate'] = water_df.query('Potability == 1')['Sulfate'][water_df['Sulfate'].notna()].mean()
Sulfate_nan_0 = water_df.query('Potability == 0')['Sulfate'][water_df['Sulfate'].isna()].index
water_df.loc[Sulfate_nan_0,'Sulfate'] = water_df.query('Potability == 0')['Sulfate'][water_df['Sulfate'].notna()].mean()


## Conductivity
EC value should not exceeded 400 μS/cm.

In [None]:
water_df.hist(column='Conductivity', by='Potability')

Again some of the conductivity values of the samples calssified potable exceed the stated limits. Note again that we have no values that are close to seawater which contracts the Solids (TDS) values.
<img src="https://www.fondriest.com/environmental-measurements/wp-content/uploads/2014/02/conductivity_averages.jpg" width="250"> <br /> https://www.fondriest.com/environmental-measurements/parameters/water-quality/conductivity-salinity-tds/

## Total Organic Carbon (TOC)
According to US EPA < 2 mg/L as TOC in treated / drinking water

In [None]:
water_df.hist(column='Organic_carbon', by='Potability')

Many samples classified potable would not pass the US EPA

## Trihalomethanes
THM levels up to 80 ppm is considered safe in drinking water.

In [None]:
water_df.hist(column='Trihalomethanes', by='Potability')

As with the ph, we will replace some of the missing values and reclassify the values above the safe limits

In [None]:
#first replace the Nan values with the mean of the classification
THM_nan_1 = water_df.query('Potability == 1')['Trihalomethanes'][water_df['Trihalomethanes'].isna()].index
water_df.loc[THM_nan_1,'Trihalomethanes'] = water_df.query('Potability == 1')['Trihalomethanes'][water_df['Trihalomethanes'].notna()].mean()
THM_nan_0 = water_df.query('Potability == 0')['Trihalomethanes'][water_df['Trihalomethanes'].isna()].index
water_df.loc[THM_nan_0,'Trihalomethanes'] = water_df.query('Potability == 0')['Trihalomethanes'][water_df['Trihalomethanes'].notna()].mean()
#Set any value that fails the guideline for Trihalomethanes not to be potable
water_df.loc[water_df.Trihalomethanes > 80, 'Potability'] = 0

In [None]:
water_df.hist(column='Trihalomethanes', by='Potability')

## Turbidity
WHO recommended value below 5.00 NTU, ideally below 1 NTU

In [None]:
water_df.hist(column='Turbidity', by='Potability')

Again some of the samples classified as potable are above the limits

In [None]:
sns.set_theme(style="ticks")
sns.pairplot(water_df, hue="Potability")

Reclassifying all smaples that don't pass a quality standard.

In [None]:
quality_water_df = water_df

In [None]:
#Conductivity 400
quality_water_df.loc[quality_water_df.Conductivity > 400, 'Potability'] = 0

#If we applied all these criteria... we would have no potable samples...
#Hardness 500 - taste
#quality_water_df.loc[quality_water_df.Hardness > 500, 'Potability'] = 0
#Solids 1000 - palability
#quality_water_df.loc[quality_water_df.Solids > 1000, 'Potability'] = 0
#Chloramines 4
#quality_water_df.loc[quality_water_df.Chloramines > 4, 'Potability'] = 0
#sulfate - no health impact
#Organic_carbon 2
#quality_water_df.loc[quality_water_df.Organic_carbon > 2, 'Potability'] = 0
#Turbidity 5
#quality_water_df.loc[quality_water_df.Turbidity > 5, 'Potability'] = 0

In [None]:
quality_water_df.describe()

In [None]:
sns.pairplot(quality_water_df, hue="Potability")

# Exploring models



From: https://www.kaggle.com/startupsci/titanic-data-science-solutions

Now we are ready to train a model and predict the required solution. There are 60+ predictive modelling algorithms to choose from. We must understand the type of problem and solution requirement to narrow down to a select few models which we can evaluate. Our problem is a classification and regression problem. We want to identify relationship between output (Potable or not) with other variables or features (water chemistry / measurements). We are also perfoming a category of machine learning which is called supervised learning as we are training our model with a given dataset. With these two criteria - Supervised Learning plus Classification and Regression, we can narrow down our choice of models to a few. These include:

Logistic Regression
KNN or k-Nearest Neighbors
Support Vector Machines
Naive Bayes classifier
Decision Tree
Random Forrest
Perceptron
Artificial neural network
RVM or Relevance Vector Machine

Setting up the training and test sets

In [None]:
X = quality_water_df.drop(["Potability","Solids"], axis=1) #dropping Solids as the results don't appear reliable
Y = quality_water_df["Potability"]

# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.10, random_state = 0)

Logistic Regression is a useful model to run early in the workflow. Logistic regression measures the relationship between the categorical dependent variable (feature) and one or more independent variables (features) by estimating probabilities using a logistic function, which is the cumulative logistic distribution. 

Note the confidence score generated by the model based on our training dataset.

In [None]:
# Logistic Regression

logreg = LogisticRegression(solver='liblinear')
logreg.fit(X_train, Y_train)
Y_pred = logreg.predict(X_test)
acc_log = round(logreg.score(X_train, Y_train) * 100, 2)
acc_log

We can use Logistic Regression to validate our assumptions and decisions for feature creating and completing goals. This can be done by calculating the coefficient of the features in the decision function.

Positive coefficients increase the log-odds of the response (and thus increase the probability), and negative coefficients decrease the log-odds of the response (and thus decrease the probability).

In [None]:
coeff_df = pd.DataFrame(quality_water_df.columns.delete(0))
coeff_df.columns = ['Feature']
coeff_df["Correlation"] = pd.Series(logreg.coef_[0])

coeff_df.sort_values(by='Correlation', ascending=False)

Next we model using Support Vector Machines which are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. Given a set of training samples, each marked as belonging to one or the other of two categories, an SVM training algorithm builds a model that assigns new test samples to one category or the other, making it a non-probabilistic binary linear classifier.


In [None]:
# Support Vector Machines

svc = SVC()
svc.fit(X_train, Y_train)
Y_pred = svc.predict(X_test)
acc_svc = round(svc.score(X_train, Y_train) * 100, 2)
acc_svc

In pattern recognition, the k-Nearest Neighbors algorithm (or k-NN for short) is a non-parametric method used for classification and regression. A sample is classified by a majority vote of its neighbors, with the sample being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor. 


In [None]:
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, Y_train)
Y_pred = knn.predict(X_test)
acc_knn = round(knn.score(X_train, Y_train) * 100, 2)
acc_knn

In machine learning, naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes' theorem with strong (naive) independence assumptions between the features. Naive Bayes classifiers are highly scalable, requiring a number of parameters linear in the number of variables (features) in a learning problem.

In [None]:
# Gaussian Naive Bayes

gaussian = GaussianNB()
gaussian.fit(X_train, Y_train)
Y_pred = gaussian.predict(X_test)
acc_gaussian = round(gaussian.score(X_train, Y_train) * 100, 2)
acc_gaussian

The perceptron is an algorithm for supervised learning of binary classifiers (functions that can decide whether an input, represented by a vector of numbers, belongs to some specific class or not). It is a type of linear classifier, i.e. a classification algorithm that makes its predictions based on a linear predictor function combining a set of weights with the feature vector. The algorithm allows for online learning, in that it processes elements in the training set one at a time.

In [None]:
# Perceptron

perceptron = Perceptron()
perceptron.fit(X_train, Y_train)
Y_pred = perceptron.predict(X_test)
acc_perceptron = round(perceptron.score(X_train, Y_train) * 100, 2)
acc_perceptron

In [None]:
# Linear SVC

linear_svc = LinearSVC()
linear_svc.fit(X_train, Y_train)
Y_pred = linear_svc.predict(X_test)
acc_linear_svc = round(linear_svc.score(X_train, Y_train) * 100, 2)
acc_linear_svc

In [None]:
# Stochastic Gradient Descent

sgd = SGDClassifier()
sgd.fit(X_train, Y_train)
Y_pred = sgd.predict(X_test)
acc_sgd = round(sgd.score(X_train, Y_train) * 100, 2)
acc_sgd

This model uses a decision tree as a predictive model which maps features (tree branches) to conclusions about the target value (tree leaves). Tree models where the target variable can take a finite set of values are called classification trees; in these tree structures, leaves represent class labels and branches represent conjunctions of features that lead to those class labels. Decision trees where the target variable can take continuous values (typically real numbers) are called regression trees. 

The model confidence score is the highest among models evaluated so far.

In [None]:
# Decision Tree

decision_tree = tree.DecisionTreeClassifier()
decision_tree.fit(X_train, Y_train)
Y_pred = decision_tree.predict(X_test)
acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)
acc_decision_tree

The next model Random Forests is one of the most popular. Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees (n_estimators=100) at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees.

In [None]:
# Random Forest

random_forest = RandomForestClassifier(n_estimators=100)
random_forest.fit(X_train, Y_train)
Y_pred = random_forest.predict(X_test)
random_forest.score(X_train, Y_train)
acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)
acc_random_forest

## Model evaluation
We can now rank our evaluation of all the models to choose the best one for our problem. While both Decision Tree and Random Forest score the same, we choose to use Random Forest as they correct for decision trees' habit of overfitting to their training set.

In [None]:
models = pd.DataFrame({
    'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression', 
              'Random Forest', 'Naive Bayes', 'Perceptron', 
              'Stochastic Gradient Decent', 'Linear SVC', 
              'Decision Tree'],
    'Score': [acc_svc, acc_knn, acc_log, 
              acc_random_forest, acc_gaussian, acc_perceptron, 
              acc_sgd, acc_linear_svc, acc_decision_tree]})
models.sort_values(by='Score', ascending=False)

D3 Tree visualisation - from https://www.kaggle.com/bhavesh09/titanic-decision-tree-visual-with-d3-js

In [None]:
# rules defined in the tree object clf
def rules(clf, features, labels, node_index=0):
    """Structure of rules in a fit decision tree classifier

    Parameters
    ----------
    clf : DecisionTreeClassifier
        A tree that has already been fit.

    features, labels : lists of str
        The names of the features and labels, respectively.

    """
    node = {}
    if clf.tree_.children_left[node_index] == -1:  # indicates leaf
        #count_labels = zip(clf.tree_.value[node_index, 0], labels)
        #node['name'] = ', '.join(('{} of {}'.format(int(count), label)
        #                          for count, label in count_labels))
        node['type']='leaf'
        node['value'] = clf.tree_.value[node_index, 0].tolist()
        node['error'] = np.float64(clf.tree_.impurity[node_index]).item()
        node['samples'] = clf.tree_.n_node_samples[node_index]
    else:
        feature = features[clf.tree_.feature[node_index]]
        threshold = clf.tree_.threshold[node_index]
        node['type']='split'
        node['label'] = '{} > {}'.format(feature, threshold)
        node['error'] = np.float64(clf.tree_.impurity[node_index]).item()
        node['samples'] = clf.tree_.n_node_samples[node_index]
        node['value'] = clf.tree_.value[node_index, 0].tolist()
        left_index = clf.tree_.children_left[node_index]
        right_index = clf.tree_.children_right[node_index]
        node['children'] = [rules(clf, features, labels, right_index),
                            rules(clf, features, labels, left_index)]
        
    return node

class MyEncoder(json.JSONEncoder):
    def default(self, obj):
        if isinstance(obj, np.integer):
            return int(obj)
        elif isinstance(obj, np.floating):
            return float(obj)
        elif isinstance(obj, np.ndarray):
            return obj.tolist()
        else:
            return super(MyEncoder, self).default(obj)

In [None]:
cols = X_train.columns
d = rules(decision_tree, cols, None)
with open('output.json', 'w') as outfile:  
    json.dump(d, outfile,cls=MyEncoder)

j = json.dumps(d, cls=MyEncoder)

In [None]:
html_string = """
<!DOCTYPE html>
<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
    <script type="text/javascript" src="https://d3js.org/d3.v3.min.js"></script>
    <style type="text/css">
body {
  font-family: "Helvetica Neue", Helvetica;
}
.hint {
  font-size: 12px;
  color: #999;
}
.node rect {
  cursor: pointer;
  fill: #fff;
  stroke-width: 1.5px;
}
.node text {
  font-size: 11px;
}
path.link {
  fill: none;
  stroke: #ccc;
}
    </style>
  </head>
  <body>
    <div id="body">
      <div id="footer">
        Decision Tree viewer
        <div class="hint">click to expand or collapse</div>
        <div id="menu">
          <select id="datasets"></select>
        </div>

      </div>
    </div>    
"""

In [None]:
js_string="""
 var m = [20, 120, 20, 120],
    w = 1280 - m[1] - m[3],
    h = 800 - m[0] - m[2],
    i = 0,
    rect_width = 80,
    rect_height = 20,
    max_link_width = 20,
    min_link_width = 1.5,
    char_to_pxl = 6,
    root;
// Add datasets dropdown
d3.select("#datasets")
    .on("change", function() {
      if (this.value !== '-') {
        d3.json(this.value + ".json", load_dataset);
      }
    })
  .selectAll("option")
    .data([
      "-",
      "output"
    ])
  .enter().append("option")
    .attr("value", String)
    .text(String);
var tree = d3.layout.tree()
    .size([h, w]);
var diagonal = d3.svg.diagonal()
    .projection(function(d) { return [d.x, d.y]; });
var vis = d3.select("#body").append("svg:svg")
    .attr("width", w + m[1] + m[3])
    .attr("height", h + m[0] + m[2] + 1000)
  .append("svg:g")
    .attr("transform", "translate(" + m[3] + "," + m[0] + ")");
// global scale for link width
var link_stoke_scale = d3.scale.linear();
var color_map = d3.scale.category10();
// stroke style of link - either color or function
var stroke_callback = "#ccc";
function load_dataset(json) {
  root = json;
  root.x0 = 0;
  root.y0 = 0;
  var n_samples = root.samples;
  var n_labels = root.value.length;
  if (n_labels >= 2) {
    stroke_callback = mix_colors;
  } else if (n_labels === 1) {
    stroke_callback = mean_interpolation(root);
  }
  link_stoke_scale = d3.scale.linear()
                             .domain([0, n_samples])
                             .range([min_link_width, max_link_width]);
  function toggleAll(d) {
    if (d && d.children) {
      d.children.forEach(toggleAll);
      toggle(d);
    }
  }
  // Initialize the display to show a few nodes.
  root.children.forEach(toggleAll);
  update(root);
}
function update(source) {
  var duration = d3.event && d3.event.altKey ? 5000 : 500;
  // Compute the new tree layout.
  var nodes = tree.nodes(root).reverse();
  // Normalize for fixed-depth.
  nodes.forEach(function(d) { d.y = d.depth * 180; });
  // Update the nodes…
  var node = vis.selectAll("g.node")
      .data(nodes, function(d) { return d.id || (d.id = ++i); });
  // Enter any new nodes at the parent's previous position.
  var nodeEnter = node.enter().append("svg:g")
      .attr("class", "node")
      .attr("transform", function(d) { return "translate(" + source.x0 + "," + source.y0 + ")"; })
      .on("click", function(d) { toggle(d); update(d); });
  nodeEnter.append("svg:rect")
      .attr("x", function(d) {
        var label = node_label(d);
        var text_len = label.length * char_to_pxl;
        var width = d3.max([rect_width, text_len])
        return -width / 2;
      })
      .attr("width", 1e-6)
      .attr("height", 1e-6)
      .attr("rx", function(d) { return d.type === "split" ? 2 : 0;})
      .attr("ry", function(d) { return d.type === "split" ? 2 : 0;})
      .style("stroke", function(d) { return d.type === "split" ? "steelblue" : "olivedrab";})
      .style("fill", function(d) { return d._children ? "lightsteelblue" : "#fff"; });
  nodeEnter.append("svg:text")
      .attr("dy", "12px")
      .attr("text-anchor", "middle")
      .text(node_label)
      .style("fill-opacity", 1e-6);
  // Transition nodes to their new position.
  var nodeUpdate = node.transition()
      .duration(duration)
      .attr("transform", function(d) { return "translate(" + d.x + "," + d.y + ")"; });
  nodeUpdate.select("rect")
      .attr("width", function(d) {
        var label = node_label(d);
        var text_len = label.length * char_to_pxl;
        var width = d3.max([rect_width, text_len])
        return width;
      })
      .attr("height", rect_height)
      .style("fill", function(d) { return d._children ? "lightsteelblue" : "#fff"; });
  nodeUpdate.select("text")
      .style("fill-opacity", 1);
  // Transition exiting nodes to the parent's new position.
  var nodeExit = node.exit().transition()
      .duration(duration)
      .attr("transform", function(d) { return "translate(" + source.x + "," + source.y + ")"; })
      .remove();
  nodeExit.select("rect")
      .attr("width", 1e-6)
      .attr("height", 1e-6);
  nodeExit.select("text")
      .style("fill-opacity", 1e-6);
  // Update the links
  var link = vis.selectAll("path.link")
      .data(tree.links(nodes), function(d) { return d.target.id; });
  // Enter any new links at the parent's previous position.
  link.enter().insert("svg:path", "g")
      .attr("class", "link")
      .attr("d", function(d) {
        var o = {x: source.x0, y: source.y0};
        return diagonal({source: o, target: o});
      })
      .transition()
      .duration(duration)
      .attr("d", diagonal)
      .style("stroke-width", function(d) {return link_stoke_scale(d.target.samples);})
      .style("stroke", stroke_callback);
  // Transition links to their new position.
  link.transition()
      .duration(duration)
      .attr("d", diagonal)
      .style("stroke-width", function(d) {return link_stoke_scale(d.target.samples);})
      .style("stroke", stroke_callback);
  // Transition exiting nodes to the parent's new position.
  link.exit().transition()
      .duration(duration)
      .attr("d", function(d) {
        var o = {x: source.x, y: source.y};
        return diagonal({source: o, target: o});
      })
      .remove();
  // Stash the old positions for transition.
  nodes.forEach(function(d) {
    d.x0 = d.x;
    d.y0 = d.y;
  });
}
// Toggle children.
function toggle(d) {
  if (d.children) {
    d._children = d.children;
    d.children = null;
  } else {
    d.children = d._children;
    d._children = null;
  }
}
// Node labels
function node_label(d) {
  if (d.type === "leaf") {
    // leaf
    var formatter = d3.format(".2f");
    var vals = [];
    d.value.forEach(function(v) {
        vals.push(formatter(v));
    });
    return "[" + vals.join(", ") + "]";
  } else {
    // split node
    return d.label;
  }
}
/**
 * Mixes colors according to the relative frequency of classes.
 */
function mix_colors(d) {
  var value = d.target.value;
  var sum = d3.sum(value);
  var col = d3.rgb(0, 0, 0);
  value.forEach(function(val, i) {
    var label_color = d3.rgb(color_map(i));
    var mix_coef = val / sum;
    col.r += mix_coef * label_color.r;
    col.g += mix_coef * label_color.g;
    col.b += mix_coef * label_color.b;
  });
  return col;
}
/**
 * A linear interpolator for value[0].
 *
 * Useful for link coloring in regression trees.
 */
function mean_interpolation(root) {
  var max = 1e-9,
      min = 1e9;
  function recurse(node) {
    if (node.value[0] > max) {
      max = node.value[0];
    }
    if (node.value[0] < min) {
      min = node.value[0];
    }
    if (node.children) {
      node.children.forEach(recurse);
    }
  }
  recurse(root);
  var scale = d3.scale.linear().domain([min, max])
                               .range(["#2166AC","#B2182B"]);
  function interpolator(d) {
    return scale(d.target.value[0]);
  }
  return interpolator;
}
 """

In [None]:
h = display(HTML(html_string))
j = IPython.display.Javascript(js_string)
IPython.display.display_javascript(j)