# d3.js Introduction

* Author: [Ming-Yuan Jian](mailto:mjian@andrew.cmu.edu)
* Date: 2016-11-04

![D3 Logo](https://camo.githubusercontent.com/722a5cc12c7d40231ebeb8ca6facdc8547e2abf7/68747470733a2f2f64336a732e6f72672f6c6f676f2e737667)

## Introduction
In the data science area, we often deal with complex and large data sets. And while professionals may understand what we are doing, in many times we want to derive some statistics of what we've done or what the data look like. So the visualization becomes very important. In Python we have matplotlib, and in JavaScript we have D3.

[D3](https://d3js.org/), which stands for Data-Driven Documents, is a JavaScript library for visualizing data using web standards. It has the advantage that
* It can be run by any modern browsers.
* It can render (and even update) the graph at client side.
* The texts in graph are selectable.

In this tutorial, we will demonstrate several examples of visualizing the properties of documents: the calendar view, the bubble chart and the partition graph. We'll use the latest version (4.2.7) of d3.js.

## The dataset

The documents I choose are the text from the [Federalist Papers](https://en.wikipedia.org/wiki/The_Federalist_Papers), a series of essay written in 1787 and 1788 by Alexander Hamilton, James Madison, and John Jay, that promoted the ratification of the U.S. Constitution. The following Python code processes the raw text downloaded from [Project Guttenberg](http://www.gutenberg.org/ebooks/18) into separate papers and parses the number, the authors, the published dates, and the length of each paper.

In [1]:
import calendar, re
month_names = calendar.month_name[:]  # [0] is empty string, [1]: "January"
YEARS = "(178[78])"
MONTH_PATTERN = "(" + "|".join(month_names[1:]) + ")"
DATE_PATTERN = MONTH_PATTERN + "\\s+(\\d+),\\s+" + YEARS

def load_federalist_corpus(filename):
    """ Load the federalist papers as a tokenized list of strings, one for each eassay
    """
    with open(filename, "rt") as f:
        data = f.read()
    papers = data.split("FEDERALIST")
    
    # all start with "To the people of the State of New York:" (sometimes . instead of :)
    # all end with PUBLIUS (or no end at all)
    locations = [(p.rfind("of the State of New York"), p.rfind("PUBLIUS")) for p in papers]
    locations = [(-1 if loc[0] == -1 else loc[0] + 25, loc[1]) for loc in locations]
    papers_content = [papers[i][loc[0]:loc[1]] for i, loc in enumerate(locations)]
    
    # discard entries that are not actually a paper
    is_paper = map(lambda p: len(p) > 0, papers_content)
    papers = [p for i, p in enumerate(papers) if is_paper[i]]
    papers_content = [p for i, p in enumerate(papers_content) if is_paper[i]]
    # replace all whitespace with a single space
    papers_content = [re.sub(r"\s+", " ", p).lower() for p in papers_content]

    # add spaces before all punctuation, so they are separate tokens
    punctuation = set(re.findall(r"[^\w\s]+", " ".join(papers_content))) - {"-","'"}
    for c in punctuation:
        papers_content = [p.replace(c, " "+c+" ") for p in papers_content]
    papers_content = [re.sub(r"\s+", " ", p).lower().strip() for p in papers_content]
    
    authors = [tuple(re.findall("MADISON|JAY|HAMILTON", a)) for a in papers]
    
    numbers = [re.search(r"No\. \d+", p).group(0) for p in papers if re.search(r"No\. \d+", p)]
    dates = [re.search(YEARS, p) for p in papers]
    for i, date in enumerate(dates):
        if date:
            match = re.search(DATE_PATTERN, papers[i][:date.end()])
            month = month_names.index(match.group(1))
            dates[i] = "{}-{:02d}-{}".format(match.group(3), month, match.group(2).zfill(2))
        else:
            dates[i] = ''
            
    return papers_content, authors, numbers, dates
    

After we get the infomation from the raw text, we organize them into a JSON file so we can load them easily in JavaScript. For example, each paper in our json file will look like:

    {'date': '1788-02-19', 'length': 13146, 'number': 'No. 57', 'authors': ['HAMILTON', 'MADISON']}
    
* `date`: publish date of the paper
* `number`: the chapter
* `length`: number of characters in the paper
* `authors`: list of the authors of the paper

In [2]:
import json

papers, authors, numbers, dates = load_federalist_corpus("pg18.txt")
papers_info = zip(numbers, [list(a) for a in authors], dates, [len(p) for p in papers])
papers_info = map(lambda x: {"number": x[0], "authors": x[1], "date": x[2], "length": x[3]}, papers_info)
#print papers_info[56] # uncomment to see the object
with open("papers_info.json", "w") as f:
    json.dump(papers_info, f)

## Load the data in JavaScript
In the following block, we load *d3.js* library. If you are writing a web page, you can put the following code in HTML to retrieve d3.js.

    <script type="text/javascript" charset="utf-8" src="https://d3js.org/d3.v4.min.js"></script>

In [13]:
%%javascript
require.config({
    paths: {
        d3: 'https://d3js.org/d3.v4.min'
    }
});

var css_rule = "body { shape-rendering: auto; }"; // Make visualization prettier
var styleElement = document.createElement("style");
styleElement.type = "text/css";
styleElement.appendChild(document.createTextNode(css_rule));
document.head.appendChild(styleElement);

<IPython.core.display.Javascript object>

After we load d3.js, we can use `d3.json` to load our data.

In [14]:
%%javascript
require(['d3'], function(d3){
    d3.json("papers_info.json", function(error, json) {
        window.papers_info = json;
    });
});

<IPython.core.display.Javascript object>

## Calendar View

Calendar view is very useful for analyzing the peak user count or stock prices over days. In this case, we will count the papers published on each day and show them on the calendar. The code we use are adapted from [Calendar View](http://bl.ocks.org/mbostock/4063318) by [Mike Bostock](http://bl.ocks.org/mbostock). At first, we need to add some CSS rule to define what each month/day should look like.


We can also add some CSS rules to prettify the graph shape rendered on browsers. Just put the following `<style>` tag before `</head>` if you are writing an HTML file.

    <style type="text/css">
    .month {
        fill: none;
        stroke: #000;
        stroke-width: 2px;
    }
    /* ... (other styles) */
    </style>

In [15]:
%%javascript
var css_rule_calendar_view = ".month { fill: none; stroke: #000; stroke-width: 2px; }" +
        ".day { fill: #fff; stroke: #ccc; }" +
        ".year .q0 {fill: rgb(238, 238, 238)}" + // From gray to green (using GitHub colors)
        ".year .q1 {fill: rgb(214, 230, 133)}" + // Least green
        ".year .q2 {fill: rgb(140, 198, 101)}" +
        ".year .q3 {fill: rgb( 68, 163,  64)}" +
        ".year .q4 {fill: rgb( 30, 104,  35)}"; // Greenest
var styleElement = document.createElement("style");
styleElement.type = "text/css";
styleElement.appendChild(document.createTextNode(css_rule_calendar_view));
document.head.appendChild(styleElement);

<IPython.core.display.Javascript object>

Below is the main visualization code, which can be described in 6 parts:
* Create 2 `<svg>` elements of size 136 x 960, each shows a year from 1787 (inclusize) to 1789 (exclusive)
* Add a `<text>` element to each svg showing the year
* Create 365 or 366 (if the year is leap year) `<rect>` elements of size 17 x 17, each shows a day
* Add a `<title>` to each `<rect>` showing the date
* Create 12 `<path>` elements, each outlines the month in the year
* Convert the paper list to d3 nest object, pick the color from CSS class `q0` (no paper published on that day) to `q5` (5 or more papers published on that day), and set the title in that day to `"[Date]: [# of articles]"`

In [16]:
%%javascript
require(['d3'], function(d3) {
    $("#calendar_view").remove();
    element.append("<div id='calendar_view'></div>"); // append to jupyter output block
    $("#calendar_view").width("960px");
    $("#calendar_view").height("300px");   
    
    var width = 960, height = 136; // each year in 136 * 960
    var cellSize = 17, weeksPerYear = 53; // cell size
    var dateFormat = d3.timeFormat("%Y-%m-%d");
    
    var calendar_view = d3.select("#calendar_view");
    
    var year = calendar_view.selectAll("svg")
            .data(d3.range(1787, 1789))
            .enter().append("svg")
            .attr("width", width)
            .attr("height", height)
            .attr("class", "year")
            .append("g")
            .attr("transform",
                  "translate(" + ((width - cellSize * weeksPerYear) / 2) + "," + (height - cellSize * 7 - 1) + ")");

    year.append("text")
            .attr("transform", "translate(-6," + cellSize * 3.5 + ") rotate(-90)")
            .style("text-anchor", "middle")
            .text(function(d) { return d; });

    var rect = year.selectAll(".day")
            .data(function(d) { return d3.timeDays(new Date(d, 0, 1), new Date(d + 1, 0, 1)); })
            .enter().append("rect")
            .attr("class", "day")
            .attr("width", cellSize)
            .attr("height", cellSize)
            .attr("x", function(d) { return d3.timeWeek.count(d3.timeYear(d), d) * cellSize; })
            .attr("y", function(d) { return d.getDay() * cellSize; })
            .datum(dateFormat);

    rect.append("title").text(function(d) { return d; });

    year.selectAll(".month")
            .data(function(d) { return d3.timeMonths(new Date(d, 0, 1), new Date(d + 1, 0, 1)); })
            .enter().append("path")
            .attr("class", "month")
            .attr("d", monthPath);

    function monthPath(t0) {
        var t1 = new Date(t0.getFullYear(), t0.getMonth() + 1, 0),
                d0 = t0.getDay(), w0 = d3.timeWeek.count(d3.timeYear(t0), t0),
                d1 = t1.getDay(), w1 = d3.timeWeek.count(d3.timeYear(t1), t1);
        return "M" + (w0 + 1) * cellSize + "," + d0 * cellSize +
                "H" + w0 * cellSize + "V" + 7 * cellSize +
                "H" + w1 * cellSize + "V" + (d1 + 1) * cellSize +
                "H" + (w1 + 1) * cellSize + "V" + 0 +
                "H" + (w0 + 1) * cellSize + "Z";
    }

    var data = d3.nest()
            .key(function(d) { return d.date; })
            .rollup(function(ds) { return ds.length })
            .map(window.papers_info);

    calendar_view.data([data["$"]]);
    calendar_view.append('div')
            .style("width", width + "px")
            .style("text-align", "center")
            .text(function(d) { return "Papers without date: " + d; });

    var colorPicker = d3.scaleQuantize().domain([0, 5])
            .range(d3.range(5).map(function(number) { return "q" + number; }));

    rect.filter(function(d) { return "$"+d in data; })
            .attr("class", function(d) { return "day " + colorPicker(data["$"+d]); })
            .select("title")
            .text(function(d) { return d + ": " + data["$"+d]; });
});

<IPython.core.display.Javascript object>

You will get the graph like this:
![](calendar_view.png)
As the chart shows, we can found that most papers published on Tuesday or Thursday. You can hover your mouse on each cell to see how many papers published on that day.

## Bubble Chart

In this section, we are going to create a bubble chart, which shows the papers in different sizes of circles based on the length of the paper. Bubble chart can be handy for visualizing quantitative information like budget, popularity of hashtags, or score distribution. The code we use are adapted from [Bubble Chart](http://bl.ocks.org/mbostock/4063269) by [Mike Bostock](http://bl.ocks.org/mbostock). Before we start writing visualization code, we can add a CSS rule to make the text shown in each circle prettier.

In [17]:
%%javascript
var css_rule_bubble_chart = ".bubble-label { font: 12px sans-serif; text-anchor: middle; }";
var styleElement = document.createElement("style");
styleElement.type = "text/css";
styleElement.appendChild(document.createTextNode(css_rule_bubble_chart));
document.head.appendChild(styleElement);

<IPython.core.display.Javascript object>

Below is the code showing the bubble chart, which includes:
* Insert a `<svg>` element of size 660 x 960, which will contains all bubbles
* Convert the paper list to d3 hierarchy object. In the meantime, we store the concatenated author string, which will be used in color selection and title text
* Calculate x, y, and r (radius) of the hierarchy object using `pack(root)` and exclude the root node (which sums all lengths) by `leaves()`
* Create one `<g class="node">` element per paper
* Create one `<circle>` element in each `<g class="node">`
* Create one `<text>` element in each `<g class="node">`, showing the paper number
* Create one `<title>` element in each `<g class="node">`, showing the authors and the length


In [18]:
%%javascript
require(['d3'], function(d3) {
    $("#bubble_chart").remove();
    element.append('<svg id="bubble_chart" width="960" height="660"></div>');
    var bubble_chart = d3.select("#bubble_chart");
    var width = bubble_chart.attr("width");
    var height = bubble_chart.attr("height");
    var numberFormat = d3.format(",d");
    var pickColor = d3.scaleOrdinal(d3.schemeCategory10);
    var pack = d3.pack().size([width, height]).padding(1.5);

    var root = d3.hierarchy({children: window.papers_info})
            .sum(function(d) { return d.length; })
            .each(function(d) {
                if (d.data.authors) {
                    d.authors = d.data.authors.join(", ")
                }
            });

    var node = bubble_chart.selectAll(".paper")
            .data(pack(root).leaves())
            .enter().append("g")
            .attr("class", "node")
            .attr("transform", function(d) { return "translate(" + d.x + "," + d.y + ")"; });

    node.append("circle")
            .attr("id", function(d) { return d.data.number; })
            .attr("r", function(d) { return d.r; })
            .style("fill", function(d) { return pickColor(d.authors); });

    node.append("text")
            .selectAll("tspan")
            .data(function(d) { return [d.data.number]; })
            .enter().append("tspan")
            .attr("class", "bubble-label")
            .attr("x", 0)
            .attr("y", function(d, i, nodes) { return 13 + (i - nodes.length / 2 - 0.5) * 10; })
            .text(function(d) { return d; });

    node.append("title")
            .text(function(d) { return d.authors + "\n" + numberFormat(d.data.length); });
});


<IPython.core.display.Javascript object>

You will get a graph like this:
![](bubble_chart.png)
In this chart, we can see Hamilton (blue) wrote most of the papers, and Jay's papers (orange) are fewer and shorter than the rest of the papers. Madison's are in green and Hamilton or Madison's are in red. You can hover your mouse on each circle to see the authors and the length of a specific paper.

## Partition
Partition is a graph that shows the distribution of data in a hierarchical manner. For example, if you have a directory that contains sub-directories and files at different depths, you may want to show how much space each file and directory take. In that case, you may want to draw a partition graph where the file size is proportional to the area of rectangle or the arc length in a circle.


### Data preprocessing
Before we can draw a partition graph, we need to format the data so `d3.hierarchy` can read them easily.
Here we group papers by the authors, then by the year and month, and we put papers without publish date as direct children of the author.
After the processing, a part of the JSON look like this:

    {
        "name": "MADISON",
        "children": [{
            "name": "1787-11",
            "children": [
                {"date": "1787-11-23", "length": 18203, "name": "No. 10"},
                {"date": "1787-11-30", "length": 12872, "name": "No. 14"}
            ]}, {
            "name": "1788-01",
            "children": [
                {"date": "1788-01-11", "length": 16894, "name": "No. 37"},
                {"date": "1788-01-15", "length": 19854, "name": "No. 38"},
                {"date": "1788-01-18", "length": 18577, "name": "No. 40"},
                {"date": "1788-01-22", "length": 17165, "name": "No. 42"},
                {"date": "1788-01-25", "length": 17731, "name": "No. 44"},
                {"date": "1788-01-29", "length": 16001, "name": "No. 46"}
            ]}, {
            "name": "1788-02",
            "children": [
                {"date": "1788-02-01", "length": 17217, "name": "No. 47"},
                {"date": "1788-02-01", "length": 11696, "name": "No. 48"}
            ]},
            {"date": "", "length": 15860, "name": "No. 39"},
            {"date": "", "length": 21397, "name": "No. 41"},
            {"date": "", "length": 21075, "name": "No. 43"},
            {"date": "", "length": 13092, "name": "No. 45"},
            {"date": "", "length": 12823, "name": "No. 58"}
        ]
    }


The formatting code is not included here because the main focus is on the visualization.

### Rectangle Partition (aka Icicle)
Before the main visualization code, we add some CSS rules to define the rectangle color and fonts.

In [19]:
%%javascript
var css_rule_partition_rect = "#partition_rect .node rect { fill: #ddd; }" + 
"#partition_rect .node text { font: 11px sans-serif; }" +
"#partition_rect .node tspan:last-child { font-size: 10px; fill-opacity: 0.5; }" +
"#partition_rect .node.internal text { font-weight: bold; }" +
"#partition_rect .node.leaf rect { fill-opacity: 0.6; }";
var styleElement = document.createElement("style");
styleElement.type = "text/css";
styleElement.appendChild(document.createTextNode(css_rule_partition_rect));
document.head.appendChild(styleElement);

<IPython.core.display.Javascript object>

Below is the code showing the partition in rectangles, which includes:
* Insert a `<svg>` element of size 1800 x 960, which the graph will be put.
* Define the number formatter that will be used in all nodes
* Define the color selector that will be used in leaf nodes.
* Define the parition function, which takes [width, height] as parameter and add 1px padding around each rectangle. (**NOTE**: here we give parameter in `[height, width]` so the partition will be vertical. If you want the graph to be horizontal, please use `[height, width]` and swap all `x0` and `y0`, and `x1` and `y1`.)
* Read the paper list tree asynchronously (using `d3.json`) to d3 hierarchy object. In the meantime, we sum the total paper length for every nodes and initialize `d.name` field.
* Calculate `x0`, `x1`, `y0`, and `y1` by `partition(root)`
* Create one `<g class="node">` element for every node (node.descendants() returns all nodes, including leaf nodes and internal nodes)
    * Add class `"leaf"` if the node is a child node, or `"internal"` otherwise.
    * Set the x and y offset (i.e. the top left corner of the rectangle) in the `transform` attribute.
* Create one `<rect>` element in each `<g class="node">`, whose width and height are determined by `d.y1 - d.y0` and `d.x1 - d.x0`.
    * Pick the fill color for leaf nodes. And we use the color by the author (which is at depth 1).
* Create one `<text>` element in each `<g class="node">` with 2 `<tspan>` elements showing the name and the value respectively.
* Create one `<title>` element in each `<g class="node">`, showing the name, date and length.

In [21]:
%%javascript
require(['d3'], function(d3) {
    $("#partition_rect").remove();
    element.append('<svg id="partition_rect" width="960" height="1800"></div>');
    var svg = d3.select("#partition_rect"),
        width = +svg.attr("width"),
        height = +svg.attr("height");

    var format = d3.format(",d");

    var color = d3.scaleOrdinal(d3.schemeCategory10);

    var partition = d3.partition()
        .size([height, width])
        .padding(1)
        .round(true);

    d3.json("papers_info_tree.json", function(error, data) {
      if (error) throw error;

      var root = d3.hierarchy({children: data})
          .sum(function(d) { return (d.length === undefined)? 0 : d.length; })
          .each(function(d) { d.name = d.data.name ? d.data.name : ""; });

      partition(root);

      var cell = svg
        .selectAll(".node")
        .data(root.descendants())
        .enter().append("g")
          .attr("class", function(d) { return "node" + (d.children ? " internal" : " leaf"); })
          .attr("transform", function(d) { return "translate(" + d.y0 + "," + d.x0 + ")"; });

      cell.append("rect")
          .attr("width", function(d) { return d.y1 - d.y0; })
          .attr("height", function(d) { return d.x1 - d.x0; })
        .filter(function(d) { return !d.children; })
          .style("fill", function(d) {
              while (d.depth > 1) d = d.parent;
              return color(d.name);
          });

      cell.append("text")
          .attr("x", 4)
        .selectAll("tspan")
          .data(function(d) { return [d.name, " " + format(d.value)]; })
        .enter().append("tspan")
          .attr("y", 13)
          .text(function(d) { return d; });

      cell.append("title")
          .text(function(d) {
              if (!d.children) {  // leaf nodes
                  if (d.data.date === undefined || d.data.date === "")
                      return d.name + "\ndate unknown\n" + format(d.value);
                  return d.name + "\n" + d.data.date + "\n" + format(d.value);
              } else if (d.name !== "") {  // intermediate nodes
                  return d.name + "\n" + format(d.value);
              } else {  // root node
                  return "Total\n" + format(d.value);
              }
          });
    });

});

<IPython.core.display.Javascript object>

Running the code above should generate a graph like this:
![](partition_rect.png)
The partition layout is flexible that the data do not need to end at the same level. For example, the papers without date are shown in third column in the graph. If you hover over one block in the graph, you can see the exact date when the paper was published or "date unknown" if the publish date is unavailable.

### Circle Partition
The partition can be also rendered in circles. You can see [Vinicius Gravina's example](http://bl.ocks.org/vgrocha/1580af34e56ee6224d33) on bilevel partition where you can zoom in or zoom out on specific sections. The following is a screenshot of what it looks like.
![](bilevel-partition.png)

## Conclusion
To make things simple, I didn't introduce how to change the graph dynamically. But you can check out the above example about Circle Partition which demonstrate the smoothness and beauty of D3.

There are no specific usage of D3. Instead, we can find hundreds of examples at [D3 repository's Gallery](https://github.com/d3/d3/wiki/Gallery). So I think the best way to learn D3 is:
* Pick a graph that fits the data the way you want
* Walk through the code and edit that to hold your data
* Change the colors and fonts to make it look great



## References

* [D3.js - Data-Driven Documents](https://d3js.org/)
* Michael Li. [Embedding D3 in an IPython Notebook](http://blog.thedataincubator.com/2015/08/embedding-d3-in-an-ipython-notebook/).
* Ben Blank. [How do you add CSS with Javascript?](http://stackoverflow.com/questions/707565/how-do-you-add-css-with-javascript)
* Mike Bostock. [Calendar View](http://bl.ocks.org/mbostock/4063318).
* Mike Bostock. [Bubble Chart](http://bl.ocks.org/mbostock/4063269).
* Mike Bostock. [Partition](http://bl.ocks.org/mbostock/2e73ec84221cb9773f4c).
* Vinicius Gravina. [Bilevel Partition](http://bl.ocks.org/vgrocha/1580af34e56ee6224d33).