Permalink
Browse files

new socrata post

  • Loading branch information...
1 parent a21c6a5 commit 17d23982e6493b30fa7827ae286ee0f14b9f46f5 @tlevine committed Jul 19, 2013
View
439 !/feed.xml
@@ -2,14 +2,412 @@
<feed xmlns="http://www.w3.org/2005/Atom">
<id>http://www.thomaslevine.com/</id>
<title>Thomas Levine</title>
- <updated>2013-07-10T07:00:00Z</updated>
+ <updated>2013-07-19T07:00:00Z</updated>
<link rel="alternate" href="http://www.thomaslevine.com/"/>
<link rel="self" href="http://www.thomaslevine.com/!/feed.xml"/>
<author>
<name>Thomas Levine</name>
<uri>http://www.thomaslevine.com</uri>
</author>
<entry>
+ <id>tag:www.thomaslevine.com,2013-07-19:/!/socrata-genealogies/index.html</id>
+ <title type="html">Progeny of Ten Socrata Datasets</title>
+ <published>2013-07-19T07:00:00Z</published>
+ <updated>2013-07-19T07:00:00Z</updated>
+ <link rel="alternate" href="http://www.thomaslevine.com/!/socrata-genealogies/index.html"/>
+ <content type="html">&lt;p&gt;Governments and other organizations have recently been trying to open up their
+data. Socrata’s &lt;a href="http://www.socrata.com/open-data-portal/"&gt;Open Data Portal&lt;/a&gt; software
+is one tool that tries to help with this; an organization using Socrata is given a website
+(“portal”) hosted by Socrata where they can upload their datasets and where
+the public can download them.&lt;/p&gt;
+
+&lt;p&gt;I recently downloaded all of the metadata about all of the datasets from all
+of the Socrata portals and then posted this &lt;a href="/!/socrata-summary"&gt;summary&lt;/a&gt; of
+the data. Now on to some deeper analysis.&lt;/p&gt;
+
+&lt;h2 id="what-is-a-dataset"&gt;What is a dataset?&lt;/h2&gt;
+&lt;p&gt;As the Twitters have pointed out,the dataset counts that I presented in my
+initial summary are somewhat deceptive.&lt;/p&gt;
+
+&lt;p&gt;&lt;a href="https://twitter.com/tomschenkjr/status/354010005504147456"&gt;&lt;img src="tomschenkjr.png" alt="@tomschenkjr Tweets about Chicago's filters" /&gt;&lt;/a&gt;&lt;/p&gt;
+
+&lt;p&gt;&lt;a href="https://twitter.com/SR_spatial/status/354088265344749568"&gt;&lt;img src="SR_spatial.png" alt="@SR_spatial Tweets about patterns of derived datasets" /&gt;&lt;/a&gt;&lt;/p&gt;
+
+&lt;p&gt;Many of the things that I was calling a dataset can be seen as a
+copy or a derivative of another dataset. In this post, I’ll discuss&lt;/p&gt;
+
+&lt;ol&gt;
+ &lt;li&gt;Socrata concepts and terminology&lt;/li&gt;
+ &lt;li&gt;Ways that we can arrive at apparent duplicates in Socrata data&lt;/li&gt;
+ &lt;li&gt;The progeny of ten Socrata datasets&lt;/li&gt;
+&lt;/ol&gt;
+
+&lt;h2 id="socrata-terminology"&gt;Socrata terminology&lt;/h2&gt;
+&lt;p&gt;Most of my work on this for the past week has been figuring out
+Socrata’s terminology and schema. Let’s define some Socrata terms.&lt;/p&gt;
+
+&lt;h3 id="everything-is-a-view"&gt;Everything is a view&lt;/h3&gt;
+&lt;p&gt;When you go to the home page of a Socrata portal, you can
+“Search &amp;amp; Browse Datasets and Views”. This phrasing is sort
+of wrong. “&lt;strong&gt;view&lt;/strong&gt;” is just a generic concept that refers
+to any sort of file or data that is presented to a user.&lt;/p&gt;
+
+&lt;p&gt;Everything in that list on the home page is a view. I haven’t
+yet explained what a dataset is, but a dataset is a type of
+view. For example, the top two views in
+&lt;a href="https://explore.data.gov/"&gt;explore.data.gov&lt;/a&gt; are currently (July 17)
+&lt;a href="https://explore.data.gov/dataset/White-House-Visitor-Records-Requests/644b-gaut"&gt;White House Visitor Records Requests&lt;/a&gt;
+and &lt;a href="https://explore.data.gov/dataset/White-House-Visitor-Records-Requests/644b-gaut"&gt;U.S. Overseas Loans and Grants (Greenbook)&lt;/a&gt;.&lt;/p&gt;
+
+&lt;p&gt;&lt;a href="https://explore.data.gov/"&gt;&lt;img src="search-browse.png" alt="Search &amp;amp; Browse Datasets and Views" class="wide" /&gt;&lt;/a&gt;&lt;/p&gt;
+
+&lt;p&gt;You also get a list of “View Types”. Below, I define some
+of these view types.&lt;/p&gt;
+
+&lt;h3 id="datasets"&gt;Datasets&lt;/h3&gt;
+&lt;p&gt;Let’s start with the &lt;strong&gt;dataset&lt;/strong&gt;.
+A dataset is when you get when you upload data to Socrata in one of
+its supported tabular formats.&lt;/p&gt;
+
+&lt;h3 id="filtered-views"&gt;Filtered views&lt;/h3&gt;
+&lt;p&gt;Before I define “filtered views”, I want to explain why they exist.
+Socrata helps people publish their data by providing various APIs
+for importing from different data sources, and Socrata helps people
+consume data by providing a data analysis suite inside the web browser.
+This includes maps and graphs and whatnot that you can embed in
+websites rather than just in PDF documents.&lt;/p&gt;
+
+&lt;p&gt;Socrata also allows you to “Filter”
+datasets. For example, here I filter the list of
+&lt;a href="https://data.oaklandnet.com/Environmental/Public-Works-Volunteer-Opportunities/sduu-bfki"&gt;Public Works Volunteer Opportunities&lt;/a&gt;
+to include only opportunities on July 29.&lt;/p&gt;
+
+&lt;p&gt;&lt;img src="filter.png" alt="Filtering on date July 29" class="wide" /&gt;&lt;/p&gt;
+
+&lt;p&gt;&lt;a href="https://data.oaklandnet.com/Environmental/Volunteer-Opportunities-on-July-29/vyhb-nqtw"&gt;Here&lt;/a&gt;’s the resulting filtered view.&lt;/p&gt;
+
+&lt;p&gt;&lt;strong&gt;Filtered views&lt;/strong&gt; are queries on a dataset. The queries are represented internally in the
+&lt;a href="http://dev.socrata.com/deprecated/querying-datasets"&gt;SODA filter query language&lt;/a&gt;.&lt;/p&gt;
+
+&lt;h3 id="charts-and-maps"&gt;Charts and maps&lt;/h3&gt;
+&lt;p&gt;&lt;strong&gt;Charts&lt;/strong&gt; and &lt;strong&gt;maps&lt;/strong&gt; are also queries on a dataset.
+The difference between filtered views, charts and maps is quite subtle.
+They are all queries on datasets; they just display a different
+visualization when you view them on the Socrata website.&lt;/p&gt;
+
+&lt;p&gt;There are other types of views, but we don’t need to know about them
+for now.&lt;/p&gt;
+
+&lt;h3 id="tables"&gt;Tables&lt;/h3&gt;
+&lt;p&gt;&lt;img src="/!/socrata-genealogies/family.jpg" alt="A table family, containing a dataset and several filtered views, charts and maps" class="wide" /&gt;
+&lt;!-- Icons from https://explore.data.gov/stylesheets/images/icons/type_icons_30.png?1 --&gt;&lt;/p&gt;
+
+&lt;p&gt;There is also a concept of a &lt;strong&gt;table&lt;/strong&gt;, and
+it is somewhat abstract. Here are two ways of thinking of it.&lt;/p&gt;
+
+&lt;p&gt;First, a more conceptual explanation.
+After someone uploads a dataset, a variety of filtered views,
+charts and maps can emerge. I see this as a family of views,
+with the parent being the original dataset and the ancestors
+being all of the filtered views, charts and maps that make
+SODA queries on the original dataset. In Socrata, this family
+is called a table.&lt;/p&gt;
+
+&lt;p&gt;Next, a more technical explanation.
+The data are stored in a table, and this table is not exposed directly to users.
+The most raw form of the table is exposed through a dataset, which is an empty
+query on the dataset (equivalent to &lt;code&gt;SELECT * FROM table_name;&lt;/code&gt;). Filtered views,
+charts and maps act on the table rather than on the source dataset; they’re just
+like datasets, except that they include a query.&lt;/p&gt;
+
+&lt;h3 id="federation"&gt;Federation&lt;/h3&gt;
+&lt;p&gt;Socrata doesn’t provide a particularly obvious means for searching multiple
+data portals at once. (This was part of my motivation for downloading all of
+the datasets.) But it is possible for one data portal to include all of
+another portal’s datasets.&lt;/p&gt;
+
+&lt;p&gt;Sometimes, you’ll see a view in the search &amp;amp; browse pane with a gray background,
+instead of white. Hawaii has a bunch of these.&lt;/p&gt;
+
+&lt;p&gt;&lt;a href="https://data.hawaii.gov/"&gt;&lt;img src="hawaii.png" alt="Hawaii data portal" class="wide" /&gt;&lt;/a&gt;&lt;/p&gt;
+
+&lt;p&gt;These views are “provided” by other portals through a process called
+“federation”. The destination portal (data.hawaii.gov in the above screenshot)
+makes a request to the source portal (data.explore.gov in the above screenshot)
+to federate the source portal’s data.&lt;/p&gt;
+
+&lt;p&gt;This request shows up in the administrator interface for the source portal.
+If the source portal accepts the request, all of the views from the source portal
+are provided to the destination portal as in the screenshot above. Here are
+&lt;a href="http://www.socrata.com/video/socrata-open-data-federation-demonstration/"&gt;two&lt;/a&gt;
+&lt;a href="http://www.socrata.com/datagov/open-data-federation-video/"&gt;videos&lt;/a&gt; about that.&lt;/p&gt;
+
+&lt;p&gt;If you look closely, you’ll notice that the federated views are actually just
+links to the source portal; the views show up in the search, but they aren’t
+otherwise copied to the destination portal.&lt;/p&gt;
+
+&lt;h2 id="types-of-duplicate-datasets"&gt;Types of duplicate datasets&lt;/h2&gt;
+
+&lt;p&gt;&lt;a href="https://twitter.com/richmanmax/status/353956877501087746"&gt;&lt;img src="deduuuuuupe.png" alt="@richmanmax Tweets &amp;quot;deduuuuuupe.&amp;quot;" /&gt;&lt;/a&gt;&lt;/p&gt;
+
+&lt;p&gt;Now that you know a bit more about how Socrata works, I can explain my three
+categories of datasets-that-I-counted-twice.&lt;/p&gt;
+
+&lt;h3 id="soda-queries-filtered-views-charts-maps"&gt;SODA queries: Filtered views, charts, maps&lt;/h3&gt;
+&lt;p&gt;After a dataset is uploaded, people can create many views that derive from it.
+In my previous analysis, I counted filtered views, charts and maps all as separate
+entities. I think it’s worth separating these because they can be derived from the
+source datasets.&lt;/p&gt;
+
+&lt;p&gt;If people are using Socrata as it is intended, there should be tons of filtered
+views, charts and maps, and they’ll give us an interesting picture of how the
+portal is being used.&lt;/p&gt;
+
+&lt;h3 id="federation-1"&gt;Federation&lt;/h3&gt;
+&lt;p&gt;When datasets are federated, &lt;em&gt;all&lt;/em&gt; of the datasets from the source portal are
+provided to the destination. (You can’t pick and choose.) That is, they show up
+in search as links to the source portal.&lt;/p&gt;
+
+&lt;p&gt;In my previous analysis, I counted federated datasets as belonging to the portal
+to which they’re provided. Also, I downloaded them in a way that made it hard for
+me figure out what the source portal was. It’s easy to fix, so I might download
+them all again and graph the network of federation across Socrata portals.&lt;/p&gt;
+
+&lt;p&gt;(For those who are curious, the issue was that I followed HTTP redirects and
+didn’t record whether I was following a redirect or accessing the page directly.)&lt;/p&gt;
+
+&lt;h3 id="copied-rather-than-elegantly-linked"&gt;Copied rather than elegantly linked&lt;/h3&gt;
+&lt;p&gt;Some datasets have simply been uploaded to two different portals.
+Lombardia’s museums is an example of that.&lt;/p&gt;
+
+&lt;!-- I don't know why the Kramdown table syntax isn't working here. --&gt;
+&lt;table&gt;
+ &lt;thead&gt;
+ &lt;tr&gt;
+ &lt;th&gt;Portal&lt;/th&gt;
+ &lt;th&gt;Identifier&lt;/th&gt;
+ &lt;th&gt;Rows&lt;/th&gt;
+ &lt;th&gt;Columns&lt;/th&gt;
+ &lt;th&gt;Downloads&lt;/th&gt;
+ &lt;/tr&gt;
+ &lt;/thead&gt;
+ &lt;tbody&gt;
+ &lt;tr&gt;
+ &lt;td&gt;dati.lombardia.it&lt;/td&gt;
+ &lt;td&gt;&lt;a href="https://dati.lombardia.it/Cultura/Musei/3syc-54zf?"&gt;3syc-54zf&lt;/a&gt;&lt;/td&gt;
+ &lt;td&gt;234&lt;/td&gt;
+ &lt;td&gt;56&lt;/td&gt;
+ &lt;td&gt;1675&lt;/td&gt;
+ &lt;/tr&gt;
+ &lt;tr&gt;
+ &lt;td&gt;opendata.socrata.com&lt;/td&gt;
+ &lt;td&gt;&lt;a href="https://opendata.socrata.com/Education/Musei-Lombardi/54y8-wyde?"&gt;54y8-wyde&lt;/a&gt;&lt;/td&gt;
+ &lt;td&gt;234&lt;/td&gt;
+ &lt;td&gt;56&lt;/td&gt;
+ &lt;td&gt;9&lt;/td&gt;
+ &lt;/tr&gt;
+ &lt;/tbody&gt;
+&lt;/table&gt;
+
+&lt;p&gt;I identified this group by looking for datasets with the same numbers of rows,
+the same number of columns, and similar names.
+I haven’t done it on a larger scale, but that would be fun to do later.&lt;/p&gt;
+
+&lt;h2 id="ten-large-dataset-families"&gt;Ten large dataset families&lt;/h2&gt;
+&lt;p&gt;It took me quite a while to figure out everything that I explained above.
+(That’s a story in itself.) My goal all along was to start looking
+at how families of datasets are related, so now I’ll talk about what I
+did on that front.&lt;/p&gt;
+
+&lt;h3 id="methodology"&gt;Methodology&lt;/h3&gt;
+&lt;p&gt;I grouped all of the views that I had collected by table. (Recall that
+a table in Socrata is a dataset plus the family of views that derives
+from that particular dataset.)&lt;/p&gt;
+
+&lt;p&gt;Once I had grouped them, I found the ten largest families, by number of
+different views. To be clear, this is the number of Socrata entities
+called “views” rather than the number of times people viewed the dataset.
+(Confusingly, Socrata also provides the latter sort of view count, and
+I’ve included that figure in the present report.)&lt;/p&gt;
+
+&lt;p&gt;Out of these datasets, I took the top ten datasets, and I
+show their families in the fancy table at the end of this page. Select a dataset,
+and then you can see all of that dataset plus all of the filtered views,
+maps and charts of that dataset. You can also see which portals each of
+these datasets is federated to. You can sort by the different columns,
+and you can click on a row to see more detail.&lt;/p&gt;
+
+&lt;p&gt;And In case you’re reading this a year later, the data were collected from
+Socrata portals at the end of May 2013.&lt;/p&gt;
+
+&lt;h3 id="discussion"&gt;Discussion&lt;/h3&gt;
+&lt;p&gt;&lt;em&gt;This section might make more sense if you play with the fancy table first.&lt;/em&gt;&lt;/p&gt;
+
+&lt;h4 id="why-its-not-a-tree"&gt;Why it’s not a tree&lt;/h4&gt;
+&lt;p&gt;In Socrata, you can create a filtered view, chart or map based on a dataset,
+and the link to the source dataset will be preserved. This is represented
+in the table below.&lt;/p&gt;
+
+&lt;p&gt;Unfortunately, the genealogy is not recorded any deeper than this; if you
+create a new filtered view based on an existing filtered view, the SODA query
+is simply combined between the two views, and the new filtered view is
+represented as a child of the original dataset rather than a child of the old
+filtered view.&lt;/p&gt;
+
+&lt;p&gt;Thus, we don’t get the full family tree that you might have expected.&lt;/p&gt;
+
+&lt;h4 id="compare-family-statistics-with-view-statistics"&gt;Compare family statistics with view statistics&lt;/h4&gt;
+&lt;p&gt;In some cases, like with the White House visitor records requests, most of the
+downloads and hits for the whole family are from this source dataset.
+In other cases, like the World Bank major contract awards, only a small
+minority comes from this source dataset. This occurrence is illustrated by the
+plots below.&lt;/p&gt;
+
+&lt;p&gt;The first plot looks at hits, and the second at downloads. Within each plot,
+the left (red) dot is the number of hits/downloads that the source dataset
+received and the right (blue) dot is the total hits/downloads across the whole
+family.&lt;/p&gt;
+
+&lt;p&gt;If these are close to each other (that is, the black line is short),
+most of the hits/downloads came from the source dataset.
+If they are far apart, most
+hits/downloads came from filtered views, charts and maps.&lt;/p&gt;
+
+&lt;p&gt;&lt;img src="/!/socrata-genealogies/hits.png" alt="Hits by dataset family" class="wide" /&gt;&lt;/p&gt;
+
+&lt;p&gt;&lt;img src="/!/socrata-genealogies/downloads.png" alt="Downloads by dataset family" class="wide" /&gt;&lt;/p&gt;
+
+&lt;p&gt;This information might tell us something about
+how people like to use the data. Perhaps people working with the World Bank
+contracts are interested in subsets for their particular region and time.
+And maybe people are just playing with the White House data because it’s the
+first one in the list.&lt;/p&gt;
+
+&lt;h4 id="view-size-and-shape"&gt;View size and shape&lt;/h4&gt;
+&lt;p&gt;The view size and shape give us an idea of what sort of queries people are running.
+Are people selecting certain variables, or are they aggregating or subsetting
+the records?&lt;/p&gt;
+
+&lt;p&gt;&lt;img src="/!/socrata-genealogies/query-1.jpg" alt="A rectangle indicating the original dataset" /&gt;&lt;/p&gt;
+
+&lt;p&gt;&lt;img src="/!/socrata-genealogies/query-2.jpg" alt="The same rectangle, with a shorter one for a record subset" /&gt;&lt;/p&gt;
+
+&lt;p&gt;&lt;img src="/!/socrata-genealogies/query-3.jpg" alt="The same rectangles, with a tall, thin one for a selection of variables" /&gt;&lt;/p&gt;
+
+&lt;h4 id="federation-2"&gt;Federation&lt;/h4&gt;
+&lt;p&gt;As I discussed earlier, federation is all-or-nothing; you either include all
+of the source portal’s datasets or none of them. So you would expect that the
+“Federation” column would list the same number of copies for each dataset.
+In at least one instance (FEC contributions), this is not the case.
+I haven’t figured out what’s going on there.&lt;/p&gt;
+
+&lt;h3 id="relevance"&gt;Relevance&lt;/h3&gt;
+&lt;p&gt;Socrata exposes enough of the data analysis process that we can start to see
+what sorts of analyses different people are doing. We can see what sorts of
+datasets are interesting to people. We may even be able to develop new
+guidelines for publishing datasets through analysis of what makes datasets more
+likely to be viewed, downloaded and filtered on Socrata.&lt;/p&gt;
+
+&lt;h3 id="data-family-explorer"&gt;Data family explorer&lt;/h3&gt;
+&lt;p&gt;And now, the aforementioned fancy table. As I said above, this table contains
+the families/tables associated with the ten datasets with the largest families.
+Select a dataset, and then you can see all of that dataset plus all of the
+filtered views, charts and maps, with some information about each. And if you
+sort by “Created” date, the first one should be the source dataset.&lt;/p&gt;
+
+&lt;!-- Scripts after the introduction so you don't notice the table loading --&gt;
+&lt;script src="angular.min.js"&gt;&lt;/script&gt;
+
+&lt;script src="angular-table.js"&gt;&lt;/script&gt;
+
+&lt;script src="angular-strap.js"&gt;&lt;/script&gt;
+
+&lt;script src="script.js"&gt;&lt;/script&gt;
+
+&lt;link rel="stylesheet" href="style.css" /&gt;
+
+&lt;div ng-app="genealogy"&gt;
+ &lt;div ng-controller="GenealogyCtrl"&gt;
+ &lt;select ng-model="table" ng-options="t.source.name for t in tables"&gt;
+ &lt;option value=""&gt;Choose a dataset&lt;/option&gt;
+ &lt;/select&gt;
+ &lt;div ng-show="table"&gt;
+ &lt;h3&gt;The family/table&lt;/h3&gt;
+ &lt;ul&gt;
+ &lt;li&gt;&lt;strong&gt;Original source&lt;/strong&gt;: &lt;a href="https://{{table.source.portal}}/-/-/{{table.source.id}}"&gt;{{table.source.portal}}&lt;/a&gt;&lt;/li&gt;
+ &lt;li&gt;&lt;strong&gt;Number of children&lt;/strong&gt;: {{table.datasets.length}}&lt;/li&gt;
+ &lt;li&gt;&lt;strong&gt;Total downloads&lt;/strong&gt;: {{ table.totals.downloadCount }}&lt;/li&gt;
+ &lt;li&gt;&lt;strong&gt;Total views&lt;/strong&gt;: {{ table.totals.viewCount }}&lt;/li&gt;
+ &lt;li&gt;&lt;strong&gt;Description&lt;/strong&gt;: {{table.source.description}}&lt;/li&gt;
+ &lt;/ul&gt;
+ &lt;h3&gt;Its member views&lt;/h3&gt;
+ &lt;angular-table model="table.datasets" default-sort-column="createdAt"&gt;
+ &lt;header-row&gt;
+
+ &lt;header-column sortable="true" sort-field-name="name"&gt;
+ &lt;div style="display: inline-block;"&gt;Name&lt;/div&gt;
+ &lt;sort-arrow-ascending&gt;&lt;/sort-arrow-ascending&gt;
+ &lt;sort-arrow-descending&gt;&lt;/sort-arrow-descending&gt;
+ &lt;/header-column&gt;
+
+ &lt;header-column class="skinny" sortable="true" sort-field-name="createdAt"&gt;
+ &lt;div style="display: inline-block;"&gt;Created&lt;/div&gt;
+ &lt;sort-arrow-ascending&gt;&lt;/sort-arrow-ascending&gt;
+ &lt;sort-arrow-descending&gt;&lt;/sort-arrow-descending&gt;
+ &lt;/header-column&gt;
+
+ &lt;header-column class="skinny" sortable="true" sort-field-name="viewCount"&gt;
+ &lt;div style="display: inline-block;"&gt;Hits&lt;/div&gt;
+ &lt;sort-arrow-ascending&gt;&lt;/sort-arrow-ascending&gt;
+ &lt;sort-arrow-descending&gt;&lt;/sort-arrow-descending&gt;
+ &lt;/header-column&gt;
+
+ &lt;header-column class="skinny" sortable="true" sort-field-name="downloadCount"&gt;
+ &lt;div style="display: inline-block;"&gt;Down-loads&lt;/div&gt;
+ &lt;sort-arrow-ascending&gt;&lt;/sort-arrow-ascending&gt;
+ &lt;sort-arrow-descending&gt;&lt;/sort-arrow-descending&gt;
+ &lt;/header-column&gt;
+
+ &lt;header-column class="less-skinny" sortable="true" sort-field-name="ncell"&gt;
+ &lt;div style="display: inline-block;"&gt;Size&lt;/div&gt;
+ &lt;sort-arrow-ascending&gt;&lt;/sort-arrow-ascending&gt;
+ &lt;sort-arrow-descending&gt;&lt;/sort-arrow-descending&gt;
+ &lt;/header-column&gt;
+
+ &lt;header-column sortable="true" sort-field-name="ncopies"&gt;
+ &lt;div style="display: inline-block;"&gt;Federation&lt;/div&gt;
+ &lt;sort-arrow-ascending&gt;&lt;/sort-arrow-ascending&gt;
+ &lt;sort-arrow-descending&gt;&lt;/sort-arrow-descending&gt;
+ &lt;/header-column&gt;
+
+ &lt;/header-row&gt;
+
+ &lt;row on-selected="emptyFunction" selected-color="#111"&gt;
+ &lt;column&gt;&lt;a href="https://{{row.source_portal_hack}}/-/-/{{row.id}}" title="{{row.name}}"&gt;{{row.shortName}}&lt;/a&gt;&lt;/column&gt;
+ &lt;column class="skinny"&gt;{{row.prettyDate}}&lt;/column&gt;
+ &lt;column class="skinny number"&gt;{{row.viewCountPretty}}&lt;/column&gt;
+ &lt;column class="skinny number"&gt;{{row.downloadCountPretty}}&lt;/column&gt;
+ &lt;column class="less-skinny"&gt;
+ {{row.ncellPretty}} cells
+ &lt;ul style="list-style: none;" class="snug" ng-show="row.rowSelected"&gt;
+ &lt;li class="snug"&gt;&lt;small&gt;{{row.ncolPretty}} variables&lt;/small&gt;&lt;/li&gt;
+ &lt;li class="snug"&gt;&lt;small&gt;{{row.nrowPretty}} records&lt;/small&gt;&lt;/li&gt;
+ &lt;/ul&gt;
+ &lt;/column&gt;
+ &lt;column&gt;
+ &lt;span ng-hide="row.rowSelected"&gt;{{row.ncopiesPretty}}&lt;/span&gt;
+ &lt;ul class="snug" style="list-style: none;" ng-show="row.rowSelected"&gt;
+ &lt;li class="snug" ng-repeat="portal in row.portals"&gt;&lt;small&gt;{{portal}}&lt;/small&gt;&lt;/li&gt;
+ &lt;/ul&gt;
+ &lt;/column&gt;
+ &lt;/row&gt;
+ &lt;/angular-table&gt;
+ &lt;/div&gt;
+ &lt;/div&gt;
+&lt;/div&gt;
+</content>
+ </entry>
+ <entry>
<id>tag:www.thomaslevine.com,2013-07-10:/!/r-spells-for-data-wizards/index.html</id>
<title type="html">R spells for data wizards</title>
<published>2013-07-10T07:00:00Z</published>
@@ -833,44 +1231,5 @@ the French one with &lt;code&gt;setxkbmap&lt;/code&gt; like so.&lt;/p&gt;
&lt;p&gt;I still haven’t figured out how to select the Swedish one with &lt;code&gt;setxkbmap&lt;/code&gt;.&lt;/p&gt;
</content>
</entry>
- <entry>
- <id>tag:www.thomaslevine.com,2013-06-26:/!/tmux-aliases/index.html</id>
- <title type="html">tmux aliases</title>
- <published>2013-06-26T07:00:00Z</published>
- <updated>2013-06-26T07:00:00Z</updated>
- <link rel="alternate" href="http://www.thomaslevine.com/!/tmux-aliases/index.html"/>
- <content type="html">&lt;p&gt;Before I implemented my &lt;code&gt;tmuxa&lt;/code&gt; and &lt;code&gt;tmuxl&lt;/code&gt; aliases, the three &lt;code&gt;tmux&lt;/code&gt; calls
-that I used most often were &lt;code&gt;tmux&lt;/code&gt;, &lt;code&gt;tmux list-sessions&lt;/code&gt; and &lt;code&gt;tmux attach&lt;/code&gt;.&lt;/p&gt;
-
-&lt;pre&gt;&lt;code&gt;$ grep ^tmux ~/.history/sh-201*|sed -e s/^.*:// -e 's/ *$//' |sort|uniq -c
- 127 tmux
- 149 tmuxa
- 2 tmuxa 0
- 1 tmuxa -t0
- 3 tmuxa -t 0
- 6 tmuxa -t23
- 1 tmuxa -t 23
- 4 tmuxa -t32
- 1 tmuxa -t 5
- 15 tmux attach
- 2 tmux attach -t0
- 1 tmux attach -t16
- 1 tmux attach -t18
- 2 tmux attach -t 23
- 1 tmux attach -t27
- 1 tmux attach -t28
- 1 tmux attach -t 32
- 18 tmuxl
- 1 tmux --list-sessions
- 7 tmux list-sessions
-&lt;/code&gt;&lt;/pre&gt;
-
-&lt;p&gt;Those commands are long, so I &lt;a href="https://github.com/tlevine/.prophyl-teh-awesum/blob/master/source/tmux"&gt;made them shorter&lt;/a&gt;.&lt;/p&gt;
-
-&lt;pre&gt;&lt;code&gt;alias tmuxl='tmux list-sessions'
-alias tmuxa='tmux attach'
-&lt;/code&gt;&lt;/pre&gt;
-</content>
- </entry>
</feed>
View
12 !/index.html
@@ -76,15 +76,23 @@
</nav>
<header class="title-card">
<h1>
- <a href="r-spells-for-data-wizards/">R spells for data wizards</a>
+ <a href="socrata-genealogies/">Progeny of Ten Socrata Datasets</a>
</h1>
<div class="date">
- July 10, 2013
+ July 19, 2013
</div>
</header>
<div class="clearfix" id="links">
<div class="link">
<strong>
+ <a href="r-spells-for-data-wizards/">R spells for data wizards</a>
+ </strong>
+ <footer>
+ Jul 10, 2013
+ </footer>
+ </div>
+ <div class="link">
+ <strong>
<a href="socrata-summary/">Analyze all the datasets</a>
</strong>
<footer>
View
BIN !/socrata-genealogies/downloads.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
View
BIN !/socrata-genealogies/family.jpg
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
View
BIN !/socrata-genealogies/hits.png
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
View
97 !/socrata-genealogies/index.html
@@ -7,18 +7,18 @@
<!--<![endif]-->
<head>
<meta charset='utf-8'>
- <title>Progenies of Ten Socrata Datasets</title>
+ <title>Progeny of Ten Socrata Datasets</title>
<meta content='How are datasets are transformed in Socrata, and what can we can learn from that?' name='description'>
<meta content='Thomas Levine' name='author'>
<link href='http://domain/humans.txt' rel='author' type='text/plain'>
<meta content='nanoc 3.6.4' name='generator'>
<meta content='width=device-width' name='viewport'>
<meta content='summary' name='twitter:card'>
<meta content='@thomaslevine' name='twitter:site'>
- <meta content='Progenies of Ten Socrata Datasets' name='twitter:title'>
+ <meta content='Progeny of Ten Socrata Datasets' name='twitter:title'>
<meta content='How are datasets are transformed in Socrata, and what can we can learn from that?' name='twitter:description'>
<meta content='@thomaslevine' name='twitter:creator'>
- <meta content='http://thomaslevine.com/!/socrata-genealogies/screenshot.png' name='twitter:image:src'>
+ <meta content='http://thomaslevine.com/!/socrata-genealogies/family.png' name='twitter:image:src'>
<meta content='thomaslevine.com' name='twitter:domain'>
<meta content='' name='twitter:app:name:iphone'>
<meta content='' name='twitter:app:name:ipad'>
@@ -31,9 +31,9 @@
<meta content='' name='twitter:app:id:googleplay'>
<meta content='http://thomaslevine.com/!/socrata-genealogies/' property='og:url'>
<meta content='thomaslevine.com' property='og:site_name'>
- <meta content="It's cool what you can do when data analysis is logged and exposed publically over the web." property='og:description'>
+ <meta content="It's cool what you can do when everyone's data analysis is logged and exposed publicly over the web." property='og:description'>
<meta content='How are datasets are transformed in Socrata, and what can we can learn from that?' property='og:title'>
- <meta content='http://thomaslevine.com/!/socrata-genealogies/screenshot.png' property='og:image'>
+ <meta content='http://thomaslevine.com/!/socrata-genealogies/family.png' property='og:image'>
<link href='/favicon.ico' rel='icon' type='image/x-icon'>
<link href='/!/feed.xml' rel='alternate' title='Thomas Levine' type='application/atom+xml'>
<link href='http://fonts.googleapis.com/css?family=Open+Sans:400,700' rel='stylesheet' type='text/css'>
@@ -73,10 +73,10 @@
</nav>
<header class='title-card'>
<h1>
- Progenies of Ten Socrata Datasets
+ Progeny of Ten Socrata Datasets
</h1>
<div class='date'>
-
+ July 19, 2013
</div>
</header>
<div id='article-wrapper'>
@@ -89,7 +89,7 @@
<p>I recently downloaded all of the metadata about all of the datasets from all
of the Socrata portals and then posted this <a href="/!/socrata-summary">summary</a> of
- the data. Now on to some deeper further analysis.</p>
+ the data. Now on to some deeper analysis.</p>
<h2 id="what-is-a-dataset">What is a dataset?</h2>
<p>As the Twitters have pointed out,the dataset counts that I presented in my
@@ -105,7 +105,7 @@ <h2 id="what-is-a-dataset">What is a dataset?</h2>
<ol>
<li>Socrata concepts and terminology</li>
<li>Ways that we can arrive at apparent duplicates in Socrata data</li>
- <li>The progenies of ten Socrata datasets</li>
+ <li>The progeny of ten Socrata datasets</li>
</ol>
<h2 id="socrata-terminology">Socrata terminology</h2>
@@ -148,7 +148,7 @@ <h3 id="filtered-views">Filtered views</h3>
<a href="https://data.oaklandnet.com/Environmental/Public-Works-Volunteer-Opportunities/sduu-bfki">Public Works Volunteer Opportunities</a>
to include only opportunities on July 29.</p>
- <p><a href="filter.png"><img src="filter.png" alt="Filtering on date July 29" class="wide" /></a></p>
+ <p><img src="filter.png" alt="Filtering on date July 29" class="wide" /></p>
<p><a href="https://data.oaklandnet.com/Environmental/Volunteer-Opportunities-on-July-29/vyhb-nqtw">Here</a>’s the resulting filtered view.</p>
@@ -165,6 +165,9 @@ <h3 id="charts-and-maps">Charts and maps</h3>
for now.</p>
<h3 id="tables">Tables</h3>
+ <p><img src="/!/socrata-genealogies/family.jpg" alt="A table family, containing a dataset and several filtered views, charts and maps" class="wide" />
+ <!-- Icons from https://explore.data.gov/stylesheets/images/icons/type_icons_30.png?1 --></p>
+
<p>There is also a concept of a <strong>table</strong>, and
it is somewhat abstract. Here are two ways of thinking of it.</p>
@@ -189,7 +192,7 @@ <h3 id="federation">Federation</h3>
the datasets.) But it is possible for one data portal to include all of
another portal’s datasets.</p>
- <p>Sometimes, you’ll see a view in the search &amp; browse pane with a grey background,
+ <p>Sometimes, you’ll see a view in the search &amp; browse pane with a gray background,
instead of white. Hawaii has a bunch of these.</p>
<p><a href="https://data.hawaii.gov/"><img src="hawaii.png" alt="Hawaii data portal" class="wide" /></a></p>
@@ -201,7 +204,9 @@ <h3 id="federation">Federation</h3>
<p>This request shows up in the administrator interface for the source portal.
If the source portal accepts the request, all of the views from the source portal
- are provided to the destination portal as in the screenshot above.</p>
+ are provided to the destination portal as in the screenshot above. Here are
+ <a href="http://www.socrata.com/video/socrata-open-data-federation-demonstration/">two</a>
+ <a href="http://www.socrata.com/datagov/open-data-federation-video/">videos</a> about that.</p>
<p>If you look closely, you’ll notice that the federated views are actually just
links to the source portal; the views show up in the search, but they aren’t
@@ -216,10 +221,7 @@ <h2 id="types-of-duplicate-datasets">Types of duplicate datasets</h2>
<h3 id="soda-queries-filtered-views-charts-maps">SODA queries: Filtered views, charts, maps</h3>
<p>After a dataset is uploaded, people can create many views that derive from it.
- Depending on what you want to know, it might not make sense to treat these as
- separate entities.</p>
-
- <p>In my previous analysis, I did count filtered views, charts and maps all as separate
+ In my previous analysis, I counted filtered views, charts and maps all as separate
entities. I think it’s worth separating these because they can be derived from the
source datasets.</p>
@@ -278,11 +280,10 @@ <h3 id="copied-rather-than-elegantly-linked">Copied rather than elegantly linked
I haven’t done it on a larger scale, but that would be fun to do later.</p>
<h2 id="ten-large-dataset-families">Ten large dataset families</h2>
- <p>It took me quite a while to figure out how all of this works.
+ <p>It took me quite a while to figure out everything that I explained above.
(That’s a story in itself.) My goal all along was to start looking
- at how families of datasets are related. I figured I’d make something
- a bit less sloppy than ggplot plots tiny text and with legends
- hanging off of the page.</p>
+ at how families of datasets are related, so now I’ll talk about what I
+ did on that front.</p>
<h3 id="methodology">Methodology</h3>
<p>I grouped all of the views that I had collected by table. (Recall that
@@ -296,7 +297,7 @@ <h3 id="methodology">Methodology</h3>
I’ve included that figure in the present report.)</p>
<p>Out of these datasets, I took the top ten datasets, and I
- show their families in the table at the end of this page. Select a dataset,
+ show their families in the fancy table at the end of this page. Select a dataset,
and then you can see all of that dataset plus all of the filtered views,
maps and charts of that dataset. You can also see which portals each of
these datasets is federated to. You can sort by the different columns,
@@ -305,7 +306,10 @@ <h3 id="methodology">Methodology</h3>
<p>And In case you’re reading this a year later, the data were collected from
Socrata portals at the end of May 2013.</p>
- <h3 id="why-its-not-a-tree">Why it’s not a tree</h3>
+ <h3 id="discussion">Discussion</h3>
+ <p><em>This section might make more sense if you play with the fancy table first.</em></p>
+
+ <h4 id="why-its-not-a-tree">Why it’s not a tree</h4>
<p>In Socrata, you can create a filtered view, chart or map based on a dataset,
and the link to the source dataset will be preserved. This is represented
in the table below.</p>
@@ -316,45 +320,66 @@ <h3 id="why-its-not-a-tree">Why it’s not a tree</h3>
represented as a child of the original dataset rather than a child of the old
filtered view.</p>
- <h3 id="things-to-look-for">Things to look for</h3>
-
- <h4 id="the-source-dataset">The source dataset</h4>
- <p>If you sort by “Created” date, the first one should be the source dataset.</p>
+ <p>Thus, we don’t get the full family tree that you might have expected.</p>
<h4 id="compare-family-statistics-with-view-statistics">Compare family statistics with view statistics</h4>
<p>In some cases, like with the White House visitor records requests, most of the
downloads and hits for the whole family are from this source dataset.
In other cases, like the World Bank major contract awards, only a small
- minority comes from this source dataset. This might tell us something about
+ minority comes from this source dataset. This occurrence is illustrated by the
+ plots below.</p>
+
+ <p>The first plot looks at hits, and the second at downloads. Within each plot,
+ the left (red) dot is the number of hits/downloads that the source dataset
+ received and the right (blue) dot is the total hits/downloads across the whole
+ family.</p>
+
+ <p>If these are close to each other (that is, the black line is short),
+ most of the hits/downloads came from the source dataset.
+ If they are far apart, most
+ hits/downloads came from filtered views, charts and maps.</p>
+
+ <p><img src="/!/socrata-genealogies/hits.png" alt="Hits by dataset family" class="wide" /></p>
+
+ <p><img src="/!/socrata-genealogies/downloads.png" alt="Downloads by dataset family" class="wide" /></p>
+
+ <p>This information might tell us something about
how people like to use the data. Perhaps people working with the World Bank
contracts are interested in subsets for their particular region and time.
And maybe people are just playing with the White House data because it’s the
first one in the list.</p>
- <h4 id="view-size">View size</h4>
- <p>The view size gives us an idea of what sort of queries people are running.
+ <h4 id="view-size-and-shape">View size and shape</h4>
+ <p>The view size and shape give us an idea of what sort of queries people are running.
Are people selecting certain variables, or are they aggregating or subsetting
the records?</p>
+ <p><img src="/!/socrata-genealogies/query-1.jpg" alt="A rectangle indicating the original dataset" /></p>
+
+ <p><img src="/!/socrata-genealogies/query-2.jpg" alt="The same rectangle, with a shorter one for a record subset" /></p>
+
+ <p><img src="/!/socrata-genealogies/query-3.jpg" alt="The same rectangles, with a tall, thin one for a selection of variables" /></p>
+
<h4 id="federation-2">Federation</h4>
<p>As I discussed earlier, federation is all-or-nothing; you either include all
of the source portal’s datasets or none of them. So you would expect that the
“Federation” column would list the same number of copies for each dataset.
- In at least one instance (FEC contributions), this is not the case. what’s
- going on there?</p>
+ In at least one instance (FEC contributions), this is not the case.
+ I haven’t figured out what’s going on there.</p>
<h3 id="relevance">Relevance</h3>
- <p>Frankly, this table is a rather terrible way of exploring these broader trends,
- but it conveys the scale with which datasets are being adapted on Socrata and
- lets us drill down to the views on Socrata to see more detail.</p>
-
<p>Socrata exposes enough of the data analysis process that we can start to see
what sorts of analyses different people are doing. We can see what sorts of
datasets are interesting to people. We may even be able to develop new
guidelines for publishing datasets through analysis of what makes datasets more
likely to be viewed, downloaded and filtered on Socrata.</p>
- <p>And now, the dataset progeny explorer:</p>
+ <h3 id="data-family-explorer">Data family explorer</h3>
+ <p>And now, the aforementioned fancy table. As I said above, this table contains
+ the families/tables associated with the ten datasets with the largest families.
+ Select a dataset, and then you can see all of that dataset plus all of the
+ filtered views, charts and maps, with some information about each. And if you
+ sort by “Created” date, the first one should be the source dataset.</p>
<!-- Scripts after the introduction so you don't notice the table loading -->
<script src="angular.min.js"></script>
View
BIN !/socrata-genealogies/query-1.jpg
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
View
BIN !/socrata-genealogies/query-2.jpg
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
View
BIN !/socrata-genealogies/query-3.jpg
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 17d2398

Please sign in to comment.