Permalink
Browse files

edit socrata genealogies

  • Loading branch information...
1 parent 7eaff9d commit b8dd3fd090b707712ceb054024f2bba4d7aecd9a @tlevine committed Jul 18, 2013
Showing with 59 additions and 27 deletions.
  1. +1 −1 !/feed.xml
  2. +57 −25 !/socrata-genealogies/index.html
  3. +1 −1 !/socrata-summary/index.html
View
2 !/feed.xml
@@ -218,7 +218,7 @@ cars$speed[order(cars$dist)]
<updated>2013-07-07T07:00:00Z</updated>
<link rel="alternate" href="http://www.thomaslevine.com/!/socrata-summary/index.html"/>
<content type="html">
-&lt;p&gt;I downloaded the metadata files for all of the datasets across all of the Socrata data portals.
+&lt;p&gt;I downloaded the metadata files for all of the datasets across all of the &lt;a href="http://www.socrata.com/"&gt;Socrata&lt;/a&gt; data portals.
Here I explain how I did that and present an summary of the sorts of data that we find in the portals.&lt;/p&gt;
&lt;h2 id="acquiring-the-data"&gt;Acquiring the data&lt;/h2&gt;
View
82 !/socrata-genealogies/index.html
@@ -7,18 +7,18 @@
<!--<![endif]-->
<head>
<meta charset='utf-8'>
- <title>Progenies of Nine Socrata datasets</title>
- <meta content='' name='description'>
+ <title>Progenies of Ten Socrata datasets</title>
+ <meta content='How are datasets are transformed in Socrata, and what can we can learn from that?' name='description'>
<meta content='Thomas Levine' name='author'>
<link href='http://domain/humans.txt' rel='author' type='text/plain'>
<meta content='nanoc 3.6.4' name='generator'>
<meta content='width=device-width' name='viewport'>
<meta content='summary' name='twitter:card'>
<meta content='@thomaslevine' name='twitter:site'>
- <meta content='Progenies of Nine Socrata datasets' name='twitter:title'>
- <meta content='' name='twitter:description'>
+ <meta content='Progenies of Ten Socrata datasets' name='twitter:title'>
+ <meta content='How are datasets are transformed in Socrata, and what can we can learn from that?' name='twitter:description'>
<meta content='@thomaslevine' name='twitter:creator'>
- <meta content='http://thomaslevine.com/apple-touch-icon-144x144-precomposed.png' name='twitter:image:src'>
+ <meta content='http://thomaslevine.com/!/socrata-genealogies/screenshot.png' name='twitter:image:src'>
<meta content='thomaslevine.com' name='twitter:domain'>
<meta content='' name='twitter:app:name:iphone'>
<meta content='' name='twitter:app:name:ipad'>
@@ -31,9 +31,9 @@
<meta content='' name='twitter:app:id:googleplay'>
<meta content='http://thomaslevine.com/!/socrata-genealogies/' property='og:url'>
<meta content='thomaslevine.com' property='og:site_name'>
- <meta content='' property='og:description'>
- <meta content='Progenies of Nine Socrata datasets' property='og:title'>
- <meta content='http://thomaslevine.com/apple-touch-icon-144x144-precomposed.png' property='og:image'>
+ <meta content="It's cool what you can do when data analysis is logged and exposed publically over the web." property='og:description'>
+ <meta content='How are datasets are transformed in Socrata, and what can we can learn from that?' property='og:title'>
+ <meta content='http://thomaslevine.com/!/socrata-genealogies/screenshot.png' property='og:image'>
<link href='/favicon.ico' rel='icon' type='image/x-icon'>
<link href='/!/feed.xml' rel='alternate' title='Thomas Levine' type='application/atom+xml'>
<link href='http://fonts.googleapis.com/css?family=Open+Sans:400,700' rel='stylesheet' type='text/css'>
@@ -73,16 +73,26 @@
</nav>
<header class='title-card'>
<h1>
- Progenies of Nine Socrata datasets
+ Progenies of Ten Socrata datasets
</h1>
<div class='date'>
</div>
</header>
<div id='article-wrapper'>
<article>
- <p>As the Twitters have pointed out, the dataset counts that I presented
- in my initial <a href="/!/socrata-summary">summary</a> of Socrata portals is somewhat deceptive.</p>
+ <p>Governments and other organizations have recently been trying to open up their
+ data in order that the public may benefit from them. Socrata’s <a href="http://www.socrata.com/open-data-portal/">Open Data Portal</a> software
+ is one tool that tries to help with this; an organization using Socrata is given a website
+ (“portal”) hosted by Socrata where they can upload their datasets and where
+ the public can download them.</p>
+
+ <p>I recently downloaded all of the metadata about all of the datasets from all
+ of the Socrata portals and then posted this <a href="/!/socrata-summary">summary</a> of
+ the data. Now on to some deeper further analysis.</p>
+
+ <p>As the Twitters have pointed out,the dataset counts that I presented in my
+ initial summary are somewhat deceptive.</p>
<p><a href="https://twitter.com/tomschenkjr/status/354010005504147456"><img src="tomschenkjr.png" alt="@tomschenkjr Tweets about Chicago's filters" /></a></p>
@@ -96,7 +106,7 @@
<ol>
<li>Socrata concepts and terminology</li>
<li>ways that we can arrive at apparent duplicates in Socrata data</li>
- <li>the progeny of nine Socrata datasets</li>
+ <li>the progenies of ten Socrata datasets</li>
<li>ideas for future study</li>
</ol>
@@ -266,7 +276,7 @@ <h3 id="copied-rather-than-elegantly-linked">Copied rather than elegantly linked
the same number of columns, and similar names.
I haven’t done it on a larger scale, but that would be fun to do later.</p>
- <h2 id="nine-large-dataset-families">Nine large dataset families</h2>
+ <h2 id="ten-large-dataset-families">Ten large dataset families</h2>
<p>It took me quite a while to figure out how all of this works.
(That’s a story in itself.) My goal all along was to start looking
at how families of datasets are related. I figured I’d make something
@@ -284,19 +294,13 @@ <h3 id="methodology">Methodology</h3>
(Confusingly, Socrata also provides the latter sort of view count, and
I’ve included that figure in the present report.)</p>
- <p>Out of these datasets, I took out nine of the top ten datasets, and I
+ <p>Out of these datasets, I took the top ten datasets, and I
show their families in the table at the end of this page. Select a dataset,
and then you can see all of that dataset plus all of the filtered views,
maps and charts of that dataset. You can also see which portals each of
these datasets is federated to. You can sort by the different columns,
and you can click on a row to see more detail.</p>
- <h3 id="caveats">Caveats</h3>
- <p>The one dataset that I skipped is
- <a href="https://explore.data.gov/Contributors/FEC-Contributions/4dkz-64bn?">FEC contributions</a>.
- I skipped it because some of the child views appeared to be in different portals.
- I’m not really sure what’s going on there; we can worry about that one some other time.</p>
-
<p>And In case you’re reading this a year later, the data were collected from
Socrata portals at the end of May 2013.</p>
@@ -311,8 +315,34 @@ <h3 id="why-its-not-a-tree">Why it’s not a tree</h3>
represented as a child of the original dataset rather than a child of the old
filtered view.</p>
+ <h3 id="things-to-play-with">Things to play with</h3>
+ <p>If you sort by “Created” date, the first one should be the source dataset.</p>
+
+ <p>In some cases, like with the White House visitor records requests, most of the
+ downloads and hits for the whole family are from this source dataset.
+ In other cases, like the World Bank major contract awards, only a small
+ minority comes from this source dataset. This might tell us something about
+ how people like to use the data. Perhaps people working with the World Bank
+ contracts are interested in subsets for their particular region and time.
+ And maybe people are just playing with the White House data because it’s the
+ first one in the list.</p>
+
+ <p>The dataset size gives us an idea of what sort of queries people are running.
+ Are people selecting certain variables, or are they aggregating or subsetting
+ the records?</p>
+
+ <p>As I discussed earlier, federation is all-or-nothing; you either include all
+ of the source portal’s datasets or none of them. So you would expect that the
+ “Federation” column would list the same number of copies for each dataset.
+ In at least one instance (FEC contributions), this is not the case. what’s
+ going on there?</p>
+
+ <p>Frankly, this table is a rather terrible way of exploring these broader trends,
+ but it conveys the scale with which datasets are being adapted on Socrata and
+ lets us drill down to the views on Socrata to see more detail.</p>
+
<h2 id="future-research">Future research</h2>
- <p>Before you scroll down to the table of dataset progeny, I’m going to comment
+ <p>Before you scroll down to the table of dataset progenies, I’m going to comment
on some ideas for future study that I’ve come up with. I’ve already alluded to
some future study above; belowe, I’m focusing on things that I haven’t really
discussed above.</p>
@@ -331,10 +361,12 @@ <h3 id="users">Users</h3>
sorts of views they’re creating. I’m particularly interested in ordinary
citizens who are creating lots of views.</p>
- <h3 id="socrata-features">Socrata features</h3>
- <p>Socrata sells a bunch of add-on integration features. I’m somewhat curious to
+ <!--
+ ### Socrata features
+ Socrata sells a bunch of add-on integration features. I'm somewhat curious to
see which cities are using which features, and we can determine this based on
- the sorts of data that are in each portal.</p>
+ the sorts of data that are in each portal.
+ -->
<h3 id="data-quality">Data quality</h3>
<p>A couple months ago, <a href="https://twitter.com/ag_dubs">Ashley Williams</a> and I
@@ -368,7 +400,7 @@ <h3 id="data-quality">Data quality</h3>
some accident, it wasn’t appearing in the dataset. If we could identify
datasets like these, we could fix geocoding problems before people complain about them.</p>
- <h2 id="the-aforementioned-table-of-dataset-progeny">The aforementioned table of dataset progeny</h2>
+ <h2 id="the-aforementioned-table-of-dataset-progenies">The aforementioned table of dataset progenies</h2>
<!-- Scripts after the introduction so you don't notice the table loading -->
<script src="angular.min.js"></script>
View
2 !/socrata-summary/index.html
@@ -82,7 +82,7 @@
<div id='article-wrapper'>
<article>
- <p>I downloaded the metadata files for all of the datasets across all of the Socrata data portals.
+ <p>I downloaded the metadata files for all of the datasets across all of the <a href="http://www.socrata.com/">Socrata</a> data portals.
Here I explain how I did that and present an summary of the sorts of data that we find in the portals.</p>
<h2 id="acquiring-the-data">Acquiring the data</h2>

0 comments on commit b8dd3fd

Please sign in to comment.