Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Browse code

aoeu

  • Loading branch information...
commit a21c6a5affa864b86e4c0253f9f061eef274aa4a 1 parent b8dd3fd
Thomas Levine authored July 18, 2013
102  !/socrata-genealogies/index.html
@@ -7,7 +7,7 @@
7 7
   <!--<![endif]-->
8 8
   <head>
9 9
     <meta charset='utf-8'>
10  
-    <title>Progenies of Ten Socrata datasets</title>
  10
+    <title>Progenies of Ten Socrata Datasets</title>
11 11
     <meta content='How are datasets are transformed in Socrata, and what can we can learn from that?' name='description'>
12 12
     <meta content='Thomas Levine' name='author'>
13 13
     <link href='http://domain/humans.txt' rel='author' type='text/plain'>
@@ -15,7 +15,7 @@
15 15
     <meta content='width=device-width' name='viewport'>
16 16
     <meta content='summary' name='twitter:card'>
17 17
     <meta content='@thomaslevine' name='twitter:site'>
18  
-    <meta content='Progenies of Ten Socrata datasets' name='twitter:title'>
  18
+    <meta content='Progenies of Ten Socrata Datasets' name='twitter:title'>
19 19
     <meta content='How are datasets are transformed in Socrata, and what can we can learn from that?' name='twitter:description'>
20 20
     <meta content='@thomaslevine' name='twitter:creator'>
21 21
     <meta content='http://thomaslevine.com/!/socrata-genealogies/screenshot.png' name='twitter:image:src'>
@@ -73,7 +73,7 @@
73 73
         </nav>
74 74
         <header class='title-card'>
75 75
           <h1>
76  
-            Progenies of Ten Socrata datasets
  76
+            Progenies of Ten Socrata Datasets
77 77
           </h1>
78 78
           <div class='date'>
79 79
             
@@ -82,7 +82,7 @@
82 82
         <div id='article-wrapper'>
83 83
           <article>
84 84
             <p>Governments and other organizations have recently been trying to open up their
85  
-            data in order that the public may benefit from them. Socrata’s <a href="http://www.socrata.com/open-data-portal/">Open Data Portal</a> software
  85
+            data. Socrata’s <a href="http://www.socrata.com/open-data-portal/">Open Data Portal</a> software
86 86
             is one tool that tries to help with this; an organization using Socrata is given a website
87 87
             (“portal”) hosted by Socrata where they can upload their datasets and where
88 88
             the public can download them.</p>
@@ -91,6 +91,7 @@
91 91
             of the Socrata portals and then posted this <a href="/!/socrata-summary">summary</a> of
92 92
             the data. Now on to some deeper further analysis.</p>
93 93
             
  94
+            <h2 id="what-is-a-dataset">What is a dataset?</h2>
94 95
             <p>As the Twitters have pointed out,the dataset counts that I presented in my
95 96
             initial summary are somewhat deceptive.</p>
96 97
             
@@ -98,16 +99,13 @@
98 99
             
99 100
             <p><a href="https://twitter.com/SR_spatial/status/354088265344749568"><img src="SR_spatial.png" alt="@SR_spatial Tweets about patterns of derived datasets" /></a></p>
100 101
             
101  
-            <p><a href="https://twitter.com/richmanmax/status/353956877501087746"><img src="deduuuuuupe.png" alt="@richmanmax Tweets &quot;deduuuuuupe.&quot;" /></a></p>
102  
-            
103 102
             <p>Many of the things that I was calling a dataset can be seen as a
104 103
             copy or a derivative of another dataset. In this post, I’ll discuss</p>
105 104
             
106 105
             <ol>
107 106
               <li>Socrata concepts and terminology</li>
108  
-              <li>ways that we can arrive at apparent duplicates in Socrata data</li>
109  
-              <li>the progenies of ten Socrata datasets</li>
110  
-              <li>ideas for future study</li>
  107
+              <li>Ways that we can arrive at apparent duplicates in Socrata data</li>
  108
+              <li>The progenies of ten Socrata datasets</li>
111 109
             </ol>
112 110
             
113 111
             <h2 id="socrata-terminology">Socrata terminology</h2>
@@ -127,7 +125,7 @@ <h3 id="everything-is-a-view">Everything is a view</h3>
127 125
             <a href="https://explore.data.gov/dataset/White-House-Visitor-Records-Requests/644b-gaut">White House Visitor Records Requests</a>
128 126
             and <a href="https://explore.data.gov/dataset/White-House-Visitor-Records-Requests/644b-gaut">U.S. Overseas Loans and Grants (Greenbook)</a>.</p>
129 127
             
130  
-            <p><a href="https://explore.data.gov/"><img src="search-browse.png" alt="Search &amp; Browse Datasets and Views" /></a></p>
  128
+            <p><a href="https://explore.data.gov/"><img src="search-browse.png" alt="Search &amp; Browse Datasets and Views" class="wide" /></a></p>
131 129
             
132 130
             <p>You also get a list of “View Types”. Below, I define some
133 131
             of these view types.</p>
@@ -150,7 +148,7 @@ <h3 id="filtered-views">Filtered views</h3>
150 148
             <a href="https://data.oaklandnet.com/Environmental/Public-Works-Volunteer-Opportunities/sduu-bfki">Public Works Volunteer Opportunities</a>
151 149
             to include only opportunities on July 29.</p>
152 150
             
153  
-            <p><a href="filter.png"><img src="filter.png" alt="Filtering on date July 29" /></a></p>
  151
+            <p><a href="filter.png"><img src="filter.png" alt="Filtering on date July 29" class="wide" /></a></p>
154 152
             
155 153
             <p><a href="https://data.oaklandnet.com/Environmental/Volunteer-Opportunities-on-July-29/vyhb-nqtw">Here</a>’s the resulting filtered view.</p>
156 154
             
@@ -194,7 +192,7 @@ <h3 id="federation">Federation</h3>
194 192
             <p>Sometimes, you’ll see a view in the search &amp; browse pane with a grey background,
195 193
             instead of white. Hawaii has a bunch of these.</p>
196 194
             
197  
-            <p><a href="https://data.hawaii.gov/"><img src="hawaii.png" alt="Hawaii data portal" /></a></p>
  195
+            <p><a href="https://data.hawaii.gov/"><img src="hawaii.png" alt="Hawaii data portal" class="wide" /></a></p>
198 196
             
199 197
             <p>These views are “provided” by other portals through a process called
200 198
             “federation”. The destination portal (data.hawaii.gov in the above screenshot)
@@ -210,6 +208,9 @@ <h3 id="federation">Federation</h3>
210 208
             otherwise copied to the destination portal.</p>
211 209
             
212 210
             <h2 id="types-of-duplicate-datasets">Types of duplicate datasets</h2>
  211
+            
  212
+            <p><a href="https://twitter.com/richmanmax/status/353956877501087746"><img src="deduuuuuupe.png" alt="@richmanmax Tweets &quot;deduuuuuupe.&quot;" /></a></p>
  213
+            
213 214
             <p>Now that you know a bit more about how Socrata works, I can explain my three
214 215
             categories of datasets-that-I-counted-twice.</p>
215 216
             
@@ -315,9 +316,12 @@ <h3 id="why-its-not-a-tree">Why it’s not a tree</h3>
315 316
             represented as a child of the original dataset rather than a child of the old
316 317
             filtered view.</p>
317 318
             
318  
-            <h3 id="things-to-play-with">Things to play with</h3>
  319
+            <h3 id="things-to-look-for">Things to look for</h3>
  320
+            
  321
+            <h4 id="the-source-dataset">The source dataset</h4>
319 322
             <p>If you sort by “Created” date, the first one should be the source dataset.</p>
320 323
             
  324
+            <h4 id="compare-family-statistics-with-view-statistics">Compare family statistics with view statistics</h4>
321 325
             <p>In some cases, like with the White House visitor records requests, most of the
322 326
             downloads and hits for the whole family are from this source dataset.
323 327
             In other cases, like the World Bank major contract awards, only a small
@@ -327,80 +331,30 @@ <h3 id="things-to-play-with">Things to play with</h3>
327 331
             And maybe people are just playing with the White House data because it’s the
328 332
             first one in the list.</p>
329 333
             
330  
-            <p>The dataset size gives us an idea of what sort of queries people are running.
  334
+            <h4 id="view-size">View size</h4>
  335
+            <p>The view size gives us an idea of what sort of queries people are running.
331 336
             Are people selecting certain variables, or are they aggregating or subsetting
332 337
             the records?</p>
333 338
             
  339
+            <h4 id="federation-2">Federation</h4>
334 340
             <p>As I discussed earlier, federation is all-or-nothing; you either include all
335 341
             of the source portal’s datasets or none of them. So you would expect that the
336 342
             “Federation” column would list the same number of copies for each dataset.
337 343
             In at least one instance (FEC contributions), this is not the case. what’s
338 344
             going on there?</p>
339 345
             
  346
+            <h3 id="relevance">Relevance</h3>
340 347
             <p>Frankly, this table is a rather terrible way of exploring these broader trends,
341 348
             but it conveys the scale with which datasets are being adapted on Socrata and
342 349
             lets us drill down to the views on Socrata to see more detail.</p>
343 350
             
344  
-            <h2 id="future-research">Future research</h2>
345  
-            <p>Before you scroll down to the table of dataset progenies, I’m going to comment
346  
-            on some ideas for future study that I’ve come up with. I’ve already alluded to
347  
-            some future study above; belowe, I’m focusing on things that I haven’t really
348  
-            discussed above.</p>
349  
-            
350  
-            <p>A small note on grammar:
351  
-            I talk about these studies as if I’m going to do them, but that’s just because
352  
-            I normally find that easier than convincing other people to help; all of the code
353  
-            and data is free/libre/open, so you can also help or do these yourself rather
354  
-            than waiting for me.</p>
355  
-            
356  
-            <h3 id="users">Users</h3>
357  
-            <p>As far as I could tell, Socrata’s API doesn’t make it particularly easy to
358  
-            get a list of all of the users, so I started with views. But now I have
359  
-            a list of all of the users who have created views, which is close enough to
360  
-            the list of all of the users. I’d like to see who is creating views, what
361  
-            sorts of views they’re creating. I’m particularly interested in ordinary
362  
-            citizens who are creating lots of views.</p>
363  
-            
364  
-            <!--
365  
-            ### Socrata features
366  
-            Socrata sells a bunch of add-on integration features. I'm somewhat curious to
367  
-            see which cities are using which features, and we can determine this based on
368  
-            the sorts of data that are in each portal.
369  
-            -->
370  
-            
371  
-            <h3 id="data-quality">Data quality</h3>
372  
-            <p>A couple months ago, <a href="https://twitter.com/ag_dubs">Ashley Williams</a> and I
373  
-            <a href="http://www.appgen.me/audit/report">prototyped</a> a tool for identifying
374  
-            data quality issues in the data portal. We had a
375  
-            <a href="http://www.appgen.me/audit">slew of best practices</a> that we had found to
376  
-            be frequently violated in the New York data portal, but we didn’t know
377  
-            enough about Socrata to evaluate them properly. Many of these were already
378  
-            on my list for further study, but I got some more ideas on this front
379  
-            through my conversation with <a href="https://twitter.com/nneditch">Nicole Neditch</a>,
380  
-            who administrates Oakland’s data portal.</p>
381  
-            
382  
-            <p><strong>Codebooks</strong>: Socrata doesn’t really have a feature for including
383  
-            explanations of what the different variables in a dataset mean. (I’d call
384  
-            this a data dictionary or a codebook.) However, some datasets may already
385  
-            include codebooks. I’m personally just a bit curious as to which datasets
386  
-            have codebooks and whether that impacts their use. But this could also work
387  
-            its way into our hypothetical tool. For example, we could look for datasets
388  
-            with lots of views and without codebooks; those might be useful datasets
389  
-            to write codebooks for.</p>
390  
-            
391  
-            <p><strong>Geocoding</strong>: Socrata is quite slow at geocoding. Nicole suspects that
392  
-            this is because all of the geocoding for all of the portals runs on
393  
-            one server. This is something that Socrata could improve, but there’s
394  
-            a lot that cities can already do about this. This issue came up in relation
395  
-            to Oakland’s <a href="https://data.oaklandnet.com/Public-Safety/CrimeWatch-Maps-Past-90-Days/ym6k-rx7a">CrimeWatch maps</a>.
396  
-            The dataset has geospatial coordinates, is quite long, and is updated
397  
-            frequently. Every time it is updated, all of the geocoded coordinates
398  
-            get cleared, and the geocoding restarts, so the geocoding never finishes.
399  
-            Oakland actually has the geospatial data in its database, but through
400  
-            some accident, it wasn’t appearing in the dataset. If we could identify
401  
-            datasets like these, we could fix geocoding problems before people complain about them.</p>
402  
-            
403  
-            <h2 id="the-aforementioned-table-of-dataset-progenies">The aforementioned table of dataset progenies</h2>
  351
+            <p>Socrata exposes enough of the data analysis process that we can start to see
  352
+            what sorts of analyses different people are doing. We can see what sorts of
  353
+            datasets are interesting to people. We may even be able to develop new
  354
+            guidelines for publishing datasets through analysis of what makes datasets more
  355
+            likely to be viewed, downloaded and filtered on Socrata.</p>
  356
+            
  357
+            <p>And now, the dataset progeny explorer:</p>
404 358
             
405 359
             <!-- Scripts after the introduction so you don't notice the table loading -->
406 360
             <script src="angular.min.js"></script>
BIN  !/socrata-genealogies/screenshot.png
181  !/socrata-schema/index.html
... ...
@@ -0,0 +1,181 @@
  1
+<!DOCTYPE html>
  2
+<!--[if lt IE 7]> <html class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
  3
+<!--[if IE 7]>    <html class="no-js lt-ie9 lt-ie8"> <![endif]-->
  4
+<!--[if IE 8]>    <html class="no-js lt-ie9"> <![endif]-->
  5
+<!--[if gt IE 8]><!-->
  6
+<html class='no-js'>
  7
+  <!--<![endif]-->
  8
+  <head>
  9
+    <meta charset='utf-8'>
  10
+    <title>Thomas Levine</title>
  11
+    <meta content='' name='description'>
  12
+    <meta content='Thomas Levine' name='author'>
  13
+    <link href='http://domain/humans.txt' rel='author' type='text/plain'>
  14
+    <meta content='nanoc 3.6.4' name='generator'>
  15
+    <meta content='width=device-width' name='viewport'>
  16
+    <meta content='summary' name='twitter:card'>
  17
+    <meta content='@thomaslevine' name='twitter:site'>
  18
+    <meta content='Thomas Levine' name='twitter:title'>
  19
+    <meta content='' name='twitter:description'>
  20
+    <meta content='@thomaslevine' name='twitter:creator'>
  21
+    <meta content='http://thomaslevine.com/apple-touch-icon-144x144-precomposed.png' name='twitter:image:src'>
  22
+    <meta content='thomaslevine.com' name='twitter:domain'>
  23
+    <meta content='' name='twitter:app:name:iphone'>
  24
+    <meta content='' name='twitter:app:name:ipad'>
  25
+    <meta content='' name='twitter:app:name:googleplay'>
  26
+    <meta content='' name='twitter:app:url:iphone'>
  27
+    <meta content='' name='twitter:app:url:ipad'>
  28
+    <meta content='' name='twitter:app:url:googleplay'>
  29
+    <meta content='' name='twitter:app:id:iphone'>
  30
+    <meta content='' name='twitter:app:id:ipad'>
  31
+    <meta content='' name='twitter:app:id:googleplay'>
  32
+    <meta content='http://thomaslevine.com/!/socrata-schema/' property='og:url'>
  33
+    <meta content='thomaslevine.com' property='og:site_name'>
  34
+    <meta content='' property='og:description'>
  35
+    <meta content='' property='og:title'>
  36
+    <meta content='http://thomaslevine.com/apple-touch-icon-144x144-precomposed.png' property='og:image'>
  37
+    <link href='/favicon.ico' rel='icon' type='image/x-icon'>
  38
+    <link href='/!/feed.xml' rel='alternate' title='Thomas Levine' type='application/atom+xml'>
  39
+    <link href='http://fonts.googleapis.com/css?family=Open+Sans:400,700' rel='stylesheet' type='text/css'>
  40
+    <link href='/css/style-cb653401acb.css' rel='stylesheet'>
  41
+    <script src='https://c328740.ssl.cf1.rackcdn.com/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML' type='text/javascript'></script>
  42
+    <script src='/js/modernizr-cb42306a279.js'></script>
  43
+  </head>
  44
+  <body>
  45
+    <!--[if lt IE 7 ]>
  46
+      <p class='chromeframe'>
  47
+        You are using an <strong>outdated</strong> browser.
  48
+        Please <a href="http://browsehappy.com/">upgrade your browser</a> or
  49
+        <a href="http://www.google.com/chromeframe/?redirect=true">activate Google Chrome Frame</a>
  50
+        to improve your experience.
  51
+      </p>
  52
+    <![endif]-->
  53
+    <div id='wrapper'>
  54
+      <div id='container'>
  55
+        <nav>
  56
+          <ul class='nobullet'>
  57
+            <li class='link'>
  58
+              <a href='/'>
  59
+                <div>~</div>
  60
+              </a>
  61
+            </li>
  62
+            <li class='link'>
  63
+              <a href='/!/'>
  64
+                <div>!</div>
  65
+              </a>
  66
+            </li>
  67
+            <li class='link'>
  68
+              <a href='/!/about/'>
  69
+                <div>?</div>
  70
+              </a>
  71
+            </li>
  72
+          </ul>
  73
+        </nav>
  74
+        <header class='title-card'>
  75
+          <h1>
  76
+            
  77
+          </h1>
  78
+          <div class='date'>
  79
+            
  80
+          </div>
  81
+        </header>
  82
+        <div id='article-wrapper'>
  83
+          <article>
  84
+            
  85
+            <h2 id="future-research">Future research</h2>
  86
+            <p>Before you scroll down to the table of dataset progenies, I’m going to comment
  87
+            on some ideas for future study that I’ve come up with. I’ve already alluded to
  88
+            some future study above; belowe, I’m focusing on things that I haven’t really
  89
+            discussed above.</p>
  90
+            
  91
+            <p>A small note on grammar:
  92
+            I talk about these studies as if I’m going to do them, but that’s just because
  93
+            I normally find that easier than convincing other people to help; all of the code
  94
+            and data is free/libre/open, so you can also help or do these yourself rather
  95
+            than waiting for me.</p>
  96
+            
  97
+            <h3 id="users">Users</h3>
  98
+            <p>As far as I could tell, Socrata’s API doesn’t make it particularly easy to
  99
+            get a list of all of the users, so I started with views. But now I have
  100
+            a list of all of the users who have created views, which is close enough to
  101
+            the list of all of the users. I’d like to see who is creating views, what
  102
+            sorts of views they’re creating. I’m particularly interested in ordinary
  103
+            citizens who are creating lots of views.</p>
  104
+            
  105
+            <!--
  106
+            ### Socrata features
  107
+            Socrata sells a bunch of add-on integration features. I'm somewhat curious to
  108
+            see which cities are using which features, and we can determine this based on
  109
+            the sorts of data that are in each portal.
  110
+            -->
  111
+            
  112
+            <h3 id="data-quality">Data quality</h3>
  113
+            <p>A couple months ago, <a href="https://twitter.com/ag_dubs">Ashley Williams</a> and I
  114
+            <a href="http://www.appgen.me/audit/report">prototyped</a> a tool for identifying
  115
+            data quality issues in the data portal. We had a
  116
+            <a href="http://www.appgen.me/audit">slew of best practices</a> that we had found to
  117
+            be frequently violated in the New York data portal, but we didn’t know
  118
+            enough about Socrata to evaluate them properly. Many of these were already
  119
+            on my list for further study, but I got some more ideas on this front
  120
+            through my conversation with <a href="https://twitter.com/nneditch">Nicole Neditch</a>,
  121
+            who administrates Oakland’s data portal.</p>
  122
+            
  123
+            <p><strong>Codebooks</strong>: Socrata doesn’t really have a feature for including
  124
+            explanations of what the different variables in a dataset mean. (I’d call
  125
+            this a data dictionary or a codebook.) However, some datasets may already
  126
+            include codebooks. I’m personally just a bit curious as to which datasets
  127
+            have codebooks and whether that impacts their use. But this could also work
  128
+            its way into our hypothetical tool. For example, we could look for datasets
  129
+            with lots of views and without codebooks; those might be useful datasets
  130
+            to write codebooks for.</p>
  131
+            
  132
+            <p><strong>Geocoding</strong>: Socrata is quite slow at geocoding. Nicole suspects that
  133
+            this is because all of the geocoding for all of the portals runs on
  134
+            one server. This is something that Socrata could improve, but there’s
  135
+            a lot that cities can already do about this. This issue came up in relation
  136
+            to Oakland’s <a href="https://data.oaklandnet.com/Public-Safety/CrimeWatch-Maps-Past-90-Days/ym6k-rx7a">CrimeWatch maps</a>.
  137
+            The dataset has geospatial coordinates, is quite long, and is updated
  138
+            frequently. Every time it is updated, all of the geocoded coordinates
  139
+            get cleared, and the geocoding restarts, so the geocoding never finishes.
  140
+            Oakland actually has the geospatial data in its database, but through
  141
+            some accident, it wasn’t appearing in the dataset. If we could identify
  142
+            datasets like these, we could fix geocoding problems before people complain about them.</p>
  143
+          </article>
  144
+        </div>
  145
+        <div id='pagination'>
  146
+          <div class='base-little-card'>
  147
+            <a href="https://github.com/tlevine/www.thomaslevine.com/tree/master/content/!/socrata-schema/index.md">View source</a>
  148
+            <a href="https://twitter.com/thomaslevine">Discuss</a>
  149
+          </div>
  150
+        </div>
  151
+      </div>
  152
+    </div>
  153
+    <div id='feedback'>
  154
+      <strong>
  155
+        Tom requests your feedback.
  156
+      </strong>
  157
+      <p>
  158
+        I can never decide what to write;
  159
+        tell me what you like,
  160
+        and my decisions will be easier.
  161
+        (Contact information is <a href="/" target="_blank" >here</a>.)
  162
+      </p>
  163
+      <a class='close' href='javascript:$("#feedback").fadeOut()'>
  164
+        Close
  165
+      </a>
  166
+    </div>
  167
+    <script src='/js/application-cb286d6f677.js'></script>
  168
+    <!-- Piwik -->
  169
+    <script type="text/javascript">
  170
+    var pkBaseURL = (("https:" == document.location.protocol) ? "https://piwik.thomaslevine.com/" : "http://piwik.thomaslevine.com/");
  171
+    document.write(unescape("%3Cscript src='" + pkBaseURL + "piwik.js' type='text/javascript'%3E%3C/script%3E"));
  172
+    </script><script type="text/javascript">
  173
+    try {
  174
+    var piwikTracker = Piwik.getTracker(pkBaseURL + "piwik.php", 2);
  175
+    piwikTracker.trackPageView();
  176
+    piwikTracker.enableLinkTracking();
  177
+    } catch( err ) {}
  178
+    </script><noscript><p><img src="http://piwik.thomaslevine.com/piwik.php?idsite=2" style="border:0" alt="Piwik tracking image" /></p></noscript>
  179
+    <!-- End Piwik Tracking Code -->
  180
+  </body>
  181
+</html>

0 notes on commit a21c6a5

Please sign in to comment.
Something went wrong with that request. Please try again.