In [53]:
import pandas as pd 
import requests
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import seaborn as sbn
from urllib.parse import urlencode
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score
from sklearn.manifold import TSNE
from sklearn.decomposition import PCA
from IPython.display import HTML, Image


def iframe(url, width=800, height=400, https=True):
    if https:
        url2 = url.replace('http:','https:')
    else:
        url2 = url.replace('https:','htt')
    return HTML("<iframe width={width} height={height} src='{url}'></iframe>".format(width=width, height=height, url=url2))

# Beyond Visualisation : 5 Methods For Answering Questions With your Data

Stuart Lynn: Head of Data And Research [@carto](https://twitter.com/CARTO)


[@stuart_lynn](https://twitter.com/Stuart_Lynn?ref_src=twsrc%5Egoogle%7Ctwcamp%5Eserp%7Ctwgr%5Eauthor)

Follow along here: [http://stuartlynn.github.com/NYSOD-talk](stuartlynn.github.com/NYSOD-talk)

Github Repo lives here: [http://github.com/stuartlynn/NYSOD-talk](http://github.com/stuartlynn/NYSOD-talk)

<iframe src='http://carto.com' style='width:90%; min-height:600px; height:90%'></iframe>

##  Spatial Open Gov Data Axioms

- Most Open Gov data has some kind of spatial aspect to it.
- Open Gov people love making maps (but who doesn't)!


<img src="imgs/lovemap.png" alt="Drawing" style="width: 300px;"/>


## Spatial Data 

- This has lead to some really awesome maps that tell really interesting stories about the places we live. 
- It has also led to some really interesting apps that give communites acces to information the desperately need.


## This talk is about exploring other tools we can use to ask questions of our data and get answers back.

<img src='https://s-media-cache-ak0.pinimg.com/originals/65/d6/d9/65d6d9d3172dcc27daea6ecd1e9afc1b.gif'></img>

## ~~Messy~~ Real World Data. 

Today for this workshop we are going to be using 311 data which you can obtain from [here](https://nycopendata.socrata.com/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9):

<iframe style='width:90%; min-height:600px; height:90%' src="https://nycopendata.socrata.com/Social-Services/311-Service-Requests-from-2010-to-Present/erm2-nwe9"></iframe>

## ~~Messy~~ Real World Data. 


...but actually its a lot easier to get from here

<iframe style='width:90%; min-height:400px; height:90%' src="https://chriswhong.github.io/311plus/"></iframe>

_Thanks Chris!_

## Why 311 data?

- Want this talk to show not just polished examples where everything goes right (trust me it wont).

- Want it to show some of the subtleties you will encounter with working with data like this.

- I didn't realize it was going to be quite this messy until I started using it

## First Step - Load the data in to CARTO and make a pretty map

<iframe style='width:800px; min-height:600px; height:90%' src='https://team.carto.com/u/stuartlynn/builder/3b000998-fddd-11e6-a748-0e3ff518bd15/embed'></iframe>

Thats a lot of complaining... hard to see trends though... lets add some widgets

<iframe style='width:800px; min-height:600px; height:90%' src='https://team.carto.com/u/stuartlynn/builder/4244fa1a-6049-4bf6-8bb7-cced0387eb50/embed'></iframe>

This allows us to start to visually interegate the data and to start getting some idea of whats going on

## Problems with relying only on visual exploration

1. The points on the map are all pretty near each other, multiple records might be on top of each other masking the real pattern 
2. It's hard to get a handle on real numbers 
3. There are so many different complaint types its hard to really pull them all together.
4. It's hard to pick out any real trends.

## Method no 1: Aggregate the data! 

1. Really what we want are number counts of different complaints to different agencies in different areas
2. We really want to aggregate the data to some kind of boundaries


## Introducing the Carto Data Observatory

- A repository of easy to access set of public open data and an API/interface to consume it.
- It's open source! [https://github.com/cartodb/bigmetadata](https://github.com/cartodb/bigmetadata)

<iframe style='width:90%; min-height:400px; height:90%' src='https://carto.com/data-observatory/'></iframe>

## The Catalouge

<iframe style='width:90%; min-height:400px; height:90%' src='https://cartodb.github.io/bigmetadata/united_states/boundary.html#us-census-tracts'></iframe>

## Grabbing Census Tracts

```postgresql

INSERT INTO {new_table_name} (the_geom, {geo_id_column})
  SELECT *
  FROM OBS_GetBoundariesByGeometry(
    ST_MakeEnvelope(-74.259094,40.477398,-73.700165,40.91758, 4326),
    'us.census.tiger.census_tract'
  )
```

<iframe style='width:800px; min-height:600px; height:90%' src='https://team.carto.com/u/stuartlynn/builder/37cd3ac4-009f-11e7-a398-0ee66e2c9693/embed'></iframe>

## Method no 2: Normalize that data! 

<img src='https://imgs.xkcd.com/comics/heatmap.png'></img>
[xkcd](https://imgs.xkcd.com/comics/heatmap.png)


<iframe style='width:800px; min-height:600px; height:90%' src='https://team.carto.com/u/stuartlynn/builder/d287d6c0-005c-11e7-8512-0e3a376473ab/embed'></iframe>

## Method 3: Look for statistically significant clusters and outliers

After seeing the data our eyes might be drawn to areas where there appears to be high or low density of calls 
Is this really a pattern though or our eyes deceiving us?

There is a statistical test for finding spatial similar clusters called Moran I which we have built in to CARTO.

<img src='imgs/Moran.png'></img>


The output of Moran I are regions where there are spatial clusters and outliers. A region gets one of 4 categories:

- HH - The regions is in a cluster of statistically high value
- LL - The regions is in a cluster of statistically low value
- LH - The regions is an outlier of statically low activity compared to its neighbors
- HL - The regions is an outlier of statically high activity compared to its neighbors

Lets apply this to the 311 complaints to the NYPD 

<iframe style='width:800px; min-height:600px; height:90%' src='https://team.carto.com/u/stuartlynn/builder/383529fb-e262-4c04-b49d-1023a8a772e6/embed'></iframe>

## Another interesting example: Earthquake tweets 

<iframe style='width:800px; min-height:600px; height:90%' src='https://team.carto.com/u/mamataakella/builder/5023b856-3e5d-4d71-b0a5-80960575e90e/embed'></iframe>

## What next? 

- Aggregated and Normalized data is awesome for finding patters in data of a single variable 
- What about the patterns between agencies though? How can we easily understand them?


## The problem: Maps tend to be univariate or bivariate at best and even then they can be confusing.



## Great examples of Bi-Variate mapping

<iframe style='width:800px; min-height:600px; height:90%' src='http://www.joshuastevens.net/cartography/make-a-bivariate-choropleth-map/'></iframe>

## Method 4: Clustering

The problem: The census has 100's if not 1000's of pieces of information for each location. How can show all that data on a map at the same time?


The Solution: Try to find neighborhoods that are similar to each other in a number of variables and create categories of neighborhoods based on those similarities.


<iframe style='width:800px; min-height:600px; height:90%' src='https://observatory.carto.com/viz/141d46d6-1c85-11e6-9708-0ecfd53eb7d3/embed_map'></iframe>

<iframe style='width:800px; min-height:600px; height:90%' src='http://www.chpcny.org/making-neighborhoods-map/index.html'></iframe>

## Different clustering methods 

- [k-means](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)
- [Agglomerative clustering](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html)
- [DBSCAN](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html)

This is a feature we are working to bring to CARTO pretty soon. In the mean time this [notebook](New York 311.ipynb ) that accompanies this talk has examples of how to compute these clusters

## Apply this to the top 4 agencies in our dataset 

<img src='imgs/Characteristics.png'></img>

## Give the data better labels 

0. Many calls NYPD and Housing 
1. All low complaints Moderate no Sanitation and Transportation
3. High call Rate: Sanitation Transport and moderate Police 
4. Low volume varied


<iframe style='width:800px; min-height:600px; height:90%' src='https://team.carto.com/u/stuartlynn/builder/12447a72-00a6-11e7-8352-0ecd1babdde5/embed'></iframe>


## Method 5: How do relationships vary from place to place?

We all know that the relationship between things can vary from place to place 

We intuitively understand for example that the price of a house in an urban environment is effected highly by its proximity to public transit. 

While in rural areas this matters less.

How can we quantify these relationships

## Method 5: Regression Analysis

We are going to try and understand how some census variables impact the total volume of 311 calls 

Or we where before I ran out of time for fighting 311 data... So instead let me present some other interesting use-case for this kind of analysis.


In [None]:
<iframe style='width:800px; min-height:600px; height:90%' src='https://team.carto.com/u/stuartlynn/builder/e311d600-689d-11e6-8fa0-0e3ff518bd15/embed' ></iframe>

## UK petitions 

<iframe style='width:800px; min-height:600px; height:90%'  src='http://petitionmap.unboxedconsulting.com'></iframe>

<img src='imgs/Tweet.png'></img>


## Petitions that highly correlate with the petition to ban Trump from a state visit

<img src='imgs/TrumpCorrelations.png'></img>

## Petitions that highly anti-correlate with the Trump petition

<img src='imgs/anti_correlations.png'></img>

### Accept more asylum seekers and increase support for refugee migrants in the UK.

<iframe style='width:800px; min-height:600px; height:90%' src='https://team.carto.com/u/stuartlynn/builder/010f9b4c-005b-11e7-be55-0e3a376473ab/embed'></iframe>

### Stop all immigration and close the UK borders until ISIS is defeated.

<iframe style='width:800px; min-height:600px; height:90%' src='https://team.carto.com/u/stuartlynn/builder/ad373e3e-00b0-11e7-9256-0e233c30368f/embed'></iframe>

## Conclusions

- Visualizing data is a great way to gain insight in to that data 
- There are other, more statistical things you can do to also get insight in to your data 
- Tools and libraries like SK-LEARN and PYSAL can give you quick methods for applying these methods 
- We are slowly making these methods avaliable in CARTO in an easy to use code free manner.

## But remember .... 

<img src='https://cdn.meme.am/cache/instances/folder373/500x/43601373.jpg'></img>

We just scratched the surface here. This was more about wetting your appetite and getting you to think beyond visualizing data. Need to be cautions with some of these ideas and approaches.