This is an experiment in collecting instagram data to gather insights about London. One inspiration was from http://www.datascopeanalytics.com/blog/instagrams-blind-spot/, but I am also hoping to learn about what people are taking photos of, plus a bit of image analysis (colors of instagram viz comes to mind), and maybe eventually playing with doing some sort of image recognition with neural networks (if I ever get that far 😅)

Anyways, what I've done is to call the media_search api endpoint to scrape photos from London -- instagram only seems to let you call 20 photos for this, so what I've done is to split up greater london into a grid of latlng pairs and hoover up stuff per grid. Obviously new photos are added all the time so to get the most out of it I set up a AWS lambda function that runs every hour to pick up new stuff. 

Obviously there will be duplications but I've imported all the files into a mongoDB with a unique index on the photo id to get unique posts

The lambda and the parser is in nodejs since that's what I know most, but python comes with some handy tools (including scikit learn and notebooks...) so handing over to python for this bit...


In [29]:
import numpy as np
import pandas as pd
import pymongo
from pymongo import MongoClient
client = MongoClient()

In [30]:
from pprint import pprint

In [31]:
from bson.son import SON #needs sorting so...

In [32]:
#for geo parsing
import json
from shapely.geometry import shape, Point

In [33]:
#db setup-- the posts collection is the only one for stashing everything returned from that api endpoint
db = client.instagramLondon
coll = db.posts

In [34]:
#here's what one record looks like
pprint(coll.find_one())

{'_id': ObjectId('57c46f2e4f084b83079167ac'),
 'attribution': None,
 'caption': {'created_time': '1472338156',
             'from': {'full_name': 'Carlos Pinto @ London-Lisbon',
                      'id': '245152137',
                      'profile_picture': 'https://scontent.cdninstagram.com/t51.2885-19/10932399_451400081695046_252473917_a.jpg',
                      'username': 'carlosmpinto10'},
             'id': '17862561217003381',
             'text': 'Category is "west London realness"! Thank you for coming '
                     'and eating all my Portuguese saussages!!'},
 'comments': {'count': 0},
 'created_time': '1472338156',
 'filter': 'Normal',
 'id': '1326391051846276361_245152137',
 'images': {'low_resolution': {'height': 320,
                               'url': 'https://scontent.cdninstagram.com/t51.2885-15/s320x320/e35/14031729_1508980895794743_1117711764_n.jpg?ig_cache_key=MTMyNjM5MTA1MTg0NjI3NjM2MQ%3D%3D.2.l',
                               'width': 320},
      

Some fields of interest here-- 
* tags (array) -- what are people talking about?
* timestamp -- it's stored as a ISODate -- this is done in the parser from 'created_time'
* images -- the url link to, would be really useful for color analysis
* location -- for mapping, obviously 😄

In [35]:
#some basic stats about the collection so far
print('number of records: {}'.format(coll.count()))

number of records: 49200


In [36]:
#what are the top tags?
pipeline = [ 
    {"$unwind": "$tags"}, 
    {"$group": {"_id": "$tags", "count": { "$sum": 1 }}}, 
    {"$sort" : SON({ "count" : -1}) }, 
    {"$limit": 20}]
aggCursor = db.posts.aggregate(pipeline)
toptags = list(aggCursor)
toptags

[{'_id': 'london', 'count': 12907},
 {'_id': 'uk', 'count': 1970},
 {'_id': 'travel', 'count': 1621},
 {'_id': 'love', 'count': 1455},
 {'_id': 'nottinghillcarnival', 'count': 1403},
 {'_id': 'londonlife', 'count': 1306},
 {'_id': 'carnival', 'count': 1262},
 {'_id': 'summer', 'count': 1253},
 {'_id': 'nottinghill', 'count': 1196},
 {'_id': 'england', 'count': 1163},
 {'_id': 'instagood', 'count': 1162},
 {'_id': 'bankholiday', 'count': 1124},
 {'_id': 'art', 'count': 991},
 {'_id': 'photooftheday', 'count': 894},
 {'_id': 'architecture', 'count': 816},
 {'_id': 'thisislondon', 'count': 798},
 {'_id': 'photography', 'count': 783},
 {'_id': 'fashion', 'count': 746},
 {'_id': 'picoftheday', 'count': 721},
 {'_id': 'instadaily', 'count': 708}]

unsurprisingly, london comes top as a tag...
However, there are some that I expect to be more 'seasonal'. e.g nottinghillcarnival (27-28 Aug 2016), bankholiday (29-8-2016)
It would also be interesting to see which locations have a higher density of tags e.g. west end with 'theatre' maybe? Or 'bigben' around Westminster?

In [37]:
#min date:
list(db.posts.find().sort("timestamp", pymongo.ASCENDING).limit(1))[0]['timestamp']

datetime.datetime(2016, 8, 19, 15, 42, 48)

In [38]:
#max date so far:
list(db.posts.find().sort("timestamp", pymongo.DESCENDING).limit(1))[0]['timestamp']

datetime.datetime(2016, 8, 29, 16, 54, 58)

In [39]:
#similar to tags, lets see who the top users are:
pipeline = [
    {"$group": {"_id": {"userid" :"$user.id", "name" : "$user.full_name"}, "count": { "$sum": 1 }}}, 
    {"$sort" : SON({ "count" : -1}) }, 
    {"$limit": 20}]
aggCursor = db.posts.aggregate(pipeline)
topUsers = list(aggCursor)
topUsers

[{'_id': {'name': 'Nav Sandhu', 'userid': '2284422920'}, 'count': 53},
 {'_id': {'name': '💕11/03/2016💕', 'userid': '1800822004'}, 'count': 39},
 {'_id': {'name': '', 'userid': '355544734'}, 'count': 33},
 {'_id': {'name': 'Ellz M', 'userid': '241533014'}, 'count': 30},
 {'_id': {'name': '', 'userid': '2234850097'}, 'count': 30},
 {'_id': {'name': '✯N⋆I⋆C⋆O⋆U⋆K✯', 'userid': '3683701425'}, 'count': 27},
 {'_id': {'name': 'вrian 😷💉', 'userid': '7115543'}, 'count': 27},
 {'_id': {'name': 'Gonzalo Espejo', 'userid': '292534422'}, 'count': 23},
 {'_id': {'name': 'Will Hammond', 'userid': '1562996357'}, 'count': 23},
 {'_id': {'name': '', 'userid': '3665940198'}, 'count': 23},
 {'_id': {'name': 'Matthew Wells', 'userid': '1203901802'}, 'count': 23},
 {'_id': {'name': 'Ekaterina', 'userid': '339658241'}, 'count': 23},
 {'_id': {'name': 'Yousef', 'userid': '272365160'}, 'count': 22},
 {'_id': {'name': 'Şirvan Uslu', 'userid': '1688297718'}, 'count': 22},
 {'_id': {'name': 'Геннадій Іванущенко',

In [40]:
#hmm, weirdly some users don't seem to have a name?? Lets check:
coll.find_one({"user.id": '3665940198'})

{'_id': ObjectId('57c46f434f084b830791ead5'),
 'attribution': None,
 'caption': {'created_time': '1472347035',
  'from': {'full_name': '',
   'id': '3665940198',
   'profile_picture': 'https://scontent.cdninstagram.com/t51.2885-19/s150x150/14052574_154788661623879_146262989_a.jpg',
   'username': '30_30_europe'},
  'id': '17851887115098564',
  'text': '.\n예약해 둔 스카이가든 가기 전에 들른 버로우마켓. 식료품 전문이라 들어가는 순간 맛있는 음식들이 천지삐까리! -\n\n#유럽여행 #런던여행 #버로우마켓 #boroughmarket'},
 'comments': {'count': 0},
 'created_time': '1472347035',
 'filter': 'Normal',
 'id': '1326465530261260760_3665940198',
 'images': {'low_resolution': {'height': 320,
   'url': 'https://scontent.cdninstagram.com/t51.2885-15/s320x320/e35/14031706_1254496294594574_1241856600_n.jpg?ig_cache_key=MTMyNjQ2NTUzMDI2MTI2MDc2MA%3D%3D.2.l',
   'width': 320},
  'standard_resolution': {'height': 640,
   'url': 'https://scontent.cdninstagram.com/t51.2885-15/s640x640/sh0.08/e35/14031706_1254496294594574_1241856600_n.jpg?ig_cache_key=MTMyNjQ2NTUzMDI2

In [41]:
#aha-- I should have used the 'username' field rather than the full_name-- try again
pipeline = [
    {"$group": {"_id": {"userid" :"$user.id", "name" : "$user.username", "displayname":  "$user.full_name"}, "count": { "$sum": 1 }}}, 
    {"$sort" : SON({ "count" : -1}) }, 
    {"$limit": 20}]
aggCursor = db.posts.aggregate(pipeline)
topUsers = list(aggCursor)
topUsers

[{'_id': {'displayname': 'Nav Sandhu',
   'name': 'nav_77_sandhu',
   'userid': '2284422920'},
  'count': 53},
 {'_id': {'displayname': '💕11/03/2016💕',
   'name': 'iwantsel2know',
   'userid': '1800822004'},
  'count': 39},
 {'_id': {'displayname': '', 'name': 'me_eunk', 'userid': '355544734'},
  'count': 33},
 {'_id': {'displayname': 'Ellz M', 'name': 'rawdoku', 'userid': '241533014'},
  'count': 30},
 {'_id': {'displayname': '', 'name': 'quackhistory', 'userid': '2234850097'},
  'count': 30},
 {'_id': {'displayname': '✯N⋆I⋆C⋆O⋆U⋆K✯',
   'name': 'nicophotocomukofficial',
   'userid': '3683701425'},
  'count': 27},
 {'_id': {'displayname': 'вrian 😷💉',
   'name': 'teethpuller',
   'userid': '7115543'},
  'count': 27},
 {'_id': {'displayname': 'Ekaterina',
   'name': 'misskatrin5',
   'userid': '339658241'},
  'count': 23},
 {'_id': {'displayname': 'Matthew Wells',
   'name': 'coastermadmatt',
   'userid': '1203901802'},
  'count': 23},
 {'_id': {'displayname': 'Will Hammond',
   'name':

locations-- now here come the interesting part. If I use the census output areas polygons the location has to be mapped to each polygon... and with that many entries...?

interstingly, each location has an id, so a more efficient way may be to check for unique ids and see what they are like first

In [42]:
len(db.posts.distinct('location.id'))

7761

still a lot but quite a bit smaller than the number of entries...
tools wise-- off the top of my head, explore turf.js. Also ogr and python has some geo tools too I think...

london datastore has a stash of gis files relating to the OA in london-- can use these to get the right polygons
using a combination of https://gist.github.com/benbalter/5858851 and https://github.com/mapbox/geojson-merge, I managed to get a geojson file with all the oa polygons in it. Now to figure out how to assign points to the polygons... there's shapely / basemap in python that may do the job...

In [45]:
with open('OA_2011_BGC_london.json', 'r') as f:
    geojsonLondon = json.load(f)
