This is an experiment in collecting instagram data to gather insights about London. One inspiration was from http://www.datascopeanalytics.com/blog/instagrams-blind-spot/, but I am also hoping to learn about what people are taking photos of, plus a bit of image analysis (colors of instagram viz comes to mind), and maybe eventually playing with doing some sort of image recognition with neural networks (if I ever get that far 😅)

Anyways, what I've done is to call the media_search api endpoint to scrape photos from London -- instagram only seems to let you call 20 photos for this, so what I've done is to split up greater london into a grid of latlng pairs and hoover up stuff per grid. Obviously new photos are added all the time so to get the most out of it I set up a AWS lambda function that runs every hour to pick up new stuff. 

Obviously there will be duplications but I've imported all the files into a mongoDB with a unique index on the photo id to get unique posts

The lambda and the parser is in nodejs since that's what I know most, but python comes with some handy tools (including scikit learn and notebooks...) so handing over to python for this bit...


In [2]:
import numpy as np
import pandas as pd
import pymongo
from pymongo import MongoClient
client = MongoClient()

In [4]:
from pprint import pprint

In [7]:
from bson.son import SON #needs sorting so...

In [3]:
#db setup-- the posts collection is the only one for stashing everything returned from that api endpoint
db = client.instagramLondon
coll = db.posts

In [13]:
#here's what one record looks like
pprint(coll.find_one())

{'_id': ObjectId('57c46f2e4f084b83079167ac'),
 'attribution': None,
 'caption': {'created_time': '1472338156',
             'from': {'full_name': 'Carlos Pinto @ London-Lisbon',
                      'id': '245152137',
                      'profile_picture': 'https://scontent.cdninstagram.com/t51.2885-19/10932399_451400081695046_252473917_a.jpg',
                      'username': 'carlosmpinto10'},
             'id': '17862561217003381',
             'text': 'Category is "west London realness"! Thank you for coming '
                     'and eating all my Portuguese saussages!!'},
 'comments': {'count': 0},
 'created_time': '1472338156',
 'filter': 'Normal',
 'id': '1326391051846276361_245152137',
 'images': {'low_resolution': {'height': 320,
                               'url': 'https://scontent.cdninstagram.com/t51.2885-15/s320x320/e35/14031729_1508980895794743_1117711764_n.jpg?ig_cache_key=MTMyNjM5MTA1MTg0NjI3NjM2MQ%3D%3D.2.l',
                               'width': 320},
      

Some fields of interest here-- 
* tags (array) -- what are people talking about?
* timestamp -- it's stored as a ISODate -- this is done in the parser from 'created_time'
* images -- the url link to, would be really useful for color analysis
* location -- for mapping, obviously 😄

In [11]:
#some basic stats about the collection so far
print('number of records: {}'.format(coll.count()))

number of records: 49200


In [12]:
#what are the top tags?
pipeline = [ 
    {"$unwind": "$tags"}, 
    {"$group": {"_id": "$tags", "count": { "$sum": 1 }}}, 
    {"$sort" : SON({ "count" : -1}) }, 
    {"$limit": 20}]
aggCursor = db.posts.aggregate(pipeline)
toptags = list(aggCursor)
toptags

[{'_id': 'london', 'count': 12907},
 {'_id': 'uk', 'count': 1970},
 {'_id': 'travel', 'count': 1621},
 {'_id': 'love', 'count': 1455},
 {'_id': 'nottinghillcarnival', 'count': 1403},
 {'_id': 'londonlife', 'count': 1306},
 {'_id': 'carnival', 'count': 1262},
 {'_id': 'summer', 'count': 1253},
 {'_id': 'nottinghill', 'count': 1196},
 {'_id': 'england', 'count': 1163},
 {'_id': 'instagood', 'count': 1162},
 {'_id': 'bankholiday', 'count': 1124},
 {'_id': 'art', 'count': 991},
 {'_id': 'photooftheday', 'count': 894},
 {'_id': 'architecture', 'count': 816},
 {'_id': 'thisislondon', 'count': 798},
 {'_id': 'photography', 'count': 783},
 {'_id': 'fashion', 'count': 746},
 {'_id': 'picoftheday', 'count': 721},
 {'_id': 'instadaily', 'count': 708}]

unsurprisingly, london comes top as a tag...
However, there are some that I expect to be more 'seasonal'. e.g nottinghillcarnival (27-28 Aug 2016), bankholiday (29-8-2016)
It would also be interesting to see which locations have a higher density of tags e.g. west end with 'theatre' maybe? Or 'bigben' around Westminster?

In [18]:
#min date:
list(db.posts.find().sort("timestamp", pymongo.ASCENDING).limit(1))[0]['timestamp']

datetime.datetime(2016, 8, 19, 15, 42, 48)

In [19]:
#max date so far:
list(db.posts.find().sort("timestamp", pymongo.DESCENDING).limit(1))[0]['timestamp']

datetime.datetime(2016, 8, 29, 16, 54, 58)