# Web Archive Seedlist

When you are trying to document an event that has been reflected in social media, like #Ferguson or #CharlestonShooting how can you discover Web content that might be useful to collect? One way is to examine what people are talking about in social media, the URLs that they mention, and then use these URLs to create a seed list to use with your web archiving software or ArchiveIt account.

The twitter data we collected with Social Feed Manager is gzip compressed, line oriented JSON. Don't worry if this sounds complicated...it just means we have a file where every line in the file is a complete JSON document that represents a tweet, and the file itself is compressed, like a zip file. Luckily Python comes with modules that make it easy to parse JSON and uncompress gzip files. Let's import them:

In [38]:
import json
import gzip

We need to open up our file, read in every line, parse the JSON for the tweet, and count how many URLs are present. We're going to use a Python dictionary to keep track of the URLs and the number of times they were found.

In [50]:
def count_urls(filename):
    urls = {}
    for line in gzip.open(filename, 'rt'):
        tweet = json.loads(line)
        # ignore data that doesn't have entities, e.g. deletes
        if 'entities' not in tweet:
            continue
        for entity in tweet['entities']['urls']:
            url = entity['expanded_url']
            urls[url] = urls.get(url, 0) + 1
    return urls

Let's run our function on a data. If you've got a different dataset that you are woking with feel free to change the path to the file here.

In [54]:
counts = count_urls('data/filters/fakePREMIS/author-mentions.json.gz')

In [56]:
print(counts)

{'http://gogetfunding.com/ms-society-tcep-skydive-event/': 1, 'http://boingboing.net/2015/07/08/u-s-patent-office-cancels-was.html': 28, 'https://youtu.be/DOInXG3VZAU': 1, 'http://imgur.com/kDTbkhS': 1, 'https://twitter.com/yony_themoony/status/614903337838886912': 3, 'https://twitter.com/VRFocus/status/613821447027322880': 1, 'https://twitter.com/caria_pridmore': 1, 'http://inhabitat.com/testicle-eating-fish-return-to-new-jersey/': 1, 'https://vimeo.com/ondemand/conman/132973956': 19, 'https://twitter.com/disinfo/status/619121390982156289': 1, 'https://twitter.com/henryfraser0/status/618523251929686016': 208, 'http://fnd.us/c/810jw4': 1, 'https://soundcloud.com/widerangehum01/anbar-rules?in=widerangehum01/sets/6th-and-callow&utm_source=soundcloud&utm_campaign=share&utm_medium=twitter': 4, 'http://bit.ly/vote-ada': 1, 'http://cnn.it/1G89Dhu': 93, 'http://www.bustle.com/articles/17219-harry-potter-fan-theory-suggests-potter-is-immortal-would-this-be-a-better-ending': 1, 'http://www.them

But that's not the easiest thing to read right? How about we print them out in the order of most tweeted, to see what people are talking about the most?

In [68]:
urls = sorted(counts, key=counts.get, reverse=True)

In [69]:
for url in urls:
    print(url, counts[url])

http://ow.ly/P8G6W 220
https://twitter.com/henryfraser0/status/618523251929686016 208
http://www.citylab.com/housing/2015/07/mapping-the-us-by-property-value-instead-of-land-area/397841/ 113
http://cnn.it/1G89Dhu 93
http://fus.in/1HMRnkU 67
http://ind.pn/1NOEi9e 57
http://www.comicbookresources.com/article/sdcc-butch-guice-teams-with-william-gibson-for-idws-archangel 41
http://qz.com/445330/japan-is-building-solar-energy-plants-on-abandoned-golf-courses-and-the-idea-is-spreading/ 41
https://twitter.com/SagarNeupane16/status/619539601506627584 38
http://boingboing.net/2015/07/08/u-s-patent-office-cancels-was.html 28
https://twitter.com/wewillquackyou/status/616133852428374016 27
http://nyti.ms/1HhEz2M 27
http://sfist.com/2015/07/08/video_what_apple_thought_the_future.php 24
http://youtu.be/tnWP2Emps1M 23
http://nyti.ms/1LWIuoZ 23
http://bbc.in/1MgM0Z9 23
http://engage.dyingwithdignity.ca/dying_canadians_need_clear_answers_now?recruiter_id=14055 21
http://bit.ly/1CpKLag 20
https://vimeo.