# Web Archive Seedlist

When you are trying to document an event that has been reflected in social media, like #Ferguson or #CharlestonShooting how can you discover Web content that might be useful to collect? One way is to examine what people are talking about in social media, the URLs that they mention, and then use these URLs to create a seed list to use with your web archiving software or ArchiveIt account.

The twitter data we collected with Social Feed Manager is gzip compressed, line oriented JSON. Don't worry if this sounds complicated...it just means we have a file where every line in the file is a complete JSON document that represents a tweet, and the file itself is compressed, like a zip file. Luckily Python comes with modules that make it easy to parse JSON and uncompress gzip files. Let's import them:

In [12]:
import json
import gzip

We need to open up our file, read in every line, parse the JSON for the tweet, and count how many URLs are present. We're going to use a Python dictionary to keep track of the URLs and the number of times they were found.

In [13]:
def count_urls(filename):
    urls = {}
    with gzip.open(filename, 'rt') as f:
        for line in f:
            tweet = json.loads(line)
            # ignore data that doesn't have entities, e.g. deletes
            if 'entities' not in tweet:
                continue
            for entity in tweet['entities']['urls']:
                url = entity['expanded_url']
                urls[url] = urls.get(url, 0) + 1
        return urls

Let's run our function on a data. If you've got a different dataset that you are woking with feel free to change the path to the file here.

In [14]:
counts = count_urls('data/assorted/ferguson-blacklivesmatter.json.gz')

In [15]:
print(counts)

{'https://secure.piryx.com/donate/mS25KFCe/MORE/mikebrown': 7, 'http://ow.ly/K81w0': 6, 'http://ln.is/tumblr.com/X04hs': 2, 'https://instagram.com/p/0B6xCQIlq8/': 1, 'http://wp.me/p5kq6O-Y': 1, 'http://bit.ly/1KoCuXe': 2, 'http://www.mrctv.org/blog/revenue-generation-rather-lawful-policing-led-constitutional-and-racial-abuses-says-doj?utm_campaign=naytev&utm_content=54f8db4be4b0cebf9280360c': 1, 'http://www.youtube.com/watch?v=Gw4nQd6lryw': 2, 'http://tinyurl.com/kxewvt8': 2, 'http://mic.com/articles/111772/the-doj-just-released-its-ferguson-police-investigation-and-it-s-worse-than-you-thought?utm_campaign=naytev&utm_content=54f76ee0e4b014856d104c6a': 1, 'http://youtu.be/zwtgQly2NhM': 2, 'http://fw.to/O3IPgvF': 1, 'http://fb.me/1TTOS9e4H': 1, 'http://www.rescuepost.com/files/ori-complaint_rev_1.pdf': 3, 'http://ow.ly/JBh5d': 8, 'http://nytimes.com?smid=nytcore-iphone-share&smprod=nytcore-iphone': 1, 'http://youtu.be/6ZzyVoPISg8': 1, 'http://wp.me/p3HucV-dsy': 2, 'http://en.m.wikipedia.

But that's not the easiest thing to read right? How about we print them out in the order of most tweeted, to see what people are talking about the most?

In [16]:
urls = sorted(counts, key=counts.get, reverse=True)

In [17]:
for url in urls:
    print(url, counts[url])

http://www.alternet.org/personal-health/ferguson-activists-are-struggling-mental-trauma-long-after-police-abuse-during 462
http://www.huffingtonpost.com/2015/03/09/ferguson-report_n_6833272.html?1425932483 265
http://www.usatoday.com/story/news/nation/2015/03/10/ferguson-city-manager-resigned/24734977/ 258
http://www.citylab.com/crime/2015/03/this-police-brutality-map-shows-ferguson-is-everywhere/386833/ 99
http://www.theatlantic.com/national/archive/2015/03/ferguson-as-a-criminal-conspiracy-against-its-black-@JonathanHoenigresidents/386887/?utm_source=btn-twitter-pckt 93
https://vimeo.com/user37413083/blacklivesmatter 83
http://nyti.ms/1wS7XWm 60
http://www.mappingpoliceviolence.org 55
http://bit.ly/1F0kL3o 53
https://www.aclu.org/blog/racial-justice-criminal-law-reform-free-speech/ferguson-black-blue 53
http://lat.ms/1CfI0qw 53
http://www.washingtonpost.com/world/national-security/justice-dept-review-finds-pattern-of-racial-bias-among-ferguson-police/2015/03/03/27535390-c1c7-11e4-927