Creating a Seed Document for Analysis
=====================================

This notebook outlines a tutorial for including data for the Web Archive Collections.  This includes
- Descriptive Data for the collection itself
- Selected Description of Most Interesting Seeds
- Time-series analysis of the main domains through Crawl-viz (https://github.com/web-archive-group/WALK-CrawlVis)
- Including some field notes to help others get a "feel" for what's in the collection.
- Possibly utilizing some annotation software like Hypothes.is to take advantage of crowd sourcing.

Observing the Archive
=====================

The first step is to view the archive at archive-it.org.  For this example, I am going to use the [Canadian Political Parties and Political Interest Groups Archive](https://archive-it.org/collections/227). Reading through the archive, what websites are archived, who owns it and so on is the first main step.  Maybe jot down some notes on these questions:

- What kinds of sites are being collected? (Political parties & interest groups)
- How long have they been collecting data? (for over 10 years)
- What qualitative evaluations can you make? (In general, marginal parties like The Cosmopolitan Party are treated as having equal importance as established parties such as the Liberals, Conservatives & NDP.)

Observing the Crawl-Viz
=======================

Each crawl has been analyse in a time series via the [Crawl-Viz data site](https://web-archive-group.github.io/WALK-CrawlVis/crawl-sites). [CPP's crawl viz](https://web-archive-group.github.io/WALK-CrawlVis/crawl-sites/TORONTO_Canadian_Political_Parties-urls.html) is very interesting. For example, one might expect that established parties would have more to say than marginal parties. In fact, the busiest web site appears to be [equalvoice.ca](http://www.equalvoice.ca), an advocacy group committed to including more women in political life. Nonetheless, additional notes about the crawl-viz would be very helpful.  Jot them down.

# Enter Data into a NoSql Database

NoSql is all the rage these days. Not that that really means that much. Sql or *structured query language* was the standard for everything not very long ago. The reason for this is that computer processing had a hard time processing lots of data, so we spent a lot of time trying to keep everything organized so it was easy to get at later. NoSQL does not use SQL. That sounds fancy, but one of the most common ways to store data is to apply "key-value pairs" which, if you know anything about programming, is the exact same thing that a Dictionary in Python or an Object in Javascript does.  It's also the way that JSON, a popular alternative to XML, works.  For instance:
<pre>
{"key" : "value" }
</pre>

There!  I just did nosql.  So exciting eh?  Well, there's more to it than this of course. One advantage to this approach is its flexibility.  Instead of conforming to a specific schema or structure, you can just put the data in a store and make things up as you go along. The code below does this.

In [1]:
# TinyDB is a very simple NoSQL database used for small projects.  It's not ideal for large projects
# because it is file-based, meaning only one person can change the database at a time.
#
# We are also importing the time library so that we can include some timestamps as part of the work.
import sys
sys.path.append('/opt/anaconda/anaconda3/lib/python3.6/site-packages')
import tinydb as tdb
import time
#import networkx

In [2]:
# set up database
db = tdb.TinyDB("./data/WALK.json")

# db is now what our database is called.  We can also create some tables

collections = db.table("collections")  # the table for describing an overall collection
collections_backup = db.table("collections_backup") # a table for backing things up whenever we mess up

seed_main = db.table("seed_main") # in each collection, there are a number of websites we might want to describe.
seed_backup = db.table("seed_backup") # again, a backup in case we lose information.

Seed = tdb.Query()  # Seed will be the query object we will use (instead of having to write Query() all the time).

In [5]:
# !!!!!!!!!!!!!  USE WITH CARE!  Will delete all tables from the database !!!!!!!!!!!!!!!!

################# seed_main.purge()
################# seed_backup.purge()
################# collections.purge()
################# collections_backup.purge()
################# db.purge_tables()
################# default = db.table("_default")
print (db.tables())

{'seed_backup', 'collections', 'collections_backup', '_default', 'seed_main'}


# Describe the collection and add to the Database

A tinyDB accepts python objects (or dictionaries) in key-value pairs as data. Therefore, if we want to add information into the database, we must make it conform to python dictionary format: { 'key' : 'value" }.  In python, we can do this one of two ways.  

1. Write out the whole dictionary each time (next slide) OR
2. add data to the object after the fact.  For example, 
   <pre> description['key'] = value 
   </pre>
   will put "value" in the dictionary that can be accessed by searching
   "key."
   
### Troubleshooting

#### Invalid Syntax:
Check for:
    1. commas after each selection
    2. all quotes are completed with the same format.
       ( NOTE: in python, single quotes assume everything inside are literally correct.
       Double quotes accept programming instructions, unless otherwise escaped.
       I suggest using single quotes because '&' will properly give you & while "&" might
       think & is doing something special.
    3. Keys are encased in quotes. 'key'
    4. Values are encased in quotes unless they are number values.
       


In [4]:
collection_title = input("Enter the collection title:")


Enter the collection title:hello


In [45]:

description = {
    ## collection_title and WALK_collection_folder are used to decide
    ## whether to insert a new item or to update an old one ...
    
    'collection_title' : 'Canadian Political Parties and Political Interest Groups',
    
    'WALK_collection_folder' : 'TORONTO_Canadian_Political_Parties',
    
    ## If you accidentally update the wrong item, you can retrieve the old value from the 
    ## collections_backup table.
    
    
    # How does the Library/Archives describe the archive?
    'institutional_description' : '''
    
    Canadian Political Parties and Political Interest Groups will archive the websites of all 
    the national Canadian political parties, and a number of special interest groups across 
    the political spectrum.
    
    ''',
    
    # In your own words, how do you describe the collection
    'WALK_description' : '''
    
    Contains the web archives for the main parties (Liberal, Conservatives, NDP, Bloc, Green) but 
    also a wide range marginal parties (Cosmopolitan Party, Canadian Action, Christian Heritage and
    so on).  The "special interest groups" include the David Suzuki Foundation (an environmental 
    advocacy group) and fairvote.ca (advocacy for changing the electoral system).
    
    ''',
    
    # What file did you use to view the viz-link?
    'crawl_viz_link_file' : 'TORONTO_Canadian_Political_Parties-urls.html',
    
    'crawl_viz_description' : '''
    
    - Between March 06 and January 07 and then again between July 09 & November 09, Policy Alternatives had the
    largest amount of activity.
    - A rise in activity for equalvoice.ca (advocacy for women in political leadership) between December 09 and 
    November 2011.
    - Of the major parties, the Liberal Party of Canada and the Green Party had the most activity.
    
    ''',
    
    ##  You can add additional items here using the format 
    ##  'META_DATA_TAG' : 'DATA_VALUES',
}



exists = collections.search(Seed.collection_title==description['collection_title']) and collections.search(Seed.WALK_collection_folder==description['WALK_collection_folder'])
description['TIMESTAMP'] = time.time() #create a timestamp
if exists:
    # print(collections.search(Seed.collection_title.exists()))
    el = collections.get(Seed.collection_title==description['collection_title'])
    il = collections.get(Seed.WALK_collection_folder==description['WALK_collection_folder'])
    if el.eid == il.eid:
        collections.update(description, eids=[el.eid])
        collections_backup.insert(description)
    print ("updated "  + description['collection_title'] + ".\n")
    print ("Previous insert added to backup log.\n")
    backup = max([(x['collection_title'], x['TIMESTAMP'], x.eid) for x in collections_backup.search(Seed.TIMESTAMP > 0)])
    print ('title: "' + backup[0] + "\ntimestamp: " + str(backup[1]) + "\neid (aka id): " + str(backup[2]))
        
else:
    collections.insert(description)
    print ("inserted!")



inserted!


# More detailed description of collection contents

Within each Collection, there may be one or more seeds that are worth an additional look.  I propose using the seeds table for this.  Again, you can include whatever additional information you think is relevant by providing an additional key: value pair.



In [49]:
seed = {
        "collection_title" : "Canadian Political Parties and Political Interest Groups",
        "WALK_collection_folder" : "TORONTO_Canadian_Political_Parties",
        "seed_name" : "Cosmopolitan Party of Canada",
        "first_crawl" : "2005-10-04",
        "latest_crawl" : "2012-11-03",
        "times_captured": 59,
        "videos" : 0,
        "url" : "http://agoracosmopolite.com",
        "description" : '''
            Also called the "Progressive Nationalist Party", it is a "progressive and environment protection oriented
            political party" that seeks the "political, economic and cultural assimilation of Canada, into the
            United States, under the _Security and Prosperity Partnership_ (SPP)."
        ''', 
        "some new information" : ""
    }

seed_exists = ((seed_main.search(Seed.collection_title==seed['collection_title']) or seed_main.search(Seed.WALK_collection_folder==seed['WALK_collection_folder']))
                and seed_main.search(Seed.seed_name==seed['seed_name']))
print (seed_exists)
seed['TIMESTAMP'] = time.time()
if seed_exists:
    # print(collections.search(Seed.collection_title.exists()))
    sel = seed_main.get(Seed.collection_title==seed['collection_title'])
    sil = seed_main.get(Seed.WALK_collection_folder==seed['WALK_collection_folder'])
    ssl = seed_main.get(Seed.seed_name==seed['seed_name'])
    if sel.eid == sil.eid == ssl.eid:
        seed_main.update(seed, eids=[el.eid])
        seed_backup.insert(seed)
    print ("updated "  + seed['collection_title'] + ".\n")
    print ("Previous insert added to backup log.\n")
    Seed_backup = max([(x['collection_title'], x['TIMESTAMP'], x.eid) for x in seed_backup.search(Seed.TIMESTAMP > 0)])
    print ('title: "' + Seed_backup[0] + "\ntimestamp: " + str(Seed_backup[1]) + "\neid (aka id): " + str(Seed_backup[2]))
        
else:
    seed_main.insert(seed)
    print ("inserted!")

[{'TIMESTAMP': 1488049265.426975, 'description': '\n            Also called the "Progressive Nationalist Party", it is a "progressive and environment protection oriented\n            political party" that seeks the "political, economic and cultural assimilation of Canada, into the\n            United States, under the _Security and Prosperity Partnership_ (SPP)."\n        ', 'latest_crawl': '2012-11-03', 'url': 'http://agoracosmopolite.com', 'videos': 0, 'times_captured': 59, 'seed_name': 'Cosmopolitan Party of Canada', 'WALK_collection_folder': 'TORONTO_Canadian_Political_Parties', 'collection_title': 'Canadian Political Parties and Political Interest Groups', 'first_crawl': '2005-10-04'}]
updated Canadian Political Parties and Political Interest Groups.

Previous insert added to backup log.

title: "Canadian Political Parties and Political Interest Groups
timestamp: 1488049284.002535
eid (aka id): 2


In [63]:
import textwrap
RESULTS = collections.search(Seed.collection_title.exists())
for item in RESULTS:
    print (item['collection_title'])
    seeds = seed_main.search(Seed.collection_title==item['collection_title'])
    print ("has the following seeds:")
    for se in seeds:
        
        print (textwrap.indent(se['seed_name'], '   '))


Canadian Political Parties and Political Interest Groups
has the following seeds:
   Cosmopolitan Party of Canada


Preparing Gephi
===============

The data that includes Gephi-compatible files are available at the [WALK dataverse page](https://dataverse.scholarsportal.info/dataset.xhtml?persistentId=hdl:10864/12040). Each file is in "gdf" format which is compatable with a network visualization tool called [Gephi](https://gephi.org/). We are going to use gephi to produce attractive web graphs.

* Start by downloading the file and starting Gephi. 
        > File > Open (or ⌘o)
        > Change file format to GDF (Guess) *.gdf
        > Select the file and open

The raw data will probably just be a large set of nodes with no real order to them. Not very helpful to us.


<img src="img/firstGephi.png" alt="Not very helpful Gephi visualization" width="200" height="200" />


We need to filter them.  A good way is to focus on [strongly connected components](http://www.geeksforgeeks.org/strongly-connected-components/). A component is just a group of nodes (in this case websites) all connected together. *Strongly* connected components are a little different in that everyone in the component will have access to the same information. A good test of a strongly connected component is if you run your finger from node to node, following the direction of each arrow, eventually you will be able to access every node.

If we use the statistics menu, gephi will calculate a number of things that are important to us. To create a strongly connected component graph, we need to run the connected components algorithm. 

<img src="img/menu.png" alt="Gephi Statistics Menu" width="200" height="800" />

It will ask us if we want to do this in a directed or undirected fashion. Directed means that we care about the direction of the linkages, so choose that (it should be the default).

Now for the hard part.  Out of many thousands of strongly connected components, we need to find the largest ones.

Use the top bar to switch to the data laboratory.

<img src="img/topbar.png" alt="Gephi Top Bar" width="300" height="200" />


There's a light bulb icon at the far right of the spreadsheet.  Click that to limit the number of pages you see.

<img src="img/lightbulb.png" alt="Gephi Light Bulb" width="300" height="200" />

Unclick all the checks except for strongly connected component.  Then click the label "Strongly Connected Component" to sort the list.  

Perhaps there is a better way to do this with a regular expression, but we need to find the Strongly Connected Component ID that has a largest number of nodes. Unfortunately, this requires scrolling through the list to find what ID that is.  Fortunately, the largest component is probably so large that as you scroll quickly down, it will be obvious because the ids will be the same for a while.  In my case the number was 13047 (yes, pretty far down) but I found it in less than 2 minutes.

<img src="img/list.png" alt="Long List Changing" width="200" height="800" />

Now to filter the graph. Go back to "Overview" to see the graph.  On the statistics menu, there is a tab called "filter" click that.

* Click attributes > Equal 

and select Strongly Connected Component ID. 

at the bottom, you will see "value." Enter the ID of the component with the large number of units. (Again, mine was 13047). Then click "filter."  The new graph will be much more manageable.

<img src="img/filtered.png" alt="Filtered Graph" width="200" height="200" />


Now on the left, we can use an algorithm to organize the nodes a little better.  Go to the "layout" menu and select the "Yifan Hu" algorithm. 

<img src="img/yifan.png" alt="Yifan Hu Menu" width="200" height="800" />

You'll find the layout much better organized.

<img src="img/yifangraph.png" alt="Yifan Hu Graph" width="200" height="200" />

It's possible that the Yifan Hu is not the best possible layout for your graph.  This is going to depend on what you notice in the graph and what you want to highlight.  Here are some other recommendations:

* Force Atlas - This algorithm is better if you have a smaller graph

<img src="img/fatlas.png" alt="Forced Atlas Graph" width="200" height="200" />


* Circular - This algorithm emphasizes the edges of the graph, as all the nodes are in a circle.

<img src="img/circular.png" alt="Circular Graph" width="200" height="200" />


* Fruchtermann-Reingold - spaces all the nodes an equal distance apart.

<img src="img/fruchtermann.png" alt="Fruchtermann-Reingold" width="200" height="200" />


Now the layout is fixed, we can add some color to the graph.  Return to the statistics menu (hit the "statistics" tab where you did your filter, and run the "average degree," "average path length" and "modularity" statistics.

Now go to the top left "partition" menu.

<img src="img/partition.png" alt="Partition menu" width="200" height="400" />

Click the "refresh" icon to ensure all your new statistics appear.  Then use the select menu to choose "modularity class" as a partition.  Gephi will select a number of items based on a modularity (a rough community detection) algorithm.  Then choose a "sizing" item to size the nodes based on their degree, betweenness or other value.

There are a variety of things you can do with Gephi, that I will leave up to you now. But I do have some suggestions:

* Try eliminating some nodes from the analysis that are too obvious (eg. Google, Youtube, Twitter, Facebook).
* Use "spline" to control for large network (long tail effects), for example when one node has so many more links than everyone else that it's hard to see what's going on.
* Try grouping sets of nodes together that have a lot in common (e.g. if there's an NPD.org (french) and an NDP.ca (english) website).

This is what I got after a little bit of fiddling with the results:

<img src="img/cpp.png" alt="Final Product" width="800" height="800" />