Skip to content


Subversion checkout URL

You can clone with
Download ZIP
Fetching contributors…
Cannot retrieve contributors at this time
70 lines (52 sloc) 4.41 KB
This project contains all the program files for my SNA course project. The course home page can be found at
The purpose of the project is to extract link relationships from the blogs and analyze the community aspects of the statistics blog community. In addition, NLP techniques will be used to analyze similar blogs based on content. The question of is there any relationship between community and content will be explored.
Current status:
* Able to take a list of urls, extract the feed, extract links based on those feeds, and save content along with links to a json file. Also extracts links from the first page, approximately corresponding to a blog roll. Both links (stripped to domain) and links matched to blog list are saved.
* However, the blogroll effort wasn't so great. So I'm building the links from the blogroll manually. It's a very slow process. See caveats below:
* Able to construct a basic dot file of a directed graph based on those links.
manual_blogroll.txt - Text file with a list of blog urls. Format is a url, followed by a semi-colon, followed by a comma-separated list of blog urls in the blog's blogroll. - takes a list of urls, downloads the feeds based on url, and saves the content and links to a json file - extracts links from HTML. One function simply extracts the domain, and another will match it to a list passed to it (such as a list of blogs) so that outlinks will be constrained to the original community. - unit tests for Could be much more robust. - creates term document matrix, stores in out\tdf.txt
feedlist.txt - list of blog urls, one to a line
out/ - directory holding json files from get_feed and gml file from - parses all json files in out/ and creates a digraph based on outlinks in the blogs (as saved by get_feed). Creates a dot file in the out/ directory
NOTE: there is some error in with the addition of titles as labels. - reads the term document matrix from and creates a similarity matrix, writes it out to out\similarity.txt - among other things, performs k-means clustering on the blogs
README - this document
TODO - things that are remaining to do in the project
* Links to Andrew Gelman's blog are very diverse. He has several addresses. I standardized them to
* Same with Simply Stats, standardized to
* And Flowing Data, all standardized to
* There are many links from inside to outside the statistics web, for example to econometrics, sociology, mathematics, and CS. I had to stop following them somewhere, and sometimes the break may seem arbitrary. I had to balance time and return on value to the project.
How to run the analysis:
1. Create manual_blogroll.txt in the format listed above.
2. python manual (this takes a while if you want the titles of the blog as labels)
2.1 python manual manual_blogroll.txt out/<something>.pickle to create a pickled NetworkX graph that can be loaded later
3. python feedlist to create feedlist.txt
4. python to create json files in out\ directory (this takes a while)
5. python to create the term document matrix file (this takes a while)
For the SNA part of the analysis
1. After manual you can load into Gephi or other tool that understands .dot files
2. After json you can load into Gephi.
3. You can use the pickle file to load graph into python for further analysis, say with networkx.
For the NLP part of the analysis:
1. Run after running (this takes a short while)
Dependencies that must be installed (also see references)
NumPy (so you can't use the community version of ActivePython)
- Mining the Social Web, Matthew Russell, O'Reilly, 2011.
- Programming Collective Intelligence, Toby Seagran, O'Reilly, O'Reilly, 2007.
- Python, (CPython version 2.7.3 used for this project)
- NetworkX, (Python package)
- NLTK, (Python package)
- ... and of course, Social Network Analysis, taught by Lada Adamic Fall 2012 on Coursera.
I realize a lot of this code is done inefficiently, but it works.
Jump to Line
Something went wrong with that request. Please try again.