Skip to content
SNA project
Python R
Find file
Latest commit 0b15453 @randomjohn unimportant changes from final version
R studio and Gephi updated some files
Failed to load latest commit information.
out unimportant changes from final version
.RData final version
.Rhistory unimportant changes from final version
.gitattributes First commit
.gitignore First commit
README added blog clustering based on content
TODO added blog clustering based on content
blogs_clus.pickle final version Fixed broken graph construction First commit added blog clustering based on content
feedlist.txt First commit
feedlist_manual.txt First commit First commit First commit First commit
imp_and_analyze.R final version First commit
manual_blogroll.txt Fixed broken graph construction First commit
project.Rproj added blog clustering based on content added blog clustering based on content
sna_project.MININT-JSFC1SN.John.pui unimportant changes from final version
sna_project.prj Added from _Programming Collective Intelligence_
stat blogs graph 2012 10 31.gephi First commit
stat blogs graph 2012 10 31.gephi_temp some temp files First commit First commit First commit


This project contains all the program files for my SNA course project. The course home page can be found at

The purpose of the project is to extract link relationships from the blogs and analyze the community aspects of the statistics blog community. In addition, NLP techniques will be used to analyze similar blogs based on content. The question of is there any relationship between community and content will be explored.

Current status:

* Able to take a list of urls, extract the feed, extract links based on those feeds, and save content along with links to a json file. Also extracts links from the first page, approximately corresponding to a blog roll. Both links (stripped to domain) and links matched to blog list are saved.
* However, the blogroll effort wasn't so great. So I'm building the links from the blogroll manually. It's a very slow process. See caveats below:
* Able to construct a basic dot file of a directed graph based on those links.


manual_blogroll.txt - Text file with a list of blog urls. Format is a url, followed by a semi-colon, followed by a comma-separated list of blog urls in the blog's blogroll. - takes a list of urls, downloads the feeds based on url, and saves the content and links to a json file - extracts links from HTML. One function simply extracts the domain, and another will match it to a list passed to it (such as a list of blogs) so that outlinks will be constrained to the original community. - unit tests for Could be much more robust. - creates term document matrix, stores in out\tdf.txt
feedlist.txt - list of blog urls, one to a line
out/ - directory holding json files from get_feed and gml file from - parses all json files in out/ and creates a digraph based on outlinks in the blogs (as saved by get_feed). Creates a dot file in the out/ directory
   NOTE: there is some error in with the addition of titles as labels. - reads the term document matrix from and creates a similarity matrix, writes it out to out\similarity.txt - among other things, performs k-means clustering on the blogs
README - this document
TODO - things that are remaining to do in the project


* Links to Andrew Gelman's blog are very diverse. He has several addresses. I standardized them to
* Same with Simply Stats, standardized to
* And Flowing Data, all standardized to
* There are many links from inside to outside the statistics web, for example to econometrics, sociology, mathematics, and CS. I had to stop following them somewhere, and sometimes the break may seem arbitrary. I had to balance time and return on value to the project.

How to run the analysis:

1. Create manual_blogroll.txt in the format listed above.
2. python manual (this takes a while if you want the titles of the blog as labels)
2.1 python manual manual_blogroll.txt out/<something>.pickle to create a pickled NetworkX graph that can be loaded later
3. python feedlist to create feedlist.txt
4. python to create json files in out\ directory (this takes a while)
5. python to create the term document matrix file (this takes a while)

For the SNA part of the analysis

1. After manual you can load into Gephi or other tool that understands .dot files
2. After json you can load into Gephi.
3. You can use the pickle file to load graph into python for further analysis, say with networkx.

For the NLP part of the analysis:
1. Run after running (this takes a short while)

Dependencies that must be installed (also see references)

NumPy (so you can't use the community version of ActivePython)


- Mining the Social Web, Matthew Russell, O'Reilly, 2011.
- Programming Collective Intelligence, Toby Seagran, O'Reilly, O'Reilly, 2007.
- Python, (CPython version 2.7.3 used for this project)
- NetworkX, (Python package)
- NLTK, (Python package)
- ... and of course, Social Network Analysis, taught by Lada Adamic Fall 2012 on Coursera.


I realize a lot of this code is done inefficiently, but it works.
Something went wrong with that request. Please try again.