GitHub - randomjohn/project: SNA project

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
out		out
.RData		.RData
.Rhistory		.Rhistory
.gitattributes		.gitattributes
.gitignore		.gitignore
README		README
TODO		TODO
blogs_clus.pickle		blogs_clus.pickle
build_graph.py		build_graph.py
build_tdm.py		build_tdm.py
clusters.py		clusters.py
feedlist.txt		feedlist.txt
feedlist_manual.txt		feedlist_manual.txt
filename_munger.py		filename_munger.py
get_counts.py		get_counts.py
get_feed.py		get_feed.py
imp_and_analyze.R		imp_and_analyze.R
link_extractor.py		link_extractor.py
manual_blogroll.txt		manual_blogroll.txt
porterstemmer.py		porterstemmer.py
project.Rproj		project.Rproj
similarity.py		similarity.py
sna_project.MININT-JSFC1SN.John.pui		sna_project.MININT-JSFC1SN.John.pui
sna_project.prj		sna_project.prj
stat blogs graph 2012 10 31.gephi		stat blogs graph 2012 10 31.gephi
stat blogs graph 2012 10 31.gephi_temp		stat blogs graph 2012 10 31.gephi_temp
test_filename_munger.py		test_filename_munger.py
test_get_counts.py		test_get_counts.py
test_link_extractor.py		test_link_extractor.py

Repository files navigation

This project contains all the program files for my SNA course project. The course home page can be found at https://class.coursera.org/sna-2012-001/class/index.

The purpose of the project is to extract link relationships from the blogs and analyze the community aspects of the statistics blog community. In addition, NLP techniques will be used to analyze similar blogs based on content. The question of is there any relationship between community and content will be explored.

Current status:

* Able to take a list of urls, extract the feed, extract links based on those feeds, and save content along with links to a json file. Also extracts links from the first page, approximately corresponding to a blog roll. Both links (stripped to domain) and links matched to blog list are saved.
* However, the blogroll effort wasn't so great. So I'm building the links from the blogroll manually. It's a very slow process. See caveats below:
* Able to construct a basic dot file of a directed graph based on those links.

Files:

manual_blogroll.txt - Text file with a list of blog urls. Format is a url, followed by a semi-colon, followed by a comma-separated list of blog urls in the blog's blogroll.
get_feed.py - takes a list of urls, downloads the feeds based on url, and saves the content and links to a json file
link_extractor.py - extracts links from HTML. One function simply extracts the domain, and another will match it to a list passed to it (such as a list of blogs) so that outlinks will be constrained to the original community.
test_link_extractor.py - unit tests for link_extractor.py. Could be much more robust.
get_counts.py - creates term document matrix, stores in out\tdf.txt
feedlist.txt - list of blog urls, one to a line
out/ - directory holding json files from get_feed and gml file from build_graph.py.
build_graph.py - parses all json files in out/ and creates a digraph based on outlinks in the blogs (as saved by get_feed). Creates a dot file in the out/ directory
NOTE: there is some error in build_graph.py with the addition of titles as labels.
similarity.py - reads the term document matrix from get_counts.py and creates a similarity matrix, writes it out to out\similarity.txt
clusters.py - among other things, performs k-means clustering on the blogs
README - this document
TODO - things that are remaining to do in the project

Caveats

* Links to Andrew Gelman's blog are very diverse. He has several addresses. I standardized them to http://www.andrewgelman.com
* Same with Simply Stats, standardized to http://simplystatistics.org
* And Flowing Data, all standardized to http://www.flowingdata.com
* There are many links from inside to outside the statistics web, for example to econometrics, sociology, mathematics, and CS. I had to stop following them somewhere, and sometimes the break may seem arbitrary. I had to balance time and return on value to the project.

How to run the analysis:

1. Create manual_blogroll.txt in the format listed above.
2. python build_graph.py manual (this takes a while if you want the titles of the blog as labels)
2.1 python build_graph.py manual manual_blogroll.txt out/<something>.pickle to create a pickled NetworkX graph that can be loaded later
3. python build_graph.py feedlist to create feedlist.txt
4. python get_feed.py to create json files in out\ directory (this takes a while)
5. python get_counts.py to create the term document matrix file (this takes a while)

For the SNA part of the analysis

1. After build_graph.py manual you can load blogs_manual.dot into Gephi or other tool that understands .dot files
2. After build_graph.py json you can load blogs.dot into Gephi.
3. You can use the pickle file to load graph into python for further analysis, say with networkx.

For the NLP part of the analysis:
1. Run similarity.py after running get_counts.py (this takes a short while)

Dependencies that must be installed (also see references)

NLTK
NumPy (so you can't use the community version of ActivePython)
Feedparser
Networkx

References

- Mining the Social Web, Matthew Russell, O'Reilly, 2011.
- Programming Collective Intelligence, Toby Seagran, O'Reilly, O'Reilly, 2007.
- Python, http://www.python.org (CPython version 2.7.3 used for this project)
- NetworkX, http://networkx.lanl.gov (Python package)
- NLTK, http://www.nltk.org (Python package)
- ... and of course, Social Network Analysis, taught by Lada Adamic Fall 2012 on Coursera.

Coda

I realize a lot of this code is done inefficiently, but it works.