Permalink
Browse files

First commit

  • Loading branch information...
0 parents commit 32191a0c5a7bb50fa3f0514c973903575f66392a @randomjohn committed Nov 4, 2012
Showing 400 changed files with 9,129 additions and 0 deletions.
@@ -0,0 +1,22 @@
+# Auto detect text files and perform LF normalization
+* text=auto
+
+# Custom for Visual Studio
+*.cs diff=csharp
+*.sln merge=union
+*.csproj merge=union
+*.vbproj merge=union
+*.fsproj merge=union
+*.dbproj merge=union
+
+# Standard to msysgit
+*.doc diff=astextplain
+*.DOC diff=astextplain
+*.docx diff=astextplain
+*.DOCX diff=astextplain
+*.dot diff=astextplain
+*.DOT diff=astextplain
+*.pdf diff=astextplain
+*.PDF diff=astextplain
+*.rtf diff=astextplain
+*.RTF diff=astextplain
@@ -0,0 +1,163 @@
+#################
+## Eclipse
+#################
+
+*.pydevproject
+.project
+.metadata
+bin/
+tmp/
+*.tmp
+*.bak
+*.swp
+*~.nib
+local.properties
+.classpath
+.settings/
+.loadpath
+
+# External tool builders
+.externalToolBuilders/
+
+# Locally stored "Eclipse launch configurations"
+*.launch
+
+# CDT-specific
+.cproject
+
+# PDT-specific
+.buildpath
+
+
+#################
+## Visual Studio
+#################
+
+## Ignore Visual Studio temporary files, build results, and
+## files generated by popular Visual Studio add-ons.
+
+# User-specific files
+*.suo
+*.user
+*.sln.docstates
+
+# Build results
+[Dd]ebug/
+[Rr]elease/
+*_i.c
+*_p.c
+*.ilk
+*.meta
+*.obj
+*.pch
+*.pdb
+*.pgc
+*.pgd
+*.rsp
+*.sbr
+*.tlb
+*.tli
+*.tlh
+*.tmp
+*.vspscc
+.builds
+*.dotCover
+
+## TODO: If you have NuGet Package Restore enabled, uncomment this
+#packages/
+
+# Visual C++ cache files
+ipch/
+*.aps
+*.ncb
+*.opensdf
+*.sdf
+
+# Visual Studio profiler
+*.psess
+*.vsp
+
+# ReSharper is a .NET coding add-in
+_ReSharper*
+
+# Installshield output folder
+[Ee]xpress
+
+# DocProject is a documentation generator add-in
+DocProject/buildhelp/
+DocProject/Help/*.HxT
+DocProject/Help/*.HxC
+DocProject/Help/*.hhc
+DocProject/Help/*.hhk
+DocProject/Help/*.hhp
+DocProject/Help/Html2
+DocProject/Help/html
+
+# Click-Once directory
+publish
+
+# Others
+[Bb]in
+[Oo]bj
+sql
+TestResults
+*.Cache
+ClientBin
+stylecop.*
+~$*
+*.dbmdl
+Generated_Code #added for RIA/Silverlight projects
+
+# Backup & report files from converting an old project file to a newer
+# Visual Studio version. Backup files are not needed, because we have git ;-)
+_UpgradeReport_Files/
+Backup*/
+UpgradeLog*.XML
+
+
+
+############
+## Windows
+############
+
+# Windows image file caches
+Thumbs.db
+
+# Folder config file
+Desktop.ini
+
+
+#############
+## Python
+#############
+
+*.py[co]
+
+# Packages
+*.egg
+*.egg-info
+dist
+build
+eggs
+parts
+bin
+var
+sdist
+develop-eggs
+.installed.cfg
+
+# Installer logs
+pip-log.txt
+
+# Unit test / coverage reports
+.coverage
+.tox
+
+#Translations
+*.mo
+
+#Mr Developer
+.mr.developer.cfg
+
+# Mac crap
+.DS_Store
41 README
@@ -0,0 +1,41 @@
+This project contains all the program files for my SNA course project. The course home page can be found at https://class.coursera.org/sna-2012-001/class/index.
+
+The purpose of the project is to extract link relationships from the blogs and analyze the community aspects of the statistics blog community. In addition, NLP techniques will be used to analyze similar blogs based on content. The question of is there any relationship between community and content will be explored.
+
+Current status:
+
+* Able to take a list of urls, extract the feed, extract links based on those feeds, and save content along with links to a json file. Also extracts links from the first page, approximately corresponding to a blog roll. Both links (stripped to domain) and links matched to blog list are saved.
+* However, the blogroll effort wasn't so great. So I'm building the links from the blogroll manually. It's a very slow process. See caveats below:
+* Able to construct a basic dot file of a directed graph based on those links.
+
+Files:
+
+manual_blogroll.txt - Text file with a list of blog urls. Format is a url, followed by a semi-colon, followed by a comma-separated list of blog urls in the blog's blogroll.
+get_feed.py - takes a list of urls, downloads the feeds based on url, and saves the content and links to a json file
+link_extractor.py - extracts links from HTML. One function simply extracts the domain, and another will match it to a list passed to it (such as a list of blogs) so that outlinks will be constrained to the original community.
+test_link_extractor.py - unit tests for link_extractor.py. Could be much more robust.
+get_counts.py - creates term document matrix, stores in out\tdf.txt
+feedlist.txt - list of blog urls, one to a line
+out/ - directory holding json files from get_feed and gml file from build_graph.py.
+build_graph.py - parses all json files in out/ and creates a digraph based on outlinks in the blogs (as saved by get_feed). Creates a gml file in the out/ directory
+README - this document
+TODO - things that are remaining to do in the project
+
+Caveats
+
+* Links to Andrew Gelman's blog are very diverse. He has several addresses. I standardized them to http://www.andrewgelman.com
+* Same with Simply Stats, standardized to http://simplystatistics.org
+* There are many links from inside to outside the statistics web, for example to econometrics, sociology, mathematics, and CS. I had to stop following them somewhere, and sometimes the break may seem arbitrary. I had to balance time and return on value to the project.
+
+How to run the analysis:
+
+1. Create manual_blogroll.txt in the format listed above.
+2. python build_graph.py manual (this takes a while if you want the titles of the blog as labels)
+3. python build_graph.py feedlist to create feedlist.txt
+4. python get_feed.py to create json files in out\ directory (this takes a while)
+5. python get_counts.py to create the term document matrix file (this takes a while)
+
+For the SNA part of the analysis
+
+1. After build_graph.py manual you can load blogs_manual.dot into Gephi or other tool that understands .dot files
+2. After build_graph.py json you can load blogs.dot into Gephi.
16 TODO
@@ -0,0 +1,16 @@
+ * Basic stuff
+ x add titles of blogs to the labels (tried adding attr to networkx graph, gml didn't like it) - completed
+ * try scraping the front page of each of the blogs for blogroll links - need to add replacements to title strings so they can be valid files
+ x complete manual blogroll process (start with allendowney.blogspot.com)
+ * Implement the NLP stuff
+ * create term document matrix
+ * convert to tf-idf
+ * cluster blogs based on similarity (review Programming Collective Intelligence)
+ * fancier stuff (nice to have): named entity extraction
+ * Implement the SNA stuff
+ x build graph
+ x display it
+ * analyze it
+ * Compare the two
+ * find a way to overlap the two
+ * maybe add topic clusters as attributes? that way Gephi can color them
@@ -0,0 +1,127 @@
+# build a graph from a bunch of json files created by get_feed.py
+
+import networkx as nx
+import sys
+import os
+import json
+import link_extractor as le
+
+def build_graphs_from_json(blog_file):
+ # get list of blogs based on feedlist in blog_file
+ blog_list=[line for line in file(blog_file)]
+ # get json files
+ json_files = [fil for fil in os.listdir("out/") if fil.endswith(".json")]
+
+ # initialize graph
+ blog_gr = nx.DiGraph()
+ blog_gr.add_nodes_from(blog_list)
+ #print blog_gr.nodes()
+ for fil in json_files:
+ print "Processing: " + fil
+ blog_props = json.load(file("out/" + fil))
+ # get the node
+ blog_url = blog_props[0]['blogurl']
+ #blog_url = ''.join([c for c in blog_url if c not in '\n'])
+ print blog_props[0]['title'].encode('ascii','ignore')
+ #blog_gr[blog_url].setdefault('weight',blog_props[0]['blogtitle'])
+ # add up all outlinks
+ outlinks = {}
+ # note this is commented out because I am using blogroll
+ # for prop in blog_props[1:]:
+ # for outlink in prop['bloglinks']:
+ # outlinks.setdefault(outlink,0)
+ # outlinks[outlink]+=prop['bloglinks'][outlink]
+ outlinks = blog_props[0]['blogroll']
+ print outlinks
+ for outlink in outlinks:
+ try:
+ blog_gr.add_edge(blog_url,outlink)
+ except TypeError:
+ print 'Type Error in add_edge'
+ print blog_url
+ print outlink
+ raise TypeError
+ blog_gr[blog_url][outlink].setdefault('weight',0)
+ blog_gr[blog_url][outlink]['weight']+=1
+ nx.write_gml(blog_gr,'out/blogs.gml')
+ return
+
+def build_graph_from_manual( blog_file, add_labels=False ):
+ blog_list = [line for line in file(blog_file)]
+ blog_gr=nx.DiGraph()
+
+ for blog in blog_list:
+ blog=blog.strip()
+ # split on the semicolon, on the left is the node and the right are outlinks
+ bl_list = blog.split(';',2)
+ blog_gr.add_node(bl_list[0])
+ if len(bl_list)>1 and bl_list[1]!="":
+ if bl_list[0].find("visualcomplexity.com")>-1:
+ # some debugging
+ print blog
+ print bl_list
+
+ outlinks = bl_list[1].split(',')
+ for outlink in outlinks:
+ if (len(outlink)>1 and outlink[len(outlink)-1]=='/'):
+ outlink=outlink[-1]
+ elif (len(outlink)==1 or outlink==''):
+ # skip some slop
+ continue
+ blog_gr.add_edge(bl_list[0],outlink)
+ # add labels to nodes, if we need to
+ if (add_labels):
+ for n in blog_gr:
+ blog_gr[n]['title'] = ''
+ try:
+ webpage_title = le.extract_title_from_url(n)
+ print >> sys.stderr, 'Note: web page at ' + n + ' has title ' + webpage_title
+ blog_gr[n]['title'] = webpage_title.strip().replace('\n','').replace(' ','')
+ except:
+ print >> sys.stderr, 'Note: Could not parse ' + n
+ blog_gr[n]['title'] = n
+ #write out by hand
+ node_dot = ['"%s" [label="%s"]' % (n,blog_gr[n]['title']) for n in blog_gr]
+ edge_dot = ['"%s" -> "%s"' % (n1, n2) for n1,n2 in blog_gr.edges() if n2!="title"]
+ OUT = "out/blogs_manual.dot"
+ f = open(OUT,'w')
+ f.write('strict digraph{\n%s\n%s\n}' % (";\n".join(node_dot).encode('ascii','ignore'),";\n".join(edge_dot).encode('ascii','ignore')))
+ f.close()
+ return
+
+def make_feedlist_from_file(blog_file,out_file="feedlist_manual.txt"):
+ edge_list = [blog for blog in file(blog_file)]
+ url_set = set()
+ # create set of urls
+ for edge in edge_list:
+ urls = edge.split(';')
+ url_set.add(urls[0])
+ if (len(urls)>1 and urls[1]!=''):
+ for url in urls[1].split(','):
+ url_set.add(url.strip())
+ # write out to file
+ f=open(out_file,'w')
+ f.write('\n'.join(url_set))
+ f.close()
+ return
+
+def main( ):
+ if len(sys.argv)==1 or len(sys.argv)>3:
+ print "Usage: python build_graph.py (json|manual|feedlist) <file>"
+ return
+ elif len(sys.argv)==2:
+ if sys.argv[1]=="json":
+ blog_file="feedlist.txt"
+ else:
+ blog_file="manual_blogroll.txt"
+ elif len(sys.argv)==3:
+ blog_file=sys.argv[2]
+ if sys.argv[1]=="json":
+ build_graphs_from_json(blog_file)
+ elif sys.argv[1]=="feedlist":
+ make_feedlist_from_file(blog_file)
+ else:
+ build_graph_from_manual(blog_file,add_labels=True)
+ return
+if __name__=='__main__':
+ main()
@@ -0,0 +1,3 @@
+# build the term document matrix from the json files
+
+import json
Oops, something went wrong.

0 comments on commit 32191a0

Please sign in to comment.