# blog_finder$^{\dagger}$: an application to find the best blogs, bloggers, and blog posts on any given topic
- Crawls technical internet communities looking for links to blog posts
- Returns the blogs that are most relevant to user and most highly regarded by the community

$^{\dagger}$: application title subject to change :)

## Progress as of 5-3-2016

### 1. Data acquisition
    - I Wrote Python wrappers for reddit and StackOverflow api's that allow for full-history downloads of any sub-community on those sites (i.e., subreddits and StackOverflow sites)
    - The corpus so far is ~1 GB worth of data from select subreddits and stackexchange sites, with more readily available after making calls to the api's with the written code.
    - The communities were selected on the basis of having a wide range of degrees of similarity
        -e.g., /r/datascience and stackexchange.datascience are very similar, /r/datascience and /r/statistics are somewhat similar, and /r/datascience and /r/Cooking are not similar.
        
        - subreddits:
            1. /r/MachineLearning
                - 30,260 comments
                - 2,830 links
    
            2. /r/datascience
                - 17,362 comments
                - 2,193 links

            3. /r/compsci
                - 23,847 comments
                - 3,464 links
                
            4. /r/statistics
                - 24,232 comments
                - 1,609 links

            5. /r/physics
                - 35,744 comments
                - 3,131 links

            5. /r/Cooking
                - 52,923 comments
                - 2,863 links

            7. /r/food
                - 29,048 comments
                - 930 links
                
        - stackexchange sites:
            1. stackexchange.datascience
                - 139,619 comments
                
            2. stackoverflow
                - 271,492 comments
                - 28,076 links

### 2. Community exploratory data analysis
    
    - Before building the recommender engine, I wanted to explore individual and shared aspects of different communities.
    
    - First, I looked at how often blog posts are actually shared on a particular subreddit; it's no use in doing this if no one ever links to blogs! Fortunately, it looks like blogs are shared sufficiently often that the recommender engine will have enough data to work with.
![https://raw.githubusercontent.com/tphinkle/blog_finder/master/plots/blog_timeseries.png](./plots/blog_timeseries.png)

    - Next, I wanted to see how often a community linked to the same blogger.
![https://raw.githubusercontent.com/tphinkle/blog_finder/master/plots/links_to_blogs.png](./plots/links_to_blogs.png)

    - Sometimes there are multiple communities based around topics that are similar. I wanted to see if two communities based on similar topics had similar post titles. For the two communities, I looked at /r/datascience and /r/MachineLearning and plotted the top twenty most common words posted to each community. As expected, there was a fair amount of overlap in the two; for instance, 7 words are found in the top twenty search terms in both communities.
![https://raw.githubusercontent.com/tphinkle/blog_finder/master/plots/most_common_terms.png](./plots/most_common_terms.png)

    - In order to efficiently determine which communities we should be looking in for blogs to recommend, it's important to have a well-defined metric for similarity of the desired search topic and the community topic. To do this, I first created dictionaries (basically, vectors) of word counts from the search query and from every single post title submitted to a given subcommunity. The cosine-distance similarity metric is basically just a normalized dot product of two such vectors, with resultant numbers close to 1 indicating high similarity. Here are the similarity scores for searches that I made to communities that I thought would show high relevance.
![https://raw.githubusercontent.com/tphinkle/blog_finder/master/plots/search_term_relevance.png](./plots/search_term_relevance.png)

    - Finally, I wanted to investigate the degree of similarity between the communities that I chose. I used the same similarity metric as in the previous comment, and plotted the similarities in a matrix. Not surprisingly, the communities that we expect to be similar are similar as shown in the following plot.
![https://raw.githubusercontent.com/tphinkle/blog_finder/master/plots/community_similarities.png](./plots/community_similarities.png)

### 3. Getting the blog recommendations

- Now that we have a clearly defined way of investigating which communities we should be searching for blogs for a given search query in, let's give it a shot!

- For this example, I am searching for the topic 'neural network', and for the preliminary analysis I am artificially automatically picking to search for posts in three communities I already know are relevant. Once deployed, the application will determine the appropriate communities to look in. Here are the three communities.
    - /r/MachineLearning
    - /r/datascience
    - stackexchange.datascience
    
- For each community, I extracted every single comment. From those comments, I first filtered out those that contained urls using regex's. Then, I filtered those comments again, retaining only those that had 'blog' in the url. There are probably a lot of great links that were filtered out in this last step, so the final application will need to be more savvy in finding out which links point to blogs.

- Once I had all the comments containing blog links, I assigned each link two scores:
    1. A 'utility' score based on how useful the community found that post (and hence the blog link) as measured by votes.
    2. A 'relevance' score, based on the same cosine-similarity metric described previously, with the two vectors being the search query and the post contents.
    
- Here is a scatter plot of the 'utility' and 'relevance' scores for the search query 'neural network'. I added a decision boundary to the plot, where points within the decision boundary are *not* recommended, and those outside *are* recommended. The blog links could then be sorted, say, by normal distance to the decision boundary.
![https://raw.githubusercontent.com/tphinkle/blog_finder/master/plots/neural_network_query.png](./plots/neural_network_query.png)
    
- Finally, let's see if our little blog recommender works. Hopefully we should find that the blogs that the links point to are both well-written and related to neural networks.

