# blog_finder$^{\dagger}$: an application to find the best blogs, bloggers, and blog posts on any given topic
- Crawls technical internet communities looking for links to blog posts
- Returns the blogs that are most relevant to user and most highly regarded by the community

$^{\dagger}$: application title subject to change :)

# Summary and workflow for blog finder application

## Summary:
1. Raw data acquisition:
    - Store raw data from posts from various social communities
    - Sources:
        1. __Reddit__: Make calls to Reddit API via custom Python script
        2. __StackExchange__: Use StackExchange API with custom Python script
        3. __Quora__:
        4. __Twitter__:

2. Blog post URL detection:
    - First find links from regex
    - Determine if link points to a blog
        - Binary classification problem? (Is blog vs. is not blog?) Need to collect features on training data (pages that are known to be blog or not)
     
3. Data storage
    - Each data source gets its own table, reflecting the different features each community has
    - For example, for reddit the features might be
        1. Post ID (the table's *primary key*)
        2. Parent title
        3. Date/time
        4. Link
        5. Upvotes
        6. Blog contents
  
4. Utility metrics
    1. __Relevance score__: A measure of how relevant the blog post is to the user's query
        - Determined by tf-idf distance between query and community, blog post content, linking post content, etc.
    2. __Epistemic score__: A measure of the educational utility of the post
        - Determined by upvotes, frequency of appearance, etc.
    
5. Serve (one, two, three...) such posts to the user (daily, weekly...)

## 1. Raw data acquisition

### Corpus summary

- Data used to-date from Reddit and StackOverflow
- Entire corpus is 700 MB in size and contains 20,000 links

### i. Reddit comments
- Use Reddit API to download entire subreddits worth of comments at a time.
- Search by date, t1 to t2 on a given subreddit for all submissions that occurred within those bounds.
- Search within those submissions and accumulate all comments.
- Save raw data to a plain text file, with one file per subreddit.
- Add submissions and comments raw data with all relevant features to a MySQL database.
- When searching all submissions and comments to a subreddit, first check the database to see if it is already added; this avoids the need to make another costly call to the API.

#### Features:
1. id (primary key)
2. date created utc
3. body of submission/comment
4. upvotes
5. url

#### Subreddit info
1. /r/MachineLearning
    - 30,260 comments
    - 2,830 links
    
2. /r/datascience
    - 17,362 comments
    - 2,193 links

3. /r/compsci
    - 23,847 comments
    - 3,464 links
4. /r/statistics
    - 24,232 comments
    - 1,609 links
    
5. /r/physics
    - 35,744 comments
    - 3,131 links

5. /r/Cooking
    - 52,923 comments
    - 2,863 links
    
7. /r/food
    - 29,048 comments
    - 930 links    

### ii. StackExchange comments
- Use StackExchange API to download entire community's questions and answers at a time.
- Save raw data to two different plain text files: '[community]_questions' and '[community]_answers'
- Add data with relevant features to a MySQL database.

#### Features:
- Questions:
    - id (primary key)
    - date created utc
    - body
    - upvotes
    - url
    
- Answers:
    - id (primary key)
    - date created utc
    - body
    - upvotes
    - url
    - parent
    
#### Community info:
- datascience
    - 6,373 questions
    - 13,3236 answers
- stackoverflow
    - 212,796 questions
    - 57,696 answers

## Match search topic to community
- A community is represented by a word frequency vector, which is a dict in Python
     - e.g. /r/MachineLearning might look like {'machine': 103230, 'learning': 10283, 'regression': 10280:, ...}
- The search term is also represented by such a dict
    - e.g. for searching 'neural network', dict is {'machine': 0, ..., 'neural': 1, ... 'network': 1, ...}

In [None]:
## Finding links:
- Fitness score=(community relevance)x(post relevance)x(normalized upvotes)
- community relevance: (normalized community topic vector) . (normalized search topic vector)
- post relevance: (normalized post contents topic vector) . (normalized search topic vector)
        - Maybe add 
- Normalized upvotes: Post upvotes/total submission upvotes

### Plot ideas
1. Most common words by subreddit
    - Have e.g. /r/MachineLearning and /r/datascience on same plot, with the x axis labels the words themselves and upper axis can be one subreddit, lower axis the other
2. Random relevant search terms for given topics, and the cosine distance score for a given subreddit
    - Make this a histogram, where each term gets number of bars equal to number of subreddits
    - For instance, could use /r/Datascience, /r/MachineLearning, /r/Statistics, etc...
3. Matrix of cosine distance (dot product) of word vectors for different subreddits
    - Need a spectrum of highly related, fairly related, and not related subreddits
    - e.g., /r/MachineLearning, /r/datascience, /r/statistics, /r/CompSci, /r/Cooking, /r/food