<a href="https://colab.research.google.com/github/marinathomas/SentimentAnalysisHN/blob/master/HN_SentimentAnalysis_v2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

1. Load credentials to access BigQuery
2. Read the story ids for 2017 from the 'full' table.
3. For each story, get the associated 'main' comments. We are not considering response to the comments for now.
4. Analyze the comments and give the story a score based off the sentiment of the comments.

Step 1 - Load credentials

In [18]:
from google.cloud import bigquery
import pandas as pd
import os

from google.colab import drive
drive.mount('/content/gdrive')

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


In [0]:
credential_path = "/content/gdrive/Shared drives/HackerNews:SentimentAnalysis/BigData-HackerNews-77d9fa1b02c1.json"
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = credential_path

Step 2 - Install the sentiment analysis library
https://github.com/cjhutto/vaderSentiment

In [9]:
!pip install vaderSentiment

Collecting vaderSentiment
[?25l  Downloading https://files.pythonhosted.org/packages/86/9e/c53e1fc61aac5ee490a6ac5e21b1ac04e55a7c2aba647bb8411c9aadf24e/vaderSentiment-3.2.1-py2.py3-none-any.whl (125kB)
[K     |██▋                             | 10kB 17.8MB/s eta 0:00:01[K     |█████▏                          | 20kB 6.1MB/s eta 0:00:01[K     |███████▉                        | 30kB 8.4MB/s eta 0:00:01[K     |██████████▍                     | 40kB 5.4MB/s eta 0:00:01[K     |█████████████                   | 51kB 6.6MB/s eta 0:00:01[K     |███████████████▋                | 61kB 7.8MB/s eta 0:00:01[K     |██████████████████▎             | 71kB 8.8MB/s eta 0:00:01[K     |████████████████████▉           | 81kB 9.8MB/s eta 0:00:01[K     |███████████████████████▍        | 92kB 10.9MB/s eta 0:00:01[K     |██████████████████████████      | 102kB 8.7MB/s eta 0:00:01[K     |████████████████████████████▋   | 112kB 8.7MB/s eta 0:00:01[K     |███████████████████████████████▎| 1

In [0]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyser = SentimentIntensityAnalyzer()

Step 3 - Load BigQuery client and HackerNews dataset
1. Load the BigQuery client
2. Get a reference to HackerNews dataset
3. Load the data set

In [0]:
client = bigquery.Client()
hn_dataset_ref = client.dataset('hacker_news', project='bigquery-public-data')
hn_dset = client.get_dataset(hn_dataset_ref)

Step 4 - Look for 3 most popular stories of 2017

In [0]:
def get_stories():
    query = """
    SELECT *
    FROM `bigquery-public-data.hacker_news.full`
    WHERE  REGEXP_CONTAINS(title, r"(S|s)how HN") and (deleted IS NULL or deleted IS FALSE) and  EXTRACT(YEAR FROM timestamp)=2017
    ORDER BY SCORE desc
    LIMIT 3
    """

    query_job = client.query(query)
    iterator = query_job.result(timeout=30)
    rows = list(iterator)

    # Transform the rows into a nice pandas dataframe
    stories = pd.DataFrame(data=[list(x.values()) for x in rows], columns=list(rows[0].keys()))
    stories.head(3)

    return stories

Lets check the data

In [23]:
stories = get_stories()
for index,row in stories.iterrows():
    title, descendents, parent_id = row['title'], row['descendants'], row['id']
    print("----------------------------------------")
    print('Title: {} \t  Descendants: {} \t  ID: {}'.format(title, descendents, parent_id))

----------------------------------------
Title: Show HN: Airmash – Multiplayer Missile Warfare HTML5 Game 	  Descendants: 304 	  ID: 15892066
----------------------------------------
Title: Show HN: Sorting Two Metric Tons of Lego 	  Descendants: 211 	  ID: 14226889
----------------------------------------
Title: Show HN: Privacy-focused, ad-free, non-tracking torrent search engine 	  Descendants: 346 	  ID: 13423629


Let's bring up the comments for the above stories

Step 5 - For each story, bring up the associated comment

In [0]:
def get_comments(parent_id):
    query = """
    select  *
    from `bigquery-public-data.hacker_news.full` 
    where type = 'comment'  and (deleted IS NULL or deleted IS FALSE) and parent = @parent
    order by parent ;
    """

    query_params = [
        bigquery.ScalarQueryParameter("parent", "INT64", parent_id)
    ] 

    job_config = bigquery.QueryJobConfig()
    job_config.query_parameters = query_params
    query_job = client.query(query,location="US",job_config=job_config,)  

    iterator = query_job.result(timeout=30)
    rows = list(iterator)

    # Transform the rows into a nice pandas dataframe
    comments = pd.DataFrame(data=[list(x.values()) for x in rows], columns=list(rows[0].keys()))
    comments.head(20)

    return comments

Step 6 - Analyse comments

In [0]:
for index, row in stories.iterrows():
  parent_id = row['id']
  comments = get_comments(parent_id)
  break;

In [0]:
def analyse_comments(comments):
    scores = []
    for index,row in comments.iterrows():
        sentence = row['text']
        score = analyser.polarity_scores(str(sentence))
        print("{}\t Comment: {} \t SCORE: {}".format(index, sentence, str(score)))
        print("==============================================================================")
        scores.append(score)
    return scores


In [28]:
analyse_comments(comments)

0	 Comment: Very cool!  Can you tell us about the back end arch? 	 SCORE: {'neg': 0.0, 'neu': 0.776, 'pos': 0.224, 'compound': 0.4376}
1	 Comment: What a bloody delight! 	 SCORE: {'neg': 0.358, 'neu': 0.124, 'pos': 0.518, 'compound': 0.3164}
2	 Comment: Very nice game, but I couldn&#x27;t enjoy much because of my high ping. I&#x27;d like to see an South America and Africa servers. AFAIK AWS provides South America servers for Sao Paulo, Brazil.<p>Are there any plans on letting anyone host their own servers? 	 SCORE: {'neg': 0.0, 'neu': 0.803, 'pos': 0.197, 'compound': 0.8624}
3	 Comment: That was a lot of fun! One quick request - can the level-up popup not appear square in the middle of the screen? That can be a real problem if you happen to level while you&#x27;re in the middle of a fight.<p>One other quick request, now I think of it - can the key event handler pass F11 through? Being able to un-fullscreen would be handy.<p>Awesome game! I look forward to the writeup on the tech stack 

TODO: Analyze comments of comments ==> Will take up later

In [0]:
def score_story(scores):
  story_score = 0
  for row in scores:
    compound_score = row['compound']
    if compound_score >= 0.05:
      story_score += 1
    elif compound_score <= -0.05:
      story_score -= 1
  return story_score


In [0]:
def analyse_hacker_news():
  stories = get_stories()
  scored_stories = []
  for index, row in stories.iterrows():
    parent_id = row['id']
    comments = get_comments(parent_id)
    scores = analyse_comments(comments)
    story_point = score_story(scores)
    print("Story {} with id {} scored {}".format(row['title'], parent_id, story_point))
    break;

In [39]:
analyse_hacker_news()

0	 Comment: Very cool!  Can you tell us about the back end arch? 	 SCORE: {'neg': 0.0, 'neu': 0.776, 'pos': 0.224, 'compound': 0.4376}
1	 Comment: What a bloody delight! 	 SCORE: {'neg': 0.358, 'neu': 0.124, 'pos': 0.518, 'compound': 0.3164}
2	 Comment: Very nice game, but I couldn&#x27;t enjoy much because of my high ping. I&#x27;d like to see an South America and Africa servers. AFAIK AWS provides South America servers for Sao Paulo, Brazil.<p>Are there any plans on letting anyone host their own servers? 	 SCORE: {'neg': 0.0, 'neu': 0.803, 'pos': 0.197, 'compound': 0.8624}
3	 Comment: That was a lot of fun! One quick request - can the level-up popup not appear square in the middle of the screen? That can be a real problem if you happen to level while you&#x27;re in the middle of a fight.<p>One other quick request, now I think of it - can the key event handler pass F11 through? Being able to un-fullscreen would be handy.<p>Awesome game! I look forward to the writeup on the tech stack 