<p align = "center" draggable=”false” ><img src="https://user-images.githubusercontent.com/37101144/161836199-fdb0219d-0361-4988-bf26-48b0fad160a3.png" 
     width="200px"
     height="auto"/>
</p>

# <h1 align="center" id="heading">Sentiment Analysis of Reddit Data using Reddit API</h1>

In this live coding session, we leverage the Python Reddit API Wrapper (`PRAW`) to retrieve data from subreddits on [Reddit](https://www.reddit.com), and perform sentiment analysis using [`pipelines`](https://huggingface.co/docs/transformers/main_classes/pipelines) from [HuggingFace ( 🤗 the GitHub of Machine Learning )](https://techcrunch.com/2022/05/09/hugging-face-reaches-2-billion-valuation-to-build-the-github-of-machine-learning/), powered by [transformer](https://arxiv.org/pdf/1706.03762.pdf).

## Objectives

At the end of the session, you will 

- know how to work with APIs
- feel more comfortable navigating thru documentation, even inspecting the source code
- understand what a `pipeline` object is in HuggingFace
- perform sentiment analysis using `pipeline`
- run a python script in command line and get the results

## How to Submit

- At the end of each task, commit* the work into the repository you created before the assignment
- After completing all three tasks, make sure to push the notebook containing all code blocks and output cells to your repository you created before the assignment
- Submit the link to the notebook in Canvas

\***NEVER** commit a notebook displaying errors unless it is instructed otherwise. However, commit often; recall git ABC = **A**lways **B**e **C**ommitting.

## Tasks

### Task I: Instantiate a Reddit API Object

The first task is to instantiate a Reddit API object using [PRAW](https://praw.readthedocs.io/en/stable/), through which you will retrieve data. PRAW is a wrapper for [Reddit API](https://www.reddit.com/dev/api) that makes interacting with the Reddit API easier unless you are already an expert of [`requests`](https://docs.python-requests.org/en/latest/).

#### 1. Install packages

Please ensure you've ran all the cells in the `imports.ipynb`, located [here](https://github.com/FourthBrain/MLE-8/blob/main/assignments/week-3-analyze-sentiment-subreddit/imports.ipynb), to make sure you have all the required packages for today's assignment.

####  2. Create a new app on Reddit 

Create a new app on Reddit and save secret tokens; refer to [post in medium](https://towardsdatascience.com/how-to-use-the-reddit-api-in-python-5e05ddfd1e5c) for more details.

- Create a Reddit account if you don't have one, log into your account.
- To access the API, we need create an app. Slight updates, on the website, you need to navigate to `preference` > `app`, or click [this link](https://www.reddit.com/prefs/apps) and scroll all the way down. 
- Click to create a new app, fill in the **name**, choose `script`, fill in  **description** and **redirect uri** ( The redirect URI is where the user is sent after they've granted OAuth access to your application (more info [here](https://github.com/reddit-archive/reddit/wiki/OAuth2)) For our purpose, you can enter some random url, e.g., www.google.com; as shown below.


    <img src="https://miro.medium.com/max/700/1*lRBvxpIe8J2nZYJ6ucMgHA.png" width="500"/>
- Jot down `client_id` (left upper corner) and `client_secret` 

    NOTE: CLIENT_ID refers to 'personal use script" and CLIENT_SECRET to secret.
    
    <div>
    <img src="https://miro.medium.com/max/700/1*7cGAKth1PMrEf2sHcQWPoA.png" width="300"/>
    </div>

- Create `secrets_reddit.py` in the same directory with this notebook, fill in `client_id` and `secret_id` obtained from the last step. We will need to import those constants in the next step.
    ```
    REDDIT_API_CLIENT_ID = "client_id"
    REDDIT_API_CLIENT_SECRET = "secret_id"
    REDDIT_API_USER_AGENT = "any string except bot; ex. My User Agent"
    ```
- Add `secrets_reddit.py` to your `.gitignore` file if not already done. NEVER push credentials to a repo, private or public. 

#### 3. Instantiate a `Reddit` object

Now you are ready to create a read-only `Reddit` instance. Refer to [documentation](https://praw.readthedocs.io/en/stable/code_overview/reddit_instance.html) when necessary.

In [150]:
import praw
import secrets_reddit

# Create a Reddit object which allows us to interact with the Reddit API
reddit = praw.Reddit(
    client_id=secrets_reddit.REDDIT_API_CLIENT_ID,
    client_secret=secrets_reddit.REDDIT_API_CLIENT_SECRET,
    user_agent=secrets_reddit.REDDIT_API_USER_AGENT,
)

In [151]:
print(reddit) 

<praw.reddit.Reddit object at 0x7f919b2307f0>


<details>
<summary>Expected output:</summary>   

```<praw.reddit.Reddit object at 0x10f8a0ac0>```
</details>

#### 4. Instantiate a `subreddit` object

Lastly, create a `subreddit` object for your favorite subreddit and inspect the object. The expected output you will see ar from `r/machinelearning` unless otherwise specified.

In [152]:
subreddit = reddit.subreddit("machinelearning")

What is the display name of the subreddit?

In [153]:
subreddit.display_name

'machinelearning'

<details>
<summary>Expected output:</summary>   

    machinelearning
</details>

How about its title, is it different from the display name?

In [154]:
subreddit.title

'Machine Learning'

<details>
<summary>Expected output:</summary>   

    Machine Learning
</details>

Print out the description of the subreddit:

In [155]:
print(subreddit.description)

**[Rules For Posts](https://www.reddit.com/r/MachineLearning/about/rules/)**
--------
+[Research](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3AResearch)
--------
+[Discussion](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3ADiscussion)
--------
+[Project](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3AProject)
--------
+[News](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3ANews)
--------
***[@slashML on Twitter](https://twitter.com/slashML)***
--------
***[Chat with us on Slack](https://join.slack.com/t/rml-talk/shared_invite/enQtNjkyMzI3NjA2NTY2LWY0ZmRjZjNhYjI5NzYwM2Y0YzZhZWNiODQ3ZGFjYmI2NTU3YjE1ZDU5MzM2ZTQ4ZGJmOTFmNWVkMzFiMzVhYjg)***
--------
**Beginners:**
--------
Please have a look at [our FAQ and Link-Collection](http://www.reddit.com/r/MachineLearning/wiki/index)

[Metacademy](http://www.metacademy.org) is a great resource which compiles le

<details>
<summary>Expected output:</summary>

    **[Rules For Posts](https://www.reddit.com/r/MachineLearning/about/rules/)**
    --------
    +[Research](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3AResearch)
    --------
    +[Discussion](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3ADiscussion)
    --------
    +[Project](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict_sr=on&q=flair%3AProject)
    --------
    +[News](https://www.reddit.com/r/MachineLearning/search?sort=new&restrict
</details>

### Task II: Parse comments

#### 1. Top Posts of All Time

Find titles of top 10 posts of **all time** from your favorite subreddit. Refer to [Obtain Submission Instances from a Subreddit Section](https://praw.readthedocs.io/en/stable/getting_started/quick_start.html)) if necessary. Verify if the titles match what you read on Reddit.

In [156]:
# try run this line, what do you see? press q once you are done
?subreddit.top 

[0;31mSignature:[0m
[0msubreddit[0m[0;34m.[0m[0mtop[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mtime_filter[0m[0;34m:[0m [0mstr[0m [0;34m=[0m [0;34m'all'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0;34m**[0m[0mgenerator_kwargs[0m[0;34m:[0m [0mUnion[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mint[0m[0;34m,[0m [0mDict[0m[0;34m[[0m[0mstr[0m[0;34m,[0m [0mstr[0m[0;34m][0m[0;34m][0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m [0;34m->[0m [0mIterator[0m[0;34m[[0m[0mAny[0m[0;34m][0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Return a :class:`.ListingGenerator` for top items.

:param time_filter: Can be one of: ``"all"``, ``"day"``, ``"hour"``,
    ``"month"``, ``"week"``, or ``"year"`` (default: ``"all"``).

:raises: :py:class:`ValueError` if ``time_filter`` is invalid.

Additional keyword arguments are passed in the initialization of
:class:`.ListingGenerator`.

This method can be used

In [157]:
for submission in subreddit.top(limit=10, time_filter="all"):
    print(submission.title)

[Project] From books to presentations in 10s with AR + ML
[D] A Demo from 1993 of 32-year-old Yann LeCun showing off the World's first Convolutional Network for Text Recognition
[R] First Order Motion Model applied to animate paintings
[N] AI can turn old photos into moving Images / Link is given in the comments - You can also turn your old photo like this
[D] This AI reveals how much time politicians stare at their phone at work
[D] Types of Machine Learning Papers
[D] The machine learning community has a toxicity problem
I made a robot that punishes me if it detects that if I am procrastinating on my assignments [P]
[Project] NEW PYTHON PACKAGE: Sync GAN Art to Music with "Lucid Sonic Dreams"! (Link in Comments)
[P] Using oil portraits and First Order Model to bring the paintings back to life


<details> <summary>Expected output:</summary>

    [Project] From books to presentations in 10s with AR + ML
    [D] A Demo from 1993 of 32-year-old Yann LeCun showing off the World's first Convolutional Network for Text Recognition
    [R] First Order Motion Model applied to animate paintings
    [N] AI can turn old photos into moving Images / Link is given in the comments - You can also turn your old photo like this
    [D] This AI reveals how much time politicians stare at their phone at work
    [D] Types of Machine Learning Papers
    [D] The machine learning community has a toxicity problem
    [Project] NEW PYTHON PACKAGE: Sync GAN Art to Music with "Lucid Sonic Dreams"! (Link in Comments)
    [P] Using oil portraits and First Order Model to bring the paintings back to life
    [D] Convolution Neural Network Visualization - Made with Unity 3D and lots of Code / source - stefsietz (IG)    
</details>

The expected output looks dated. The last couple of titles in the All time Top 10 list seem to have changed since this output was setup.

#### 2. Top 10 Posts of This Week

What are the titles of the top 10 posts of **this week** from your favorite subreddit?

In [158]:
for submission in subreddit.top(limit=10, time_filter="week"):
    print(submission.title)

[P] Finetuned Diffusion: multiple fine-tuned Stable Diffusion models, trained on different styles
[P] Transcribe any podcast episode in just 1 minute with optimized OpenAI/whisper
[D] DALL·E to be made available as API, OpenAI to give users full ownership rights to generated images
[P] Made a text generation model to extend stable diffusion prompts with suitable style cues
[R] APPLE research: GAUDI — a neural architect for immersive 3D scene generation
[P] Learn diffusion models with Hugging Face course 🧨
[R] Reincarnating Reinforcement Learning (NeurIPS 2022) - Google Brain
[N] Adversarial Policies Beat Professional-Level Go AIs
[P] Fine Tuning Stable Diffusion: Naruto Character Edition
[N] Meta AI | Evolutionary-scale prediction of atomic level protein structure with a language model


<details><summary>Expected output:</summary>

    [N] Ian Goodfellow, Apple’s director of machine learning, is leaving the company due to its return to work policy. In a note to staff, he said “I believe strongly that more flexibility would have been the best policy for my team.” He was likely the company’s most cited ML expert.
    [R][P] Thin-Plate Spline Motion Model for Image Animation + Gradio Web Demo
    [P] I’ve been trying to understand the limits of some of the available machine learning models out there. Built an app that lets you try a mix of CLIP from Open AI + Apple’s version of MobileNet, and more directly on your phone's camera roll.
    [R] Meta is releasing a 175B parameter language model
    [N] Hugging Face raised $100M at $2B to double down on community, open-source & ethics
    [P] T-SNE to view and order your Spotify tracks
    [D] : HELP Finding a Book - A book written for Google Engineers about foundational Math to support ML
    [R] Scaled up CLIP-like model (~2B) shows 86% Zero-shot on Imagenet
    [D] Do you use NLTK or Spacy for text preprocessing?
    [D] Democratizing Diffusion Models - LDMs: High-Resolution Image Synthesis with Latent Diffusion Models, a 5-minute paper summary by Casual GAN Papers
</details>

Top 10 posts from last week is as expected very different from the expected output that was probably generated for the last cohort.

💽❓ Data Question:

Check out what other attributes the `praw.models.Submission` class has in the [docs](https://praw.readthedocs.io/en/stable/code_overview/models/submission.html). 

1. After having a chance to look through the docs, is there any other information that you might want to extract? How might this additional data help you?

Write a sample piece of code below extracting three additional pieces of information from the submission below.

In [159]:
#For each submission: extract Author, Score, No. of comments, and Submission date formatted
from datetime import datetime, timezone
import pandas as pd
submission_dict = { "title":[], "author":[], "score":[], "num_comments":[], "date":[]}
for submission in subreddit.top(limit=10, time_filter="week"):
        submission_dict["title"].append(submission.title)
        submission_dict["author"].append(submission.author)
        submission_dict["score"].append(submission.score)
        submission_dict["num_comments"].append(submission.num_comments)
        submission_dict["date"].append(datetime.fromtimestamp(submission.created_utc))


        

In [160]:
submissions_df = pd.DataFrame(submission_dict)
submissions_df

Unnamed: 0,title,author,score,num_comments,date
0,[P] Finetuned Diffusion: multiple fine-tuned S...,Illustrious_Row_9971,1105,60,2022-11-05 03:17:11
1,[P] Transcribe any podcast episode in just 1 m...,thundergolfer,430,23,2022-11-06 12:58:59
2,"[D] DALL·E to be made available as API, OpenAI...",TiredOldCrow,411,59,2022-11-03 18:12:45
3,[P] Made a text generation model to extend sta...,Neat-Delivery4741,395,55,2022-11-03 04:51:38
4,[R] APPLE research: GAUDI — a neural architect...,SpatialComputing,382,7,2022-11-05 13:12:14
5,[P] Learn diffusion models with Hugging Face c...,lewtun,313,14,2022-11-04 08:28:41
6,[R] Reincarnating Reinforcement Learning (Neur...,smallest_meta_review,247,31,2022-11-05 23:06:06
7,[N] Adversarial Policies Beat Professional-Lev...,xutw21,169,49,2022-11-01 20:42:05
8,[P] Fine Tuning Stable Diffusion: Naruto Chara...,mippie_moe,156,8,2022-11-03 10:52:09
9,[N] Meta AI | Evolutionary-scale prediction of...,xutw21,110,20,2022-11-01 11:46:18


💽❓ Data Question:

2. Is there any information available that might be a concern when it comes to Ethical Data?

In [166]:
for submission in subreddit.top(time_filter="year"):
    if submission.over_18 == True:
        print(submission.title)

[P] DeepCreamPy - Decensoring Hentai with Deep Neural Networks


It appears straightforward to get submissions that are over 18 with the API. A google safesearch would probably have not listed this title. I am not clear how the API ensures that such content is not presented by other means to under age users. I would be concerned about this.

#### 3. Comment Code

Add comments to the code block below to describe what each line of the code does (Refer to [Obtain Comment Instances Section](https://praw.readthedocs.io/en/stable/getting_started/quick_start.html) when necessary). The code is adapted from [this tutorial](https://praw.readthedocs.io/en/stable/tutorials/comments.html)

The purpose is 
1. to understand what the code is doing 
2. start to comment your code whenever it is not self-explantory if you have not (others will thank you, YOU will thank you later 😊) 

In [167]:
%%time
from praw.models import MoreComments

# Initialise an empty list for top comments 
top_comments = []

# Iterate through the list of top 10 submissions
for submission in subreddit.top(limit=10):
    # Iterate through the comments for the current submission
    for top_level_comment in submission.comments:
        # Comments contain many "Load More Comments..." links, ignore them by skipping to end of loop
        if isinstance(top_level_comment, MoreComments):
            continue
        # Add comments to list for top comments
        top_comments.append(top_level_comment.body)

CPU times: user 405 ms, sys: 238 ms, total: 644 ms
Wall time: 45.7 s


Check the total number of comments for top 10 submissions. The num_comments includes the MoreComments. So count explicitly after removing MoreComments with replace_more(). 

In [168]:
total_comments = 0
total_comments_with_more = 0

for submission in subreddit.top(limit=10):
    total_comments_with_more += submission.num_comments
    submission.comments.replace_more(limit=0)
    for top_level_comment in submission.comments:
        total_comments += 1
    
total_comments
print(f"Double check the total number of comments extracted with given code and with replace_more() matches: {len(top_comments)}, {total_comments}, {total_comments_with_more}")    

Double check the total number of comments extracted with given code and with replace_more() matches: 746, 746, 2114


Clearly the num_comments is including the MoreComments in its count.  

#### 4. Inspect Comments

How many comments did you extract from the last step? Examine a few comments. 

In [169]:
len(top_comments)  # the answer may vary 693 for r/machinelearning - My answer is 746

746

In [170]:
import random

[random.choice(top_comments) for i in range(10)]

['Which type do you like the most?',
 'Link?',
 'well this is next gen future. i will take look on my grandpapa lol.',
 'i hate the ones that begin with "towards..". why the tf would i want to read something that\'s incomplete?',
 "The first set of numbers was Yann LeCun's phone number at bell labs.",
 "And here I thought WinAmp visualizations really kicked the llama's ass... this explodes the llama into radioactive atoms",
 "> Thirdly, there is a worshiping problem.\n\ni agree about the godfathers portion.\n\nhowever the worship of publications from places like Google or DeepMind is unfortunately very well-founded.\n\nif you look at most university papers, they are training over 1/100th the amount of data industry papers use (for good reason).  as a practitioner it just isn't worth your time to look for other papers unless you're chasing the last few basis points.",
 "This is a good post and you're right, but there's one criticism I have:\n\n>**Sixthly**, moral and ethics are set *arb

<details> <summary>Some of the comments from `r/machinelearning` subreddit are:</summary>

    ['Awesome visualisation',
    'Similar to a stack or connected neurons.',
    'Will this Turing pass the Turing Test?']
</details>

💽❓ Data Question:

3. After having a chance to review a few samples of 5 comments from the subreddit, what can you say about the data? 

HINT: Think about the "cleanliness" of the data, the content of the data, think about what you're trying to do - how does this data line up with your goal?

In [171]:
import numpy as np

len_comments = np.asarray([len(i.split()) for i in top_comments])
print((len_comments > 500).sum())


6


In [173]:
for c in top_comments: 
    if(len(c.split()) > 500): 
        print(c)

I totally agree with 99% of your stuff. All of them are great points.

Although I will contest one of these points:

> machine learning, and computer science in general, have a huge diversity problem

I will say, in my experience, I did not find it to be particularly exclusionary.                      
(I still agree on making the culture healthier and more welcoming for all people, but won't call it a huge diversity problem, that is any different from what plagues other fields)              
I also think it has very little to do with those in CS or intentional rejection of minorities/women by CS as a field.

Far fewer women and minorities enroll in  CS, so it is more of a highschool problem than anything. If anything, CS tries really really hard to hire and attract under represented groups into the fold. That it fails, does not necessarily mean it is exclusionary. Many other social factors tend to be at play behind cohort statistics. An ML person knows that better than anyone.        

I assume our goal is to analyse the sentiment of responses to a particular submission on Reddit. Here are some observations about the comments: 
* There are conversational abbreviations - we need to ensure the dataset our model is trained on include such abbreviations like pls, sry, imho etc. Sometimes people abuse when they are excited! E.g. 'Wtffff. Well that was incredible.' And this: 'gonna take shrooms then revisit this.' Wonder if the sentiment analyser will capture the sentiment here! 
* Some posts are very long (as shown above) and contain both criticism as well as , and it is difficult to classify them into positive or negative. I suspect they will go into neutral category, but we would have lost out on all the different sentiments in that comment. There are 6 comments that are more than 500 words! 
* There is a lot of markdown formatting in the text with \*\ \\n etc. Some preprocessing will be needed to remove these from the text. E.g. 'Make the robot \*later\*.','\\>Schmidhuber calls Hinton a thief,\n\nNo doubt Hinton is a thief, the whole Toronto communities are thieves and gangsta.Hinton community cross site every stupid articles they write.',  '[deleted]' 

#### 5. Extract Top Level Comment from Subreddit `TSLA`.

Write your code to extract top level comments from the top 10 topics of a time period, e.g., year, from subreddit `TSLA` and store them in a list `top_comments_tsla`.  

In [174]:
# YOUR CODE HERE
subreddit_tsla = reddit.subreddit("TSLA")
top_comments_tsla = []
for submission in subreddit_tsla.top(limit=10, time_filter="year"):
    submission.comments.replace_more(limit=0)
    for top_level_comment in submission.comments:
        top_comments_tsla.append(top_level_comment.body)
   

In [175]:
len(top_comments_tsla) # Expected: 174 for r/machinelearning

109

In [176]:
[random.choice(top_comments_tsla) for i in range(10)]

['I set my buy target…. At $420 per share.',
 'Nice I’m hoping to jump in after the split. Is there any recommendations how long after the split to jump in',
 'Since fractional shares can be bought, the price of of a stock is irrelevant. It will still earn or lose the same. It is purely psychological.',
 'When’s this stock split happening ?',
 'Nope I wussed out with the economy and Russia. And to top it off I bought a lot near the peak- $1052 and sold it off well under that. \n\nFortunately I lost $50 only. I bought/sold more as it made its descent into hell and back. \n\nUnfortunately I should have held when it was at below $800. \n\nI’ll buy it. But not at $1000.  I simply cannot imagine this going to $2000 with all that’s going outside of Tesla’s control. \n\nI’ll look at $900, maybe dip in at $800 and love it at $700.',
 'I gave up on Elon shit... but wish you luck',
 'TSLA to $3000',
 "Second time in as many weeks this moron has tweeted something to fuck the stock price. Im getti

<details>
<summary>Some of the comments from `r/TSLA` subreddit:</summary>

    ['I bought puts',
    '100%',
    'Yes. And I’m bag holding 1200 calls for Friday and am close to throwing myself out the window']
</details>

💽❓ Data Question:

4. Now that you've had a chance to review another subreddits comments, do you see any differences in the kinds of comments either subreddit has - and how might this relate to bias?

Comments in r/TSLA are all related to the stock prices of Tesla shares. So the language is very specific to stock trading. If the dataset on which the sentiment analyser is trained is representative of this type of language usage, then the pre-trained model will perform reasonably well. But this is very specific/niche usage. So I suspect that the pre-trained model will need to be trained on online discussions around stock trading to better classify sentiments in this space. 

### Task III: Sentiment Analysis

Let us analyze the sentiment of comments scraped from `r/TSLA` using a pre-trained HuggingFace model to make the inference. Take a [Quick tour](https://huggingface.co/docs/transformers/quicktour). 

#### 1. Import `pipeline`

In [177]:
from transformers import pipeline

#### 2. Create a Pipeline to Perform Task "sentiment-analysis"

In [178]:
sentiment_model = pipeline("sentiment-analysis") # YOUR CODE HERE

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


#### 3. Get one comment from list `top_comments_tsla` from Task II - 5.

In [179]:
comment = random.choice(top_comments_tsla)

In [180]:
comment

'Sold my last position at $1k. Have been waiting to jump back in finally did it 8 shares at $640'

The example comment is: `'Bury Burry!!!!!'`. Print out what you get. For reproducibility, use the same comment in the next step; consider setting a seed.

#### 4. Make Inference!

In [181]:
sentiment = sentiment_model(["Bury Burry!!!!!"])# YOUR CODE HERE
print(sentiment) 

[{'label': 'NEGATIVE', 'score': 0.989326000213623}]


What is the type of the output `sentiment`?

```
[{'label': 'NEGATIVE', 'score': 0.989326000213623}]
```

In [182]:
print(f'The comment: {comment}')
print(f'Predicted Label is {sentiment[0]["label"]} and the score is {sentiment[0]["score"]:.3f}')

The comment: Sold my last position at $1k. Have been waiting to jump back in finally did it 8 shares at $640
Predicted Label is NEGATIVE and the score is 0.989


For the example comment, the output is:

    The comment: Bury Burry!!!!!
    Predicted Label is NEGATIVE and the score is 0.989

🖥️❓ Model Question:

1. What does the score represent?

The score represents the confidence of the prediction as POSITIVE or NEGATIVE. 

### Task IV: Put All Together

Let's pull all the piece together, create a simple script that does 

- get the subreddit
- get comments from the top posts for given subreddit
- run sentiment analysis 

#### Complete the Script

Once you complete the code, running the following block writes the code into a new Python script and saves it as `top_tlsa_comment_sentiment.py` under the same directory with the notebook. 

In [183]:
%%writefile top_tlsa_comment_sentiment.py

import secrets_reddit as secrets
import random

from typing import Dict, List

from praw import Reddit
from praw.models.reddit.subreddit import Subreddit
from praw.models import MoreComments

from transformers import pipeline


def get_subreddit(display_name:str) -> Subreddit:
    """Get subreddit object from display name

    Args:
        display_name (str): [description]

    Returns:
        Subreddit: [description]
    """
    reddit = Reddit(
        client_id=secrets.REDDIT_API_CLIENT_ID,        
        client_secret=secrets.REDDIT_API_CLIENT_SECRET,
        user_agent=secrets.REDDIT_API_USER_AGENT
        )
    
    subreddit = reddit.subreddit(display_name) # YOUR CODE HERE
    return subreddit

def get_comments(subreddit:Subreddit, limit:int=3) -> List[str]:
    """ Get comments from subreddit

    Args:
        subreddit (Subreddit): [description]
        limit (int, optional): [description]. Defaults to 3.

    Returns:
        List[str]: List of comments
    """
    top_comments = []
    for submission in subreddit.top(limit=limit):
        for top_level_comment in submission.comments:
            if isinstance(top_level_comment, MoreComments):
                continue
            top_comments.append(top_level_comment.body)
    return top_comments

def run_sentiment_analysis(comment:str) -> Dict:
    """Run sentiment analysis on comment using default distilbert model
    
    Args:
        comment (str): [description]
        
    Returns:
        str: Sentiment analysis result
    """
    sentiment_model = pipeline("sentiment-analysis") # YOUR CODE HERE
    sentiment = sentiment_model(comment)
    return sentiment[0]


if __name__ == '__main__':
    subreddit = get_subreddit("TSLA")
    comments = get_comments(subreddit)
    comment = random.choice(comments)# YOUR CODE HERE
    sentiment = run_sentiment_analysis(comment)
    
    print(f'The comment: {comment}')
    print(f'Predicted Label is {sentiment["label"]} and the score is {sentiment["score"]:.3f}')

Overwriting top_tlsa_comment_sentiment.py


Run the following block to see the output.

In [184]:
!python top_tlsa_comment_sentiment.py

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
The comment: Have you ever talked with Elon? Could say few words about him? Thanks
Predicted Label is POSITIVE and the score is 0.999


<details><summary> Expected output:</summary>

    No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)
    The comment: When is DOGE flying
    Predicted Label is POSITIVE and the score is 0.689
</details>

💽❓ Data Question:

5. Is the subreddit active? About how many posts or threads per day? How could you find this information?

In [187]:
num_posts_ml = 0 
num_comments_ml = 0
for submission in subreddit.top(limit=100, time_filter="week"):
    #print(f"{submission.title}, {submission.author}, {submission.num_comments}")
    num_comments_ml += submission.num_comments
    num_posts_ml += 1


num_posts_tsla = 0 
num_comments_tsla = 0
for submission in subreddit_tsla.top(limit=100, time_filter="week"):
    #print(f"{submission.title}, {submission.author}, {submission.num_comments}")
    num_comments_tsla += submission.num_comments
    num_posts_tsla += 1

print(f"Posts: Machine Learning {num_posts_ml} vs TSLA {num_posts_tsla} last week")
print(f"Comments: Machine Learning {num_comments_ml} vs TSLA {num_comments_tsla} last week")


Posts: Machine Learning 89 vs TSLA 7 last week
Comments: Machine Learning 908 vs TSLA 48 last week


In [194]:
stats = subreddit.traffic()
stats

Redirect: Redirect to /r/MachineLearning/login/ (You may be trying to perform a non-read-only action via a read-only instance.)

In [196]:
stats = subreddit.public_traffic
stats

False

We can get an idea on how active the subreddit is by looking at the no. of posts and comments in the last week or last day. From a quick count, it appears that the Machine Learning subreddit is more active than TSLA, with more posts and comments. 

Posts: Machine Learning 89 vs TSLA 7 last week

Comments: Machine Learning 908 vs TSLA 48 last week

💽❓ Data Question:

6. Does there seem to be a large distribution of posters or a smaller concentration of posters who are very active? What kind of impact might this have on the data?

In [226]:
posters = {}

for submission in subreddit.top(limit=100, time_filter="month"):
    author_name = submission.author.name
    if author_name in posters:
        num_posts = posters.get(author_name)
        num_posts += 1
    else:
        num_posts = 1

    posters.update({author_name: num_posts})

sorted_posters_by_posts = dict(sorted(posters.items(), key=lambda x:x[1], reverse=True))
print(sorted_posters_by_posts)
print(f"No. of posters in last month: {len(sorted_posters_by_posts)}")
print(f"No. of posts in last month: {np.sum(list(sorted_posters_by_posts.values()))}")
print(f"Mean no. of posts per author: {np.mean(list(sorted_posters_by_posts.values()))}")
print(f"Median no. of posts per author: {np.median(list(sorted_posters_by_posts.values()))}")



{'Illustrious_Row_9971': 7, 'Singularian2501': 6, 'xutw21': 3, 'SpatialComputing': 3, 'MysteryInc152': 2, 'hardmaru': 2, 'Effective_Tax_2096': 2, 'mippie_moe': 2, 'cloneofsimo': 2, 'lexfridman': 1, 'Mogady': 1, 'TiredOldCrow': 1, 'jsonathan': 1, 'Neat-Delivery4741': 1, 'highergraphic': 1, 'pommedeterresautee': 1, 'lewtun': 1, 'MLC_Money': 1, 'jkterry1': 1, 'dilmerv': 1, 'smallest_meta_review': 1, 'obsoletelearner': 1, 'Greedy_Childhood8732': 1, 'Lajamerr_Mittesdine': 1, 'That_Violinist_18': 1, '0xWTC': 1, 'thundergolfer': 1, 'ggerganov': 1, 'aviisu': 1, 'likeamanyfacedgod': 1, 'shahaff32': 1, 'st8ic': 1, 'SleekEagle': 1, '0x00groot': 1, 'fromnighttilldawn': 1, 'YaYaLeB': 1, 'CodingButStillAlive': 1, 'DisWastingMyTime': 1, 'MohamedRashad': 1, 'TallTahawus': 1, 'laprika0': 1, 'fedetask': 1, 'killver': 1, 'ChrisRackauckas': 1, 'Technical-Vast1314': 1, 'Batuhan_Y': 1, 'lifesthateasy': 1, 'vajraadhvan': 1, 'Wiskkey': 1, 'tuned-mec-is': 1, 'ZeronixSama': 1, 'aozorahime': 1, 'phraisely': 1, '

In [225]:
posters = {}

for submission in subreddit_tsla.top(limit=100, time_filter="month"):
    author_name = submission.author.name
    if author_name in posters:
        num_posts = posters.get(author_name)
        num_posts += 1
    else:
        num_posts = 1

    posters.update({author_name: num_posts})

sorted_posters_by_posts = dict(sorted(posters.items(), key=lambda x:x[1], reverse=True))
print(sorted_posters_by_posts)
print(f"No. of posters in last month: {len(sorted_posters_by_posts)}")
print(f"No. of posts in last month: {np.sum(list(sorted_posters_by_posts.values()))}")
print(f"Mean no. of posts per author: {np.mean(list(sorted_posters_by_posts.values()))}")
print(f"Median no. of posts per author: {np.median(list(sorted_posters_by_posts.values()))}")

{'wewewawa': 19, 'revanold': 8, 'droneauto': 2, 'TonyLiberty': 2, 'PrimaryMysterious': 1, 'LoganLee43': 1, 'ecoshares': 1, 'Local-Rip9621': 1, 'MoneyAdx': 1, 'Plus_Seesaw2023': 1, 'PolarBearPolo': 1, 'trowawayfarawaytoday': 1, 'Super_Stickman13': 1, 'EqualFlower': 1, 'JamesJimmyJim': 1, 'trillamanillla': 1, 'SouthSink1232': 1, 'Relative-Addendum534': 1, 'mbj7000': 1}
No. of posters in last month: 19
No. of posts in last month: 46
Mean no. of posts per author: 2.4210526315789473
Median no. of posts per author: 1.0


As can be seen from the above results, the MachineLearning subreddit has a better distribution of posters - with mean no. of posts at 1.25 that is close to the median of 1 post per poster. The TSLA subreddit on the other hand, has a mean of 2.42 while its median is also 1 post per author, with a more skewed distribution. For instance the top poster in Machine learning has 7 out of the total 100 posts whereas the top poster (wewewawa) in TSLA has 19 out of the 46. We should worry about the TSLA subreddit data being biased by this one author who seems to be contributing a major portion of the posts. 