# Reddit and HuggingFace Stater Kit

## ============= PART 1 =============
The first part of this excercise is to figure out how to instantiate a Reddit API object using the [PRAW library](https://praw.readthedocs.io/en/stable/). This is a Python library that gives easy to understand interfaces to interact with the Reddit API.

Your first task is to look through the [documentation here](https://praw.readthedocs.io/en/stable/code_overview/reddit_instance.html) and figure out how to instanstiate a Reddit instance. (hint: you only need to use `client_id`, `client_secret`, and `user_agent`)

Make sure everyone in the group does this part! Follow the guide below on how to get your `client_id` and `client_secret`.

### Steps for part 1:
1. Pull the `FourthBrain/ML03` repo locally so you can start development.
2. Open `reddit_and_huggingface.ipynb` and install the necessary packages for this lesson by running:

    ```
    cd code_student/Week_2
    conda activate {your_virtual_environment_name}
    pip install transformers praw torch torchvision torchaudio
    ```

3. Get your `client_id` and `client_secret` from Reddit by doing:

* Make a Reddit account
* Follow the steps in this screenshot which are the first steps from this [guide](https://towardsdatascience.com/how-to-use-the-reddit-api-in-python-5e05ddfd1e5c).

![instructions to set up reddit api](../../images/reddit_get_access.JPG)

* Create a `secrets.py` file and include the following:

    ```
    REDDIT_API_CLIENT_ID = ""
    REDDIT_API_CLIENT_SECRET = ""
    REDDIT_API_USER_AGENT = {can_be_any_string...for ex: "marksbot"}
    ```

* Put `secrets.py` in `Week_2` so you can easily import it

4. Read the [documentation here](https://praw.readthedocs.io/en/stable/code_overview/reddit_instance.html) and try to figure out how to instanstiate a Reddit instance object. 
5. Lastly, once you have your Reddit instance object, figure out how to make a `subreddit` object for your favorite subreddit. (hint: look for `subreddit` in the documentation)
6. If time permits, continue to browse the documentation to prep for the next part of this excercise which will be to get submissions from the subreddit of your choosing.

In [2]:
import praw
import secrets

# Create a Reddit object which allows us to interact with the Reddit API
reddit = praw.Reddit(
    client_id=secrets.REDDIT_API_CLIENT_ID,
    client_secret=secrets.REDDIT_API_CLIENT_SECRET,
    user_agent=secrets.REDDIT_API_USER_AGENT
)
subreddit = reddit.subreddit("wallstreetbets")

## ============= PART 2 =============

This next part you are going to figure out how to parse comments by using your `subreddit` instance object.

### Todos for part 2:
1. How do I find the top 10 posts of all time from your favorite subreddit(s)? (hint: look at ["Obtain Submission Instances from a Subreddit"](https://praw.readthedocs.io/en/stable/getting_started/quick_start.html))
2. How do I parse comments from the post? (hint: look at ["Obtain Submission Instances from a Subreddit"](https://praw.readthedocs.io/en/stable/getting_started/quick_start.html))
3. And finally, how do I parse replies from that comment? (hint: look at [this tutorial](https://praw.readthedocs.io/en/stable/tutorials/comments.html))

Put your scraped data into a list so you can use it in the next part. A good structure would be a list of lists where the inner list is one comment and its replies. (hint: you might need a TRIPLE nested for loop). This could take a minute or two once running.

In [3]:
from praw.models import MoreComments

# Get top 100 posts of all time and iterate over them putting all comments and replies into a flat list
top_comments_and_replies = []

for submission in subreddit.top(limit=10):
    for top_level_comment in submission.comments:
        convo = []
        if isinstance(top_level_comment, MoreComments):
                    continue
        convo.append(top_level_comment.body)
        for reply in top_level_comment.replies:
            if isinstance(reply, MoreComments):
                continue
            convo.append(reply.body)
        top_comments_and_replies.append(convo)

## ============= PART 3 =============

The last part is to somehow use the data you have scraped from Reddit to run a HuggingFace model inference. 

### Todos for part 3:

1. Look at each of these HuggingFace models and choose one to implement:
    * [News Classification](https://huggingface.co/mrm8488/bert-mini-finetuned-age_news-classification) [HARD]
    * [Sentiment Analysis](https://huggingface.co/docs/transformers/quicktour) [EASY]
    * [Text Generation](https://huggingface.co/tasks/text-generation) [EASY]
2. Once you have chosen a model, go ahead and implement it!
3. If time permits, try another!

In [1]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

  from .autonotebook import tqdm as notebook_tqdm


### Fill in the blanks for News Classification to get started

In [28]:
news_tokenizer = AutoTokenizer.from_pretrained("mrm8488/bert-mini-finetuned-age_news-classification") 
news_model = AutoModelForSequenceClassification.from_pretrained("mrm8488/bert-mini-finetuned-age_news-classification")

sentiment_model = pipeline("sentiment-analysis")

text_generation_model = pipeline("text-generation", model="gpt2")

No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english)
Downloading: 100%|██████████| 665/665 [00:00<00:00, 327kB/s]
Downloading: 100%|██████████| 523M/523M [00:20<00:00, 26.1MB/s] 
Downloading: 100%|██████████| 0.99M/0.99M [00:00<00:00, 4.25MB/s]
Downloading: 100%|██████████| 446k/446k [00:00<00:00, 2.28MB/s]
Downloading: 100%|██████████| 1.29M/1.29M [00:00<00:00, 4.55MB/s]


### Helper Functions for News Classification

In [33]:
import torch

NEWS_CLASSES = ["world", "sports", "business", "sci/tech"]

# Take the output from our news classifier and map it to a class
def map_news_output_to_class(inference_output: torch.Tensor) -> str:
    softmax_values = []
    for output in inference_output:
        softmax_values.append(output.item())
    max_value = max(softmax_values)
    max_index = softmax_values.index(max_value)
    return NEWS_CLASSES[max_index]

# Run the news classifier model on the input
def run_subject_analysis(query: str) -> str:
    inputs = news_tokenizer(query, return_tensors="pt")
    labels = torch.tensor([1]).unsqueeze(0) # Batch size of 1
    outputs = news_model(**inputs, labels=labels) # Unpack key-value pairs into keyword args in function call
    news_subject = map_news_output_to_class(outputs.logits[0]) # Taking softmax tensor from inference
    return news_subject


In [25]:
import random

def get_random_conversation(conversations):
    text = ""
    conversation = random.choice(conversations)
    for comment in conversation:
        text = text + " " + comment
    
    return text

def get_random_comment(conversations):
    convo = random.choice(conversations)
    comment = random.choice(convo)
    
    return comment

In [16]:
# Run sentiment analysis
sentiment_query_sentence = get_random_comment(top_comments_and_replies)
sentiment = sentiment_model(sentiment_query_sentence)
print(f"Sentiment test: {sentiment_query_sentence} === {sentiment}")

Sentiment test: reddit recap says i was one of the first to upvote this === [{'label': 'POSITIVE', 'score': 0.958552360534668}]


In [31]:
# Run text summarization
text_generation_query = get_random_conversation(top_comments_and_replies)
text_generation = text_generation_model(text_generation_query)
print(f"Text generation: {text_generation_query} === {text_generation}")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Text generation:  If DFV doesn't win man of the year 2021, I no longer believe in the award. === [{'generated_text': " If DFV doesn't win man of the year 2021, I no longer believe in the award. I'm a fan of the current and former winners of the year awards. Even if I don't win it, I still hope to play in the"}]


In [40]:
# Run news categorization analysis
category_query = get_random_conversation(top_comments_and_replies)
category = run_subject_analysis(category_query)
print(f"News Category: {category_query} === {category}")

News Category:  If gme hits $1000 EOW I’m naming my son DFV. You heard it here. 

200 shares @$291 === business
