# Introduction 
`V1.0.0` `2021-12-04`
### Who am I
Just a fellow Kaggle learner. I was creating this Notebook as practice and thought it could be useful to some others 
### Who is this for
This Notebook is for people that learn from examples. Forget the boring lectures and follow along for some fun/instructive time :)
### What can I learn here
You learn all the basics needed to create a rudimentary sentiment analyzer using NLTK library and reddit API. I go over a multitude of steps with explanations. Hopefully with these building blocks,you can go ahead and build much more complex models.

### Things to remember
+ Please Upvote/Like the Notebook so other people can learn from it
+ Feel free to give any recommendations/changes. 
+ I will be continuously updating the notebook. Look forward to many more upcoming changes in the future.

In [None]:
!pip install praw

# Imports
First let us start by importing the relevant libraries that we need.

In [None]:
!pip install praw
import praw                  # Reddit API 
import nltk                  # Natural Language Tool Kit
import pandas as pd          # DataFrames
from pprint import pprint    # Printing
from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA   # Natural Language Tool Kit - Sentiment Analyzer Class

# Reddit API Instance
You will have to create your own oAuth token to create an instance of the Reddit class. You can find more information on how to obtain the `client_id` and `client_secret` on [link](https://praw.readthedocs.io/en/stable/getting_started/authentication.html).

In [None]:
# Here you can replace the 'your_user_here' by your reddit account name
user = "Web Scraper v1.0 by /u/your_user_here"

# Create Reddit Instance, you will have to fill in client_id and client_secret to able to create an instance of the Reddit API
reddit = praw.Reddit(client_id="",       # Replace here the client_id created by Reddit
                     client_secret="",   # Replace here the client_secret created by Reddit
                     user_agent=user)    # This is your user

# Parsing Reddit Posts
In this section, we parse through all the hot posts under the CryptoCurrency subredit. We only keep posts that have more than 500 upvotes.

In [None]:
# Initialize a set to store unique headlines
headlines = set()

# Iterate through every HOT posts of the CryptoCurrency subreddit
for submission in reddit.subreddit("CryptoCurrency").hot(limit=None):
    # If upvotes higher than 500, store the headline of the post
    if submission.score >= 500:
        headlines.add(submission.title)

# Print headlines 
print(*headlines, sep='\n')

# Create DataFreame from set of headlines
headlines_df = pd.DataFrame(headlines)
print(headlines_df.head(5))

# Sentiment Analyzer Instance
Here, we create the instance of the `SIA()` class from the natural language tool kit (`nltk`) library. This allows us to analyze each post and get a sentiment rating.

In [None]:
# Download the vader_lexicon to be able to create an appropriate sentiment analyze with SIA() class
nltk.download("vader_lexicon")
sia = SIA()

# Init results list to store headline and its score as a dict
results = []

# Analyze sentiment
We analyze the sentiment of each headline using the `ntlk` library and the `SIA()` class. We store it in a list called results to then be able to create a dataframe.

In [None]:
# Iterate through list of headliens and obtain sentiment score with nltk library
for headline in headlines:
    polarity_score = sia.polarity_scores(headline)
    polarity_score['headline'] = headline
    results.append(polarity_score)

# Use pretty print library to print the list of dicts
pprint(results[:], width=100)

# Create DataFrame from Records
We have recorded each headline and its respective scores as a dictionary in a list named results. Results containts multiple dicts representing each headline/post. We use the `from_records()` method to create a DataFrame from a list of dicts.

In [None]:
# Create DataFrame from list of dicts
results_df = pd.DataFrame.from_records(results)
print(results_df.head())

# Post-Process Results
Here we create a new column `label` which will rather contain a 1 if the sentiment is postive or a -1 if the sentiment is negative. We do not need to keep all the information that the nltk library produces. We are just interested to know if the post is positive or negative.

In [None]:
# Create new column label and fill all values with 0
results_df['label'] = 0

# Label 1 would be positive
results_df.loc[results_df['compound'] > 0.2, 'label'] = 1

# Label -1 would be negative
results_df.loc[results_df['compound'] < 0.2, 'label'] = -1

# Display DataFrame head for quick visualization
results_df.head()

Here we filter the DataFrame to only contain the headline and the label. We then proceed to save the results in a csv file that I can use in the future. For example, I might use these results as additional feature to a time-series prediction algorithm for CryptoCurrency. 

**Needless to say, if I succeed on making it... This will be the last time you see me on Kaggle :)**

In [None]:
# Create new filtered DataFrame which only containts columns headline and label
results_filtered_df = results_df[['headline', 'label']]

# Save to a csv file
results_filtered_df.to_csv('reddit_sentiment_analysis.csv', encoding='utf-8', index=False)

# Overview of Results
This allows us to have a quick overview of how much percentage of posts were positive and negative. We use `pandas` `value_counts()` method which gives us the amount of pos. and neg. values in the `label` column. We added the `normalize` argument and multiplied the output by 100 to obtain the percentage amount of pos. and neg. headlines.

In [None]:
# Obtain percentage of how many headlines/posts were postive and negative
results_filtered_df.label.value_counts(normalize=True) * 100