# Sentiment Analysis - Reddit

## Work Flow

**SET UP**

1. Set up notebook.
2. Identify flairs or search queries of interest (cannot use below function if you want to use additional parameters - i.e., time stamp).

**THREADS**

3. Use Function A to scrape thread titles for search queries of interest.
4. Use Function B to perform sentiment analysis on thread titles.
5. Use Function B-i to sort by positive, negative, and neutral posts.

**COMMENTS**

<strike>6. Use Function C to get URL, etc. for threads based on sentiment label. (Optional: Extract to .CSV)
    
<strike>7. Go through the titles manually to see which ones peak your interest.

8. Use Function D to get comments from thread URL. 
9. Use Function B to perform sentiment analysis on comments.

Suggestions - flair (discussion, official post), word cloud, competitor analysis

## Set Up

**Instructions**
1. Log on to Reddit.
2. Visit https://reddit.com/prefs/apps (or https://old.reddit.com/prefs/apps/).
3. Create a personal script. 
4. Replace client id, client_secret below.

In [1]:
from pprint import pprint
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
import praw

user_agent = "practice by u/okcoolbeans12"
reddit = praw.Reddit(
    client_id="uK6dCq7boUKNAOyYwzuCuQ",
    client_secret="mZsU6N-9AyJX62KR6LaJcnEG7tIIZQ",
    user_agent=user_agent,
)

In [4]:
import nltk
nltk.download('vader_lexicon')

from nltk.sentiment.vader import SentimentIntensityAnalyzer as SIA

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     C:\Users\kirby\AppData\Roaming\nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


## Functions

### A. Getting Thread Titles from Search Queries

Only set up for search keywords & flairs.

In [4]:
def get_thread_info(subreddit, param):
    data = []
    
    for submission in reddit.subreddit(subreddit).search(param):
        data.add(submission.subreddit)
        data.add(submission.author_flair_text)
        data.add(submission.title)
        data.add(submission.shortlink)
        data.add(submission.id)
        data.add(submission.author)
        data.add(submission.num_comments)
        data.add(submission.upvote_ratio) #percentage of upvotes from all votes
        data.add(submission.created_utc)
    
    df = pd.DataFrame(data, columns=['subreddit', 'flair', 'title', 'shortlink',
                                     'id', 'author', 'num_comments', 'upvote_ratio',
                                     'created_utc'])
    
    return df

### B. Sentiment Analysis

In [5]:
def sentiment_analysis(dataframe):
    
    sia = SIA()
    results = []

    for line in dataframe['title']:
        pol_score = sia.polarity_scores(line) #returns dict
        pol_score['text'] = line #store line as headline key in dict
        results.append(pol_score)

    df = pd.DataFrame.from_records(results)

    dataframe['label'] = 0
    
    for n in df['compound']:
        if n < -0.5:
            dataframe['label'] = -2
        elif -0.5 <= n < 0:
            dataframe['label'] = -1
        elif 0 < n <= 0.5:
            dataframe['label'] = 1
        elif n > 0.5:
            dataframe['label'] = 2
        else:
            dataframe['label'] = 0
            
    return dataframe

**B-i. Check Headlines by Label**

In [6]:
def check_headline(df, value):
    
    if value == 1:
        print('Positive headlines:\n')
        pprint(list(df[df['label'] == 1].headline), width=200)
        
    elif value == -1:
        print('Negative headlines:\n')
        pprint(list(df[df['label'] == -1].headline), width=200)
        
    elif value == 0:
        print('Neutral headlines:\n')
        pprint(list(df[df['label'] == 0].headline), width=200)
        
    return

### C. Get Info for Threads
note: sorry...

In [7]:
def get_thread_info(subreddit, flair):
    
    final_data = []
    header_list = ['id', 'title', 'author', 'url', 'num_upvotes', 'upvote_ratio', 'num_comments', 'date']

    for submission in reddit.subreddit(subreddit).search(flair):
        val = [submission.id, submission.title, submission.author, submission.url, submission.score, submission.upvote_ratio, submission.num_comments, submission.created_utc] 
        final_data.add(val) 

    return final_data

### D. Get Comments from Thread URL

**References**
* https://praw.readthedocs.io/en/latest/code_overview/models/comment.html

In [8]:
def get_comments(url):
    submission = reddit.submission(url=url)
    
    final_comments = set()
    submission.comments.replace_more(limit=0)
    
    for comment in submission.comments:
        final_comments.add(comment.body)
    
    return final_comments

## Genshin Impact

In [9]:
data = get_thread_info("Genshin_Impact", "flair:Discussion")

RequestException: error with request HTTPSConnectionPool(host='www.reddit.com', port=443): Max retries exceeded with url: /api/v1/access_token (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self signed certificate in certificate chain (_ssl.c:997)')))

In [None]:
df = sentiment_analysis(data)

In [None]:
#check number of positive, negative, neutral headlines

df.label.value_counts()

In [None]:
fig, ax = plt.subplots(figsize=(8, 8))

counts = df.label.value_counts(normalize=True) * 100

sns.barplot(x=counts.index, y=counts, ax=ax)

ax.set_xticklabels(['Negative', 'Neutral', 'Positive'])
ax.set_ylabel('Percentage')

plt.show()

In [None]:
check_headline(df, -1)

In [None]:
check_headline(df, 1)

In [None]:
comments = get_comments("https://www.reddit.com/r/Genshin_Impact/comments/14zgvjg/why_klee_and_eulas_revenue_is_so_low_an_analysis/")

In [None]:
dff = sentiment_analysis(comments)

In [None]:
#check number of positive, negative, neutral headlines

dff.label.value_counts()

In [None]:
fig, ax = plt.subplots(figsize=(8, 8))

counts = dff.label.value_counts(normalize=True) * 100

sns.barplot(x=counts.index, y=counts, ax=ax)

ax.set_xticklabels(['Negative', 'Neutral', 'Positive'])
ax.set_ylabel('Percentage')

plt.show()

In [None]:
dff.to_csv(r'/Users/kirbypark/Desktop/JupyterNotebook/genshin_14zgvjg.csv', index=False)

In [None]:
data2 = get_thread_titles("Genshin_Impact", "flair_name:\":hoyo1::hoyo2: Official Post\"")

In [None]:
df2 = sentiment_analysis(data2)

In [None]:
#check number of positive, negative, neutral headlines

df2.label.value_counts()

In [None]:
check_headline(df2, 1)

In [None]:
comments2 = get_comments("https://www.reddit.com/r/Genshin_Impact/comments/14p8g2w/overture_teaser_the_final_feast_genshin_impact/")

In [None]:
dff2 = sentiment_analysis(comments2)

In [None]:
dff2.label.value_counts()

In [None]:
fig, ax = plt.subplots(figsize=(8, 8))

counts = dff2.label.value_counts(normalize=True) * 100

sns.barplot(x=counts.index, y=counts, ax=ax)

ax.set_xticklabels(['Negative', 'Neutral', 'Positive'])
ax.set_ylabel('Percentage')

plt.show()

In [None]:
#for nexon game bc im curious

c = get_comments("https://www.reddit.com/r/Maplestory/comments/14y59pu/v243_savior_shangrila_kaling_update_preview/")

In [None]:
d = sentiment_analysis(c)

In [None]:
d.label.value_counts()

In [None]:
fig, ax = plt.subplots(figsize=(8, 8))

counts = d.label.value_counts(normalize=True) * 100

sns.barplot(x=counts.index, y=counts, ax=ax)

ax.set_xticklabels(['Negative', 'Neutral', 'Positive'])
ax.set_ylabel('Percentage')

plt.show()

In [None]:
check_headline(d, 1)

In [None]:
check_headline(dff2, 1)

In [None]:
list = ["hello!!!", "Is this best event ever territory? These rewards are crazy."]
for n in list:
    print(n, SIA().polarity_scores(n))