# IPA vs Lager: Analyzing Reddit Discussions 
### Web Scraping & Sentiment Analysis

CSIS 4260 / 300394050 / Rachel Kim

Static websites often use BeautifulSoup, while dynamic websites typically rely on Selenium. However, platforms like Reddit impose restrictions such as login requirements, API rate limits, and bot detection systems, making traditional web scraping methods (BS4, Selenium, etc.) difficult to implement.

Therefore, to collect data from Reddit in a stable and efficient manner, I decided to use PRAW and JSON API to analyze user opinions on IPA and Lager.

## Part 1: Web scraping : Praw  vs Requests + Json api

### 1.1. Praw

In [76]:
# Import Library for Praw
import praw
import pandas as pd
import time

In [77]:
# Enter my user information for Access to Reddit
reddit = praw.Reddit(
    client_id="EPxklbuaSYjqQeitrb_PuQ",
    client_secret="VCtdzE9j_kILiK8-h93n8DBtPMz_8w",
    password="hb@qiz834LtaXYn",
    user_agent="Apprehensive_You9283",
    username="Apprehensive_You9283",
)

In [78]:

# keywords, subreddit list
beer_keywords = ["IPA", "lager"]
subreddits = ["beer", "craftbeer", "alcohol"]

# List for store data
posts = []

# Start measuring execution time
start_time = time.time()

# Scraping
for subreddit in subreddits:
    sub = reddit.subreddit(subreddit)
    for keyword in beer_keywords:
        for submission in sub.search(keyword, limit=50): 
            posts.append([subreddit, keyword, submission.title, submission.selftext, submission.score])
            
# End measuring execution time
end_time = time.time()
praw_time = end_time - start_time

# saving the data to dataframe 
df = pd.DataFrame(posts, columns=["Subreddit", "Keyword", "Title", "Text", "Score"])
df.to_csv("beer_reviews_praw.csv", index=False)
print(f"DataFrame is saved. \nPraw Time: {praw_time:.2f}s")

DataFrame is saved. 
Praw Time: 7.10s


In [79]:
# Check the data
df.info

<bound method DataFrame.info of     Subreddit Keyword                                              Title  \
0        beer     IPA  If your IPA does not have the date printed on ...   
1        beer     IPA  Non-IPA drinkers - are there any IPAs that you...   
2        beer     IPA  "IPA has gone too far.", says former Stone bre...   
3        beer     IPA  Potentially unpopular opinion: a “variety pack...   
4        beer     IPA  Who here remembers Ranger IPA before it became...   
..        ...     ...                                                ...   
295   alcohol   lager               Found behind my old hot water heater   
296   alcohol   lager        "heh you don't drink it for the taste, kid"   
297   alcohol   lager           Graph of alcohols and my future mini-bar   
298   alcohol   lager               Why in the hell do people like ipas?   
299   alcohol   lager         best alcohol to drink on night out on diet   

                                                  Text 

#### Praw Result
- The number of pages: 300  
- Praw Time for scraping: 7.1s 

### 1.2. Requests + Json api

In [80]:
# import library for Requests + Json
import requests
import pandas as pd
import time

In [81]:
# keywords & subreddits
beer_keywords = ["IPA", "lager"]
subreddits = ["beer", "craftbeer", "alcohol"]

# User-Agent & Headers
headers = {"User-Agent": "Mozilla/5.0"}

# Start measuring execution time
start_time = time.time()

# List to store data
posts_json = []

# Scraping from Reddit JSON API
for subreddit in subreddits:
    for keyword in beer_keywords:
        url = f"https://www.reddit.com/r/{subreddit}/search.json?q={keyword}&restrict_sr=1&limit=50"
        response = requests.get(url, headers=headers)

        if response.status_code == 200:
            data = response.json()
            posts = data["data"]["children"]  # children: An array containing the actual post data from the Reddit JSON API response
            
            for post in posts:
                title = post["data"]["title"]
                text = post["data"]["selftext"]
                score = post["data"]["score"]
                posts_json.append([subreddit, keyword, title, text, score])
        else:
            print(f"Error fetching data from {subreddit}: {response.status_code}")

# End measuring execution time
end_time = time.time()
json_api_time = end_time - start_time

# Save data to CSV
df_json = pd.DataFrame(posts_json, columns=["Subreddit", "Keyword", "Title", "Text", "Score"])
df_json.to_csv("beer_reviews_json.csv", index=False)

print(f"DataFrame is saved. \nRequests+Json api Time: {json_api_time:.2f} seconds")

DataFrame is saved. 
Requests+Json api Time: 4.81 seconds


In [82]:
df_json.info

<bound method DataFrame.info of     Subreddit Keyword                                              Title  \
0        beer     IPA  If your IPA does not have the date printed on ...   
1        beer     IPA  Non-IPA drinkers - are there any IPAs that you...   
2        beer     IPA  "IPA has gone too far.", says former Stone bre...   
3        beer     IPA  Potentially unpopular opinion: a “variety pack...   
4        beer     IPA  Who here remembers Ranger IPA before it became...   
..        ...     ...                                                ...   
295   alcohol   lager               Found behind my old hot water heater   
296   alcohol   lager        "heh you don't drink it for the taste, kid"   
297   alcohol   lager           Graph of alcohols and my future mini-bar   
298   alcohol   lager               Why in the hell do people like ipas?   
299   alcohol   lager         best alcohol to drink on night out on diet   

                                                  Text 

#### JSON Result
- The number of pages: 300  
- Praw Time for scraping: 4.81s 

### 2.3. Comparision Praw vs Json API + requests

- PRAW: A Python library that simplifies the use of Reddit's official API, automatically handling authentication and request limits.  
- Reddit JSON API+Requests: Involves sending HTTP requests directly and processing the JSON responses.

In this project, performance comparison results showed that the Reddit JSON API took 4.81 seconds, while PRAW took 7.1 seconds, with the JSON API being faster. This is because the JSON API simply sends HTTP requests and receives JSON responses, allowing for faster processing without the overhead of library wrapping. In contrast, PRAW maintains a connection with the API and handles multiple tasks, which takes more time.

However, from a user perspective, when scraping 300 pages, PRAW's more concise and intuitive code made the data collection process much easier and more convenient.

 ## Part 2: Text analysis


### 2.1. Vader  
Vader is optimized for short texts such as social media or reviews, and returns the sentiment analysis results  
as a score divided into positive, negative, and neutral. It's useful for measuring the intensity of emotions in text.

In [83]:
# Load the library
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

In [84]:
# Load the data
df = pd.read_csv("beer_reviews_praw.csv")

# Initialize VADER analyzer
vader_analyzer = SentimentIntensityAnalyzer()

# Function to analyze sentiment
def analyze_vader_sentiment(text):
    vader_score = vader_analyzer.polarity_scores(text)["compound"]
    
    if vader_score >= 0.05:
        sentiment = "Positive"
    elif vader_score <= -0.05:
        sentiment = "Negative"
    else:
        sentiment = "Neutral"
    
    return vader_score, sentiment

# Apply VADER sentiment analysis
df["VADER_Score"] = df["Text"].apply(lambda x: analyze_vader_sentiment(str(x))[0])
df["VADER_Sentiment_Label"] = df["Text"].apply(lambda x: analyze_vader_sentiment(str(x))[1])

# Filter IPA & Lager separately
ipa_vader_df = df[df["Keyword"] == "IPA"]
lager_vader_df = df[df["Keyword"] == "lager"]

# Count sentiment labels for IPA and Lager
ipa_vader = ipa_vader_df["VADER_Sentiment_Label"].value_counts(normalize=True) * 100
lager_vader = lager_vader_df["VADER_Sentiment_Label"].value_counts(normalize=True) * 100

In [85]:

# Print results
print("\nVADER Sentiment Distribution for IPA (%)")
print(ipa_vader)

print("\nVADER Sentiment Distribution for Lager (%)")
print(lager_vader)


VADER Sentiment Distribution for IPA (%)
VADER_Sentiment_Label
Neutral     49.333333
Positive    37.333333
Negative    13.333333
Name: proportion, dtype: float64

VADER Sentiment Distribution for Lager (%)
VADER_Sentiment_Label
Neutral     50.666667
Positive    42.000000
Negative     7.333333
Name: proportion, dtype: float64


### 2.2. TextBlob
TextBlob is a simple and intuitive library for natural language processing (NLP) in Python. It is primarily used for text analysis and sentiment analysis, along with various other NLP tasks.   

**Polarity** indicates the degree of positivity or negativity in the text, ranging from -1 (negative) to 1 (positive).  
**Subjectivity** reflects the degree of subjectivity in the text, ranging from 0 (objective) to 1 (subjective).

In [86]:
# Load the library
from textblob import TextBlob

In [87]:

# Function to analyze sentiment
def analyze_textblob_sentiment(text):
    textblob_score = TextBlob(text).sentiment.polarity
    
    if textblob_score >= 0.05:
        sentiment = "Positive"
    elif textblob_score <= -0.05:
        sentiment = "Negative"
    else:
        sentiment = "Neutral"
    
    return textblob_score, sentiment

# Apply TextBlob sentiment analysis separately
df["TextBlob_Score"] = df["Text"].apply(lambda x: analyze_textblob_sentiment(str(x))[0])
df["TextBlob_Sentiment_Label"] = df["Text"].apply(lambda x: analyze_textblob_sentiment(str(x))[1])

# Filter IPA & Lager separately
ipa_textblob_df = df[df["Keyword"] == "IPA"]
lager_textblob_df = df[df["Keyword"] == "lager"]

# Count sentiment labels for IPA and Lager
ipa_textblob = ipa_textblob_df["TextBlob_Sentiment_Label"].value_counts(normalize=True) * 100
lager_textblob = lager_textblob_df["TextBlob_Sentiment_Label"].value_counts(normalize=True) * 100


In [88]:
# Print results
print("\nTextBlob Sentiment Distribution for IPA (%)")
print(ipa_textblob)

print("\nTextBlob Sentiment Distribution for Lager (%)")
print(lager_textblob)


TextBlob Sentiment Distribution for IPA (%)
TextBlob_Sentiment_Label
Neutral     57.333333
Positive    33.333333
Negative     9.333333
Name: proportion, dtype: float64

TextBlob Sentiment Distribution for Lager (%)
TextBlob_Sentiment_Label
Neutral     54.666667
Positive    38.000000
Negative     7.333333
Name: proportion, dtype: float64


### 2.3. Importance Score

Importance analysis aims to assess how significant each post's content is regarding the topic.  
Since not all reviews may hold the same value, in this project, the Reddit score was used as a weight to conduct the importance analysis.

In [89]:
# Function to calculate Importance Score based on VADER, TextBlob scores and Reddit score
def calculate_importance_score(vader_score, textblob_score, score):
    importance_score = vader_score * 0.5 + textblob_score * 0.5
    # Weight Reddit score
    weighted_importance = importance_score * (1 + abs(score) / 1000)
    return weighted_importance

# Apply Importance Score calculation with score
df["Importance_Score"] = df.apply(lambda x: calculate_importance_score(x["VADER_Score"], x["TextBlob_Score"], x["Score"]), axis=1)

# Assign Importance Direction (Positive, Negative, Neutral)
df["Importance_Direction"] = df["Importance_Score"].apply(lambda x: "Positive" if x > 0 else "Negative" if x < 0 else "Neutral")

# Combine results
importance_df = df[["Text", "Importance_Score", "Importance_Direction", "VADER_Sentiment_Label", "TextBlob_Sentiment_Label"]]

# Print
print("Importance Score Table:")
importance_df

Importance Score Table:


Unnamed: 0,Text,Importance_Score,Importance_Direction,VADER_Sentiment_Label,TextBlob_Sentiment_Label
0,"I’m tired of seeing no dates on cans, nothing ...",-0.876338,Negative,Negative,Negative
1,"Hey everyone. I turned 21 in November, and my ...",0.596254,Positive,Positive,Positive
2,,0.000000,Neutral,Neutral,Neutral
3,I have spoken,0.000000,Neutral,Neutral,Neutral
4,To me voodoo ranger sucks ( especially the jui...,-0.576103,Negative,Negative,Negative
...,...,...,...,...,...
295,Anyone know what decade this is from?,0.000000,Neutral,Neutral,Neutral
296,How often that line comes up in films or TV. Y...,0.653527,Positive,Positive,Positive
297,,0.000000,Neutral,Neutral,Neutral
298,I got 4 different highly rated IPA's from my l...,-0.419911,Negative,Negative,Negative


In [90]:
# Print value counts for categorical columns
print("\nImportance Direction Value Counts:")
print(df["Importance_Direction"].value_counts())

print("\nVADER Sentiment Label Value Counts:")
print(df["VADER_Sentiment_Label"].value_counts())

print("\nTextBlob Sentiment Label Value Counts:")
print(df["TextBlob_Sentiment_Label"].value_counts())


Importance Direction Value Counts:
Importance_Direction
Neutral     140
Positive    128
Negative     32
Name: count, dtype: int64

VADER Sentiment Label Value Counts:
VADER_Sentiment_Label
Neutral     150
Positive    119
Negative     31
Name: count, dtype: int64

TextBlob Sentiment Label Value Counts:
TextBlob_Sentiment_Label
Neutral     168
Positive    107
Negative     25
Name: count, dtype: int64


In [91]:
# Save the dataframe with sentiment results to CSV
df.to_csv("beer_reviews_with_sentiment.csv", index=False)

### 2.3. Results of Sentiment Analysis

In [92]:
# VADER results
ipa_vader_df = ipa_vader.rename("IPA_VADER (%)")
lager_vader_df = lager_vader.rename("Lager_VADER (%)")

# TextBlob results
ipa_textblob_df = ipa_textblob.rename("IPA_TextBlob (%)")
lager_textblob_df = lager_textblob.rename("Lager_TextBlob (%)")

# Importance score result
important_d_count = df["Importance_Direction"].value_counts().rename("Importance Direction counts")

# Combine Dataframes
result_df = pd.concat([ipa_vader_df, lager_vader_df, ipa_textblob_df, lager_textblob_df,important_d_count], axis=1).reset_index()
result_df.rename(columns={"index": "Sentiment"}, inplace=True)

# Print
result_df

Unnamed: 0,Sentiment,IPA_VADER (%),Lager_VADER (%),IPA_TextBlob (%),Lager_TextBlob (%),Importance Direction counts
0,Neutral,49.333333,50.666667,57.333333,54.666667,140
1,Positive,37.333333,42.0,33.333333,38.0,128
2,Negative,13.333333,7.333333,9.333333,7.333333,32


**VADER**: Both IPA and Lager had the highest proportion of Neutral sentiment, accounting for approximately 50%. In terms of Positive sentiment, Lager (42%) received slightly more positive reactions than IPA (37.33%). On the other hand, Negative sentiment was higher for IPA (13.33%) compared to Lager (7.33%).  

**TextBlob**: The Neutral sentiment percentage was even higher than VADER, with IPA at 57.33% and Lager at 54.67%. Similar to VADER, Positive sentiment was higher for Lager (38%) than IPA (33.33%). In terms of Negative sentiment, IPA (9.33%) showed a slightly higher percentage than Lager (7.33%).  


In conclusion, consumer sentiment toward IPA was more negative than Lager, likely due to IPA’s strong aroma and bitter taste, which tend to be polarizing. In contrast, Lager is generally preferred by a wider audience due to its lighter and smoother taste, resulting in fewer negative reactions.

However, given that both models showed a Neutral sentiment rate exceeding 50%, it suggests that Reddit beer reviews are more focused on sharing experiences or comparing beers within the same category (IPA or Lager) rather than expressing strong emotions. For future related projects, incorporating sentiment analysis of well-known brands within each category could yield more meaningful insights.