<a href="https://colab.research.google.com/github/sarmadchandio/WebScrapper/blob/main/reddit-scrapper.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# @title
!pip install praw
!pip install wordcloud matplotlib
!pip install gensim
!pip install bertopic

In [None]:
# @title
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from tqdm.notebook import tqdm
import praw
from wordcloud import WordCloud
import seaborn as sns
import nltk
nltk.download('punkt')


import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning)

# Big Data Pipeline

![img](https://drive.google.com/uc?id=1I8hfMsmjKTIl76DxsA9F9tzngkhUo6dJ)


# Part1: Let's scrape some reddit data!


Annoying things that we need to setup before starting.
1. Create a Reddit account.
2. Go to [this link](https://www.reddit.com/prefs/apps)
3. Use the following image to setup things
<div>
<img src="https://drive.google.com/uc?id=1V62iD3KVlrPoyLRqbqtaRGcQRvJGh3i8"
width="700"/>
</div>

- Enter http://localhost:8080 in the redirect uri
- Copy personal_use_script and paste it in the personal_use_script below. </br>
- Copy secret and paste it in the client_secret below

</br>

---



The deal with *APIs* is simple. It is annoying to setup ONCE but it easy to use over and over and over and over ... </br>
Think of an API as a waiter who takes order from you gets food from the kitchen.

---


In [None]:
# @title
personal_use_script = ''
client_secret = ''
user_agent = 'Dont mind me'


reddit = praw.Reddit(client_id=personal_use_script, client_secret=client_secret, user_agent=user_agent, check_for_async=False)
print("We are done setting up the api!")

In [None]:
# @title
def get_reddit_posts(subreddit_name, limit=1000):
    subreddit = reddit.subreddit(subreddit_name)
    posts = []

    for submission in subreddit.hot(limit=limit):  # change to .new, .top, or .controversial if needed
      post_data = {
          "title": submission.title,
          "score": submission.score,
          "selftext": submission.selftext,
          # "id": submission.id,
          # "url": submission.url,
          # "created": submission.created
          # ... any other attributes you are interested in
      }
      posts.append(post_data)

    return posts


### Try changing 'politics' to the subreddit of your liking!



In [None]:
# Get the latest 1000 posts from r/politics
posts = get_reddit_posts('politics', limit=1000)
print("Posts collected successfully!")

In [None]:
# @title
df_posts = pd.DataFrame(posts)
df_posts

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

file_path = '/content/gdrive/MyDrive/sample_data.csv'
df_posts.to_csv(file_path, index=False)

# Part 2: But what do I do with so much text?

Before doing so much with big text let's try something small with just two sentences. </br>
Don't tell my peers that I showed you how simple it is!

In [None]:
# @title
sentence1 = 'This is a Sentence. This sentence will be lowerCASED, LEMMAtized, removed OF stopwords, and tokenized'
sentence2 = 'I am another SENTENCE! I will be used to demonstrate how lemmatization, stopword removal, lowercasing, and tokenization work.'

print(sentence1)
print(sentence2)

Quick question: Is 'Toxic' the same as 'toxic' or 'toXic' or 'TOXIC'?

In [None]:
# @title
sentence1 = sentence1.lower()
sentence2 = sentence2.lower()

print(sentence1)
print(sentence2)

quick question: Is 'stripped' the same as 'strip' or 'used' the same as 'use'?

In [None]:
# @title
sentence1 = sentence1.replace('cased', 'case').replace('lemmatized', 'lemma').replace('removed', 'remove').replace('tokenized', 'token').replace('stopwords', 'stopword').replace('lowercased', 'lowercase')
sentence2 = sentence2.replace('used', 'use').replace('lemmatization', 'lemma').replace('removal', 'remove').replace('tokenization', 'token').replace('lowercasing', 'lowercase')

print(sentence1)
print(sentence2)

A computer can't really make sense of sentences (atleast not before I tell you it does). So let's help break the sentences into words, called tokens.

In [None]:
# @title
tokens1 = nltk.word_tokenize(sentence1)
tokens2 = nltk.word_tokenize(sentence2)

print(tokens1)
print(tokens2)

In [None]:
tokens1 = [t for t in tokens1 if t.isalpha()]
tokens2 = [t for t in tokens2 if t.isalpha()]

print(tokens1)
print(tokens2)

another quick question: are there words that are not as important or maybe that aren't useful topics?

In [None]:
# @title
# remove all the stop words
stop_words = ['i', 'is', 'am', 'are', 'will', 'and', 'be', 'a', 'to', 'of', 'how', 'this']
tokens1 = [t for t in tokens1 if t not in stop_words]
tokens2 = [t for t in tokens2 if t not in stop_words]

print(tokens1)
print(tokens2)

Different techniques compare these lists to compare similarity between any two documents. Let's apply these techniques to our collected data!

### cleaning our dataset

In [None]:
# @title
import pandas as pd
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
import string
import re
from gensim import corpora
from gensim.models import LdaModel


def preprocess(text):
    text = text.lower()  # Lowercase
    text = re.sub(r'\s+', ' ', text)  # Remove extra spaces
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    text = ' '.join([word for word in text.split() if word not in ENGLISH_STOP_WORDS])  # Remove stopwords
    return text

# Assuming df_posts is your DataFrame and 'selftext' is the column with text data
texts = df_posts['title'] + '. ' + df_posts['selftext']

# texts = df_posts['title']
# texts = df_posts['selftext']

# Apply preprocessing to each document
processed_texts = [preprocess(text) for text in texts]

# Tokenize the documents
tokenized_texts = [text.split() for text in processed_texts]

# Create a Gensim dictionary and corpus
dictionary = corpora.Dictionary(tokenized_texts)
corpus = [dictionary.doc2bow(text) for text in tokenized_texts]
print("Our code is cleaned")

In [None]:
# This is how the entries look like after cleaning.
for doc_token in tokenized_texts[:10]:
  print(doc_token)


# Part 3: Can I see some graphs?
Extracting topics!
Try changing the number of topics and passes to see how the results change.

In [None]:
# Train LDA model
lda_model = LdaModel(corpus=corpus, id2word=dictionary, num_topics=3, passes=20)

# Print the topics
topics = lda_model.print_topics(num_words=5)
for topic in topics:
    print(topic)

In [None]:
# @title
def get_top_n_words(lda_model, dictionary, n_top_words=10):
    all_top_words = set()

    for topic in lda_model.get_topics():
        top_feature_ids = topic.argsort()[-n_top_words:][::-1]
        top_words = [dictionary[id] for id in top_feature_ids]
        all_top_words.update(top_words)

    return all_top_words

def plot_top_words(lda_model, dictionary, n_top_words, n_top_topics):

    # Get the number of topics
    num_topics = len(lda_model.get_topics())

    top_n_words = get_top_n_words(lda_model, dictionary, n_top_words=n_top_words)
    # Define a color palette with 21 unique colors
    palette = [
      "#1f77b4", "#aec7e8", "#ff7f0e", "#ffbb78", "#2ca02c",
      "#98df8a", "#d62728", "#ff9896", "#9467bd", "#c5b0d5",
      "#8c564b", "#c49c94", "#e377c2", "#f7b6d2", "#7f7f7f",
      "#c7c7c7", "#bcbd22", "#dbdb8d", "#17becf", "#9edae5",
      "#7fc97f", "#beaed4", "#fdc086", "#ffff99", "#386cb0",
      "#f0027f", "#bf5b17", "#666666", "#1b9e77", "#d95f02"
  ]

    # Set the palette
    word_colors = dict(zip(top_n_words, sns.color_palette(palette, len(top_n_words))))

    # Create a single figure and multiple subplots (axes) arranged in one row
    fig, axes = plt.subplots(1, n_top_topics, figsize=(n_top_topics * 6, 8))

    for topic_idx, topic in enumerate(lda_model.get_topics()[:n_top_topics]):
        top_feature_ids = topic.argsort()[-n_top_words:][::-1]
        top_words = [dictionary[id] for id in top_feature_ids]
        weights = topic[top_feature_ids]

        # Get consistent colors for words from the color map
        current_word_colors = [word_colors[word] for word in top_words]

        ax = axes[topic_idx]
        ax.barh(top_words, weights, color=current_word_colors)
        ax.invert_yaxis()
        ax.set_title(f'Topic {topic_idx + 1}', fontsize=24, fontweight='bold', pad=20)
        ax.set_xlabel('Word Probability', fontsize=18)
        ax.set_ylabel('Words', fontsize=18)
        ax.tick_params(axis='both', which='major', labelsize=14)
        ax.grid(True, which="both", ls="--", c='0.7')

    plt.tight_layout()
    plt.show()




### We have a list of extracted topics from our own collected data!
Wait... why do they all look the same and why is trymp in most of them? Maybe my model was dumb! Or it wasn't context aware (try and recall the paper you read last week). Language context matters!

In [None]:
plot_top_words(lda_model, dictionary, n_top_words=7, n_top_topics=5)

## Lets try context-aware models to do the same thing and see what topics we get.
ahmmm wait. But what does context even mean? </br>

The following sentences will have the same tokens. \[lets, eat, grandpa\] but the meaning is different. </br>

 - 'let's eat grandpa'
 - 'grandpa let's eat'

</br>

The new machine learning models can somehow capture this! Let's put them to test.

In [None]:
from bertopic import BERTopic

# Create an instance of BERTopic
topic_model = BERTopic(min_topic_size=4)

# processed_text

# Fit the model on your data to retrieve topics
topics = topic_model.fit_transform(text)

Show me the top 7 topics that you have collected and top 3 words that contribute the most in making up the topic

In [None]:
topic_model.visualize_barchart(top_n_topics=4, n_words=10, width=450, height=400)

How far away the topics are from each other.

In [None]:
topic_model.visualize_topics(top_n_topics=6)

## Everybody loves word clouds!

Let's see which words occur the most in our collected data.

In [None]:
# join all the posts to make one large paragraph
text_data = ' '.join(texts)
processed_text = ' '.join(processed_texts)

def generate_word_cloud(text):

    # Generate the word cloud
    wordcloud = WordCloud(width=800, height=400, background_color='white').generate(text)

    # Display the word cloud using matplotlib
    plt.figure(figsize=(10, 5))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off')  # Hide the axes
    plt.show()

In [None]:
generate_word_cloud(text_data)

In [None]:
generate_word_cloud(processed_text)

## What about sentiment analysis?
Let's calculate the sentiments and see some example posts 🙂

In [None]:
# @title
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
nltk.download('vader_lexicon')

def calculate_sentiment(text):
    sia = SentimentIntensityAnalyzer()
    sentiment = sia.polarity_scores(text)['compound']
    return sentiment

# Calculate sentiment
df_posts['sentiment'] = (df_posts['title']+df_posts['selftext']).apply(calculate_sentiment)

In [None]:
# @title
df_posts['sentiment_label'] = pd.cut(
    df_posts['sentiment'],
    bins=[-1, -0.1, 0.1, 1],
    labels=['negative', 'neutral', 'positive']
)


In [None]:
sns.countplot(x='sentiment_label', data=df_posts, palette='viridis')
plt.title('Sentiment Distribution')
plt.show()

In [None]:
df_posts

In [None]:
# Calculate the count of each sentiment label
sentiment_counts = df_posts['sentiment_label'].value_counts()

# Calculate the total count of sentiments
total_sentiments = len(df_posts)

# Calculate the percentage of each sentiment label
sentiment_percentage = (sentiment_counts / total_sentiments) * 100

# Display the percentage of each sentiment label
for sentiment, percentage in sentiment_percentage.items():
    print(f"The percentage of {sentiment} sentiments is {percentage:.2f}%")

# Set up the seaborn style and color palette
sns.set_style("whitegrid")
palette = sns.color_palette("viridis", n_colors=sentiment_counts.shape[0])

# Create a bar plot
plt.figure(figsize=(10, 6))
ax = sns.barplot(x=sentiment_percentage.index, y=sentiment_percentage.values, palette=palette)

# Title and labels
plt.title('Percentage Distribution of Sentiments', fontsize=20, fontweight='bold', pad=20)
plt.xlabel('Sentiment', fontsize=16, labelpad=10)
plt.ylabel('Percentage (%)', fontsize=16, labelpad=10)

# Beautify the axes and grid
plt.xticks(fontsize=12, rotation=45)
plt.yticks(fontsize=12)
sns.despine(left=True, bottom=True)  # Remove left and bottom spines

# Show the plot
plt.tight_layout()
plt.show()
