# Web Scraping Subreddits of Interest

## Introduction

In this project, we'll build a web scraper to extract job listings from a popular job search platform. We'll extract job titles, companies, locations, job descriptions, and other relevant information.

Here are the main steps we'll follow in this project:

1. Setup our development environment
2. Understand the basics of web scraping
3. Analyze the website structure of our job search platform
4. Write the Python code to extract job data from our job search platform
5. Save the data to a CSV file
6. Test our web scraper and refine our code as needed

## Prerequisites

Before starting this project, you should have some basic knowledge of Python programming and HTML structure. In addition, you may want to use the following packages in your Python environment:

- requests
- BeautifulSoup
- csv
- datetime

These packages should already be installed in Coursera's Jupyter Notebook environment, however if you'd like to install additional packages that are not included in this environment or are working off platform you can install additional packages using `!pip install packagename` within a notebook cell such as:

- `!pip install requests`
- `!pip install BeautifulSoup`

## Step 1: Importing Required Libraries

In [1]:
# your code here

In [5]:
pip install praw

Note: you may need to restart the kernel to use updated packages.




In [None]:
import os
import praw
import pandas as pd
import datetime as dt
from tqdm import tqdm
import time

In [4]:
def get_date(created):
    return dt.datetime.fromtimestamp(created)

In [12]:
#fill in the below Authentication details from Reddit
def reddit_connection():

    reddit = praw.Reddit(
        client_id="SI8pN3DSbt0zor",
        client_secret="xaxkj7HNh8kwg8e5t4m6KvSrbTI",
        password="1guiwevlfo00esyy",
        user_agent="testscript by u/SachindraBhattacharya",
        username="Sachindra Bhattacharya",
)
    return reddit

In [15]:
print(reddit.user.me())

TheBadAimGuy


In [16]:
def build_dataset(reddit, search_words='gameofthrones', items_limit=None):

    # Collect reddit posts
    subreddit = reddit.subreddit(search_words)
    new_subreddit = subreddit.new(limit=items_limit)
    topics_dict = { "title":[],
                "score":[],
                "id":[], "url":[],
                "comms_num": [],
                "created": [],
                "body":[]}

    print(f"retreive new reddit posts ...")
    for submission in tqdm(new_subreddit):
        topics_dict["title"].append(submission.title)
        topics_dict["score"].append(submission.score)
        topics_dict["id"].append(submission.id)
        topics_dict["url"].append(submission.url)
        topics_dict["comms_num"].append(submission.num_comments)
        topics_dict["created"].append(submission.created)
        topics_dict["body"].append(submission.selftext)

    for comment in tqdm(subreddit.comments(limit=None)):
        topics_dict["title"].append("Comment")
        topics_dict["score"].append(comment.score)
        topics_dict["id"].append(comment.id)
        topics_dict["url"].append("")
        topics_dict["comms_num"].append(0)
        topics_dict["created"].append(comment.created)
        topics_dict["body"].append(comment.body)

    topics_df = pd.DataFrame(topics_dict)
    print(f"new reddit posts retrieved: {len(topics_df)}")
    topics_df['timestamp'] = topics_df['created'].apply(lambda x: get_date(x))

    return topics_df

In [17]:
def update_and_save_dataset(topics_df):   
    file_path = "reddit_GoT.csv"
    if os.path.exists(file_path):
        topics_old_df = pd.read_csv(file_path)
        print(f"past reddit posts: {topics_old_df.shape}")
        topics_all_df = pd.concat([topics_old_df, topics_df], axis=0)
        print(f"new reddit posts: {topics_df.shape[0]} past posts: {topics_old_df.shape[0]} all posts: {topics_all_df.shape[0]}")
        topics_new_df = topics_all_df.drop_duplicates(subset = ["id"], keep='last', inplace=False)
        print(f"all reddit posts: {topics_new_df.shape}")
        topics_new_df.to_csv(file_path, index=False)
    else:
        print(f"reddit posts: {topics_df.shape}")
        topics_df.to_csv(file_path, index=False)

In [18]:
if __name__ == "__main__": 
	reddit = reddit_connection()
	topics_data_df = build_dataset(reddit)
	update_and_save_dataset(topics_data_df)

0it [00:00, ?it/s]

retreive new reddit posts ...


977it [00:11, 85.71it/s]
972it [00:06, 140.93it/s]

new reddit posts retrieved: 1949
reddit posts: (1949, 8)





In [19]:
df = pd.read_csv('reddit_GoT.csv')

In [20]:
df

Unnamed: 0,title,score,id,url,comms_num,created,body,timestamp
0,Little step,2,1diosnm,https://i.redd.it/uxvyfd0xjb7d1.jpeg,1,1.718712e+09,,2024-06-18 11:53:22
1,The Jon Snow vibes were so beautiful. Jace has...,1,1dios0k,https://i.redd.it/dsawtrbtjb7d1.jpeg,2,1.718712e+09,,2024-06-18 11:52:19
2,Audio quality is terrible in a feast for crows...,2,1dinck2,https://www.reddit.com/r/gameofthrones/comment...,2,1.718706e+09,What happened to the audio? I’ve seen posts a...,2024-06-18 10:26:23
3,Lego idea,1,1dimosd,https://www.reddit.com/r/gameofthrones/comment...,1,1.718704e+09,Does Leogs have a piece set for KingsLanding?I...,2024-06-18 09:42:49
4,My thoughts 💭,2,1dimfl3,https://i.redd.it/zytmqonjta7d1.jpeg,4,1.718703e+09,This is who Sam Tarly would be if Samwell was ...,2024-06-18 09:25:02
...,...,...,...,...,...,...,...,...
1944,Comment,1,l922cap,,0,1.718657e+09,Yes. Everyone was perfectly cast. They all gam...,2024-06-17 20:50:59
1945,Comment,2,l922ba2,,0,1.718657e+09,Here here! :),2024-06-17 20:50:49
1946,Comment,1,l9227mz,,0,1.718657e+09,id add in theon or swap rhaenyra out for him,2024-06-17 20:50:15
1947,Comment,2,l9222p2,,0,1.718657e+09,"Because he wasn’t born a nobleman, he doesn’t ...",2024-06-17 20:49:31
