# Scrape Travel subreddits using PRAW 

Code implemented from:
- https://artificialcorner.com/how-to-easily-scrape-data-from-social-media-the-example-of-reddit-138d619edfa5
- https://towardsdatascience.com/scraping-reddit-data-1c0af3040768

Reddit API Documentation https://www.reddit.com/dev/api/#GET_hot

List of travel related subreddits https://www.reddit.com/r/travel/comments/1100hca/the_definitive_list_of_travel_subreddits_to_help/

### Rate limits

Rate limits with PRAW https://praw.readthedocs.io/en/latest/getting_started/ratelimits.html
- Usually several hundred requests per minute depending on endpoint

Rate limits with requests 
https://www.reddit.com/r/redditdev/comments/14nbw6g/updated_rate_limits_going_into_effect_over_the/
- 100 queries per minute per OAuth client id if you are using OAuth authentication 
- 10 queries per minute if you are not using OAuth authentication

https://www.reddit.com/r/redditdev/comments/151vty4/reddit_api/
- You get to make 100 API requests per minute if you're making an app. For scripts, the limit is still 60 per minute. One search API request can give you up to 100 results.

https://www.reddit.com/r/redditdev/comments/145liwv/api_changes_and_personal_oauth/
- Effective July 1, 2023, the rate for apps that require higher usage limits is $0.24 per 1K API calls (less than a dollar 1.00 per user / month for a typical Reddit third-party app).

Reddit API Rules (archived) https://github.com/reddit-archive/reddit/wiki/API#rules

# Libraries

In [1]:
import requests
from praw import Reddit
import os
import pandas as pd
import datetime as dt
import time
from tqdm import tqdm
import joblib

## Sonia's Machine

In [2]:
import tensorflow as tf
print(tf.config.list_physical_devices('GPU'))

2023-11-12 18:59:12.476588: E tensorflow/compiler/xla/stream_executor/cuda/cuda_dnn.cc:9342] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-11-12 18:59:12.476612: E tensorflow/compiler/xla/stream_executor/cuda/cuda_fft.cc:609] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-11-12 18:59:12.476636: E tensorflow/compiler/xla/stream_executor/cuda/cuda_blas.cc:1518] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-11-12 18:59:12.482872: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


[PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU')]


2023-11-12 18:59:15.480344: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-11-12 18:59:15.480579: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysfs-bus-pci#L344-L355
2023-11-12 18:59:15.532159: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:894] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero. See more at https://github.com/torvalds/linux/blob/v6.0/Documentation/ABI/testing/sysf

# Reddit Authentication

In [3]:
# public key identifier
my_client_ID = ''

# secret key (do not show)
my_secret = ''

In [4]:
# Authenticate Reddit App
auth = requests.auth.HTTPBasicAuth(my_client_ID, my_secret)

In [5]:
# read in pw from text file
with open('pw.txt', 'r') as f:
    pw = f.read()

In [6]:
# Login - initialize a dict to specify we log in with a password
data = {
    'grant_type': 'password',
    # pass in username and pw
    'username': '',
    'password': pw
}

In [7]:
# ID version of our API
my_user_agent = {'User-Agent': 'TravelAPI/0.0.1'}

In [8]:
# Authorized instance
reddit_authorized = Reddit(client_id=my_client_ID,
                                client_secret=my_secret,
                                user_agent=my_user_agent,
                                username="",
                                password=pw)

### Dev - identifying subreddit attributes

In [9]:
# instance is returned used to access each post using next()
# limit to 100 for now
travel_subreddit = reddit_authorized.subreddit('travel').hot(limit=100)
print(travel_subreddit)

<praw.models.listing.generator.ListingGenerator object at 0x7f7022bea530>


In [9]:
next_reddit = next(travel_subreddit)
print(type(next_reddit))

<class 'praw.models.reddit.submission.Submission'>


In [10]:
all_attributes = dir(next_reddit) 

# Helper function to print all the attributes
def print_attributes_in_table(data, columns):
    for i in range(0, len(data), columns):
        print(',\t'.join(data[i:i+columns]))

# Run the function
print_attributes_in_table(all_attributes, 5)

STR_FIELD,	__class__,	__delattr__,	__dict__,	__dir__
__doc__,	__eq__,	__format__,	__ge__,	__getattr__
__getattribute__,	__gt__,	__hash__,	__init__,	__init_subclass__
__le__,	__lt__,	__module__,	__ne__,	__new__
__reduce__,	__reduce_ex__,	__repr__,	__setattr__,	__sizeof__
__str__,	__subclasshook__,	__weakref__,	_additional_fetch_params,	_chunk
_comments_by_id,	_edit_experimental,	_fetch,	_fetch_data,	_fetch_info
_fetched,	_kind,	_reddit,	_replace_richtext_links,	_reset_attributes
_safely_add_arguments,	_url_parts,	_vote,	add_fetch_param,	all_awardings
allow_live_comments,	approved_at_utc,	approved_by,	archived,	author
author_flair_background_color,	author_flair_css_class,	author_flair_richtext,	author_flair_template_id,	author_flair_text
author_flair_text_color,	author_flair_type,	author_fullname,	author_is_blocked,	author_patreon_flair
author_premium,	award,	awarders,	banned_at_utc,	banned_by
can_gild,	can_mod_post,	category,	clear_vote,	clicked
comment_limit,	comment_sort,	comments,	co

## Define Functions

In [22]:
def extract_comments_from_forest(submission):

    all_comments = []

    # Start iterating through each comment in the forest and get the content
    submission.comments.replace_more(limit=0) # Flatten the tree
    comments = submission.comments.list() # all the comments

    for comment in comments:
        all_comments.append(comment.body)

    return all_comments

In [10]:
def extract_top_N_posts(topic_of_interest, N = 100, sleep_time = 60):

    topic_of_interest = topic_of_interest.replace(' ', '')
    final_list_of_dict = []
    dict_result = {}

    try:
        submissions = reddit_authorized.subreddit(topic_of_interest).top(time_filter ='year', limit=N)
        time.sleep(sleep_time)
    except praw.exceptions.APIException as e:
        if e.response.status_code == 429:
            # Rate limit exceeded, wait for the specified duration
            retry_after = int(e.response.headers['Retry-After'])
            print(f"Rate limit exceeded. Waiting for {retry_after} seconds...")
            time.sleep(retry_after)
        else:
            # Handle other API exceptions
            print(f"API Exception: {e}")
    except Exception as e:
        print(f"An error occurred: {e}")
        time.sleep(60)
    # submissions = reddit_authorized.subreddit(topic_of_interest).top(time_filter ='day', limit=N)
    # https://www.reddit.com/dev/api/#GET_hot
    #limit for top is 100, top 100 per day, removing duplicates
    #decided to do hot daily, limit 1000
        
    
    for submission in submissions:
    # 11 cols
        dict_result["title"] = submission.title
        dict_result["selftext"] = submission.selftext
        dict_result["creation_date"] = dt.datetime.fromtimestamp(submission.created)
        dict_result["id"] = submission.id
        dict_result["url"] = submission.url
        dict_result["upvote_ratio"] = submission.upvote_ratio
        dict_result["ups"] = submission.ups
        dict_result["downs"] = submission.downs
        dict_result["score"] = submission.score
        dict_result["link_flair_css_class"] = submission.link_flair_css_class
        dict_result["comments"] = extract_comments_from_forest(submission)
    
        final_list_of_dict.append(dict_result)
        dict_result = {}

    # Create the dataframe
    df = pd.DataFrame(final_list_of_dict)
    
    return df

### Dev - test function

In [51]:
# Took a few minutes to extract posts from travel subreddit
# ~10 mins to call less than 1000 posts
travel_reddits_df = extract_top_N_posts('travel', 5)

In [None]:
# (100, 5) means 100 rows and 11 columns
print(travel_reddits_df.shape)

In [53]:
display(travel_reddits_df.head())

Unnamed: 0,title,selftext,creation_date,id,url,upvote_ratio,ups,downs,score,link_flair_css_class,comments
0,Passport Questions & Issues Megathread (2023),NOTE: October 2023 **If the US Government has ...,2023-01-01 12:56:19,100t75r,https://www.reddit.com/r/travel/comments/100t7...,0.99,542,0,542,question,[SPRING BREAK RUSH HAS STARTED. AS OF TODAY PR...
1,"U.S. Department of State - ""Worldwide Caution""",U.S. Department of State issued a new travel a...,2023-10-19 10:41:36,17bouw5,https://www.reddit.com/r/travel/comments/17bou...,0.94,737,0,737,advice,[Yes. They are routinely issued when there is ...
2,We need to be more supportive of each other IR...,I’ve been traveling solo for a little bit and ...,2023-11-12 07:05:56,17tm34k,https://www.reddit.com/r/travel/comments/17tm3...,0.88,177,0,177,advice,"[I get tired of the ""I'm a traveller not a tou..."
3,What is a place people overlook because it isn...,"For me, it is the Akshardham [Temple](https://...",2023-11-11 19:11:22,17tb91s,https://www.reddit.com/r/travel/comments/17tb9...,0.94,492,0,492,advice,"[Not sure it's truly overlooked, but Herculane..."
4,"I haven't flew since before 9/11, have some qu...","I'm going to have to take a plane soon, and it...",2023-11-12 04:14:01,17tj0yw,https://www.reddit.com/r/travel/comments/17tj0...,0.65,80,0,80,question,[The volume of liquid is limited because in 20...


In [16]:
display(travel_reddits_df.comments)

0     [SPRING BREAK RUSH HAS STARTED. AS OF TODAY PR...
1     [Yes. They are routinely issued when there is ...
2     [I get tired of the "I'm a traveller not a tou...
3     [Not sure it's truly overlooked, but Herculane...
4     [Waiters in Paris weren’t rude at all. People ...
                            ...                        
95    [it's basically a translation sheet with a sta...
96    [You’re right, they are certainly exaggerating...
97    [> Also, I purchased an insurance while in the...
98    [**Notice:** Are you asking about a layover or...
99    [No.  There's nothing to see just outside the ...
Name: comments, Length: 100, dtype: object

In [17]:
len(travel_reddits_df.comments)

100

# Output Directory & Set Parameters

In [11]:
source = 'reddit'
output_dir = os.path.join(os.getcwd(), f'{source}_output')
output_dir

'/media/joeymeyer/970-evo-plus/Sonia/bertproj/reddit/reddit_output'

In [12]:
if not os.path.exists(output_dir):
    os.makedirs(output_dir)

In [37]:
subreddits = pd.read_csv('travel_subreddits.csv')['travel_subreddits'].to_list()

In [38]:
sleep_time = 2 #seconds #100 requests per hour, is 0.6 seconds between requests
days_to_run = 1

sub_N = 1000
country_N = 50
city_N = 20

countries = [country for country in subreddits[27:56]]
cities = [city for city in subreddits[57:]]
subreddits = subreddits[0:26]

total_requests = len(countries)*country_N + len(cities)*city_N + len(subreddits)*sub_N
print(round(total_requests*sleep_time/60/60,1)) #approx time
print(total_requests)

16.9
30330


In [24]:
# # wait until the other one finished, so as to not overload api request limit
# wait_time = int(3.4*60*60)

# # Create a tqdm progress bar
# with tqdm(total=wait_time, desc="Processing") as pbar:
#     for i in range(wait_time):
#         time.sleep(1) 
#         pbar.update(1)

# Loop through Subreddits

In [40]:
day_count = 1
list_of_df = []

while day_count <= days_to_run:
    
    today = dt.date.today()
    
    #SUBREDDITS
    for i in tqdm(range(len(subreddits)), desc=f"Processing day {day_count}, subreddits", ncols=100):
        subreddit = subreddits[i]
        try:
            df = extract_top_N_posts(subreddit, sub_N, sleep_time)
            df.insert(0, 'source', source)
            df.insert(1, 'filename', f'{subreddit}_top_{sub_N}_{today}')
            list_of_df.append(df)
            joblib.dump(list_of_df, f'{output_dir}/safety_net.pkl')
        except Exception as e:
            print(f"An error occurred for {subreddit}: {e}")

    #COUNTRIES
    for i in tqdm(range(len(countries)), desc=f"Processing day {day_count}, countries", ncols=100):
        subreddit = countries[i]
        try:
            df = extract_top_N_posts(subreddit, country_N, sleep_time)
            df.insert(0, 'source', source)
            df.insert(1, 'filename', f'{subreddit}_top_{country_N}_{today}')
            list_of_df.append(df)
            joblib.dump(list_of_df, f'{output_dir}/safety_net.pkl')
            # df = joblib.load(f'{output_dir}/safety_net.pkl')
        except Exception as e:
            print(f"An error occurred for {subreddit}: {e}")

    #CITIES
    for i in tqdm(range(len(cities)), desc=f"Processing day {day_count}, cities", ncols=100):
        subreddit = cities[i]
        try:
            df = extract_top_N_posts(subreddit, city_N, sleep_time)
            df.insert(0, 'source', source)
            df.insert(1, 'filename', f'{subreddit}_top_{city_N}_{today}')
            list_of_df.append(df)
            joblib.dump(list_of_df, f'{output_dir}/safety_net.pkl')
            # df = joblib.load(f'{output_dir}/safety_net.pkl')
        except Exception as e:
            print(f"An error occurred for {subreddit}: {e}")

    day_count += 1

    # Concatenate DataFrames along rows (stack vertically)
    df = pd.concat(list_of_df, axis=0)
    # Reset the index of the concatenated DataFrame
    df.reset_index(drop=True, inplace=True)
    df.to_csv(f"{output_dir}/all_travel_top_{today}.csv")
    
    # time.sleep(24*60*60 - N*sleep_time)

Processing day 1, countries:  10%|███▌                               | 3/29 [02:52<19:29, 44.99s/it]

An error occurred for unitedstates: received 403 HTTP response


Processing day 1, countries:  52%|█████████████████▌                | 15/29 [19:20<13:57, 59.81s/it]

An error occurred for russia: received 403 HTTP response


Processing day 1, countries: 100%|██████████████████████████████████| 29/29 [37:27<00:00, 77.51s/it]
Processing day 1, cities:  16%|█████▊                              | 23/144 [07:22<19:26,  9.64s/it]

An error occurred for medina: received 404 HTTP response


Processing day 1, cities:  24%|████████▊                           | 35/144 [11:11<32:16, 17.77s/it]

An error occurred for milan: received 403 HTTP response


Processing day 1, cities:  28%|██████████                          | 40/144 [12:08<19:19, 11.15s/it]

An error occurred for cancún: received 404 HTTP response


Processing day 1, cities:  33%|███████████▊                        | 47/144 [14:16<19:09, 11.85s/it]

An error occurred for halong: received 403 HTTP response


Processing day 1, cities:  44%|███████████████▊                    | 63/144 [19:12<15:03, 11.15s/it]

An error occurred for lisbon: received 403 HTTP response


Processing day 1, cities:  44%|████████████████                    | 64/144 [19:14<11:16,  8.46s/it]

An error occurred for dammam: received 403 HTTP response


Processing day 1, cities:  45%|████████████████▎                   | 65/144 [19:17<08:38,  6.57s/it]

An error occurred for penangisland: received 404 HTTP response


Processing day 1, cities:  47%|█████████████████                   | 68/144 [19:33<07:17,  5.76s/it]

An error occurred for zhuhai: received 404 HTTP response


Processing day 1, cities:  57%|████████████████████▌               | 82/144 [24:34<11:40, 11.30s/it]

An error occurred for hurghada: received 404 HTTP response


Processing day 1, cities:  69%|████████████████████████▊           | 99/144 [28:02<10:42, 14.27s/it]

An error occurred for krabi: received 403 HTTP response


Processing day 1, cities:  75%|██████████████████████████▎        | 108/144 [31:35<12:34, 20.96s/it]

An error occurred for düsseldorf: received 404 HTTP response


Processing day 1, cities:  83%|█████████████████████████████▏     | 120/144 [35:55<04:31, 11.32s/it]

An error occurred for beirut: received 404 HTTP response


Processing day 1, cities:  89%|███████████████████████████████    | 128/144 [37:07<02:16,  8.52s/it]

An error occurred for montevideo: received 403 HTTP response


Processing day 1, cities:  93%|████████████████████████████████▌  | 134/144 [38:58<03:03, 18.34s/it]

An error occurred for accra: received 404 HTTP response


Processing day 1, cities:  98%|██████████████████████████████████▎| 141/144 [39:54<00:24,  8.24s/it]

An error occurred for palmademallorca: received 404 HTTP response


Processing day 1, cities: 100%|███████████████████████████████████| 144/144 [40:36<00:00, 16.92s/it]

An error occurred for laspalms: Redirect to /subreddits/search





In [41]:
df.shape

(3593, 13)

In [None]:
# list_of_df = joblib.load(f'{output_dir}/safety_net.pkl')
# print(len(list_of_df))