# Step 1: Importing Necessary Libraries

In this initial step, we import the essential Python libraries required for our project. Each library serves a specific purpose, enabling us to work with Reddit's data and perform data manipulation tasks.

- **Pandas (pd)**: We use the `pandas` library to manipulate and analyze data. It's particularly well-suited for handling structured data in a tabular format, which aligns with our data analysis needs.

- **PRAW (Python Reddit API Wrapper)**: For accessing Reddit's API, we utilize the `praw` library. It provides a convenient interface for interacting with Reddit, allowing us to retrieve posts, comments, and perform various actions on the platform.

- **Time**: The `time` module is included to enable us to introduce controlled time delays within our code. These delays, implemented using `time.sleep()`, are essential for managing the rate at which we make requests to Reddit's API. This helps avoid overloading the API and encountering rate-limiting issues.

- **PRAW Exceptions**: The `praw.exceptions` module is imported to handle specific exceptions that might arise during our interactions with Reddit's API. These exceptions include rate limit errors, connection issues, and other potential problems that could occur when retrieving data from Reddit.

In summary, this initial step equips us with the necessary tools to interact with Reddit's API, fetch data, and effectively process it for analysis. We're now prepared to use these libraries in subsequent steps to retrieve and analyze data from Reddit.


In [9]:
import pandas as pd
import praw
import time
import praw.exceptions

### Step 2: Authenticating with the Reddit API

To access Reddit's data programmatically, we need to authenticate our Python application using the `praw` library. Here's what each part of the code does:

- `import praw`: We import the `praw` library, which is a Python wrapper for the Reddit API. This library simplifies the process of interacting with Reddit's data.

- `reddit = praw.Reddit(...)`: We create a Reddit API client object named `reddit`. This object is used for making authenticated requests to the Reddit API. The constructor takes the following parameters:

   - `client_id`: This should be replaced with the unique identifier of your Reddit Developer Application, which you obtained when you created the application on the Reddit Developer Portal. It's used to identify your application when making API requests.

   - `client_secret`: This key should be replaced with the secret key provided during the creation of your Reddit Developer Application. It's a secret key that, when combined with the client ID, allows your application to securely authenticate with the Reddit API.

   - `user_agent`: The user agent is a string that identifies your application and its purpose. It's important to provide a user agent that follows Reddit's guidelines, typically including the name of your application and a version number. For personal projects, you can include your Reddit username or any other descriptive information.

With this authenticated `reddit` object, we can now access various Reddit data and perform operations like fetching posts, comments, and more, which will be an essential part of our project.
```

In [2]:
reddit = praw.Reddit(
    client_id='DAuF7LHCr_OM-_PGf-UBaw',
    client_secret='uor61HS6MgW5yHYBk8LcOmTlW5j5xQ',
    user_agent='Dry_Try8800',
)

### Step 3: Data Retrieval and Data Processing

In this step, we utilize the Python Reddit API Wrapper (PRAW) library to collect data from Reddit and subsequently process it to create a structured DataFrame for in-depth analysis. Before executing the code, it's essential to ensure that the necessary libraries are installed and Reddit API credentials are correctly configured.

### Parameters

- **Subreddit**: By default, the code extracts posts from the 'all' subreddit. However, this setting can be adjusted to target a specific subreddit of your choice.

- **Total Posts to Retrieve**: The variable `total_posts_to_retrieve` is initially set to 10,000 in this example. You can customize it to suit your specific data collection requirements.

- **Time Filter**: The code applies a time filter to retrieve posts within a specified time frame, such as 'year.' You can modify the `time_filter` variable to focus on a different time period.

- **Batch Size**: The `batch_size` variable dictates the number of posts fetched in each batch. This batch-oriented approach is crucial for adhering to Reddit's API rate limits and managing data retrieval effectively.

### Data Retrieval Loop

The data retrieval process unfolds in a sequence of steps:

1. Calculate the number of remaining posts to retrieve in the current batch.

2. Implement a 2-second delay between API requests to circumvent potential rate limiting issues.

3. Issue an API request to search for posts, utilizing the 'before' parameter for pagination purposes.

4. Perform a check to determine if there are additional posts to retrieve. If the query returns no more results, the loop concludes.

5. Append the retrieved posts to the `all_posts` list, continuously building the dataset.

6. Keep track of counts to monitor the number of posts retrieved and the current batch being processed.

### Data Processing

After obtaining the data, we engage in a data processing phase to structure it into a Pandas DataFrame for subsequent analysis:

1. Initialize an empty list `post_data` to accumulate essential information about each post.

2. Iterate through each post:
   - Extract data points such as the author's username, post title, URL, the number of upvotes, and more.
   - Incorporate a robust approach to handle cases where author information might not be available, guarding against potential `AttributeError` exceptions.
   - Appends the extracted data as a dictionary to the `post_data` list.

3. Create a DataFrame (`df`) using the `post_data` list. Within this DataFrame, each row corresponds to a Reddit post, and each column represents a distinct attribute.

### Data Analysis

With the DataFrame (`df`) prepared, a multitude of analyses and insights can be derived from the Reddit data:

- Explore the distribution of upvotes and downvotes.
- Investigate trends related to the timing of post creation.
- Analyze the prevalence of specific words or hashtags within post titles or content.
- Unearth relationships between variables, such as the interplay between the number of comments and upvotes.

For effective visualization of your findings, you can leverage prominent Python libraries such as Matplotlib, Seaborn, or NetworkX, depending on the precise goals of your analysis.


In [3]:
subreddit = reddit.subreddit('all')
total_posts_to_retrieve = 10000
time_filter = 'year'
batch_size = 250
all_posts = []
retrieved_posts = 0
current_batch = 0

while retrieved_posts < total_posts_to_retrieve:
    remaining_posts = total_posts_to_retrieve - retrieved_posts
    posts_to_retrieve = min(remaining_posts, batch_size)
    ukraine_posts = list(subreddit.search('#ukrainewar', limit=posts_to_retrieve, time_filter=time_filter, sort='top', params={'before': f'{current_batch}y'}))

    if not ukraine_posts:
        break

    all_posts.extend(ukraine_posts)
    retrieved_posts += len(ukraine_posts)
    current_batch += 1

df = pd.DataFrame(vars(post) for post in all_posts)
df = df[['subreddit','selftext','author_fullname','title','upvote_ratio','ups','created','created_utc','num_comments','author','id']]
df.to_csv('ukrainewar_full.csv', index=True)

In [8]:
df

Unnamed: 0,subreddit,selftext,author_fullname,title,upvote_ratio,ups,created,created_utc,num_comments,author,id
0,NonCredibleDefense,,t2_4dfoopu9,Damn...we blinked and missed the T-34 stage of...,0.99,10398,1.666899e+09,1.666899e+09,738,doooompatrol,yf13ra
1,ukraine,,t2_3so708z2,Finnishüá´üáÆ volunteer sends greetings home from ...,1.00,2100,1.680237e+09,1.680237e+09,57,Vivarevo,127a6ew
2,UkrainianConflict,,t2_p5qo3hyf,Massive #HIMARS strikes 60km from the front in...,0.99,834,1.668659e+09,1.668659e+09,41,Orcasystems99,yxg7ub
3,europe,\nThis megathread is meant for discussion of t...,t2_ojkhp,War in Ukraine Megathread LIII,0.95,578,1.680552e+09,1.680552e+09,8232,ModeratorsOfEurope,12aw2q2
4,europe,This megathread is meant for discussion of the...,t2_ojkhp,War in Ukraine Megathread L,0.96,424,1.673995e+09,1.673995e+09,9524,ModeratorsOfEurope,10eps9y
...,...,...,...,...,...,...,...,...,...,...,...
9995,UkrainWarMonitor,,t2_7q9ji3dc,Gunfire erupted after Russian soldiers didn't ...,0.72,3,1.667099e+09,1.667099e+09,0,mazstocks,yh2ynz
9996,caps,Keep on losing and try to move up the draft lo...,t2_701pa,Lucky Guess - Game 80: vs NYI - Blunder for Be...,0.64,3,1.681129e+09,1.681129e+09,29,mdkss12,12hh318
9997,UkrainWarMonitor,,t2_7q9ji3dc,üî• Drone armed with a powerful bomb destroys Ru...,1.00,4,1.678322e+09,1.678322e+09,0,mazstocks,11mdcf7
9998,u_liberty_ukraine,\n\nLiberty Ukraine in Action! Thank you for ...,t2_a2tmiv6l,Liberty Ukraine in Action!,1.00,4,1.677530e+09,1.677530e+09,0,liberty_ukraine,11dlwx9


# Step 4: Data Processing and Deduplication

After retrieving a substantial number of posts from Reddit, the next crucial step involves processing and deduplicating the data. This process ensures that your dataset remains structured and free from redundant entries. The code for this step is summarized as follows:

1. **DataFrame Copy**: The DataFrame `unique_posts` is created as a copy of the original DataFrame (`df`). This copy is used for subsequent operations to preserve the integrity of the original data.

2. **Sorting by Number of Comments**: The `unique_posts` DataFrame is sorted in descending order based on the number of comments each post has received. This operation allows you to prioritize posts with the highest engagement.

3. **Deduplication**: To eliminate duplicate posts, the `drop_duplicates()` method is applied to the `unique_posts` DataFrame. It ensures that only one instance of each unique post, identified by its unique identifier (`id`), remains in the dataset. All additional duplicate entries are removed, resulting in a dataset with no redundant posts.

The resulting `unique_posts` DataFrame is now optimized for further analysis, containing distinct posts ordered by their level of engagement (number of comments).


In [4]:
unique_posts = df.copy()
unique_posts.sort_values(by='num_comments',ascending=False,inplace=True)
unique_posts.drop_duplicates(subset='id',inplace=True)
unique_posts

Unnamed: 0,subreddit,selftext,author_fullname,title,upvote_ratio,ups,created,created_utc,num_comments,author,id
3860,europe,This megathread is meant for discussion of the...,t2_ojkhp,War in Ukraine Megathread L,0.96,426,1.673995e+09,1.673995e+09,9524,ModeratorsOfEurope,10eps9y
3141,europe,This megathread is meant for discussion of the...,t2_ojkhp,War in Ukraine Megathread XLIX,0.97,342,1.670887e+09,1.670887e+09,8921,ModeratorsOfEurope,zkf1p4
2661,europe,This megathread is meant for discussion of the...,t2_ojkhp,War in Ukraine Megathread XLVII,0.94,269,1.667165e+09,1.667165e+09,8439,ModeratorsOfEurope,yhqh49
246,europe,"This is a special megathread. **One year ago, ...",t2_ojkhp,War in Ukraine Megathread LII,0.97,406,1.677156e+09,1.677156e+09,8276,ModeratorsOfEurope,119wltg
1208,europe,\nThis megathread is meant for discussion of t...,t2_ojkhp,War in Ukraine Megathread LIII,0.95,583,1.680552e+09,1.680552e+09,8232,ModeratorsOfEurope,12aw2q2
...,...,...,...,...,...,...,...,...,...,...,...
7027,UkrainWarMonitor,,t2_7q9ji3dc,Ukraine war footage: Intense combat take over ...,0.88,11,1.678719e+09,1.678719e+09,0,mazstocks,11qbqqt
7024,UkraineConflict,,t2_9dwibqf3,Gen. Ben Hodges Ukrainians üá∫üá¶ are shockingly g...,1.00,15,1.678891e+09,1.678891e+09,0,Aminokef,11ryhx9
7042,UkrainWarMonitor,,t2_7q9ji3dc,Close combat footage from Ukraine - battle wit...,0.86,9,1.678209e+09,1.678209e+09,0,mazstocks,11l4y9a
7016,Wallstreetsilver,,t2_79blzkek,#wef#bank#russia#ukraine#ukrainewar#petrol#bar...,0.84,29,1.670960e+09,1.670960e+09,0,Ok_Entertainer_6860,zl4s68


# Step 5: Extracting Comments for Reddit Posts

This code snippet outlines the process of extracting comments related to Reddit posts. It utilizes the Python Reddit API Wrapper (PRAW) library and the information obtained from the previous steps. The explanation of the code is as follows:

1. **Empty List Creation**: An empty list named `comments_list` is initialized. This list will be used to store extracted comment data.

2. **Comment Extraction Function**: The code defines a function called `extract_comments(post_id)`. This function takes a post's unique identifier (`post_id`) as a parameter.

3. **Reddit Submission Retrieval**: Within the function, a Reddit post submission is retrieved using the provided `post_id`. This submission object represents the specific post on Reddit.

4. **Replacing More Comments**: To ensure all comments are retrieved, the code uses `submission.comments.replace_more(limit=None)` to replace more comments. This is necessary because Reddit employs a "load more comments" mechanism that can hide some comments from direct retrieval. This function effectively retrieves all comments.

5. **Comment Data Extraction**: The code then iterates through the comments using a list comprehension. For each comment, it extracts three pieces of information: the `comment_id` (unique comment identifier), `comment_body` (the text content of the comment), and `post_id` (the identifier of the post to which the comment is related). These details are stored in a dictionary.

6. **Appending to `comments_list`**: The extracted comment data is appended to the `comments_list` as a dictionary. This process ensures that comments from multiple posts are gathered in a single list for later analysis.

By using this code, you can collect comments related to Reddit posts, making them available for further processing and analysis. Each comment is associated with its parent post, which can be valuable for understanding user engagement and sentiment.


In [5]:
comments_list = []

def extract_comments(post_id):
    submission = reddit.submission(id=post_id)
    submission.comments.replace_more(limit=None)
    comments_list.extend([{'comment_id': comment.id,'comment_body': comment.body,'post_id': post_id,} for comment in submission.comments.list()])


# Step 6: Extracting Comments from Reddit Posts

In this code block, we see the process of extracting comments from Reddit posts. The code is designed to handle potential rate limiting by Reddit's API while collecting the comments. Here's a step-by-step explanation of the code:

1. **Iteration Through Posts**: The code starts by initializing a variable `i` to zero. It then iterates through the unique post identifiers stored in the DataFrame `unique_posts` using the 'id' column.

2. **Rate Limit Handling**: Within the loop, the code includes a rate limit handling mechanism. The code checks if `i` (the rate limit counter) is less than 4, indicating that we will attempt to collect comments from up to four posts.

3. **Try-Except Block**: Inside the loop, there's a try-except block. It attempts to extract comments from a Reddit post using the `extract_comments` function. If there's an exception (such as reaching the rate limit), the code proceeds to the except block.

4. **Rate Limit Exceeded Handling**: In the except block, `i` is incremented by 1, indicating that an attempt to collect comments has been made. The code also prints an informative message indicating that the rate limit was exceeded and that it's waiting for a moment before retrying. A delay of 7 seconds (achieved using `time.sleep(7)`) is introduced to prevent overwhelming the Reddit API with requests.

5. **Continuing the Loop**: After handling the exception, the code continues the loop to try collecting comments from the next post. This process repeats up to four times (controlled by the rate limit handling).

6. **Creating a DataFrame**: After the loop completes, a DataFrame named `comments_df` is created. This DataFrame is built from the `comments_list`, which accumulates comments from multiple posts. Each row in the DataFrame represents a comment, including the `comment_id`, `comment_body`, and `post_id`.

7. **Saving to CSV**: Finally, the DataFrame is saved to a CSV file named 'comments_data.csv' without an index column.

By using this code, you can systematically extract comments from multiple Reddit posts while efficiently handling rate limits, making it suitable for data collection tasks that involve Reddit's API.


In [6]:
i = 0
for post_id in unique_posts['id']:
    print(i)

    if i < 4: 
        try:
            extract_comments(post_id)
        except Exception as error:
            i += 1
            print(f"Rate limit exceeded. Waiting for a moment. Error: {error}")
            time.sleep(7)
            continue
    else:
        break

comments_df = pd.DataFrame(comments_list)

# Save the DataFrame to a CSV file
comments_df.to_csv('comments_data.csv', index=False)


0
0
Rate limit exceeded. Waiting for a moment. Error: received 429 HTTP response
1
1
Rate limit exceeded. Waiting for a moment. Error: received 429 HTTP response
2
Rate limit exceeded. Waiting for a moment. Error: received 429 HTTP response
3
Rate limit exceeded. Waiting for a moment. Error: received 429 HTTP response
4


In [7]:
comments_df

Unnamed: 0,comment_id,comment_body,post_id
0,j4spbvz,"Back in December, we asked for some feedback o...",10eps9y
1,j5t53d7,It is funny to read the pro-Russian far-right ...,10eps9y
2,j509t84,> [Denmark donates artillery to Ukraine](https...,10eps9y
3,j60rk9u,[Auschwitz museum: Russia not invited to event...,10eps9y
4,j4sbv6e,...hoping the L will be the one Russia takes,10eps9y
...,...,...,...
18178,iuwtttw,You moved the definition of your own words to ...,yhqh49
18179,iuwv2om,"No, you misinterpreted my definition to mean s...",yhqh49
18180,iuwvosc,"Okay, you're clearly doing this intentionally.",yhqh49
18181,iuwwidt,I think you do.,yhqh49
