# Step 1: Importing Necessary Libraries

In this initial step, we import the essential Python libraries required for our project. Each library serves a specific purpose, enabling us to work with Reddit's data and perform data manipulation tasks.

- **Pandas (pd)**: We use the `pandas` library to manipulate and analyze data. It's particularly well-suited for handling structured data in a tabular format, which aligns with our data analysis needs.

- **PRAW (Python Reddit API Wrapper)**: For accessing Reddit's API, we utilize the `praw` library. It provides a convenient interface for interacting with Reddit, allowing us to retrieve posts, comments, and perform various actions on the platform.

- **Time**: The `time` module is included to enable us to introduce controlled time delays within our code. These delays, implemented using `time.sleep()`, are essential for managing the rate at which we make requests to Reddit's API. This helps avoid overloading the API and encountering rate-limiting issues.

- **PRAW Exceptions**: The `praw.exceptions` module is imported to handle specific exceptions that might arise during our interactions with Reddit's API. These exceptions include rate limit errors, connection issues, and other potential problems that could occur when retrieving data from Reddit.

In summary, this initial step equips us with the necessary tools to interact with Reddit's API, fetch data, and effectively process it for analysis. We're now prepared to use these libraries in subsequent steps to retrieve and analyze data from Reddit.


In [1]:
import pandas as pd
import praw
import time
import praw.exceptions

# Step 2: Create a Reddit Application and Obtain API Credentials

1. **Go to Reddit's App Preferences**:  
   [Click here to open Reddit App Preferences](https://www.reddit.com/prefs/apps).

2. **Create a New Application**:
   - Ensure you’re **logged in** to your Reddit account.
   - Scroll down to the section titled **"Developed Applications"**.
   - Click on the **"Create App" or "Create Another App"** button.

3. **Fill in Application Details**:
   - **Name**: Enter a name for your application, such as "Reddit Data Downloader".
   - **App Type**: Select **"script"** (for personal, non-distributed use).
   - **Description**: Write a short description (e.g., "An app to download Reddit data").
   - **About URL**: Leave this blank unless you have a website for your app.
   - **Redirect URI**: Enter a redirect URI (e.g., `http://localhost:8080`).
   - **Permissions**: Leave as default.

4. **Create the Application**:
   - After filling in all fields, click **"Create app"**.

5. **Retrieve Your Credentials**:
   - **client_id**: This alphanumeric string is located directly under the application name.
   - **client_secret**: Found in the application details and labeled as "secret".

   Save these credentials securely, as they are needed to authenticate your API requests.


# Step 3: Authenticating with the Reddit API

To access Reddit's data programmatically, we need to authenticate our Python application using the `praw` library. Here's what each part of the code does:

- `import praw`: We import the `praw` library, which is a Python wrapper for the Reddit API. This library simplifies the process of interacting with Reddit's data.

- `reddit = praw.Reddit(...)`: We create a Reddit API client object named `reddit`. This object is used for making authenticated requests to the Reddit API. The constructor takes the following parameters:

   - `client_id`: This should be replaced with the unique identifier of your Reddit Developer Application, which you obtained when you created the application on the Reddit Developer Portal. It's used to identify your application when making API requests.

   - `client_secret`: This key should be replaced with the secret key provided during the creation of your Reddit Developer Application. It's a secret key that, when combined with the client ID, allows your application to securely authenticate with the Reddit API.

   - `user_agent`: The user agent is a string that identifies your application and its purpose. It's important to provide a user agent that follows Reddit's guidelines, typically including the name of your application and a version number. For personal projects, you can include your Reddit username or any other descriptive information.

With this authenticated `reddit` object, we can now access various Reddit data and perform operations like fetching posts, comments, and more, which will be an essential part of our project.
```

In [2]:
reddit = praw.Reddit(
    client_id='9AkNcQ17Z5pi_zo36Qrr6g',
    client_secret='bTQxJR7g2NVrYQZ1kNT1iipeMIGckA',
    user_agent='Dry_Try8800',
)

Version 7.7.1 of praw is outdated. Version 7.8.1 was released 6 days ago.


# Step 4: Downloading Reddit Posts with a Keyword and Saving to CSV

This code retrieves a set number of posts from Reddit that match a specific keyword (in this case, `#ukrainewar`). We save the data in a CSV file with detailed information about each post, like its title, author, and upvotes.

1. **Define the Search Settings**
   - **Subreddit**: We use `'all'`, which searches all of Reddit. You could specify a subreddit, like `'news'` or `'politics'`, to narrow down the search.
   - **Keyword**: We set `keyword = '#ukrainewar'`, meaning we’re looking for posts that contain this hashtag.
   - **Total Posts**: We set `total_posts_to_retrieve = 1000`. This tells the code to stop once it has gathered 1,000 posts matching our criteria.
   - **Time Filter**: We specify `time_filter = 'year'`, meaning we want posts only from the past year.
   - **Sort Order**: We set `sort = 'top'`, which means we want the most popular posts related to our keyword.

2. **Initialize an Empty List for Storing Data**
   - We create `all_posts = []`, an empty list where we will store the data of each post. 
   - Each post’s details will be stored as a dictionary in this list. This way, we can easily convert it into a DataFrame later on.

3. **Download Posts in Chunks**
   - **Why Use Chunks?**: Reddit may limit how much data you can download at once. By downloading smaller batches (or “chunks”), we avoid hitting limits or causing errors.
   - **Setting up a Loop**: We use `remaining_posts` to track how many posts we still need. Each time we download a batch, `remaining_posts` decreases.
   - **Chunk Size**: We define `chunk_size = min(100, remaining_posts)` to make sure each batch is no more than 100 posts. This keeps the program efficient and compatible with Reddit's limits.

4. **Retrieve and Store Each Post’s Data**
   - For each batch, the code loops through each post and collects important details, like:
     - **`subreddit`**: The subreddit where the post was published (e.g., `'politics'` or `'news'`).
     - **`selftext`**: The main text content of the post.
     - **`author_fullname`**: The Reddit author’s full name (if available).
     - **`title`**: The post’s title.
     - **`upvote_ratio`** and **`ups`**: The ratio of upvotes and total upvotes the post has received.
     - **`created`** and **`created_utc`**: The creation time of the post in both regular and UTC format.
     - **`num_comments`**: Number of comments the post has.
     - **`author`**: The username of the author (if available).
     - **`id`**: The unique ID for the post.

   - **Store in List**: We store each post’s data as a dictionary inside `all_posts`. This makes it easy to manage and access all the post data in one place.

5. **Save the Data to a CSV File**
   - **Convert to DataFrame**: We use `pd.DataFrame(all_posts)` to convert `all_posts` into a pandas DataFrame. DataFrames are easy to work with and make it simple to save our data to a file.
   - **Save to CSV**: The DataFrame is then saved as a CSV file using `df.to_csv('ukrainewar_full.csv', index=True)`. This file, `ukrainewar_full.csv`, now contains all the data on the posts we downloaded.

6. **Print Completion Message**: Finally, the code prints `"Data saved to 'ukrainewar_full.csv'"` to let us know the data has been successfully saved.

---

#### Example of the Code in Action

Let’s walk through an example to show how this code works in the first couple of loops.

1. **Loop 1**: 
   - **Remaining Posts**: We start with `remaining_posts = 1000`.
   - **Chunk Size**: Since we want 1,000 posts and can only take 100 at a time, `chunk_size = min(100, remaining_posts)` is set to 100.
   - The code then searches for the top 100 posts with `#ukrainewar` in the last year and retrieves details for each post in that batch.
   - **Retrieve Post Data**: For each of these posts, it gathers information like `subreddit`, `title`, `upvote_ratio`, and `author`.
   - **Store in List**: After gathering details on each of these 100 posts, it adds them to `all_posts`.
   - **Update Remaining Posts**: We then update `remaining_posts` to `1000 - 100 = 900`, meaning there are still 900 more posts to retrieve.

2. **Loop 2**:
   - **Remaining Posts**: Now, `remaining_posts = 900`.
   - **Chunk Size**: Again, we retrieve another batch of 100 posts, so `chunk_size` remains 100.
   - The code fetches another 100 posts based on the keyword, just like in Loop 1, collecting information for each post and adding it to `all_posts`.
   - **Update Remaining Posts**: After this second batch, `remaining_posts` becomes `900 - 100 = 800`.

3. **Continue the Process**:
   - This process continues, downloading batches of 100 posts at a time, until `remaining_posts` reaches 0, meaning we’ve gathered all 1,000 posts as planned.

In each loop, the code:
- Retrieves a batch of posts matching the keyword and criteria.
- Collects important information about each post.
- Adds that information to the main `all_posts` list.
- Updates `remaining_posts` until we reach the target number of posts.

#### Visualizing One Post

Here’s what one entry in `all_posts` might look like after one post is added:

```python
{
    'subreddit': 'politics',
    'selftext': 'Discussion on recent events...',
    'author_fullname': 'user_abc',
    'title': 'Impact of Ukraine War on Global Politics',
    'upvote_ratio': 0.95,
    'ups': 1500,
    'created': 1677420800,
    'created_utc': 1677420800,
    'num_comments': 400,
    'author': 'user_abc',
    'id': 'abc123'
}


In [3]:
subreddit = reddit.subreddit('all')
keyword = '#ukrainewar'
total_posts_to_retrieve = 1000
time_filter = 'year'
sort = 'top'
all_posts = []


remaining_posts = total_posts_to_retrieve
while remaining_posts > 0:
    chunk_size = min(100, remaining_posts)
    for post in subreddit.search(keyword, limit=chunk_size, time_filter=time_filter, sort=sort):
        post_data = {
            'subreddit': post.subreddit.display_name,
            'selftext': post.selftext,
            'author_fullname': post.author_fullname if post.author else 'N/A',
            'title': post.title,
            'upvote_ratio': post.upvote_ratio,
            'ups': post.ups,
            'created': post.created,
            'created_utc': post.created_utc,
            'num_comments': post.num_comments,
            'author': str(post.author) if post.author else 'N/A',
            'id': post.id
        }
        all_posts.append(post_data)

    remaining_posts -= chunk_size


df = pd.DataFrame(all_posts)
df.to_csv('ukrainewar_full.csv', index=True)
print("Data saved to 'ukrainewar_full.csv'")


Data saved to 'ukrainewar_full.csv'


In [4]:
df

Unnamed: 0,subreddit,selftext,author_fullname,title,upvote_ratio,ups,created,created_utc,num_comments,author,id
0,NonCredibleDiplomacy,,t2_yjaan75u8,TIL Ukraine wants…. Chechnya??? 🤣,0.98,1452,1.714230e+09,1.714230e+09,93,em1011081,1cegm8c
1,WhitePeopleTwitter,,t2_5dshp,Never forget what the GOP's beloved Putin is d...,0.97,1373,1.703426e+09,1.703426e+09,36,rhino910,18pvpv6
2,facepalm,,t2_kmlo28dhm,Russian Tools are not intelligent,0.96,655,1.712592e+09,1.712592e+09,11,Lord_Answer_me_Why,1bz1p3q
3,europe,\nThis megathread is meant for discussion of t...,t2_ojkhp,War in Ukraine Megathread LVI (57),0.97,525,1.711113e+09,1.711113e+09,2657,ModeratorsOfEurope,1bkysju
4,NAFO,,t2_tb342,Russians in Kursk are evacuating to..... Ukraine.,1.00,323,1.723545e+09,1.723545e+09,14,macktruck6666,1er4b5b
...,...,...,...,...,...,...,...,...,...,...,...
965,weapons,,t2_v4ecdy4d,Russian Armed Forces Reconnaissance Units Laun...,0.50,0,1.713319e+09,1.713319e+09,0,KyivMilitary,1c5y01u
966,RAW_NEWS,,t2_6nhz2unq,"Charlie on Instagram: ""AMERICA ROBBED OF ITS F...",0.50,0,1.718502e+09,1.718502e+09,0,mrpersistence2020,1dgx0cj
967,Substack,"Hello, this is my Substack in which I attempt ...",t2_12lhavbzn2,Land Land Land,0.33,0,1.718530e+09,1.718530e+09,0,badopinionsub,1dh3v4b
968,AndroidGaming,Drone game!!! #fpvdrone #ukrainewar #foryou #f...,t2_13reac5gsr,Drone incoming,0.38,0,1.721465e+09,1.721465e+09,1,monke_games_1,1e7r6vh


# Step 4: Data Processing and Deduplication

1. **Create a Copy of the DataFrame**:  
   - A copy of the original DataFrame `df` is created to work with, ensuring that the original data remains unchanged.

2. **Sort the Data by Number of Comments**:  
   - The DataFrame is sorted in descending order based on the number of comments, so posts with more comments appear first. This helps keep the most engaged posts when removing duplicates.

3. **Remove Duplicate Posts by Post ID**:  
   - Duplicate posts are removed based on the unique `id` of each post. Only the first occurrence of each `id` (with the most comments due to the sorting) is retained.

4. **Reset the DataFrame Index**:  
   - The DataFrame index is reset to ensure it is sequential, starting from 0. Dropping the old index keeps the DataFrame clean.

5. **Display the Cleaned DataFrame**:  
   - The final result, `unique_posts`, is displayed, containing only unique posts with the highest engagement (by comments).



In [5]:
unique_posts = df.copy()
unique_posts.sort_values(by='num_comments',ascending=False,inplace=True)
unique_posts.drop_duplicates(subset='id',inplace=True)
unique_posts.reset_index(inplace=True,drop=True)
unique_posts

Unnamed: 0,subreddit,selftext,author_fullname,title,upvote_ratio,ups,created,created_utc,num_comments,author,id
0,europe,\nThis megathread is meant for discussion of t...,t2_ojkhp,War in Ukraine Megathread LVI (57),0.97,529,1.711113e+09,1.711113e+09,2657,ModeratorsOfEurope,1bkysju
1,europe,\nThis megathread is meant for discussion of t...,t2_7b6qg,War in Ukraine Megathread LVIII (58),0.94,88,1.726737e+09,1.726737e+09,454,BkkGrl,1fkglfj
2,NonCredibleDiplomacy,,t2_yjaan75u8,TIL Ukraine wants…. Chechnya??? 🤣,0.98,1448,1.714230e+09,1.714230e+09,93,em1011081,1cegm8c
3,600euro,,t2_il63lzpcx,Fängt die Ampel wohl bald einen Krieg an?,0.99,180,1.709498e+09,1.709498e+09,80,Meaglo,1b5rdjk
4,indonesia,,t2_bywstnsf,What? Tribun is evolving! English video + robo...,0.95,180,1.711219e+09,1.711219e+09,45,Affectionate_Cat293,1blzh0r
...,...,...,...,...,...,...,...,...,...,...,...
92,Standbyukraine,https://preview.redd.it/f5rhpblge4td1.jpg?widt...,t2_fneqj99c,Will Western Leaders Choose Appeasement or Res...,1.00,1,1.728214e+09,1.728214e+09,0,ChristianEnglev,1fxe9sk
93,yUkraine,,t2_u42z2,Beautiful architecture in Ukraine Ivano-Franki...,1.00,1,1.718478e+09,1.718478e+09,0,Lodhur,1dgp32s
94,yUkraine,,t2_u42z2,the graveyard #ukraine #memorialday #holiday #...,1.00,1,1.717249e+09,1.717249e+09,0,Lodhur,1d5n9eq
95,u_Mriya_Production,🇺🇦 Immerse yourself in the resilience of war-t...,t2_qpqn0lxe3,"Murals, a documentary featuring Banksy's works...",1.00,1,1.704753e+09,1.704753e+09,0,Mriya_Production,191xz1a


# Step 6: Extracting Comments for Reddit Posts

This code snippet outlines the process of extracting comments related to Reddit posts. It utilizes the Python Reddit API Wrapper (PRAW) library and the information obtained from the previous steps. The explanation of the code is as follows:

1. **Empty List Creation**: An empty list named `comments_list` is initialized. This list will be used to store extracted comment data.

2. **Comment Extraction Function**: The code defines a function called `extract_comments(post_id)`. This function takes a post's unique identifier (`post_id`) as a parameter.

3. **Reddit Submission Retrieval**: Within the function, a Reddit post submission is retrieved using the provided `post_id`. This submission object represents the specific post on Reddit.

4. **Replacing More Comments**: To ensure all comments are retrieved, the code uses `submission.comments.replace_more(limit=None)` to replace more comments. This is necessary because Reddit employs a "load more comments" mechanism that can hide some comments from direct retrieval. This function effectively retrieves all comments.

5. **Comment Data Extraction**: The code then iterates through the comments using a list comprehension. For each comment, it extracts three pieces of information: the `comment_id` (unique comment identifier), `comment_body` (the text content of the comment), and `post_id` (the identifier of the post to which the comment is related). These details are stored in a dictionary.

6. **Appending to `comments_list`**: The extracted comment data is appended to the `comments_list` as a dictionary. This process ensures that comments from multiple posts are gathered in a single list for later analysis.

By using this code, you can collect comments related to Reddit posts, making them available for further processing and analysis. Each comment is associated with its parent post, which can be valuable for understanding user engagement and sentiment.


In [6]:
comments_list = []

def extract_comments(post_id):
    submission = reddit.submission(id=post_id)
    submission.comments.replace_more(limit=None)
    comments_list.extend([{'comment_id': comment.id,'comment_body': comment.body,'post_id': post_id,} for comment in submission.comments.list()])


# Step 7: Extracting Comments from Reddit Posts

In this code block, we see the process of extracting comments from Reddit posts. The code is designed to handle potential rate limiting by Reddit's API while collecting the comments. Here's a step-by-step explanation of the code:

1. **Iteration Through Posts**: The code starts by initializing a variable `i` to zero. It then iterates through the unique post identifiers stored in the DataFrame `unique_posts` using the 'id' column.

2. **Rate Limit Handling**: Within the loop, the code includes a rate limit handling mechanism. The code checks if `i` (the rate limit counter) is less than 5, indicating that we will attempt to collect comments from up to four posts.

3. **Try-Except Block**: Inside the loop, there's a try-except block. It attempts to extract comments from a Reddit post using the `extract_comments` function. If there's an exception (such as reaching the rate limit), the code proceeds to the except block.

4. **Rate Limit Exceeded Handling**: In the except block, `i` is incremented by 1, indicating that an attempt to collect comments has been made. The code also prints an informative message indicating that the rate limit was exceeded and that it's waiting for a moment before retrying. A delay of 7 seconds (achieved using `time.sleep(7)`) is introduced to prevent overwhelming the Reddit API with requests.

5. **Continuing the Loop**: After handling the exception, the code continues the loop to try collecting comments from the next post. This process repeats up to four times (controlled by the rate limit handling).

6. **Creating a DataFrame**: After the loop completes, a DataFrame named `comments_df` is created. This DataFrame is built from the `comments_list`, which accumulates comments from multiple posts. Each row in the DataFrame represents a comment, including the `comment_id`, `comment_body`, and `post_id`.

7. **Saving to CSV**: Finally, the DataFrame is saved to a CSV file named 'comments_data.csv' without an index column.

By using this code, you can systematically extract comments from multiple Reddit posts while efficiently handling rate limits, making it suitable for data collection tasks that involve Reddit's API.


In [7]:
i = 0
for post_id in unique_posts['id']:
    print(i)

    if i < 5: 
        try:
            extract_comments(post_id)
        except Exception as error:
            i += 1
            print(f"Rate limit exceeded. Waiting for a moment. Error: {error}")
            time.sleep(7)
            continue
    else:
        break

comments_df = pd.DataFrame(comments_list)


comments_df.to_csv('comments_data.csv', index=False)


0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0


In [8]:
comments_df = pd.read_csv('comments_data.csv')
comments_df = pd.DataFrame(comments_df)
comments_df

Unnamed: 0,comment_id,comment_body,post_id
0,kw230q6,The latest round of missile strikes on Ukraine...,1bkysju
1,kz1n390,I'm so fucking tired of Europe being afraid of...,1bkysju
2,kw6apo1,It's really an extraordinary moment: we have a...,1bkysju
3,kwt3o1a,Russian propaganda in Serbia's government-spon...,1bkysju
4,lh6von7,"Apparently, one of the units participating in ...",1bkysju
...,...,...,...
3271,li7m0a9,Thanks for posting on r/YouTubePromoter!\n\n*I...,1esqde2
3272,kho66ig,"Bard's response to this inquiry: ""What does sc...",195p8ey
3273,lpgmbkt,PTSD angry birds,1e7r726
3274,le239io,What are those hashtags bruh,1e7r6vh
