### Step 1: Importing Necessary Libraries

We start by importing the libraries that are essential for our project. These libraries will help us work with data, create visualizations, and interact with the Reddit API.

- `import numpy as np`: NumPy is a library for numerical computations in Python. We use it for efficient array operations.
- `import pandas as pd`: Pandas is a library for data manipulation and analysis. It provides data structures like DataFrames.
- `import matplotlib.pyplot as plt`: Matplotlib is a popular library for creating visualizations, and `plt` is a common alias for it.
- `import praw`: PRAW stands for "Python Reddit API Wrapper," and it enables us to interact with Reddit's API programmatically.
- `import re`: The `re` library provides support for regular expressions, which we'll use for text pattern matching and extraction.

Now that we have our libraries loaded, we can proceed with the rest of our project.



In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import praw
import prawcore.exceptions
import re
import time
import praw.exceptions

### Step 2: Authenticating with the Reddit API

To access Reddit's data programmatically, we need to authenticate our Python application using the `praw` library. Here's what each part of the code does:

- `import praw`: We import the `praw` library, which is a Python wrapper for the Reddit API. This library simplifies the process of interacting with Reddit's data.

- `reddit = praw.Reddit(...)`: We create a Reddit API client object named `reddit`. This object is used for making authenticated requests to the Reddit API. The constructor takes the following parameters:

   - `client_id`: This should be replaced with the unique identifier of your Reddit Developer Application, which you obtained when you created the application on the Reddit Developer Portal. It's used to identify your application when making API requests.

   - `client_secret`: This key should be replaced with the secret key provided during the creation of your Reddit Developer Application. It's a secret key that, when combined with the client ID, allows your application to securely authenticate with the Reddit API.

   - `user_agent`: The user agent is a string that identifies your application and its purpose. It's important to provide a user agent that follows Reddit's guidelines, typically including the name of your application and a version number. For personal projects, you can include your Reddit username or any other descriptive information.

With this authenticated `reddit` object, we can now access various Reddit data and perform operations like fetching posts, comments, and more, which will be an essential part of our project.
```

In [2]:
reddit = praw.Reddit(
    client_id='DAuF7LHCr_OM-_PGf-UBaw',
    client_secret='uor61HS6MgW5yHYBk8LcOmTlW5j5xQ',
    user_agent='Dry_Try8800',
)

### Step 3: Data Retrieval and Data Processing

In this step, we retrieve data from Reddit using the Python Reddit API Wrapper (PRAW) library and process it to create a DataFrame for analysis. Before running the code, ensure that you have installed the required libraries and configured your Reddit API credentials.

### Parameters

- **Subreddit**: By default, the code retrieves posts from the 'all' subreddit, but you can change this to a specific subreddit of your choice.

- **Total Posts to Retrieve**: The variable `total_posts_to_retrieve` is set to 10,000 in this example, but you can adjust it to the desired number of posts.

- **Time Filter**: The code filters posts based on a specific time frame, such as 'year'. You can modify the `time_filter` variable to cover a different time period.

- **Batch Size**: The `batch_size` variable determines the number of posts retrieved in each batch. This helps to adhere to Reddit's API rate limits.

### Data Retrieval Loop

The data retrieval process follows these steps:

1. Calculate the number of remaining posts to retrieve in the current batch.

2. Introduce a 2-second delay between API requests to avoid rate limiting.

3. Make an API request to search for posts using the 'before' parameter for pagination.

4. Check if there are more posts to retrieve. If not, exit the loop.

5. Extend the `all_posts` list with the retrieved posts.

6. Update counters for the number of retrieved posts and the current batch.

### Data Processing

After retrieving the data, it is processed and converted into a Pandas DataFrame for further analysis:

1. Initialize an empty list `post_data` to store information about each post.

2. Iterate through each retrieved post:
   - Extract information such as author username, post title, URL, number of upvotes, and more.
   - Handle cases where author information might not be available (AttributeError).
   - Append the extracted data as a dictionary to the `post_data` list.

3. Create a DataFrame (`df`) using the `post_data` list. In this DataFrame, each row represents a Reddit post, and each column represents a specific attribute.

### Data Analysis

With the DataFrame (`df`) ready, you can conduct various analyses and draw insights from the Reddit data:

- Analyze the distribution of upvotes and downvotes.
- Investigate trends in post creation times.
- Examine the most common words or hashtags used in post titles or text.
- Explore relationships between variables, such as the number of comments and upvotes.

For visualization of your findings, you can use Python libraries such as Matplotlib, Seaborn, or NetworkX, depending on your analysis objectives.


In [3]:
subreddit = reddit.subreddit('all')
total_posts_to_retrieve = 10000
time_filter = 'year'
batch_size = 250
all_posts = []
retrieved_posts = 0
current_batch = 0

while retrieved_posts < total_posts_to_retrieve:
    remaining_posts = total_posts_to_retrieve - retrieved_posts
    posts_to_retrieve = min(remaining_posts, batch_size)
    ukraine_posts = list(subreddit.search('#ukrainewar', limit=posts_to_retrieve, time_filter=time_filter, sort='top', params={'before': f'{current_batch}y'}))

    if not ukraine_posts:
        break

    all_posts.extend(ukraine_posts)
    retrieved_posts += len(ukraine_posts)
    current_batch += 1

df = pd.DataFrame(vars(post) for post in all_posts)
df = df[['subreddit','selftext','author_fullname','title','upvote_ratio','ups','created','created_utc','num_comments','author','id']]
df.to_csv('ukrainewar_full.csv', index=True)

### Step 4: Data Selection and Export

In this step, we select specific columns from the DataFrame and export the data to a CSV file for further analysis or sharing. 

### Data Selection

You have chosen to focus on specific columns for your analysis, including attributes like:

- `subreddit`: The name of the subreddit where the post is located.
- `selftext`: The text content of the post, if it's available (text post).
- `author_fullname`: The unique identifier of the post's author.
- `title`: The title of the post.
- `upvote_ratio`: The ratio of upvotes to total votes.
- `ups`: The number of upvotes received by the post.
- `created`: The timestamp of when the post was created.
- `created_utc`: The timestamp of when the post was created in UTC.
- `num_comments`: The number of comments on the post.
- `author`: The username of the post's author.
- `id`: A unique identifier for the post.

These selected columns will be included in your final dataset.

### Data Export

The selected data is now being exported to a CSV file named 'ukrainewar_full.csv' in the current working directory. The `index=True` parameter specifies that the DataFrame's index will be included as a separate column in the CSV file.

### Displaying the DataFrame

Finally, the resulting DataFrame is displayed below to provide a snapshot of the data that has been selected and exported.

You can now use 'ukrainewar_full.csv' for further analysis or share it as needed.


In [11]:
unique_posts = df.copy()
unique_posts.sort_values(by='num_comments',ascending=False,inplace=True)
unique_posts.drop_duplicates(subset='id',inplace=True)
unique_posts

Unnamed: 0,subreddit,selftext,author_fullname,title,upvote_ratio,ups,created,created_utc,num_comments,author,id
3137,europe,This megathread is meant for discussion of the...,t2_ojkhp,War in Ukraine Megathread L,0.96,428,1.673995e+09,1.673995e+09,9524,ModeratorsOfEurope,10eps9y
4346,europe,This megathread is meant for discussion of the...,t2_ojkhp,War in Ukraine Megathread XLIX,0.97,346,1.670887e+09,1.670887e+09,8921,ModeratorsOfEurope,zkf1p4
9168,europe,This megathread is meant for discussion of the...,t2_ojkhp,War in Ukraine Megathread XLVII,0.94,272,1.667165e+09,1.667165e+09,8445,ModeratorsOfEurope,yhqh49
3138,europe,"This is a special megathread. **One year ago, ...",t2_ojkhp,War in Ukraine Megathread LII,0.97,410,1.677156e+09,1.677156e+09,8276,ModeratorsOfEurope,119wltg
8679,europe,\nThis megathread is meant for discussion of t...,t2_ojkhp,War in Ukraine Megathread LIII,0.95,577,1.680552e+09,1.680552e+09,8232,ModeratorsOfEurope,12aw2q2
...,...,...,...,...,...,...,...,...,...,...,...
8982,InternationalLeft,,t2_39jzgdc5,CLARE DALY IS RIGHT! #shorts #claredaly #ukrai...,0.99,10,1.687868e+09,1.687868e+09,0,karmagheden,14kbzc1
7085,bing,Wanted to share these stunning images I've cre...,t2_e6xuffxu,Amazing AI Art I created using the chat mode o...,0.71,6,1.691855e+09,1.691855e+09,0,Activistjayden,15p7x16
8987,WayOfTheBern,,t2_39jzgdc5,CLARE DALY IS RIGHT! #shorts #claredaly #ukrai...,0.75,8,1.685546e+09,1.685546e+09,0,karmagheden,13worbn
9020,UkrainWarMonitor,,t2_7q9ji3dc,Javelin armed four wheeler used in Ukraine war...,1.00,6,1.676849e+09,1.676849e+09,0,mazstocks,116rf5u


In [12]:
comments_list = []

def extract_comments(post_id):
    submission = reddit.submission(id=post_id)
    submission.comments.replace_more(limit=None)
    comments_list.extend([{'comment_id': comment.id,'comment_body': comment.body} for comment in submission.comments.list()])


In [None]:
comments_list = []
def comments_extractor(post_id):
    try : 
        post_submission = reddit.submission(id=post_id)
        comments_list.append(post_submission.comments)
    except prawcore.exceptions.RequestException as e:
        print(f"Rate limit exceeded. Waiting for a moment. Error: {e}")
        time.sleep(5)
        comments_extractor(post_id)
                     

In [13]:
i = 0
for post_id in unique_posts['id']:
    print(i)

    if i < 3 : 
        
        try :
                
            extract_comments(post_id)
        
        except Exception as error :
            
            i += 1    
            print(f"Rate limit exceeded. Waiting for a moment. Error: {error}")
            time.sleep(5)
            continue
    else : 
        break
            
comments_df = pd.DataFrame(comments_list)

# Save the DataFrame to a CSV file
comments_df.to_csv('comments_data.csv', index=False)

0
Rate limit exceeded. Waiting for a moment. Error: received 429 HTTP response
1
1
1
Rate limit exceeded. Waiting for a moment. Error: received 429 HTTP response
2
Rate limit exceeded. Waiting for a moment. Error: received 429 HTTP response
3


In [14]:
comments_df

Unnamed: 0,comment_id,comment_body
0,j33bk6u,[Joe Biden about the ceasefire proposed by Rus...
1,j16lsjt,"""We do not complain. We do not judge and compa..."
2,j1icupp,"[Pink Floyd raised £450,000 for Ukraine \(and ..."
3,j42o3lq,[Russia's Ministry of Defense spokesman Igor K...
4,j0tjkkp,"Let's recap: 1) In 1994, Russia signed the Bud..."
...,...,...
17558,iuwtttw,You moved the definition of your own words to ...
17559,iuwv2om,"No, you misinterpreted my definition to mean s..."
17560,iuwvosc,"Okay, you're clearly doing this intentionally."
17561,iuwwidt,I think you do.
