# Title: Data Collection and Sentiment Analysis

## Working Procedure
The program is structured to include separate user-defined functions for distinct tasks: data collection, CSV file saving, and sentiment analysis. These functions are orchestrated within the main function, which facilitates the workflow. The main function also uses a `while` loop to collect data and compute sentiment analysis scores in near real-time.

### Data Collection Function
This function utilizes the Reddit API to fetch data. It collects the following fields for each post:
- **ID**: Unique identifier of the post.
- **Title**: The title of the post.
- **SelfText**: The body content of the post.
- **Date**: The creation date of the post.

The function ensures that only posts with non-missing `SelfText` data are collected, thereby avoiding empty or irrelevant entries.

### Data Saving Function
This function is responsible for handling and storing data. It:
1. Combines newly collected data with previously saved data from the CSV file.
2. Performs duplication checks to ensure no redundant entries are stored.

### Sentiment Analysis Function
This function applies sentiment analysis to text data using the **TextBlob** sentiment analysis model. It evaluates the sentiment of the text and generates sentiment scores, which are saved for further analysis.

### Main Function
The main function coordinates the workflow by invoking the individual functions. It contains a `while` loop to:
1. Continuously collect new data at regular intervals.
2. Save the collected data to the CSV file 1 for backup.
3. Apply sentiment analysis to the collected data.
4. Save the processed results to the CSV file 2, enabling near real-time analysis.


In [1]:
import praw
import pandas as pd
import time
import os
from datetime import datetime
from textblob import TextBlob  # Import TextBlob

# Reddit API credentials
client_id = 'wvFf0PPQo_1zth6SkDESUQ'
client_secret = 'krMR9Hj-IILecymbX_kf4SdLHNv6Gg'
user_agent = 'climate_sentiment_analyzer:v1.0 (by /u/Ok_Beginning1171)'

# Authenticate with Reddit API using praw
reddit = praw.Reddit(
    client_id=client_id,
    client_secret=client_secret,
    user_agent=user_agent
)

# Function to search Reddit by subreddit
def search_reddit_by_subreddit(query, subreddit, limit=1000, collected_ids=set()):
    posts = []
    try:
        subreddit_instance = reddit.subreddit(subreddit)
        for submission in subreddit_instance.search(query, sort='new', limit=limit):
            if submission.id not in collected_ids and submission.selftext.strip():
                posts.append({
                    'ID': submission.id,
                    'Title': submission.title,
                    'SelfText': submission.selftext,
                    'Date': datetime.utcfromtimestamp(submission.created_utc).strftime('%Y-%m-%d %H:%M:%S')
                })
    except Exception as e:
        print(f"Error fetching data from subreddit r/{subreddit}: {e}")
    return posts

# Function to save posts to CSV
def save_to_csv(df_new_posts, filename):
    df_updated = pd.DataFrame()
    if os.path.exists(filename):
        df_existing = pd.read_csv(filename)
        df_updated = pd.concat([df_existing, df_new_posts]).drop_duplicates(subset=['ID']).reset_index(drop=True)
    df_updated.to_csv(filename, index=False)
    return len(df_new_posts), len(df_updated)

# Define a function for sentiment scoring using TextBlob
def analyze_sentiment(text):
    if not text:
        return 0.0  # Neutral score if no text
    blob = TextBlob(text)
    return blob.sentiment.polarity  # Polarity score: -1 (negative) to 1 (positive)

if __name__ == "__main__":
    queries = ["climate change", "global warming", "greenhouse gases", "carbon emissions", "renewable energy", 
                "deforestation", "sea level rise", "extreme weather", "climate action", "fossil fuels",
                "carbonfoot print", "united nations"]

    subreddits = ["unitednations", "climatechange", "change", "heat", "weather", "environment", "sustainability", "renewableenergy", "conservation"]
    
    filename = 'climate_change_posts_streaming.csv'
    filename_sa = "climate_change_sentiment_analysis_streaming.csv"
    collected_ids = set()

    while True:
        try:
            print("Fetching data from Reddit...")
            new_posts = []

            for subreddit in subreddits:
                print(f"Searching in subreddit: r/{subreddit}...")
                for query in queries:
                    if os.path.exists(filename):
                        df_existing = pd.read_csv(filename)
                        if 'ID' in df_existing.columns:
                            collected_ids = set(df_existing['ID'].values)
                
                    subreddit_posts = search_reddit_by_subreddit(query, subreddit, collected_ids=collected_ids)
                    new_posts.extend(subreddit_posts)
                print(f"r/{subreddit}: Done")

            df_new_posts = pd.DataFrame(new_posts).drop_duplicates(subset=['ID'])
            
            if df_new_posts.empty:
                print("No new posts found.")
                
            else:
                # Save new posts to CSV
                new_count, total_count = save_to_csv(df_new_posts, filename=filename)

                # Perform sentiment analysis
                df_new_posts['TitleSentimentScore'] = df_new_posts['Title'].apply(analyze_sentiment)
                df_new_posts['SelfTextSentimentScore'] = df_new_posts['SelfText'].apply(analyze_sentiment)
                
                # Save sentiment analysis to a new CSV
                header = not os.path.exists(filename_sa)
                df_new_posts.to_csv(filename_sa, mode='a', index=False, header=header)

                print(f"Appended {new_count} new posts. Total posts: {total_count}")

            print("Waiting for the next interval...")
            time.sleep(5)
            
        except KeyboardInterrupt:
            print("\nScript interrupted. Exiting...")
            break
        except Exception as e:
            print(f"Unexpected error: {e}")
            break


Fetching data from Reddit...
Searching in subreddit: r/unitednations...
r/unitednations: Done
Searching in subreddit: r/climatechange...
r/climatechange: Done
Searching in subreddit: r/change...
r/change: Done
Searching in subreddit: r/heat...
r/heat: Done
Searching in subreddit: r/weather...
r/weather: Done
Searching in subreddit: r/environment...
r/environment: Done
Searching in subreddit: r/sustainability...
r/sustainability: Done
Searching in subreddit: r/renewableenergy...
r/renewableenergy: Done
Searching in subreddit: r/conservation...
r/conservation: Done
Appended 2 new posts. Total posts: 3973
Waiting for the next interval...
Fetching data from Reddit...
Searching in subreddit: r/unitednations...
r/unitednations: Done
Searching in subreddit: r/climatechange...
r/climatechange: Done
Searching in subreddit: r/change...
r/change: Done
Searching in subreddit: r/heat...
r/heat: Done
Searching in subreddit: r/weather...
r/weather: Done
Searching in subreddit: r/environment...
r/envi