<hr>

# README

<style>
h1 {
    text-align: center;
    color: orange;
    font-weight: bold;
}
</style>

<style>
h2 {
    text-align: center;
    color: black;
    font-weight: bold;
}
</style>

<style>
h3 {
    text-align: center;
    color: black;
    font-weight: bold;
}
</style>

<style>
h4 {
    text-align: center;
    color: black;
    font-weight: bold;
}
</style>

<hr>

# ‚≠ê **Reddit Scraper for Posts** 

A Python-based scraper to extract high-engagement Reddit posts related to **music**. Automatically collects post data, including title, description, votes, comments, visuals (images/videos), and metadata. Ideal for content analysis, research, or curating trending music posts. Supports filtering by **Hot, Top, New, Comments, and Relevance**, with time-based searches like **past hour**, **week**, **month**, or **all-time**.  

## Features
- Extract post metadata: title, link, ID, votes, comments, post date & time.
- Download visuals (image, video, carousel) with automatic file naming.
- Filter posts by engagement, relevance, and time period.
- Focused on **music-related content**, including theory, instruments, production, memes, and historical highlights.

Perfect for researchers, content creators, or music enthusiasts looking to track trends and high-performing posts on Reddit.


## üéµ **Reddit/Music Threads Guidelines**

**Conditions:**  
- Description: 100‚Äì300 characters  
- Posts must be HOT & relevant to music  
- Downloadable visuals (images/videos)  
- Only include posts with ‚â•100 shares  
- High engagement / top-performing posts  

**Essentials:**  
- Keyword: `music`  
- Focus: short text or image, simplicity, relevance  

**Relevance Categories:**  
- Music theory, instruments, production, teaching  
- Historical highlights, facts, memes, communities  
- Target: anyone learning, playing, or producing music  

**Threads Post Rules:**  
- Video under 15 seconds  
- If both image & video exist, use only video  



## ‚≠ê **Reddit Post Data Fields**

The main data fields to extract from the Reddit Post :
| Field Name         | Python Data Type | Description |
|--------------------|-----------------|-------------|
| <span style="color:green">**post_title**</span>     | `str`            | Title of the post. |
| **post_link**      | `str`            | Direct URL to the post. |
| **post_id**        | `str`            | Unique identifier for the post. |
| <span style="color:red">**num_votes**</span>      | `int`            | Total number of upvotes the post received. |
| <span style="color:red">**num_comments**</span>   | `int`            | Total number of comments on the post. |
| **text_length**    | `int`            | Character count of the post‚Äôs text description. |
| <span style="color:green">**post_description**</span> | `str`          | Full text description or caption of the post. |
| **post_date**      | `date`           | Calendar date when the post was published. |
| **post_time**      | `str`            | Time (with timezone) when the post was published. |
| <span style="color:green">**post_visual**</span>    | `list[str]`      | Direct URLs to visual content (images or videos). |
| **visual_type**    | `str`            | Type of visual content: `"IMAGE"`, `"VIDEO"`,`"CAROUSEL"`, or `"NONE"`. |
| **visual_count**   | `int`            | Number of visual items in the post. |
| <span style="color:blue">**filter**</span>         | `str`            | Reddit search filter used. Must be one of: `"Relevance"`, `"Hot"`, `"Top"`, `"New"`, `"Comment count"`. |
| <span style="color:blue">**keyword**</span>        | `str`            | Search keyword or query. |
| <span style="color:blue">**limit**</span>          | `int`            | Number of posts requested. |
| <span style="color:blue">**period**</span>         | `str`            | Time filter used when searching posts. Must be one of: `"All time"`, `"Past year"`, `"Past month"`, `"Past week"`, `"Today"`, `"Past hour"`. |
| <span style="color:red">**time_ago**</span>       | `str`            | Relative time since the post was published (e.g., `"3 hours ago"`). |



## ‚≠ê **POST LINK - EXPLORING**

| Link             | Keyword       | Filter       |
|------------------------|-----------------------|-----------------------|
| https://www.reddit.com/search/?q=music             |  ```music```  | ```Relevance``` by default |
| https://www.reddit.com/search/?q=***keyword***             |  ```keyword```  | ```Relevance``` by default |
| https://www.reddit.com/search/?q=music&type=posts&sort=hot             |  ```music``` |  ```HOT```  |
| https://www.reddit.com/search/?q=music&type=posts&sort=top             |  ```music``` |  ```Top```  |
| https://www.reddit.com/search/?q=***keyword***&type=posts&sort=***filter***             |  ```keyword``` |  ```filter```  |


## ‚≠ê **FILTER**

| Filter       | What it does                                             | Use it when                                      |
|--------------|----------------------------------------------------------|-------------------------------------------------|
| ```Relevance```    | Shows posts most related to your search terms           | You want posts that best match your search query |
| ```Hot```          | Shows currently trending posts based on upvotes, recency, and engagement | You want popular and active posts right now    |
| ```Top```          | Shows posts with the highest score for a given time period | You want the most upvoted or "best" content for a topic |
| ```New```          | Shows posts in chronological order, newest first       | You want fresh content or to track recent activity |
| ```Comments Count```     | Sorts posts by number of comments (descending)          | You want posts with lots of community discussion |

## ‚≠ê **PERIOD**

| Filter       | What it does                                             | Use it when                                      |
|--------------|----------------------------------------------------------|-------------------------------------------------|
| ```All time```    | Shows posts most related to your search terms           | You want posts that best match your search query |
| ```Past year```          | Shows currently trending posts based on upvotes, recency, and engagement | You want popular and active posts right now    |
| ```Past month```          | Shows posts with the highest score for a given time period | You want the most upvoted or "best" content for a topic |
| ```Past week```          | Shows posts in chronological order, newest first       | You want fresh content or to track recent activity |
| ```Past hour```     | Sorts posts by number of comments (descending)          | You want posts with lots of community discussion |


## ‚≠ê **DOWNLOAD AUTOMATICALLY**

| Visual Type | File Naming Formula | Example File Names |
|-------------|-------------------|------------------|
| ```CAROUSEL```    | **keyword_filter_period_**`<post_number>_<type>_<sequence>` | **keyword_filter_period_**`1_img_01`<br>**keyword_filter_period_**`1_img_02`<br>**keyword_filter_period_**`1_img_03` |
| ```IMAGE```       | **keyword_filter_period_**`<post_number>_<type>` | **keyword_filter_period_**`2_img`<br>**keyword_filter_period_**`3_img`<br>**keyword_filter_period_**`4_img` |
| ```VIDEO```       | **keyword_filter_period_**`<post_number>_<type>` | **keyword_filter_period_**`5_vid`<br>**keyword_filter_period_**`6_vid`<br>**keyword_filter_period_**`7_vid` |


## ‚≠ê **HOW TO USE - STEPS**

| Step | Instruction |
|------|-------------|
| 01  | Run the code |
| 02  | Input what you want to search (example: `"music"`) |
| 03  | Input the Reddit search filter: `"Relevance"`, `"Hot"`, `"Top"`, `"New"`, `"Comment count"` |
| 04  | Input the time filter: `"All time"`, `"Past year"`, `"Past month"`, `"Past week"`, `"Today"`, `"Past hour"` |
| 05  | Input the limit of how many posts you want to extract (example: `"17"`) |
| 06  | Use the CSV file for data analysis and access the post visual content files in `../data/visuals/` |


## ‚≠ê **SEARCH FILTER 5 CHOICES**

| Step | Description |
|------|-------------|
| **01**   | `"Hot"` |
| **02**   | `"Top"` |
| **03**   | `"New"` |
| **04**   | `"Comments Count"` |
| **05**   | `"Relevance"` |


## ‚≠ê **PERIOD FILTER 6 CHOICES**

| Step | Description |
|------|-------------|
| **01**   | `"All time"` |
| **02**   | `"Past year"` |
| **03**   | `"Past month"` |
| **04**   | `"Past week"` |
| **05**   | `"Today"` |
| **06**   | `"Past hour"` |

<hr>

# ‚öôÔ∏è INSTALL BEFORE RUNNING CODE

<style>
h1 {
    text-align: center;
    color: orange;
    font-weight: bold;
}
</style>

<style>
h2 {
    text-align: center;
    color: black;
    font-weight: bold;
}
</style>

<style>
h3 {
    text-align: center;
    color: black;
    font-weight: bold;
}
</style>

<style>
h4 {
    text-align: center;
    color: black;
    font-weight: bold;
}
</style>

<hr>

<hr>

## 1Ô∏è‚É£ INSTALL using **GIT BASH**


<style>
h1 {
    text-align: center;
    color: black;
    font-weight: bold;
}
</style>

<style>
h2 {
    text-align: left;
    color: black;
    font-weight: bold;
}
</style>

<style>
h3 {
    text-align: center;
    color: black;
    font-weight: bold;
}
</style>

<style>
h4 {
    text-align: center;
    color: black;
    font-weight: bold;
}
</style>
<hr>

In [None]:
#pip install requests beautifulsoup4 lxml
%pip install playwright pandas beautifulsoup4 lxml

#pip install selenium beautifulsoup4 requests webdriver-manager
%pip install selenium beautifulsoup4 requests webdriver-manager

# install selenium
%pip install selenium pandas

# install chromium
%playwright install chromium

# install fake-useragent
%pip install fake-useragent

# install requests & beautifulsoup
%pip install requests beautifulsoup4 fake-useragent pandas

# Install pandas for data manipulation
%pip install pandas

<hr>

## 2Ô∏è‚É£ INSTALL using **NOTEBOOK**


<style>
h1 {
    text-align: center;
    color: black;
    font-weight: bold;
}
</style>

<style>
h2 {
    text-align: left;
    color: black;
    font-weight: bold;
}
</style>

<style>
h3 {
    text-align: center;
    color: black;
    font-weight: bold;
}
</style>

<style>
h4 {
    text-align: center;
    color: black;
    font-weight: bold;
}
</style>
<hr>

In [None]:
# Install core packages
%pip install requests beautifulsoup4 lxml pandas fake-useragent

# Install Selenium and WebDriver manager
%pip install selenium webdriver-manager

# Install Playwright and Chromium browser
%pip install playwright
%playwright install chromium

# Install pandas for data manipulation
%pip install pandas

<hr>

# üß∞ V4 - STABLE SCRIPT


<style>
h1 {
    text-align: center;
    color: orange;
    font-weight: bold;
}
</style>

<style>
h2 {
    text-align: left;
    color: darkblue;
    font-weight: bold;
}
</style>

<style>
h3 {
    text-align: center;
    color: black;
    font-weight: bold;
}
</style>

<style>
h4 {
    text-align: center;
    color: black;
    font-weight: bold;
}
</style>
<hr>

In [None]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import time
import os
import re
import urllib.parse
import json
from datetime import datetime, timedelta
from urllib.parse import urlparse
import subprocess  # üî• for yt-dlp audio

def create_sample_main_data(keyword, limit):
    """Generate sample Reddit main data if input file missing"""
    keyword_clean = keyword.replace(' ', '_')
    sample_data = {
        'post_title': [f'{keyword.title()} post {i+1}' for i in range(limit)],
        'post_link': [f'https://www.reddit.com/r/{keyword}/{i+1}/title{i+1}/' for i in range(limit)],
        'post_id': [f'{i+1}' for i in range(limit)]
    }
    INPUT_FILE = f"../data/reddit/{keyword_clean}_main.csv"
    os.makedirs(os.path.dirname(INPUT_FILE), exist_ok=True)
    pd.DataFrame(sample_data).to_csv(INPUT_FILE, index=False)
    print(f"‚úÖ Created sample MAIN data ({limit} posts): {INPUT_FILE}")
    return INPUT_FILE

def get_period_param(period_filter):
    """üî• EXACT HTML MATCH: Convert display text ‚Üí Reddit API 't=' parameter"""
    period_map = {
        'All time': 'all',
        'Past year': 'year',
        'Past month': 'month',
        'Past week': 'week',
        'Today': 'day',
        'Past hour': 'hour'
    }
    return period_map.get(period_filter, 'month')

def fetch_reddit_posts_search(search_keyword, filter='hot', limit=50, period_filter=None):
    """üî• UPDATED: Uses EXACT HTML period_filter values"""
    print(f"üîç Fetching UP TO {limit} {filter} posts for '{search_keyword}'...")
    if period_filter:
        print(f"   ‚è∞ Time filter: {period_filter}")
   
    encoded_keyword = urllib.parse.quote(search_keyword)
    period_param = get_period_param(period_filter) if period_filter else 'month'
    search_url = f"https://www.reddit.com/search.json?q={encoded_keyword}&sort={filter}&limit=100&t={period_param}&type=link"
   
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
        'Accept': 'application/json,'
    }
   
    try:
        print(f"   üì° API: q={encoded_keyword}&sort={filter}&t={period_param}...")
        response = requests.get(search_url, headers=headers, timeout=15)
        if response.status_code != 200:
            print(f"‚ùå Search failed: HTTP {response.status_code}")
            return None
         
        data = response.json()
        posts = []
     
        if 'data' in data and 'children' in data['data']:
            available = len(data['data']['children'])
            print(f"   ‚úÖ API returned {available} posts available")
         
            for i, post in enumerate(data['data']['children'][:limit]):
                post_data = post['data']
                posts.append({
                    'post_title': post_data.get('title', 'N/A'),
                    'post_link': f"https://www.reddit.com{post_data.get('permalink', '')}",
                    'post_id': post_data.get('id', 'N/A'),
                    'num_votes': post_data.get('score', 0),
                    'num_comments': post_data.get('num_comments', 0),
                    'filter': filter,
                    'period_filter': period_filter or 'N/A'
                })
         
        actual_posts = len(posts)
        print(f"‚úÖ SUCCESS: {actual_posts}/{limit} {filter} posts loaded!")
        return pd.DataFrame(posts)
     
    except Exception as e:
        print(f"‚ùå Search error: {str(e)}")
        return None

def get_viewable_image_url(url):
    """üî• ONLY i.redd.it/xxx.png - NO preview/external-preview EVER"""
    if not url or 'reddit.com' not in url.lower():
        return url
   
    url_lower = url.lower()
   
    if 'i.redd.it' in url_lower:
        parsed = urllib.parse.urlparse(url)
        return f"{parsed.scheme}://{parsed.netloc}{parsed.path}"
   
    match = re.search(r'preview\.redd\.it/([a-z0-9]+)', url_lower)
    if not match:
        match = re.search(r'external-preview\.redd\.it/([a-z0-9]+)', url_lower)
    if not match:
        match = re.search(r'/([a-z0-9]{13})\.', url_lower)
   
    if match:
        media_id = match.group(1)
        return f"https://i.redd.it/{media_id}.png"
   
    return url

def format_post_date(created_utc):
    """Convert Reddit UTC timestamp to readable date/time"""
    if not created_utc or created_utc == 'N/A':
        return 'N/A', 'N/A'
   
    try:
        timestamp = float(created_utc)
        dt = datetime.fromtimestamp(timestamp)
        post_date = dt.strftime("%A, %B %d, %Y")
        post_time = dt.strftime("%I:%M:%S %p UTC")
        return post_date, post_time
    except:
        return 'ERROR', 'ERROR'

def calculate_time_ago(post_date_str, post_time_str):
    """üî• Calculate '1 year 2 month 3 week 4 day 5 hour ago' format"""
    if post_date_str == 'N/A' or post_time_str == 'N/A':
        return 'N/A'
   
    try:
        datetime_str = f"{post_date_str} {post_time_str.replace(' UTC', '')}"
        post_dt = datetime.strptime(datetime_str, "%A, %B %d, %Y %I:%M:%S %p")
     
        now = datetime.now()
        delta = now - post_dt
     
        years = delta.days // 365
        months = (delta.days % 365) // 30
        weeks = (delta.days % 30) // 7
        days = delta.days % 7
        hours = delta.seconds // 3600
     
        parts = []
        if years > 0:
            parts.append(f"{years} year" + ("s" if years > 1 else ""))
        if months > 0:
            parts.append(f"{months} month" + ("s" if months > 1 else ""))
        if weeks > 0:
            parts.append(f"{weeks} week" + ("s" if weeks > 1 else ""))
        if days > 0:
            parts.append(f"{days} day" + ("s" if days > 1 else ""))
        if hours > 0 and len(parts) == 0:
            parts.append(f"{hours} hour" + ("s" if hours > 1 else ""))
     
        if not parts:
            return "just now"
     
        time_ago = " ".join(parts) + " ago"
        return time_ago
     
    except:
        return 'ERROR'

def get_enhanced_media_candidates(media_id):
    """üî• Generate ALL possible media URLs for a media_id (videos/images/gifs/audio)"""
    return [
        # VIDEOS (highest quality first)
        f"https://v.redd.it/{media_id}/DASH_1080.mp4",
        f"https://v.redd.it/{media_id}/DASH_720.mp4",
        f"https://v.redd.it/{media_id}/DASH_480.mp4",
        f"https://v.redd.it/{media_id}/DASH_360.mp4",
        f"https://v.redd.it/{media_id}/DASH_1080",
        f"https://v.redd.it/{media_id}/DASH_720",
        f"https://v.redd.it/{media_id}/DASH_480",
        f"https://v.redd.it/{media_id}/DASH_audio.mp4",
        f"https://v.redd.it/{media_id}/audio.m4a",
        f"https://v.redd.it/{media_id}/DASH_audio",
     
        # DIRECT VIDEOS
        f"https://i.redd.it/{media_id}.mp4",
        f"https://i.redd.it/{media_id}.webm",
     
        # IMAGES/GIFS
        f"https://i.redd.it/{media_id}.gif",
        f"https://i.redd.it/{media_id}.png",
        f"https://i.redd.it/{media_id}.jpg",
        f"https://i.redd.it/{media_id}.jpeg"
    ]

def test_url_working(url, headers_browser, timeout=10):
    """üî• Returns (working_url, content_type, file_ext) or None if broken"""
    try:
        resp = requests.head(url, headers=headers_browser, timeout=timeout, allow_redirects=True)
        if resp.status_code == 200:
            content_type = resp.headers.get('content-type', '').lower()
            size = int(resp.headers.get('content-length', 0) or 0)
         
            if size > 1000 and any(media_type in content_type for media_type in ['video', 'image', 'audio']):
                # Extract extension from content-type
                if 'video' in content_type:
                    file_ext = '.mp4' if 'mp4' in content_type else '.webm'
                elif 'image' in content_type:
                    if 'gif' in content_type:
                        file_ext = '.gif'
                    elif 'png' in content_type:
                        file_ext = '.png'
                    elif 'jpeg' in content_type:
                        file_ext = '.jpg'
                    else:
                        file_ext = '.jpg'
                else:
                    file_ext = '.bin'
             
                return url, content_type, file_ext
         
        # Fallback to GET if HEAD fails
        resp = requests.get(url, headers=headers_browser, timeout=timeout, stream=True)
        if resp.status_code == 200:
            content_type = resp.headers.get('content-type', '').lower()
            size = len(resp.content)
         
            if size > 1000 and any(media_type in content_type for media_type in ['video', 'image', 'audio']):
                if 'video' in content_type:
                    file_ext = '.mp4' if 'mp4' in content_type else '.webm'
                elif 'image' in content_type:
                    if 'gif' in content_type:
                        file_ext = '.gif'
                    elif 'png' in content_type:
                        file_ext = '.png'
                    elif 'jpeg' in content_type:
                        file_ext = '.jpg'
                    else:
                        file_ext = '.jpg'
                else:
                    file_ext = '.bin'
             
                return url, content_type, file_ext
         
    except:
        pass
    return None

# üî• AUDIO NAMING FORMAT
def download_reddit_audio_only(post_url, keyword_clean, filter_param, period_str, post_number, audio_folder):
    """üöÄ PERFECT AUDIO ONLY with EXACT naming: keyword_filter_period_postnumber_audio.m4a"""
    # Extract post ID for validation
    post_id = re.search(r'/comments/([a-zA-Z0-9]+)', post_url)
    if not post_id:
        return None
    post_id = post_id.group(1)

    # üî• NAMING FORMAT
    audio_filename = f"{keyword_clean}_{filter_param}_{period_str}_{post_number}_audio.m4a"
    audio_path = os.path.join(audio_folder, audio_filename)

    # Skip if already exists
    if os.path.exists(audio_path):
        print(f"   üìÅ Audio exists: {audio_filename}")
        return audio_path

    # üî• yt-dlp AUDIO command
    cmd = [
        'yt-dlp',
        '--extract-audio',      # Audio only (no video)
        '--audio-format', 'm4a', # Best quality M4A
        '--audio-quality', '0',  # Highest quality (lossless)
        '--embed-metadata',     # Title, uploader info
        '-o', audio_path,       # EXACT filename required
        post_url
    ]

    try:
        print(f"   üéµ Running yt-dlp ‚Üí {audio_filename}")
        result = subprocess.run(cmd, capture_output=True, text=True, timeout=90)
        if result.returncode == 0 and os.path.exists(audio_path):
            file_size = os.path.getsize(audio_path) / 1024
            print(f"   ‚úÖ Audio saved: {audio_filename} ({file_size:.1f}KB)")
            return audio_path
        else:
            print(f"   ‚ùå yt-dlp failed: {result.stderr[:100]}")
    except subprocess.TimeoutExpired:
        print(f"   ‚è∞ yt-dlp timeout")
    except Exception as e:
        print(f"   ‚ö†Ô∏è  yt-dlp error: {str(e)[:50]}")
    
    return None

def download_visual_auto(post_number, visual_type, visual_urls, base_filename, visual_folder):
    """üî• DOWNLOADS visuals ‚Üí ONLY WORKING LINKS ‚Üí PROPER EXTENSIONS ‚Üí ../data/visuals/"""
    downloaded_files = []
   
    headers_browser = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
        'Accept': 'video/*,image/*,audio/*,*/*;q=0.8',
        'Referer': 'https://www.reddit.com/'
    }
   
    working_urls = []
   
    if visual_type == 'CAROUSEL':
        # CAROUSEL: Test each URL for each sequence number
        for seq_idx, url in enumerate(visual_urls, 1):
            seq_str = f"{seq_idx:02d}"
            result = test_url_working(url, headers_browser)
         
            if result:
                working_url, content_type, file_ext = result
                working_urls.append(working_url)
             
                # Determine file type prefix
                if 'video' in content_type:
                    file_prefix = 'vid'
                else:
                    file_prefix = 'img'
             
                filename = f"{base_filename}_{file_prefix}_{seq_str}{file_ext}"
                filepath = os.path.join(visual_folder, filename)
             
                if os.path.exists(filepath):
                    print(f"   üìÅ SKIP {filename}")
                    downloaded_files.append(filename)
                    continue
             
                # Download with proper extension
                try:
                    resp = requests.get(working_url, headers=headers_browser, timeout=15, stream=True)
                    if resp.status_code == 200:
                        size = len(resp.content)
                        with open(filepath, 'wb') as f:
                            for chunk in resp.iter_content(8192):
                                f.write(chunk)
                     
                        media_type = content_type.split('/')[0].upper()
                        print(f"   üíæ [{media_type}]{file_ext} {filename} ({size/1024:.1f}KB)")
                        downloaded_files.append(filename)
                        time.sleep(0.5)
                except Exception as e:
                    print(f"   ‚ö†Ô∏è  Download error: {str(e)[:30]}")
            else:
                print(f"   ‚ùå Broken URL: {url[:60]}...")
   
    else:
        # SINGLE IMAGE/VIDEO: Test each URL once
        for url in visual_urls:
            result = test_url_working(url, headers_browser)
            if result:
                working_url, content_type, file_ext = result
                working_urls.append(working_url)
             
                # Determine file type prefix
                if 'video' in content_type:
                    file_prefix = 'vid'
                    filename = f"{base_filename}_vid{file_ext}"
                else:
                    file_prefix = 'img'
                    filename = f"{base_filename}_img{file_ext}"
             
                filepath = os.path.join(visual_folder, filename)
             
                if os.path.exists(filepath):
                    print(f"   üìÅ SKIP {filename}")
                    downloaded_files.append(filename)
                    break
             
                # Download with proper extension
                try:
                    resp = requests.get(working_url, headers=headers_browser, timeout=15, stream=True)
                    if resp.status_code == 200:
                        size = len(resp.content)
                        with open(filepath, 'wb') as f:
                            for chunk in resp.iter_content(8192):
                                f.write(chunk)
                     
                        media_type = content_type.split('/')[0].UPPER()
                        print(f"   üíæ [{media_type}]{file_ext} {filename} ({size/1024:.1f}KB)")
                        downloaded_files.append(filename)
                        time.sleep(0.5)
                        break  # Success! Done with this post
                except Exception as e:
                    print(f"   ‚ö†Ô∏è  Download error: {str(e)[:30]}")
                break
   
    return downloaded_files, working_urls

def extract_post_details_complete(keyword, filter='hot', limit=50, period_filter=None):
    """üî• MAIN FUNCTION - WORKING LINKS ONLY + PROPER EXTENSIONS + AUDIO NAMING"""
    keyword_clean = keyword.replace(' ', '_')
    period_str = period_filter.replace(' ', '_').lower() if period_filter else 'all_time'
    INPUT_FILE = f"../data/reddit/{keyword_clean}_main.csv"
    OUTPUT_FILE = f"../data/reddit/{keyword_clean}_{filter}_{period_str}.csv"
    VISUALS_FOLDER = f"../data/visuals/{keyword_clean}_{filter}_{period_str}"
    AUDIO_FOLDER = f"../data/audio/{keyword_clean}_{filter}_{period_str}"
   
    print(f"üì° Fetching EXACTLY {limit} {filter} posts (Period: {period_filter or 'All time'})...")
    df = fetch_reddit_posts_search(keyword, filter, limit, period_filter)
   
    if df is None or df.empty:
        print("‚ö†Ô∏è  Search failed ‚Üí Using sample data")
        create_sample_main_data(keyword_clean, limit)
        df = pd.read_csv(INPUT_FILE).head(limit)
    else:
        os.makedirs(os.path.dirname(INPUT_FILE), exist_ok=True)
        df.to_csv(INPUT_FILE, index=False)
        print(f"‚úÖ Saved {len(df)} REAL posts ‚Üí {INPUT_FILE}")
   
    total_posts = len(df)
    print(f"\nüöÄ PROCESSING {total_posts} posts ‚Üí {OUTPUT_FILE}")
    print(f"üíæ DOWNLOADING visuals to ‚Üí {VISUALS_FOLDER}")
    print(f"üéµ DOWNLOADING audio to ‚Üí {AUDIO_FOLDER}")
    print("=" * 100)
   
    os.makedirs(VISUALS_FOLDER, exist_ok=True)
    os.makedirs(AUDIO_FOLDER, exist_ok=True)
   
    new_data = []
    session = requests.Session()
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
        'Accept': 'application/json, text/plain, */*',
        'Accept-Language': 'en-US,en;q=0.5',
        'Accept-Encoding': 'gzip, deflate, br',
        'Referer': 'https://www.reddit.com/',
        'Sec-Fetch-Dest': 'empty',
        'Sec-Fetch-Mode': 'cors',
        'Sec-Fetch-Site': 'same-origin'
    }
    session.headers.update(headers)
   
    def extract_post_id(url):
        if pd.isna(url): return None
        url = str(url).strip()
        match = re.search(r'/comments/([a-zA-Z0-9]+)', url)
        if match: return match.group(1)
        match = re.search(r't3_([a-zA-Z0-9]+)', url)
        if match: return match.group(1)
        return None
   
    def get_visual_type_count(visual):
        if visual in ['N/A', 'MEDIA_ERROR', 'ERROR']:
            return 'NONE', 0
        visual_str = str(visual).lower()
        if any(x in visual_str for x in ['.mp4', 'v.redd.it', 'youtube.com', 'youtu.be']):
            return 'VIDEO', 1
        if '\n' in visual_str:
            return 'CAROUSEL', len(visual_str.splitlines())
        if 'i.redd.it' in visual_str or any(ext in visual_str for ext in ['.jpg', '.png', '.gif']):
            return 'IMAGE', 1
        return 'OTHER', 1
   
    def calculate_text_length(description):
        if not description or description in ['N/A', 'ERROR', 'INVALID_LINK']:
            return 0
        text = re.sub(r'http[s]?://\S+', '', str(description))
        text = re.sub(r'\s+', ' ', text).strip()
        return len(text)
   
    # üî• ENHANCED: Comprehensive visual extraction with carousel + video support
    def extract_visual_urls(post_info):
        visual_urls = []
        try:
            # 1. REDDIT VIDEO (highest priority)
            if post_info.get('is_video') and post_info.get('media', {}).get('reddit_video'):
                fallback_url = post_info['media']['reddit_video'].get('fallback_url', '')
                if fallback_url:
                    visual_urls.append(fallback_url)
                    return visual_urls
     
            # 2. YOUTUBE/EXTERNAL VIDEO
            if any(domain in post_info.get('url', '').lower() for domain in ['youtube.com', 'youtu.be', 'v.redd.it']):
                visual_urls.append(post_info['url'])
                return visual_urls
     
            # üî• 3. CAROUSEL - NEW ENHANCED EXTRACTION
            gallery_data = post_info.get('gallery_data')
            if gallery_data and gallery_data.get('items'):
                for item in gallery_data['items']:
                    if isinstance(item, dict) and 'media_id' in item:
                        media_id = item['media_id']
                        # üî• Try ALL possible formats for this media_id
                        media_candidates = get_enhanced_media_candidates(media_id)
                        for candidate_url in media_candidates:
                            visual_urls.append(candidate_url)
                if visual_urls:
                    return visual_urls
     
            # 4. SINGLE IMAGE
            post_url = post_info.get('url', '')
            viewable_url = get_viewable_image_url(post_url)
            if viewable_url and 'i.redd.it' in viewable_url:
                return [viewable_url]
     
            # 5. PREVIEW IMAGES
            if post_info.get('preview', {}).get('images'):
                for img in post_info['preview']['images']:
                    source_url = img.get('source', {}).get('url', '')
                    if source_url:
                        viewable_url = get_viewable_image_url(source_url)
                        if 'i.redd.it' in viewable_url:
                            return [viewable_url]
     
            # 6. THUMBNAIL FALLBACK
            if post_info.get('thumbnail') and 'i.redd.it' in post_info['thumbnail']:
                return [post_info['thumbnail']]
         
        except:
            pass
        return visual_urls
   
    # üî• FULL PROGRESS TRACKING + AUTO DOWNLOAD + WORKING LINKS ONLY + AUDIO NAMING
    for idx, row in df.iterrows():
        progress = f"{idx+1:2d}/{total_posts}"
        post_title = str(row['post_title'])[:60]
        post_number = idx + 1
        print(f"üîç [{progress}] {post_title}...")
     
        post_data = {
            'post_title': row.get('post_title', 'N/A'),
            'post_link': row.get('post_link', 'N/A'),
            'post_id': 'N/A',
            'num_votes': row.get('num_votes', 'N/A'),
            'num_comments': row.get('num_comments', 'N/A'),
            'filter': filter,
            'period_filter': period_filter or 'N/A',
            'post_date': 'N/A',
            'post_time': 'N/A',
            'time_ago': 'N/A',
            'text_length': 0,
            'post_description': 'N/A',
            'post_visual': 'N/A',
            'visual_type': 'NONE',
            'visual_count': 0,
            'downloaded_files': 'N/A',
            'audio_file': 'N/A'
        }
     
        post_id = extract_post_id(row['post_link'])
        if not post_id:
            print(f"   ‚ùå [{progress}] Invalid link - SKIPPED")
            new_data.append(post_data)
            time.sleep(0.5)
            continue
     
        print(f"   üîó [{progress}] Post ID: {post_id}")
     
        try:
            response = session.get(f"https://www.reddit.com/comments/{post_id}.json", timeout=10)
            if response.status_code == 200:
                data = response.json()
                if len(data) > 0 and 'data' in data[0]:
                    post_info = data[0]['data']['children'][0]['data']
                 
                    post_data.update({
                        'post_id': post_id,
                        'num_votes': str(post_info.get('score', 'N/A')),
                    })
                 
                    # üî• DATE/TIME + TIME_AGO
                    created_utc = post_info.get('created_utc')
                    post_date, post_time = format_post_date(created_utc)
                    post_data['post_date'] = post_date
                    post_data['post_time'] = post_time
                    post_data['time_ago'] = calculate_time_ago(post_date, post_time)
                 
                    print(f"   üìÖ [{progress}] {post_date} | üïê {post_time[:12]} | ‚è∞ {post_data['time_ago']}")
                 
                    selftext = post_info.get('selftext', '')[:2000]
                    if selftext.strip():
                        post_data['post_description'] = selftext
                        post_data['text_length'] = calculate_text_length(selftext)
                        print(f"   üìù [{progress}] {post_data['text_length']} chars")
                 
                    # üî• ENHANCED VISUAL EXTRACTION + TEST WORKING LINKS + AUTO DOWNLOAD
                    all_candidate_urls = extract_visual_urls(post_info)
                    base_filename = f"{keyword_clean}_{filter}_{period_str}_{post_number}"
                 
                    if all_candidate_urls:
                        print(f"   üñºÔ∏è [{progress}] Testing {len(all_candidate_urls)} candidate URLs...")
                     
                        # üî• TEST + DOWNLOAD with EXACT naming + PROPER EXTENSIONS!
                        downloaded_files, working_urls = download_visual_auto(
                            post_number, 'CAROUSEL' if '\n' in '\n'.join(all_candidate_urls) else 'IMAGE',
                            all_candidate_urls, base_filename, VISUALS_FOLDER
                        )
                     
                        # üî• ONLY WORKING LINKS go to post_visual
                        if working_urls:
                            post_data['post_visual'] = '\n'.join(working_urls)
                            vtype, vcount = get_visual_type_count(post_data['post_visual'])
                            post_data.update({'visual_type': vtype, 'visual_count': vcount})
                            post_data['downloaded_files'] = '; '.join(downloaded_files) if downloaded_files else 'ERROR'
                         
                            print(f"   ‚úÖ [{progress}] {vtype} ({vcount}) - {len(working_urls)} WORKING URLs!")
                            print(f"   üíæ {len(downloaded_files)} files saved!")
                        else:
                            print(f"   ‚ùå [{progress}] No working URLs found")
                    else:
                        print(f"   ‚ûñ [{progress}] No visuals")
                 
                    # üî• PERFECT AUDIO DOWNLOAD with EXACT NAMING
                    if post_data['visual_type'] in ['VIDEO'] and post_id:
                        print(f"   üéµ [{progress}] Extracting audio...")
                        # üî• PASS ALL PARAMETERS for PERFECT NAMING
                        audio_path = download_reddit_audio_only(
                            post_data['post_link'], 
                            keyword_clean, 
                            filter, 
                            period_str, 
                            post_number, 
                            AUDIO_FOLDER
                        )
                        if audio_path:
                            audio_filename = os.path.basename(audio_path)
                            post_data['audio_file'] = audio_filename  # ‚úÖ example oban_star_racers_hot_all_time_2_audio.m4a
                            print(f"   ‚úÖ [{progress}] Audio: {audio_filename}")
                        else:
                            print(f"   ‚ûñ [{progress}] No audio extracted")
                 
                    print(f"   üéâ [{progress}] COMPLETE ‚úì")
                else:
                    print(f"   ‚ùå [{progress}] No post data")
            else:
                print(f"   ‚ùå [{progress}] HTTP {response.status_code}")
         
        except Exception as e:
            print(f"   ‚ö†Ô∏è  [{progress}] Error: {str(e)[:40]}")
     
        new_data.append(post_data)
        time.sleep(2.5)  # Rate limiting
        print()  # Empty line
   
    os.makedirs(os.path.dirname(OUTPUT_FILE), exist_ok=True)
    new_df = pd.DataFrame(new_data, columns=[
        'post_title', 'post_link', 'post_id', 'num_votes', 'num_comments',
        'filter', 'period_filter', 'post_date', 'post_time', 'time_ago', 'text_length',
        'post_description', 'post_visual', 'visual_type', 'visual_count', 'downloaded_files',
        'audio_file'  # üî• PERFECT NAMING
    ])
    new_df.to_csv(OUTPUT_FILE, index=False)
   
    print(f"\nüéâ SAVED {len(new_df)}/{limit} posts ‚Üí {OUTPUT_FILE}")
    print(f"üíæ ALL VISUALS ‚Üí {VISUALS_FOLDER}/")
    print(f"üéµ ALL AUDIO ‚Üí {AUDIO_FOLDER}/ (keyword_filter_period_postnumber_audio.m4a)")
    print(f"‚úÖ post_visual = WORKING LINKS ONLY!")
    return new_df

# üî• COMPLETE INTERACTIVE RUN
if __name__ == "__main__":
    print("üöÄ REDDIT EXTRACTOR + VISUALS + AUDIO NAMING!")
    print("=" * 60)
   
    # 1. KEYWORD FIRST
    keyword = input("Enter keyword: ").strip() or 'music'
   
    # 2. FILTER NEXT
    print("\nüî• Filters: 1=hot, 2=top, 3=new, 4=comments, 5=relevance")
    choice = input("Choose filter [1]: ").strip() or '1'
    filter_map = {'1': 'hot', '2': 'top', '3': 'new', '4': 'comments', '5': 'relevance'}
    filter = filter_map.get(choice, 'hot')
   
    # 3. PERIOD NEXT (for relevance/top/comments)
    period_filter = None
    if filter in ['relevance', 'top', 'comments']:
        print(f"\n‚è∞ PERIOD FILTER (HTML dropdown match):")
        print("1=All time, 2=Past year, 3=Past month, 4=Past week, 5=Today, 6=Past hour")
        period_choice = input("Choose period [2=Past year]: ").strip() or '2'
        period_map = {
            '1': 'All time', '2': 'Past year', '3': 'Past month',
            '4': 'Past week', '5': 'Today', '6': 'Past hour'
        }
        period_filter = period_map.get(period_choice, 'Past year')
        print(f"   ‚úÖ Using period: {period_filter} ‚Üí API t={get_period_param(period_filter)}")
   
    # 4. LIMIT LAST
    limit_input = input("\nHow many posts? (1-100) [20]: ").strip()
    limit = int(limit_input) if limit_input.isdigit() else 20
    limit = min(max(limit, 1), 100)
   
    print(f"\nüî• Scraping {limit} '{keyword}' {filter.upper()} posts...")
    print(f"‚úÖ WORKING LINKS + PROPER .png/.mp4 + AUDIO: keyword_filter_period_postnumber_audio.m4a")
    if period_filter:
        print(f"   ‚è∞ Time filter: {period_filter}")
    result = extract_post_details_complete(keyword, filter, limit, period_filter)
   
    period_filename = period_filter.replace(' ', '_').lower() if period_filter else 'all_time'
    print(f"\n‚úÖ DONE! {len(result)} posts + visuals + audio ‚Üí ../data/")
    print(f"üéµ Audio files ‚Üí ../data/audio/{keyword.replace(' ','_')}_{filter}_{period_filename}/")


<hr>

# üß∞ V5 - STABLE SCRIPT  


<style>
h1 {
    text-align: center;
    color: hotpink;
    font-weight: bold;
}
</style>

<style>
h2 {
    text-align: left;
    color: darkblue;
    font-weight: bold;
}
</style>

<style>
h3 {
    text-align: center;
    color: black;
    font-weight: bold;
}
</style>

<style>
h4 {
    text-align: center;
    color: black;
    font-weight: bold;
}
</style>
<hr>

In [None]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import time
import os
import re
import urllib.parse
import json
from datetime import datetime, timedelta
from urllib.parse import urlparse
import subprocess  # üî• for yt-dlp audio


def create_sample_main_data(keyword, limit):
    """Generate sample Reddit main data if input file missing"""
    keyword_clean = keyword.replace(' ', '_')
    sample_data = {
        'post_title': [f'{keyword.title()} post {i+1}' for i in range(limit)],
        'post_link': [f'https://www.reddit.com/r/{keyword}/{i+1}/title{i+1}/' for i in range(limit)],
        'post_id': [f'{i+1}' for i in range(limit)]
    }
    INPUT_FILE = f"../data/reddit/{keyword_clean}_main.csv"
    os.makedirs(os.path.dirname(INPUT_FILE), exist_ok=True)
    pd.DataFrame(sample_data).to_csv(INPUT_FILE, index=False)
    print(f"‚úÖ Created sample MAIN data ({limit} posts): {INPUT_FILE}")
    return INPUT_FILE


def get_period_param(period_filter):
    """üî• EXACT HTML MATCH: Convert display text ‚Üí Reddit API 't=' parameter"""
    period_map = {
        'All time': 'all',
        'Past year': 'year',
        'Past month': 'month',
        'Past week': 'week',
        'Today': 'day',
        'Past hour': 'hour'
    }
    return period_map.get(period_filter, 'month')


def fetch_reddit_posts_search(search_keyword, filter='hot', limit=50, period_filter=None):
    """üî• UPDATED: Uses EXACT HTML period_filter values"""
    print(f"üîç Fetching UP TO {limit} {filter} posts for '{search_keyword}'...")
    if period_filter:
        print(f"   ‚è∞ Time filter: {period_filter}")
   
    encoded_keyword = urllib.parse.quote(search_keyword)
    period_param = get_period_param(period_filter) if period_filter else 'month'
    search_url = f"https://www.reddit.com/search.json?q={encoded_keyword}&sort={filter}&limit=100&t={period_param}&type=link"
   
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
        'Accept': 'application/json,'
    }
   
    try:
        print(f"   üì° API: q={encoded_keyword}&sort={filter}&t={period_param}...")
        response = requests.get(search_url, headers=headers, timeout=15)
        if response.status_code != 200:
            print(f"‚ùå Search failed: HTTP {response.status_code}")
            return None
        
        data = response.json()
        posts = []
    
        if 'data' in data and 'children' in data['data']:
            available = len(data['data']['children'])
            print(f"   ‚úÖ API returned {available} posts available")
        
            for i, post in enumerate(data['data']['children'][:limit]):
                post_data = post['data']
                posts.append({
                    'post_title': post_data.get('title', 'N/A'),
                    'post_link': f"https://www.reddit.com{post_data.get('permalink', '')}",
                    'post_id': post_data.get('id', 'N/A'),
                    'num_votes': post_data.get('score', 0),
                    'num_comments': post_data.get('num_comments', 0),
                    'filter': filter,
                    'period_filter': period_filter or 'N/A'
                })
        
        actual_posts = len(posts)
        print(f"‚úÖ SUCCESS: {actual_posts}/{limit} {filter} posts loaded!")
        return pd.DataFrame(posts)
    
    except Exception as e:
        print(f"‚ùå Search error: {str(e)}")
        return None


def get_viewable_image_url(url):
    """üî• ONLY i.redd.it/xxx.png - NO preview/external-preview EVER"""
    if not url or 'reddit.com' not in url.lower():
        return url
   
    url_lower = url.lower()
   
    if 'i.redd.it' in url_lower:
        parsed = urllib.parse.urlparse(url)
        return f"{parsed.scheme}://{parsed.netloc}{parsed.path}"
   
    match = re.search(r'preview\.redd\.it/([a-z0-9]+)', url_lower)
    if not match:
        match = re.search(r'external-preview\.redd\.it/([a-z0-9]+)', url_lower)
    if not match:
        match = re.search(r'/([a-z0-9]{13})\.', url_lower)
   
    if match:
        media_id = match.group(1)
        return f"https://i.redd.it/{media_id}.png"
   
    return url


def format_post_date(created_utc):
    """Convert Reddit UTC timestamp to readable date/time"""
    if not created_utc or created_utc == 'N/A':
        return 'N/A', 'N/A'
   
    try:
        timestamp = float(created_utc)
        dt = datetime.fromtimestamp(timestamp)
        post_date = dt.strftime("%A, %B %d, %Y")
        post_time = dt.strftime("%I:%M:%S %p UTC")
        return post_date, post_time
    except:
        return 'ERROR', 'ERROR'


def calculate_time_ago(post_date_str, post_time_str):
    """üî• Calculate '1 year 2 month 3 week 4 day 5 hour ago' format"""
    if post_date_str == 'N/A' or post_time_str == 'N/A':
        return 'N/A'
   
    try:
        datetime_str = f"{post_date_str} {post_time_str.replace(' UTC', '')}"
        post_dt = datetime.strptime(datetime_str, "%A, %B %d, %Y %I:%M:%S %p")
    
        now = datetime.now()
        delta = now - post_dt
    
        years = delta.days // 365
        months = (delta.days % 365) // 30
        weeks = (delta.days % 30) // 7
        days = delta.days % 7
        hours = delta.seconds // 3600
    
        parts = []
        if years > 0:
            parts.append(f"{years} year" + ("s" if years > 1 else ""))
        if months > 0:
            parts.append(f"{months} month" + ("s" if months > 1 else ""))
        if weeks > 0:
            parts.append(f"{weeks} week" + ("s" if weeks > 1 else ""))
        if days > 0:
            parts.append(f"{days} day" + ("s" if days > 1 else ""))
        if hours > 0 and len(parts) == 0:
            parts.append(f"{hours} hour" + ("s" if hours > 1 else ""))
    
        if not parts:
            return "just now"
    
        time_ago = " ".join(parts) + " ago"
        return time_ago
    
    except:
        return 'ERROR'


def get_enhanced_media_candidates(media_id):
    """üî• Generate ALL possible media URLs for a media_id (videos/images/gifs/audio)"""
    return [
        # VIDEOS (highest quality first)
        f"https://v.redd.it/{media_id}/DASH_1080.mp4",
        f"https://v.redd.it/{media_id}/DASH_720.mp4",
        f"https://v.redd.it/{media_id}/DASH_480.mp4",
        f"https://v.redd.it/{media_id}/DASH_360.mp4",
        f"https://v.redd.it/{media_id}/DASH_1080",
        f"https://v.redd.it/{media_id}/DASH_720",
        f"https://v.redd.it/{media_id}/DASH_480",
        f"https://v.redd.it/{media_id}/DASH_audio.mp4",
        f"https://v.redd.it/{media_id}/audio.m4a",
        f"https://v.redd.it/{media_id}/DASH_audio",
    
        # DIRECT VIDEOS
        f"https://i.redd.it/{media_id}.mp4",
        f"https://i.redd.it/{media_id}.webm",
    
        # IMAGES/GIFS
        f"https://i.redd.it/{media_id}.gif",
        f"https://i.redd.it/{media_id}.png",
        f"https://i.redd.it/{media_id}.jpg",
        f"https://i.redd.it/{media_id}.jpeg"
    ]


def test_url_working(url, headers_browser, timeout=10):
    """üî• Returns (working_url, content_type, file_ext) or None if broken"""
    try:
        resp = requests.head(url, headers=headers_browser, timeout=timeout, allow_redirects=True)
        if resp.status_code == 200:
            content_type = resp.headers.get('content-type', '').lower()
            size = int(resp.headers.get('content-length', 0) or 0)
        
            if size > 1000 and any(media_type in content_type for media_type in ['video', 'image', 'audio']):
                # Extract extension from content-type
                if 'video' in content_type:
                    file_ext = '.mp4' if 'mp4' in content_type else '.webm'
                elif 'image' in content_type:
                    if 'gif' in content_type:
                        file_ext = '.gif'
                    elif 'png' in content_type:
                        file_ext = '.png'
                    elif 'jpeg' in content_type:
                        file_ext = '.jpg'
                    else:
                        file_ext = '.jpg'
                else:
                    file_ext = '.bin'
                
                return url, content_type, file_ext
        
        # Fallback to GET if HEAD fails
        resp = requests.get(url, headers=headers_browser, timeout=timeout, stream=True)
        if resp.status_code == 200:
            content_type = resp.headers.get('content-type', '').lower()
            size = len(resp.content)
        
            if size > 1000 and any(media_type in content_type for media_type in ['video', 'image', 'audio']):
                if 'video' in content_type:
                    file_ext = '.mp4' if 'mp4' in content_type else '.webm'
                elif 'image' in content_type:
                    if 'gif' in content_type:
                        file_ext = '.gif'
                    elif 'png' in content_type:
                        file_ext = '.png'
                    elif 'jpeg' in content_type:
                        file_ext = '.jpg'
                    else:
                        file_ext = '.jpg'
                else:
                    file_ext = '.bin'
                
                return url, content_type, file_ext
    
    except:
        pass
    return None


# üî• FIXED: PERFECT AUDIO NAMING FORMAT - SAVES TO ../data/audio/
def download_reddit_audio_only(post_url, keyword_clean, filter_param, period_str, post_number, audio_folder):
    """üöÄ PERFECT AUDIO ONLY with EXACT naming: keyword_filter_period_postnumber_audio.m4a"""
    # Extract post ID for validation
    post_id = re.search(r'/comments/([a-zA-Z0-9]+)', post_url)
    if not post_id:
        return None
    post_id = post_id.group(1)

    # üî• EXACT REQUIRED NAMING FORMAT
    audio_filename = f"{keyword_clean}_{filter_param}_{period_str}_{post_number}_audio.m4a"
    audio_path = os.path.join(audio_folder, audio_filename)

    # Skip if already exists
    if os.path.exists(audio_path):
        print(f"   üìÅ Audio exists: {audio_filename}")
        return audio_path

    # üî• yt-dlp PERFECT AUDIO command
    cmd = [
        'yt-dlp',
        '--extract-audio',      # Audio only (no video)
        '--audio-format', 'm4a', # Best quality M4A
        '--audio-quality', '0',  # Highest quality (lossless)
        '--embed-metadata',     # Title, uploader info
        '-o', audio_path,       # EXACT filename required
        post_url
    ]

    try:
        print(f"   üéµ Running yt-dlp ‚Üí {audio_filename}")
        result = subprocess.run(cmd, capture_output=True, text=True, timeout=90)
        if result.returncode == 0 and os.path.exists(audio_path):
            file_size = os.path.getsize(audio_path) / 1024
            print(f"   ‚úÖ Audio saved: {audio_filename} ({file_size:.1f}KB)")
            return audio_path
        else:
            print(f"   ‚ùå yt-dlp failed: {result.stderr[:100]}")
    except subprocess.TimeoutExpired:
        print(f"   ‚è∞ yt-dlp timeout")
    except Exception as e:
        print(f"   ‚ö†Ô∏è  yt-dlp error: {str(e)[:50]}")
    
    return None


def download_visual_auto(post_number, visual_type, visual_urls, base_filename, visuals_folder, videos_folder):
    """üî• DOWNLOADS visuals ‚Üí ONLY WORKING LINKS ‚Üí PROPER EXTENSIONS ‚Üí CORRECT FOLDERS BY TYPE"""
    downloaded_files = []
   
    headers_browser = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
        'Accept': 'video/*,image/*,audio/*,*/*;q=0.8',
        'Referer': 'https://www.reddit.com/'
    }
   
    working_urls = []
   
    if visual_type == 'CAROUSEL':
        # CAROUSEL: Test each URL for each sequence number
        for seq_idx, url in enumerate(visual_urls, 1):
            seq_str = f"{seq_idx:02d}"
            result = test_url_working(url, headers_browser)
        
            if result:
                working_url, content_type, file_ext = result
                working_urls.append(working_url)
                
                # üî• ROUTE TO CORRECT FOLDER BY CONTENT TYPE
                if 'video' in content_type:
                    target_folder = videos_folder
                    file_prefix = 'vid'
                else:  # images, gifs
                    target_folder = visuals_folder
                    file_prefix = 'img'
                
                filename = f"{base_filename}_{file_prefix}_{seq_str}{file_ext}"
                filepath = os.path.join(target_folder, filename)
            
                if os.path.exists(filepath):
                    print(f"   üìÅ SKIP {filename}")
                    downloaded_files.append(filename)
                    continue
            
                # Download with proper extension
                try:
                    resp = requests.get(working_url, headers=headers_browser, timeout=15, stream=True)
                    if resp.status_code == 200:
                        size = len(resp.content)
                        with open(filepath, 'wb') as f:
                            for chunk in resp.iter_content(8192):
                                f.write(chunk)
                    
                        media_type = content_type.split('/')[0].upper()
                        print(f"   üíæ [{media_type}]{file_ext} {filename} ({size/1024:.1f}KB)")
                        downloaded_files.append(filename)
                        time.sleep(0.5)
                except Exception as e:
                    print(f"   ‚ö†Ô∏è  Download error: {str(e)[:30]}")
            else:
                print(f"   ‚ùå Broken URL: {url[:60]}...")
    
    else:
        # SINGLE IMAGE/VIDEO: Test each URL once
        for url in visual_urls:
            result = test_url_working(url, headers_browser)
            if result:
                working_url, content_type, file_ext = result
                working_urls.append(working_url)
                
                # üî• ROUTE TO CORRECT FOLDER BY CONTENT TYPE
                if 'video' in content_type:
                    target_folder = videos_folder
                    file_prefix = 'vid'
                    filename = f"{base_filename}_vid{file_ext}"
                else:  # images, gifs
                    target_folder = visuals_folder
                    file_prefix = 'img'
                    filename = f"{base_filename}_img{file_ext}"
                
                filepath = os.path.join(target_folder, filename)
            
                if os.path.exists(filepath):
                    print(f"   üìÅ SKIP {filename}")
                    downloaded_files.append(filename)
                    break
                
                # Download with proper extension
                try:
                    resp = requests.get(working_url, headers=headers_browser, timeout=15, stream=True)
                    if resp.status_code == 200:
                        size = len(resp.content)
                        with open(filepath, 'wb') as f:
                            for chunk in resp.iter_content(8192):
                                f.write(chunk)
                    
                        media_type = content_type.split('/')[0].upper()
                        print(f"   üíæ [{media_type}]{file_ext} {filename} ({size/1024:.1f}KB)")
                        downloaded_files.append(filename)
                        time.sleep(0.5)
                        break  # Success! Done with this post
                except Exception as e:
                    print(f"   ‚ö†Ô∏è  Download error: {str(e)[:30]}")
                break
   
    return downloaded_files, working_urls


def extract_post_details_complete(keyword, filter='hot', limit=50, period_filter=None):
    """üî• MAIN FUNCTION - WORKING LINKS ONLY + PROPER EXTENSIONS + PERFECT AUDIO + CORRECT FOLDERS"""
    keyword_clean = keyword.replace(' ', '_')
    period_str = period_filter.replace(' ', '_').lower() if period_filter else 'all_time'
    INPUT_FILE = f"../data/reddit/{keyword_clean}_main.csv"
    OUTPUT_FILE = f"../data/reddit/{keyword_clean}_{filter}_{period_str}.csv"
    
    # üî• NEW FOLDER STRUCTURE PER SPECS
    VIDEOS_FOLDER = "../data/videos"           # Videos (mute)
    VISUALS_FOLDER = "../data/visuals"         # Images, GIFs, Videos+Audio  
    AUDIO_FOLDER = "../data/audio"             # Audio files
    
    print(f"üì° Fetching EXACTLY {limit} {filter} posts (Period: {period_filter or 'All time'})...")
    df = fetch_reddit_posts_search(keyword, filter, limit, period_filter)
   
    if df is None or df.empty:
        print("‚ö†Ô∏è  Search failed ‚Üí Using sample data")
        create_sample_main_data(keyword_clean, limit)
        df = pd.read_csv(INPUT_FILE).head(limit)
    else:
        os.makedirs(os.path.dirname(INPUT_FILE), exist_ok=True)
        df.to_csv(INPUT_FILE, index=False)
        print(f"‚úÖ Saved {len(df)} REAL posts ‚Üí {INPUT_FILE}")
   
    total_posts = len(df)
    print(f"\nüöÄ PROCESSING {total_posts} posts ‚Üí {OUTPUT_FILE}")
    print(f"üíæ VIDEOS(mute) ‚Üí {VIDEOS_FOLDER}/")
    print(f"üñºÔ∏è  IMAGES/GIFs ‚Üí {VISUALS_FOLDER}/")
    print(f"üéµ AUDIO ‚Üí {AUDIO_FOLDER}/")
    print("=" * 100)
   
    os.makedirs(VIDEOS_FOLDER, exist_ok=True)
    os.makedirs(VISUALS_FOLDER, exist_ok=True)
    os.makedirs(AUDIO_FOLDER, exist_ok=True)
   
    new_data = []
    session = requests.Session()
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
        'Accept': 'application/json, text/plain, */*',
        'Accept-Language': 'en-US,en;q=0.5',
        'Accept-Encoding': 'gzip, deflate, br',
        'Referer': 'https://www.reddit.com/',
        'Sec-Fetch-Dest': 'empty',
        'Sec-Fetch-Mode': 'cors',
        'Sec-Fetch-Site': 'same-origin'
    }
    session.headers.update(headers)
   
    def extract_post_id(url):
        if pd.isna(url): return None
        url = str(url).strip()
        match = re.search(r'/comments/([a-zA-Z0-9]+)', url)
        if match: return match.group(1)
        match = re.search(r't3_([a-zA-Z0-9]+)', url)
        if match: return match.group(1)
        return None
   
    def get_visual_type_count(visual):
        if visual in ['N/A', 'MEDIA_ERROR', 'ERROR']:
            return 'NONE', 0
        visual_str = str(visual).lower()
        if any(x in visual_str for x in ['.mp4', 'v.redd.it', 'youtube.com', 'youtu.be']):
            return 'VIDEO', 1
        if '\n' in visual_str:
            return 'CAROUSEL', len(visual_str.splitlines())
        if 'i.redd.it' in visual_str or any(ext in visual_str for ext in ['.jpg', '.png', '.gif']):
            return 'IMAGE', 1
        return 'OTHER', 1
   
    def calculate_text_length(description):
        if not description or description in ['N/A', 'ERROR', 'INVALID_LINK']:
            return 0
        text = re.sub(r'http[s]?://\S+', '', str(description))
        text = re.sub(r'\s+', ' ', text).strip()
        return len(text)
   
    # üî• ENHANCED: Comprehensive visual extraction with carousel + video support
    def extract_visual_urls(post_info):
        visual_urls = []
        try:
            # 1. REDDIT VIDEO (highest priority)
            if post_info.get('is_video') and post_info.get('media', {}).get('reddit_video'):
                fallback_url = post_info['media']['reddit_video'].get('fallback_url', '')
                if fallback_url:
                    visual_urls.append(fallback_url)
                    return visual_urls
    
            # 2. YOUTUBE/EXTERNAL VIDEO
            if any(domain in post_info.get('url', '').lower() for domain in ['youtube.com', 'youtu.be', 'v.redd.it']):
                visual_urls.append(post_info['url'])
                return visual_urls
    
            # üî• 3. CAROUSEL - NEW ENHANCED EXTRACTION
            gallery_data = post_info.get('gallery_data')
            if gallery_data and gallery_data.get('items'):
                for item in gallery_data['items']:
                    if isinstance(item, dict) and 'media_id' in item:
                        media_id = item['media_id']
                        # üî• Try ALL possible formats for this media_id
                        media_candidates = get_enhanced_media_candidates(media_id)
                        for candidate_url in media_candidates:
                            visual_urls.append(candidate_url)
                if visual_urls:
                    return visual_urls
    
            # 4. SINGLE IMAGE
            post_url = post_info.get('url', '')
            viewable_url = get_viewable_image_url(post_url)
            if viewable_url and 'i.redd.it' in viewable_url:
                return [viewable_url]
    
            # 5. PREVIEW IMAGES
            if post_info.get('preview', {}).get('images'):
                for img in post_info['preview']['images']:
                    source_url = img.get('source', {}).get('url', '')
                    if source_url:
                        viewable_url = get_viewable_image_url(source_url)
                        if 'i.redd.it' in viewable_url:
                            return [viewable_url]
    
            # 6. THUMBNAIL FALLBACK
            if post_info.get('thumbnail') and 'i.redd.it' in post_info['thumbnail']:
                return [post_info['thumbnail']]
        
        except:
            pass
        return visual_urls
   
    # üî• FULL PROGRESS TRACKING + AUTO DOWNLOAD + WORKING LINKS ONLY + PERFECT AUDIO + CORRECT FOLDERS!
    for idx, row in df.iterrows():
        progress = f"{idx+1:2d}/{total_posts}"
        post_title = str(row['post_title'])[:60]
        post_number = idx + 1
        print(f"üîç [{progress}] {post_title}...")
    
        post_data = {
            'post_title': row.get('post_title', 'N/A'),
            'post_link': row.get('post_link', 'N/A'),
            'post_id': 'N/A',
            'num_votes': row.get('num_votes', 'N/A'),
            'num_comments': row.get('num_comments', 'N/A'),
            'filter': filter,
            'period_filter': period_filter or 'N/A',
            'post_date': 'N/A',
            'post_time': 'N/A',
            'time_ago': 'N/A',
            'text_length': 0,
            'post_description': 'N/A',
            'post_visual': 'N/A',
            'visual_type': 'NONE',
            'visual_count': 0,
            'downloaded_files': 'N/A',
            'audio_file': 'N/A'  # üî• PERFECT NAMING COLUMN
        }
    
        post_id = extract_post_id(row['post_link'])
        if not post_id:
            print(f"   ‚ùå [{progress}] Invalid link - SKIPPED")
            new_data.append(post_data)
            time.sleep(0.5)
            continue
    
        print(f"   üîó [{progress}] Post ID: {post_id}")
    
        try:
            response = session.get(f"https://www.reddit.com/comments/{post_id}.json", timeout=10)
            if response.status_code == 200:
                data = response.json()
                if len(data) > 0 and 'data' in data[0]:
                    post_info = data[0]['data']['children'][0]['data']
                
                    post_data.update({
                        'post_id': post_id,
                        'num_votes': str(post_info.get('score', 'N/A')),
                    })
                
                    # üî• DATE/TIME + TIME_AGO
                    created_utc = post_info.get('created_utc')
                    post_date, post_time = format_post_date(created_utc)
                    post_data['post_date'] = post_date
                    post_data['post_time'] = post_time
                    post_data['time_ago'] = calculate_time_ago(post_date, post_time)
                
                    print(f"   üìÖ [{progress}] {post_date} | üïê {post_time[:12]} | ‚è∞ {post_data['time_ago']}")
                
                    selftext = post_info.get('selftext', '')[:2000]
                    if selftext.strip():
                        post_data['post_description'] = selftext
                        post_data['text_length'] = calculate_text_length(selftext)
                        print(f"   üìù [{progress}] {post_data['text_length']} chars")
                
                    # üî• ENHANCED VISUAL EXTRACTION + TEST WORKING LINKS + AUTO DOWNLOAD TO CORRECT FOLDERS
                    all_candidate_urls = extract_visual_urls(post_info)
                    base_filename = f"{keyword_clean}_{filter}_{period_str}_{post_number}"
                
                    if all_candidate_urls:
                        print(f"   üñºÔ∏è [{progress}] Testing {len(all_candidate_urls)} candidate URLs...")
                    
                        # üî• TEST + DOWNLOAD with EXACT naming + PROPER EXTENSIONS + CORRECT FOLDERS!
                        downloaded_files, working_urls = download_visual_auto(
                            post_number, 'CAROUSEL' if '\n' in '\n'.join(all_candidate_urls) else 'IMAGE',
                            all_candidate_urls, base_filename, VISUALS_FOLDER, VIDEOS_FOLDER  # üî• PASS BOTH FOLDERS
                        )
                    
                        # üî• ONLY WORKING LINKS go to post_visual!
                        if working_urls:
                            post_data['post_visual'] = '\n'.join(working_urls)
                            vtype, vcount = get_visual_type_count(post_data['post_visual'])
                            post_data.update({'visual_type': vtype, 'visual_count': vcount})
                            post_data['downloaded_files'] = '; '.join(downloaded_files) if downloaded_files else 'ERROR'
                        
                            print(f"   ‚úÖ [{progress}] {vtype} ({vcount}) - {len(working_urls)} WORKING URLs!")
                            print(f"   üíæ {len(downloaded_files)} files saved!")
                        else:
                            print(f"   ‚ùå [{progress}] No working URLs found")
                    else:
                        print(f"   ‚ûñ [{progress}] No visuals")
                
                    # üî• PERFECT AUDIO DOWNLOAD with EXACT NAMING TO ../data/audio/
                    if post_data['visual_type'] in ['VIDEO'] and post_id:
                        print(f"   üéµ [{progress}] Extracting audio...")
                        # üî• PASS ALL PARAMETERS for PERFECT NAMING
                        audio_path = download_reddit_audio_only(
                            post_data['post_link'], 
                            keyword_clean, 
                            filter, 
                            period_str, 
                            post_number, 
                            AUDIO_FOLDER  # üî• FIXED: ../data/audio/
                        )
                        if audio_path:
                            audio_filename = os.path.basename(audio_path)
                            post_data['audio_file'] = audio_filename  # ‚úÖ oban_star_racers_hot_all_time_2_audio.m4a
                            print(f"   ‚úÖ [{progress}] Audio: {audio_filename}")
                        else:
                            print(f"   ‚ûñ [{progress}] No audio extracted")
                
                    print(f"   üéâ [{progress}] COMPLETE ‚úì")
                else:
                    print(f"   ‚ùå [{progress}] No post data")
            else:
                print(f"   ‚ùå [{progress}] HTTP {response.status_code}")
        
        except Exception as e:
            print(f"   ‚ö†Ô∏è  [{progress}] Error: {str(e)[:40]}")
    
        new_data.append(post_data)
        time.sleep(2.5)  # Rate limiting
        print()  # Empty line
   
    os.makedirs(os.path.dirname(OUTPUT_FILE), exist_ok=True)
    new_df = pd.DataFrame(new_data, columns=[
        'post_title', 'post_link', 'post_id', 'num_votes', 'num_comments',
        'filter', 'period_filter', 'post_date', 'post_time', 'time_ago', 'text_length',
        'post_description', 'post_visual', 'visual_type', 'visual_count', 'downloaded_files',
        'audio_file'  # üî• PERFECT NAMING
    ])
    new_df.to_csv(OUTPUT_FILE, index=False)
   
    print(f"\nüéâ SAVED {len(new_df)}/{limit} posts ‚Üí {OUTPUT_FILE}")
    print(f"üíæ VIDEOS(mute) ‚Üí {VIDEOS_FOLDER}/")
    print(f"üñºÔ∏è  VISUALS ‚Üí {VISUALS_FOLDER}/")
    print(f"üéµ AUDIO ‚Üí {AUDIO_FOLDER}/ (keyword_filter_period_postnumber_audio.m4a)")
    print(f"‚úÖ post_visual = WORKING LINKS ONLY!")
    return new_df


# üî• COMPLETE INTERACTIVE RUN
if __name__ == "__main__":
    print("üöÄ REDDIT EXTRACTOR + VISUALS + PERFECT AUDIO NAMING!")
    print("=" * 60)
   
    # 1. KEYWORD FIRST
    keyword = input("Enter keyword: ").strip() or 'music'
   
    # 2. FILTER NEXT
    print("\nüî• Filters: 1=hot, 2=top, 3=new, 4=comments, 5=relevance")
    choice = input("Choose filter [1]: ").strip() or '1'
    filter_map = {'1': 'hot', '2': 'top', '3': 'new', '4': 'comments', '5': 'relevance'}
    filter = filter_map.get(choice, 'hot')
   
    # 3. PERIOD NEXT (for relevance/top/comments)
    period_filter = None
    if filter in ['relevance', 'top', 'comments']:
        print(f"\n‚è∞ PERIOD FILTER (HTML dropdown match):")
        print("1=All time, 2=Past year, 3=Past month, 4=Past week, 5=Today, 6=Past hour")
        period_choice = input("Choose period [2=Past year]: ").strip() or '2'
        period_map = {
            '1': 'All time', '2': 'Past year', '3': 'Past month',
            '4': 'Past week', '5': 'Today', '6': 'Past hour'
        }
        period_filter = period_map.get(period_choice, 'Past year')
        print(f"   ‚úÖ Using period: {period_filter} ‚Üí API t={get_period_param(period_filter)}")
   
    # 4. LIMIT LAST
    limit_input = input("\nHow many posts? (1-100) [20]: ").strip()
    limit = int(limit_input) if limit_input.isdigit() else 20
    limit = min(max(limit, 1), 100)
   
    print(f"\nüî• Scraping {limit} '{keyword}' {filter.upper()} posts...")
    print(f"‚úÖ VIDEOS‚Üí../data/videos/ | IMAGES‚Üí../data/visuals/ | AUDIO‚Üí../data/audio/")
    if period_filter:
        print(f"   ‚è∞ Time filter: {period_filter}")
    result = extract_post_details_complete(keyword, filter, limit, period_filter)
   
    period_filename = period_filter.replace(' ', '_').lower() if period_filter else 'all_time'
    print(f"\n‚úÖ DONE! {len(result)} posts + media ‚Üí ../data/")
    print(f"üìÅ Videos(mute): ../data/videos/")
    print(f"üìÅ Visuals: ../data/visuals/")
    print(f"üéµ Audio: ../data/audio/")


üöÄ REDDIT EXTRACTOR + VISUALS + PERFECT AUDIO NAMING!

üî• Filters: 1=hot, 2=top, 3=new, 4=comments, 5=relevance

üî• Scraping 10 'oban star racers' HOT posts...
‚úÖ VIDEOS‚Üí../data/videos/ | IMAGES‚Üí../data/visuals/ | AUDIO‚Üí../data/audio/
üì° Fetching EXACTLY 10 hot posts (Period: All time)...
üîç Fetching UP TO 10 hot posts for 'oban star racers'...
   üì° API: q=oban%20star%20racers&sort=hot&t=month...
   ‚úÖ API returned 100 posts available
‚úÖ SUCCESS: 10/10 hot posts loaded!
‚úÖ Saved 10 REAL posts ‚Üí ../data/reddit/oban_star_racers_main.csv

üöÄ PROCESSING 10 posts ‚Üí ../data/reddit/oban_star_racers_hot_all_time.csv
üíæ VIDEOS(mute) ‚Üí ../data/videos/
üñºÔ∏è  IMAGES/GIFs ‚Üí ../data/visuals/
üéµ AUDIO ‚Üí ../data/audio/
üîç [ 1/10] NEW OFICIAL OBAN STAR RACERS COMIC...
   üîó [ 1/10] Post ID: 1pk06te
   üìÖ [ 1/10] Thursday, December 11, 2025 | üïê 04:23:24 PM  | ‚è∞ 6 hours ago
   üìù [ 1/10] 160 chars
   üñºÔ∏è [ 1/10] Testing 1 candidate URLs...
   üí

<hr>

# ü™ú STEPS EXTRACT+DOWNLOAD


<style>
h1 {
    text-align: left;
    color: black;
    font-weight: bold;
}
</style>

<style>
h2 {
    text-align: left;
    color: darkblue;
    font-weight: bold;
}
</style>

<style>
h3 {
    text-align: center;
    color: black;
    font-weight: bold;
}
</style>

<style>
h4 {
    text-align: center;
    color: black;
    font-weight: bold;
}
</style>
<hr>



| Type                     | Directory              | Description                          |
|--------------------------|----------------------|--------------------------------------|
| **Videos (mute)**            | `../data/videos/`       | Video files with no audio            |
| **Audio files**              | `../data/audio/`        | Audio-only files                     |
| **Images, GIFs, Videos** | `../data/visuals/`   | Visual files including images, GIFs, and videos with audio |

<hr>

## üóÉÔ∏è 1 - FIX EXTRACTION OF GIF CAROUSEL

<style>
h1 {
    text-align: center;
    color: purple;
    font-weight: bold;
}
</style>

<style>
h2 {
    text-align: left;
    color: darkblue;
    font-weight: bold;
}
</style>

<style>
h3 {
    text-align: center;
    color: black;
    font-weight: bold;
}
</style>

<style>
h4 {
    text-align: center;
    color: black;
    font-weight: bold;
}
</style>
<hr>


In [None]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
import time
import os
import re
import urllib.parse
import json
from datetime import datetime, timedelta
from urllib.parse import urlparse
import subprocess  # for yt-dlp audio


def create_sample_main_data(keyword, limit):
    """Generate sample Reddit main data if input file missing"""
    keyword_clean = keyword.replace(' ', '_')
    sample_data = {
        'post_title': [f'{keyword.title()} post {i+1}' for i in range(limit)],
        'post_link': [f'https://www.reddit.com/r/{keyword}/{i+1}/title{i+1}/' for i in range(limit)],
        'post_id': [f'{i+1}' for i in range(limit)]
    }
    INPUT_FILE = f"../data/reddit/{keyword_clean}_main.csv"
    os.makedirs(os.path.dirname(INPUT_FILE), exist_ok=True)
    pd.DataFrame(sample_data).to_csv(INPUT_FILE, index=False)
    print(f"‚úÖ Created sample MAIN data ({limit} posts): {INPUT_FILE}")
    return INPUT_FILE


def get_period_param(period_filter):
    """Convert display text to Reddit API 't=' parameter"""
    period_map = {
        'All time': 'all',
        'Past year': 'year',
        'Past month': 'month',
        'Past week': 'week',
        'Today': 'day',
        'Past hour': 'hour'
    }
    return period_map.get(period_filter, 'month')


def fetch_reddit_posts_search(search_keyword, filter='hot', limit=50, period_filter=None):
    """period_filter values"""
    print(f"üîç Fetching UP TO {limit} {filter} posts for '{search_keyword}'...")
    if period_filter:
        print(f"   ‚è∞ Time filter: {period_filter}")
   
    encoded_keyword = urllib.parse.quote(search_keyword)
    period_param = get_period_param(period_filter) if period_filter else 'month'
    search_url = f"https://www.reddit.com/search.json?q={encoded_keyword}&sort={filter}&limit=100&t={period_param}&type=link"
   
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
        'Accept': 'application/json,'
    }
   
    try:
        print(f"   üì° API: q={encoded_keyword}&sort={filter}&t={period_param}...")
        response = requests.get(search_url, headers=headers, timeout=15)
        if response.status_code != 200:
            print(f"‚ùå Search failed: HTTP {response.status_code}")
            return None
        
        data = response.json()
        posts = []
    
        if 'data' in data and 'children' in data['data']:
            available = len(data['data']['children'])
            print(f"   ‚úÖ API returned {available} posts available")
        
            for i, post in enumerate(data['data']['children'][:limit]):
                post_data = post['data']
                posts.append({
                    'post_title': post_data.get('title', 'N/A'),
                    'post_link': f"https://www.reddit.com{post_data.get('permalink', '')}",
                    'post_id': post_data.get('id', 'N/A'),
                    'num_votes': post_data.get('score', 0),
                    'num_comments': post_data.get('num_comments', 0),
                    'filter': filter,
                    'period_filter': period_filter or 'N/A'
                })
        
        actual_posts = len(posts)
        print(f"‚úÖ SUCCESS: {actual_posts}/{limit} {filter} posts loaded!")
        return pd.DataFrame(posts)
    
    except Exception as e:
        print(f"‚ùå Search error: {str(e)}")
        return None


def get_viewable_image_url(url):
    """ONLY i.redd.it/xxx.png - NO preview/external-preview EVER"""
    if not url or 'reddit.com' not in url.lower():
        return url
   
    url_lower = url.lower()
   
    if 'i.redd.it' in url_lower:
        parsed = urllib.parse.urlparse(url)
        return f"{parsed.scheme}://{parsed.netloc}{parsed.path}"
   
    match = re.search(r'preview\.redd\.it/([a-z0-9]+)', url_lower)
    if not match:
        match = re.search(r'external-preview\.redd\.it/([a-z0-9]+)', url_lower)
    if not match:
        match = re.search(r'/([a-z0-9]{13})\.', url_lower)
   
    if match:
        media_id = match.group(1)
        return f"https://i.redd.it/{media_id}.png"
   
    return url


def format_post_date(created_utc):
    """Convert Reddit UTC timestamp to readable date/time"""
    if not created_utc or created_utc == 'N/A':
        return 'N/A', 'N/A'
   
    try:
        timestamp = float(created_utc)
        dt = datetime.fromtimestamp(timestamp)
        post_date = dt.strftime("%A, %B %d, %Y")
        post_time = dt.strftime("%I:%M:%S %p UTC")
        return post_date, post_time
    except:
        return 'ERROR', 'ERROR'


def calculate_time_ago(post_date_str, post_time_str):
    """Calculate '1 year 2 month 3 week 4 day 5 hour ago' format"""
    if post_date_str == 'N/A' or post_time_str == 'N/A':
        return 'N/A'
   
    try:
        datetime_str = f"{post_date_str} {post_time_str.replace(' UTC', '')}"
        post_dt = datetime.strptime(datetime_str, "%A, %B %d, %Y %I:%M:%S %p")
    
        now = datetime.now()
        delta = now - post_dt
    
        years = delta.days // 365
        months = (delta.days % 365) // 30
        weeks = (delta.days % 30) // 7
        days = delta.days % 7
        hours = delta.seconds // 3600
    
        parts = []
        if years > 0:
            parts.append(f"{years} year" + ("s" if years > 1 else ""))
        if months > 0:
            parts.append(f"{months} month" + ("s" if months > 1 else ""))
        if weeks > 0:
            parts.append(f"{weeks} week" + ("s" if weeks > 1 else ""))
        if days > 0:
            parts.append(f"{days} day" + ("s" if days > 1 else ""))
        if hours > 0 and len(parts) == 0:
            parts.append(f"{hours} hour" + ("s" if hours > 1 else ""))
    
        if not parts:
            return "just now"
    
        time_ago = " ".join(parts) + " ago"
        return time_ago
    
    except:
        return 'ERROR'


def get_enhanced_media_candidates(media_id):
    """Generate ALL possible media URLs for a media_id (videos/images/gifs/audio)"""
    return [
        # VIDEOS (highest quality first)
        f"https://v.redd.it/{media_id}/DASH_1080.mp4",
        f"https://v.redd.it/{media_id}/DASH_720.mp4",
        f"https://v.redd.it/{media_id}/DASH_480.mp4",
        f"https://v.redd.it/{media_id}/DASH_360.mp4",
        f"https://v.redd.it/{media_id}/DASH_1080",
        f"https://v.redd.it/{media_id}/DASH_720",
        f"https://v.redd.it/{media_id}/DASH_480",
        f"https://v.redd.it/{media_id}/DASH_audio.mp4",
        f"https://v.redd.it/{media_id}/audio.m4a",
        f"https://v.redd.it/{media_id}/DASH_audio",
    
        # DIRECT VIDEOS
        f"https://i.redd.it/{media_id}.mp4",
        f"https://i.redd.it/{media_id}.webm",
    
        # IMAGES/GIFS (GIFs prioritized)
        f"https://i.redd.it/{media_id}.gif",
        f"https://i.redd.it/{media_id}.png",
        f"https://i.redd.it/{media_id}.jpg",
        f"https://i.redd.it/{media_id}.jpeg"
    ]


def test_url_working(url, headers_browser, timeout=10):
    """Returns (working_url, content_type, file_ext) or None if broken"""
    try:
        resp = requests.head(url, headers=headers_browser, timeout=timeout, allow_redirects=True)
        if resp.status_code == 200:
            content_type = resp.headers.get('content-type', '').lower()
            size = int(resp.headers.get('content-length', 0) or 0)
        
            if size > 1000 and any(media_type in content_type for media_type in ['video', 'image', 'audio']):
                # Extract extension from content-type
                if 'video' in content_type:
                    file_ext = '.mp4' if 'mp4' in content_type else '.webm'
                elif 'image' in content_type:
                    if 'gif' in content_type:
                        file_ext = '.gif'
                    elif 'png' in content_type:
                        file_ext = '.png'
                    elif 'jpeg' in content_type:
                        file_ext = '.jpg'
                    else:
                        file_ext = '.jpg'
                else:
                    file_ext = '.bin'
                
                return url, content_type, file_ext
        
        # Fallback to GET if HEAD fails
        resp = requests.get(url, headers=headers_browser, timeout=timeout, stream=True)
        if resp.status_code == 200:
            content_type = resp.headers.get('content-type', '').lower()
            size = len(resp.content)
        
            if size > 1000 and any(media_type in content_type for media_type in ['video', 'image', 'audio']):
                if 'video' in content_type:
                    file_ext = '.mp4' if 'mp4' in content_type else '.webm'
                elif 'image' in content_type:
                    if 'gif' in content_type:
                        file_ext = '.gif'
                    elif 'png' in content_type:
                        file_ext = '.png'
                    elif 'jpeg' in content_type:
                        file_ext = '.jpg'
                    else:
                        file_ext = '.jpg'
                else:
                    file_ext = '.bin'
                
                return url, content_type, file_ext
    
    except:
        pass
    return None


# AUDIO NAMING FORMAT - SAVES TO ../data/audio/
def download_reddit_audio_only(post_url, keyword_clean, filter_param, period_str, post_number, audio_folder):
    """üöÄ PERFECT AUDIO ONLY with EXACT naming: keyword_filter_period_postnumber_audio.m4a"""
    # Extract post ID for validation
    post_id = re.search(r'/comments/([a-zA-Z0-9]+)', post_url)
    if not post_id:
        return None
    post_id = post_id.group(1)

    # NAMING FORMAT
    audio_filename = f"{keyword_clean}_{filter_param}_{period_str}_{post_number}_audio.m4a"
    audio_path = os.path.join(audio_folder, audio_filename)

    # Skip if already exists
    if os.path.exists(audio_path):
        print(f"   üìÅ Audio exists: {audio_filename}")
        return audio_path

    # üî• yt-dlp PERFECT AUDIO command
    cmd = [
        'yt-dlp',
        '--extract-audio',      # Audio only (no video)
        '--audio-format', 'm4a', # Best quality M4A
        '--audio-quality', '0',  # Highest quality (lossless)
        '--embed-metadata',     # Title, uploader info
        '-o', audio_path,       # EXACT filename required
        post_url
    ]

    try:
        print(f"   üéµ Running yt-dlp ‚Üí {audio_filename}")
        result = subprocess.run(cmd, capture_output=True, text=True, timeout=90)
        if result.returncode == 0 and os.path.exists(audio_path):
            file_size = os.path.getsize(audio_path) / 1024
            print(f"   ‚úÖ Audio saved: {audio_filename} ({file_size:.1f}KB)")
            return audio_path
        else:
            print(f"   ‚ùå yt-dlp failed: {result.stderr[:100]}")
    except subprocess.TimeoutExpired:
        print(f"   ‚è∞ yt-dlp timeout")
    except Exception as e:
        print(f"   ‚ö†Ô∏è  yt-dlp error: {str(e)[:50]}")
    
    return None


def download_visual_auto(post_number, visual_type, visual_urls, base_filename, visuals_folder, videos_folder):
    """üî• DOWNLOADS visuals ‚Üí ONLY WORKING LINKS ‚Üí PROPER EXTENSIONS ‚Üí CORRECT FOLDERS BY TYPE"""
    downloaded_files = []
   
    headers_browser = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
        'Accept': 'video/*,image/*,audio/*,*/*;q=0.8',
        'Referer': 'https://www.reddit.com/'
    }
   
    working_urls = []
   
    if visual_type == 'CAROUSEL':
        # CAROUSEL: Test each URL for each sequence number - PRIORITIZE GIFs!
        for seq_idx, url in enumerate(visual_urls, 1):
            seq_str = f"{seq_idx:02d}"
            result = test_url_working(url, headers_browser)
        
            if result:
                working_url, content_type, file_ext = result
                working_urls.append(working_url)
                
                # ROUTE TO CORRECT FOLDER BY CONTENT TYPE
                if 'video' in content_type:
                    target_folder = videos_folder
                    file_prefix = 'vid'
                else:  # images, gifs ‚Üí visuals/
                    target_folder = visuals_folder
                    if 'gif' in content_type:
                        file_prefix = 'gif'
                    else:
                        file_prefix = 'img'
                
                filename = f"{base_filename}_{file_prefix}_{seq_str}{file_ext}"
                filepath = os.path.join(target_folder, filename)
            
                if os.path.exists(filepath):
                    print(f"   üìÅ SKIP {filename}")
                    downloaded_files.append(filename)
                    continue
            
                # Download with proper extension
                try:
                    resp = requests.get(working_url, headers=headers_browser, timeout=15, stream=True)
                    if resp.status_code == 200:
                        size = len(resp.content)
                        with open(filepath, 'wb') as f:
                            for chunk in resp.iter_content(8192):
                                f.write(chunk)
                    
                        media_type = content_type.split('/')[0].upper()
                        print(f"   üíæ [{media_type}]{file_ext} {filename} ({size/1024:.1f}KB)")
                        downloaded_files.append(filename)
                        time.sleep(0.5)
                except Exception as e:
                    print(f"   ‚ö†Ô∏è  Download error: {str(e)[:30]}")
            else:
                print(f"   ‚ùå Broken URL: {url[:60]}...")
    
    else:
        # SINGLE IMAGE/VIDEO/GIF: Test each URL once
        for url in visual_urls:
            result = test_url_working(url, headers_browser)
            if result:
                working_url, content_type, file_ext = result
                working_urls.append(working_url)
                
                # üî• ROUTE TO CORRECT FOLDER BY CONTENT TYPE
                if 'video' in content_type:
                    target_folder = videos_folder
                    file_prefix = 'vid'
                    filename = f"{base_filename}_vid{file_ext}"
                else:  # images, gifs
                    target_folder = visuals_folder
                    if 'gif' in content_type:
                        file_prefix = 'gif'
                    else:
                        file_prefix = 'img'
                    filename = f"{base_filename}_{file_prefix}{file_ext}"
                
                filepath = os.path.join(target_folder, filename)
            
                if os.path.exists(filepath):
                    print(f"   üìÅ SKIP {filename}")
                    downloaded_files.append(filename)
                    break
                
                # Download with proper extension
                try:
                    resp = requests.get(working_url, headers=headers_browser, timeout=15, stream=True)
                    if resp.status_code == 200:
                        size = len(resp.content)
                        with open(filepath, 'wb') as f:
                            for chunk in resp.iter_content(8192):
                                f.write(chunk)
                    
                        media_type = content_type.split('/')[0].upper()
                        print(f"   üíæ [{media_type}]{file_ext} {filename} ({size/1024:.1f}KB)")
                        downloaded_files.append(filename)
                        time.sleep(0.5)
                        break  # Success! Done with this post
                except Exception as e:
                    print(f"   ‚ö†Ô∏è  Download error: {str(e)[:30]}")
                break
   
    return downloaded_files, working_urls


def extract_post_details_complete(keyword, filter='hot', limit=50, period_filter=None):
    """MAIN FUNCTION - WORKING LINKS ONLY + PROPER EXTENSIONS + AUDIO + CORRECT FOLDERS + CAROUSEL_IMAGE/GIF"""
    keyword_clean = keyword.replace(' ', '_')
    period_str = period_filter.replace(' ', '_').lower() if period_filter else 'all_time'
    INPUT_FILE = f"../data/reddit/{keyword_clean}_main.csv"
    OUTPUT_FILE = f"../data/reddit/{keyword_clean}_{filter}_{period_str}.csv"
    
    # FOLDER STRUCTURE
    VIDEOS_FOLDER = "../data/videos"           # Videos (mute)
    VISUALS_FOLDER = "../data/visuals"         # Images, GIFs, Videos+Audio  
    AUDIO_FOLDER = "../data/audio"             # Audio files
    
    print(f"üì° Fetching EXACTLY {limit} {filter} posts (Period: {period_filter or 'All time'})...")
    df = fetch_reddit_posts_search(keyword, filter, limit, period_filter)
   
    if df is None or df.empty:
        print("‚ö†Ô∏è  Search failed ‚Üí Using sample data")
        create_sample_main_data(keyword_clean, limit)
        df = pd.read_csv(INPUT_FILE).head(limit)
    else:
        os.makedirs(os.path.dirname(INPUT_FILE), exist_ok=True)
        df.to_csv(INPUT_FILE, index=False)
        print(f"‚úÖ Saved {len(df)} REAL posts ‚Üí {INPUT_FILE}")
   
    total_posts = len(df)
    print(f"\nüöÄ PROCESSING {total_posts} posts ‚Üí {OUTPUT_FILE}")
    print(f"üíæ VIDEOS(mute) ‚Üí {VIDEOS_FOLDER}/")
    print(f"üñºÔ∏è  IMAGES/GIFs ‚Üí {VISUALS_FOLDER}/")
    print(f"üéµ AUDIO ‚Üí {AUDIO_FOLDER}/")
    print("=" * 100)
   
    os.makedirs(VIDEOS_FOLDER, exist_ok=True)
    os.makedirs(VISUALS_FOLDER, exist_ok=True)
    os.makedirs(AUDIO_FOLDER, exist_ok=True)
   
    new_data = []
    session = requests.Session()
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
        'Accept': 'application/json, text/plain, */*',
        'Accept-Language': 'en-US,en;q=0.5',
        'Accept-Encoding': 'gzip, deflate, br',
        'Referer': 'https://www.reddit.com/',
        'Sec-Fetch-Dest': 'empty',
        'Sec-Fetch-Mode': 'cors',
        'Sec-Fetch-Site': 'same-origin'
    }
    session.headers.update(headers)
   
    def extract_post_id(url):
        if pd.isna(url): return None
        url = str(url).strip()
        match = re.search(r'/comments/([a-zA-Z0-9]+)', url)
        if match: return match.group(1)
        match = re.search(r't3_([a-zA-Z0-9]+)', url)
        if match: return match.group(1)
        return None
   
    # CAROUSEL_IMAGE, CAROUSEL_GIF DETECTION
    def get_visual_type_count(visual):
        if visual in ['N/A', 'MEDIA_ERROR', 'ERROR']:
            return 'NONE', 0
        visual_str = str(visual).lower()
        
        # VIDEO DETECTION
        if any(x in visual_str for x in ['.mp4', 'v.redd.it', 'youtube.com', 'youtu.be']):
            return 'VIDEO', 1
        
        # CAROUSEL DETECTION WITH GIF/IMAGE CLASSIFICATION
        if '\n' in visual_str:
            lines = visual_str.splitlines()
            gif_count = sum(1 for line in lines if '.gif' in line.lower())
            total_count = len(lines)
            
            if gif_count > 0:
                if gif_count == total_count:
                    return 'CAROUSEL_GIF', total_count
                else:
                    return 'CAROUSEL_MIXED', total_count
            else:
                return 'CAROUSEL_IMAGE', total_count
        
        # SINGLE GIF/IMAGE
        if '.gif' in visual_str or 'gif' in visual_str:
            return 'GIF', 1
        if 'i.redd.it' in visual_str or any(ext in visual_str for ext in ['.jpg', '.png']):
            return 'IMAGE', 1
        
        return 'OTHER', 1
   
    def calculate_text_length(description):
        if not description or description in ['N/A', 'ERROR', 'INVALID_LINK']:
            return 0
        text = re.sub(r'http[s]?://\S+', '', str(description))
        text = re.sub(r'\s+', ' ', text).strip()
        return len(text)
   
    # ENHANCED: Comprehensive visual extraction with carousel + video + GIF support
    def extract_visual_urls(post_info):
        visual_urls = []
        try:
            # 1. REDDIT VIDEO (highest priority)
            if post_info.get('is_video') and post_info.get('media', {}).get('reddit_video'):
                fallback_url = post_info['media']['reddit_video'].get('fallback_url', '')
                if fallback_url:
                    visual_urls.append(fallback_url)
                    return visual_urls
    
            # 2. YOUTUBE/EXTERNAL VIDEO
            if any(domain in post_info.get('url', '').lower() for domain in ['youtube.com', 'youtu.be', 'v.redd.it']):
                visual_urls.append(post_info['url'])
                return visual_urls
    
            # 3. CAROUSEL - ENHANCED GIF/IMAGE EXTRACTION
            gallery_data = post_info.get('gallery_data')
            if gallery_data and gallery_data.get('items'):
                for item in gallery_data['items']:
                    if isinstance(item, dict) and 'media_id' in item:
                        media_id = item['media_id']
                        # Try ALL possible formats for this media_id (GIFs prioritized)
                        media_candidates = get_enhanced_media_candidates(media_id)
                        for candidate_url in media_candidates:
                            visual_urls.append(candidate_url)
                if visual_urls:
                    return visual_urls
    
            # 4. SINGLE IMAGE/GIF
            post_url = post_info.get('url', '')
            viewable_url = get_viewable_image_url(post_url)
            if viewable_url and 'i.redd.it' in viewable_url:
                return [viewable_url]
    
            # 5. PREVIEW IMAGES/GIFs
            if post_info.get('preview', {}).get('images'):
                for img in post_info['preview']['images']:
                    source_url = img.get('source', {}).get('url', '')
                    if source_url:
                        viewable_url = get_viewable_image_url(source_url)
                        if 'i.redd.it' in viewable_url:
                            return [viewable_url]
    
            # 6. THUMBNAIL FALLBACK
            if post_info.get('thumbnail') and 'i.redd.it' in post_info['thumbnail']:
                return [post_info['thumbnail']]
        
        except:
            pass
        return visual_urls
   
    # FULL PROGRESS TRACKING + AUTO DOWNLOAD + WORKING LINKS ONLY + AUDIO + CORRECT FOLDERS!
    for idx, row in df.iterrows():
        progress = f"{idx+1:2d}/{total_posts}"
        post_title = str(row['post_title'])[:60]
        post_number = idx + 1
        print(f"üîç [{progress}] {post_title}...")
    
        post_data = {
            'post_title': row.get('post_title', 'N/A'),
            'post_link': row.get('post_link', 'N/A'),
            'post_id': 'N/A',
            'num_votes': row.get('num_votes', 'N/A'),
            'num_comments': row.get('num_comments', 'N/A'),
            'filter': filter,
            'period_filter': period_filter or 'N/A',
            'post_date': 'N/A',
            'post_time': 'N/A',
            'time_ago': 'N/A',
            'text_length': 0,
            'post_description': 'N/A',
            'post_visual': 'N/A',
            'visual_type': 'NONE',
            'visual_count': 0,
            'downloaded_files': 'N/A',
            'audio_file': 'N/A' 
        }
    
        post_id = extract_post_id(row['post_link'])
        if not post_id:
            print(f"   ‚ùå [{progress}] Invalid link - SKIPPED")
            new_data.append(post_data)
            time.sleep(0.5)
            continue
    
        print(f"   üîó [{progress}] Post ID: {post_id}")
    
        try:
            response = session.get(f"https://www.reddit.com/comments/{post_id}.json", timeout=10)
            if response.status_code == 200:
                data = response.json()
                if len(data) > 0 and 'data' in data[0]:
                    post_info = data[0]['data']['children'][0]['data']
                
                    post_data.update({
                        'post_id': post_id,
                        'num_votes': str(post_info.get('score', 'N/A')),
                    })
                
                    # DATE/TIME + TIME_AGO
                    created_utc = post_info.get('created_utc')
                    post_date, post_time = format_post_date(created_utc)
                    post_data['post_date'] = post_date
                    post_data['post_time'] = post_time
                    post_data['time_ago'] = calculate_time_ago(post_date, post_time)
                
                    print(f"   üìÖ [{progress}] {post_date} | üïê {post_time[:12]} | ‚è∞ {post_data['time_ago']}")
                
                    selftext = post_info.get('selftext', '')[:2000]
                    if selftext.strip():
                        post_data['post_description'] = selftext
                        post_data['text_length'] = calculate_text_length(selftext)
                        print(f"   üìù [{progress}] {post_data['text_length']} chars")
                
                    # ENHANCED VISUAL EXTRACTION + TEST WORKING LINKS + AUTO DOWNLOAD TO CORRECT FOLDERS
                    all_candidate_urls = extract_visual_urls(post_info)
                    base_filename = f"{keyword_clean}_{filter}_{period_str}_{post_number}"
                
                    if all_candidate_urls:
                        print(f"   üñºÔ∏è [{progress}] Testing {len(all_candidate_urls)} candidate URLs...")
                    
                        # TEST + DOWNLOAD with EXACT naming + PROPER EXTENSIONS + CORRECT FOLDERS!
                        downloaded_files, working_urls = download_visual_auto(
                            post_number, 'CAROUSEL' if len(all_candidate_urls) > 1 else 'IMAGE',
                            all_candidate_urls, base_filename, VISUALS_FOLDER, VIDEOS_FOLDER
                        )
                    
                        # ONLY WORKING LINKS go to post_visual! NEW CAROUSEL TYPES!
                        if working_urls:
                            post_data['post_visual'] = '\n'.join(working_urls)
                            vtype, vcount = get_visual_type_count(post_data['post_visual'])
                            post_data.update({'visual_type': vtype, 'visual_count': vcount})
                            post_data['downloaded_files'] = '; '.join(downloaded_files) if downloaded_files else 'ERROR'
                        
                            print(f"   ‚úÖ [{progress}] {vtype} ({vcount}) - {len(working_urls)} WORKING URLs!")
                            print(f"   üíæ {len(downloaded_files)} files saved!")
                        else:
                            print(f"   ‚ùå [{progress}] No working URLs found")
                    else:
                        print(f"   ‚ûñ [{progress}] No visuals")
                
                    # üî• AUDIO DOWNLOAD with EXACT NAMING TO ../data/audio/
                    if post_data['visual_type'] in ['VIDEO'] and post_id:
                        print(f"   üéµ [{progress}] Extracting audio...")
                        audio_path = download_reddit_audio_only(
                            post_data['post_link'], 
                            keyword_clean, 
                            filter, 
                            period_str, 
                            post_number, 
                            AUDIO_FOLDER
                        )
                        if audio_path:
                            audio_filename = os.path.basename(audio_path)
                            post_data['audio_file'] = audio_filename
                            print(f"   ‚úÖ [{progress}] Audio: {audio_filename}")
                        else:
                            print(f"   ‚ûñ [{progress}] No audio extracted")
                
                    print(f"   üéâ [{progress}] COMPLETE ‚úì")
                else:
                    print(f"   ‚ùå [{progress}] No post data")
            else:
                print(f"   ‚ùå [{progress}] HTTP {response.status_code}")
        
        except Exception as e:
            print(f"   ‚ö†Ô∏è  [{progress}] Error: {str(e)[:40]}")
    
        new_data.append(post_data)
        time.sleep(2.5)  # Rate limiting
        print()  # Empty line
   
    os.makedirs(os.path.dirname(OUTPUT_FILE), exist_ok=True)
    new_df = pd.DataFrame(new_data, columns=[
        'post_title', 'post_link', 'post_id', 'num_votes', 'num_comments',
        'filter', 'period_filter', 'post_date', 'post_time', 'time_ago', 'text_length',
        'post_description', 'post_visual', 'visual_type', 'visual_count', 'downloaded_files',
        'audio_file'
    ])
    new_df.to_csv(OUTPUT_FILE, index=False)
   
    print(f"\nüéâ SAVED {len(new_df)}/{limit} posts ‚Üí {OUTPUT_FILE}")
    print(f"üíæ VIDEOS(mute) ‚Üí {VIDEOS_FOLDER}/")
    print(f"üñºÔ∏è  VISUALS ‚Üí {VISUALS_FOLDER}/ (incl. CAROUSEL_IMAGE, CAROUSEL_GIF)")
    print(f"üéµ AUDIO ‚Üí {AUDIO_FOLDER}/ (keyword_filter_period_postnumber_audio.m4a)")
    print(f"‚úÖ NEW visual_types: CAROUSEL_IMAGE, CAROUSEL_GIF, CAROUSEL_MIXED!")
    return new_df


# INTERACTIVE
if __name__ == "__main__":
    print("üöÄ REDDIT EXTRACTOR + VISUALS + PERFECT AUDIO + CAROUSEL_GIF!")
    print("=" * 60)
   
    # 1. KEYWORD FIRST
    keyword = input("Enter keyword: ").strip() or 'music'
   
    # 2. FILTER NEXT
    print("\nüî• Filters: 1=hot, 2=top, 3=new, 4=comments, 5=relevance")
    choice = input("Choose filter [1]: ").strip() or '1'
    filter_map = {'1': 'hot', '2': 'top', '3': 'new', '4': 'comments', '5': 'relevance'}
    filter = filter_map.get(choice, 'hot')
   
    # 3. PERIOD NEXT (for relevance/top/comments)
    period_filter = None
    if filter in ['relevance', 'top', 'comments']:
        print(f"\n‚è∞ PERIOD FILTER (HTML dropdown match):")
        print("1=All time, 2=Past year, 3=Past month, 4=Past week, 5=Today, 6=Past hour")
        period_choice = input("Choose period [2=Past year]: ").strip() or '2'
        period_map = {
            '1': 'All time', '2': 'Past year', '3': 'Past month',
            '4': 'Past week', '5': 'Today', '6': 'Past hour'
        }
        period_filter = period_map.get(period_choice, 'Past year')
        print(f"   ‚úÖ Using period: {period_filter} ‚Üí API t={get_period_param(period_filter)}")
   
    # 4. LIMIT LAST
    limit_input = input("\nHow many posts? (1-100) [20]: ").strip()
    limit = int(limit_input) if limit_input.isdigit() else 20
    limit = min(max(limit, 1), 100)
   
    print(f"\nüî• Scraping {limit} '{keyword}' {filter.upper()} posts...")
    print(f"‚úÖ VIDEOS‚Üí../data/videos/ | IMAGES/GIFs‚Üí../data/visuals/ | AUDIO‚Üí../data/audio/")
    print(f"‚úÖ NEW: CAROUSEL_IMAGE, CAROUSEL_GIF detection!")
    if period_filter:
        print(f"   ‚è∞ Time filter: {period_filter}")
    result = extract_post_details_complete(keyword, filter, limit, period_filter)
   
    period_filename = period_filter.replace(' ', '_').lower() if period_filter else 'all_time'
    print(f"\n‚úÖ DONE! {len(result)} posts + media ‚Üí ../data/")
    print(f"üìÅ Videos(mute): ../data/videos/")
    print(f"üìÅ Visuals(GIFs/Images): ../data/visuals/")
    print(f"üéµ Audio: ../data/audio/")


üöÄ REDDIT EXTRACTOR + VISUALS + PERFECT AUDIO + CAROUSEL_GIF!

üî• Filters: 1=hot, 2=top, 3=new, 4=comments, 5=relevance

‚è∞ PERIOD FILTER (HTML dropdown match):
1=All time, 2=Past year, 3=Past month, 4=Past week, 5=Today, 6=Past hour
   ‚úÖ Using period: All time ‚Üí API t=all

üî• Scraping 10 'oban star racers' RELEVANCE posts...
‚úÖ VIDEOS‚Üí../data/videos/ | IMAGES/GIFs‚Üí../data/visuals/ | AUDIO‚Üí../data/audio/
‚úÖ NEW: CAROUSEL_IMAGE, CAROUSEL_GIF detection!
   ‚è∞ Time filter: All time
üì° Fetching EXACTLY 10 relevance posts (Period: All time)...
üîç Fetching UP TO 10 relevance posts for 'oban star racers'...
   ‚è∞ Time filter: All time
   üì° API: q=oban%20star%20racers&sort=relevance&t=all...
   ‚úÖ API returned 100 posts available
‚úÖ SUCCESS: 10/10 relevance posts loaded!
‚úÖ Saved 10 REAL posts ‚Üí ../data/reddit/oban_star_racers_main.csv

üöÄ PROCESSING 10 posts ‚Üí ../data/reddit/oban_star_racers_relevance_all_time.csv
üíæ VIDEOS(mute) ‚Üí ../data/videos/
üñº

<hr>

## üóÉÔ∏è 2 - MERGE MUTE VIDEOS with AUDIOS


<style>
h1 {
    text-align: center;
    color: purple;
    font-weight: bold;
}
</style>

<style>
h2 {
    text-align: left;
    color: darkblue;
    font-weight: bold;
}
</style>

<style>
h3 {
    text-align: center;
    color: black;
    font-weight: bold;
}
</style>

<style>
h4 {
    text-align: center;
    color: black;
    font-weight: bold;
}
</style>
<hr>


In [None]:
import subprocess
from pathlib import Path
import glob
import re


def auto_video_merge_all():
    """üöÄ Automatically merge ALL video+audio pairs - outputs *_video.mp4 to ../data/visuals/"""
    
    base_dir = Path(r"C:/Users/sboub/Documents/GitHub/reddit-scraper")
    
    #  DIRECTORIES
    videos_dir = base_dir / "data" / "videos"      # Videos (mute) - SOURCE *_vid*.mp4
    audio_dir = base_dir / "data" / "audio"        # Audio files
    visuals_dir = base_dir / "data" / "visuals"    # Videos+Audio destination
    
    print("üîç Scanning for video/audio pairs...")
    
    # SEARCH VIDEOS IN ../data/videos/ (mute videos)
    video_files = list(videos_dir.rglob("*_vid*.mp4")) + list(videos_dir.rglob("*_vid*.webm"))
    print(f"üìä Found {len(video_files)} video files in {videos_dir}")
    
    success_count = 0
    fail_count = 0
    
    for video_path in video_files:
        # Extract keyword_filter_period_postnumber from *_vid.mp4 OR *_vid_01.mp4
        video_name = video_path.stem  # removes .mp4
        
        # FLEXIBLE PATTERN: keyword_filter_period_postnumber_vid OR keyword_filter_period_postnumber_vid_01
        match = re.match(r'(.+?)_vid(?:_\d+)?$', video_name)
        if not match:
            print(f"‚ö†Ô∏è  Skipping non-matching video: {video_name}")
            continue
            
        keyword_filter_period_postnumber = match.group(1)
        
        # Construct matching audio path: same name + _audio.m4a
        audio_filename = f"{keyword_filter_period_postnumber}_audio.m4a"
        audio_path = audio_dir / audio_filename
        
        # OUTPUT *_video.mp4 TO ../data/visuals/ (Videos+Audio)
        output_filename = f"{keyword_filter_period_postnumber}_video.mp4"
        output_path = visuals_dir / output_filename
        
        print(f"\nüîÑ [{success_count + fail_count + 1}/{len(video_files)}] {keyword_filter_period_postnumber}")
        print(f"   üìÅ Video:  {video_path.name} (from {videos_dir.name})")
        print(f"   üìÅ Audio:  {audio_filename}")
        print(f"   üìÅ Output: {output_filename} (to {visuals_dir.name})")
        
        # Skip if already merged
        if output_path.exists():
            print("   ‚úÖ Already exists - SKIPPING")
            success_count += 1
            continue
            
        if not audio_path.exists():
            print("   ‚ùå Audio missing - SKIPPING")
            fail_count += 1
            continue
        
        # Merge command
        cmd = [
            r"C:\ffmpeg\bin\ffmpeg.exe",
            "-y",
            "-i", str(video_path),
            "-i", str(audio_path),
            "-map", "0:v:0",
            "-map", "1:a:0",
            "-c:v", "copy",
            "-c:a", "aac",
            "-shortest",
            str(output_path)
        ]
        
        # Execute
        try:
            result = subprocess.run(cmd, capture_output=True, text=True, timeout=300)
            
            if result.returncode == 0 and output_path.exists():
                size_mb = output_path.stat().st_size / 1e6
                print(f"   ‚úÖ SUCCESS! ({size_mb:.1f} MB)")
                success_count += 1
            else:
                print(f"   ‚ùå FAILED: {result.stderr[:200]}")
                fail_count += 1
                
        except subprocess.TimeoutExpired:
            print("   ‚è∞ TIMEOUT - Killed")
            fail_count += 1
    
    # Final summary
    print(f"\nüéâ SUMMARY:")
    print(f"   ‚úÖ Success: {success_count}")
    print(f"   ‚ùå Failed:  {fail_count}")
    print(f"   üìÅ Total:   {len(video_files)}")
    print(f"   üìÇ Source:  {videos_dir}/(*_vid*.mp4)")
    print(f"   üìÇ Audio:   {audio_dir}/(*_audio.m4a)")
    print(f"   üìÇ Output:  {visuals_dir}/(*_video.mp4)")  # Videos+Audio


# Auto-run
if __name__ == "__main__":
    auto_video_merge_all()


üîç Scanning for video/audio pairs...
üìä Found 2 video files in C:\Users\sboub\Documents\GitHub\reddit-scraper\data\videos

üîÑ [1/2] oban_star_racers_relevance_all_time_2
   üìÅ Video:  oban_star_racers_relevance_all_time_2_vid.mp4 (from videos)
   üìÅ Audio:  oban_star_racers_relevance_all_time_2_audio.m4a
   üìÅ Output: oban_star_racers_relevance_all_time_2_video.mp4 (to visuals)
   ‚úÖ SUCCESS! (16.4 MB)

üîÑ [2/2] oban_star_racers_relevance_all_time_4
   üìÅ Video:  oban_star_racers_relevance_all_time_4_vid.mp4 (from videos)
   üìÅ Audio:  oban_star_racers_relevance_all_time_4_audio.m4a
   üìÅ Output: oban_star_racers_relevance_all_time_4_video.mp4 (to visuals)
   ‚úÖ SUCCESS! (12.6 MB)

üéâ SUMMARY:
   ‚úÖ Success: 2
   ‚ùå Failed:  0
   üìÅ Total:   2
   üìÇ Source:  C:\Users\sboub\Documents\GitHub\reddit-scraper\data\videos/(*_vid*.mp4)
   üìÇ Audio:   C:\Users\sboub\Documents\GitHub\reddit-scraper\data\audio/(*_audio.m4a)
   üìÇ Output:  C:\Users\sboub\Document

<hr>

# üß™ TESTING


<style>
h1 {
    text-align: center;
    color: red;
    font-weight: bold;
}
</style>

<style>
h2 {
    text-align: left;
    color: darkblue;
    font-weight: bold;
}
</style>

<style>
h3 {
    text-align: center;
    color: black;
    font-weight: bold;
}
</style>

<style>
h4 {
    text-align: center;
    color: black;
    font-weight: bold;
}
</style>
<hr>