# First Part - Scrapping the Data, do NOT run this

This script is designed to **scrape data from YouTube videos** and **analyze transcripts** to gain more insights.

### **Libraries**
- The script begins by importing the necessary libraries. These include:
  - **Selenium**: Used for automating web interactions (like clicking buttons and scrolling).
  - **BeautifulSoup**: Used for parsing HTML content and extracting specific parts of a webpage.
  - **NLTK**: Used for text processing, including cleaning and analyzing the text.
- A **Chrome WebDriver** is set up using Selenium to control the browser. This driver can open a YouTube video and interact with it, like clicking buttons or scrolling.

### **Web Scraping YouTube Video Data**
- The script visits a YouTube video page and extracts information such as:
  - **Likes**: The number of likes on the video.
  - **Views**: The total number of views.
  - **People Mentioned**: Names of individuals mentioned in the description or title of the video.
- For example, if the video has **10,000 likes** and **500,000 views**, this function will gather those numbers. If the description mentions people like "Joe Rogan" and "Elon Musk," those names are also captured.

### **Fetching and Processing the Transcript**
- **`fetch_transcript(episode_url)`**: This function fetches the transcript of a video from athis website https://ogjre.com/transcripts

### **Text Processing**
- **`preprocess_transcript(transcript)`**: This function prepares the transcript for further analysis:
  - **Text Cleaning**: Removes punctuation marks and converts the text to lowercase.
  - **Tokenization and Lemmatization**: Breaks the transcript into individual words and reduces them to their root form. For example, *"running"* becomes *"run"*.
  - **Stopword Removal**: Common words like "the", "and", "is" are removed since they don’t add much meaning.
- Once the transcript is cleaned, **Sentiment Analysis** is performed using **VADER** from NLTK. This analysis provides four scores:
  - **Negative (neg)**: Measures the negative tone in the text.
  - **Neutral (neu)**: Measures neutral content.
  - **Positive (pos)**: Measures the positive tone.
  - **Compound**: A combined score that shows the overall sentiment (ranging from -1 to 1, where 1 means very positive).
- For example, if the transcript talks about positive topics like *"exciting future"*, the sentiment analysis might return a positive score.


### **Main Processing Function**
- **`process_youtube_data(input_file, output_file, json_output_file)`**: This is the main function that:
  - Loads YouTube video URLs from an Excel file.
  - Uses **`scrape_youtube_video_data`** to gather information about each video.
  - Uses **`fetch_transcript`** to retrieve and analyze the transcript of each video.
  - Updates the Excel file and JSON file with the scraped data.
- For example, if the Excel file contains a list of 20 YouTube URLs, this function will visit each link, collect information, analyze it, and save the results.

### **Output**
- The script produces two main outputs:
  1. **Excel File** (`jre_episodes_data.xlsx`): Contains updated information about each video, such as likes, views, and people mentioned.
  2. **JSON File** (`updated_jre_episodes_data.json`): Contains more detailed data, including the cleaned transcript and sentiment analysis.


### Example of Output Data
- **Transcript and Sentiment**: The output includes a cleaned version of the transcript, along with sentiment scores like:
  ```json
  {
      "Episode Title": "Joe Rogan Experience #2159 - Sal Vulcano",
      "Cleaned_Transcript": "hello everyone today we talk about exciting future space",
      "people_mentioned": ["Sal Vulcano"],
      "views": 500000,
      "likes": 10000,
      "guest": ["Sal Vulcano"],
      "sentiment": {
          "neg": 0.093,
          "neu": 0.719,
          "pos": 0.188,
          "compound": 1.0
      }
  }

### Sentiment Explanation

Negative (neg): 0.093 indicates 9.3% of the content is negative.
Neutral (neu): 0.719 indicates 71.9% of the content is neutral.
Positive (pos): 0.188 indicates 18.8% of the content is positive.
Compound: 1.0 means the overall sentiment is very positive.


In [None]:
import time
import pandas as pd
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import json
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
import os

nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')



# Setup WebDriver with the same configuration
def setup_driver():
    options = webdriver.ChromeOptions()
    # Uncomment the following line to run in headless mode
    # options.add_argument("--headless")
    options.add_argument("--disable-gpu")
    options.add_argument("--no-sandbox")
    options.add_argument("--start-maximized")  # Open browser in fullscreen
    return webdriver.Chrome(options=options)


def wait_for_element(driver, by, value, timeout=15):
    """Wait for an element to be present on the page."""
    return WebDriverWait(driver, timeout).until(EC.presence_of_element_located((by, value)))


def scrape_youtube_video_data(driver, url):
    """Scrape YouTube video data including likes, views, and people mentioned."""
    metrics = {"Likes": "Not Found", "Views": "Not Found", "People_Mentioned": []}
    try:
        print(f"Processing URL: {url}")
        driver.get(url)
        time.sleep(5)  # Wait for the page to load

        # Accept cookies if prompted
        try:
            cookie_button = wait_for_element(driver, By.XPATH, '//*[@id="content"]/div[2]/div[6]/div[1]/ytd-button-renderer[1]/yt-button-shape/button/yt-touch-feedback-shape/div/div[2]', timeout=10)
            cookie_button.click()
            print("Cookies accepted.")
            time.sleep(2)
        except Exception:
            print("No cookies prompt or already accepted.")

        # Get likes
        try:
            likes_element = wait_for_element(driver, By.XPATH, '//*[@id="top-level-buttons-computed"]/segmented-like-dislike-button-view-model/yt-smartimation/div/div/like-button-view-model/toggle-button-view-model/button-view-model/button/div[2]', 15)
            metrics["Likes"] = likes_element.text
            print(f"Likes: {metrics['Likes']}")
        except Exception as e:
            print(f"Could not fetch likes: {e}")

        # Get views
        try:
            views_element = wait_for_element(driver, By.CSS_SELECTOR, '.style-scope.yt-formatted-string.bold[style-target="bold"]', 15)
            metrics["Views"] = views_element.text
            print(f"Views: {metrics['Views']}")
        except Exception as e:
            print(f"Could not fetch views: {e}")

        # Expand and retrieve names of people mentioned
        try:
            expand_button = wait_for_element(driver, By.XPATH, '//*[@id="expand"]', timeout=10)
            expand_button.click()
            print("'Expand' button clicked. Scrolling down to reveal names.")
            time.sleep(2)  # Allow time for content to expand

            # Scroll down to ensure all names are loaded
            driver.execute_script("window.scrollBy(0, 500);")
            time.sleep(2)

            # Retrieve names of people mentioned
            index = 1
            while True:
                try:
                    person_xpath = f'//*[@id="items"]/yt-video-attributes-section-view-model/div/div[2]/div/yt-video-attribute-view-model[{index}]/div/a/div[2]/h1'
                    person_element = wait_for_element(driver, By.XPATH, person_xpath, timeout=5)
                    person_name = person_element.text.strip()
                    metrics["People_Mentioned"].append(person_name)
                    print(f"Person {index}: {person_name}")
                    index += 1
                except Exception:
                    print(f"Finished retrieving people mentioned. Total: {len(metrics['People_Mentioned'])}")
                    break
        except Exception as e:
            print(f"Error retrieving people mentioned: {e}")

    except Exception as e:
        print(f"Error processing URL {url}: {e}")

    return metrics


def fetch_transcript(episode_url):
    """Fetch transcript from the given URL."""
    print(f"Fetching transcript for {episode_url}...")
    try:
        response = requests.get(episode_url)
        response.raise_for_status()
        soup = BeautifulSoup(response.content, "html.parser")
        transcript_tag = soup.find("p", class_="chakra-text ssc-transcript css-0")
        transcript = transcript_tag.get_text(strip=True) if transcript_tag else "Transcript not available."

        # Preprocess the transcript
        return preprocess_transcript(transcript)
    except Exception as e:
        print(f"Error fetching transcript for {episode_url}: {e}")
        return "Transcript not available."


def preprocess_transcript(transcript):
    """Preprocess transcript by cleaning, lemmatizing, and tokenizing."""
    if transcript == "Transcript not available.":
        return transcript

    # Text Cleaning: Remove punctuation, stopwords, and lemmatize
    lemmatizer = WordNetLemmatizer()
    stop_words = set(stopwords.words('english'))
    transcript = re.sub(r'[^a-zA-Z\s]', '', transcript)  # Remove punctuation
    words = word_tokenize(transcript.lower())
    cleaned_words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
    cleaned_transcript = ' '.join(cleaned_words)

    # Sentiment Analysis
    analyzer = SentimentIntensityAnalyzer()
    sentiment = analyzer.polarity_scores(transcript)

    return {
        "cleaned_transcript": cleaned_transcript,
        "sentiment": sentiment
    }


def save_intermediate_results(df, processed_data, output_file, json_output_file):
    """Save the current state of the DataFrame and JSON data to avoid data loss."""
    df.to_excel(output_file, index=False)
    print(f"Intermediate data saved to {output_file}")
    with open(json_output_file, 'w') as json_file:
        json.dump(processed_data, json_file, indent=4)
    print(f"Intermediate JSON data saved to {json_output_file}")


def process_youtube_data(input_file, output_file, json_output_file):
    """Process YouTube video URLs from an Excel file, save scraped data, and generate JSON."""
    # Load the existing data if available
    if os.path.exists(output_file):
        df = pd.read_excel(output_file)
        print(f"Loaded existing data from {output_file}")
    else:
        df = pd.read_excel(input_file)

    if os.path.exists(json_output_file):
        with open(json_output_file, 'r') as json_file:
            processed_data = json.load(json_file)
        print(f"Loaded existing JSON data from {json_output_file}")
    else:
        processed_data = []

    driver = setup_driver()

    try:
        for index, row in df.iterrows():
            if pd.notna(row.get('Likes')) and pd.notna(row.get('Views')) and pd.notna(row.get('People_Mentioned')):
                # Skip rows that have already been processed
                continue

            video_url = row['url']
            generated_url = row['generated_url']
            print(f"Processing video: {row['name']}")
            try:
                metrics = scrape_youtube_video_data(driver, video_url)
                transcript_data = fetch_transcript(generated_url)

                # Save the scraped data to the DataFrame
                df.at[index, 'Likes'] = metrics["Likes"]
                df.at[index, 'Views'] = metrics["Views"]
                df.at[index, 'People_Mentioned'] = ", ".join(metrics["People_Mentioned"])

                # Append data for JSON file
                processed_data.append({
                    "Episode Title": row['name'],
                    "Cleaned_Transcript": transcript_data["cleaned_transcript"] if isinstance(transcript_data, dict) else transcript_data,
                    "people_mentioned": metrics["People_Mentioned"],
                    "views": metrics["Views"],
                    "likes": metrics["Likes"],
                    "guest": row.get('guest', 'Unknown'),
                    "sentiment": transcript_data.get("sentiment") if isinstance(transcript_data, dict) else {}
                })
                print(f"Successfully processed video: {row['name']}")

                # Save intermediate results after processing each video
                save_intermediate_results(df, processed_data, output_file, json_output_file)

            except Exception as e:
                print(f"Error processing video {row['name']}: {e}")

    except Exception as e:
        print(f"Error during processing: {e}")
    finally:
        driver.quit()
        print("WebDriver closed.")


if __name__ == "__main__":
    input_file = "jre_episodes.xlsx"
    output_file = "jre_episodes_with_metrics.xlsx"
    json_output_file = "jre_episodes_data.json"
    process_youtube_data(input_file, output_file, json_output_file)


Just look at the json file were the transcripts are included and there is also an excel file to have a cleare visualization of the output


# Next Steps 


## How to Use the Outputs of the Script

The script produces two main outputs: an Excel file and a JSON file. Here is how you can use them for the next steps:

### **Excel File (`jre_episodes_data.xlsx`)**:
   - **Guest Standardization**: Use the guest column to standardize guest names. Correct any inconsistencies to make sure that the same guest is represented uniformly.
   - **Engagement Metrics**: Analyze likes and views to understand which episodes were more popular. This can help in the engagement vs. sentiment analysis.

### **JSON File (`updated_jre_episodes_data.json`)**:
   - **Sentiment Analysis and Transcript**:
     - Extract and use the cleaned transcript and sentiment scores for each episode. This data is essential for topic modeling and understanding audience sentiment.
   - **Network Construction**:
     - Use the guest names and episode information from the JSON file to create a network of interconnected episodes.
   - **Correlation Analysis**:
     - Use the engagement metrics (likes, views) alongside the sentiment scores to perform correlation analysis.

### **Further Analysis and Insights**:
   - Use the cleaned transcripts to perform **topic modeling** and **thematic analysis**.
   - Analyze sentiment trends across different themes and correlate them with the guest appearances and audience engagement data.
   - Construct **network graphs** based on shared guests and topics to visualize relationships between different episodes.
   - Generate **visual insights** such as word clouds, scatter plots, and sentiment trends to help in interpreting the data.

### **Visualizations**:
   - Visualize the relationships between **engagement metrics, guests, and sentiment** using charts.
   - Create a **network graph** to see which guests often appear together or how episodes are connected by shared topics.
   - **Topic Clouds** can be generated to see which words are most often associated with specific guests or themes.

By following these steps and using the outputs effectively, you will be able to conduct a thorough analysis of the Joe Rogan Experience podcast, extracting insights into its content, audience engagement, and thematic relationships.
