# **Youtube Video Analysis With Transcripts**

# Overview

This project analyzes YouTube videos focusing on specific topics by processing their transcripts. The analysis includes:

* Video metadata collection (views, likes, comments)
* Transcript analysis and sentiment analysis
* Channel performance metrics
* Engagement analysis
* Interactive visualizations



# Key Features

* Filters videos based on transcript availability
* Processes English transcripts only
* Sentiment analysis of transcripts
* Custom visualization dashboard
* Channel and video performance metrics

## Required Libraries

In [1]:
!pip install youtube-transcript-api

Collecting youtube-transcript-api
  Downloading youtube_transcript_api-0.6.3-py3-none-any.whl.metadata (17 kB)
Downloading youtube_transcript_api-0.6.3-py3-none-any.whl (622 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/622.3 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m622.3/622.3 kB[0m [31m21.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: youtube-transcript-api
Successfully installed youtube-transcript-api-0.6.3


In [2]:
!pip install gradio google-api-python-client transformers torch textblob nltk plotly pandas numpy scikit-learn faiss-cpu huggingface_hub

Collecting gradio
  Downloading gradio-5.9.0-py3-none-any.whl.metadata (16 kB)
Collecting faiss-cpu
  Downloading faiss_cpu-1.9.0.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.4 kB)
Collecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl.metadata (9.7 kB)
Collecting fastapi<1.0,>=0.115.2 (from gradio)
  Downloading fastapi-0.115.6-py3-none-any.whl.metadata (27 kB)
Collecting ffmpy (from gradio)
  Downloading ffmpy-0.4.0-py3-none-any.whl.metadata (2.9 kB)
Collecting gradio-client==1.5.2 (from gradio)
  Downloading gradio_client-1.5.2-py3-none-any.whl.metadata (7.1 kB)
Collecting markupsafe~=2.0 (from gradio)
  Downloading MarkupSafe-2.1.5-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (3.0 kB)
Collecting pydub (from gradio)
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting python-multipart>=0.0.18 (from gradio)
  Downloading python_multipart-0.0.19-py3-none-any.whl.metadat

In [3]:
!pip install -U langchain-community

Collecting langchain-community
  Downloading langchain_community-0.3.12-py3-none-any.whl.metadata (2.9 kB)
Collecting dataclasses-json<0.7,>=0.5.7 (from langchain-community)
  Downloading dataclasses_json-0.6.7-py3-none-any.whl.metadata (25 kB)
Collecting httpx-sse<0.5.0,>=0.4.0 (from langchain-community)
  Downloading httpx_sse-0.4.0-py3-none-any.whl.metadata (9.0 kB)
Collecting langchain<0.4.0,>=0.3.12 (from langchain-community)
  Downloading langchain-0.3.12-py3-none-any.whl.metadata (7.1 kB)
Collecting langchain-core<0.4.0,>=0.3.25 (from langchain-community)
  Downloading langchain_core-0.3.25-py3-none-any.whl.metadata (6.3 kB)
Collecting pydantic-settings<3.0.0,>=2.4.0 (from langchain-community)
  Downloading pydantic_settings-2.7.0-py3-none-any.whl.metadata (3.5 kB)
Collecting marshmallow<4.0.0,>=3.18.0 (from dataclasses-json<0.7,>=0.5.7->langchain-community)
  Downloading marshmallow-3.23.1-py3-none-any.whl.metadata (7.5 kB)
Collecting typing-inspect<1,>=0.4.0 (from dataclasses-

## Import Required Libraries

In [4]:
import os
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import plotly.express as px
import plotly.graph_objects as go
from googleapiclient.discovery import build
from transformers import pipeline, AutoModelForSequenceClassification, AutoTokenizer
#import concurrent.futures
#from multiprocessing import Pool, cpu_count
import torch
from tqdm import tqdm
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from textblob import TextBlob
import gradio as gr
from sentence_transformers import SentenceTransformer
from youtube_transcript_api import YouTubeTranscriptApi
import faiss
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains import RetrievalQA
from langchain.llms import HuggingFacePipeline
import gc
import re
from typing import List, Dict
import warnings
warnings.filterwarnings('ignore')

## 📚 Download NLTK Resources

This cell downloads required NLTK data for text processing:
- punkt: for tokenization
- stopwords: for filtering common words
- averaged_perceptron_tagger: for part-of-speech tagging

Run this cell once when setting up the notebook.

In [5]:
# Download NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


True

## YouTube Analyzer Class Implementation
The `YouTubeAnalyzer` class handles video data collection and processing.

This includes:

- YouTube API interaction
- Transcript collection and cleaning
- Video metadata gathering
- Basic sentiment analysis
- Data validation and error handling

In [6]:
class YouTubeAnalyzer:
    def __init__(self, api_key):
        """Initialize YouTube API client and models"""
        self.youtube = build('youtube', 'v3', developerKey=api_key)
        self.device = 0 if torch.cuda.is_available() else -1
        self.sentiment_model = pipeline("sentiment-analysis", device=self.device)

    def _clean_text(self, text):
        """Clean transcript text by removing special characters and formatting"""
        text = re.sub(r'[\[\]\{\}]', '', text)
        text = re.sub(r'\s+', ' ', text)
        text = re.sub(r'\d{2}:\d{2}', '', text)
        return text.strip()

    def _get_transcript(self, video_id):
        """Get and clean transcript for a video"""
        try:
            transcript_list = YouTubeTranscriptApi.get_transcript(video_id)
            if not transcript_list:
                return None

            full_transcript = ' '.join([item['text'] for item in transcript_list])
            cleaned_transcript = self._clean_text(full_transcript)
            return cleaned_transcript

        except Exception as e:
            return None

    def _get_video_details(self, video_id):
        """Get details for a single video with transcript"""
        try:
            transcript = self._get_transcript(video_id)
            if not transcript:
                return None

            video_response = self.youtube.videos().list(
                part='snippet,statistics',
                id=video_id
            ).execute()

            if not video_response['items']:
                return None

            video = video_response['items'][0]
            stats = video['statistics']
            snippet = video['snippet']

            required_fields = {
                'title': snippet.get('title'),
                'channel_title': snippet.get('channelTitle'),
                'published_at': snippet.get('publishedAt'),
                'views': stats.get('viewCount'),
                'likes': stats.get('likeCount'),
                'comments': stats.get('commentCount')
            }

            if any(v is None or v == '' for v in required_fields.values()):
                return None

            sentiment_result = self.sentiment_model(transcript[:512])[0]

            return {
                'video_id': video['id'],
                'title': required_fields['title'],
                'transcript': transcript,
                'channel_title': required_fields['channel_title'],
                'published_at': required_fields['published_at'],
                'views': int(required_fields['views']),
                'likes': int(required_fields['likes']),
                'comments': int(required_fields['comments']),
                'sentiment': sentiment_result['label'],
                'transcript_length': len(transcript.split())
            }
        except Exception as e:
            return None

    def fetch_videos(self, topic, max_results=5000):
        """Fetch videos with transcripts"""
        published_after = (datetime.now() - timedelta(days=180)).isoformat() + 'Z'
        video_ids = []
        next_page_token = None

        print(f"Fetching videos for: {topic}")
        while len(video_ids) < max_results:
            try:
                search_response = self.youtube.search().list(
                    q=topic,
                    part='id',
                    type='video',
                    maxResults=min(50, max_results - len(video_ids)),
                    publishedAfter=published_after,
                    pageToken=next_page_token,
                    order='relevance'
                ).execute()

                batch_video_ids = [item['id']['videoId']
                                 for item in search_response.get('items', [])]
                video_ids.extend(batch_video_ids)

                next_page_token = search_response.get('nextPageToken')
                if not next_page_token:
                    break

            except Exception as e:
                print(f"Error fetching video IDs: {str(e)}")
                break

        print(f"Found {len(video_ids)} videos, checking for transcripts...")
        videos_data = []

        for video_id in tqdm(video_ids, desc="Processing videos"):
            result = self._get_video_details(video_id)
            if result:
                videos_data.append(result)
                if len(videos_data) == max_results:
                    break

        df = pd.DataFrame(videos_data)

        if df.empty:
            print("No videos found with available transcripts!")
            return None

        print(f"\nProcessing Summary:")
        print(f"- Total videos with transcripts: {len(df)}")
        print(f"- Average transcript length: {df['transcript_length'].mean():.0f} words")
        print(f"- Total channels: {df['channel_title'].nunique()}")

        return df

The `ContentAnalyzer` class analyses downloaded Youtube transcripts for toxic content.

In [7]:
class ContentAnalyzer:
    def __init__(self):
        self.device = 0 if torch.cuda.is_available() else -1
        self.toxicity_model = pipeline("text-classification",
                                     model="unitary/toxic-bert",
                                     device=self.device)
        self.zero_shot_classifier = pipeline("zero-shot-classification",
                                           model="facebook/bart-large-mnli",
                                           device=self.device)

    def analyze_content(self, df: pd.DataFrame) -> pd.DataFrame:
        """Process content sequentially with proper text truncation"""
        print("Starting content analysis...")
        safety_scores = []

        for idx in tqdm(range(len(df)), desc="Analyzing content"):
            try:
                # Truncate text to 512 characters for model processing
                text = df.iloc[idx]['transcript'][:512] if df.iloc[idx]['transcript'] else ""

                if not text.strip():  # If text is empty or only whitespace
                    safety_scores.append(0.5)  # Neutral score for empty text
                    continue

                # Process toxicity with truncated text
                try:
                    toxicity_result = self.toxicity_model(text)[0]
                except Exception as e:
                    print(f"Toxicity analysis error: {e}")
                    toxicity_result = {'label': 'unknown', 'score': 0.5}

                # Process zero-shot classification with truncated text
                try:
                    hate_speech_result = self.zero_shot_classifier(
                        text,
                        candidate_labels=["hate speech", "neutral", "positive"],
                        truncation=True
                    )
                except Exception as e:
                    print(f"Zero-shot classification error: {e}")
                    hate_speech_result = {
                        'labels': ['neutral'],
                        'scores': [0.5]
                    }

                # Calculate safety score
                safety_score = (
                    (1 - float(toxicity_result['score']) * (toxicity_result['label'] == 'toxic')) +
                    (1 - float(hate_speech_result['scores'][0]) * (hate_speech_result['labels'][0] == 'hate speech'))
                ) / 2

                safety_scores.append(safety_score)

            except Exception as e:
                print(f"Error processing content at index {idx}: {e}")
                safety_scores.append(0.5)  # Default neutral score on error

        df['safety_score'] = safety_scores
        return df

## 📊 Visualization Functions

These functions create interactive visualizations using Plotly:
1. Channel Performance Analysis
2. Engagement Rate Analysis
3. Transcript Length vs Views
4. Sentiment Distribution


## 📈 Metrics Generation

Functions to calculate and display key metrics:
- Total video count
- View statistics
- Engagement rates
- Transcript analysis

In [8]:
class AnalysisState:
    def __init__(self):
        self.df = None
        self.content_analyzer = None
        #self.rag_system = None

def create_visualizations(df):
    """Create detailed visualizations with clear metrics"""

    # 1. Channel Performance
    channel_fig = go.Figure()
    channel_stats = df.groupby('channel_title').agg({
        'views': 'sum',
        'video_id': 'count'
    }).reset_index().sort_values('views', ascending=False).head(10)

    channel_fig.add_trace(go.Bar(
        x=channel_stats['channel_title'],
        y=channel_stats['views'],
        marker_color='rgb(55, 83, 109)',
        hovertemplate=(
            "<b>Channel:</b> %{x}<br>" +
            "<b>Total Views:</b> %{y:,.0f}<br>" +
            "<b>Videos:</b> %{customdata}<br>"
        ),
        customdata=channel_stats['video_id']
    ))
    channel_fig.update_layout(
        title='Top 10 Channels by Total View Count',
        xaxis_title='Channel Name',
        yaxis_title='Total Views',
        template='plotly_dark',
        showlegend=False,
        hoverlabel=dict(bgcolor="white", font_size=12)
    )

    # 2. Engagement Analysis
    df['engagement_rate'] = ((df['likes'] + df['comments']) / df['views'] * 100).round(2)
    engagement_fig = go.Figure()

    engagement_fig.add_trace(go.Scatter(
        x=df['views'],
        y=df['engagement_rate'],
        mode='markers',
        marker=dict(
            size=10,
            color=df['likes'],
            colorscale='Viridis',
            showscale=True,
            colorbar=dict(title="Likes Count")
        ),
        hovertemplate=(
            "<b>Title:</b> %{customdata[0]}<br>" +
            "<b>Views:</b> %{x:,.0f}<br>" +
            "<b>Engagement Rate:</b> %{y:.2f}%<br>" +
            "<b>Likes:</b> %{customdata[1]:,.0f}<br>" +
            "<b>Comments:</b> %{customdata[2]:,.0f}<br>"
        ),
        customdata=np.column_stack((df['title'], df['likes'], df['comments']))
    ))
    engagement_fig.update_layout(
        title=(
            'Video Engagement Analysis<br>' +
            '<sup>Engagement Rate = (Likes + Comments) / Views × 100</sup>'
        ),
        xaxis_title='Views',
        yaxis_title='Engagement Rate (%)',
        template='plotly_dark',
        hoverlabel=dict(bgcolor="white", font_size=12)
    )

    # 3. Transcript Length vs Views
    transcript_fig = go.Figure()
    transcript_fig.add_trace(go.Scatter(
        x=df['transcript_length'],
        y=df['views'],
        mode='markers',
        marker=dict(
            size=10,
            color=df['engagement_rate'],
            colorscale='Viridis',
            showscale=True,
            colorbar=dict(title="Engagement Rate (%)")
        ),
        hovertemplate=(
            "<b>Title:</b> %{customdata[0]}<br>" +
            "<b>Word Count:</b> %{x:,.0f}<br>" +
            "<b>Views:</b> %{y:,.0f}<br>" +
            "<b>Engagement:</b> %{customdata[1]:.2f}%<br>"
        ),
        customdata=np.column_stack((df['title'], df['engagement_rate']))
    ))
    transcript_fig.update_layout(
        title='Video Length vs. Performance',
        xaxis_title='Transcript Word Count',
        yaxis_title='Views',
        template='plotly_dark',
        hoverlabel=dict(bgcolor="white", font_size=12)
    )

    return channel_fig, engagement_fig, transcript_fig

def create_metrics_html(df):
    """Create detailed metrics summary with explanations"""
    metrics = {
        'Total Videos': {
            'value': format(len(df), ','),
            'desc': 'Total number of videos analyzed with available transcripts'
        },
        'Total Views': {
            'value': format(int(df['views'].sum()), ','),
            'desc': 'Combined view count across all videos'
        },
        'Average Views per Video': {
            'value': format(int(df['views'].mean()), ','),
            'desc': 'Mean number of views per video'
        },
        'Average Engagement Rate': {
            'value': f"{((df['likes'].sum() + df['comments'].sum()) / df['views'].sum() * 100):.2f}%",
            'desc': 'Average (Likes + Comments) / Views × 100'
        },
        'Average Transcript Length': {
            'value': f"{int(df['transcript_length'].mean()):,} words",
            'desc': 'Average word count of video transcripts'
        }
    }

    html = """
    <div style='display: flex; justify-content: space-around; flex-wrap: wrap; gap: 20px; padding: 20px;'>
    """

    for metric, data in metrics.items():
        html += f"""
        <div style='background: rgba(55, 83, 109, 0.1); padding: 20px; border-radius: 10px; min-width: 200px;
                    box-shadow: 0 4px 6px rgba(0, 0, 0, 0.1); text-align: center;'>
            <h3 style='margin: 0; color: #718096;'>{metric}</h3>
            <p style='margin: 10px 0 0 0; font-size: 24px; color: #2D3748;'>{data['value']}</p>
            <p style='margin: 5px 0 0 0; font-size: 12px; color: #718096;'>{data['desc']}</p>
        </div>
        """

    html += "</div>"
    return html

def perform_analysis(api_key: str, topic: str, max_results: int, state: AnalysisState):
    """Perform complete analysis and update state"""
    try:
        # 1. Fetch videos
        analyzer = YouTubeAnalyzer(api_key)
        df = analyzer.fetch_videos(topic, max_results)

        if df is None:
            return "No videos found", None, None, None, None

        # Update state
        state.df = df

        # Create visualizations
        channel_fig, engagement_fig, transcript_fig = create_visualizations(df)
        metrics_html = create_metrics_html(df)
        top_videos_html = df.nlargest(10, 'views')[['title', 'channel_title', 'views', 'likes']].to_html(
            index=False, classes='table table-striped'
        )

        return metrics_html, channel_fig, engagement_fig, transcript_fig, top_videos_html

    except Exception as e:
        print(f"Error in analysis: {str(e)}")
        return f"Error: {str(e)}", None, None, None, None

#def process_question(question: str, state: AnalysisState) -> str:
 #   """Process RAG question"""
  #  if state.rag_system is None:
   #     return "Please analyze videos first"

    #try:
     #   results = state.rag_system.query(question)
      #  return "\n\n".join(["Relevant content:"] + results)
    #except Exception as e:
     #   return f"Error processing question: {str(e)}"

def create_gradio_interface():
    state = AnalysisState()

    with gr.Blocks(theme=gr.themes.Soft()) as demo:
        gr.Markdown("""
        # Advanced YouTube Topic Analysis Dashboard
        ## Analyze YouTube content with AI-powered insights
        """)

        with gr.Row():
            api_key = gr.Textbox(label="YouTube API Key", type="password")
            topic = gr.Textbox(label="Topic to Analyze", value="technology")
            max_results = gr.Slider(
                minimum=100,
                maximum=5000,
                value=500,
                step=100,
                label="Number of Videos"
            )

        analyze_btn = gr.Button("Analyze Topic", variant="primary")

        # Results section
        metrics_html = gr.HTML(label="Key Metrics")

        with gr.Tabs():
            with gr.TabItem("Channel Analysis"):
                channel_plot = gr.Plot()
            with gr.TabItem("Engagement Analysis"):
                engagement_plot = gr.Plot()
            with gr.TabItem("Transcript Analysis"):
                transcript_plot = gr.Plot()

        with gr.Row():
            gr.Markdown("### Top Performing Videos")
            top_videos_html = gr.HTML()

        # Set up event handlers
        analyze_btn.click(
            fn=lambda api_key, topic, max_results: perform_analysis(api_key, topic, max_results, state),
            inputs=[api_key, topic, max_results],
            outputs=[metrics_html, channel_plot, engagement_plot, transcript_plot, top_videos_html]
        )

    return demo

## 🎨 Dashboard Interface

Launch the interactive Gradio dashboard:
1. Enter your YouTube API key
2. Choose your topic
3. Set number of videos
4. View the analysis results

Note: The dashboard URL will appear below after running the cell.

## 📝 Usage Instructions

1. Enter your YouTube API key in the dashboard
2. Input your desired topic (e.g., "cooking", "travel")
3. Choose number of videos to analyze
4. Click "Analyze Topic"
5. Explore the visualizations and metrics

## ⚠️ Important Notes
- Only processes videos with English transcripts
- Analysis limited by YouTube API quotas
- Processing time depends on number of videos
- Keep API key private

In [9]:
# Create and launch the interface
if __name__ == "__main__":
  demo = create_gradio_interface()
  demo.launch(debug=True, share=True)

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
* Running on public URL: https://87f000ebe02d299450.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from the terminal in the working directory to deploy to Hugging Face Spaces (https://huggingface.co/spaces)


No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Fetching videos for: trekking
Found 500 videos, checking for transcripts...


Processing videos:   4%|▍         | 21/500 [00:14<04:35,  1.74it/s]You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
Processing videos: 100%|██████████| 500/500 [04:59<00:00,  1.67it/s]



Processing Summary:
- Total videos with transcripts: 77
- Average transcript length: 1128 words
- Total channels: 59
Keyboard interruption in main thread... closing server.
Killing tunnel 127.0.0.1:7860 <> https://87f000ebe02d299450.gradio.live


In [11]:
!pip freeze > requirements.txt