# NLP Document Similarity Analysis
## Comparing YouTube Transcript with 2083 Document

This notebook presents a comprehensive NLP analysis comparing a YouTube video transcript with a large document (1000+ pages).

In [None]:
# Import libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
import re
from collections import Counter

plt.style.use('seaborn-v0_8-darkgrid')
%matplotlib inline

## 1. Data Loading

In [None]:
# Load documents
with open('2083. EUROPEAN DECLARATION OF INDEPENDENCE.txt', 'r', encoding='utf-8', errors='ignore') as f:
    doc_text = f.read()

with open('youtube_transcript_clean.txt', 'r', encoding='utf-8', errors='ignore') as f:
    yt_text = f.read()

print(f"Document size: {len(doc_text):,} characters")
print(f"YouTube transcript size: {len(yt_text):,} characters")
print(f"\nDocument words: {len(doc_text.split()):,}")
print(f"YouTube words: {len(yt_text.split()):,}")

## 2. Similarity Scores Overview

We tested multiple NLP similarity methods to understand the relationship between the documents.

In [None]:
# Summary of all similarity scores
similarity_results = {
    'Method': [
        'Bag-of-Words (with stop words)',
        'Bag-of-Words (no stop words)',
        'Bigrams',
        'Trigrams',
        'TF-IDF (sklearn)',
        'TF-IDF with N-grams'
    ],
    'Similarity (%)': [91.05, 28.16, 31.14, 0.93, 16.01, 15.31]
}

df_similarity = pd.DataFrame(similarity_results)
print(df_similarity.to_string(index=False))

# Visualize
fig, ax = plt.subplots(figsize=(12, 6))
bars = ax.barh(df_similarity['Method'], df_similarity['Similarity (%)'], color='steelblue')
ax.set_xlabel('Similarity Score (%)', fontsize=12)
ax.set_title('Document Similarity Scores by Method', fontsize=14, fontweight='bold')
ax.grid(axis='x', alpha=0.3)

# Add value labels
for i, bar in enumerate(bars):
    width = bar.get_width()
    ax.text(width + 1, bar.get_y() + bar.get_height()/2, 
            f'{width:.2f}%', ha='left', va='center', fontsize=10)

plt.tight_layout()
plt.show()

### Key Insight:
- **Bag-of-words with stop words: 91%** - Artificially inflated by common words
- **TF-IDF (most accurate): 16%** - True semantic similarity
- **Trigrams: <1%** - Very few exact 3-word phrase matches

## 3. Chunk Analysis

Split the 1000+ page document into 4 chunks (~250 pages each) and compared each to the YouTube transcript.

In [None]:
# Chunk similarity results
chunk_results = {
    'Chunk': ['Chunk 1', 'Chunk 2', 'Chunk 3', 'Chunk 4', 'Whole Doc'],
    'TF-IDF (%)': [10.94, 13.18, 10.37, 8.52, 16.01],
    'N-grams (%)': [7.32, 7.92, 6.87, 5.19, 15.31]
}

df_chunks = pd.DataFrame(chunk_results)
print(df_chunks.to_string(index=False))

# Visualize
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# TF-IDF scores
ax1.bar(df_chunks['Chunk'], df_chunks['TF-IDF (%)'], color='coral', alpha=0.7)
ax1.set_ylabel('Similarity (%)', fontsize=11)
ax1.set_title('TF-IDF Similarity by Chunk', fontsize=12, fontweight='bold')
ax1.grid(axis='y', alpha=0.3)
ax1.tick_params(axis='x', rotation=45)

# N-gram scores
ax2.bar(df_chunks['Chunk'], df_chunks['N-grams (%)'], color='teal', alpha=0.7)
ax2.set_ylabel('Similarity (%)', fontsize=11)
ax2.set_title('N-gram Similarity by Chunk', fontsize=12, fontweight='bold')
ax2.grid(axis='y', alpha=0.3)
ax2.tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

### Key Finding:
**The whole document (16%) is MORE similar than any individual chunk!**

This suggests the YouTube video draws themes from across the entire document rather than focusing on one section.

## 4. Vocabulary Analysis

In [None]:
def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'[^a-z0-9\s]', '', text)
    return text.split()

doc_tokens = preprocess_text(doc_text)
yt_tokens = preprocess_text(yt_text)

doc_vocab = set(doc_tokens)
yt_vocab = set(yt_tokens)

common_vocab = doc_vocab & yt_vocab
yt_only = yt_vocab - doc_vocab
doc_only = doc_vocab - yt_vocab

vocab_data = {
    'Category': ['Common', 'YT Only', 'Doc Only'],
    'Count': [len(common_vocab), len(yt_only), len(doc_only)]
}

df_vocab = pd.DataFrame(vocab_data)

# Pie chart
fig, ax = plt.subplots(figsize=(10, 6))
colors = ['#66b3ff', '#ff9999', '#99ff99']
explode = (0.1, 0, 0)

ax.pie(df_vocab['Count'], labels=df_vocab['Category'], autopct='%1.1f%%',
       colors=colors, explode=explode, startangle=90)
ax.set_title('Vocabulary Distribution\n(YT Transcript: {} unique terms)'.format(len(yt_vocab)), 
             fontsize=14, fontweight='bold')

plt.show()

print(f"\nVocabulary overlap: {len(common_vocab) / len(yt_vocab) * 100:.1f}%")
print(f"\nTerms unique to YT transcript: {len(yt_only)}")
print("Sample:")
print(list(yt_only)[:15])

## 5. Named Entity Analysis

In [None]:
# Key names and concepts
names_data = {
    'Name/Concept': ['Marx', 'Shakespeare', 'Frankfurt School', 'Political Correctness', 
                     'Cultural Marxism', 'Islam', 'Western Civilization'],
    'In YT': [1, 1, 0, 0, 0, 0, 1],
    'In Document': [14, 11, 63, 58, 28, 186, 0]
}

df_names = pd.DataFrame(names_data)

fig, ax = plt.subplots(figsize=(12, 6))
x = np.arange(len(df_names['Name/Concept']))
width = 0.35

bars1 = ax.bar(x - width/2, df_names['In YT'], width, label='YouTube', color='orange', alpha=0.8)
bars2 = ax.bar(x + width/2, df_names['In Document'], width, label='Document', color='blue', alpha=0.8)

ax.set_xlabel('Entity', fontsize=11)
ax.set_ylabel('Frequency', fontsize=11)
ax.set_title('Named Entity Frequency Comparison', fontsize=14, fontweight='bold')
ax.set_xticks(x)
ax.set_xticklabels(df_names['Name/Concept'], rotation=45, ha='right')
ax.legend()
ax.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

print("\nKey observation: Document heavily focuses on Islam (186 mentions),")
print("while YT transcript does not mention Islam at all.")

## 6. Concept Search Results

In [None]:
concepts_data = {
    'Concept': ['Political Correctness', 'Cultural Marxism', 'Frankfurt School', 
                'Post Modernist', 'Western Civilization', 'Freedom of Speech'],
    'Document': [78, 28, 72, 0, 0, 2],
    'YouTube': [0, 0, 0, 3, 1, 1],
    'Status': ['Doc Only', 'Doc Only', 'Doc Only', 'YT Only', 'YT Only', 'Both']
}

df_concepts = pd.DataFrame(concepts_data)
print(df_concepts.to_string(index=False))

# Heatmap-style visualization
fig, ax = plt.subplots(figsize=(10, 6))

# Create color-coded table
for i, concept in enumerate(df_concepts['Concept']):
    doc_val = df_concepts.loc[i, 'Document']
    yt_val = df_concepts.loc[i, 'YouTube']
    
    if doc_val > 0:
        ax.scatter(0, i, s=doc_val*50, c='blue', alpha=0.6)
    if yt_val > 0:
        ax.scatter(1, i, s=yt_val*500, c='orange', alpha=0.6)

ax.set_yticks(range(len(df_concepts)))
ax.set_yticklabels(df_concepts['Concept'])
ax.set_xticks([0, 1])
ax.set_xticklabels(['Document', 'YouTube'])
ax.set_title('Concept Distribution (bubble size = frequency)', fontsize=14, fontweight='bold')
ax.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

## 7. Exact Phrase Matches

In [None]:
phrase_matches = {
    'Phrase Length': ['4+ words', '3 words', '2 words'],
    'Matches Found': [2, 1, 193],
    'Overlap %': [0.28, 0.14, 28.4]
}

df_phrases = pd.DataFrame(phrase_matches)
print(df_phrases.to_string(index=False))

print("\n=== Notable Exact Matches ===")
print("5-word match: 'for the first time in'")
print("3-word match: 'throughout the west'")
print("3-word match: 'dead white males'")
print("\nThe phrase 'dead white males' appears in both documents")
print("in the context of Shakespeare and curriculum changes.")

## 8. Conclusions

### Summary of Findings:

1. **True Similarity: ~16%** (using TF-IDF)
   - Initial bag-of-words score of 91% was inflated by common stop words
   - More sophisticated methods show moderate similarity

2. **Minimal Direct Quotation**
   - Only 2 exact matches of 4+ words found
   - Less than 1% trigram overlap
   - Documents are thematically related but not directly quoting

3. **Different Terminology**
   - Document uses: "Political Correctness", "Cultural Marxism"
   - YouTube uses: "Post Modernist"
   - Same concepts, different framing

4. **Different Focus**
   - Document: Heavy focus on Islam (186 mentions)
   - YouTube: Focus on education, post-modernism (no Islam mentions)

5. **Distributed Themes**
   - Whole document more similar (16%) than any chunk (8-13%)
   - YouTube video synthesizes ideas from across entire document

### Interpretation:
The YouTube transcript and document share **ideological themes and vocabulary** but represent **distinct texts** with different emphases. The video appears to discuss related cultural and educational topics without directly quoting the document.