<a href="https://colab.research.google.com/github/tanvircr7/meh/blob/master/DL_App_Prototype.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [10]:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize import sent_tokenize, word_tokenize
import nltk

def segment_script(text, min_segment_size=3):
    """
    Segments a script into coherent sections using topic similarity

    Args:
        text (str): The input script text
        min_segment_size (int): Minimum number of sentences per segment

    Returns:
        list: List of text segments
    """
    # Download required NLTK data
    nltk.download('punkt')
    nltk.download('punkt_tab')

    # Split into sentences
    sentences = sent_tokenize(text)

    # Create TF-IDF vectors for each sentence
    vectorizer = TfidfVectorizer(stop_words='english')
    tfidf_matrix = vectorizer.fit_transform(sentences)

    # Calculate similarity between adjacent sentences
    similarity_scores = []
    for i in range(len(sentences)-1):
        similarity = (tfidf_matrix[i] * tfidf_matrix[i+1].T).toarray()[0][0]
        similarity_scores.append(similarity)

    # Find segment boundaries where similarity is low
    threshold = np.mean(similarity_scores) - (0.01)*np.std(similarity_scores)
    boundaries = [0]
    current_size = 0

    for i, score in enumerate(similarity_scores):
        current_size += 1
        if score < threshold and current_size >= min_segment_size:
            boundaries.append(i + 1)
            current_size = 0

    boundaries.append(len(sentences))

    # Create segments
    segments = []
    for i in range(len(boundaries)-1):
        segment = ' '.join(sentences[boundaries[i]:boundaries[i+1]])
        segments.append(segment)

    return segments

# Example usage
sample_text = """
Artificial intelligence has revolutionized many fields. From healthcare to transportation,
AI systems are making our lives easier. Machine learning algorithms can now recognize patterns
that humans might miss. This has led to breakthrough discoveries in medicine.

The ethics of AI is another important consideration. We must ensure AI systems are fair
and unbiased. Privacy concerns have also emerged as AI systems collect more data.
These challenges require careful thought and regulation.

Looking to the future, AI will continue to evolve. New architectures and algorithms
are being developed. The potential applications seem limitless. However, we must
proceed with caution and wisdom.
"""

segments = segment_script(sample_text)

for i, segment in enumerate(segments):
    print(f"\nSegment {i+1}:")
    print(segment)


Segment 1:

Artificial intelligence has revolutionized many fields. From healthcare to transportation, 
AI systems are making our lives easier. Machine learning algorithms can now recognize patterns
that humans might miss.

Segment 2:
This has led to breakthrough discoveries in medicine. The ethics of AI is another important consideration. We must ensure AI systems are fair
and unbiased. Privacy concerns have also emerged as AI systems collect more data.

Segment 3:
These challenges require careful thought and regulation. Looking to the future, AI will continue to evolve. New architectures and algorithms
are being developed.

Segment 4:
The potential applications seem limitless. However, we must
proceed with caution and wisdom.


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [5]:
len(segments)

1