<a href="https://colab.research.google.com/github/sammainahkinya1404/Machine-Learning-Projects/blob/main/LLM_Applications.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**1**.**Text Extraction from document**

In [None]:
# Install the missing package
!pip install langchain-text-splitters

import re
from typing import List

from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document

Collecting langchain-text-splitters
  Downloading langchain_text_splitters-1.1.0-py3-none-any.whl.metadata (2.7 kB)
Downloading langchain_text_splitters-1.1.0-py3-none-any.whl (34 kB)
Installing collected packages: langchain-text-splitters
Successfully installed langchain-text-splitters-1.1.0


In [None]:
text= """
# PART 1: CORE CGT FRAMEWORK

## 1.1 What is Capital Gains Tax (CGT)?

**Summary:** CGT is a tax on the profit (capital gain) you make when you dispose of a CGT asset.

**Key Points:**
- CGT started on 20 September 1985
- Assets acquired before this date are "pre-CGT" (generally exempt)
- Assets acquired on or after this date are "post-CGT" (subject to CGT)
- CGT is not a separate tax - capital gains are added to your assessable income
- You pay tax at your marginal rate (with potential 50% discount)

**Legislative References:**
- ITAA 1997 Division 104 (CGT events)
- ITAA 1997 Division 110 (Cost base)
- ITAA 1997 Division 115 (CGT discount)

---

## 1.2 Pre-CGT vs Post-CGT Assets

### 1.2.1 Pre-CGT Assets (Acquired Before 20 Sep 1985)

**GENERAL RULE:** Assets acquired before 20 September 1985 are exempt from CGT.

**Legislative Reference:** ITAA 1997 s149-10

**APPLIES WHEN:**
- Asset acquired before 20 Sep 1985
- No CGT events have occurred that break pre-CGT status
- No major changes to the asset structure

**DOES NOT APPLY WHEN (Critical Exceptions):**
1. **Asset has been subdivided** → TD 2003/11 applies (subdivided lots become post-CGT)
2. **Post-CGT improvements made** → Improvement is separate post-CGT asset
3. **Asset transferred to associate** → May trigger CGT event
4. **Testamentary disposition** → Beneficiary gets pre-CGT status only if conditions met
5. **Strata title created** → New units may be post-CGT

**Common Mistake:**
❌ "My property was bought in 1979, so any sale is exempt"
✓ "My property was bought in 1979, so the subdivided lots RETAIN pre-CGT status and are fully exempt from CGT under TD 7 and s149-10"

**See Also:**
- Section 2.2 (Pre-CGT Land Subdivision - TD 7, s149-10)
- Section 1.2.2 (Post-CGT Improvements on Pre-CGT Land)

---

### 1.2.2 Post-CGT Improvements on Pre-CGT Land

**RULE:** When you make improvements to pre-CGT land after 19 September 1985, the improvement is a separate post-CGT asset.

**Legislative Reference:** TD 2017/13, s108-70 ITAA 1997

**How It Works:**
1. **Land component:** Remains pre-CGT (exempt)
2. **Building/improvement component:** Post-CGT (taxable)
3. **Sale proceeds must be apportioned** between land and building

**Apportionment Method:**
Use relative market values at the time of sale or construction (whichever is appropriate).

**EXAMPLE:**
```
Property purchased: 15 June 1980 (pre-CGT)
House built: 1 March 2010 (post-CGT improvement)
Construction cost: $350,000
Property sold: 1 July 2025 for $1,200,000

Valuation at sale:
- Land component: $400,000 (33.3%)
- Building component: $800,000 (66.7%)

CGT calculation:
Land: $400,000 sale proceeds → PRE-CGT EXEMPT
Building: $800,000 sale proceeds
  Less cost base: $350,000
  Capital gain: $450,000
  CGT discount (50%): $225,000
  Net capital gain: $225,000
```

**APPLIES WHEN:**
- Original land was pre-CGT
- Improvement (building, structure) was made after 19 Sep 1985
- Selling the property with the improvement

**DOES NOT APPLY WHEN:**
- Both land and improvement are pre-CGT
- Property was subdivided (different rules apply - see TD 2003/11)
- Improvement is minor (repairs/maintenance, not capital)

**Common Mistake:**
❌ "The whole property is exempt because I bought the land pre-CGT"
✓ "The land is exempt, but the house I built in 2010 is taxable"

---
"""

In [None]:
import re
def extract_title_from_page(text: str) -> str:
    """Extract the section title from ATO CGT Guide page text."""
    lines = text.strip().split('\n')

    # Priority patterns for ATO guide headings
    section_patterns = [
        r'^(Chapter\s+\d+)',                    # Chapter 1, Chapter 2
        r'^(Part\s+[A-Z])',                     # Part A, Part B
        r'^(\d+\.\d+(?:\.\d+)?)\s+(.+)',        # 1.1 Title, 2.3.4 Title
        r'^(Section\s+\d+)',                    # Section 118
        r'^(s\d+[\.\d]*)',                      # s118.110
        r'^(ITAA\d+)',                          # ITAA97
        r'^(Main residence)',                   # Main residence exemption
        r'^(Cost base)',                        # Cost base elements
        r'^(Capital gain)',                     # Capital gains
        r'^(CGT)',                              # CGT events
    ]

    for line in lines[:15]:
        line = line.strip()
        if not line or len(line) < 5:
            continue

        # Skip page numbers, QC codes, dates
        if re.match(r'^[\d\s\-/]+$', line) or line.startswith('QC '):
            continue

        # Check for section patterns
        for pattern in section_patterns:
            match = re.match(pattern, line, re.IGNORECASE)
            if match:
                return line[:100]


        # Look for ALL CAPS headings
        if line.isupper() and 10 < len(line) < 80:
            return line

    # Fallback: find first substantive line
    for line in lines[:10]:
        line = line.strip()
        if 15 < len(line) < 100:
            if not re.match(r'^[\d\s\-/\$,\.]+$', line):
                if not line.startswith('QC '):
                    if not re.match(r'^[A-Z][a-z]+\s+(bought|sold|purchased|received)', line):
                        return line

    return "CGT Guide Reference"

In [None]:
extract_title_from_page(text)

'# PART 1: CORE CGT FRAMEWORK'

In [None]:
def detect_cgt_topics(text: str) -> List[str]:
    """Detect CGT topics present in the text."""
    topic_patterns = {
        # Original topics
        'main_residence_exemption': r'main residence|dwelling|home|lived in',
        'cost_base': r'cost base|first element|second element|third element',
        'capital_gain_calculation': r'capital gain|capital loss|net capital',
        'cgt_discount': r'cgt discount|50%.*discount|discount method',
        'indexation': r'indexation|indexed cost base|cpi',
        'cgt_events': r'cgt event|event [a-z]\d|cgt event a1|cgt event c1|cgt event d1|cgt event k',
        'exemptions': r'exempt|exemption|disregard',
        'rental_property': r'rent|rental|income.producing|tenant',
        'partial_exemption': r'partial|apportion|percentage|fraction',
        'six_year_rule': r'six.year|6.year|absence|temporary',
        'rollover': r'rollover|roll.over|defer|relationship breakdown',
        'small_business': r'small business|active asset|division 152',
        # New topics for supplementary rules
        'pre_cgt': r'pre.cgt|before.*1985|20 september 1985|pre-cgt',
        'foreign_resident': r'foreign resident|non.resident|withholding|clearance certificate|15%.*withhold',
        'joint_ownership': r'joint tenant|tenants in common|co.own|joint ownership',
        'smsf': r'smsf|self.managed|super fund|superannuation fund|pension phase|accumulation phase',
        'trust': r'family trust|discretionary trust|trust.*capital|streaming|beneficiar',
        'capital_losses': r'capital loss|carry forward|offset.*loss|net capital loss',
        'record_keeping': r'record keep|keep.*record|5 year|document.*retain',
        'market_valuation': r'market value|valuation|valuer|arm.s length',
        'granny_flat': r'granny flat|occupancy right|elderly|pension age',
        'subdivision': r'subdivid|subdivision|vacant land|strata title',
        'deceased_estate': r'deceased|death|inherit|beneficiary|estate|passed away',
        'compulsory_acquisition': r'compulsory acqui|involuntary disposal|government acqui',
    }

    detected = []
    text_lower = text.lower()

    for topic, pattern in topic_patterns.items():
        if re.search(pattern, text_lower):
            detected.append(topic)

    return detected

In [None]:
detect_cgt_topics(text)

['cost_base',
 'capital_gain_calculation',
 'cgt_discount',
 'cgt_events',
 'exemptions',
 'rental_property',
 'partial_exemption',
 'pre_cgt',
 'trust',
 'market_valuation',
 'subdivision',
 'deceased_estate']

In [None]:
def classify_chunk_type(chunk_text: str) -> str:
    """Classify a chunk as rule, example, or reference."""
    if re.search(r'^Example\s+\d+', chunk_text, re.MULTILINE):
        return "example"
    elif "For more information" in chunk_text:
        return "reference"
    return "rule"


In [None]:
classify_chunk_type(text)

'rule'

In [None]:
def create_document_aware_chunks(
    documents: List[Document],
    chunk_size: int = CHUNK_SIZE,
    chunk_overlap: int = CHUNK_OVERLAP
) -> List[Document]:
    """Create chunks using document-aware separators that respect ATO guide structure."""

    # Custom separators that respect ATO CGT Guide structure
    separators = [
        "\n\nExample ",           # Example sections - keep examples together
        "\nFor more information",  # Reference sections
        "\n\n",                   # Paragraph breaks
        "\n",                     # Line breaks
        ". ",                     # Sentences
        " ",                      # Words (last resort)
    ]

    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=chunk_overlap,
        separators=separators,
        keep_separator=True,
    )

    all_chunks = []
    chunk_index = 0

    for doc in documents:
        page_num = doc.metadata.get('page', 0) + 1
        page_content = doc.page_content

        text_chunks = splitter.split_text(page_content)

        for chunk_text in text_chunks:
            chunk_title = extract_title_from_page(chunk_text)
            chunk_topics = detect_cgt_topics(chunk_text)
            chunk_type = classify_chunk_type(chunk_text)

            metadata = {
                'source': 'ATO CGT Guide 2025',
                'page': page_num,
                'title': chunk_title,
                'topics': chunk_topics,
                'chunk_type': chunk_type,
                'chunk_index': chunk_index,
            }

            header = f"[Page {page_num} | {chunk_title}]"
            contextualized_text = f"{header}\n\n{chunk_text}"

            all_chunks.append(Document(
                page_content=contextualized_text,
                metadata=metadata
            ))
            chunk_index += 1

    print(f"  Created {len(all_chunks)} document-aware chunks")
    return all_chunks

#Note Taking APP

In [1]:
!pip install -U youtube-transcript-api transformers accelerate sentencepiece


Collecting youtube-transcript-api
  Downloading youtube_transcript_api-1.2.4-py3-none-any.whl.metadata (24 kB)
Collecting transformers
  Downloading transformers-5.1.0-py3-none-any.whl.metadata (31 kB)
Downloading youtube_transcript_api-1.2.4-py3-none-any.whl (485 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m485.2/485.2 kB[0m [31m11.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading transformers-5.1.0-py3-none-any.whl (10.3 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.3/10.3 MB[0m [31m95.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: youtube-transcript-api, transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 5.0.0
    Uninstalling transformers-5.0.0:
      Successfully uninstalled transformers-5.0.0
Successfully installed transformers-5.1.0 youtube-transcript-api-1.2.4


In [2]:
from youtube_transcript_api import YouTubeTranscriptApi, TranscriptsDisabled, NoTranscriptFound
import re

def extract_video_id(url):
    """Extracts video ID from different YouTube URL formats."""
    # We use Regex to hunt for the 11-character ID after 'v=' or 'youtu.be/'
    match = re.search(r"(?:v=|youtu\.be/)([a-zA-Z0-9_-]{11})", url)
    return match.group(1) if match else None

def get_transcript(video_id):
    """Fetch transcript using the NEW API format."""
    try:
        api = YouTubeTranscriptApi()
        # The .fetch method grabs the subtitle object list
        transcript = api.fetch(video_id)
        # We join the list into a single long string of text
        return " ".join([t.text for t in transcript])

    except TranscriptsDisabled:
        return "Error: Transcripts are disabled for this video."
    except NoTranscriptFound:
        return "Error: No transcript found for this video."
    except Exception as e:
        return f"Error: {str(e)}"

In [3]:
import torch
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Check if we have a GPU (CUDA) available to speed things up
device = "cuda" if torch.cuda.is_available() else "cpu"

model_name = "google/flan-t5-base"

# Load the tokenizer (translates text to numbers)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Load the model (the neural network) and move it to the GPU/CPU
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).to(device)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json: 0.00B [00:00, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

Loading weights:   0%|          | 0/282 [00:00<?, ?it/s]



generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

In [7]:
def summarize_chunk(text_chunk):
    prompt = f"""Summarize the following text in detail.
Cover all the main points discussed in the video.
Write the summary in short, clear paragraphs — not just single sentences.
Expand on key ideas with brief explanations.
Avoid repetition and keep the language simple and easy to follow.

Text:
{text_chunk}"""

    inputs = tokenizer(
        prompt,
        return_tensors="pt",
        truncation=True,
        max_length=1024
    ).to(device)

    summary_ids = model.generate(
        **inputs,
        max_new_tokens=4096,
        num_beams=5,            # Slightly higher for better quality
        length_penalty=1.2,     # Encourages longer, more detailed output
        no_repeat_ngram_size=3, # Prevents repeating 3-word phrases
        early_stopping=True
    )

    return tokenizer.decode(summary_ids[0], skip_special_tokens=True)

In [8]:
def chunk_text(text, chunk_size=3600):
    sentences = text.split(". ")
    chunks, current_chunk = [], ""

    for sentence in sentences:
        # Check if adding the next sentence exceeds our limit
        if len(current_chunk) + len(sentence) < chunk_size:
            current_chunk += sentence + ". "
        else:
            # If full, seal the chunk and start a new one
            chunks.append(current_chunk.strip())
            current_chunk = sentence + ". "

    if current_chunk:
        chunks.append(current_chunk.strip())

    return chunks

In [9]:
def generate_video_notes(video_url):
    print(f"\n🎬 Processing video: {video_url}")

    video_id = extract_video_id(video_url)
    if not video_id:
        print("Invalid YouTube URL.")
        return

    print("🎧 Fetching transcript...")
    transcript = get_transcript(video_id)

    if transcript.startswith("Error"):
        print(transcript)
        return

    print("🔪 Chunking transcript...")
    chunks = chunk_text(transcript)
    print(f"   -> {len(chunks)} chunks created.")

    print("🧠 Generating AI notes...")
    notes = []

    # Loop through chunks and summarize each one
    for i, chunk in enumerate(chunks):
        print(f"   Summarizing chunk {i+1}/{len(chunks)}...")
        summary = summarize_chunk(chunk)
        notes.append(f"- {summary}")

    print("\n" + "="*50)
    print("📝 AI GENERATED NOTES")
    print("="*50)
    print("\n".join(notes))


if __name__ == "__main__":
    url = input("Paste YouTube URL: ")
    generate_video_notes(url)

Paste YouTube URL: https://youtu.be/oUP96WnpOsI

🎬 Processing video: https://youtu.be/oUP96WnpOsI
🎧 Fetching transcript...
🔪 Chunking transcript...
   -> 5 chunks created.
🧠 Generating AI notes...
   Summarizing chunk 1/5...
   Summarizing chunk 2/5...
   Summarizing chunk 3/5...
   Summarizing chunk 4/5...
   Summarizing chunk 5/5...

📝 AI GENERATED NOTES
- Most people think productivity in coding is about hours. It's not. Productivity is how much usable software you get per hour. That means features that actually work, bugs that stay dead, systems that don't collapse under load, and...
- I'm a big fan of learning like a detective instead of a spectator. I've learned a lot from this video.
- It's the best way to learn coding in 2026.
- Be ruthless with your learning.
- If you want to stop wasting time, three fixes change everything.
