# NLP Processing and Analysis with Universal Dependencies (UD)\n
\n
**Project Overview:**\n
This notebook demonstrates comprehensive NLP analysis using Universal Dependencies (UD) treebanks.\n
\n
**Steps covered:**\n
1. Corpus Selection and Data Loading\n
2. Data Preprocessing and Extraction\n
3. Corpus Statistics\n
4. PoS Tag Distribution and Visualization\n
5. Custom Sentence Processing\n
6. TF-IDF Vectorization\n
7. Similarity Analysis\n
8. Most Similar Sentence Pairs

## 1. Import Required Libraries

In [None]:
# System\n
import sys\n
from pathlib import Path\n
import warnings

from Project_NLP.src.visualizer import plot_sentence_length_distribution\n
warnings.filterwarnings('ignore')\n
\n
# Add src to path\n
sys.path.append(str(Path.cwd().parent / 'src'))\n
\n
# Core libraries\n
import pandas as pd\n
import numpy as np\n
from tqdm import tqdm\n
\n
# Custom modules\n
from ud_loader import load_conllu_file, extract_sentence_data, get_conllu_sample\n
from statistics import (compute_corpus_statistics, compute_pos_distribution,\n
                       create_statistics_summary, get_top_frequent_words,\n
                       get_top_frequent_lemmas)\n
from preprocessor import TextPreprocessor\n
from similarity import SimilarityAnalyzer\n
from visualizer import (plot_pos_distribution, plot_similarity_distribution,\n
                       plot_sentence_length_distribution, plot_top_frequent_words)\n
\n
print('✓ All libraries imported successfully!')

## 2. Corpus Selection and Data Loading

**Selected Language**: Albanian (Shqip)

**Dataset**: UD_Albanian-TSA (Treebank of Standard Albanian)

**About Albanian**:
- Indo-European language with its own unique branch
- Spoken by ~7-8 million people in Albania, Kosovo, and surrounding regions
- Rich morphology with cases, genders, and definite/indefinite forms
- File: `sq_tsa-ud-test.conllu`

**Understanding .conllu Format**:
Each sentence has:
- Metadata: `# sent_id`, `# text`
- 10 columns per token: ID, FORM, LEMMA, UPOS, XPOS, FEATS, HEAD, DEPREL, DEPS, MISC

In [None]:
# Configuration
CONLLU_FILE = "../data/sq_tsa-ud-test.conllu"
LANGUAGE = "Albanian"

print(f"Language: {LANGUAGE}")
print(f"File: {CONLLU_FILE}")

In [None]:
# Load the corpus
print("Loading Albanian corpus...")
sentences = load_conllu_file(CONLLU_FILE)
print(f"✓ Loaded {len(sentences)} sentences from {LANGUAGE} corpus")

In [None]:
# Display sample sentence structure
print("\n=== Sample Albanian Sentence Structure ===")
print(f"\nSentence: {sentences[0].metadata.get('text', 'N/A')}\n")
print("Token Annotations:")
print("-" * 100)

sample_tokens = []
for token in sentences[0]:
    if isinstance(token['id'], int):
        sample_tokens.append({
            'ID': token['id'],
            'Form': token['form'],
            'Lemma': token['lemma'],
            'UPOS': token['upos'],
            'Features': str(token['feats'])[:30] if token['feats'] else None
        })

df_sample = pd.DataFrame(sample_tokens)
print(df_sample.to_string(index=False))
print("\n" + "-" * 100)

## 3. Data Extraction

Extract tokens, lemmas, and PoS tags from the Albanian corpus.

In [None]:
# Extract all corpus data
print("Extracting data from corpus...")
corpus_data = extract_sentence_data(sentences)
print("✓ Data extraction complete!")
print(f"\nExtracted:")
print(f"  - {len(corpus_data['all_tokens'])} tokens")
print(f"  - {len(corpus_data['all_lemmas'])} lemmas")
print(f"  - {len(corpus_data['sentence_texts'])} sentences")

## 4. Corpus Statistics

Calculate key statistics about the Albanian corpus.

In [None]:
# Compute statistics
stats = compute_corpus_statistics(corpus_data)
stats_df = create_statistics_summary(stats)

print("\n" + "="*80)
print(f"         CORPUS STATISTICS - {LANGUAGE}")
print("="*80)
print(stats_df.to_string(index=False))
print("="*80)

# Save statistics
stats_df.to_csv('../reports/corpus_statistics.csv', index=False)
print("\n✓ Statistics saved to reports/corpus_statistics.csv")

## 5. Part-of-Speech Tag Distribution

Calculate and visualize PoS tag frequencies in Albanian.

In [None]:
# Calculate PoS distribution
pos_df = compute_pos_distribution(corpus_data['all_pos_tags'])

print("\n=== Albanian Part-of-Speech Tag Distribution ===\n")
print(pos_df.to_string(index=False))

# Save PoS distribution
pos_df.to_csv('../reports/pos_distribution.csv', index=False)
print("\n✓ PoS distribution saved to reports/pos_distribution.csv")

In [None]:
# Visualize PoS distribution
plot_pos_distribution(pos_df, language=LANGUAGE, save_path='../outputs/pos_distribution.png')
print("✓ PoS distribution visualization saved to outputs/pos_distribution.png")

## 6. Sentence Length Distribution

In [None]:
# Visualize sentence length distribution
sent_lengths = [len(sent) for sent in corpus_data['sentence_tokens']]
plot_sentence_length_distribution(sent_lengths, language=LANGUAGE,
                                  save_path='../outputs/sentence_length_distribution.png')
print("✓ Sentence length distribution saved to outputs/sentence_length_distribution.png")

## 7. Top Frequent Words and Lemmas

In [None]:
# Top frequent words
top_words = get_top_frequent_words(corpus_data['all_tokens'], top_n=20)
print("\n=== Top 10 Most Frequent Words in Albanian ===\n")
print(top_words.head(10).to_string(index=False))

# Top frequent lemmas
top_lemmas = get_top_frequent_lemmas(corpus_data['all_lemmas'], top_n=20)
print("\n\n=== Top 10 Most Frequent Lemmas in Albanian ===\n")
print(top_lemmas.head(10).to_string(index=False))

# Save results
top_words.to_csv('../reports/top_frequent_words.csv', index=False)
top_lemmas.to_csv('../reports/top_frequent_lemmas.csv', index=False)
print("\n✓ Frequency data saved to reports/")

## 8. Process New Sentences

Implement tokenization and lemmatization function for Albanian text.

In [None]:
# Initialize preprocessor
preprocessor = TextPreprocessor()

# Test with sample sentences (using English for demonstration since NLTK lemmatizer works best with English)
test_sentences = [
    "Natural language processing is a fascinating field of study.",
    "The students are learning about computational linguistics today.",
    "Machine learning algorithms can process large amounts of text data."
]

print("\n=== Processing New Sentences ===\n")
print("Demonstrating tokenization and lemmatization:\n")

for i, sent in enumerate(test_sentences, 1):
    result = preprocessor.process_sentence(sent)
    print(f"Example {i}:")
    print(f"  Original: {result['original']}")
    print(f"  Tokens:   {result['tokens'][:8]}...")
    print(f"  Lemmas:   {result['processed'][:8]}...")
    print(f"  Method:   {result['method']}")
    print()

In [None]:
# Compare Lemmatization vs Stemming
comparison_sentence = "The runners are running faster than they were running yesterday."

lemma_result = preprocessor.process_sentence(comparison_sentence, use_stemming=False)
stem_result = preprocessor.process_sentence(comparison_sentence, use_stemming=True)

comparison_df = pd.DataFrame({
    'Original': lemma_result['tokens'],
    'Lemmatized': lemma_result['processed'],
    'Stemmed': stem_result['processed']
})

print("\n=== Lemmatization vs Stemming Comparison ===")
print(f"\nSentence: {comparison_sentence}\n")
print(comparison_df.to_string(index=False))

## 9. TF-IDF Vectorization

Convert Albanian sentences to TF-IDF vectors.

In [None]:
# Use all available sentences for TF-IDF
subset_sentences = corpus_data['sentence_texts']
SUBSET_SIZE = len(subset_sentences)

print(f"Using {SUBSET_SIZE} Albanian sentences for TF-IDF analysis...")

# Create TF-IDF analyzer
analyzer = SimilarityAnalyzer(
    max_features=1000,  # Adjusted for smaller corpus
    ngram_range=(1, 2),
    stop_words=None,  # No Albanian stop words in sklearn
    min_df=1  # Lower threshold for smaller corpus
)

# Fit and transform
tfidf_matrix = analyzer.fit_transform(subset_sentences)

print(f"\n✓ TF-IDF matrix shape: {tfidf_matrix.shape}")
print(f"  - {tfidf_matrix.shape[0]} sentences")
print(f"  - {tfidf_matrix.shape[1]} features")
print(f"  - Sparsity: {(1 - tfidf_matrix.nnz / (tfidf_matrix.shape[0] * tfidf_matrix.shape[1])) * 100:.2f}%")

In [None]:
# Display TF-IDF scores for a sample sentence
sample_idx = 5
sample_tfidf = analyzer.get_sentence_tfidf(sample_idx, top_n=10)

print(f"\n=== TF-IDF Analysis for Albanian Sentence {sample_idx} ===")
print(f"\nSentence: {subset_sentences[sample_idx][:100]}...")
print(f"\nTop 10 TF-IDF weighted terms:")
print(sample_tfidf.to_string(index=False))

## 10. Similarity Analysis

Calculate cosine similarity and Euclidean distance between Albanian sentences.

In [None]:
# Compute similarity matrices
print("Computing similarity matrices...")
cosine_sim = analyzer.compute_cosine_similarity()
euclidean_dist = analyzer.compute_euclidean_distance()

print("✓ Similarity matrices computed")
print(f"  Matrix shape: {cosine_sim.shape}")

In [None]:
# Example comparisons
print("\n=== Example Sentence Comparisons ===\n")

example_pairs = [(5, 10), (15, 20), (25, 30)]

for idx1, idx2 in example_pairs:
    if idx1 < len(subset_sentences) and idx2 < len(subset_sentences):
        comparison = analyzer.compare_sentences(idx1, idx2, cosine_sim, euclidean_dist)
        print(f"Sentence {idx1} vs Sentence {idx2}:")
        print(f"  [{idx1}] {subset_sentences[idx1][:60]}...")
        print(f"  [{idx2}] {subset_sentences[idx2][:60]}...")
        print(f"  Cosine Similarity:    {comparison['cosine_similarity']}")
        print(f"  Euclidean Distance:   {comparison['euclidean_distance']}")
        print()

In [None]:
# Get similarity statistics
sim_stats = analyzer.get_similarity_statistics(cosine_sim)

print("\n=== Similarity Statistics ===\n")
print(f"Mean Cosine Similarity:    {sim_stats['mean']:.4f}")
print(f"Std Deviation:             {sim_stats['std']:.4f}")
print(f"Min Similarity:            {sim_stats['min']:.4f}")
print(f"Max Similarity:            {sim_stats['max']:.4f}")
print(f"Median:                    {sim_stats['median']:.4f}")
print(f"Q1 (25th percentile):      {sim_stats['q1']:.4f}")
print(f"Q3 (75th percentile):      {sim_stats['q3']:.4f}")

## 11. Most Similar Sentence Pairs

Find the most semantically similar Albanian sentences.

In [None]:
# Find top 10 most similar pairs
print("Finding most similar Albanian sentence pairs...\n")
most_similar = analyzer.find_most_similar_pairs(subset_sentences, cosine_sim, top_n=10)

print("="*100)
print("TOP 10 MOST SIMILAR ALBANIAN SENTENCE PAIRS")
print("="*100)

for rank, (idx1, idx2, similarity, sent1, sent2) in enumerate(most_similar, 1):
    print(f"\nRank {rank} - Similarity: {similarity:.4f}")
    print(f"  [{idx1}] {sent1[:80]}...")
    print(f"  [{idx2}] {sent2[:80]}...")
    print("-" * 100)

In [None]:
# Save similarity results
similarity_results = []
for idx1, idx2, similarity, sent1, sent2 in most_similar:
    similarity_results.append({
        'Rank': len(similarity_results) + 1,
        'Sentence_1_Index': idx1,
        'Sentence_2_Index': idx2,
        'Cosine_Similarity': round(similarity, 4),
        'Sentence_1': sent1[:100],
        'Sentence_2': sent2[:100]
    })

similarity_df = pd.DataFrame(similarity_results)
similarity_df.to_csv('../reports/most_similar_pairs.csv', index=False)
print("\n✓ Similarity results saved to reports/most_similar_pairs.csv")

## 12. Visualize Similarity Distribution

In [None]:
# Get upper triangle of similarity matrix (excluding diagonal)
upper_triangle = np.triu_indices(cosine_sim.shape[0], k=1)
similarity_values = cosine_sim[upper_triangle]

# Visualize
plot_similarity_distribution(similarity_values, save_path='../outputs/similarity_distribution.png')
print("✓ Similarity distribution visualization saved to outputs/similarity_distribution.png")

## 13. Final Summary

Complete analysis of Albanian Universal Dependencies corpus.

In [None]:
print("\n" + "="*100)
print("                    FINAL ANALYSIS SUMMARY - ALBANIAN CORPUS")
print("="*100)

print(f"\n1. CORPUS INFORMATION")
print(f"   Language:              {LANGUAGE}")
print(f"   Treebank:              UD_Albanian-TSA")
print(f"   File:                  {CONLLU_FILE}")

print(f"\n2. CORPUS STATISTICS")
print(f"   Total Sentences:       {stats['num_sentences']:,}")
print(f"   Total Tokens:          {stats['num_tokens']:,}")
print(f"   Vocabulary Size:       {stats['vocabulary_size']:,}")
print(f"   Unique PoS Tags:       {stats['unique_pos_tags']}")
print(f"   Avg Sentence Length:   {stats['avg_sent_length']:.2f} tokens")
print(f"   Type-Token Ratio:      {stats['type_token_ratio']:.4f}")

print(f"\n3. POS TAG DISTRIBUTION")
print(f"   Most Common PoS Tags:")
for _, row in pos_df.head(5).iterrows():
    print(f"   - {row['PoS Tag']:10s}: {row['Frequency']:5,} ({row['Percentage']:5.1f}%)")

print(f"\n4. TF-IDF ANALYSIS")
print(f"   Sentences Analyzed:    {tfidf_matrix.shape[0]}")
print(f"   Feature Dimensions:    {tfidf_matrix.shape[1]}")
print(f"   Matrix Sparsity:       {(1 - tfidf_matrix.nnz / (tfidf_matrix.shape[0] * tfidf_matrix.shape[1])) * 100:.2f}%")

print(f"\n5. SIMILARITY ANALYSIS")
print(f"   Mean Similarity:       {sim_stats['mean']:.4f}")
print(f"   Std Deviation:         {sim_stats['std']:.4f}")
print(f"   Similarity Range:      [{sim_stats['min']:.4f}, {sim_stats['max']:.4f}]")

print(f"\n6. GENERATED FILES")
print(f"   Reports (CSV):")
print(f"   ✓ reports/corpus_statistics.csv")
print(f"   ✓ reports/pos_distribution.csv")
print(f"   ✓ reports/top_frequent_words.csv")
print(f"   ✓ reports/top_frequent_lemmas.csv")
print(f"   ✓ reports/most_similar_pairs.csv")
print(f"\n   Visualizations (PNG):")
print(f"   ✓ outputs/pos_distribution.png")
print(f"   ✓ outputs/sentence_length_distribution.png")
print(f"   ✓ outputs/similarity_distribution.png")

print("\n" + "="*100)
print("✓ ANALYSIS COMPLETE! All results saved.")
print("="*100)
print("\nUse these outputs for your final report:")
print("1. Language description: Albanian (Shqip) - Indo-European")
print("2. Processing steps: Documented in this notebook")
print("3. Statistics: Available in CSV reports")
print("4. Code: This reproducible Jupyter notebook")
print("5. Screenshots: Capture visualizations and console output")