# Wikilangs Tutorial

This notebook demonstrates how to use the wikilangs package to work with pre-trained language models from Wikipedia data.

## Features Covered

1. Tokenizers (BPE)
2. N-gram Models
3. Markov Chains
4. Vocabularies


In [None]:
# Install the package (if not already installed)
# !pip install wikilangs

# Import the modules
from wikilangs import tokenizer, ngram, markov, vocabulary
import warnings
warnings.filterwarnings('ignore')

## 1. Tokenizers

BPE tokenizers trained on Wikipedia data for different languages.

In [None]:
# Create a tokenizer for English
tok = tokenizer(date='20251201', lang='en', vocab_size=16000)

# Tokenize some text
text = "This is a sample sentence for tokenization."
tokens = tok.tokenize(text)
token_ids = tok.encode(text)

print(f"Original text: {text}")
print(f"Tokens: {tokens}")
print(f"Token IDs: {token_ids}")
print(f"Decoded text: {tok.decode(token_ids)}")

## 2. N-gram Models

N-gram language models for text scoring and next token prediction.

In [None]:
# Create a 3-gram model for English
ng = ngram(date='20251201', lang='en', gram_size=3)

# Score a text
text = "This is a sample sentence."
score = ng.score(text)

print(f"Text: {text}")
print(f"Score: {score}")

# Predict next token
context = "This is a"
predictions = ng.predict_next(context, top_k=5)

print(f"Context: {context}")
print(f"Predictions: {predictions}")

## 3. Markov Chains

Markov chain models for text generation with configurable depth.

In [None]:
# Create a Markov chain for English
mc = markov(date='20251201', lang='en', depth=2)

# Generate text
generated_text = mc.generate(length=50)

print(f"Generated text: {generated_text}")

## 4. Embedding models

Position-aware word embedding models with dimensions 32, 64, 128.

In [None]:
from wikilangs import tokenizer, embeddings

# Date defaults to 'latest'
tok = tokenizer(lang='ary')
emb = embeddings(lang='ary')

print(tok.tokenize("مرحبا"))
print(emb.embed_sentence("مرحبا بالعالم", method='rope'))

# Defaults to 32 but you can set a higher dimension
emb64 = embeddings(lang='ary', dimension=64)
print(emb64.embed_sentence("مرحبا بالعالم", method='rope'))

emb128 = embeddings(lang='ary', dimension=128)
print(emb128.embed_sentence("مرحبا بالعالم", method='rope'))

## 5. Vocabularies

Comprehensive word dictionaries with frequency information using [vocabulous](https://github.com/omarkamali/vocabulous).

In [None]:
# Create a vocabulary for English
vocab = vocabulary(date='20251201', lang='en')

# Look up a word
word = "example"
word_info = vocab.lookup(word)
frequency = vocab.get_frequency(word)

print(f"Word: {word}")
print(f"Information: {word_info}")
print(f"Frequency: {frequency}")

# Get similar words
similar = vocab.get_similar_words(word, top_k=5)
print(f"Similar words: {similar}")

# Get words with prefix
prefixed = vocab.get_words_with_prefix("ex", top_k=5)
print(f"Words with prefix 'ex': {prefixed}")

## Working with Different Languages

The wikilangs package supports 100+ Wikipedia languages. Here's an example with French.

In [None]:
# Create models for French
try:
    fr_tok = tokenizer(date='20251201', lang='fr', vocab_size=16000)
    fr_ng = ngram(date='20251201', lang='fr', gram_size=3)
    fr_mc = markov(date='20251201', lang='fr', depth=2)
    fr_vocab = vocabulary(date='20251201', lang='fr')
    
    print("French models loaded successfully!")
    
    # Example with French tokenizer
    fr_text = "Ceci est une phrase d'exemple."
    fr_tokens = fr_tok.tokenize(fr_text)
    print(f"French text: {fr_text}")
    print(f"Tokens: {fr_tokens}")
except Exception as e:
    print(f"Failed to load French models: {e}")

## Conclusion

The wikilangs package provides easy access to pre-trained language models from Wikipedia data.
You can use these models for various NLP tasks including tokenization, text scoring, generation, and vocabulary lookup.

For more information, check out the [documentation](https://github.com/omarkamali/wikilangs).