## Here are some more examples of how you can use spaCy with Hindi language text. 

These examples cover a variety of NLP tasks like tokenization, entity recognition (if supported), custom tokenization, and handling different attributes.

### Explanation and Integration

Basic Tokenization shows how spaCy tokenizes Hindi text into individual words and punctuation marks.

Token Attributes demonstrate accessing different properties of tokens, like is_alpha (if the token is alphabetic), like_num (if the token resembles a number), and is_currency (if the token is a currency symbol).

Custom Tokenization is particularly useful for handling idiomatic expressions or compound words in Hindi.

Currency Detection examples illustrate how to extract and work with financial data in Hindi text.

Sentence Segmentation splits Hindi text into sentences, which can be important for further processing like summarization or dialogue systems.

Stop Words example shows how to remove common words that might not be useful for text analysis.

POS Tagging and Entity Recognition are powerful features for understanding the grammatical structure and identifying key entities in the text, though they may require trained models.

Pattern Matching is useful for extracting specific phrases or patterns from Hindi text.

These examples provide a comprehensive overview of spaCy's capabilities with the Hindi language, showing how you can perform various NLP tasks such as tokenization, custom token handling, and pattern recognition.

### 1. Basic Tokenization in Hindi


In [45]:
import spacy

# Create a blank Hindi language object
nlp = spacy.blank("hi")

# Tokenize a Hindi sentence
doc = nlp("यहां गर्मी बहुत हो रही है, लेकिन बारिश की कोई उम्मीद नहीं है।")

# Print tokens
print("Tokens in the sentence:")
for token in doc:
    print(token)


Tokens in the sentence:
यहां
गर्मी
बहुत
हो
रही
है
,
लेकिन
बारिश
की
कोई
उम्मीद
नहीं
है
।


### 2. Token Attributes in Hindi


In [46]:
doc = nlp("राम ने 5000 ₹ दिए, जबकि श्याम ने 3000 ₹ उधार लिए।")

print("\nToken attributes:")
for token in doc:
    print(f"Token: {token.text}, is_alpha: {token.is_alpha}, like_num: {token.like_num}, is_currency: {token.is_currency}")



Token attributes:
Token: राम, is_alpha: False, like_num: False, is_currency: False
Token: ने, is_alpha: False, like_num: False, is_currency: False
Token: 5000, is_alpha: False, like_num: True, is_currency: False
Token: ₹, is_alpha: False, like_num: False, is_currency: True
Token: दिए, is_alpha: False, like_num: False, is_currency: False
Token: ,, is_alpha: False, like_num: False, is_currency: False
Token: जबकि, is_alpha: False, like_num: False, is_currency: False
Token: श्याम, is_alpha: False, like_num: False, is_currency: False
Token: ने, is_alpha: False, like_num: False, is_currency: False
Token: 3000, is_alpha: False, like_num: True, is_currency: False
Token: ₹, is_alpha: False, like_num: False, is_currency: True
Token: उधार, is_alpha: False, like_num: False, is_currency: False
Token: लिए, is_alpha: False, like_num: False, is_currency: False
Token: ।, is_alpha: False, like_num: False, is_currency: False


### 3. Custom Tokenization in Hindi


1. Pre-Processing the Text

Before passing the text to the tokenizer, you can manually replace "भैया जी" with "भैया जी" where each word is separate. 

This way, the tokenizer will naturally split them without needing a special case.

In [47]:
import spacy

# Create a blank Hindi language object
nlp = spacy.blank("hi")

# Pre-process the text to separate "भैया जी"
text = "भैया जी नमस्ते!"
text = text.replace("भैया जी", "भैया जी")

doc = nlp(text)
tokens = [token.text for token in doc]
print("Tokens after pre-processing:", tokens)


Tokens after pre-processing: ['भैया', 'जी', 'नमस्ते', '!']


2. Custom Tokenizer with Infixes

Modify the tokenizer's infix rules to handle such cases.

In [48]:
import spacy
from spacy.tokenizer import Tokenizer
from spacy.util import compile_infix_regex

# Create a blank Hindi language object
nlp = spacy.blank("hi")

# Modify the tokenizer to handle specific infixes
infixes = list(nlp.Defaults.infixes) + [r"(?<=भैया)(?=जी)"]
infix_re = compile_infix_regex(infixes)
nlp.tokenizer = Tokenizer(nlp.vocab, infix_finditer=infix_re.finditer)

doc = nlp("भैया जी नमस्ते!")
tokens = [token.text for token in doc]
print("Tokens after custom tokenizer:", tokens)


Tokens after custom tokenizer: ['भैया', 'जी', 'नमस्ते!']


3. Manual Tokenization for Specific Cases

Manually split tokens for specific phrases after tokenization.

In [49]:
import spacy

# Create a blank Hindi language object
nlp = spacy.blank("hi")

doc = nlp("भैया जी नमस्ते!")
tokens = []

for token in doc:
    if token.text == "भैया जी":
        tokens.extend(["भैया", "जी"])
    else:
        tokens.append(token.text)

print("Tokens after manual processing:", tokens)


Tokens after manual processing: ['भैया', 'जी', 'नमस्ते', '!']


Below code will not work because the error I'm encountering, 

[E997] Tokenizer special cases are not allowed to modify the text, occurs 

Because the special case I'm trying to add would result in a modification of the original text, which is not allowed by spaCy's tokenizer.

#### Problem

In the special case you're trying to add:

nlp.tokenizer.add_special_case("भैया जी", [{ORTH: "भैया"}, {ORTH: "जी"}])

spaCy expects that adding this special case should not change the actual text that is tokenized. 

However, the way this special case is defined, it would modify the text, which results in the error.


In [50]:
#from spacy.symbols import ORTH

# Customizing the tokenizer to split "भैया जी" into "भैया" and "जी"
#nlp.tokenizer.add_special_case("भैया जी", [{ORTH: "भैया"}, {ORTH: "जी"}])

#oc = nlp("भैया जी नमस्ते!")
#tokens = [token.text for token in doc]
#print("\nCustom tokenizer tokens:", tokens)


### 4. Detecting Currency in Hindi Sentences


In [51]:
doc = nlp("मोहन ने 5000 ₹ दिए, जबकि सोहन ने 100 डॉलर दिए।")

print("\nCurrency detection in the sentence:")
for token in doc:
    if token.is_currency:
        print(f"Currency: {token.text}, Previous Token: {doc[token.i - 1].text}")



Currency detection in the sentence:
Currency: ₹, Previous Token: 5000


### 5. Extracting Numeric Values and Associated Currency


In [52]:
doc = nlp("राजू ने 2000 ₹ उधार लिए, जबकि सुमन ने 50 डॉलर का भुगतान किया।")

print("\nExtracting numeric values and associated currency:")
for i, token in enumerate(doc):
    if token.like_num:
        # Check if the next token is currency
        if doc[i + 1].is_currency:
            print(f"Amount: {token.text} {doc[i + 1].text}")



Extracting numeric values and associated currency:
Amount: 2000 ₹


### 6. Sentence Segmentation (Sentence Tokenization) in Hindi


In [53]:
nlp.add_pipe('sentencizer')

doc = nlp("मोहन मुझसे मिलने आए थे। उन्होंने कहा कि वह कल वापस जाएंगे।")
print("\nSentences in the text:")
for sentence in doc.sents:
    print(sentence)



Sentences in the text:
मोहन मुझसे मिलने आए थे।
उन्होंने कहा कि वह कल वापस जाएंगे।


### 7. Using Stop Words in Hindi


In [54]:
# Manually adding stop words (spaCy may not have a pre-defined stop word list for Hindi)
from spacy.lang.hi.stop_words import STOP_WORDS

print("\nStop words in Hindi (Custom List):")
print(STOP_WORDS)

doc = nlp("राम और श्याम दोनों अच्छे दोस्त हैं।")

print("\nFiltering out stop words from the text:")
filtered_tokens = [token.text for token in doc if not token.is_stop]
print(filtered_tokens)



Stop words in Hindi (Custom List):
{'इसका', 'उसि', 'मैं', 'वहिं', 'उंहें', 'कोई', 'कि', 'साबुत', 'इन्हीं', 'उन', 'उसके', 'भीतर', 'अगर', 'जेसा', 'वगेरह', 'साभ', 'के', 'वहां', 'करें', 'अदि', 'वाले', 'तिंहें', 'अंदर', 'कहते', 'या', 'वहीं', 'जा', 'ना', 'जिस', 'करते', 'ऐसे', 'करता', 'जैसा', 'थी', 'जितना', 'रखें', 'जहाँ', 'वुह', 'संग', 'वग़ैरह', 'भी', 'एसे', 'किया', 'में', 'इसकि', 'जिन्हें', 'इंहिं', 'इनका', 'व', 'निहायत', 'मेरा', 'जेसे', 'कोनसा', 'इन्हें', 'मगर', 'यहाँ', 'उसे', 'तिन्हों', 'होने', 'अभि', 'होते', 'उन्हीं', 'यह', 'सारा', 'यिह', 'रवासा', 'कर', 'उस', 'बनी', 'तिंहों', 'थि', 'अपनी', 'हुई', 'बनि', 'किन्हों', 'किसी', 'घर', 'दूसरे', 'द्वारा', 'नहिं', 'वरग', 'सकते', 'होती', 'ऱ्वासा', 'वर्ग', 'हुआ', 'का', 'तिन्हें', 'तिन', 'इंहें', 'आदि', 'दुसरा', 'काफ़ी', 'नहीं', 'हुए', 'लिये', 'निचे', 'कितना', 'कई', 'दिया', 'किंहें', 'सभी', 'एवं', 'तिसे', 'जब', 'अप', 'हो', 'किसि', 'वे', 'जैसे', 'नीचे', 'जिधर', 'उसी', 'इतयादि', 'अत', 'पे', 'बहुत', 'लेकिन', 'सकता', 'उन्हें', 'जिसे', 'हे', 'कुल', 'मुझको'

### 8. Part-of-Speech (POS) Tagging in Hindi (if available)


In [55]:
# Note: POS tagging may not be supported in the default blank Hindi model, but if supported:

# Assuming the POS tagger is available (might need a trained model)
doc = nlp("मोहन ने पुस्तक पढ़ी।")
print("\nPart-of-Speech tags:")
for token in doc:
    print(f"Token: {token.text}, POS: {token.pos_}")



Part-of-Speech tags:
Token: मोहन, POS: 
Token: ने, POS: 
Token: पुस्तक, POS: 
Token: पढ़ी, POS: 
Token: ।, POS: 


### 9. Entity Recognition in Hindi (if available)


In [56]:
# Note: Entity recognition may not be supported in the default blank Hindi model, but if supported:

doc = nlp("मुंबई भारत की आर्थिक राजधानी है।")

print("\nNamed Entities in the text:")
for ent in doc.ents:
    print(f"Entity: {ent.text}, Label: {ent.label_}")



Named Entities in the text:


### 10. Matching Patterns in Hindi Text


Matcher in spaCy relies on part-of-speech (POS) tags to match patterns based on POS attributes. 

The blank Hindi language model (nlp = spacy.blank("hi")) doesn't include a POS tagger, which is why you may see error.

## Solutions

1. Add a Tagger to the Pipeline:

You need to add a morphologizer or a tagger along with an attribute_ruler to the pipeline. 
However, keep in mind that spaCy’s default blank model for Hindi might not have pre-trained weights available for these components.

2. Use Patterns Without POS Tags:

If you do not require POS tags, you can modify the pattern to use simpler matching criteria, such as text patterns.

#### Example Without POS Tags

Here’s how you can modify the example to avoid using POS tags:

#### Explanation

Pattern: This pattern searches for "राम ने" followed by any alphabetic token (IS_ALPHA: True). This avoids the need for POS tagging.

Matcher: The matcher object is used to find the pattern in the text.

In [57]:
import spacy
from spacy.matcher import Matcher

# Initialize a blank Hindi language model
nlp = spacy.blank("hi")

# Initialize matcher with the Hindi vocabulary
matcher = Matcher(nlp.vocab)

# Define a pattern to match "राम ने" followed by any word
pattern = [{"TEXT": "राम"}, {"TEXT": "ने"}, {"IS_ALPHA": True}]

# Add the pattern to the matcher
matcher.add("RAM_ACTION", [pattern])

doc = nlp("राम ने खाना खाया और फिर सो गया।")

matches = matcher(doc)

print("\nPattern matching results:")
for match_id, start, end in matches:
    span = doc[start:end]
    print(f"Matched Span: {span.text}")



Pattern matching results:


### Adding a Tagger

If you do want to use POS tags, you would need to train or load a Hindi model that includes a POS tagger. 

However, this requires access to a pre-trained Hindi model that supports POS tagging, which spaCy might not provide out of the box for Hindi.

Here's how you would do it if you had such a model:

In [58]:
""" import spacy
from spacy.pipeline import Tagger, AttributeRuler
from spacy.matcher import Matcher

# Load a model with a tagger or create a new one and add components
nlp = spacy.blank("hi")
tagger = Tagger(nlp.vocab)
ruler = AttributeRuler(nlp)
nlp.add_pipe(tagger)
nlp.add_pipe(ruler)

# Initialize matcher with the Hindi vocabulary
matcher = Matcher(nlp.vocab)

# Define a pattern to match "राम ने" followed by a verb
pattern = [{"TEXT": "राम"}, {"TEXT": "ने"}, {"POS": "VERB"}]

# Add the pattern to the matcher
matcher.add("RAM_ACTION", [pattern])

doc = nlp("राम ने खाना खाया और फिर सो गया।")

matches = matcher(doc)

print("\nPattern matching results:")
for match_id, start, end in matches:
    span = doc[start:end]
    print(f"Matched Span: {span.text}")
 """

' import spacy\nfrom spacy.pipeline import Tagger, AttributeRuler\nfrom spacy.matcher import Matcher\n\n# Load a model with a tagger or create a new one and add components\nnlp = spacy.blank("hi")\ntagger = Tagger(nlp.vocab)\nruler = AttributeRuler(nlp)\nnlp.add_pipe(tagger)\nnlp.add_pipe(ruler)\n\n# Initialize matcher with the Hindi vocabulary\nmatcher = Matcher(nlp.vocab)\n\n# Define a pattern to match "राम ने" followed by a verb\npattern = [{"TEXT": "राम"}, {"TEXT": "ने"}, {"POS": "VERB"}]\n\n# Add the pattern to the matcher\nmatcher.add("RAM_ACTION", [pattern])\n\ndoc = nlp("राम ने खाना खाया और फिर सो गया।")\n\nmatches = matcher(doc)\n\nprint("\nPattern matching results:")\nfor match_id, start, end in matches:\n    span = doc[start:end]\n    print(f"Matched Span: {span.text}")\n '