This is a package in Python which implements a tokenizer, stemmer for Hindi language
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
HindiTokenizer.py package name changed Apr 24, 2015
README.md
stopwords.txt

README.md

Tokenizer for Hindi

This package tends to implement a Tokenizer and a stemmer for Hindi language.

To import the package,

from HindiTokenizer import Tokenizer

This package implements various funcions, which are listed as below:

The Tokenizer can be created in two ways

t=Tokenizer("यह वाक्य हिन्दी में है।")

Or

t=Tokenizer()
t.read_from_file('filename_here')

A brief description about all the functions

read_from_file

This function takes the name of the file which is present in the current directory and reads it.

t.read_from_file('hindi_file.txt')

generate_sentences

Given a text, this will generate a list of sentences.

t.generate_sentences()

print_sentences

This will print the sentences generated by print_sentences.

t.generate_sentences()
t.print_sentences()

tokenize

This will generate a list of tokens from the given text

t.tokenize()

print_tokens

This will print the sentences generated by print_tokens.

t.tokenize()
t.print_tokens()

generate_freq_dict

This will generate a dictionary of frequency of words and return it.

freq_dict=t.generate_freq_dict()

print_freq_dict

This will print the dictionary of frequency of words generated by generate_freq_dict.

freq_dict=t.generate_freq_dict()
t.print_freq_dict(freq_dict)

generate_stem_word

Given a word, this will generate its stem word.

word=t.generate_stem_word("भारतीय")
print word
भारत

generate_stem_dict

This will return the dictionary of stemmed words.

stem_dict=t.generate_stem_dict()

print_stem_dict

This will print the dictionary of stemmed words generated by generate_stem_dict.

stem_dict=t.generate_stem_dict()
t.print_stem_dict(stem_dict)

remove_stopwords

This will remove all the stopwords occuring from the given text.

t.remove_stopwords()

clean_text

This will remove all the punctuation symbols occuring in the given text.

t.clean_text()

len_text

Given a text, this will return the length of it.

print t.len_text()

sentence_count

Given a text, this will return the number of sentences in it.

print t.sentence_count()

tokens_count

Given a text, this will return the number of tokens in it.

print t.tokens_count()

concordance

Given a text, and a word, it will print all the sentences where that word is occuring.

sentences=t.concordace("हिन्दी")
t.print_sentences(sentences)