Machine/Deep Learning with the Tamil (Oldest Language)
The aim is to create ML specific applications and enhancements with Tamil Language.
Extracts the bz2 compressed article to xml file. Mainly created to avoid large transactions to cloud. Upload the compressed file and extract in the cloud. Download the article page from WikiMedia Tamil Data
Extracts each articles "Title" and clean "Content" from the XML tree and exports to Tabular Data for easy use.
Used many regex
rules to clean the data.
Word embeddings for Tamil words using gensim
library.
This creates the similarity metrics between the words.
Training model parameter can be adjusted to extend its usage.
- DL based Language Modelling
- Document Similarity Model
- NER
- Sentiment detection
- ETC....
The idea and chars are copied from this repo. https://github.com/wickkiey/open-tamil/
This approach is to convert tamil to english and vice versa (Phonetic translation).
data
folder has pickle file, which can be used to continue further.
Note : Work in progress
The idea is to create a mapping between the words.
- root word
- synonyms
- antonyms
- related words