<img src="https://i.imgur.com/RFR6UZX.jpg" width="100%"/>

# Quick and dirty Transliteration Tables

This notebook provides a simple, short, and easy to *copy and paste* set of three functions to transliterate (in a very quick-and-dirty fashion) from `Hindi` and `Tamil` to `Latin` alphabet (in 16 lines of code).

# What is transliteration?  `‡§Ö‡§ï‡•ç‡§§‡•Ç‡§¨‡§∞` -> `akt≈´br` (October)

Transliteration is the action of phonetically mapping one alphabet with another. It allows or improves phonetic readability and, sometimes, interpretability too.

See this example:

This is how you write `police` in Russian: `–ø–æ–ª–∏—Ü–∏—è`.

And this is how it looks when you transliterate Cyrillic to Latin: `politsiya`

It's still Russian, but much more familiar, isn't it?
The transliteration is a simple phonetic mapping from one alphabet to another. Here, the mapping was:
```python
{'–ø': 'p', '–æ': 'o', '–ª': 'l', '–∏': 't', '—Ü': 's', '–∏': 'i', '—è': 'ya'}
```


# Origin of the tables

I couldn't find well-established python packaged for that, at least fast. But I did find the following tables:

For Hindi:
* https://pandey.github.io/posts/transliterate-devanagari-to-latin.html

For Tamil:
* https://www.loc.gov/catdir/cpso/romanization/tamil.pdf


Note that few characters are dropped (this is actually quick and dirty)


# Usage

The usage is quite straightforward. See examples below for some good surprises!
```python
df_trans = transliterate(df_train)
```

In [None]:
import string

def transliterate_hindi(st):
    HINDI_MAP = { '‡•ê' : 'o·πÅ', '‡§Ä' : '·πÅ', '‡§Å' : '·πÉ', '‡§Ç' : '·πÉ', '‡§É' : '·∏•', '‡§Ö' : 'a', '‡§Ü' : 'ƒÅ', '‡§á' : 'i', '‡§à' : 'ƒ´', '‡§â' : 'u', '‡§ä' : '≈´', '‡§ã' : 'rÃ•', '‡•†' : ' rÃ•ÃÑ', '‡§å' : 'lÃ•', '‡•°' : ' lÃ•ÃÑ', '‡§ç' : '√™', '‡§é' : 'e', '‡§è' : 'e', '‡§ê' : 'ai', '‡§ë' : '√¥', '‡§í' : 'o', '‡§ì' : 'o', '‡§î' : 'au', '‡§æ' : 'ƒÅ', '‡§ø' : 'i', '‡•Ä' : 'ƒ´', '‡•Å' : 'u', '‡•Ç' : '≈´', '‡•É' : 'rÃ•', '‡•Ñ' : ' rÃ•ÃÑ', '‡•¢' : 'lÃ•', '‡•£' : ' lÃ•ÃÑ', '‡•Ö' : '√™', '‡•á' : 'e', '‡•à' : 'ai', '‡•â' : '√¥', '‡•ã' : 'o', '‡•å' : 'au', '‡§ï‡§º' : 'q', '‡§ï' : 'k', '‡§ñ‡§º' : 'x', '‡§ñ' : 'kh', '‡§ó‡§º' : 'ƒ°', '‡§ó' : 'g', '‡•ª' : 'g', '‡§ò' : 'gh', '‡§ô' : '·πÖ', '‡§ö' : 'c', '‡§õ' : 'ch', '‡§ú‡§º' : 'z', '‡§ú' : 'j', '‡•º' : 'j', '‡§ù' : 'jh', '‡§û' : '√±', '‡§ü' : '·π≠', '‡§†' : '·π≠h', '‡§°‡§º' : '·πõ', '‡§°' : '·∏ç', '‡•∏' : '·∏ç', '‡•æ' : 'd', '‡§¢‡§º' : '·πõh', '‡§¢' : '·∏çh', '‡§£' : '·πá', '‡§§' : 't', '‡§•' : 'th', '‡§¶' : 'd', '‡§ß' : 'dh', '‡§®' : 'n', '‡§™' : 'p', '‡§´‡§º' : 'f', '‡§´' : 'ph', '‡§¨' : 'b', '‡•ø' : 'b', '‡§≠' : 'bh', '‡§Æ' : 'm', '‡§Ø' : 'y', '‡§∞' : 'r', '‡§≤' : 'l', '‡§≥' : '·∏∑', '‡§µ' : 'v', '‡§∂' : '≈õ', '‡§∑' : '·π£', '‡§∏' : 's', '‡§π' : 'h', '‡§Ω' : '\'', '‡•ç' : '', '‡§º' : '', '‡•¶' : '0', '‡•ß' : '1', '‡•®' : '2', '‡•©' : '3', '‡•™' : '4', '‡•´' : '5', '‡•¨' : '6', '‡•≠' : '7', '‡•Æ' : '8', '‡•Ø' : '9', 'Í£≥' : '·πÅ', '‡•§' : '.', '‡••' : '..', ' ' : ' '}
    return ''.join(HINDI_MAP.get(c, c)  for c in st)

def transliterate_tamil(st):
    text = """‡ÆÖ a ‡Æé e ‡ÆÜ ƒÅ ‡Æè ƒì ‡Æá i ‡Æê ai ‡Æà ƒ´ ‡Æí o ‡Æâ u ‡Æì ≈ç ‡Æä ≈´ ‡Æî au ‡ÆÉ ka ‡ÆÆ ma ‡Æï ka ‡ÆØ ya ‡Æô ·πÖa ‡Æ∞ ra ‡Æö ca ‡Æ≤ la ‡Æû √±a ‡Æµ va ‡Æü ·π≠a ‡Æ¥ la ‡Æ£ ·πáa ‡Æ≥ ·∏∑a ‡Æ§ ta ‡Æ± ra‡Æ® na ‡Æ© na ‡Æ™ pa ‡Æú ja ‡Æ∏ sa ‡Æ∂ ≈õa ‡Æπ ha ‡Æ∑ ·π£a""".split()
    TAMIL_MAP = dict(zip(text[0::2], text[1::2]))
    TAMIL_MAP.update({t: t for t in ' ?.1234567890'+string.ascii_lowercase})
    return ''.join(TAMIL_MAP.get(c.lower(), '') for c in st)

def transliterate(df_in, columns=['question', 'context', 'answer_text']):
    df = df_in.copy()
    for c in columns:
        df.loc[df['language'] == 'hindi', c] = df.loc[df['language'] == 'hindi', c].apply(transliterate_hindi)
        df.loc[df['language'] == 'tamil', c] = df.loc[df['language'] == 'tamil', c].apply(transliterate_tamil)        
    return df

# Example usage:

In [None]:
import pandas as pd

df_train = pd.read_csv("../input/chaii-hindi-and-tamil-question-answering/train.csv")
df_trans = transliterate(df_train)

## Tamil

In [None]:
df_train.head()

In [None]:
df_trans.head()

## Hindi

In [None]:
df_train[df_train['language'] == 'hindi'].head(5)

In [None]:
df_trans[df_trans['language'] == 'hindi'].head(5)

It increases a little the readability. See for example:

In [None]:
# This is a name. Adolph Meyr or something
df_trans[df_trans['language'] == 'hindi']['answer_text'].iloc[0]

In [None]:
df_train[df_train['language'] == 'hindi']['answer_text'].iloc[0]

And this is a date (October 27, 1605):

In [None]:
df_train.iloc[1112]['answer_text']

In [None]:
df_trans.iloc[1112]['answer_text']

In [None]:
df_train[df_train['language'] == 'hindi'].iloc[0]['context']

In [None]:
df_trans[df_trans['language'] == 'hindi'].iloc[0]['context']

### New to Question Answering? 
Join me in my `Quick overview for QA noobs` series, where I'm jumping into this competition from zero. Current notebooks are:

1. [The competition [QA for QA noobs]](https://www.kaggle.com/julian3833/1-the-competition-qa-for-qa-noobs)
2. [The dataset [QA for QA noobs]](https://www.kaggle.com/julian3833/2-the-dataset-qa-for-qa-noobs)
3. [The metric (Jaccard) [QA for QA noobs]](https://www.kaggle.com/julian3833/3-the-metric-jaccard-qa-for-qa-noobs)
4. [Exploring Public Models [QA for QA noobs]](https://www.kaggle.com/julian3833/4-exploring-public-models-qa-for-qa-noobs/)
5. [XLM-Roberta + Torch's extra data [LB: 0.749]](https://www.kaggle.com/julian3833/5-xlm-roberta-torch-s-extra-data-lb-0-749)

&nbsp;
&nbsp;
&nbsp;
&nbsp;
&nbsp;

## Remember to upvote the notebook if you found it useful! ü§ó