# NLP Improvements: Peculiarities of Russian Inflection Diminish TF-IDF Quality
### Maximizing TF-IDF scores in Russian NLP applications


Another competitor [brilliantly shared](https://www.kaggle.com/iggisv9t/handling-russian-language-inflectional-structure) that the Russian language has peculiar inflectional structure, so the same word can be spelled different ways within different contexts.  This can fool our TF-IDF which is built for languages without these structural pecularities.

**For example:**
* Dog -> Собак**а**
* No dog -> нет собак**и**
* Give a dog a bone -> Дай собак**е** кость.

As you can see, the last letter of the word "dog" changed within different contexts.  It can get even more complicated.  We need to account for this in our TF-IDF.

With package `pymorph2` and TF-IDF, it will be simple to *normalize* the text.  First, we must define a function called `normalize`.  `normalize` depends on `retoken` and `morph`, which we need to import `re` and `pymorphy2` for; `pymorphy2` requires a special installation from the kernel settings.

In [None]:
import os
import pymorphy2
import re
import pandas as pd

train = pd.read_csv('../input/train.csv')
test = pd.read_csv('../input/test.csv')

In [None]:
morph = pymorphy2.MorphAnalyzer()
retoken = re.compile(r'[\'\w\-]+')
def normalize(text):
    text = retoken.findall(text.lower()) # make all text lowercase
    text = [morph.parse(x)[0].normal_form for x in text] # morphological analysis
    return ' '.join(text)

Here we normalize all of the text... it takes a while:

In [None]:
%%time
train['title'] = train['title'].astype(str)
train['description'] = train['description'].astype(str)
test['title'] = test['title'].astype(str)
test['description'] = test['description'].astype(str)

In [None]:
%%time
train['title'] = train['title'].apply(normalize)
train['description'] = train['description'].apply(normalize)
train.to_csv("updnlp-train.csv"); del train
test['title'] = test['title'].apply(normalize)
test['description'] = test['description'].apply(normalize)

You can see the results of this function (it does work!):

In [None]:
print(normalize('собака'))
print(normalize('нет собаки'))
print(normalize('Дай собаке кость.'))

Now, we will write the normalized text data to a `csv` file, which you can use to train your TF-IDF.  If you're running your model on the Kaggle platform, I suggest importing this kernel's output into your own kernel to perform TF-IDF.  Otherwise, you can download the output.  **This will decrease document size and improve the accuracy of your TF-IDF!**

In [None]:
test.to_csv("updnlp-test.csv")