Detecting Gender of VKontakte Users by Full Names in Russian/English

Data

We selected the top-1 VK group¹ based on Medialogia rating² and loaded all the users with specified gender via VKontakte API³. We collected 13,126,794 Vkontake profiles, where each profile contains gender and first and last names written in Cyrillic and Latin alphabets. We did not consider users with unknown gender.

Among these profiles, we got 6,521,854 unique full names in Cyrillic alphabet and 6,263,813 unique full names in Latin alphabet. The number of full names in the two alphabets differs due to the fact that different names in the Cyrillic alphabet can have both the same and different transliteration into the Latin alphabet, and even similar names in one alphabet can have different transliteration to another alphabet. Based on the data obtained, we formed the final dataset for training models using the following logic: if the user's name in Latin and Cyrillic is different, we added both names to the dataset, if they are the same, we added only one name.

The final dataset contains 25,101,673 names (46% male and 54% female).

Model

Following the approach by Panchenko and Teterin⁴, we used L2-regularized Logistic Regression with character n-grams to classify gender. In order to identify the best hyper-parameters (e.g. character n-grams type, n-grams range, usage of IDF, TF-IDF normalisation type), we firstly ran a grid search with 10-fold cross-validation (80% – training subset, 20% – test subset) on a random sample of 100,000 full names. The model with character n-grams inside word boundaries, n-grams range of (2, 7), usage of IDF, L2 TF-IDF normalisation, and ignoring terms that appear in more than 50% of the documents showed the best $F_1$ score of 0.9771, so we used these hyper-parameters to train the final model on the whole dataset. The final model trained in the full dataset demonstrated $F_1=0.9835$ on the test subset (20% of full names).

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
README.md		README.md
VK Names.ipynb		VK Names.ipynb
VkGenderLogit.joblib		VkGenderLogit.joblib

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Detecting Gender of VKontakte Users by Full Names in Russian/English

Data

Model

About

Releases 1

Packages

Languages

sismetanin/gender-by-name-vk

Folders and files

Latest commit

History

Repository files navigation

Detecting Gender of VKontakte Users by Full Names in Russian/English

Data

Model

Footnotes

About

Resources

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages