This Python library provides detailed insights about names, including:
- Popularity (ranking by country)
- Gender prediction
- Country-specific statistics (105 countries supported)
- Fuzzy search (e.g., search for "ISABLE" returns "ISABEL")
- Autocomplete search (e.g., search for names starting with "ISA")
It can give you an answer to some of those questions:
- Who is
Zoe
? Likely aFemale, United Kindgom
. - Knows
Philippe
? Likely aMale, France
. And with the spellingPhilipp
?Male, Germany
. - How about
Nikki
? Likely aFemale, United States
.
๐ฅ To download the raw CSV data for your analysis, browse here.
730K first names and 983K last names, extracted from the Facebook massive dump (533M users).
Available on PyPI:
pip install names-dataset
MemoryError
.
Once installed, you can run the following commands to get familiar with the library:
from names_dataset import NameDataset, NameWrapper
# The library takes time to initialize because the database is massive. A tip is to include its initialization in your app's startup process.
nd = NameDataset()
print(NameWrapper(nd.search('Philippe')).describe)
# Male, France
print(NameWrapper(nd.search('Zoe')).describe)
# Female, United Kingdom
print(nd.search('Walter'))
# {'first_name': {'country': {'Argentina': 0.062, 'Austria': 0.037, 'Bolivia, Plurinational State of': 0.042, 'Colombia': 0.096, 'Germany': 0.044, 'Italy': 0.295, 'Peru': 0.185, 'United States': 0.159, 'Uruguay': 0.036, 'South Africa': 0.043}, 'gender': {'Female': 0.007, 'Male': 0.993}, 'rank': {'Argentina': 37, 'Austria': 34, 'Bolivia, Plurinational State of': 67, 'Colombia': 250, 'Germany': 214, 'Italy': 193, 'Peru': 27, 'United States': 317, 'Uruguay': 44, 'South Africa': 388}}, 'last_name': {'country': {'Austria': 0.036, 'Brazil': 0.039, 'Switzerland': 0.032, 'Germany': 0.299, 'France': 0.121, 'United Kingdom': 0.048, 'Italy': 0.09, 'Nigeria': 0.078, 'United States': 0.172, 'South Africa': 0.085}, 'gender': {}, 'rank': {'Austria': 106, 'Brazil': 805, 'Switzerland': 140, 'Germany': 39, 'France': 625, 'United Kingdom': 1823, 'Italy': 3564, 'Nigeria': 926, 'United States': 1210, 'South Africa': 1169}}}
print(nd.search('White'))
# {'first_name': {'country': {'United Arab Emirates': 0.044, 'Egypt': 0.294, 'France': 0.061, 'Hong Kong': 0.05, 'Iraq': 0.094, 'Italy': 0.117, 'Malaysia': 0.133, 'Saudi Arabia': 0.089, 'Taiwan, Province of China': 0.044, 'United States': 0.072}, 'gender': {'Female': 0.519, 'Male': 0.481}, 'rank': {'Taiwan, Province of China': 6940, 'United Arab Emirates': None, 'Egypt': None, 'France': None, 'Hong Kong': None, 'Iraq': None, 'Italy': None, 'Malaysia': None, 'Saudi Arabia': None, 'United States': None}}, 'last_name': {'country': {'Canada': 0.035, 'France': 0.016, 'United Kingdom': 0.296, 'Ireland': 0.028, 'Iraq': 0.016, 'Italy': 0.02, 'Jamaica': 0.017, 'Nigeria': 0.031, 'United States': 0.5, 'South Africa': 0.04}, 'gender': {}, 'rank': {'Canada': 46, 'France': 1041, 'United Kingdom': 18, 'Ireland': 66, 'Iraq': 1307, 'Italy': 2778, 'Jamaica': 35, 'Nigeria': 425, 'United States': 47, 'South Africa': 416}}}
print(nd.search('ู
ุญู
ุฏ'))
# {'first_name': {'country': {'Algeria': 0.018, 'Egypt': 0.441, 'Iraq': 0.12, 'Jordan': 0.027, 'Libya': 0.035, 'Saudi Arabia': 0.154, 'Sudan': 0.07, 'Syrian Arab Republic': 0.062, 'Turkey': 0.022, 'Yemen': 0.051}, 'gender': {'Female': 0.035, 'Male': 0.965}, 'rank': {'Algeria': 4, 'Egypt': 1, 'Iraq': 2, 'Jordan': 1, 'Libya': 1, 'Saudi Arabia': 1, 'Sudan': 1, 'Syrian Arab Republic': 1, 'Turkey': 18, 'Yemen': 1}}, 'last_name': {'country': {'Egypt': 0.453, 'Iraq': 0.096, 'Jordan': 0.015, 'Libya': 0.043, 'Palestine, State of': 0.016, 'Saudi Arabia': 0.118, 'Sudan': 0.146, 'Syrian Arab Republic': 0.058, 'Turkey': 0.017, 'Yemen': 0.037}, 'gender': {}, 'rank': {'Egypt': 2, 'Iraq': 3, 'Jordan': 1, 'Libya': 1, 'Palestine, State of': 1, 'Saudi Arabia': 3, 'Sudan': 1, 'Syrian Arab Republic': 2, 'Turkey': 44, 'Yemen': 1}}}
print(nd.get_top_names(n=10, gender='Male', country_alpha2='US'))
# {'US': {'M': ['Jose', 'David', 'Michael', 'John', 'Juan', 'Carlos', 'Luis', 'Chris', 'Alex', 'Daniel']}}
print(nd.get_top_names(n=5, country_alpha2='ES'))
# {'ES': {'M': ['Jose', 'Antonio', 'Juan', 'Manuel', 'David'], 'F': ['Maria', 'Ana', 'Carmen', 'Laura', 'Isabel']}}
print(nd.get_country_codes(alpha_2=True))
# ['AE', 'AF', 'AL', 'AO', 'AR', 'AT', 'AZ', 'BD', 'BE', 'BF', 'BG', 'BH', 'BI', 'BN', 'BO', 'BR', 'BW', 'CA', 'CH', 'CL', 'CM', 'CN', 'CO', 'CR', 'CY', 'CZ', 'DE', 'DJ', 'DK', 'DZ', 'EC', 'EE', 'EG', 'ES', 'ET', 'FI', 'FJ', 'FR', 'GB', 'GE', 'GH', 'GR', 'GT', 'HK', 'HN', 'HR', 'HT', 'HU', 'ID', 'IE', 'IL', 'IN', 'IQ', 'IR', 'IS', 'IT', 'JM', 'JO', 'JP', 'KH', 'KR', 'KW', 'KZ', 'LB', 'LT', 'LU', 'LY', 'MA', 'MD', 'MO', 'MT', 'MU', 'MV', 'MX', 'MY', 'NA', 'NG', 'NL', 'NO', 'OM', 'PA', 'PE', 'PH', 'PL', 'PR', 'PS', 'PT', 'QA', 'RS', 'RU', 'SA', 'SD', 'SE', 'SG', 'SI', 'SV', 'SY', 'TM', 'TN', 'TR', 'TW', 'US', 'UY', 'YE', 'ZA']
print(nd.auto_complete('isa', n=3)) # very fast, can be used in a loop in realtime.
# [{'name': 'Isabel', 'rank': 144}, {'name': 'Isaac', 'rank': 266}, {'name': 'Isa', 'rank': 450}]
print(nd.fuzzy_search('isablel', n=3)) # slow to compute.
# [{'name': 'Isabel', 'rank': 144}, {'name': 'Isabela', 'rank': 1228}, {'name': 'Isabele', 'rank': 2386}]
nd.first_names
# Dictionary of all the first names with their attributes.
nd.last_names
# Dictionary of all the last names with their attributes.
๐ search(name: str)
Searches for a name and returns metadata for both first and last names (if available).
The result is:
country
: The probability that the name belongs to a given country. Only the top 10 matching countries are returned.gender
: The probability of the person being male or female.rank
: The popularity rank of the name in its country (1 = most popular).
๐ Note: Gender data only applies to first names.
๐ get_top_names(...)
Retrieves the most popular names across supported countries.
Parameters:
n
- Number of names to return (per group).gender
- 'Male' or 'Female' (only valid for first names).use_first_names
- Choose between first names and last names.country_alpha2
- 2-letter ISO country code (e.g., 'US', 'JP').
๐ get_country_codes(alpha_2: bool = False)
Returns a list of supported countries found in the dataset. Country codes are ISO 3166-1 alpha-2 format (e.g., US, FR, JP).
Parameters:
alpha_2
- If True, returns 2-letter ISO codes only.
โจ auto_complete(...)
Returns top name suggestions that begin with the specified prefix.
Parameters:
name
โ Prefix string (e.g., 'Al').n
โ Max number of results.use_first_names
โ Use first names if True, else last names.country_alpha2
โ Filter by country.gender
โ 'Male' or 'Female' (first names only).
๐ง fuzzy_search(...)
Performs fuzzy matching to suggest similar names.
Parameters:
name
โ Search term (e.g., 'Jonh').n
โ Number of close matches to return.use_first_names
โ Use first names if True, else last names.country_alpha2
โ Filter by country.gender
โ 'Male' or 'Female' (first names only).
The dataset is available here name_dataset.zip (3.3GB).
- The data contains 491,655,925 records from 106 countries.
- The uncompressed version takes around 10GB on the disk.
- Each country is in a separate CSV file.
- A CSV file contains rows of this format: first_name,last_name,gender,country_code.
- Each record is a real person.
- For Ruby see names_dataset.
- This version was generated from the massive Facebook Leak (533M accounts).
- Lists of names are not copyrightable, generally speaking, but if you want to be completely sure you should talk to a lawyer.
Afghanistan, Albania, Algeria, Angola, Argentina, Austria, Azerbaijan, Bahrain, Bangladesh, Belgium, Bolivia, Plurinational State of, Botswana, Brazil, Brunei Darussalam, Bulgaria, Burkina Faso, Burundi, Cambodia, Cameroon, Canada, Chile, China, Colombia, Costa Rica, Croatia, Cyprus, Czechia, Denmark, Djibouti, Ecuador, Egypt, El Salvador, Estonia, Ethiopia, Fiji, Finland, France, Georgia, Germany, Ghana, Greece, Guatemala, Haiti, Honduras, Hong Kong, Hungary, Iceland, India, Indonesia, Iran, Islamic Republic of, Iraq, Ireland, Israel, Italy, Jamaica, Japan, Jordan, Kazakhstan, Korea, Republic of, Kuwait, Lebanon, Libya, Lithuania, Luxembourg, Macao, Malaysia, Maldives, Malta, Mauritius, Mexico, Moldova, Republic of, Morocco, Namibia, Netherlands, Nigeria, Norway, Oman, Palestine, State of, Panama, Peru, Philippines, Poland, Portugal, Puerto Rico, Qatar, Russian Federation, Saudi Arabia, Serbia, Singapore, Slovenia, South Africa, Spain, Sudan, Sweden, Switzerland, Syrian Arab Republic, Taiwan, Province of China, Tunisia, Turkey, Turkmenistan, United Arab Emirates, United Kingdom, United States, Uruguay, Yemen.
๐ฒ๐น๐ช๐ฌ๐ง๐ด๐ณ๐ฆ๐น๐ณ๐ท๐ธ๐ฏ๐ฒ๐ฆ๐ท๐ฏ๐ต๐ฐ๐ฟ๐ธ๐ฆ๐บ๐ธ๐ฆ๐ช๐ญ๐บ๐ญ๐ฐ๐ถ๐ฆ๐ธ๐ฌ๐ฉ๐ช๐พ๐ช๐ฒ๐พ๐ญ๐น๐ต๐ท๐จ๐ณ๐ฆ๐ด๐น๐ผ๐ธ๐ฉ๐ง๐ญ๐ง๐ช๐ช๐น๐ช๐ช๐จ๐ด๐ฌ๐ท๐ง๐ท๐ท๐บ๐ฑ๐พ๐ธ๐ป๐ฐ๐ผ๐ฐ๐ท๐ฆ๐ฑ๐ธ๐พ๐ง๐ซ๐จ๐ฟ๐จ๐ฆ๐ด๐ฒ๐ฉ๐ฐ๐จ๐ฑ๐ง๐ฉ๐ง๐ผ๐ซ๐ฏ๐ฎ๐ถ๐ฎ๐ช๐ฟ๐ฆ๐จ๐ท๐ฏ๐ด๐ฐ๐ญ๐ต๐ช๐บ๐พ๐ฎ๐ท๐ฒ๐ฉ๐ซ๐ท๐ฒ๐ด๐ณ๐ฑ๐ฌ๐ญ๐จ๐พ๐ฉ๐ฟ๐ฎ๐น๐ฌ๐ง๐ง๐ฎ๐ฎ๐ณ๐ซ๐ฎ๐ฆ๐ซ๐ต๐ญ๐ฆ๐ฟ๐ฌ๐ช๐จ๐ฒ๐ฎ๐ฑ๐ช๐ธ๐ฑ๐น๐ฉ๐ฏ๐ฌ๐น๐ฑ๐บ๐ต๐ธ๐น๐ท๐ต๐ฑ๐ฎ๐ธ๐ณ๐ฌ๐ต๐ฆ๐ญ๐ท๐ธ๐ฎ๐ญ๐ณ๐ฆ๐น๐ฒ๐บ๐ธ๐ช๐ฒ๐ฆ๐จ๐ญ๐ง๐ณ๐ฒ๐ป๐ณ๐ด๐ช๐จ๐ฎ๐ฉ๐ง๐ฌ๐ต๐น๐ฒ๐ฝ๐ฑ๐ง๐น๐ฒ
NOTE: It is unfortunately not possible to support more countries because the missing ones were not included in the original dataset.
@misc{NameDataset2021,
author = {Philippe Remy},
title = {Name Dataset},
year = {2021},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/philipperemy/name-dataset}},
}