Skip to content

philipperemy/name-dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

First and Last Name Database

Downloads Downloads

This Python library provides detailed insights about names, including:

  • Popularity (ranking by country)
  • Gender prediction
  • Country-specific statistics (105 countries supported)
  • Fuzzy search (e.g., search for "ISABLE" returns "ISABEL")
  • Autocomplete search (e.g., search for names starting with "ISA")

It can give you an answer to some of those questions:

  • Who is Zoe? Likely a Female, United Kindgom.
  • Knows Philippe? Likely a Male, France. And with the spelling Philipp? Male, Germany.
  • How about Nikki? Likely a Female, United States.

๐Ÿ“ฅ To download the raw CSV data for your analysis, browse here.

Composition

730K first names and 983K last names, extracted from the Facebook massive dump (533M users).

Installation

Available on PyPI:

pip install names-dataset

Usage

โš ๏ธ Note: This library requires approximately 3.2โ€ฏGB of RAM to load the full dataset into memory. Make sure your system has enough available memory to avoid MemoryError.

Once installed, you can run the following commands to get familiar with the library:

from names_dataset import NameDataset, NameWrapper

# The library takes time to initialize because the database is massive. A tip is to include its initialization in your app's startup process.
nd = NameDataset()

print(NameWrapper(nd.search('Philippe')).describe)
# Male, France

print(NameWrapper(nd.search('Zoe')).describe)
# Female, United Kingdom

print(nd.search('Walter'))
# {'first_name': {'country': {'Argentina': 0.062, 'Austria': 0.037, 'Bolivia, Plurinational State of': 0.042, 'Colombia': 0.096, 'Germany': 0.044, 'Italy': 0.295, 'Peru': 0.185, 'United States': 0.159, 'Uruguay': 0.036, 'South Africa': 0.043}, 'gender': {'Female': 0.007, 'Male': 0.993}, 'rank': {'Argentina': 37, 'Austria': 34, 'Bolivia, Plurinational State of': 67, 'Colombia': 250, 'Germany': 214, 'Italy': 193, 'Peru': 27, 'United States': 317, 'Uruguay': 44, 'South Africa': 388}}, 'last_name': {'country': {'Austria': 0.036, 'Brazil': 0.039, 'Switzerland': 0.032, 'Germany': 0.299, 'France': 0.121, 'United Kingdom': 0.048, 'Italy': 0.09, 'Nigeria': 0.078, 'United States': 0.172, 'South Africa': 0.085}, 'gender': {}, 'rank': {'Austria': 106, 'Brazil': 805, 'Switzerland': 140, 'Germany': 39, 'France': 625, 'United Kingdom': 1823, 'Italy': 3564, 'Nigeria': 926, 'United States': 1210, 'South Africa': 1169}}}

print(nd.search('White'))
# {'first_name': {'country': {'United Arab Emirates': 0.044, 'Egypt': 0.294, 'France': 0.061, 'Hong Kong': 0.05, 'Iraq': 0.094, 'Italy': 0.117, 'Malaysia': 0.133, 'Saudi Arabia': 0.089, 'Taiwan, Province of China': 0.044, 'United States': 0.072}, 'gender': {'Female': 0.519, 'Male': 0.481}, 'rank': {'Taiwan, Province of China': 6940, 'United Arab Emirates': None, 'Egypt': None, 'France': None, 'Hong Kong': None, 'Iraq': None, 'Italy': None, 'Malaysia': None, 'Saudi Arabia': None, 'United States': None}}, 'last_name': {'country': {'Canada': 0.035, 'France': 0.016, 'United Kingdom': 0.296, 'Ireland': 0.028, 'Iraq': 0.016, 'Italy': 0.02, 'Jamaica': 0.017, 'Nigeria': 0.031, 'United States': 0.5, 'South Africa': 0.04}, 'gender': {}, 'rank': {'Canada': 46, 'France': 1041, 'United Kingdom': 18, 'Ireland': 66, 'Iraq': 1307, 'Italy': 2778, 'Jamaica': 35, 'Nigeria': 425, 'United States': 47, 'South Africa': 416}}}

print(nd.search('ู…ุญู…ุฏ'))
# {'first_name': {'country': {'Algeria': 0.018, 'Egypt': 0.441, 'Iraq': 0.12, 'Jordan': 0.027, 'Libya': 0.035, 'Saudi Arabia': 0.154, 'Sudan': 0.07, 'Syrian Arab Republic': 0.062, 'Turkey': 0.022, 'Yemen': 0.051}, 'gender': {'Female': 0.035, 'Male': 0.965}, 'rank': {'Algeria': 4, 'Egypt': 1, 'Iraq': 2, 'Jordan': 1, 'Libya': 1, 'Saudi Arabia': 1, 'Sudan': 1, 'Syrian Arab Republic': 1, 'Turkey': 18, 'Yemen': 1}}, 'last_name': {'country': {'Egypt': 0.453, 'Iraq': 0.096, 'Jordan': 0.015, 'Libya': 0.043, 'Palestine, State of': 0.016, 'Saudi Arabia': 0.118, 'Sudan': 0.146, 'Syrian Arab Republic': 0.058, 'Turkey': 0.017, 'Yemen': 0.037}, 'gender': {}, 'rank': {'Egypt': 2, 'Iraq': 3, 'Jordan': 1, 'Libya': 1, 'Palestine, State of': 1, 'Saudi Arabia': 3, 'Sudan': 1, 'Syrian Arab Republic': 2, 'Turkey': 44, 'Yemen': 1}}}

print(nd.get_top_names(n=10, gender='Male', country_alpha2='US'))
# {'US': {'M': ['Jose', 'David', 'Michael', 'John', 'Juan', 'Carlos', 'Luis', 'Chris', 'Alex', 'Daniel']}}

print(nd.get_top_names(n=5, country_alpha2='ES'))
# {'ES': {'M': ['Jose', 'Antonio', 'Juan', 'Manuel', 'David'], 'F': ['Maria', 'Ana', 'Carmen', 'Laura', 'Isabel']}}

print(nd.get_country_codes(alpha_2=True))
# ['AE', 'AF', 'AL', 'AO', 'AR', 'AT', 'AZ', 'BD', 'BE', 'BF', 'BG', 'BH', 'BI', 'BN', 'BO', 'BR', 'BW', 'CA', 'CH', 'CL', 'CM', 'CN', 'CO', 'CR', 'CY', 'CZ', 'DE', 'DJ', 'DK', 'DZ', 'EC', 'EE', 'EG', 'ES', 'ET', 'FI', 'FJ', 'FR', 'GB', 'GE', 'GH', 'GR', 'GT', 'HK', 'HN', 'HR', 'HT', 'HU', 'ID', 'IE', 'IL', 'IN', 'IQ', 'IR', 'IS', 'IT', 'JM', 'JO', 'JP', 'KH', 'KR', 'KW', 'KZ', 'LB', 'LT', 'LU', 'LY', 'MA', 'MD', 'MO', 'MT', 'MU', 'MV', 'MX', 'MY', 'NA', 'NG', 'NL', 'NO', 'OM', 'PA', 'PE', 'PH', 'PL', 'PR', 'PS', 'PT', 'QA', 'RS', 'RU', 'SA', 'SD', 'SE', 'SG', 'SI', 'SV', 'SY', 'TM', 'TN', 'TR', 'TW', 'US', 'UY', 'YE', 'ZA']

print(nd.auto_complete('isa', n=3)) # very fast, can be used in a loop in realtime.
# [{'name': 'Isabel', 'rank': 144}, {'name': 'Isaac', 'rank': 266}, {'name': 'Isa', 'rank': 450}]

print(nd.fuzzy_search('isablel', n=3)) # slow to compute.
# [{'name': 'Isabel', 'rank': 144}, {'name': 'Isabela', 'rank': 1228}, {'name': 'Isabele', 'rank': 2386}]

nd.first_names
# Dictionary of all the first names with their attributes.

nd.last_names
# Dictionary of all the last names with their attributes.

API

๐Ÿ” search(name: str)

Searches for a name and returns metadata for both first and last names (if available).

The result is:

  • country: The probability that the name belongs to a given country. Only the top 10 matching countries are returned.
  • gender: The probability of the person being male or female.
  • rank: The popularity rank of the name in its country (1 = most popular).

๐Ÿ“Œ Note: Gender data only applies to first names.

๐Ÿ† get_top_names(...)

Retrieves the most popular names across supported countries.

Parameters:

  • n - Number of names to return (per group).
  • gender - 'Male' or 'Female' (only valid for first names).
  • use_first_names - Choose between first names and last names.
  • country_alpha2 - 2-letter ISO country code (e.g., 'US', 'JP').
๐ŸŒ get_country_codes(alpha_2: bool = False)

Returns a list of supported countries found in the dataset. Country codes are ISO 3166-1 alpha-2 format (e.g., US, FR, JP).

Parameters:

  • alpha_2 - If True, returns 2-letter ISO codes only.
โœจ auto_complete(...)

Returns top name suggestions that begin with the specified prefix.

Parameters:

  • name โ€” Prefix string (e.g., 'Al').
  • n โ€” Max number of results.
  • use_first_names โ€” Use first names if True, else last names.
  • country_alpha2 โ€” Filter by country.
  • gender โ€” 'Male' or 'Female' (first names only).
๐Ÿง  fuzzy_search(...)

Performs fuzzy matching to suggest similar names.

Parameters:

  • name โ€” Search term (e.g., 'Jonh').
  • n โ€” Number of close matches to return.
  • use_first_names โ€” Use first names if True, else last names.
  • country_alpha2 โ€” Filter by country.
  • gender โ€” 'Male' or 'Female' (first names only).

Full dataset

The dataset is available here name_dataset.zip (3.3GB).

image

  • The data contains 491,655,925 records from 106 countries.
  • The uncompressed version takes around 10GB on the disk.
  • Each country is in a separate CSV file.
  • A CSV file contains rows of this format: first_name,last_name,gender,country_code.
  • Each record is a real person.

Ports

License

  • This version was generated from the massive Facebook Leak (533M accounts).
  • Lists of names are not copyrightable, generally speaking, but if you want to be completely sure you should talk to a lawyer.

Countries

Afghanistan, Albania, Algeria, Angola, Argentina, Austria, Azerbaijan, Bahrain, Bangladesh, Belgium, Bolivia, Plurinational State of, Botswana, Brazil, Brunei Darussalam, Bulgaria, Burkina Faso, Burundi, Cambodia, Cameroon, Canada, Chile, China, Colombia, Costa Rica, Croatia, Cyprus, Czechia, Denmark, Djibouti, Ecuador, Egypt, El Salvador, Estonia, Ethiopia, Fiji, Finland, France, Georgia, Germany, Ghana, Greece, Guatemala, Haiti, Honduras, Hong Kong, Hungary, Iceland, India, Indonesia, Iran, Islamic Republic of, Iraq, Ireland, Israel, Italy, Jamaica, Japan, Jordan, Kazakhstan, Korea, Republic of, Kuwait, Lebanon, Libya, Lithuania, Luxembourg, Macao, Malaysia, Maldives, Malta, Mauritius, Mexico, Moldova, Republic of, Morocco, Namibia, Netherlands, Nigeria, Norway, Oman, Palestine, State of, Panama, Peru, Philippines, Poland, Portugal, Puerto Rico, Qatar, Russian Federation, Saudi Arabia, Serbia, Singapore, Slovenia, South Africa, Spain, Sudan, Sweden, Switzerland, Syrian Arab Republic, Taiwan, Province of China, Tunisia, Turkey, Turkmenistan, United Arab Emirates, United Kingdom, United States, Uruguay, Yemen.

๐Ÿ‡ฒ๐Ÿ‡น๐Ÿ‡ช๐Ÿ‡ฌ๐Ÿ‡ง๐Ÿ‡ด๐Ÿ‡ณ๐Ÿ‡ฆ๐Ÿ‡น๐Ÿ‡ณ๐Ÿ‡ท๐Ÿ‡ธ๐Ÿ‡ฏ๐Ÿ‡ฒ๐Ÿ‡ฆ๐Ÿ‡ท๐Ÿ‡ฏ๐Ÿ‡ต๐Ÿ‡ฐ๐Ÿ‡ฟ๐Ÿ‡ธ๐Ÿ‡ฆ๐Ÿ‡บ๐Ÿ‡ธ๐Ÿ‡ฆ๐Ÿ‡ช๐Ÿ‡ญ๐Ÿ‡บ๐Ÿ‡ญ๐Ÿ‡ฐ๐Ÿ‡ถ๐Ÿ‡ฆ๐Ÿ‡ธ๐Ÿ‡ฌ๐Ÿ‡ฉ๐Ÿ‡ช๐Ÿ‡พ๐Ÿ‡ช๐Ÿ‡ฒ๐Ÿ‡พ๐Ÿ‡ญ๐Ÿ‡น๐Ÿ‡ต๐Ÿ‡ท๐Ÿ‡จ๐Ÿ‡ณ๐Ÿ‡ฆ๐Ÿ‡ด๐Ÿ‡น๐Ÿ‡ผ๐Ÿ‡ธ๐Ÿ‡ฉ๐Ÿ‡ง๐Ÿ‡ญ๐Ÿ‡ง๐Ÿ‡ช๐Ÿ‡ช๐Ÿ‡น๐Ÿ‡ช๐Ÿ‡ช๐Ÿ‡จ๐Ÿ‡ด๐Ÿ‡ฌ๐Ÿ‡ท๐Ÿ‡ง๐Ÿ‡ท๐Ÿ‡ท๐Ÿ‡บ๐Ÿ‡ฑ๐Ÿ‡พ๐Ÿ‡ธ๐Ÿ‡ป๐Ÿ‡ฐ๐Ÿ‡ผ๐Ÿ‡ฐ๐Ÿ‡ท๐Ÿ‡ฆ๐Ÿ‡ฑ๐Ÿ‡ธ๐Ÿ‡พ๐Ÿ‡ง๐Ÿ‡ซ๐Ÿ‡จ๐Ÿ‡ฟ๐Ÿ‡จ๐Ÿ‡ฆ๐Ÿ‡ด๐Ÿ‡ฒ๐Ÿ‡ฉ๐Ÿ‡ฐ๐Ÿ‡จ๐Ÿ‡ฑ๐Ÿ‡ง๐Ÿ‡ฉ๐Ÿ‡ง๐Ÿ‡ผ๐Ÿ‡ซ๐Ÿ‡ฏ๐Ÿ‡ฎ๐Ÿ‡ถ๐Ÿ‡ฎ๐Ÿ‡ช๐Ÿ‡ฟ๐Ÿ‡ฆ๐Ÿ‡จ๐Ÿ‡ท๐Ÿ‡ฏ๐Ÿ‡ด๐Ÿ‡ฐ๐Ÿ‡ญ๐Ÿ‡ต๐Ÿ‡ช๐Ÿ‡บ๐Ÿ‡พ๐Ÿ‡ฎ๐Ÿ‡ท๐Ÿ‡ฒ๐Ÿ‡ฉ๐Ÿ‡ซ๐Ÿ‡ท๐Ÿ‡ฒ๐Ÿ‡ด๐Ÿ‡ณ๐Ÿ‡ฑ๐Ÿ‡ฌ๐Ÿ‡ญ๐Ÿ‡จ๐Ÿ‡พ๐Ÿ‡ฉ๐Ÿ‡ฟ๐Ÿ‡ฎ๐Ÿ‡น๐Ÿ‡ฌ๐Ÿ‡ง๐Ÿ‡ง๐Ÿ‡ฎ๐Ÿ‡ฎ๐Ÿ‡ณ๐Ÿ‡ซ๐Ÿ‡ฎ๐Ÿ‡ฆ๐Ÿ‡ซ๐Ÿ‡ต๐Ÿ‡ญ๐Ÿ‡ฆ๐Ÿ‡ฟ๐Ÿ‡ฌ๐Ÿ‡ช๐Ÿ‡จ๐Ÿ‡ฒ๐Ÿ‡ฎ๐Ÿ‡ฑ๐Ÿ‡ช๐Ÿ‡ธ๐Ÿ‡ฑ๐Ÿ‡น๐Ÿ‡ฉ๐Ÿ‡ฏ๐Ÿ‡ฌ๐Ÿ‡น๐Ÿ‡ฑ๐Ÿ‡บ๐Ÿ‡ต๐Ÿ‡ธ๐Ÿ‡น๐Ÿ‡ท๐Ÿ‡ต๐Ÿ‡ฑ๐Ÿ‡ฎ๐Ÿ‡ธ๐Ÿ‡ณ๐Ÿ‡ฌ๐Ÿ‡ต๐Ÿ‡ฆ๐Ÿ‡ญ๐Ÿ‡ท๐Ÿ‡ธ๐Ÿ‡ฎ๐Ÿ‡ญ๐Ÿ‡ณ๐Ÿ‡ฆ๐Ÿ‡น๐Ÿ‡ฒ๐Ÿ‡บ๐Ÿ‡ธ๐Ÿ‡ช๐Ÿ‡ฒ๐Ÿ‡ฆ๐Ÿ‡จ๐Ÿ‡ญ๐Ÿ‡ง๐Ÿ‡ณ๐Ÿ‡ฒ๐Ÿ‡ป๐Ÿ‡ณ๐Ÿ‡ด๐Ÿ‡ช๐Ÿ‡จ๐Ÿ‡ฎ๐Ÿ‡ฉ๐Ÿ‡ง๐Ÿ‡ฌ๐Ÿ‡ต๐Ÿ‡น๐Ÿ‡ฒ๐Ÿ‡ฝ๐Ÿ‡ฑ๐Ÿ‡ง๐Ÿ‡น๐Ÿ‡ฒ

NOTE: It is unfortunately not possible to support more countries because the missing ones were not included in the original dataset.

Citation

@misc{NameDataset2021,
  author = {Philippe Remy},
  title = {Name Dataset},
  year = {2021},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/philipperemy/name-dataset}},
}

About

The Python library for names.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

 

Packages

No packages published

Contributors 5