# Synthetic Data

Creating synthetic data is a technique that can be used in several different areas of academics, including computer science, statistics, and artificial intelligence.

In computer science, synthetic data generation can be used to train machine learning models or to test software systems. In statistics, it can be used to generate datasets that have certain statistical properties or to create data that adheres to a particular model. In artificial intelligence, synthetic data can be used to generate more diverse and representative datasets for training machine learning models.

Therefore, creating synthetic data can be applied in various fields of academics, including those mentioned above, to create data that can be used to test or train algorithms or systems.

## Generate synthetic data with Python Faker

`Example 1: fake.name(), fake.address(), fake.email(), fake.phone_number()`

In [1]:
#%pip install faker
from faker import Faker
import pandas as pd # for data manipulation

In [2]:
# Instantiate Faker() instance
fake = Faker()

# Create a dataset including names, addresses, emails, and phone numbers
data = []
for _ in range(10):
    data.append([fake.name(), fake.address(), fake.email(), fake.phone_number()])

# Convert to DataFrame
df = pd.DataFrame(data, columns=['Name', 'Address', 'Email', 'Phone'])
df.head()

Unnamed: 0,Name,Address,Email,Phone
0,Christopher Watts,"2119 Wells Knolls\nAprilshire, NE 06431",craigtroy@example.net,523.103.2299x837
1,Bradley Chavez,"73904 Carson Crossroad Suite 443\nLloydhaven, PR 88046",randybarry@example.com,001-450-444-3342x070
2,Gerald Jones,"51575 Thompson Vista\nSnyderborough, IN 61648",coffeybecky@example.org,001-857-588-7856
3,Leonard Lewis,"02997 Jacqueline Spurs Suite 004\nJoelfurt, GA 39331",williamlewis@example.net,220.476.2025x772
4,James Ramirez,"70742 David Courts Apt. 964\nLake Omar, NM 87916",maria57@example.net,8391542101


Faker's package has a number of callable functions, called providers, that will generate random data for you. In the above code chunk, I used the BaseProvider's functions to generate names, physical mailing addresses, email addresses, and phone numbers.

## Example 2: fake.profile()
Let's use another provider: profile, and see what data we can generate.

In [3]:
# Create a list of fake profiles
profiles = []

for _ in range(10):
    profiles.append(fake.profile())

# Save as a DataFrame
df2 = pd.DataFrame(profiles, columns = profiles[0].keys())
df2.head()

Unnamed: 0,job,company,ssn,residence,current_location,blood_group,website,username,name,sex,address,mail,birthdate
0,Museum/gallery exhibitions officer,Turner Group,124-87-6734,"62699 Mills Fall\nAnnaton, TN 84042","(55.230484, -114.636975)",A+,[https://peterson.info/],melody63,Amy Steele,F,USCGC Pierce\nFPO AE 57068,millerkaitlin@gmail.com,2010-10-22
1,Water quality scientist,Carlson PLC,648-81-4984,"3429 Sean Radial Suite 998\nDerrickland, MD 72327","(-7.1733015, 113.979221)",A-,"[http://www.giles-johnson.biz/, http://www.reed.com/]",rosesmith,Ian Wallace,M,"19179 Peterson Ways\nWilliamberg, OK 49780",smahoney@hotmail.com,1913-05-30
2,Analytical chemist,"Forbes, Hughes and Hernandez",445-95-9416,"55576 Koch Ports Suite 458\nGarzachester, IA 71277","(-29.9591525, -44.279075)",B+,"[https://smith-mack.com/, https://www.mason-black.org/]",karenmorales,Francisco Johnson,M,"477 Denise Wall\nNew Joan, UT 06723",jason04@yahoo.com,1918-03-26
3,Licensed conveyancer,Freeman Inc,899-24-6488,USNS Taylor\nFPO AP 76388,"(33.826478, -47.390440)",AB-,"[https://www.reyes.com/, http://owens.info/, http://hansen.com/]",harrisonjonathan,Ms. Nicole Hall,F,"8691 Vang Points Suite 144\nNew Robert, IL 15041",blairjose@gmail.com,1976-02-18
4,Accounting technician,Craig-Lopez,229-75-2306,"0982 William Lake Apt. 431\nSouth Gregory, NJ 95151","(83.0906295, 104.992377)",A+,"[https://www.foster.com/, http://taylor-fox.info/, http://carter.com/]",emily21,Amy Williamson,F,"66492 Collins Track\nWest Racheltown, TN 39226",qkelly@hotmail.com,1917-11-28


As you can see from the output, there's a lot of information. Let's take a look at an individual profile:

In [4]:
fake.profile()

{'job': 'Sound technician, broadcasting/film/video',
 'company': 'Meyers, Smith and Stein',
 'ssn': '285-56-4494',
 'residence': '7669 Edwards Land Apt. 651\nSouth Samuelchester, NE 32379',
 'current_location': (Decimal('27.583831'), Decimal('176.573252')),
 'blood_group': 'O-',
 'website': ['http://moore.com/',
  'http://www.roberson.com/',
  'http://www.murphy.com/'],
 'username': 'mcardenas',
 'name': 'Kathryn Wolfe',
 'sex': 'F',
 'address': '4114 Lawrence Grove\nCatherinetown, FM 15398',
 'mail': 'jberry@gmail.com',
 'birthdate': datetime.date(2012, 3, 23)}

## Example 3: customize fake.profile(fields = [])
Depending on the columns you actually want for your fake profiles, you can list whichever attributes you're interested in using the `fields` argument.

In [5]:
# Create fake profiles using specific columns
profiles2 = []

for _ in range(10):
    profiles2.append(fake.profile(fields = ["name", "sex", "occupation", "blood_group", "birthdate"]))

df3 = pd.DataFrame(profiles2, columns = profiles2[0].keys())
df3.head()

Unnamed: 0,blood_group,name,sex,birthdate
0,AB-,Shirley Wolfe,F,1911-09-13
1,O+,Vickie Perkins,F,2013-07-22
2,AB-,Danielle Nelson,F,1961-07-11
3,A-,Christina Sanders,F,1954-02-09
4,B-,Patrick Lee,M,2010-11-04


## DynamicProvider: customizable provider

In [6]:
from faker.providers import DynamicProvider

In [7]:
df_museums = pd.read_csv('../Data/museums.csv')

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


In [8]:
# Get unique list of museum names from existing dataset
museum_list = set(df_museums["Museum Name"])

# Create museum_provider
museum_provider = DynamicProvider(
     provider_name = "museum_provider",
     elements = museum_list,
)

# Instantiate new Faker() instance
fake_more = Faker()

# Add new provider
fake_more.add_provider(museum_provider)

# Use new provider
fake_more.museum_provider()

'HORICAN HISTORICAL SOCIETY'

In this dummy example, I took an existing [dataset on museums](https://www.kaggle.com/datasets/imls/museum-directory?resource=download), extracted just the names, and in 2 lines of code, created a new provider that will randomly generate a museum name based on the data I've provided it. This could be applied to any other existing dataset that you have.

## Python Faker providers: standard vs. community
To learn more about other providers you can use the following line of code. Note that we're calling on the providers attribute of a Faker() instance, called fake. All of the providers' accompanying functions can be called on like we did above without any additional import statements.

In [9]:
# Get full list of built-in providers
fake.providers

[<faker.providers.user_agent.Provider at 0x172ff2b50>,
 <faker.providers.ssn.en_US.Provider at 0x172ff2ac0>,
 <faker.providers.python.Provider at 0x172ff2b80>,
 <faker.providers.profile.Provider at 0x172ff2a60>,
 <faker.providers.phone_number.en_US.Provider at 0x172ff2940>,
 <faker.providers.person.en_US.Provider at 0x172ff27c0>,
 <faker.providers.misc.en_US.Provider at 0x172ff26d0>,
 <faker.providers.lorem.en_US.Provider at 0x172ff2520>,
 <faker.providers.job.en_US.Provider at 0x172ff2370>,
 <faker.providers.isbn.Provider at 0x172ff2340>,
 <faker.providers.internet.en_US.Provider at 0x172ff22e0>,
 <faker.providers.geo.en_US.Provider at 0x172ff2190>,
 <faker.providers.file.Provider at 0x172cb7190>,
 <faker.providers.emoji.Provider at 0x172cb7e20>,
 <faker.providers.date_time.en_US.Provider at 0x172cb7b80>,
 <faker.providers.currency.en_US.Provider at 0x172cb70a0>,
 <faker.providers.credit_card.en_US.Provider at 0x172cb7e80>,
 <faker.providers.company.en_US.Provider at 0x172fcaaf0>,
 <f

Beyond the basic providers, there are also community-developed providers, such as:

- faker_airtravel: airport and flight information
- faker_music: music genres, subgenres, and instrument information
- faker_vehicle: year, make, model, and other vehicle information

But you will have to install and import community providers separately:

In [10]:
#%pip install faker_airtravel

In [11]:
from faker import Faker
from faker_airtravel import AirTravelProvider
fake.add_provider(AirTravelProvider)

Check out Python Faker's full [GitHub](https://github.com/joke2k/faker) and [documentation](https://faker.readthedocs.io/en/master/) for more.

In [12]:
#%pip install faker_vehicle
#%pip install faker_music

In [13]:
import faker_music
from faker_music import MusicProvider
fake.add_provider(MusicProvider)

In [14]:
import faker_vehicle
from faker_vehicle import VehicleProvider
fake.add_provider(VehicleProvider)

In [15]:
fake.providers

[<faker_vehicle.VehicleProvider at 0x172fd29d0>,
 <faker_music.music.MusicProvider at 0x172fd25e0>,
 <faker_airtravel.airports.AirTravelProvider at 0x172fd2a30>,
 <faker.providers.user_agent.Provider at 0x172ff2b50>,
 <faker.providers.ssn.en_US.Provider at 0x172ff2ac0>,
 <faker.providers.python.Provider at 0x172ff2b80>,
 <faker.providers.profile.Provider at 0x172ff2a60>,
 <faker.providers.phone_number.en_US.Provider at 0x172ff2940>,
 <faker.providers.person.en_US.Provider at 0x172ff27c0>,
 <faker.providers.misc.en_US.Provider at 0x172ff26d0>,
 <faker.providers.lorem.en_US.Provider at 0x172ff2520>,
 <faker.providers.job.en_US.Provider at 0x172ff2370>,
 <faker.providers.isbn.Provider at 0x172ff2340>,
 <faker.providers.internet.en_US.Provider at 0x172ff22e0>,
 <faker.providers.geo.en_US.Provider at 0x172ff2190>,
 <faker.providers.file.Provider at 0x172cb7190>,
 <faker.providers.emoji.Provider at 0x172cb7e20>,
 <faker.providers.date_time.en_US.Provider at 0x172cb7b80>,
 <faker.providers.cu

In [19]:
fake.address()

'02630 Craig Point\nDavidhaven, MS 85021'