# Step-by-Step Data Generation


1. Install and Import Faker


In [1]:
!pip install Faker

Collecting Faker
  Downloading faker-37.4.2-py3-none-any.whl.metadata (15 kB)
Downloading faker-37.4.2-py3-none-any.whl (1.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.9/1.9 MB[0m [31m17.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: Faker
Successfully installed Faker-37.4.2


In [2]:
from faker import Faker
import pandas as pd
import random

# 2. Create Your Very Own “Faker”
The first thing to do is create an instance of the Faker class that will be able to generate different types of “fake” data — I do personally prefer utilizing the term synthetic over fake, hence I’ll stick to it hereinafter. We will also set a fixed seed for the random number generator: a cornerstone element behind synthetic data generation. Setting a seed helps make the code reproducible and debug it effectively, if necessary.

In [3]:
fake = Faker()
Faker.seed(42)

# 3. Write a Data-Generating Function
Next comes the most critical part of the code: the function that will generate synthetic, real-world-like instances of data. Concretely, we will generate bank customer records containing basic personal and socio-demographic attributes.

In [4]:
def generate_user_for_learning():
    """This function generates real-world-like bank customer data."""
    return {
        "id": fake.uuid4(),
        "name": fake.name(),
        "email": fake.email() if random.random() > 0.1 else None,
        "phone": fake.phone_number(),
        "birthdate": fake.date_of_birth(minimum_age=16, maximum_age=85),
        "country": random.choice(["US", "UK", "India", "Germany", None]),
        "income": round(fake.pyfloat(left_digits=5, right_digits=2, positive=True), 2)
                  if random.random() > 0.05 else -1000.00
    }

That’s probably a lot to digest, so let’s analyze the code further, line by line:

* The function generates and returns a Python dictionary representing a bank customer: dictionary keys contain attribute names, and dictionary values contain, of course, the values.
* The "id" attribute contains a unique user identifier (UUID) generated with the uuid4() function for this end.
* The "name" attribute contains a randomly generated customer name with the aid of the name() function.
* Similarly, the "email" attribute contains a randomly generated email address, but in this case, the email generation function email() has a 10% chance of not being used, thus simulating the chance that about 10% of the data may contain missing values for this attribute. This is an interesting way to simulate real-world data imperfections. Notice here that the process to randomly generate email addresses and that for the previous attribute containing customer names are independent, hence if you wanted customer names and emails to be related, you may have to use an alternate, probably not random approach to create email addresses upon customer names.
* As we can see, the rest of the attributes’ values are also generated by using dedicated Fake functions, thereby providing plenty of flexibility in generating data of many types, and even supporting levels of customization, as seen for instance with the age range specified for the date of birth attribute. The choice() function is used to generate categorical attribute values within a limited number of options.
* The "income" attribute value is generated as a floating value within a specified range, rounded to two decimal places. Besides, there is a 5% chance it will be set as -1000, which indicates an invalid or missing value: again, a way to simulate real-world data imperfections or errors.

In a single line of code, we can now iteratively call this method to create any number of customer instances and store them in a Pandas DataFrame object.

# 4. Call the Function to Create Data
Let’s do so for 100 such customers:

In [5]:
users_df = pd.DataFrame([generate_user_for_learning() for _ in range(100)])
users_df.head()

Unnamed: 0,id,name,email,phone,birthdate,country,income
0,bdd640fb-0667-4ad1-9c80-317fa3b1799d,Daniel Doyle,garzaanthony@example.org,538.990.8386,1980-10-21,,851.97
1,6c307511-b2b9-437a-a8df-6ec4ce4a2bbd,Christopher Bernard,curtis61@example.com,(794)507-8161x849,1988-11-16,UK,6006.84
2,fc377a4c-4a15-444d-85e7-ce8a3a578a8e,David Garcia,shawn52@example.com,(534)719-2832x764,1998-07-23,India,7331.29
3,50c187fc-ce17-4b4e-8837-b8a3d261a7ab,Austin Gentry,jason76@example.net,724.523.8849x696,2009-05-15,,18131.65
4,0c0fd195-c17a-408a-9745-d6d87e570ddf,Brittany Moore,ycarlson@example.com,+1-878-448-0184x514,1971-04-26,US,94646.92


# Use Case: ETL Pipeline Testing
Suppose another scenario in which we are interested in testing an ETL pipeline that ingests bank transactional data. The following code generates some simplified customer instances with fewer attributes than in the previous example, plus a new dataset containing bank transactions associated with some of these customers.

In [6]:
def generate_user_for_testing():
    return {
        "id": fake.uuid4(),
        "name": fake.name(),
    }

def generate_transaction_for_user(user_id):
    return {
        "transaction_id": fake.uuid4() if random.random() > 0.02 else "DUPLICATE_ID",
        "user_id": user_id,
        "amount": round(random.uniform(-50, 5000), 2),
        "currency": random.choice(["USD", "EUR", "GBP", "BTC"]),
        "timestamp": fake.date_time_this_year().isoformat()
    }

users_test = [generate_user_for_testing() for _ in range(50)]
transactions_test = [generate_transaction_for_user(user["id"]) for user in users_test for _ in range(random.randint(1, 5))]

df_users = pd.DataFrame(users_test)
df_transactions = pd.DataFrame(transactions_test)

print("Sample Transactions:")
df_transactions.head()

Sample Transactions:


Unnamed: 0,transaction_id,user_id,amount,currency,timestamp
0,71a0449d-d703-462f-b5cb-e3f56a66c184,24e9bcfd-9647-4f9f-b508-507af622d842,1191.27,EUR,2025-04-10T13:02:04.482025
1,cbbed980-07e6-4cb0-90bd-e9001529e6f5,24e9bcfd-9647-4f9f-b508-507af622d842,4250.13,BTC,2025-06-05T00:28:02.457401
2,398f569b-465c-42cb-885e-b4894ac66780,24e9bcfd-9647-4f9f-b508-507af622d842,1942.86,EUR,2025-04-20T12:07:46.012454
3,90ce77e9-7513-43b9-87de-96ea2b0fbe44,24e9bcfd-9647-4f9f-b508-507af622d842,3970.06,EUR,2025-06-01T05:50:29.469544
4,82c42e48-8cba-4ae5-becc-daf4c4134dc2,24e9bcfd-9647-4f9f-b508-507af622d842,406.79,GBP,2025-01-24T01:26:17.550734
