                            
                            Part 1: Generating Dataset                                                        

---

        Installing Required Modules...                                                                                                                      

Let's install the required modules inside the notebook kernel:

1. `Pandas`; 2. `NumPy`; 3. `Faker`; 4. `Matplotlib` and 5. `Seaborn`

In [2]:
#!pip install numpy pandas faker matplotlib seaborn

---

        Step-1: Creating the Dataset...                                                                                                                     

As we've no predefined dataset, I'll create a dummy dataset at first step. So, let's import the required modules like faker, pandas, numpy, random and datetime to create the dataset.

In [3]:
from faker import Faker
import pandas as pd
import numpy as np
import random
from datetime import datetime

---

        Step-2: Setting Random Seeds for Reproducibility...                                                                                                 

We set a fixed `seed` for `Faker`, `NumPy` and `Random` to make our random data `reproducible`. This means we’ll get the same fake names, prices, and values every time we run the code — super helpful for consistency, debugging, and sharing work with others.

In [4]:
Faker.seed(42)
np.random.seed(42)
random.seed(42)
# 42 is just a fun, common choice — feel free to use any number!

---

        Step-3: Initializing Faker and Preparing Options...                                                                                                 

We'll initialize the Faker object as `fake` and link some dummy product names to their categories and provide some realisting payment options and cities.

In [5]:
fake = Faker('bn_BD')  # Initializing Faker object with Bangla locale

# Initializing lists to hold the generated data
products = ["Laptop", "Smartphone", "T-shirt", "Jeans", "Detergent", 
            "Toothpaste", "Electric Kettle", "Rice Cooker", "Notebook", "Pen"]

categories = {
    "Laptop": ["Electronics", np.random.randint(50000, 150000)],
    "Smartphone": ["Electronics", np.random.randint(10000, 100000)],
    "T-shirt": ["Clothing", np.random.randint(500, 3000)],
    "Jeans": ["Clothing", np.random.randint(1000, 5000)],
    "Detergent": ["Household", random.choice([50, 80, 150, 700])],
    "Toothpaste": ["Household", random.choice([30, 50, 100, 200])],
    "Electric Kettle": ["Appliances", np.random.randint(1000, 5000)],
    "Rice Cooker": ["Appliances", np.random.randint(2000, 10000)],
    "Notebook": ["Stationery", np.random.randint(50, 500)],
    "Pen": ["Stationery", np.random.randint(10, 100)],
}   # dictionary to map products to categories and prices in BDT

payment_methods = ["Credit Card", "Debit Card", "Mobile Payment", "Cash on Delivery"]
cities = ["Dhaka", "Chittagong", "Khulna", "Rajshahi", "Sylhet", "Mymensing", "Rangpur", "Barishal"]

# print(categories)  # Displaying the categories dictionary

Here are the list to visualize as human: 

Products We'll Be Using
To make our fake sales dataset feel realistic, we're including a variety of common items people actually buy. Here's our selection of 10 products:

    Laptop, Smartphone              --> Electronics

    T-Shirt, Jeans                  --> Clothings  

    Detergent, Toothpaste           --> Household  

    Rice Cooker, Electric Kettle    --> Appliances 

    Notebook, Pen                   --> Stationary 

Payment methods available: `Credit Card, Debit Card, Mobile Payment and COD`.

To the all the `divisional cities` in Bangladesh.

---

    Step-4: Creating 1000 of Entries...

Now we will create 1000 entries of random customer data using `for loop` and store those in a list named `data`.

In [6]:
data = [] # Initializing variable to hold the generated data

for _ in range(1000): # Generating 1000 records
    product = random.choice(products)
    category, price = categories[product]

    entry = {
        'Invoice ID': fake.uuid4(),
        'Date': fake.date_between(start_date='-1y', end_date='today'),
        'Customer Name': fake.name(),
        'Customer Email': fake.email(),
        'Product' : product,
        'Category': category,
        'Quantity': random.randint(1, 24),
        'Price Per Unit (BDT)': price,
        'Payment Method': random.choice(payment_methods),
        'City': random.choice(cities),
    }
    
    # Adding the entry to the data list
    data.append(entry)

---

        Step-5: Creating `.csv` File with the Data of 1000 Entries...                                                                                       

Now, we'll use the `pandas` module to create a dataframe of `data` (previously created) called `df`. Then we'll calculate the `Total Price` by multiplying `Quantity` and `Price per Unit (BDT)` values for each entry and get them in a column. Then, we'll apply some cleaning processes to remove the blank data. Though we're using `Faker` module to create dataset, but it's a good practice to apply cleaning for real-life datasets. Finally, we'll write the dataframe to the CSV file called `sales_data.csv` and save it for further use.

In [24]:
df = pd.DataFrame(data)  # Creating a DataFrame from the generated data
df['Total Price'] = df['Quantity'] * df['Price Per Unit (BDT)'] # Calculating total price

# Applyiing Cleaning Operations
df.dropna(inplace=True)  # Dropping any rows with missing values
df.drop_duplicates(inplace=True)  # Dropping duplicate rows
df['Date'] = pd.to_datetime(df['Date'])  # Converting 'Date' column to datetime format

df.to_csv('sales_data.csv', index=False)                        # Saving the DataFrame to a CSV file

# Displaying the first few rows of the DataFrame
print("First 20 rows of the DataFrame:")
df.head(20)

First 20 rows of the DataFrame:


Unnamed: 0,Invoice ID,Date,Customer Name,Customer Email,Product,Category,Quantity,Price Per Unit (BDT),Payment Method,City,Total Price
0,bdd640fb-0667-4ad1-9c80-317fa3b1799d,2024-09-27,আশীষ চন্দ্র,debaaphiphaa@example.com,Detergent,Household,8,50,Debit Card,Khulna,400
1,37f8a88b-17fc-495a-87a0-ca6e0822e8f3,2024-09-11,অদৃতা সরকার,shaarminkhaanm@example.org,Smartphone,Electronics,22,10860,Credit Card,Rangpur,238920
2,cf36d58b-4737-4190-96da-1dac72ff5d2a,2025-05-16,চঞ্চল মোড়ল,ekraamul23@example.com,Laptop,Electronics,1,65795,Credit Card,Rajshahi,65795
3,18c26797-6142-4a7d-97be-31111a2a73ed,2024-10-30,মুনতাকিম হক,tnmy78@example.org,Jeans,Clothing,17,2130,Credit Card,Rajshahi,36210
4,142c3fe8-60e7-4113-ac1b-8ca1f91e1d4c,2025-01-13,মোস্তাফিজ সিনহা,priymkumaar@example.net,Notebook,Stationery,14,70,Debit Card,Barishal,980
5,fc377a4c-4a15-444d-85e7-ce8a3a578a8e,2024-07-13,কাফি জাহান,daacaaryy@example.net,Pen,Stationery,9,92,Credit Card,Khulna,828
6,5ec42e08-29a3-42e9-9d65-a441d58842de,2024-10-28,মনোজ পাণ্ডে,raayaashaaltaa@example.net,Electric Kettle,Appliances,11,2095,Mobile Payment,Khulna,23045
7,6123fdf7-7656-4f72-a9d4-beef3eabedcb,2024-09-25,হৈমন্তী দাশগুপ্তা,nyn30@example.com,Jeans,Clothing,11,2130,Credit Card,Chittagong,23430
8,3602f8ac-10f1-4c81-848a-aa9e66b2bc5b,2025-06-02,প্রিয়াঙ্কা দে,psaahaa@example.com,Electric Kettle,Appliances,4,2095,Mobile Payment,Mymensing,8380
9,3f22faf8-23be-401d-83cf-2fde24933b83,2025-03-29,হৈমন্তী মৃধা,xsin@example.org,Pen,Stationery,9,92,Credit Card,Barishal,828


Some info about the dataset.

In [22]:
df.describe()  # Displaying summary statistics of the DataFrame

Unnamed: 0,Date,Quantity,Price Per Unit (BDT),Total Price
count,1000,1000.0,1000.0,1000.0
mean,2024-12-17 18:50:24,12.652,8943.027,105073.4
min,2024-06-13 00:00:00,1.0,30.0,30.0
25%,2024-09-12 00:00:00,7.0,70.0,920.0
50%,2024-12-20 00:00:00,13.0,2095.0,17316.0
75%,2025-03-24 00:00:00,19.0,5772.0,63492.0
max,2025-06-13 00:00:00,24.0,65795.0,1579080.0
std,,6.773535,19061.654886,254963.4


Hurray! Dataset creation has been done. `sales_data.csv` appeared in the same directory of this file. Check it out...

---

Now let's move on to part 2 to analyze the data.