# Personal Budget Data Preparation

## How to Generate Mock Data

Visit https://www.mockaroo.com to generate the data there, delete the already existing fields and add the following field for each table with the following settings:

**Note**: Make sure all fields' blank value set to 0% except the notes field you can play with the blank value from 60% to 80%, more or less.

### income_log

After generating the data as **CSV**, create another table for user 2, play with the content a bit to make it different.

|Field     | Type | Options               |
|:--------:|:----:|:---------------------|
|user_id|Number|min:**1**,max:**1**, switch both numbers to **2** when creating next user's table|
|date|Datetime|**1/1/2024** to **now**, format: **yyyy-mm-dd**|
|source|Custom List|Content: **Event Organization, Delivery Job, Other**, switch **random** to **weighted**, click the button after weighted and give each item the following weight: **Event Organization: 7**, **Delivery Job: 10**, **Other: 2**|
|amount|Number|min:**50**, max:**120**|
|notes|Sentences|At least **1** but no more than **1**, blank: **80%**

After downloading the CSV file, give it the name `raw_income_log_01.csv` and `raw_income_log_02.csv` where 1 is for user 1 and 2 is for user 2.

### savings_log

After generating the data as **CSV**, create another table for user 2, play with the content a bit to make it different.


|Field     | Type | Options               |
|:--------:|:----:|:---------------------|
|user_id|Number|min:**1**,max:**1**, switch both numbers to **2** when creating next user's table|
|date|Datetime|**1/1/2024** to **now**, format: **yyyy-mm-dd**|
|change|Number|min:**-100**, max:**120**|
|notes|Sentences|At least **1** but no more than **1**, blank: **80%**


After downloading the CSV file, give it the name `raw_savings_log_01.csv` and `raw_savings_log_02.csv` where 1 is for user 1 and 2 is for user 2.

**Note**: I had to generate the data multiple times to get it right. After generating the data, sum the change column and make sure it's not negative, nor very high.

### users

Just write two users manually with `name`, `username`, and `password`.

### investments_log

For this table, set the **Number of Rows** to around **600**. After generating the data as **CSV**, create another table for user 2, play with the content a bit to make it different.


|Field     | Type | Options               |
|:--------:|:----:|:---------------------|
|user_id|Number|min:**1**,max:**1**, switch both numbers to **2** when creating next user's table|
|date|Datetime|**1/8/2024** to **now**, format: **yyyy-mm-dd**|
|change|Number|min:**-70**, max:**70**|
|notes|Sentences|At least **1** but no more than **1**, blank: **50%**


After downloading the CSV file, give it the name `raw_investments_log_01.csv` and `raw_investments_log_02.csv` where 1 is for user 1 and 2 is for user 2.

**Note**: I had to generate the data multiple times to get it right. After generating the data, sum the change column and make sure it's not negative, nor very high.

### debts_log

For this table, we'll be creating **two tables per user**. One with the `type` column filled with "Credited" and the other filled with "Recieved". After generating the data as **CSV**, we'll create another two tables for user 2, play with the content a bit to make it different. Also, set the **number of rows per table** to around **50**

|Field     | Type | Options               |
|:--------:|:----:|:---------------------|
|user_id|Number|min:**1**,max:**1**, switch both numbers to **2** when creating next user's table|
|date|Datetime|**1/1/2024** to **now**, format: **yyyy-mm-dd**|
|party|Custom List|Content: **Ahmad, Issa, Sarah, Mila**, for this keep it random. Also, after generating the table, change up the name for other tables|
|amount|Number|min:**-60**, max:**80**, click on the formula (last button) and add this code: `val = ((this + 2) / 5).floor * 5 if val == 0 then 5 else val end` to make the numbers multiple of 5 and not 0|
|notes|Sentences|At least **1** but no more than **1**, blank: **75%**
|type|Custom List| Content: **Credited**, after generating this table, change the column to **Recieved**

After downloading the CSV file, give it the name `raw_debts_log_01.csv` and `raw_debts_log_02.csv` for user 1. And `raw_debts_log_03.csv` and `raw_debts_log_04.csv` for user two. Make sure tables 1 and 3 have the `type` column set to `Credited`, and tables 2 and 4 have the `type` column set to `Recieved`.

**Note**: I had to generate the data multiple times to get it right. After generating the data, sum the amount per party and make sure none are negative, nor very high.

### subscriptions

For this table, I'll be creating it manually with python since it will contain around 3-6 rows only, and the data has to be specific and realistic.

## Data Preparation Plan

### income_log table

1. Import mock income_log data (two)
2. Combine dataframes to one dataframe
3. Sort resulting dataframe by date
4. View total income per month per user
5. If monthly totals are not realistic, change mock data
6. Remove ";" from values in `notes` column
7. Export data as sql file with prefix `final_`

### savings_log table

1. Import mock savings_log data (two)
2. Sort each dataframe by date
3. Calculate sum of `change` column per dataframe
4. If sum is negative, change mock data
5. Add new cumulative column called `balance` for each dataframe using `cumsum`
6. Drop rows where `balance` is negative
7. Combine dataframes to one dataframe
8. Try: Move `balance` column to be after `change` column
9. Sort resulting dataframe by date
10. Remove ";" from values in `notes` column
11. Export dataframe to sql file with prefix `final_`

### users table

Write sql file containing two users each with `name`, `username`, and `password`

### investments_log

1. Import mock investments_log data (two)
2. Sort each dataframe by date
3. Calculate sum of `change` column per dataframe
4. If sum is negative, change mock data
5. Add new cumulative column called `balance` for each dataframe using `cumsum`
6. Drop rows where `balance` is negative
7. Combine dataframes to one dataframe
8. Try: Move `balance` column to be after `change` column
9. Sort resulting dataframe by date
10. Remove ";" from values in `notes` column
11. Export dataframe to sql file with prefix `final_`

### debts_log

1. Import mock debts_log data (four)
2. Caluclate the sum of amount per party. Make sure it's positive and realistic
3. Play with data to make it more realistic
4. Combine all four tables to one dataframe
5. Sort resulting dataframe by date
6. Remove ";" from values in `notes` column
7. Export dataframe to sql file with prefix `final_`

### subscriptions

Create a dataframe with 5 columns: `user_id`, `subscription`, `amount`, `expected_day`, and `notes`. Add the values manually.

## Implementation

In [268]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### income_log table

#### 1. Import mock income_log data (two)

In [269]:
income_log_1 = pd.read_csv("raw_income_log_01.csv")
income_log_2 = pd.read_csv("raw_income_log_02.csv")

In [270]:
income_log_1

Unnamed: 0,user_id,date,source,amount,notes
0,1,2025-04-25,Delivery Job,24,
1,1,2025-01-06,Event Organizing,91,
2,1,2024-04-29,Event Organizing,86,
3,1,2025-06-21,Delivery Job,25,
4,1,2024-02-03,Delivery Job,82,
...,...,...,...,...,...
995,1,2025-06-15,Delivery Job,51,
996,1,2025-05-10,Event Organizing,64,
997,1,2024-06-09,Delivery Job,96,
998,1,2024-12-03,Delivery Job,16,


In [271]:
income_log_2

Unnamed: 0,user_id,date,source,amount,notes
0,2,2024-05-14,University,105,
1,2,2025-03-22,Other,55,
2,2,2024-01-16,University,109,
3,2,2024-01-03,University,112,
4,2,2025-03-27,University,64,
...,...,...,...,...,...
995,2,2025-06-11,University,66,
996,2,2024-09-05,Shop,105,
997,2,2024-10-19,University,71,
998,2,2024-10-29,Shop,120,Integer tincidunt ante vel ipsum.


#### 2. Combine dataframes to one dataframe

In [272]:
income_log = pd.concat([income_log_1, income_log_2], ignore_index=True)

In [273]:
income_log

Unnamed: 0,user_id,date,source,amount,notes
0,1,2025-04-25,Delivery Job,24,
1,1,2025-01-06,Event Organizing,91,
2,1,2024-04-29,Event Organizing,86,
3,1,2025-06-21,Delivery Job,25,
4,1,2024-02-03,Delivery Job,82,
...,...,...,...,...,...
1995,2,2025-06-11,University,66,
1996,2,2024-09-05,Shop,105,
1997,2,2024-10-19,University,71,
1998,2,2024-10-29,Shop,120,Integer tincidunt ante vel ipsum.


#### 3. Sort resulting dataframe by date

In [274]:
# Convert date column to datetime
income_log["date"] = pd.to_datetime(income_log["date"])

In [275]:
# Sort
income_log = income_log.sort_values("date", ascending=True)

In [276]:
# Reset dataframe index
income_log = income_log.reset_index(drop=True)

In [277]:
income_log

Unnamed: 0,user_id,date,source,amount,notes
0,2,2024-01-01,Other,90,
1,2,2024-01-01,Shop,70,
2,1,2024-01-01,Event Organizing,96,
3,2,2024-01-01,University,89,
4,1,2024-01-02,Delivery Job,83,
...,...,...,...,...,...
1995,2,2025-08-23,University,59,Nam tristique tortor eu pede.
1996,2,2025-08-24,University,111,
1997,2,2025-08-25,University,74,
1998,2,2025-08-25,Other,76,


#### 4. View total income per month per user

In [278]:
# Split dataframes by user
income_log_1 = income_log[income_log["user_id"] == 1]
income_log_2 = income_log[income_log["user_id"] == 2]

In [279]:
# Group sum of amount by month
monthly_totals_1 = income_log_1.groupby(income_log_1["date"].dt.to_period("M"))["amount"].sum()
monthly_totals_2 = income_log_2.groupby(income_log_2["date"].dt.to_period("M"))["amount"].sum()

In [280]:
monthly_totals_1

date
2024-01    2197
2024-02    2370
2024-03    2573
2024-04    2872
2024-05    3134
2024-06    3126
2024-07    3023
2024-08    3471
2024-09    2711
2024-10    1942
2024-11    3423
2024-12    2722
2025-01    3471
2025-02    2801
2025-03    2192
2025-04    2967
2025-05    2864
2025-06    2512
2025-07    3510
2025-08    2327
Freq: M, Name: amount, dtype: int64

In [281]:
monthly_totals_2

date
2024-01    4292
2024-02    4996
2024-03    5988
2024-04    4258
2024-05    4094
2024-06    4275
2024-07    4282
2024-08    4484
2024-09    3897
2024-10    4474
2024-11    3285
2024-12    4027
2025-01    4312
2025-02    4774
2025-03    4255
2025-04    4044
2025-05    3798
2025-06    3577
2025-07    4362
2025-08    2801
Freq: M, Name: amount, dtype: int64

#### 5. If monthly totals are not realistic, change mock data

Data looks realistic enough.

#### 6. Remove ";" from values in `notes` column

SchemaLoader in the main app splits sql statements by ';' so keeping it in notes values causes an error.

In [282]:
income_log.loc[income_log["notes"].notna(), "notes"] = (
    income_log.loc[income_log["notes"].notna(), "notes"]
    .str.replace(";", "", regex=False)
)

#### 7. Export data as sql file with prefix `final_`

In [283]:
# Convert income_log date column back to string
income_log["date"] = income_log["date"].dt.strftime("%Y-%m-%d")

In [284]:
# Write to sql file
with open("final_income_log.sql", "w", encoding="UTF-8") as file:
    for _, row in income_log.iterrows():
        notes = 'NULL' if str(row['notes']) == 'nan' else f"'{row['notes']}'"
        sql = f"INSERT INTO income_log (user_id, date, source, amount, notes) VALUES ({row['user_id']}, '{row['date']}', '{row['source']}', {row['amount']}, {notes});\n"
        file.write(sql)

### savings_log table

#### 1. Import mock savings_log data (two)

In [285]:
savings_log_1 = pd.read_csv("raw_savings_log_01.csv")
savings_log_2 = pd.read_csv("raw_savings_log_02.csv")

In [286]:
savings_log_1

Unnamed: 0,user_id,date,change,notes
0,1,2025-05-25,-57,Aliquam sit amet diam in magna bibendum imperd...
1,1,2025-01-30,47,Phasellus sit amet erat.
2,1,2025-07-09,-38,
3,1,2024-12-23,28,
4,1,2025-06-17,91,
...,...,...,...,...
995,1,2024-06-03,-31,Curabitur convallis.
996,1,2025-06-11,78,Morbi non quam nec dui luctus rutrum.
997,1,2024-10-22,54,
998,1,2025-07-14,83,


In [287]:
savings_log_2

Unnamed: 0,user_id,date,change,notes
0,2,2025-02-24,-100,
1,2,2025-02-14,-82,Proin interdum mauris non ligula pellentesque ...
2,2,2024-07-25,-17,"Lorem ipsum dolor sit amet, consectetuer adipi..."
3,2,2024-04-10,72,Curabitur in libero ut massa volutpat convallis.
4,2,2024-11-01,-43,Vestibulum ac est lacinia nisi venenatis trist...
...,...,...,...,...
995,2,2024-01-01,-51,
996,2,2025-02-26,31,Sed accumsan felis.
997,2,2024-06-14,102,
998,2,2024-04-10,-34,


#### 2. Sort each dataframe by date

In [288]:
# Convert date column to datetime for both
savings_log_1["date"] = pd.to_datetime(savings_log_1["date"])
savings_log_2["date"] = pd.to_datetime(savings_log_2["date"])

In [289]:
# Sort both by date
savings_log_1 = savings_log_1.sort_values("date")
savings_log_2 = savings_log_2.sort_values("date")

In [290]:
# Reset both indexes
savings_log_1 = savings_log_1.reset_index(drop=True)
savings_log_2 = savings_log_2.reset_index(drop=True)

In [291]:
savings_log_1

Unnamed: 0,user_id,date,change,notes
0,1,2024-01-01,15,Pellentesque viverra pede ac diam.
1,1,2024-01-01,-20,
2,1,2024-01-01,39,"Lorem ipsum dolor sit amet, consectetuer adipi..."
3,1,2024-01-03,-52,Nullam varius.
4,1,2024-01-03,20,
...,...,...,...,...
995,1,2025-08-21,-22,
996,1,2025-08-22,-72,
997,1,2025-08-23,91,Sed accumsan felis.
998,1,2025-08-24,8,"Maecenas leo odio, condimentum id, luctus nec,..."


In [292]:
savings_log_2

Unnamed: 0,user_id,date,change,notes
0,2,2024-01-01,25,Morbi porttitor lorem id ligula.
1,2,2024-01-01,-51,
2,2,2024-01-01,29,In sagittis dui vel nisl.
3,2,2024-01-03,-54,
4,2,2024-01-03,106,"Nulla neque libero, convallis eget, eleifend l..."
...,...,...,...,...
995,2,2025-08-23,-59,"Maecenas leo odio, condimentum id, luctus nec,..."
996,2,2025-08-23,21,
997,2,2025-08-23,55,
998,2,2025-08-23,-15,


#### 3. Calculate sum of `change` column per dataframe

In [293]:
savings_log_1["change"].sum()

np.int64(1616)

In [294]:
savings_log_2["change"].sum()

np.int64(1990)

#### 4. If sum is negative, change mock data

I've changed the mock data a few times until I got the numbers right.

#### 5. Add new cumulative column called `balance` for each dataframe using `cumsum`

In [295]:
savings_log_1["balance"] = savings_log_1["change"].cumsum()
savings_log_2["balance"] = savings_log_2["change"].cumsum()

In [296]:
savings_log_1

Unnamed: 0,user_id,date,change,notes,balance
0,1,2024-01-01,15,Pellentesque viverra pede ac diam.,15
1,1,2024-01-01,-20,,-5
2,1,2024-01-01,39,"Lorem ipsum dolor sit amet, consectetuer adipi...",34
3,1,2024-01-03,-52,Nullam varius.,-18
4,1,2024-01-03,20,,2
...,...,...,...,...,...
995,1,2025-08-21,-22,,1507
996,1,2025-08-22,-72,,1435
997,1,2025-08-23,91,Sed accumsan felis.,1526
998,1,2025-08-24,8,"Maecenas leo odio, condimentum id, luctus nec,...",1534


In [297]:
savings_log_2

Unnamed: 0,user_id,date,change,notes,balance
0,2,2024-01-01,25,Morbi porttitor lorem id ligula.,25
1,2,2024-01-01,-51,,-26
2,2,2024-01-01,29,In sagittis dui vel nisl.,3
3,2,2024-01-03,-54,,-51
4,2,2024-01-03,106,"Nulla neque libero, convallis eget, eleifend l...",55
...,...,...,...,...,...
995,2,2025-08-23,-59,"Maecenas leo odio, condimentum id, luctus nec,...",1966
996,2,2025-08-23,21,,1987
997,2,2025-08-23,55,,2042
998,2,2025-08-23,-15,,2027


#### 6. Drop rows where `balance` is negative

Checking how many rows have balance < 0

In [298]:
(savings_log_1["balance"] < 0).sum()

np.int64(238)

In [299]:
(savings_log_2["balance"] < 0).sum()

np.int64(4)

Dropping the rows

In [300]:
savings_log_1 = savings_log_1[savings_log_1["balance"] >= 0]
savings_log_2 = savings_log_2[savings_log_2["balance"] >= 0]

In [301]:
savings_log_1

Unnamed: 0,user_id,date,change,notes,balance
0,1,2024-01-01,15,Pellentesque viverra pede ac diam.,15
2,1,2024-01-01,39,"Lorem ipsum dolor sit amet, consectetuer adipi...",34
4,1,2024-01-03,20,,2
15,1,2024-01-10,95,Aenean sit amet justo.,3
136,1,2024-03-08,86,Pellentesque eget nunc.,15
...,...,...,...,...,...
995,1,2025-08-21,-22,,1507
996,1,2025-08-22,-72,,1435
997,1,2025-08-23,91,Sed accumsan felis.,1526
998,1,2025-08-24,8,"Maecenas leo odio, condimentum id, luctus nec,...",1534


In [302]:
savings_log_2

Unnamed: 0,user_id,date,change,notes,balance
0,2,2024-01-01,25,Morbi porttitor lorem id ligula.,25
2,2,2024-01-01,29,In sagittis dui vel nisl.,3
4,2,2024-01-03,106,"Nulla neque libero, convallis eget, eleifend l...",55
5,2,2024-01-04,52,Nullam molestie nibh in lectus.,107
6,2,2024-01-05,-54,Cras pellentesque volutpat dui.,53
...,...,...,...,...,...
995,2,2025-08-23,-59,"Maecenas leo odio, condimentum id, luctus nec,...",1966
996,2,2025-08-23,21,,1987
997,2,2025-08-23,55,,2042
998,2,2025-08-23,-15,,2027


#### 7. Combine dataframes to one dataframe

In [303]:
savings_log = pd.concat([savings_log_1, savings_log_2], ignore_index=True)

In [304]:
savings_log

Unnamed: 0,user_id,date,change,notes,balance
0,1,2024-01-01,15,Pellentesque viverra pede ac diam.,15
1,1,2024-01-01,39,"Lorem ipsum dolor sit amet, consectetuer adipi...",34
2,1,2024-01-03,20,,2
3,1,2024-01-10,95,Aenean sit amet justo.,3
4,1,2024-03-08,86,Pellentesque eget nunc.,15
...,...,...,...,...,...
1753,2,2025-08-23,-59,"Maecenas leo odio, condimentum id, luctus nec,...",1966
1754,2,2025-08-23,21,,1987
1755,2,2025-08-23,55,,2042
1756,2,2025-08-23,-15,,2027


#### 8. Try: Move `balance` column to be after `change` column

In [305]:
columns = list(savings_log.columns) # Create list of columns
columns.insert(columns.index("change")+1, columns.pop(columns.index("balance"))) # Move balance after change
savings_log = savings_log[columns] # Return dataframe with new column order

In [306]:
savings_log

Unnamed: 0,user_id,date,change,balance,notes
0,1,2024-01-01,15,15,Pellentesque viverra pede ac diam.
1,1,2024-01-01,39,34,"Lorem ipsum dolor sit amet, consectetuer adipi..."
2,1,2024-01-03,20,2,
3,1,2024-01-10,95,3,Aenean sit amet justo.
4,1,2024-03-08,86,15,Pellentesque eget nunc.
...,...,...,...,...,...
1753,2,2025-08-23,-59,1966,"Maecenas leo odio, condimentum id, luctus nec,..."
1754,2,2025-08-23,21,1987,
1755,2,2025-08-23,55,2042,
1756,2,2025-08-23,-15,2027,


#### 9. Sort resulting dataframe by date

In [307]:
# Sort by date
savings_log = savings_log.sort_values("date")

In [308]:
# Reset indexes
savings_log = savings_log.reset_index(drop=True)

In [309]:
savings_log

Unnamed: 0,user_id,date,change,balance,notes
0,1,2024-01-01,15,15,Pellentesque viverra pede ac diam.
1,2,2024-01-01,29,3,In sagittis dui vel nisl.
2,2,2024-01-01,25,25,Morbi porttitor lorem id ligula.
3,1,2024-01-01,39,34,"Lorem ipsum dolor sit amet, consectetuer adipi..."
4,1,2024-01-03,20,2,
...,...,...,...,...,...
1753,1,2025-08-23,91,1526,Sed accumsan felis.
1754,2,2025-08-23,-87,2025,Maecenas rhoncus aliquam lacus.
1755,1,2025-08-24,8,1534,"Maecenas leo odio, condimentum id, luctus nec,..."
1756,1,2025-08-24,82,1616,Morbi vel lectus in quam fringilla rhoncus.


#### 10. Remove ";" from values in `notes` column

SchemaLoader in the main app splits sql statements by ';' so keeping it in notes values causes an error.

In [310]:
savings_log.loc[savings_log["notes"].notna(), "notes"] = (
    savings_log.loc[savings_log["notes"].notna(), "notes"]
    .str.replace(";", "", regex=False)
)

#### 11. Export dataframe to sql file with prefix `final_`

In [311]:
# Convert date column back to string
savings_log["date"] = savings_log["date"].dt.strftime("%Y-%m-%d")

In [312]:
# Write to sql file
with open("final_savings_log.sql", "w", encoding="UTF-8") as file:
    for _, row in savings_log.iterrows():
        notes = 'NULL' if str(row["notes"]) == 'nan' else f"'{row["notes"]}'"
        sql = f"INSERT INTO savings_log (user_id, date, change, balance, notes) VALUES({row["user_id"]}, '{row["date"]}', {row["change"]}, {row["balance"]}, {notes});\n"
        file.write(sql)

### users table

Write sql file containing two users each with `name`, `username`, and `password`

In [313]:
# Create users dictionary
users_dictionary = {
    "name": ["Yazeed", "Bara"],
    "username": ["admin", "bara"],
    "password": ["123456", "bara123"]
}

In [314]:
# Convert dictionary to dataframe
users = pd.DataFrame(users_dictionary)

In [315]:
users

Unnamed: 0,name,username,password
0,Yazeed,admin,123456
1,Bara,bara,bara123


In [316]:
# Write to sql file
with open("final_users.sql", "w", encoding="UTF-8") as file:
    for _, row in users.iterrows():
        sql = f"INSERT INTO users (name, username, password) VALUES('{row["name"]}', '{row["username"]}', '{row["password"]}');\n"
        file.write(sql)

### investments_log

#### 1. Import mock investments_log data (two)

In [317]:
investments_log_1 = pd.read_csv("raw_investments_log_01.csv")
investments_log_2 = pd.read_csv("raw_investments_log_02.csv")

In [318]:
investments_log_1

Unnamed: 0,user_id,date,change,notes
0,1,2024-11-30,69,"Integer aliquet, massa id lobortis convallis, ..."
1,1,2024-12-19,-52,"Integer pede justo, lacinia eget, tincidunt eg..."
2,1,2024-12-21,-20,
3,1,2025-05-22,-22,Aliquam erat volutpat.
4,1,2024-11-05,-53,Cras pellentesque volutpat dui.
...,...,...,...,...
595,1,2025-08-18,70,
596,1,2025-08-24,-2,Maecenas ut massa quis augue luctus tincidunt.
597,1,2025-07-05,6,
598,1,2024-12-04,44,Ut at dolor quis odio consequat varius.


In [319]:
investments_log_2

Unnamed: 0,user_id,date,change,notes
0,2,2024-08-29,-10,
1,2,2025-04-14,65,
2,2,2025-08-25,-12,Vestibulum ac est lacinia nisi venenatis trist...
3,2,2025-01-15,40,"Proin leo odio, porttitor id, consequat in, co..."
4,2,2024-11-13,41,
...,...,...,...,...
595,2,2025-08-08,0,
596,2,2025-06-25,8,
597,2,2025-05-02,-7,Praesent lectus.
598,2,2024-12-06,55,Vivamus tortor.


#### 2. Sort each dataframe by date

In [320]:
investments_log_1 = investments_log_1.sort_values("date")

In [321]:
investments_log_1 = investments_log_1.reset_index(drop=True)

In [322]:
investments_log_1

Unnamed: 0,user_id,date,change,notes
0,1,2024-08-01,-45,In quis justo.
1,1,2024-08-03,47,Etiam vel augue.
2,1,2024-08-04,54,"Nam ultrices, libero non mattis pulvinar, null..."
3,1,2024-08-04,2,Etiam justo.
4,1,2024-08-05,14,
...,...,...,...,...
595,1,2025-08-23,20,
596,1,2025-08-23,-40,Maecenas pulvinar lobortis est.
597,1,2025-08-24,-23,
598,1,2025-08-24,-2,Maecenas ut massa quis augue luctus tincidunt.


In [323]:
investments_log_2 = investments_log_2.sort_values("date")

In [324]:
investments_log_2 = investments_log_2.reset_index(drop=True)

In [325]:
investments_log_2

Unnamed: 0,user_id,date,change,notes
0,2,2024-08-01,54,
1,2,2024-08-02,-8,Vestibulum ante ipsum primis in faucibus orci ...
2,2,2024-08-02,19,
3,2,2024-08-03,10,
4,2,2024-08-04,34,
...,...,...,...,...
595,2,2025-08-24,-62,Aenean fermentum.
596,2,2025-08-25,-5,Nulla facilisi.
597,2,2025-08-25,-63,
598,2,2025-08-25,-36,Curabitur convallis.


#### 3. Calculate sum of `change` column per dataframe

In [326]:
investments_log_1["change"].sum()

np.int64(2495)

In [327]:
investments_log_2["change"].sum()

np.int64(1317)

#### 4. If sum is negative, change mock data

I changed the data multiple times until I got it right. Now it looks realistic enough.

#### 5. Add new cumulative column called `balance` for each dataframe using `cumsum`

In [328]:
investments_log_1["balance"] = investments_log_1["change"].cumsum()

In [329]:
investments_log_1

Unnamed: 0,user_id,date,change,notes,balance
0,1,2024-08-01,-45,In quis justo.,-45
1,1,2024-08-03,47,Etiam vel augue.,2
2,1,2024-08-04,54,"Nam ultrices, libero non mattis pulvinar, null...",56
3,1,2024-08-04,2,Etiam justo.,58
4,1,2024-08-05,14,,72
...,...,...,...,...,...
595,1,2025-08-23,20,,2605
596,1,2025-08-23,-40,Maecenas pulvinar lobortis est.,2565
597,1,2025-08-24,-23,,2542
598,1,2025-08-24,-2,Maecenas ut massa quis augue luctus tincidunt.,2540


In [330]:
investments_log_2["balance"] = investments_log_2["change"].cumsum()

In [331]:
investments_log_2

Unnamed: 0,user_id,date,change,notes,balance
0,2,2024-08-01,54,,54
1,2,2024-08-02,-8,Vestibulum ante ipsum primis in faucibus orci ...,46
2,2,2024-08-02,19,,65
3,2,2024-08-03,10,,75
4,2,2024-08-04,34,,109
...,...,...,...,...,...
595,2,2025-08-24,-62,Aenean fermentum.,1433
596,2,2025-08-25,-5,Nulla facilisi.,1428
597,2,2025-08-25,-63,,1365
598,2,2025-08-25,-36,Curabitur convallis.,1329


#### 6. Drop rows where `balance` is negative

In [332]:
investments_log_1 = investments_log_1[investments_log_1["balance"] >= 0]

In [333]:
investments_log_1

Unnamed: 0,user_id,date,change,notes,balance
1,1,2024-08-03,47,Etiam vel augue.,2
2,1,2024-08-04,54,"Nam ultrices, libero non mattis pulvinar, null...",56
3,1,2024-08-04,2,Etiam justo.,58
4,1,2024-08-05,14,,72
5,1,2024-08-06,66,Vestibulum ante ipsum primis in faucibus orci ...,138
...,...,...,...,...,...
595,1,2025-08-23,20,,2605
596,1,2025-08-23,-40,Maecenas pulvinar lobortis est.,2565
597,1,2025-08-24,-23,,2542
598,1,2025-08-24,-2,Maecenas ut massa quis augue luctus tincidunt.,2540


In [334]:
investments_log_2 = investments_log_2[investments_log_2["balance"] >= 0]

In [335]:
investments_log_2

Unnamed: 0,user_id,date,change,notes,balance
0,2,2024-08-01,54,,54
1,2,2024-08-02,-8,Vestibulum ante ipsum primis in faucibus orci ...,46
2,2,2024-08-02,19,,65
3,2,2024-08-03,10,,75
4,2,2024-08-04,34,,109
...,...,...,...,...,...
595,2,2025-08-24,-62,Aenean fermentum.,1433
596,2,2025-08-25,-5,Nulla facilisi.,1428
597,2,2025-08-25,-63,,1365
598,2,2025-08-25,-36,Curabitur convallis.,1329


#### 7. Combine dataframes to one dataframe

In [336]:
investments_log = pd.concat([investments_log_1, investments_log_2], ignore_index=True)

In [337]:
investments_log

Unnamed: 0,user_id,date,change,notes,balance
0,1,2024-08-03,47,Etiam vel augue.,2
1,1,2024-08-04,54,"Nam ultrices, libero non mattis pulvinar, null...",56
2,1,2024-08-04,2,Etiam justo.,58
3,1,2024-08-05,14,,72
4,1,2024-08-06,66,Vestibulum ante ipsum primis in faucibus orci ...,138
...,...,...,...,...,...
1166,2,2025-08-24,-62,Aenean fermentum.,1433
1167,2,2025-08-25,-5,Nulla facilisi.,1428
1168,2,2025-08-25,-63,,1365
1169,2,2025-08-25,-36,Curabitur convallis.,1329


#### 8. Move `balance` column to be after `change` column

In [338]:
investments_log = investments_log[["user_id", "date", "change", "balance", "notes"]]

In [339]:
investments_log

Unnamed: 0,user_id,date,change,balance,notes
0,1,2024-08-03,47,2,Etiam vel augue.
1,1,2024-08-04,54,56,"Nam ultrices, libero non mattis pulvinar, null..."
2,1,2024-08-04,2,58,Etiam justo.
3,1,2024-08-05,14,72,
4,1,2024-08-06,66,138,Vestibulum ante ipsum primis in faucibus orci ...
...,...,...,...,...,...
1166,2,2025-08-24,-62,1433,Aenean fermentum.
1167,2,2025-08-25,-5,1428,Nulla facilisi.
1168,2,2025-08-25,-63,1365,
1169,2,2025-08-25,-36,1329,Curabitur convallis.


#### 9. Sort resulting dataframe by date

In [340]:
investments_log = investments_log.sort_values("date")

In [341]:
investments_log = investments_log.reset_index(drop=True)

In [342]:
investments_log

Unnamed: 0,user_id,date,change,balance,notes
0,2,2024-08-01,54,54,
1,2,2024-08-02,19,65,
2,2,2024-08-02,-8,46,Vestibulum ante ipsum primis in faucibus orci ...
3,2,2024-08-03,10,75,
4,1,2024-08-03,47,2,Etiam vel augue.
...,...,...,...,...,...
1166,1,2025-08-25,-45,2495,Suspendisse accumsan tortor quis turpis.
1167,2,2025-08-25,-5,1428,Nulla facilisi.
1168,2,2025-08-25,-63,1365,
1169,2,2025-08-25,-36,1329,Curabitur convallis.


#### 10. Remove ";" from values in `notes` column

In [343]:
investments_log["notes"] = investments_log["notes"].str.replace(";", "", regex=False)

In [344]:
investments_log

Unnamed: 0,user_id,date,change,balance,notes
0,2,2024-08-01,54,54,
1,2,2024-08-02,19,65,
2,2,2024-08-02,-8,46,Vestibulum ante ipsum primis in faucibus orci ...
3,2,2024-08-03,10,75,
4,1,2024-08-03,47,2,Etiam vel augue.
...,...,...,...,...,...
1166,1,2025-08-25,-45,2495,Suspendisse accumsan tortor quis turpis.
1167,2,2025-08-25,-5,1428,Nulla facilisi.
1168,2,2025-08-25,-63,1365,
1169,2,2025-08-25,-36,1329,Curabitur convallis.


#### 11. Export dataframe to sql file with prefix `final_`

In [345]:
with open("final_investments_log.sql", "w", encoding="UTF-8") as file:
    for _, row in investments_log.iterrows():
        notes = "NULL" if str(row["notes"]) == "nan" else f"'{row["notes"]}'"
        sql = f"INSERT INTO investments_log(user_id, date, change, amount, notes) VALUES({row["user_id"]}, '{row["date"]}', {row["change"]}, {row["balance"]}, {notes});\n"
        file.write(sql)

### debts_log

1. Import mock debts_log data (four)
2. Caluclate the sum of amount per party. Make sure it's positive and realistic
3. Play with data to make it more realistic
4. Combine all four tables to one dataframe
5. Sort resulting dataframe by date
6. Remove ";" from values in `notes` column
7. Export dataframe to sql file with prefix `final_`

#### 1. Import mock debts_log data (four)

In [346]:
debts_log_1 = pd.read_csv("raw_debts_log_01.csv")
debts_log_2 = pd.read_csv("raw_debts_log_02.csv")
debts_log_3 = pd.read_csv("raw_debts_log_03.csv")
debts_log_4 = pd.read_csv("raw_debts_log_04.csv")

In [347]:
debts_log_1.head()

Unnamed: 0,user_id,date,party,amount,notes,type
0,1,2025-08-25,Sara,20,Fusce consequat.,Credited
1,1,2024-04-19,Issa,-10,"Quisque erat eros, viverra eget, congue eget, ...",Credited
2,1,2025-06-05,Ahmad,-50,Vivamus vel nulla eget eros elementum pellente...,Credited
3,1,2025-04-25,Issa,-10,Maecenas rhoncus aliquam lacus.,Credited
4,1,2024-05-21,Mila,65,,Credited


In [348]:
debts_log_2.head()

Unnamed: 0,user_id,date,party,amount,notes,type
0,1,2025-07-10,Uncle Issa,40,Suspendisse potenti.,Recieved
1,1,2025-07-30,Landlord,50,Fusce posuere felis sed lacus.,Recieved
2,1,2025-05-15,Landlord,-45,,Recieved
3,1,2024-10-12,Sara,55,,Recieved
4,1,2025-03-07,Lina,80,,Recieved


In [349]:
debts_log_3.head()

Unnamed: 0,user_id,date,party,amount,notes,type
0,2,2024-06-25,Ahmad Gym,55,Quisque porta volutpat erat.,Credited
1,2,2025-01-13,Ahmad Gym,5,Sed vel enim sit amet nunc viverra dapibus.,Credited
2,2,2024-01-21,Young Brother,-15,,Credited
3,2,2025-01-21,Young Brother,-55,,Credited
4,2,2024-09-15,Young Brother,-40,Aenean auctor gravida sem.,Credited


In [350]:
debts_log_4.head()

Unnamed: 0,user_id,date,party,amount,notes,type
0,2,2024-08-12,Father,55,,Received
1,2,2024-10-08,Abd,-30,Etiam pretium iaculis justo.,Received
2,2,2025-03-30,Abd,65,,Received
3,2,2024-05-17,Abd,-40,,Received
4,2,2024-05-08,Job,-20,,Received


#### 2. Caluclate the sum of amount per party. Make sure it's positive and realistic

In [351]:
debts_log_1.groupby("party")["amount"].sum()

party
Ahmad   -100
Issa     245
Mila     115
Sara     260
Name: amount, dtype: int64

In [352]:
debts_log_2.groupby("party")["amount"].sum()

party
Landlord      -20
Lina          195
Sara          445
Uncle Issa     15
Name: amount, dtype: int64

In [353]:
debts_log_3.groupby("party")["amount"].sum()

party
Ahmad Gym         65
Mahmoud          100
Samya             80
Young Brother   -115
Name: amount, dtype: int64

In [354]:
debts_log_4.groupby("party")["amount"].sum()

party
Abd         -45
Aunt Mina    15
Father      -45
Job         -75
Name: amount, dtype: int64

#### 3. Play with data to make it more realistic

Looking at the data, I can see there are a lot of negatives. We need to turn them to positives. Also, in each table, make one party's amount sum to zero to make it look like the debt is paid.

In [358]:
debts_log_1.loc[41, "amount"] = 105
debts_log_1.loc[48, "amount"] = 50
debts_log_1.loc[47, "amount"] = -170
debts_log_1.loc[49, "amount"] = -40

In [359]:
debts_log_1.groupby("party")["amount"].sum()

party
Ahmad     45
Issa       0
Mila     115
Sara     260
Name: amount, dtype: int64

In [361]:
debts_log_2.loc[43, "amount"] = -25

In [362]:
debts_log_2.groupby("party")["amount"].sum()

party
Landlord        0
Lina          195
Sara          445
Uncle Issa     15
Name: amount, dtype: int64

In [369]:
debts_log_3.loc[45, "amount"] = 125

In [370]:
debts_log_3.groupby("party")["amount"].sum()

party
Ahmad Gym         65
Mahmoud          100
Samya             80
Young Brother      0
Name: amount, dtype: int64

In [372]:
debts_log_4.loc[45, "amount"] = 50
debts_log_4.loc[36, "amount"] = 50
debts_log_4.loc[29, "amount"] = 55
debts_log_4.loc[41, "amount"] = 75

In [373]:
debts_log_4.groupby("party")["amount"].sum()

party
Abd          160
Aunt Mina     15
Father         0
Job           25
Name: amount, dtype: int64

#### 4. Combine all four tables to one dataframe

In [374]:
debts_log = pd.concat([debts_log_1, debts_log_2, debts_log_3, debts_log_4], ignore_index=True)

In [375]:
debts_log

Unnamed: 0,user_id,date,party,amount,notes,type
0,1,2025-08-25,Sara,20,Fusce consequat.,Credited
1,1,2024-04-19,Issa,-10,"Quisque erat eros, viverra eget, congue eget, ...",Credited
2,1,2025-06-05,Ahmad,-50,Vivamus vel nulla eget eros elementum pellente...,Credited
3,1,2025-04-25,Issa,-10,Maecenas rhoncus aliquam lacus.,Credited
4,1,2024-05-21,Mila,65,,Credited
...,...,...,...,...,...,...
195,2,2025-04-27,Father,50,,Received
196,2,2025-01-11,Father,-30,Nulla ut erat id mauris vulputate elementum.,Received
197,2,2025-05-30,Aunt Mina,-55,Sed accumsan felis.,Received
198,2,2024-06-21,Aunt Mina,30,,Received


#### 5. Sort resulting dataframe by date

In [376]:
debts_log = debts_log.sort_values("date")

In [378]:
debts_log = debts_log.reset_index(drop=True)

In [379]:
debts_log

Unnamed: 0,user_id,date,party,amount,notes,type
0,2,2024-01-04,Samya,-30,,Credited
1,1,2024-01-12,Sara,-50,,Credited
2,2,2024-01-21,Young Brother,-15,,Credited
3,1,2024-01-21,Uncle Issa,-40,"Duis bibendum, felis sed interdum venenatis, t...",Recieved
4,2,2024-01-22,Father,-55,Nam nulla.,Received
...,...,...,...,...,...,...
195,1,2025-08-14,Sara,50,,Recieved
196,2,2025-08-22,Ahmad Gym,50,Sed ante.,Credited
197,2,2025-08-24,Father,45,,Received
198,2,2025-08-25,Abd,55,,Received


#### 6. Remove ";" from values in `notes` column

In [380]:
debts_log["notes"] = debts_log["notes"].str.replace(";", "", regex=False)

In [381]:
debts_log

Unnamed: 0,user_id,date,party,amount,notes,type
0,2,2024-01-04,Samya,-30,,Credited
1,1,2024-01-12,Sara,-50,,Credited
2,2,2024-01-21,Young Brother,-15,,Credited
3,1,2024-01-21,Uncle Issa,-40,"Duis bibendum, felis sed interdum venenatis, t...",Recieved
4,2,2024-01-22,Father,-55,Nam nulla.,Received
...,...,...,...,...,...,...
195,1,2025-08-14,Sara,50,,Recieved
196,2,2025-08-22,Ahmad Gym,50,Sed ante.,Credited
197,2,2025-08-24,Father,45,,Received
198,2,2025-08-25,Abd,55,,Received


#### 7. Export dataframe to sql file with prefix `final_`

In [382]:
with open("final_debts_log.sql", "w", encoding="UTF-8") as file:
    for _, row in debts_log.iterrows():
        notes = "NULL" if str(row["notes"]) == 'nan' else f"'{row["notes"]}'"
        user_id = row["user_id"]
        date = f"'{row["date"]}'"
        party = f"'{row["party"]}'"
        amount = row["amount"]
        ttype = f"'{row["type"]}'"
        sql = f"INSERT INTO debts_log(user_id, date, party, amount, notes, type) VALUES({user_id}, {date}, {party}, {amount}, {notes}, {ttype});\n"
        file.write(sql)

### subscriptions

#### Prepare data

In [392]:
subscriptions_dictionary_1 = {
    "subscription": ["Spotify", "Adobe", "iCloud", "Gym"],
    "amount": [3.53, 16.30, 0.70, 45],
    "expected_day": [3, 19, 20, 4]
}

In [388]:
subscriptions_dictionary_2 = {
    "subscription": ["Spotify", "ChatGPT", "Boxing"],
    "amount": [3.53, 14.17, 60],
    "expected_day": [7, 27, 11]
}

#### Create dataframes

In [389]:
subscriptions_1 = pd.DataFrame(subscriptions_dictionary_1)
subscriptions_2 = pd.DataFrame(subscriptions_dictionary_2)

In [390]:
subscriptions_1

Unnamed: 0,subscription,amount,expected_day
0,Spotify,3.53,3
1,Adobe,16.3,19
2,iCloud,0.7,20
3,Gym,45.0,4


In [391]:
subscriptions_2

Unnamed: 0,subscription,amount,expected_day
0,Spotify,3.53,7
1,ChatGPT,14.17,27
2,Boxing,60.0,11


#### Add notes column

In [393]:
subscriptions_1["notes"] = np.nan

In [395]:
subscriptions_2["notes"] = np.nan

In [396]:
subscriptions_1

Unnamed: 0,subscription,amount,expected_day,notes
0,Spotify,3.53,3,
1,Adobe,16.3,19,
2,iCloud,0.7,20,
3,Gym,45.0,4,


In [397]:
subscriptions_2

Unnamed: 0,subscription,amount,expected_day,notes
0,Spotify,3.53,7,
1,ChatGPT,14.17,27,
2,Boxing,60.0,11,


#### Add user_id column

In [398]:
subscriptions_1["user_id"] = 1

In [399]:
subscriptions_2["user_id"] = 2

In [400]:
subscriptions_1

Unnamed: 0,subscription,amount,expected_day,notes,user_id
0,Spotify,3.53,3,,1
1,Adobe,16.3,19,,1
2,iCloud,0.7,20,,1
3,Gym,45.0,4,,1


In [401]:
subscriptions_2

Unnamed: 0,subscription,amount,expected_day,notes,user_id
0,Spotify,3.53,7,,2
1,ChatGPT,14.17,27,,2
2,Boxing,60.0,11,,2


#### Reorder columns

In [402]:
subscriptions_1 = subscriptions_1[["user_id", "subscription", "amount", "expected_day", "notes"]]

In [404]:
subscriptions_2 = subscriptions_2[["user_id", "subscription", "amount", "expected_day", "notes"]]

In [405]:
subscriptions_1

Unnamed: 0,user_id,subscription,amount,expected_day,notes
0,1,Spotify,3.53,3,
1,1,Adobe,16.3,19,
2,1,iCloud,0.7,20,
3,1,Gym,45.0,4,


In [406]:
subscriptions_2

Unnamed: 0,user_id,subscription,amount,expected_day,notes
0,2,Spotify,3.53,7,
1,2,ChatGPT,14.17,27,
2,2,Boxing,60.0,11,


#### Combine dataframes

In [407]:
subscriptions = pd.concat([subscriptions_1, subscriptions_2], ignore_index=True)

In [408]:
subscriptions

Unnamed: 0,user_id,subscription,amount,expected_day,notes
0,1,Spotify,3.53,3,
1,1,Adobe,16.3,19,
2,1,iCloud,0.7,20,
3,1,Gym,45.0,4,
4,2,Spotify,3.53,7,
5,2,ChatGPT,14.17,27,
6,2,Boxing,60.0,11,


#### Export data with `final_` prefix

In [409]:
with open("final_subscriptions.sql", "w", encoding="UTF-8") as file:
    for _, row in subscriptions.iterrows():
        user_id = row["user_id"]
        subscription = f"'{row["subscription"]}'"
        amount = row["amount"]
        expected_day = row["expected_day"]
        notes = "NULL"
        sql = f"INSERT INTO subscriptions(user_id, subscription, amount, expected_day, notes) VALUES({user_id}, {subscription}, {amount}, {expected_day}, {notes});\n"
        file.write(sql)