<a href="https://colab.research.google.com/github/zad-AIworld/Order-processing/blob/main/Nov_1st_Prep_Data_for_GraphRAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Read the data from the SQLite database file "database.sqlite" into a pandas DataFrame, then prepare and save a metadata YAML file for Graphrag indexing.

## Connect to the database

### Subtask:
Establish a connection to the SQLite database file.


**Reasoning**:
Import the sqlite3 library and establish a connection to the database file.



In [1]:
import sqlite3

conn = sqlite3.connect('/content/4- Categorized_DB_Enhanced.db')

## Read data

### Subtask:
Read the data from the database into a pandas DataFrame.


**Reasoning**:
Query the database to find the table name and then read the data from the identified table into a pandas DataFrame.



In [2]:
import pandas as pd

cursor = conn.cursor()
cursor.execute("SELECT name FROM sqlite_master WHERE type='table';")
tables = cursor.fetchall()

if tables:
    table_name = tables[0][0]
    sql_query = f"SELECT * FROM \"{table_name}\""
    df_data = pd.read_sql_query(sql_query, conn)
    display(df_data.head())
else:
    print("No tables found in the database.")

Unnamed: 0,eventDate,orderDate,marketplace,productName,website,orderId,currencyCode,quantity,productCondition,totalOwed,Broad Category,Specific Category
0,2025-07-17 14:36:06+00:00,2025-07-17 14:36:06+00:00,DE,MELLERUD Schimmel Vernichter Aktivchlor | 1 x ...,Amazon.de,028-1498008-9740330,EUR,1,New,7.99,Home & Kitchen,Cleaning Supplies
1,2025-07-08 05:06:46+00:00,2025-07-08 05:06:46+00:00,DE,Carson MicroFlip 100x-250x Taschenmikroskop mi...,Amazon.de,028-4603470-0628322,EUR,1,New,24.12,Electronics,Optics & Accessories
2,2025-07-04 05:36:24+00:00,2025-07-04 05:36:24+00:00,DE,"Gosknor 2stk Octopus-Kopfhautmassagegerät, Har...",Amazon.de,028-5613038-3064361,EUR,1,New,9.59,Beauty & Personal Care,Hair & Skin Care
3,2025-07-02 19:54:58+00:00,2025-07-02 19:54:58+00:00,DE,"AJINOMOTO - Monosodium Glutamat, (1 X 200 GR)",Amazon.de,028-1018659-6139555,EUR,1,New,3.69,Grocery & Gourmet Food,Cooking Ingredients
4,2025-06-17 17:33:15+00:00,2025-06-17 17:33:15+00:00,DE,Rainbow Designs Harry Potter Plüsch-Babyrassel...,Amazon.de,028-4961403-1482742,EUR,1,New,14.05,Toys & Games,Collectibles & Baby Toys


## Inspect the data

### Subtask:
Display the first few rows and the columns and their data types to understand the structure of the data.


**Reasoning**:
Display the first few rows and the columns and their data types to understand the structure of the data.



In [3]:
display(df_data.head())
display(df_data.info())

Unnamed: 0,eventDate,orderDate,marketplace,productName,website,orderId,currencyCode,quantity,productCondition,totalOwed,Broad Category,Specific Category
0,2025-07-17 14:36:06+00:00,2025-07-17 14:36:06+00:00,DE,MELLERUD Schimmel Vernichter Aktivchlor | 1 x ...,Amazon.de,028-1498008-9740330,EUR,1,New,7.99,Home & Kitchen,Cleaning Supplies
1,2025-07-08 05:06:46+00:00,2025-07-08 05:06:46+00:00,DE,Carson MicroFlip 100x-250x Taschenmikroskop mi...,Amazon.de,028-4603470-0628322,EUR,1,New,24.12,Electronics,Optics & Accessories
2,2025-07-04 05:36:24+00:00,2025-07-04 05:36:24+00:00,DE,"Gosknor 2stk Octopus-Kopfhautmassagegerät, Har...",Amazon.de,028-5613038-3064361,EUR,1,New,9.59,Beauty & Personal Care,Hair & Skin Care
3,2025-07-02 19:54:58+00:00,2025-07-02 19:54:58+00:00,DE,"AJINOMOTO - Monosodium Glutamat, (1 X 200 GR)",Amazon.de,028-1018659-6139555,EUR,1,New,3.69,Grocery & Gourmet Food,Cooking Ingredients
4,2025-06-17 17:33:15+00:00,2025-06-17 17:33:15+00:00,DE,Rainbow Designs Harry Potter Plüsch-Babyrassel...,Amazon.de,028-4961403-1482742,EUR,1,New,14.05,Toys & Games,Collectibles & Baby Toys


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 266 entries, 0 to 265
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   eventDate          266 non-null    object 
 1   orderDate          266 non-null    object 
 2   marketplace        266 non-null    object 
 3   productName        266 non-null    object 
 4   website            266 non-null    object 
 5   orderId            266 non-null    object 
 6   currencyCode       266 non-null    object 
 7   quantity           266 non-null    int64  
 8   productCondition   266 non-null    object 
 9   totalOwed          266 non-null    float64
 10  Broad Category     266 non-null    object 
 11  Specific Category  266 non-null    object 
dtypes: float64(1), int64(1), object(10)
memory usage: 25.1+ KB


None

## Prepare metadata

### Subtask:
Generate the metadata as a Python dictionary in the appropriate format for Graphrag indexing.


**Reasoning**:
Generate the metadata dictionary for Graphrag indexing based on the inspected DataFrame structure.



In [4]:
metadata = {
    "dataset_name": "sales_data",
    "dataset_description": "Sales data from an e-commerce platform, including order and product details.",
    "columns": [
        {"name": "eventDate", "data_type": "datetime", "description": "Date and time of the event."},
        {"name": "orderDate", "data_type": "datetime", "description": "Date and time of the order."},
        {"name": "marketplace", "data_type": "string", "description": "Marketplace where the sale occurred."},
        {"name": "productName", "data_type": "string", "description": "Name of the product sold."},
        {"name": "website", "data_type": "string", "description": "Website where the sale occurred."},
        {"name": "orderId", "data_type": "string", "description": "Unique identifier for the order."},
        {"name": "currencyCode", "data_type": "string", "description": "Currency code for the transaction."},
        {"name": "quantity", "data_type": "integer", "description": "Quantity of the product sold."},
        {"name": "productCondition", "data_type": "string", "description": "Condition of the product (e.g., New)."},
        {"name": "totalOwed", "data_type": "number", "description": "Total amount owed for the order."},
        {"name": "Broad Category", "data_type": "string", "description": "Broad category of the product."},
        {"name": "Specific Category", "data_type": "string", "description": "Specific category of the product."}
    ]
}
display(metadata)

{'dataset_name': 'sales_data',
 'dataset_description': 'Sales data from an e-commerce platform, including order and product details.',
 'columns': [{'name': 'eventDate',
   'data_type': 'datetime',
   'description': 'Date and time of the event.'},
  {'name': 'orderDate',
   'data_type': 'datetime',
   'description': 'Date and time of the order.'},
  {'name': 'marketplace',
   'data_type': 'string',
   'description': 'Marketplace where the sale occurred.'},
  {'name': 'productName',
   'data_type': 'string',
   'description': 'Name of the product sold.'},
  {'name': 'website',
   'data_type': 'string',
   'description': 'Website where the sale occurred.'},
  {'name': 'orderId',
   'data_type': 'string',
   'description': 'Unique identifier for the order.'},
  {'name': 'currencyCode',
   'data_type': 'string',
   'description': 'Currency code for the transaction.'},
  {'name': 'quantity',
   'data_type': 'integer',
   'description': 'Quantity of the product sold.'},
  {'name': 'productCo

## Save metadata

### Subtask:
Save the metadata to a YAML file.


**Reasoning**:
Save the metadata dictionary to a YAML file as requested by the subtask.



In [5]:
import yaml

with open('metadata.yaml', 'w') as f:
    yaml.dump(metadata, f)

## Summary:

### Data Analysis Key Findings

*   The data was successfully read from the 'sales' table in the SQLite database "4- Categorized\_DB\_Enhanced.db" into a pandas DataFrame.
*   The DataFrame `df_data` contains 266 entries and 12 columns, including `eventDate`, `orderDate`, `marketplace`, `productName`, `website`, `orderId`, `currencyCode`, `quantity`, `productCondition`, `totalOwed`, `Broad Category`, and `Specific Category`.
*   Column data types are primarily `object`, with `quantity` as `int64` and `totalOwed` as `float64`.
*   There are no missing values in any of the columns.
*   A Python dictionary named `metadata` was created, containing a dataset name, description, and a list of dictionaries for each column with its name, data type, and description, suitable for Graphrag indexing.
*   The generated `metadata` dictionary was successfully saved to a YAML file named `metadata.yaml`.

### Insights or Next Steps

*   The created `metadata.yaml` file is ready to be used for Graphrag indexing of the sales data.
*   Further analysis can be performed on the `df_data` DataFrame, such as calculating sales trends, analyzing product performance by category or marketplace, or investigating order details.
