# Example 1 - Introduction to MatGraphDB

## Introduction to MatGraphDB

### What is MatGraphDB?

MatGraphDB is a comprehensive toolkit and framework designed to streamline graph-based research in materials and molecular science. It simplifies the complexities of working with diverse datasets and advanced computational methods by providing:

- Methods to incorporate various data sources into a single, consistent materials database.
- Tools for performing lighter computational tasks directly on the database.
- An interface to manage intensive computations requiring HPC resources.
- Seamless integration with graph analysis tools like Graph-Tool, Neo4j, NetworkX, PyTorch Geometric, and Deep Graph Library.

### Purpose and Scope

MatGraphDB addresses the entire research workflow in graph-based materials and molecular science, including:

- Generating graph structures from raw data.
- Modeling relationships between entities like atoms, molecules, or materials.
- Applying advanced graph analysis techniques, including graph neural networks (GNNs).

---

## Getting Started with MatGraphDB

### Installation

*Assuming MatGraphDB is available on PyPI or through a GitHub repository.*

```bash
# If available via pip
pip install matgraphdb

# If installing from source
git clone https://github.com/yourusername/matgraphdb.git
cd matgraphdb
pip install -e .
```

### Importing MatGraphDB

In [10]:
import os
from matgraphdb import MatGraphDB

### Initializing MatGraphDB

Set up the main directory and specify optional parameters:

In [11]:
# Define the main directory where all data will be stored
main_dir = os.path.join('data', 'MatGraphDB_Example')

# Optional parameters
calculation_dirname = 'calculations'
graph_dirname = 'graph_database'
db_file = 'materials.db'
n_cores = 4  # Number of CPU cores to use for parallel processing

# Initialize MatGraphDB
matgraphdb = MatGraphDB(
    main_dir=main_dir,
    calculation_dirname=calculation_dirname,
    graph_dirname=graph_dirname,
    db_file=db_file,
    n_cores=n_cores
)

print(matgraphdb.__class__)

<class 'matgraphdb.core.MatGraphDB'>


### Directory Structure

Upon initialization, MatGraphDB sets up the following directory structure:

- **Main Directory (`main_dir`)**: The root directory for all data.
  - **Calculations Directory (`calculation_dirname`)**: Stores calculation files and results.
  - **Graph Database Directory (`graph_dirname`)**: Contains graph database files.
  - **Database File (`db_file`)**: SQLite database file storing material data.
  - **Parquet Schema File**: Stores the schema for Parquet files.

You can visualize the directory structure using the `os` module:

In [12]:
for root, dirs, files in os.walk(main_dir):
    level = root.replace(main_dir, '').count(os.sep)
    indent = ' ' * 4 * (level)
    print(f'{indent}{os.path.basename(root)}/')
    subindent = ' ' * 4 * (level + 1)
    for f in files:
        print(f'{subindent}{f}')

MatGraphDB_Example/
    materials.db
    calculations/
        MaterialsData/
    graph_database/
        nodes/
        relationships/


---

## Core Components of MatGraphDB

### MaterialDatabaseManager

Handles interactions with the SQLite database, including reading, writing, and updating material data.

In [15]:
# Access the database manager
db_manager = matgraphdb.db_manager

# Example: Read data from the database
rows = db_manager.read()
for row in rows:
    print("ASE Row:", row)
    print("id:", row.id)
    print("data:", row.data)
    print("-" * 50)


ASE Row: <AtomsRow: formula=Fe, keys=has_structure>
id: 1
data: {'density': 9.0}
--------------------------------------------------
ASE Row: <AtomsRow: formula=TiO2, keys=has_structure>
id: 2
data: {}
--------------------------------------------------
ASE Row: <AtomsRow: formula=Fe, keys=has_structure>
id: 3
data: {'density': 7.87, 'color': 'silver'}
--------------------------------------------------
ASE Row: <AtomsRow: formula=Fe, keys=has_structure>
id: 5
data: {}
--------------------------------------------------


### CalculationManager

Manages computational tasks, especially those requiring parallel processing.

In [None]:
# Access the calculation manager
calc_manager = matgraphdb.calc_manager

# Example: Define a simple calculation function
def example_calculation(data):
    # Perform some computation
    result = data.get('property', 0) * 2
    return {'calculated_property': result}

# Run the calculation across the database
results = matgraphdb.run_inmemory_calculation(example_calculation, save_results=True)

### GraphManager

Handles interactions with the graph database, allowing for advanced graph-based analyses.

In [None]:
# Access the graph manager
graph_manager = matgraphdb.graph_manager

# Example: Create a simple graph (assuming appropriate methods are implemented)
graph_manager.create_graph_from_data(db_manager.read())

## Conclusion

In this notebook, we've:

- Introduced **MatGraphDB** and its purpose in materials and molecular science research.
- Demonstrated how to initialize MatGraphDB and explained the directory structure it sets up.
- Explored its core components: `MaterialDatabaseManager`, `CalculationManager`, and `GraphManager`.
- Showed how to work with schemas and create Parquet files from data.

MatGraphDB simplifies the workflow for researchers, allowing them to focus on scientific questions without the technical overhead of data processing and model implementation.

---

## Next Steps

- **Database Management**: Learn how to interact with the materials database using `MatGraphDB` and `MaterialDatabaseManager`.
- **Improting Data**: Follow an example of importing a large dataset into MatGraphDB.
- **Calculation Management**: Learn how to manage calculations using `MatGraphDB` and `CalculationManager`.
- **Graph Management**: Learn how to create nodes and relationships and interact with them using `MatGraphDB` and `GraphManager`.
- **Feature Propagation**: Learn how to do feature propagation using `MatGraphDB`
---