# DataAgent Basic Usage Example

This notebook demonstrates how to use both sklearn and statsmodels tools from the DataAgent package.

## Overview

DataAgent provides a unified interface for:
- **Scikit-learn tools**: Machine learning estimators with automated parameter validation
- **Statsmodels tools**: Statistical analysis including linear models, GLM, nonparametric methods, and more

## Installation

```bash
pip install datagent
```

In [1]:
# Import required libraries
import datagent
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris
import warnings
warnings.filterwarnings('ignore')

print(f"DataAgent version: {datagent.__version__}")

DataAgent version: 1.0.0


## 1. Scikit-learn Example

Let's start with a machine learning example using the iris dataset.

In [2]:
# Load iris dataset
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = pd.Series(iris.target)

print(f"Dataset shape: {X.shape}")
print(f"Target classes: {np.unique(y)}")
print(f"Feature names: {list(X.columns)}")

# Display first few rows
X.head()

Dataset shape: (150, 4)
Target classes: [0 1 2]
Feature names: ['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [3]:
# Create DataFrame with target column for sklearn tools
df = X.copy()
df['target'] = y

# Use universal sklearn estimator
result = datagent.universal_sklearn_estimator(
    estimator_name="random_forest_classifier",
    data=df,
    target_column="target",
    test_size=0.2,
    random_state=42,
    n_estimators=100
)

print(f"Model: {result['model_info']['model_name']}")
print(f"Accuracy: {result['metrics']['accuracy']:.4f}")
print(f"Precision: {result['metrics']['precision']:.4f}")
print(f"Recall: {result['metrics']['recall']:.4f}")
print(f"F1 Score: {result['metrics']['f1']:.4f}")

INFO:universal_sklearn_tool.universal_estimator:Successfully trained random_forest_classifier with metrics: {'accuracy': 1.0, 'precision': 1.0, 'recall': 1.0, 'f1': 1.0}


KeyError: 'model_info'

In [6]:
result.keys()

dict_keys(['success', 'estimator_name', 'estimator_type', 'model', 'predictions', 'metrics', 'feature_importance', 'data_shape'])

## 2. Statsmodels Example

Now let's demonstrate statistical analysis using statsmodels tools.

In [7]:
# Create sample data for linear regression
np.random.seed(42)
n = 100
X = np.random.randn(n, 2)
y = 2 * X[:, 0] + 1.5 * X[:, 1] + np.random.randn(n) * 0.5

df = pd.DataFrame({
    'y': y,
    'x1': X[:, 0],
    'x2': X[:, 1]
})

print(f"Dataset shape: {df.shape}")
print(f"Columns: {list(df.columns)}")

# Display first few rows
df.head()

Dataset shape: (100, 3)
Columns: ['y', 'x1', 'x2']


Unnamed: 0,y,x1,x2
0,0.964926,0.496714,-0.138264
1,3.860314,0.647689,1.52303
2,-0.277987,-0.234153,-0.234137
3,4.836479,1.579213,0.767435
4,-0.813943,-0.469474,0.54256


In [8]:
# Use universal linear model
result = datagent.universal_linear_models(
    model_name="ols",
    data=df,
    formula="y ~ x1 + x2"
)

print(f"Model: {result.get('model_name', 'OLS')}")
print(f"R-squared: {result.get('r_squared', 'N/A')}")
print(f"Adjusted R-squared: {result.get('adj_r_squared', 'N/A')}")
print(f"AIC: {result.get('aic', 'N/A')}")
print(f"BIC: {result.get('bic', 'N/A')}")

# Print coefficients if available
if 'params' in result:
    print("\nCoefficients:")
    for param, value in result['params'].items():
        print(f"  {param}: {value:.4f}")

Model: ols
R-squared: N/A
Adjusted R-squared: N/A
AIC: N/A
BIC: N/A


## 3. Available Models

Let's explore what models are available in DataAgent.

In [9]:
# Get available sklearn models
sklearn_models = datagent.get_available_sklearn_models()
print(f"Available sklearn models: {len(sklearn_models)}")
print("\nSample sklearn models:")
for model in list(sklearn_models.keys())[:10]:
    print(f"  - {model}")

Available sklearn models: 4

Sample sklearn models:
  - regressor
  - classifier
  - clustering
  - preprocessor


In [10]:
# Get available statsmodels models
linear_models = datagent.get_linear_available_models()
print(f"\nAvailable linear models: {len(linear_models)}")
print("\nSample linear models:")
for model in list(linear_models.keys())[:5]:
    print(f"  - {model}")


Available linear models: 1

Sample linear models:
  - regression


## 4. Additional Examples

Let's try a few more examples to demonstrate the versatility of DataAgent.

In [11]:
# Example: Logistic Regression
print("=== Logistic Regression Example ===")

# Create binary classification data
np.random.seed(42)
X_binary = np.random.randn(200, 2)
y_binary = (X_binary[:, 0] + X_binary[:, 1] > 0).astype(int)

df_binary = pd.DataFrame(X_binary, columns=['feature1', 'feature2'])
df_binary['target'] = y_binary

result_logistic = datagent.universal_sklearn_estimator(
    estimator_name="logistic_regression",
    data=df_binary,
    target_column="target",
    test_size=0.2,
    random_state=42
)

print(f"Logistic Regression Accuracy: {result_logistic['metrics']['accuracy']:.4f}")

INFO:universal_sklearn_tool.universal_estimator:Successfully trained logistic_regression with metrics: {'accuracy': 1.0, 'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'roc_auc': 1.0}


=== Logistic Regression Example ===
Logistic Regression Accuracy: 1.0000


In [12]:
# Example: Ridge Regression
print("=== Ridge Regression Example ===")

result_ridge = datagent.universal_sklearn_estimator(
    estimator_name="ridge_regression",
    data=df,  # Using the linear regression data from earlier
    target_column="y",
    test_size=0.2,
    random_state=42,
    alpha=1.0
)

print(f"Ridge Regression R²: {result_ridge['metrics']['r2']:.4f}")
print(f"Ridge Regression MSE: {result_ridge['metrics']['mse']:.4f}")

INFO:universal_sklearn_tool.universal_estimator:Successfully trained ridge_regression with metrics: {'mse': 0.1500796106945576, 'mae': 0.322601557674962, 'r2': 0.9759015203461984}


=== Ridge Regression Example ===
Ridge Regression R²: 0.9759
Ridge Regression MSE: 0.1501


## 5. Summary

DataAgent provides a powerful unified interface for both machine learning and statistical analysis:

- **Easy to use**: Single function calls for complex analyses
- **Comprehensive**: Covers both sklearn and statsmodels capabilities
- **Flexible**: Supports various model types and parameters
- **Well-documented**: Clear error messages and results

### Next Steps

1. Explore more models in the available model lists
2. Try different parameters and configurations
3. Check out the LangGraph integration example
4. Use DataAgent in your own data analysis workflows