# Predict the invariants of number fields

In this notebook, we will build a decision tree model that predicts 

1) rank of the unit group $\mathcal{O}_K^\times$ of a quadratic field $K$ (or equivalently, signature of a quadratic field) from polynomial coefficients

2) Galois group of quartic Galois extensions from Dedekind zeta coefficients.

In [1]:
import pathlib

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier, plot_tree, _tree
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
import polars as pl
from tqdm import tqdm

from lmf import db

In [2]:
nf_db = db.nf_fields

Q1. Using Sage, you can compute several invariants of number fields.

Consider the degree 8 number field
$$
K = \mathbb{Q}[x] / (x^8 - 6x^6 - 4x^5 + 43x^4 + 104x^3 + 86x^2 + 24x + 36)
$$

1. Find the signature, discrimiannt, ramified primes, Galois group, and class number of $K$.

2. Is the field Galois over $\mathbb{Q}$?

3. Find the intermediate fields between $K$ and $\mathbb{Q}$.

4. Find the smallest rational prime $p$ that splits completely in $K$.

5. Find the LMFDB label of this number field.

In [1]:
# Your code here

Now, download the number field data. For degree 2, we will only use the first 10000 number fields (there are total 1370659 quadratic fields on LMFDB).

Note that you can save the dataframe with `df.write_csv(...)` as a csv file.

In [7]:
def zc(poly, N):
    """Compute the first N zeta coefficients of the number field defined by poly."""
    R = PolynomialRing(ZZ, "x")
    F.<a> = NumberField(R(poly))
    zc = F.zeta_coefficients(N)
    return zc

def get_data(
    degree=None,
    galois_only=False,
    max_N=1000,
    limit=None,
):
    """Make a polars dataframe of number fields of given degree.
    Columns are `label`, `rank`, `galois_label`, `c_0`, ..., `c_{degree-1}`, `a_1`, `a_2`, ..., `a_{max_N}`.
    """
    filter = {}
    if degree is not None:
        filter["degree"] = degree
    if galois_only:
        # Only galois extensions
        filter["is_galois"] = True

    cols = ["label", "coeffs", "r2", "galois_label"]

    qfs = nf_db.search(filter, cols, limit=limit)
    qfs = list(qfs)
    
    columns = [("rank", pl.Int8), ("galois_label", pl.String)]
    for i in range(degree):
        columns.append((f"c_{i}", pl.Int64))
    for i in range(1, max_N+1):
        columns.append((f"a_{i:04d}", pl.Int64))
    columns = pl.Schema(columns)

    df = None
    df_label = None
    
    chunk_size = 10000
    for i in tqdm(range(0, len(qfs), chunk_size), desc="loading data"):
        labels = []
        data = []
        for F in qfs[i:i+chunk_size]:
            label = F["label"]
            r2 = F["r2"]
            r1 = degree - 2 * r2
            r = r1 + r2 - 1
            galois_label = F["galois_label"]
            labels.append(label)
            F_data = [r, galois_label]
            F_data += list(ZZ(x) for x in F["coeffs"][:-1]) + list(zc(F["coeffs"], max_N))
            data.append(F_data)
        if df is None:
            df_label = pl.DataFrame(labels, schema=[("label", pl.String)])
            df = pl.DataFrame(data, schema=columns)
        else:
            df_label.extend(pl.DataFrame(labels, schema=[("label", pl.String)]))
            df.extend(pl.DataFrame(data, schema=columns))

    df = pl.concat([df_label, df], how="horizontal")

    print(f"Total number of fields: {len(df)}")
    return df

In [8]:
df_quad = get_data(
    degree=2,
    limit=100000,
)

loading data: 100%|██████████| 10/10 [01:00<00:00,  6.02s/it]


Total number of fields: 100000


In [9]:
print(df_quad.head())

shape: (5, 1_005)
┌─────────┬──────┬──────────────┬─────┬───┬────────┬────────┬────────┬────────┐
│ label   ┆ rank ┆ galois_label ┆ c_0 ┆ … ┆ a_0997 ┆ a_0998 ┆ a_0999 ┆ a_1000 │
│ ---     ┆ ---  ┆ ---          ┆ --- ┆   ┆ ---    ┆ ---    ┆ ---    ┆ ---    │
│ str     ┆ i8   ┆ str          ┆ i64 ┆   ┆ i64    ┆ i64    ┆ i64    ┆ i64    │
╞═════════╪══════╪══════════════╪═════╪═══╪════════╪════════╪════════╪════════╡
│ 2.0.3.1 ┆ 0    ┆ 2T1          ┆ 1   ┆ … ┆ 2      ┆ 0      ┆ 2      ┆ 0      │
│ 2.0.4.1 ┆ 0    ┆ 2T1          ┆ 1   ┆ … ┆ 2      ┆ 0      ┆ 0      ┆ 4      │
│ 2.2.5.1 ┆ 1    ┆ 2T1          ┆ -1  ┆ … ┆ 0      ┆ 0      ┆ 0      ┆ 0      │
│ 2.0.7.1 ┆ 0    ┆ 2T1          ┆ 2   ┆ … ┆ 0      ┆ 4      ┆ 0      ┆ 0      │
│ 2.0.8.1 ┆ 0    ┆ 2T1          ┆ 2   ┆ … ┆ 0      ┆ 2      ┆ 0      ┆ 0      │
└─────────┴──────┴──────────────┴─────┴───┴────────┴────────┴────────┴────────┘


Q2. Train a decision tree model that predicts the rank of the unit group of the ring of integers.

- Randomly split the data into train (70\%) and test (30\%) set.
- Use the polynomial coefficients as features.
- The resulting model may achieve an accuracy $>$ 99\%.

In [10]:
def X_y(df, label, feature, degree=None, max_N=None):
    """
    Given a polars dataframe df, return the feature matrix X and target vector y.
    The features are either the coefficients of the defining polynomial or the first max_N Dedekind zeta coefficients.
    The target is the column specified by label.
    """
    if feature == "poly":
        assert degree is not None, "degree must be specified when feature is 'coeffs'"
        columns_ = [f"c_{i}" for i in range(degree)]
    elif feature == "zeta":
        assert max_N is not None, "max_N must be specified when feature is 'zeta'"
        columns_ = [f"a_{i:04d}" for i in range(1, max_N+1)]
    else:
        raise ValueError("feature must be 'coeffs' or 'zeta'")
    X = df.select(columns_)
    y = df.select(label)
    return X, y

In [2]:
# Your code here

Q3. Use `plot_tree` to plot the decision tree. (Set `feature_names` and `class_names` as `["c_0", "c_1"]` and `["0", "1"]`, respectively.)

Can you explain your observation? (Hint: this highly depends on how LMFDB chooose defining polynomial for each number field.)

In [3]:
# Your code here

Similarly, download data of quartic Galois extensions. This may take few minutes.

In [14]:
df_quartic_galois = get_data(
    degree=4,
    galois_only=True,
)

loading data: 100%|██████████| 19/19 [03:36<00:00, 11.41s/it]

Total number of fields: 182860





Q4. How many quartic Galois extensions in LMFDB?

In [4]:
# Your code here

Q5. Train a decision tree model that predicts the Galois group of quartic Galois extensions.

- Randomly split the data into train (70\%) and test (30\%) set.
- Use the Dedekind zeta coefficients (up to $a_{1000}$) as features.
- The resulting model may achieve an accuracy $>$ 99\%.

In [5]:
# Your code here

Q6. Plot the tree and interpret it. Use `feature_names = [f"a_{i:04d}" for i in range(1, 1001)]` and `class_names=["4T1", "4T2"]`. Can you explain your observation?

Detailed answer can be found in the paper Lee-Lee (Figure 1, the tree may look slightly different).

In [6]:
# Your code here

Q7. Do your own experiments (different model, different data, different problem, etc.).