# Predict the rank of elliptic curves

In this notebook, we will build a logistic regression model that predicts the rank of the elliptic curve, where Frobenius traces $a_p(E)$ are used as features.

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

import polars as pl
from tqdm import tqdm

from lmf import db

In [None]:
# Elliptic curve database
ec_db = db.ec_curvedata

In [None]:
def query_data(N1, N2, r=None):
    # Elliptic curves over Q of conductor N1 <= N <= N2
    if r is None:
        return list(ec_db.search({"conductor": {"$gte": N1, "$lte": N2}}))
    return list(ec_db.search({"conductor": {"$gte": N1, "$lte": N2}, "rank": r}))

def get_df(max_N, num_ap, r=None, save_csv=False):
    """
    Get a polars dataframe of isogeny classes of elliptic curves over Q
    of conductor up to max_N, including the first num_ap coefficients a_p
    of the L-series.
    If r is specified, then we assume that it is a list of ranks to filter.
    If save_csv is True, save the dataframe to a CSV file.
    """
    data = query_data(1, max_N)
    pmax = Primes()[num_ap - 1]
    isog_labels = set()

    columns = ["isog_label", "rank"]
    for p in Primes()[:num_ap]:
        columns.append(f"a_{p:04d}")
    df = None

    for ec in tqdm(data):
        if ec['lmfdb_iso'] in isog_labels:  # one per isogeny class
            continue
        isog_labels.add(ec['lmfdb_iso'])
        ec_sage = EllipticCurve(QQ, ec['ainvs'])
        aps = list(ec_sage.aplist(pmax))
        rank = ec['rank']
        if r is not None and rank not in r:
            continue
        row = [ec['lmfdb_iso'], rank] + aps
        if df is None:
            df = pl.DataFrame([row], schema=columns)
        else:
            df.extend(pl.DataFrame([row], schema=columns))

    if save_csv:
        df.write_csv(f"ec_data_N{max_N}_ap{num_ap}.csv")
    return df

def X_y(df, label, num_ap):
    """
    Given a polars dataframe df, return the feature matrix X and target vector y.
    The features are the first num_ap coefficients a_p of the L-series.
    The target is the column specified by label.
    """
    columns_ = [f"a_{p:04d}" for p in Primes()[:num_ap]]
    X = df.select(columns_)
    y = df.select(label)
    return X, y

Q1. On LMFDB, how many elliptic curves of rank 0 or 1 and conductor $\le$ 10000?

Q2. How about the number of isogeny classes?

In [None]:
# Your code here

Q3. Let $E$ be the following elliptic curve
$$
y^2+xy+y=x^3-208x-1122
$$

1. What is the conductor, analytic / algebraic rank, torsion subgroup of the curve?
2. What is $a_{1987}(E)$?
3. Find the LMFDB label of the curve. (You can search on [this webpage](https://www.lmfdb.org/EllipticCurve/Q/).)

Hint: To find useful Sage functions, you can read [reference manual](https://doc.sagemath.org/html/en/reference/index.html). LMFDB also shows Sage commands.

In [None]:
# Your code here

Q4. Train a logistic regression model that distinguish (isogeny classes of) elliptic curves of rank 0 or 1 and conductor $\le$ 10000.

- Radomly split the data into train (70\%) and test (30\%) set.
- Use The first 300 $a_{p}(E)$'s as features.
- The resulting model may achieve an accuracy > 99\%.

In [None]:
# Your code here

Q5. By changing the function `get_df` slightly, make a dataset of elliptic curves of conductor $\le 10000$ where the torsion subgroups are either $\mathbb{Z}/3$ or $(\mathbb{Z}/2)^2$. Find the number of such elliptic curves. Build a model distinguishing two classes from Frobenius traces $a_{p_n}$ with $n \le 300$.

Note that we need to consider actual curves instead of isogeny classes, otherwise some of the torsion subgroups may not appear. Hence the problem might be "ill-defined" in the sense that $(a_p(E))_{p} \to E(\mathbb{Q})_{\mathrm{tors}}$ is not a well-defined function (although $\mathbb{Z}/3$ vs $(\mathbb{Z}/2)^2$ would be fine). One can try to consider multiple torsion subgroups as a single label to setup isogeny-invariant problem.

The number of elliptic curves with either torsion subgroups have the same asymptotic growth, see [this paper](https://arxiv.org/abs/1311.4920).

In [None]:
# Your code here

Q6. You may got an accuracy that is not as good as you expected. Try other models and see if you can achieve better accuracies.

In [None]:
# Your code here

Q7. Do your own experiments (different model, different data, different problem, etc.).
You can modify `get_df` to generate new data with other information.
Check out the `Underlying data` section of LMFDB.