Description
I have an experiment design that does not include all combinations of its categorical variables, and ran into some difficulties getting a full-rank design matrix for statsmodels
. I included a simplified version below.
import numpy as np
import numpy.linalg as la
import pandas as pd
import patsy
index_vals = tuple("abc")
level_names = list("ABD")
n_samples = 2
def describe_design_matrix(design_matrix):
print("Shape:", design_matrix.shape)
print("Rank: ", la.matrix_rank(design_matrix))
print(
"Approximate condition number: {0:.2g}".format(
np.divide(*la.svd(design_matrix)[1][[0, -1]])
)
)
ds_simple = pd.DataFrame(
index=pd.MultiIndex.from_product(
[index_vals] * len(level_names) + [range(n_samples)],
names=level_names + ["sample"],
),
columns=["y"],
data=np.random.randn(len(index_vals) ** len(level_names) * n_samples),
).reset_index()
print("All sampled")
simple_X = patsy.dmatrices("y ~ (A + B + D) ** 3", ds_simple)[1]
describe_design_matrix(simple_X)
print("Only some sampled")
simple_X = patsy.dmatrices(
"y ~ (A + B + D) ** 3", ds_simple.query("A != 'a' or B == 'a'")
)[1]
describe_design_matrix(simple_X)
print("Reduced X")
simple_X = patsy.dmatrices(
"y ~ (A + B + D) ** 3",
ds_simple.query("A != 'a' or B == 'a'"),
return_type="dataframe",
)[1]
reduced_X = simple_X.loc[
:, [col for col in simple_X.columns if not col.startswith("A[T.b]:B")]
]
describe_design_matrix(reduced_X)
print("Only some sampled: alternate method")
simple_X = patsy.dmatrices(
"y ~ (C(A, Treatment('b')) + B + D) ** 3", ds_simple.query("A != 'a' or B == 'a'")
)[1]
describe_design_matrix(simple_X)
print("Number of nonzero elements:", (simple_X != 0).sum(axis=0))
print("Number of all-zero columns:", np.count_nonzero((simple_X != 0).sum(axis=0) == 0))
print("Reduced X: alternate method")
simple_X = patsy.dmatrices(
"y ~ (C(A, Treatment('b')) + B + D) ** 3",
ds_simple.query("A != 'a' or B == 'a'"),
return_type="dataframe",
)[1]
reduced_X = simple_X.loc[
:,
[
col
for col in simple_X.columns
if not col.startswith("C(A, Treatment('b'))[T.a]:B")
],
]
describe_design_matrix(reduced_X)
produces as output
All sampled
Shape: (54, 27)
Rank: 27
Approximate condition number: 52
Only some sampled
Shape: (42, 27)
Rank: 21
Approximate condition number: 3.8e+16
Reduced X
Shape: (42, 21)
Rank: 21
Approximate condition number: 37
Only some sampled: alternate method
Shape: (42, 27)
Rank: 21
Approximate condition number: 3.4e+16
Number of nonzero elements: [42 6 18 12 12 14 14 0 6 0 6 2 6 2 6 4 4 4 4 0 2 0 2 0
2 0 2]
Number of all-zero columns: 6
Reduced X: alternate method
Shape: (42, 21)
Rank: 21
Approximate condition number: 39
I don't mind spending the time to find the representation that produces all-zero columns, but there doesn't seem to be a way within patsy to say "I know some of these columns are going to be all zeros" or "These columns will be linear dependent on others". Since some statsmodels
functions require the formula information from patsy.DesignInfo
objects, I wanted to see what could be done within patsy
.
matthewwardrop/formulaic#19 is a related issue, with some discussion of how to generalize the "Reduced X" method in the script.