# <center>Crash course 4: Generalized linear models</center>

### <center>Alfred Galichon (NYU & Sciences Po)</center>
## <center>'math+econ+code' masterclass on optimal transport and economic applications</center>
#### <center>With python code examples</center>
© 2018-2022 by Alfred Galichon. Past and present support from NSF grant DMS-1716489, ERC grant CoG-866274 are acknowledged, as well as inputs from contributors listed [here](http://www.math-econ-code.org/theteam).

**If you reuse material from this masterclass, please cite as:**<br>
Alfred Galichon, 'math+econ+code' masterclass on optimal transport and economic applications, January 2022. https://github.com/math-econ-code/mec_optim

## References
* McCullagh and Nelder (1989). Generalized Linear Models (2nd ed.). Chapman and Hall/CRC.
* Friedman, Tibshirani, and Hastie (2001). The Elements of Statistical Learning. Springer.
* The Scikit-learn library www.scikit-learn.org.

    

# Generalized linear models
## Setting

* In many setting, an economic model will allow to make predictions on the
conditional mean of a dependent random variable $\mu_a$ given explanatory random
vector $\phi_a$, for observations $a\in\mathcal{A}$.

* In the case of linear regression, we have<br>
$E\left[  \mu_a | \phi_a \right]  =\phi_a^{\top}\beta$<br>
however, we shall encounter situations where it will be useful to be more general.

* This leads us to *generalized linear models* (GLM), which are specified as<br>
$E\left[  \mu_a|\phi_a\right]  =g^{-1}\left(  \phi_a^{\top}\beta\right)$<br>
where $g:\mathbb{R}\rightarrow\mathbb{R}$ is an increasing and continuous function called *link function*, and $\beta \in \mathbb{R}^K$.

* Often we shall specify in addition $Var\left(  \mu_a |\phi_a \right)  =V\left(
g^{-1}\left(  \phi_a^{\top}\beta\right)  \right)  $.

We shall use `linear_model`from the scikit-learn library `sklearn`.

In [None]:
from sklearn import linear_model

 ## Example 1: ordinary least squares (OLS)




* In least squares (OLS), we have<br>
$\mu_a=\phi_a^{\top}\beta+\epsilon_a$<br>
with $E\left[  \epsilon_a|\phi_a\right]  =0$, in which case $g\left(  z\right)  =z$.

* Additionally, assuming $E\left[  \epsilon_a^{2}|\phi \right]  =\sigma^{2}$, we
have<br>
$Var\left(  \mu_a|\phi_a\right)  =\sigma^{2}.$

### OLS regression in scikit-learn



In [None]:
import numpy as np
nba,nbk = 100,10
np.random.seed(7)
Φ_a_k = np.random.randn(nba,nbk)
μ_a = np.random.randn(nba)
ols = linear_model.LinearRegression()
ols.fit (Φ_a_k, μ_a)
ols.coef_


## Example 2: Poisson regression



* Recall a Poisson distribution with parameter $\theta\in(0,+\infty)$ has
probability mass<br>
$\Pr(\mu|\theta)=\frac{e^{-\theta}\theta^{\mu}}{\mu!}$<br>
over $\mu \in\left\{  0,1,2,...\right\}  $. It has expectation and variance
$\theta$.

* Assume that conditional on $\phi_a$, $\mu_a$ has a Poisson distribution of
parameter $\mu^\beta_a=\exp\left(  \phi_a^{\top}\beta\right)  $. Then<br>
$ E\left[  \mu_a|\phi_a\right]  =\exp\left(  \phi_a^{\top}\beta\right)$<br>
so in this case $g=\ln$.

* Note that we get<br>
$var\left(  \mu_a|\phi_a\right)  =\exp\left(  \phi_a^{\top}\beta\right)$<br>
which may be overrestrictive (more on this later).


## Poisson regression



* Sample log-likelihood:<br>
$
\sum_{a}-\exp\left(  \phi_{a}^{\top}\beta\right)  + \phi_{a}^{\top}\beta \hat{\mu}_a -\ln\left(  \hat{\mu}_a!\right)
$<br>
and therefore, max likelihood yields the Poisson regression<br>
$
\max_{\beta}\left\{ \sum_{a} \hat{\mu}_a \phi_{a}^{\top}\beta  -\sum_{a}\exp\left(  \phi_{a}^{\top}\beta\right)\right\}
$<br>

* First order conditions give<br>
$
\sum_{a}\left(  \hat{\mu}_a-\exp\left(  \phi_{a}^{\top}\beta\right)  \right)  \phi_{ak}=0.
$<br>
and therefore $\beta$ is obtained by matching the predicted moments with the
observed ones<br>
$
\mathbb{E}_\beta\left[ \phi_k \right]  =\hat{\mathbb{E}}\left[ \phi_k \right]  .
$



### Poisson regression in scikit-learn

The following example is taken from the `scikit-learn` documentation.

In [None]:
# from https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.PoissonRegressor.html


Φ_a_k = [[1, 2], [2, 3], [3, 4], [4, 3]]
μ_a = [12, 17, 22, 21]
poisson = linear_model.PoissonRegressor()
poisson.fit(Φ_a_k, μ_a)
print('Score       = ', poisson.score(Φ_a_k, μ_a))
print('Coef        = ', poisson.coef_)
print('Intercept   = ', poisson.intercept_)
print('Predictions = ', poisson.predict([[1, 1], [3, 4]]))


## ML inference in Poisson regression (ctd)


* Recall that if $E_{P_{n}}\log P \left(  \beta,\mu\right)  $ is the
log-likelihood of the sample, and setting $l\left(  \beta,\mu\right)  =\log
P\left(  \beta,\mu\right)  $ we get

$$
E_{P_{n}}\left[  \partial_{\beta}l\left(  \beta_{n},\mu\right)  \right]   
=0\\
$$

$$
E_{P}\left[  \partial_{\beta}l\left(  \beta,\mu\right)  \right]     =0
$$

thus
$E_{P}\left[  \partial_{\beta}l\left(  \beta_{n},\mu\right)  \right]
-E_{P}\left[  \partial_{\beta}l\left(  \beta,\mu\right)  \right]  =E_{P}\left[
\partial_{\beta}l\left(  \beta_{n},\mu\right)  \right]  -E_{P_{n}}\left[
\partial_{\beta}l\left(  \beta_{n},\mu\right)  \right]
$
therefore

$$
\left(  \beta_{n}-\beta\right)  E_{P}\left[  \partial_{\beta}^{2}l\left(
\beta_{n},\mu\right)  \right]  =-\frac{1}{\sqrt{n}}g_{n}\left(  \partial_{\beta
}l\left(  \beta,\mu\right)  \right)
$$
where $g_{n}f=\sqrt{n}\left(  E_{P_{n}}f-E_{P}f\right)$.

* Thus
$$
\beta_{n}-\beta=-\frac{1}{\sqrt{n}}\left(  E_{P}\left[  \partial_{\beta}%
^{2}l\left(  \beta,\mu\right)  \right]  \right)  ^{-1}g_{n}\left(
\partial_{\beta}l\left(  \beta,\mu\right)  \right)
$$



* Hence
$$
V \left(\beta_{n}-\beta\right)   =\frac{1}{n}\left(  E_{P}\left[
\partial_{\beta}^{2}l\left(  \beta,\mu\right)  \right]  \right)  ^{-1}  \times E_{P}\left(  \partial_{\beta}l\left(  \beta,\mu\right)  \left(
\partial_{\beta}l\left(  \beta,\mu\right)  \right)  ^{\top}\right)  \times\left(  E_{P}\left[  \partial_{\beta}^{2}l\left(  \beta,\mu\right)
\right]  \right)  ^{-1}
$$


* And because at the ML parameter
$$
E_{P}\left(  \partial_{\beta}l\left(  \beta,\mu\right)  \left(  \partial_{\beta
}l\left(  \beta,\mu\right)  \right)  ^{\top}\right)  =E_{P}\left[
\partial_{\beta}^{2}l\left(  \beta,\mu\right)  \right]  ,
$$
we have thus
$$
V\left(  \beta_{n}-\beta\right)  =\frac{1}{n}\left(  E_{P}\left[
\partial_{\beta}^{2}l\left(  \beta,\mu\right)  \right]  \right)  ^{-1}.
$$




## Estimation of GLM



* Actually, we don't need to assume that $\mu_a | \phi_a \sim Poisson\left(
\exp(\phi_a^{\top}\beta)\right)  $ to estimate $\beta$.

* Consider $\Phi$ the $\mathcal{A} \times K$ matrix obtained by stacking the rows $\phi_{a}^{\top}$ on
top of each other. Compute<br>
$
\max_{\beta}\left\{  \hat{\mu}^{\top}\Phi \beta-1_\mathcal{A}^{\top}\exp\left(  (\Phi \beta)_a\right)
\right\}
$<br>
and define $\mu^\beta=\exp\left(  \Phi \beta\right)  $ the predictor of $\mu$.
One has<br>
$
\sum_{a}\hat{\mu}_{a}\Phi_{ak}=\sum_{a}\mu^\beta_{a}\Phi_{ak}~\forall k
$<br>
and therefore $\beta$ is obtained by the same procedure as before.


## Inference in GLM


* While the point estimate is unchanged wrt the Poisson regression, the
inference is changed as soon as one departs from the assumption that
$Var\left(  \mu_a | \phi_a \right)  =\exp(\phi_a^{\top}\beta)$. Denote this quantity by  $V\left(  \mu | \phi \right) $.

* The estimation of $\beta$ is now seen as what is called an *M-estimation* procedure<br>
$
\max_{\beta} \sum_{a\in\mathcal{A} }F\left(  \mu_a,\beta \right)  .
$


* The derivation done for MLE applies replacing $\partial_{\beta}l\left(
\beta,\mu_a\right)  =\partial_{\beta}\log P\left(  \beta,\mu_a\right)  $ by
$\partial_{\beta}l\left(  \beta,\mu_a\right)  =\left(  \mu_a-\exp\left(
\phi_a^{\top}\beta\right)  \right)  \phi_a$ with the provision that
$E_{P}\left[  \partial_{\beta}^{2}l\left(  \beta,\mu\right)  \right]  \neq
E_{P}\left[  \partial l\left(  \beta,\mu\right)  \partial l\left(
\beta,\mu\right)  ^{\top}\right]  $. Hence

$$
V\left(  \beta_{n}-\beta\right)   =\frac{1}{n}\left(  E_{P}\left[
\partial_{\beta}^{2}l\left(  \beta,\mu\right)  \right]  \right)  ^{-1} \times E_{P}\left(  \partial_{\beta}l\left(  \beta,\mu\right)  \left(
\partial_{\beta}l\left(  \beta,\mu\right)  \right)  ^{\top}\right) \times\left(  E_{P}\left[  \partial_{\beta}^{2}l\left(  \beta,\mu\right)
\right]  \right)  ^{-1}%
$$

* We have<br>
$
E_{P}\left[  \partial_{\beta}^{2}l\left(  \beta,\mu\right)  \right]  =E\left[
\exp\left(  \phi^{\top}\beta\right)  \phi \phi^{\top}\right]
$<br>
and

$$
E_{P}\left[  \partial_{\beta}l\left(  \beta,\mu\right)  \left(  \partial_{\beta
}l\left(  \beta,\mu\right)  \right)  ^{\top}\right]  =E\left[  \left(
\mu-\exp\left(  \phi^{\top}\beta\right)  \right)  ^{2}\phi \phi^{\top}\right] \\  =E\left[  V\left(  \mu | \phi \right)  \phi \phi^{\top}\right]  .
$$


## Poisson regression and duality


Consider $\mu \in\mathbb{R}_{+}^{\mathcal{A} }$, $\beta\in \mathbb{R}^{k}$ and $\Phi$ a $\mathcal{A}\times k$ matrix


> <span style="color:yellow"> **Theorem (Poisson duality)**. The primal problem
$$
\max_{\beta}\left\{ \hat{ \mu }^{\top}\Phi \beta-1_\mathbb{A}^{\top}\exp\left(  \Phi\beta\right)
\right\}
$$
has dual
$$
\min_{\bar{\mu}\in\mathbb{R}_{+}^{\mathcal{A}}}  \bar{\mu}^{\top}\left(  \ln\bar
{\mu}-1\right) \\
s.t.   \Phi^{\top}\left(  \hat{\mu} -\bar{\mu}\right)  =0.
$$</span>


**Proof**. Start from the latter expression and write the Lagrangian for
the problem 

$$
\min_{\bar{\mu}\geq0}\max_{\beta}\bar{\mu}^{\top}\left(  \ln\bar{\mu}-1\right)
-\left(  \bar{\mu}-\hat{\mu} \right)  ^{\top}\Phi\beta =\max_{\beta} \hat{\mu}^{\top}\Phi\beta+\min_{\bar{\mu}\geq0}\left\{  \bar{\mu}^{\top
}\left(  \ln\bar{\mu}-1\right)  -\bar{\mu}^{\top}\Phi\beta\right\}
$$

has $\ln\bar{\mu}=\Phi\beta$ and $\bar{\mu}^{\top}\left(  \ln\bar{\mu}-1\right)
-\bar{\mu}^{\top}\Phi\beta=-\bar{\mu}^{\top}1=-1^{\top}\exp\left(  \Phi\beta\right)  $
and hence this is

$$
\max_{\beta} \hat{\mu}^{\top}\Phi\beta-1_\mathcal{A}^{\top}\exp\left(  \Phi\beta\right)  .
$$

# Discrete choice models

## Multinomial logit model


* Now assume that the observations are $\mathcal{A} = \mathcal{I} \times \mathcal{Y}$ where $\mathcal{I}$ is the set of decision-makers and $\mathcal{Y}$ is the set of alternatives. Consider the logit model where the utility that $i$ assigns to choice $y$ is
$$
\sum_{k}\Phi_{iy}^{k}\beta_{k}+\varepsilon_{iy}$$
where $\varepsilon_{iy}$ are iid Gumbel distributions, i.e. of c.d.f.
$\exp\left(  -\exp\left(  -x\right)  \right)  $.

* The conditional probability that $i$ chooses $y$ is
$$
\mu^\beta_{iy}=\frac{\exp\left(  \sum_{k}\Phi_{iy}^{k}\beta_{k}\right)  }{\sum
_{y}\exp\left(  \sum_{k}\Phi_{iy}^{k}\beta_{k}\right)  }
$$
and therefore the conditional likelihood associated with $y$ is<br>
$
l_{iy}\left(  \beta\right)  =\log \mu^\beta_{iy}=\sum_{k}\Phi_{iy}^{k}\beta
_{k}-\log\sum_{y}\exp\left(  \sum_{k}\Phi_{iy}^{k}\beta_{k}\right)
$

* As a result, if $y\left(  i\right)  $ is the actual choice of $i$, and
$\hat{\mu}_{iy}=1\left\{  y=y\left(  i\right)  \right\}  $, the logistic
regression can be expressed as

$$
l\left(  \beta\right)  =\hat{\mu}^{\top}\Phi\beta-\sum_{i}\log\sum_{y}%
\exp\left(  \left(  \Phi\beta\right)  _{iy}\right)
$$


* This is *almost*, but *not quite* the form of a GLM $-$ notice
the $\log$. To make the precise connection with GLM/Poisson regression, we
need to introduce *individual fixed effects*.



## Logistic regresssion as a Poisson regression+individual fixed effects


* Introduce a fixed effect $u_{i}$ and let $\theta=\left(  \beta^{\top
},u^{\top}\right)  ^{\top}$. We rewrite $\left(  \beta,u\right)
\rightarrow\left(  \left(  \Phi\beta\right)  _{iy}-u_{i}\right)  _{iy}$ in a
matrix form by defining

$$
X=%
\begin{pmatrix}
\Phi, - M_\mathcal{I}^\top%
\end{pmatrix}
$$
where $M_\mathcal{I} = I_{\mathcal{I} }\otimes 1^\top_{\mathcal{Y}}$ is the MOM (margining- out matrix) on the first dimension, and we have
$$
(X\theta)_{iy}=  \left(  \Phi\beta\right)  _{iy}-u_{i}.
$$


* The Poisson regression of $\hat{\mu}_{iy}$ on $X$ yields
$$
\max_{\beta,u}\left\{  -\sum_{iy}\exp\left(  \left(  \Phi\beta\right)
_{iy}-u_{i}\right)  +\sum_{iy}\hat{\mu}_{iy}\left(  \left(  \Phi
\beta\right)  _{iy}-u_{i}\right)  \right\}
$$
therefore
$$
\max_{\beta_k,u_i}\left\{  -\sum_{iy}\exp\left(  \left(  \Phi\beta\right)
_{iy}-u_{i}\right)  +\sum_{iy}\hat{\mu}_{iy}\left(  \Phi\beta\right)
_{iy}-\sum_{i}u_{i}\right\}. 
$$

* Taking first order conditions in $u_{i}$ we get
$$
\sum_{y}\exp\left(  \left(  \Phi\beta\right)  _{iy}-u_{i}\right)  =1
$$


* Therefore, $u_{i}=\log\sum_{y}\exp\left(  \left(  \Phi\beta\right)
_{iy}\right)  $ and the problem becomes the **logistic regression**
model
$$
\max_{\beta_k}\left\{  \sum_{iy}\hat{\mu}_{iy}\left(  \Phi\beta\right)
_{iy}-\sum_{i}\log\sum_{y}\exp\left(  \left(  \Phi\beta\right)
_{iy}\right)  \right\}  .
$$


* To summarize: 
> <span style="color:yellow">**Logistic regression = GLM + fixed effect**.</span>

### Multinomial logit model in scikit-learn


In [None]:
import scipy.sparse as spr
import numpy as np

In [None]:
nbi,nby,nbk = 100,4,6
nba = nbi*nby
np.random.seed(7)
y_i = np.random.randint(0,nby,nbi)
μ_i_y = spr.csr_matrix( (np.ones(nbi), (range(nbi), y_i ))).todense()
μ_a = μ_i_y.reshape((-1,1))
Φ_a_k = np.random.randn(nba,nbk)
X_a_l = spr.hstack([spr.kron(spr.identity(nbi), np.ones((nby,1))), -Φ_a_k])
logistic_as_poisson = linear_model.PoissonRegressor(fit_intercept=False)
logistic_as_poisson.fit(X_a_l, μ_a)
print('β_k = ', logistic_as_poisson.coef_[:nbk])

# Trade models

## Gravity equation


* The gravity models seeks to explain the trade flows $\hat{\mu}_{xy}$
from country $x$ to country $y$ by using various measures of proximity between
these countries. (We assume $\hat{\mu}_{xx}=0$.)

* We denote
$$
\left\{
\begin{array} \\
n_x=\sum_{y}\hat{\mu}_{xy}\\
m_y=\sum_{x}\hat{\mu}_{xy}
\end{array}
\right.
$$

the total volume of the exports of country $x$ and of the imports of country
$y$, respectively.

* We have the accounting equation
$$
\sum_{x}n_x=\sum_{xy}\hat{\mu}_{xy}=\sum_{y}m_y%
$$
and (by simply rescaling) we can without loss of generality assume that these
quantities sum to one.

* The *gravity model* assumes
$$
E\left[  \hat{\mu}_{xy}|\Phi\right]  =\exp\left(  \left(  \Phi\lambda\right)
_{xy}-u_{x}-v_{y}\right)
$$

where $u_{x}$ and $v_{y}$ are resistance terms, or country-specific fixed
effects. This is a GLM with two-way fixed effects. Need to rewrite $\left(
\lambda,u,v\right)  \rightarrow\left(  \left(  \Phi\lambda\right)  _{xy}%
-u_{x}-v_{y}\right)  _{xy}$ in a matrix form, again using vectorization and
Kronecker products.

* Hence:
> <span style="color:yellow">**Gravity equation = GLM + 2-ways fixed effect** </span>



## Fixed effects and Kronecker products


* Set up
$$
X=%
\begin{pmatrix}
\Phi & - M_\mathcal{X}^\top & -M_\mathcal{Y}^\top
\end{pmatrix}
$$

where $M_\mathcal{X} = I_{\mathcal{X} }\otimes 1^\top_{\mathcal{Y}}$ and  $M_\mathcal{Y} = 1^\top_{\mathcal{Y}} \otimes I_{\mathcal{X}} $ are the MOM matrices on the first and the second margins respectively.

* Taking parameter $\theta=\left(  \beta^{\top},u^{\top},v^{\top}\right)
^{\top}$, we have
$$
X\theta=vec\left(  \left(  \left(  \Phi\beta\right)  _{xy}-u_{x}%
-v_{y}\right)  _{xy}\right)  .
$$


* Therefore rewrite our regression with dependent variable $\hat{\mu}_{xy}$, and
consider the Poisson regression

$$
\max_{\theta}\left\{  \hat{\mu}^{\top}X\theta-1^{\top}\exp\left(  X\theta\right)
\right\}
$$
which becomes
$$
\max_{\beta,u,v}\left\{  \sum_{xy}\hat{\mu}_{xy}\left(  \left(  \Phi
\beta\right)  _{xy}-u_{x}-v_{y}\right)  -\sum_{xy}\exp\left(  \left(
\Phi\beta\right)  _{xy}-u_{x}-v_{y}\right)  \right\}
$$


## Gravity as max-entropy


* By the GLM duality theorem, the dual to this program is
$$ \min_{\mu_{xy}\geq0}\sum_{xy}\mu_{xy}\ln\mu_{xy}-\sum_{xy}\mu_{xy}\\
s.t.~ \sum_y\mu_{xy}=n_x,~\sum_x\mu_{xy}=m_y\\
 \sum_{xy}\mu_{xy}\Phi_{xy}^{k}=\sum_{xy}\hat{\mu}_{xy}\Phi_{xy}^{k}
$$

* But as $\sum_{xy}\mu_{xy}=1$, we interpret the previous program as
looking among the $\mu_{xy}$ that has the same margins and moments as
$\hat{\mu}$, the one that maximizes entropy $-\sum_{xy}\mu_{xy}\ln\mu_{xy}$.
Rewrite as

$$
\max_{\mu_{xy}\geq0}\left\{  -\sum_{xy}\mu_{xy}\ln\mu_{xy}\right\} \\
s.t.~  \sum_y\mu_{xy}=n_x,~\sum_x\mu_{xy}=m_y\\
\sum_{xy}\mu_{xy}\Phi_{xy}^{k}=\sum_{xy}\hat{\mu}_{xy}\Phi_{xy}^{k}%
$$


# Matching models


* Becker (1973) describes the following model of the labor market, the
marriage market, and other matching markets. Consider a population with a
share $n_x$ men of type $i$ and a share $m_y$ of women of type $j$,
assuming that men and women come in equal numbers. Assume that if $i$ and $j$
match, this generates a joint surplus (sum of their utilities) $\Phi_{xy}$.


* Let $\mu_{xy}$ be the fraction of couples $xy$ that are formed at
equilibrium. Becker shows that the equilibrium maximizes the total surplus
$\sum_{xy}\mu_{xy}\Phi_{xy}$ out of all the feasible matchings, which are
those with
$$
\sum_{j}\mu_{xy}=n_x\text{ and }\sum_{i}\mu_{xy}=m_y.
$$

* Therefore, the equilibrium matching $\mu_{xy}$ should solve
$$
\max_{\mu_{xy}\geq0} \sum_{xy}\mu_{xy}\Phi_{xy}\\
s.t.~  \sum_{j}\mu_{xy}=n_x\text{ and }\sum_{i}\mu_{xy}=m_y.
$$

* Choo and Siow (2006) and Dupuy and Galichon (2015) consider a variant
of this model with entropic regularization

$$
\max_{\mu_{xy}\geq0} \sum_{xy}\mu_{xy}\Phi_{xy}-\sigma\sum_{xy}\mu_{xy}%
\ln\mu_{xy}\\
s.t.~  \sum_{j}\mu_{xy}=n_x\text{ and }\sum_{i}\mu_{xy}=m_y.
$$


* We shall see that we can parametrically estimate $\Phi$ in this model by
the same tools as for the gravity equation.

## Back to gravity equation


* Consider the previous program
$$ \max_{\mu_{xy}\geq0}\left\{  -\sum_{xy}\mu_{xy}\ln\mu_{xy}\right\} \\
s.t.~ \sum_y\mu_{xy}=n_x,~\sum_x\mu_{xy}=m_y\\
 \sum_{xy}\mu_{xy}\Phi_{xy}^{k}=\sum_{xy}\hat{\mu}_{xy}\Phi_{xy}^{k}%
$$

and rewrite as
$$
\max_{\mu_{xy}\geq0}\left\{  -\sum_{xy}\mu_{xy}\ln\mu_{xy}+\min_{\left(
\lambda_{k}\right)  }\left\{  \sum_{xy
k}\left(  \mu_{xy}-\hat{\mu}%
_{xy}\right)  \Phi_{xy}^{k}\lambda_{k}\right\}  \right\} \\
s.t.~ \sum_y\mu_{xy}=n_x,~\sum_x\mu_{xy}=m_y%
$$

* By the strong duality theorem, this is
$$
\min_{\left(  \lambda_{k}\right)  }\left\{  W\left(  \beta\right)  -\sum
_{xyk}\hat{\mu}_{xy}\Phi_{xy}^{k}\lambda_{k}\right\}
$$
where we recover
$$
W\left(  \beta\right)  =\max_{\mu_{xy}\geq0} \left\{  \sum_{xyk}\mu
_{xy}\Phi_{xy}^{k}\lambda_{k}-\sum_{xy}\mu_{xy}\ln\mu_{xy}\right\} \\
s.t.~ \sum_y\mu_{xy}=n_x,~\sum_x\mu_{xy}=m_y%
$$
which is the matching surplus.

