Structural Equation Modelling allows us to explore the relationship between variables and confirm the structure of a developed model.

<img src="images/1.png" alt="Drawing"  align='left' style="width:400px;"/>

There are 2 types of variables in SEM:


<img src="images/2.png" alt="Drawing"  align='left' style="width:400px;"/>

In this example, we will use Block Design, Digit Span and Matrix Reason as __Manifest variables__ to indirectly measure the latent variable __Intellegence__.
<img src="images/3.png" alt="Drawing" align='left' style="width:400px;"/>

In [3]:
# Load the lavaan library
library(lavaan)

# # Look at the dataset
data(HolzingerSwineford1939)
head(HolzingerSwineford1939[ , 7:15])

This is lavaan 0.6-3
lavaan is BETA software! Please report any bugs.


x1,x2,x3,x4,x5,x6,x7,x8,x9
3.333333,7.75,0.375,2.333333,5.75,1.2857143,3.391304,5.75,6.361111
5.333333,5.25,2.125,1.666667,3.0,1.2857143,3.782609,6.25,7.916667
4.5,5.25,1.875,1.0,1.75,0.4285714,3.26087,3.9,4.416667
5.333333,7.75,3.0,2.666667,4.5,2.4285714,3.0,5.3,4.861111
4.833333,4.75,0.875,2.666667,4.0,2.5714286,3.695652,6.3,5.916667
5.333333,5.0,2.25,1.0,3.0,0.8571429,4.347826,6.65,7.5


The data consists of 9 tests of mental ability test scores of seventh- and eighth-grade children from two different schools.
A Confirmatory Factor Aanalysis model that is often proposed for these 9 variables consists of three latent variables (or factors), each with three indicators:
* a visual factor measured by 3 manifest variables: x1, x2 and x3
* a textual factor measured by 3 manifest variables: x4, x5 and x6
* a speed factor measured by 3 manifest variables: x7, x8 and x9

In the R environment, a regression formula has the following form:
$$y ~ x1 + x2 + x3 + x4$$

In this formula, the tilde ($ \sim $) is the regression operator. On the left-hand side of the operator, we have the
dependent variable (y), and on the right-hand side, we have the independent variables, separated by the $+$ operator. In `lavaan`, a typical model is simply a set (or system) of regression formulas, where some variables (starting with an ‘f’ below) may be latent. For example:

$ y \sim f1 + f2 + x1 + x2$

$f1 \sim f2 + f3$

$f2 \sim f3 + x1 + x2$

If we have __latent__ variables in any of the regression formulas, we must ‘define’ them by listing their (__manifest__ or latent) indicators. We do this by using the special operator $=\sim$, which can be read as is measured by. For example, to define the three latent variabels $f1$, $f2$ and $f3$, we can use something like:

$ f1 =\sim y1 + y2 + y3$

$ f2 =\sim y4 + y5 + y6$

$ f3 =\sim y7 + y8 + y9 + y10$

Furthermore, variances and covariances are specified using a ‘double tilde’ operator, for example:

$y1 \sim \sim y1 $ # variance 

$y1 \sim \sim y2 $ # covariance

$f1 \sim \sim f2 $ # covariance

And finally, intercepts for observed and latent variables are simple regression formulas with only an intercept (explicitly denoted by the number ‘1’) as the only predictor:

$ y1 \sim 1$

$ f1 \sim 1$

Using these four formula types, a large variety of latent variable models can be described. This is summarized in the table below.

| formula type | operator | mnemonic|
|-|-|-|
|latent variable definition| =~ | is measured by|
|regression | ~| is regressed on|
|(residual) (co)variance |~~ | is correlated with|
|intercept| ~ 1| intercept|

### Model terms
#### Degrees of freedom:
> Determined by the number of manifest variables and estimated values
> df = Possible values - Estimated values
> $Possible \ values = \frac{Manifest \ variables * (Manifest \ variables + 1)}{2}$

#### Model identification
* Include at three manifest variables
* Create models with df>0
* Use scaling (of model, not variables. By scalign the variance of manifest variables to 1) and constraints to control df

Lets create a new model of textual speed with the variables x4, x5, and x6, which represent reading comprehension and understanding word meaning. x7, x8, and x9 represent speed counting and addition. The model will have one latent variable that predicts scores on these six manifest variables.

In [8]:
head(HolzingerSwineford1939)

id,sex,ageyr,agemo,school,grade,x1,x2,x3,x4,x5,x6,x7,x8,x9
1,1,13,1,Pasteur,7,3.333333,7.75,0.375,2.333333,5.75,1.2857143,3.391304,5.75,6.361111
2,2,13,7,Pasteur,7,5.333333,5.25,2.125,1.666667,3.0,1.2857143,3.782609,6.25,7.916667
3,2,13,1,Pasteur,7,4.5,5.25,1.875,1.0,1.75,0.4285714,3.26087,3.9,4.416667
4,1,13,2,Pasteur,7,5.333333,7.75,3.0,2.666667,4.5,2.4285714,3.0,5.3,4.861111
5,2,12,2,Pasteur,7,4.833333,4.75,0.875,2.666667,4.0,2.5714286,3.695652,6.3,5.916667
6,2,14,1,Pasteur,7,5.333333,5.0,2.25,1.0,3.0,0.8571429,4.347826,6.65,7.5


In [5]:
# Define your model specification
text.model <- 'textspeed =~ x4 + x5 + x6 + x7 + x8 + x9'

In [7]:
# Analyze the model with cfa()
text.fit <- cfa(model = text.model, data = HolzingerSwineford1939)

# Summarize the model
summary(text.fit)

lavaan 0.6-3 ended normally after 20 iterations

  Optimization method                           NLMINB
  Number of free parameters                         12

  Number of observations                           301

  Estimator                                         ML
  Model Fit Test Statistic                     149.786
  Degrees of freedom                                 9
  P-value (Chi-square)                           0.000

Parameter Estimates:

  Information                                 Expected
  Information saturated (h1) model          Structured
  Standard Errors                             Standard

Latent Variables:
                   Estimate  Std.Err  z-value  P(>|z|)
  textspeed =~                                        
    x4                1.000                           
    x5                1.130    0.067   16.946    0.000
    x6                0.925    0.056   16.424    0.000
    x7                0.196    0.067    2.918    0.004
    x8                0.186

In [6]:
# Look at the dataset
data(PoliticalDemocracy)
head(PoliticalDemocracy)

# Define your model specification
politics.model <- 'poldemo60 =~ y1 + y2 +y3 +y4'

y1,y2,y3,y4,y5,y6,y7,y8,x1,x2,x3
2.5,0.0,3.333333,0.0,1.25,0.0,3.72636,3.333333,4.442651,3.637586,2.557615
1.25,0.0,3.333333,0.0,6.25,1.1,6.666666,0.736999,5.384495,5.062595,3.568079
7.5,8.8,9.999998,9.199991,8.75,8.094061,9.999998,8.211809,5.961005,6.25575,5.224433
8.9,8.8,9.999998,9.199991,8.907948,8.127979,9.999998,4.615086,6.285998,7.567863,6.267495
10.0,3.333333,9.999998,6.666666,7.5,3.333333,9.999998,6.666666,5.863631,6.818924,4.573679
7.5,3.333333,6.666666,6.666666,6.25,1.1,6.666666,0.3685,5.533389,5.135798,3.89227


In [9]:
politics.fit <- cfa(model = politics.model, data = PoliticalDemocracy)
summary(politics.fit)

lavaan 0.6-3 ended normally after 26 iterations

  Optimization method                           NLMINB
  Number of free parameters                          8

  Number of observations                            75

  Estimator                                         ML
  Model Fit Test Statistic                      10.006
  Degrees of freedom                                 2
  P-value (Chi-square)                           0.007

Parameter Estimates:

  Information                                 Expected
  Information saturated (h1) model          Structured
  Standard Errors                             Standard

Latent Variables:
                   Estimate  Std.Err  z-value  P(>|z|)
  poldemo60 =~                                        
    y1                1.000                           
    y2                1.404    0.197    7.119    0.000
    y3                1.089    0.167    6.529    0.000
    y4                1.370    0.167    8.228    0.000

Variances:
               

### Model assessment

<img src="images/4.png" alt="Drawing" align='left' style="width:600px;"/>

<img src="images/5.png" alt="Drawing" align='left' style="width:500px;"/>



[http://davidakenny.net/cm/fit.htm]

In [10]:
summary(text.fit, fit.measures=TRUE, standardized=TRUE)

lavaan 0.6-3 ended normally after 20 iterations

  Optimization method                           NLMINB
  Number of free parameters                         12

  Number of observations                           301

  Estimator                                         ML
  Model Fit Test Statistic                     149.786
  Degrees of freedom                                 9
  P-value (Chi-square)                           0.000

Model test baseline model:

  Minimum Function Test Statistic              681.336
  Degrees of freedom                                15
  P-value                                        0.000

User model versus baseline model:

  Comparative Fit Index (CFI)                    0.789
  Tucker-Lewis Index (TLI)                       0.648

Loglikelihood and Information Criteria:

  Loglikelihood user model (H0)              -2476.130
  Loglikelihood unrestricted model (H1)      -2401.237

  Number of free parameters                         12
  Akaike (AIC)  