------

<b>The Data Simulation Process of Boston Housing Dataset</b> _ Computational Statistics Project _ [Synthetic Data](https://github.com/solmazahmadi/Computational-Statistics-Project/tree/master/Synthetic_Data)

------



<b>Description</b>  We generate synthetic data from a dataset of different variable types including character, numeric, ordered and unordered factor variables. The simulate_dataset function randomly sample numeric and ordered factors from a multivariate normal distribution.

-------

<b>Function</b>

simulate_dataset (dataset, digits=2, n=NA, use.levels=TRUE, use.miss=TRUE, mvt.method="eigen", het.ML=FALSE, het.suppress=TRUE, stealth.level=1, level3.noise=FALSE, ignore=NA)

-----

<b>Arguments</b>

<b>dataset</b>         the data frame from which to generate a randomized version.

<b>digits </b>         the number of digits after the decimal point to include in the new values.

<b>n </b>              number  of  rows  in  the  new  data  frame. Equal to  the  number  of  rows  in  theoriginal if set to NA, the default.

<b>use.levels</b>     when set to true, gives the simulated factor variables the same number of levelsas the original. 

<b>use.miss</b>        when set to TRUE, inserts the missing data like is present in the original (i.e.based on the distribution of missingness in the original data).

<b>mvt.method</b>      specifies the matrix decomposition to be used in sampling from the multivariate normal.

<b>het.ML</b>          as per the hetcor function, if TRUE, compute maximum-likelihood estimates; if FALSE, compute quick two-step estimates in computing the heterogeneous cor-relation matrix.

<b>het.suppress</b>    when set to TRUE, suppresses stops from the het.corr function.

<b>stealth.level</b>   when set to 1 (deafult), takes into account the covariances between all the un-ordered factors and the covariances between the numeric and ordered factors. When set to 2, simulates each variable independently. When set to 3, does nottake into account any covariances and instead randomly samples from a uniformdistribution ranging from the min to the max of the data for each variable.

<b>level3.noise</b>    when set to TRUE, add Gaussian noise to the min and max parameter for theuniform distribution in                       

<b>stealth.level 3</b> The noise term has a variance of one fourth of the range of the data for any particular variable.

<b>ignore</b>          specifies which columns to ignore (i.e. to leave asis instead of simulate). Takesin a list of column names as input.

---------



<b>Details</b>

This function randomly samples each each character and factor variable from the population distri-bution given in the original dataset. It simulates numeric and ordered factors from a multivariatenormal distribution. When both numeric and ordered factors are included, a heterogeneous correlation matrix is used, coercing the means of the ordered factor variables to be 0.The function only accounts for between-column correlations for numeric and ordered factor variables. Each unordered factor and character column is treated as independent.The order of the columns in the simulated dataset may differ from the order of the original datasetsince the function puts the numeric and ordered factor data in the front and the character and unordered factor data afterwards. The column names stay consistent, however.

--------------

In [55]:
# Required packages
library(MASS)
library(mvtnorm)
library(polycor)

In [56]:
# Read the original data on Github
df = read.csv(file="https://raw.githubusercontent.com/solmazahmadi/Computational-Statistics-Project/master/Data/Data_Original.csv")

In [57]:
# Make factors as factors and numeric as numeric

#Factors
df$CHAS = as.factor(df$CHAS)
df$ZN = as.factor(df$ZN)
df$RAD = as.factor(df$RAD)

#Numeric
df$CRIM = as.numeric(as.character(df$CRIM))
df$INDUS = as.numeric(as.character(df$INDUS))
df$NOX = as.numeric(as.character(df$NOX))
df$RM = as.numeric(as.character(df$RM))
df$AGE = as.numeric(as.character(df$AGE))
df$DIS = as.numeric(as.character(df$DIS))
df$TAX = as.numeric(as.character(df$TAX))
df$PTRATIO = as.numeric(as.character(df$PTRATIO))
df$B = as.numeric(as.character(df$B))
df$LSTAT = as.numeric(as.character(df$LSTAT))
df$MEDV = as.numeric(as.character(df$MEDV))

In [58]:
set.seed(13)
SimulateDataset = function(dataset, digits=2, n=NA, 
  use.names=TRUE, use.levels=TRUE, use.miss=TRUE,
  mvt.method="eigen", het.ML=FALSE, het.suppress=TRUE){

  # This function takes as argument an existing dataset, which 
  # must be either a matrix or a data frame. Each column of the 
  # dataset must consist either of numeric variables or = 
  # factors. When one or more = factors are included, 
  # then a heterogeneous correlation matrix is computed using 
  # John Fox' polycor package. Pairwise complete observations 
  # are used for all covariances, and the exact pattern of 
  # missing data present in the input is placed in the output,
  # provided a new sample size is not requested. Warnings from
  # the hetcor function are suppressed.

  require(mvtnorm)
  require(polycor)

  # requires data frame or matrix
  if((is.data.frame(dataset)+is.matrix(dataset))==0){
    warning("Data must be a data frame or matrix")
  }

  # organization
  row <- dim(dataset)[1] # number of rows
  if(is.na(n))(n <- row) # sets unspecified sample size to num rows
  col <- dim(dataset)[2] # number of columns
  del <- is.na(dataset)  # records position of NAs in dataset
  if(n!=row){
    select <- round(runif(n, 0.5, row+.49),0)
    del <- del[select,]
  }
  num <- rep(NA, col)    # see what's not a factor
  ord <- rep(NA, col)    # see what's an ordered factor

  # which columns are numeric (the others are factors)?
  for (i in 1:col){
    num[i] <- is.numeric(dataset[,i])
    ord[i] <- is.ordered(dataset[,i])
  }

  # check for unordered factors
  location <- !(num|ord)
  unorder <- sum(location)

  if(unorder>0)warning(
    paste("Unordered factor detected in variable(s):", 
      names(dataset)[location]
    )
  )


  # there are factors
  # if there are factors, we start here

  # find the variable means (constrain to zero for factors)
  mixedMeans <- rep(0, col)
  mixedMeans[num] <- apply(dataset[,num], 2, mean, na.rm=TRUE)

  # estimate a heterogeneous correlation matrix
  if (het.suppress==TRUE){
    suppressWarnings(het <- hetcor(dataset, ML=het.ML))
  } else (het <- hetcor(dataset, ML=het.ML))
  mixedCov <- het$correlations

  # make a diagonal matrix of standard deviations to turn the 
  # correlation matrix into a covariance matrix
  stand <- matrix(0, col, col)
  diag(stand) <- rep(1, col)
  diag(stand)[num] <- apply(dataset[,num], 2, sd, na.rm=TRUE)
  # pre and post multiply hetero cor matrix by diagonal sd matrix
  mixedCov <- stand %*% mixedCov %*% stand

  # generate the data
  fake <- as.data.frame(rmvnorm(row, mixedMeans, mixedCov, mvt.method))

  # insert the missing data, if so requested
  if(use.miss==TRUE)(fake[del] <- NA)

  # turn the required continuous variables into factors
  for (i in (1:col)[!num]){
    # the original data for this column
    old <- dataset[,i]
   
    # the new data for this column, omiting NAs
    new <- fake[!is.na(fake[,i]),i]

    # what are the levels of the original factor?
    lev <- levels(old)

    # establish cutpoints in new variable from cdf of old factor
    cut <- cumsum(table(old))/(sum(!is.na(old)))

    # put continuous variable into a matrix, repeating value across columns
    wide <- matrix(new, length(new), length(lev))

    # put the cutpoints in a matrix, repeating the cut point values across rows
    crit <- matrix(quantile(new, cut), length(new), length(lev), byrow=TRUE)

    # for each value (row of the wide matrix), 
    # how many cutpoints is the value greater than?
    # number of cutpoints surpassed=category
    fake[!is.na(fake[,i]),i] <- apply(wide>crit, 1, sum)

    # make it a factor
    fake[,i] <- factor(fake[,i], ordered=TRUE)

    # give the new factor the same levels as the old variable
    if(length(levels(fake[,i]))!=length(lev))message(
      paste("Fewer categories in simulated variable", 
      names(fake)[i], "than in input variable", names(dataset)[i]))
    if(use.levels==TRUE&(length(levels(fake[,i]))==length(lev))){
      levels(fake[,i]) <- lev} else (levels(fake[,i]) <- 1:length(lev))
  }

  # round the data to the requested digits
  fake[,num] <- round(fake[,num], digits)

  # give the variables names, if so requested
  if(use.names==TRUE)(names(fake) <- names(dataset))
  
  # return the new data
  return(fake)
}

In [59]:
simulated_data = SimulateDataset(df)
simulated_data

"Unordered factor detected in variable(s): ZNUnordered factor detected in variable(s): CHASUnordered factor detected in variable(s): RAD"

CRIM,ZN,INDUS,CHAS,NOX,RM,AGE,DIS,RAD,TAX,PTRATIO,B,LSTAT,MEDV
12.27,0,26.26,0,0.78,5.74,114.03,1.20,7,605.53,18.79,359.51,17.25,5.96
-4.23,0,12.32,0,0.49,6.26,61.87,6.85,3,240.21,19.79,394.89,7.10,13.17
-4.18,0,10.17,0,0.58,6.63,60.29,3.01,4,376.86,15.75,294.69,8.89,35.41
12.15,0,12.65,0,0.50,5.16,84.29,4.85,24,551.65,18.72,287.37,18.49,14.73
-7.90,20,12.95,0,0.51,6.16,56.44,3.30,4,231.47,16.54,479.85,14.11,24.19
-0.41,0,14.81,0,0.68,5.14,114.38,1.96,5,493.24,20.12,365.26,15.85,6.70
6.83,0,6.79,0,0.53,7.39,61.41,3.77,3,194.88,14.56,368.55,3.57,31.22
1.36,40,7.47,0,0.47,7.21,57.19,3.58,24,422.62,17.37,496.41,2.43,39.40
-1.08,0,13.38,0,0.69,6.75,80.83,2.89,8,482.94,18.80,340.10,8.20,22.70
1.96,22,6.56,0,0.58,5.76,74.39,5.55,6,366.61,18.38,420.60,18.83,12.60


In [61]:
write.csv(simulated_data, 'Simulated_Boston_Housing.csv') 