In [1]:
%defaultDatasource jdbc:h2:mem:db

# Valores de referência NHANES para survey de 2005-2006
## Importando faixas normais de valores indicadas em documentação do NHANES

* Para cada variável é indicado
  - gênero a que se aplica
  - faixa de idade (ageStart até ageEnd)


* A faixa é indicada na forma de valor mínimo e máximo considerado normais

In [2]:
DROP TABLE IF EXISTS ReferenceRanges;
CREATE TABLE ReferenceRanges (
  variable VARCHAR(8),
  gender VARCHAR(1),
  ageStart SMALLINT,
  ageEnd SMALLINT,
  min DECIMAL(7,1),
  max DECIMAL(7,1),
  PRIMARY KEY(variable,gender,ageStart,ageEnd)
) AS SELECT
  variable,gender,ageStart,ageEnd,min,max
FROM CSVREAD('../data/nhanes2005-2006/reference-ranges.csv');

SELECT DISTINCT variable FROM ReferenceRanges;
SELECT * FROM ReferenceRanges;

# Survey NHANES 2005-2006
## Importando dados de survey NHANES 2005-2006

* Estão sendo considerados apenas os campos relacionados a anemia, para aqueles indivíduos que têm valores para todos os campos

In [3]:
DROP TABLE IF EXISTS Survey;

CREATE TABLE Survey (
  SEQN VARCHAR(8),
  RIAGENDR VARCHAR(1),
  RIDAGEYR SMALLINT,
  LBXWBCSI DECIMAL(7,1),
  LBXHGB DECIMAL(7,1),
  LBXHCT DECIMAL(7,1),
  LBXPLTSI DECIMAL(7,1),
  PRIMARY KEY(SEQN)
) AS SELECT
  SEQN,RIAGENDR,RIDAGEYR,LBXWBCSI,LBXHGB,LBXHCT,LBXPLTSI
FROM CSVREAD('../data/nhanes2005-2006/combined-selected-variables.csv');

SELECT COUNT(*) FROM Survey;
SELECT * FROM Survey;

# Códigos e descrição das variáveis NHANES

In [4]:
DROP TABLE IF EXISTS VariableDescription;
CREATE TABLE VariableDescription (
  variable VARCHAR(8),
  acronym VARCHAR(8),
  name VARCHAR(50),
  unit VARCHAR(30),
  file VARCHAR(20),
  ranges VARCHAR(100),
  PRIMARY KEY(variable)
) AS SELECT
  variable,acronym,name,unit,file,ranges
FROM CSVREAD('../data/nhanes2005-2006/reference-ranges-variables.csv');

SELECT * FROM VariableDescription;

# Preparando matriz binária para definir perfil de pessoas

* Para cada variável essa tabela define uma coluna extra binária _b que é inicializada com 0 e receberá 1 se aquela variável estiver fora da faixa NHANES.

## Geração da tabela inicializada com 0

In [5]:
DROP TABLE IF EXISTS SurveyB;
CREATE TABLE SurveyB (
  SEQN VARCHAR(8),
  RIAGENDR VARCHAR(1),
  RIDAGEYR SMALLINT,
  LBXWBCSI DECIMAL(7,1),
  LBXWBCSI_b SMALLINT DEFAULT 0,
  LBXHGB DECIMAL(7,1),
  LBXHGB_b SMALLINT DEFAULT 0,
  LBXHCT DECIMAL(7,1),
  LBXHCT_b SMALLINT DEFAULT 0,
  LBXPLTSI DECIMAL(7,1),
  LBXPLTSI_b SMALLINT DEFAULT 0,
  PRIMARY KEY(SEQN)
) AS SELECT
  SEQN,RIAGENDR,RIDAGEYR,LBXWBCSI,0,LBXHGB,0,LBXHCT,0,LBXPLTSI,0
FROM CSVREAD('../data/nhanes2005-2006/combined-selected-variables.csv');

SELECT COUNT(*) FROM SurveyB;
SELECT * FROM SurveyB;

## Ensaio de verificação

* Ensaio de associação da variável Iron (LBXIRN) com os limites estabelecidos pela NHANES.

In [6]:
SELECT SB.LBXWBCSI, SB.LBXWBCSI_b, RR.gender, RR.ageStart, RR.ageEnd, RR.min, RR.max
FROM SurveyB SB, ReferenceRanges RR
WHERE RR.variable='LBXWBCSI' AND SB.RIAGENDR=RR.gender AND SB.RIDAGEYR>=RR.ageStart AND SB.RIDAGEYR<=RR.ageEnd;

## Construção da matriz

* Cada variável é comparada com os limites da NHANES e as colunas binárias _b são atualizadas.

In [7]:
-- Computing LBXWBCSI
UPDATE SurveyB SB
SET SB.LBXWBCSI_b = 1
WHERE EXISTS
(SELECT RRb.min
 FROM ReferenceRanges RRb
 WHERE RRb.variable='LBXWBCSI' AND SB.RIAGENDR=RRb.gender AND SB.RIDAGEYR>=RRb.ageStart AND SB.RIDAGEYR<=RRb.ageEnd AND SB.LBXWBCSI<RRb.min);
UPDATE SurveyB SB
SET SB.LBXWBCSI = 1
WHERE SB.LBXWBCSI_b = 0 AND
EXISTS (SELECT RRb.max
 FROM ReferenceRanges RRb
 WHERE RRb.variable='LBXWBCSI' AND SB.RIAGENDR=RRb.gender AND SB.RIDAGEYR>=RRb.ageStart AND SB.RIDAGEYR<=RRb.ageEnd AND SB.LBXWBCSI>RRb.max);

-- Computing LBXHGB
UPDATE SurveyB SB
SET SB.LBXHGB_b = 1
WHERE EXISTS
(SELECT RRb.min
 FROM ReferenceRanges RRb
 WHERE RRb.variable='LBXHGB' AND SB.RIAGENDR=RRb.gender AND SB.RIDAGEYR>=RRb.ageStart AND SB.RIDAGEYR<=RRb.ageEnd AND SB.LBXHGB<RRb.min);
UPDATE SurveyB SB
SET SB.LBXHGB = 1
WHERE SB.LBXHGB_b = 0 AND
EXISTS (SELECT RRb.max
 FROM ReferenceRanges RRb
 WHERE RRb.variable='LBXHGB' AND SB.RIAGENDR=RRb.gender AND SB.RIDAGEYR>=RRb.ageStart AND SB.RIDAGEYR<=RRb.ageEnd AND SB.LBXHGB>RRb.max);

-- Computing LBXHCT
UPDATE SurveyB SB
SET SB.LBXHCT_b = 1
WHERE EXISTS
(SELECT RRb.min
 FROM ReferenceRanges RRb
 WHERE RRb.variable='LBXHCT' AND SB.RIAGENDR=RRb.gender AND SB.RIDAGEYR>=RRb.ageStart AND SB.RIDAGEYR<=RRb.ageEnd AND SB.LBXHCT<RRb.min);
UPDATE SurveyB SB
SET SB.LBXHCT = 1
WHERE SB.LBXHCT_b = 0 AND
EXISTS (SELECT RRb.max
 FROM ReferenceRanges RRb
 WHERE RRb.variable='LBXHCT' AND SB.RIAGENDR=RRb.gender AND SB.RIDAGEYR>=RRb.ageStart AND SB.RIDAGEYR<=RRb.ageEnd AND SB.LBXHCT>RRb.max);

-- Computing LBXPLTSI
UPDATE SurveyB SB
SET SB.LBXPLTSI_b = 1
WHERE EXISTS
(SELECT RRb.min
 FROM ReferenceRanges RRb
 WHERE RRb.variable='LBXPLTSI' AND SB.RIAGENDR=RRb.gender AND SB.RIDAGEYR>=RRb.ageStart AND SB.RIDAGEYR<=RRb.ageEnd AND SB.LBXPLTSI<RRb.min);
UPDATE SurveyB SB
SET SB.LBXPLTSI = 1
WHERE SB.LBXPLTSI_b = 0 AND
EXISTS (SELECT RRb.max
 FROM ReferenceRanges RRb
 WHERE RRb.variable='LBXPLTSI' AND SB.RIAGENDR=RRb.gender AND SB.RIDAGEYR>=RRb.ageStart AND SB.RIDAGEYR<=RRb.ageEnd AND SB.LBXPLTSI>RRb.max);

## Matriz final (CorrelationMatrix)

* Construção da visão da matriz final que tem a indentificação da pessoa, a matriz binária _b e um perfil produzida pela concatenação da linha da matriz binária.
* O perfil representa de forma binária o que está anormal (fora dos limites) na pessoa.
* Só vão para a matriz final as pessoas que possuem algum indicador anormal.

In [8]:
DROP VIEW IF EXISTS DeviationProfiles;
DROP VIEW IF EXISTS CorrelationMatrix;

CREATE VIEW CorrelationMatrix AS
SELECT DISTINCT SB.SEQN, 
  CONCAT(SB.LBXWBCSI_b, SB.LBXHGB_b, SB.LBXHCT_b, SB.LBXPLTSI_b) AS profile,
  SB.LBXWBCSI_b, SB.LBXHGB_b, SB.LBXHCT_b, SB.LBXPLTSI_b
FROM SurveyB SB, ReferenceRanges RR
WHERE SB.RIAGENDR=RR.gender AND SB.RIDAGEYR>=RR.ageStart AND SB.RIDAGEYR<=RR.ageEnd AND
(LBXWBCSI_b>0 OR LBXHGB_b>0 OR LBXHCT_b>0 OR LBXPLTSI_b>0);

SELECT COUNT(*) FROM CorrelationMatrix;
SELECT * FROM CorrelationMatrix;

## Gravação da matriz binária

* Gravação da matriz binária em arquivo CSV.
* É possível fazer download do arquivo.

In [9]:
CALL CSVWRITE('../data/nhanes2005-2006/correlation-matrix-fb.csv', 'SELECT * FROM CorrelationMatrix');

204

# Rede de perfis

* As pessoas serão aqui associadas a partir de seus perfis binários produzindo uma rede de perfis e suas correlações.

## Agrupamento de perfis

* Os perfis são agrupados conforme o padrão binário e é registrado o número de pessoas com aquele perfil.

In [10]:
DROP VIEW IF EXISTS DeviationProfiles;

CREATE VIEW DeviationProfiles AS
SELECT CM.profile, COUNT(*) AS individuals
FROM CorrelationMatrix CM
GROUP BY CM.profile;

SELECT SUM(individuals) FROM DeviationProfiles;
SELECT * FROM DeviationProfiles;

## Gravação de perfis

* Os perfis e respectivo número de pessoas associadas é gravado em CSV.

In [11]:
CALL CSVWRITE('../data/nhanes2005-2006/deviation-profiles-fb.csv', 'SELECT DP.profile AS id, DP.individuals AS weight FROM DeviationProfiles DP');

9

# Matriz com intensidade de desvio

* Esta segunda matriz registra não somente que variáveis da pessoa ultrapassam os limites, mas quanto elas ultrapassam.

## Geração de nova matriz de base inicializada com 0

In [12]:
DROP TABLE IF EXISTS SurveyD;
CREATE TABLE SurveyD (
  SEQN VARCHAR(8),
  RIAGENDR VARCHAR(1),
  RIDAGEYR SMALLINT,
  LBXWBCSI DECIMAL(7,1),
  LBXWBCSI_d SMALLINT DEFAULT 0,
  LBXHGB DECIMAL(7,1),
  LBXHGB_d SMALLINT DEFAULT 0,
  LBXHCT DECIMAL(7,1),
  LBXHCT_d SMALLINT DEFAULT 0,
  LBXPLTSI DECIMAL(7,1),
  LBXPLTSI_d SMALLINT DEFAULT 0,
  PRIMARY KEY(SEQN)
) AS SELECT
  SEQN,RIAGENDR,RIDAGEYR,LBXWBCSI,0,LBXHGB,0,LBXHCT,0,LBXPLTSI,0
FROM CSVREAD('../data/nhanes2005-2006/combined-selected-variables.csv');

SELECT * FROM SurveyD;

## Cálculo do desvio do limite por pessoa e variável

In [13]:
-- Computing LBXWBCSI
UPDATE SurveyD SD
SET SD.LBXWBCSI_d =
(SELECT RRa.min-SD.LBXWBCSI
 FROM ReferenceRanges RRa
 WHERE RRa.variable='LBXWBCSI' AND SD.RIAGENDR=RRa.gender AND SD.RIDAGEYR>=RRa.ageStart AND SD.RIDAGEYR<=RRa.ageEnd AND SD.LBXWBCSI<RRa.min)
WHERE EXISTS
(SELECT RRb.min
 FROM ReferenceRanges RRb
 WHERE RRb.variable='LBXWBCSI' AND SD.RIAGENDR=RRb.gender AND SD.RIDAGEYR>=RRb.ageStart AND SD.RIDAGEYR<=RRb.ageEnd AND SD.LBXWBCSI<RRb.min);
UPDATE SurveyD SD
SET SD.LBXWBCSI_d =
(SELECT SD.LBXWBCSI-RRa.max
 FROM ReferenceRanges RRa
 WHERE RRa.variable='LBXWBCSI' AND SD.RIAGENDR=RRa.gender AND SD.RIDAGEYR>=RRa.ageStart AND SD.RIDAGEYR<=RRa.ageEnd AND SD.LBXWBCSI>RRa.max)
WHERE SD.LBXWBCSI_d = 0 AND
EXISTS (SELECT RRb.max
 FROM ReferenceRanges RRb
 WHERE RRb.variable='LBXWBCSI' AND SD.RIAGENDR=RRb.gender AND SD.RIDAGEYR>=RRb.ageStart AND SD.RIDAGEYR<=RRb.ageEnd AND SD.LBXWBCSI>RRb.max);

-- Computing LBXHGB
UPDATE SurveyD SD
SET SD.LBXHGB_d =
(SELECT RRa.min-SD.LBXHGB
 FROM ReferenceRanges RRa
 WHERE RRa.variable='LBXHGB' AND SD.RIAGENDR=RRa.gender AND SD.RIDAGEYR>=RRa.ageStart AND SD.RIDAGEYR<=RRa.ageEnd AND SD.LBXHGB<RRa.min)
WHERE EXISTS
(SELECT RRb.min
 FROM ReferenceRanges RRb
 WHERE RRb.variable='LBXHGB' AND SD.RIAGENDR=RRb.gender AND SD.RIDAGEYR>=RRb.ageStart AND SD.RIDAGEYR<=RRb.ageEnd AND SD.LBXHGB<RRb.min);
UPDATE SurveyD SD
SET SD.LBXHGB_d =
(SELECT SD.LBXHGB-RRa.max
 FROM ReferenceRanges RRa
 WHERE RRa.variable='LBXHGB' AND SD.RIAGENDR=RRa.gender AND SD.RIDAGEYR>=RRa.ageStart AND SD.RIDAGEYR<=RRa.ageEnd AND SD.LBXHGB>RRa.max)
WHERE SD.LBXHGB_d = 0 AND
EXISTS (SELECT RRb.max
 FROM ReferenceRanges RRb
 WHERE RRb.variable='LBXHGB' AND SD.RIAGENDR=RRb.gender AND SD.RIDAGEYR>=RRb.ageStart AND SD.RIDAGEYR<=RRb.ageEnd AND SD.LBXHGB>RRb.max);

-- Computing LBXHCT
UPDATE SurveyD SD
SET SD.LBXHCT_d =
(SELECT RRa.min-SD.LBXHCT
 FROM ReferenceRanges RRa
 WHERE RRa.variable='LBXHCT' AND SD.RIAGENDR=RRa.gender AND SD.RIDAGEYR>=RRa.ageStart AND SD.RIDAGEYR<=RRa.ageEnd AND SD.LBXHCT<RRa.min)
WHERE EXISTS
(SELECT RRb.min
 FROM ReferenceRanges RRb
 WHERE RRb.variable='LBXHCT' AND SD.RIAGENDR=RRb.gender AND SD.RIDAGEYR>=RRb.ageStart AND SD.RIDAGEYR<=RRb.ageEnd AND SD.LBXHCT<RRb.min);
UPDATE SurveyD SD
SET SD.LBXHCT_d =
(SELECT SD.LBXHCT-RRa.max
 FROM ReferenceRanges RRa
 WHERE RRa.variable='LBXHCT' AND SD.RIAGENDR=RRa.gender AND SD.RIDAGEYR>=RRa.ageStart AND SD.RIDAGEYR<=RRa.ageEnd AND SD.LBXHCT>RRa.max)
WHERE SD.LBXHCT_d = 0 AND
EXISTS (SELECT RRb.max
 FROM ReferenceRanges RRb
 WHERE RRb.variable='LBXHCT' AND SD.RIAGENDR=RRb.gender AND SD.RIDAGEYR>=RRb.ageStart AND SD.RIDAGEYR<=RRb.ageEnd AND SD.LBXHCT>RRb.max);

-- Computing LBXPLTSI
UPDATE SurveyD SD
SET SD.LBXPLTSI_d =
(SELECT RRa.min-SD.LBXPLTSI
 FROM ReferenceRanges RRa
 WHERE RRa.variable='LBXPLTSI' AND SD.RIAGENDR=RRa.gender AND SD.RIDAGEYR>=RRa.ageStart AND SD.RIDAGEYR<=RRa.ageEnd AND SD.LBXPLTSI<RRa.min)
WHERE EXISTS
(SELECT RRb.min
 FROM ReferenceRanges RRb
 WHERE RRb.variable='LBXPLTSI' AND SD.RIAGENDR=RRb.gender AND SD.RIDAGEYR>=RRb.ageStart AND SD.RIDAGEYR<=RRb.ageEnd AND SD.LBXPLTSI<RRb.min);
UPDATE SurveyD SD
SET SD.LBXPLTSI_d =
(SELECT SD.LBXPLTSI-RRa.max
 FROM ReferenceRanges RRa
 WHERE RRa.variable='LBXPLTSI' AND SD.RIAGENDR=RRa.gender AND SD.RIDAGEYR>=RRa.ageStart AND SD.RIDAGEYR<=RRa.ageEnd AND SD.LBXPLTSI>RRa.max)
WHERE SD.LBXPLTSI_d = 0 AND
EXISTS (SELECT RRb.max
 FROM ReferenceRanges RRb
 WHERE RRb.variable='LBXPLTSI' AND SD.RIAGENDR=RRb.gender AND SD.RIDAGEYR>=RRb.ageStart AND SD.RIDAGEYR<=RRb.ageEnd AND SD.LBXPLTSI>RRb.max);

## Matriz com  desvio final

* Matriz final com identificação das pessoas e desvios.

In [14]:
DROP VIEW IF EXISTS CorrelationMatrixWeighted;

CREATE VIEW CorrelationMatrixWeighted AS
SELECT DISTINCT SD.SEQN, 
  SD.LBXWBCSI_d, SD.LBXHGB_d, SD.LBXHCT_d, SD.LBXPLTSI_d
FROM SurveyD SD, ReferenceRanges RR
WHERE SD.RIAGENDR=RR.gender AND SD.RIDAGEYR>=RR.ageStart AND SD.RIDAGEYR<=RR.ageEnd AND
(LBXWBCSI_d>0 OR LBXHGB_d>0 OR LBXHCT_d>0 OR LBXPLTSI_d>0);

SELECT COUNT(*) FROM CorrelationMatrixWeighted;
SELECT * FROM CorrelationMatrixWeighted;

## Gravação da matriz final com desvios

In [15]:
CALL CSVWRITE('../data/nhanes2005-2006/correlation-matrix-weighted-fb.csv', 'SELECT * FROM CorrelationMatrixWeighted');

387

# Rede de variáveis

* Nesta rede cada nó será uma variável e cada aresta indica que duas variáveis se correlacionam com uma certa identidade.

## Lista de pares de variáveis

* Esta view prepara a lista de correlação aos pares inicializada com 0.

In [16]:
DROP VIEW IF EXISTS VariablesCorrelation;
DROP VIEW IF EXISTS Variables;

CREATE VIEW Variables AS
SELECT DISTINCT variable AS var1 FROM ReferenceRanges;

CREATE VIEW VariablesCorrelation AS
SELECT DISTINCT Variables.var1, ReferenceRanges.variable AS var2, 0 AS correlation
FROM Variables, ReferenceRanges
WHERE Variables.var1 < ReferenceRanges.variable;

SELECT COUNT(*) FROM VariablesCorrelation;
SELECT * FROM VariablesCorrelation;

## Verticalização do survey

* As pessoas e variáveis que se apresentam originalmente em uma matriz são transformadas em uma lista: pessoa, variável e valor. Essa lista facilitará as análises subsequentes.

In [17]:
DROP VIEW IF EXISTS VerticalSurvey;

CREATE VIEW VerticalSurvey AS
  SELECT SU.SEQN, RR.variable, SU.LBXWBCSI AS value, 0 AS deviation
  FROM Survey SU, ReferenceRanges RR
  WHERE RR.variable='LBXWBCSI'
UNION
  SELECT SU.SEQN, RR.variable, SU.LBXHGB AS value, 0 AS deviation
  FROM Survey SU, ReferenceRanges RR
  WHERE RR.variable='LBXHGB'
UNION
  SELECT SU.SEQN, RR.variable, SU.LBXHCT AS value, 0 AS deviation
  FROM Survey SU, ReferenceRanges RR
  WHERE RR.variable='LBXHCT'
UNION
  SELECT SU.SEQN, RR.variable, SU.LBXPLTSI AS value, 0 AS deviation
  FROM Survey SU, ReferenceRanges RR
  WHERE RR.variable='LBXPLTSI'
;
  
SELECT * FROM VerticalSurvey;

## Gravação do survey vertical em CSV

In [18]:
CALL CSVWRITE('../data/nhanes2005-2006/vertical-survey-fb.csv', 'SELECT SEQN,variable,value FROM VerticalSurvey');

10660

## Transformação da VIEW em tabela para permitir updates

In [19]:
DROP TABLE IF EXISTS VerticalSurveyD;
CREATE TABLE VerticalSurveyD (
  SEQN VARCHAR(8),
  variable VARCHAR(8),
  value DECIMAL(7,1),
  deviation DECIMAL(7,1),
  PRIMARY KEY(SEQN, variable)
) AS SELECT * FROM VerticalSurvey;

## Cáculo do desvio de variáveis que ultrapassam o limite

In [20]:
UPDATE VerticalSurveyD VS
SET VS.deviation =
(SELECT RRa.min-VS.value
 FROM Survey SUa, ReferenceRanges RRa
 WHERE RRa.variable=VS.variable AND SUa.SEQN=VS.SEQN AND SUa.RIAGENDR=RRa.gender AND SUa.RIDAGEYR>=RRa.ageStart AND SUa.RIDAGEYR<=RRa.ageEnd AND VS.value<RRa.min)
WHERE EXISTS
(SELECT RRb.min
 FROM Survey SUb, ReferenceRanges RRb
 WHERE RRb.variable=VS.variable AND SUb.SEQN=VS.SEQN AND SUb.RIAGENDR=RRb.gender AND SUb.RIDAGEYR>=RRb.ageStart AND SUb.RIDAGEYR<=RRb.ageEnd AND VS.value<RRb.min);

UPDATE VerticalSurveyD VS
SET VS.deviation =
(SELECT VS.value-RRa.max
 FROM Survey SUa, ReferenceRanges RRa
 WHERE RRa.variable=VS.variable AND SUa.SEQN=VS.SEQN AND SUa.RIAGENDR=RRa.gender AND SUa.RIDAGEYR>=RRa.ageStart AND SUa.RIDAGEYR<=RRa.ageEnd AND VS.value>RRa.max)
WHERE EXISTS
(SELECT RRb.max
 FROM Survey SUb, ReferenceRanges RRb
 WHERE RRb.variable=VS.variable AND SUb.SEQN=VS.SEQN AND SUb.RIAGENDR=RRb.gender AND SUb.RIDAGEYR>=RRb.ageStart AND SUb.RIDAGEYR<=RRb.ageEnd AND VS.value>RRb.max);
 
SELECT * FROM VerticalSurveyD WHERE deviation > 0;

## Cálculo da média dos desvios

* Tentativa de normalização dos valores, mas as médias estão estranahas.

In [21]:
SELECT variable, AVG(deviation) FROM VerticalSurveyD VS GROUP BY variable;

## Correlação de variáveis por pessoa

* Análise de pares de variáveis de pessoas que se correlacionam

In [22]:
DROP VIEW IF EXISTS VariablePairCorrelation;
DROP VIEW IF EXISTS IndividualVariablesCorrelation;

CREATE VIEW IndividualVariablesCorrelation AS
SELECT VS1.SEQN, CM.profile, VC.var1, VC.var2
FROM VariablesCorrelation VC, VerticalSurveyD VS1, VerticalSurveyD VS2, CorrelationMatrix CM
WHERE VS1.SEQN = VS2.SEQN AND VS1.variable = VC.var1 AND VS2.variable = VC.var2 AND 
      VS1.deviation > 0 AND VS2.deviation > 0 AND
      VS1.SEQN = CM.SEQN;

SELECT * FROM IndividualVariablesCorrelation
ORDER BY var1, var2;

## Correlação de pares de variáveis

* Agrupamento das correlações por pares de variáveis.
* Preparação para a montagem de rede com variáveis nos nós e arestas ligando variáveis que saíram dos limites juntas para a mesma pessoa.

In [23]:
DROP VIEW IF EXISTS VariablePairCorrelation;
CREATE VIEW VariablePairCorrelation AS
SELECT var1 AS source, var2 as TARGET, COUNT(*) AS weight
FROM IndividualVariablesCorrelation
GROUP BY var1, var2;

SELECT * FROM VariablePairCorrelation;

## Gravação de CSV de correlações para Gephi

In [24]:
CALL CSVWRITE('../data/nhanes2005-2006/variable-pair-correlation-fb.csv', 'SELECT * FROM VariablePairCorrelation');

6

# Variable Network

* Rede de variáveis produzida no Gephi a partir do arquivo acima.

![variable network](variable-network.svg "Variable Network")

# Rede de perfis

* Retomada da rede de perfis.

## Análise de correlação entre pares de perfis

* Cada vez que duas pessoas compartilham uma variável fora dos limites, é definida uma aresta entre elas.
* As arestas são agrupadas por pares de perfil. Para cada par é contado o número de indivíduos/variáveis que coocorrem.

In [25]:
DROP VIEW IF EXISTS ProfileCorrelation;

CREATE VIEW ProfileCorrelation AS
  SELECT CM1.SEQN AS SEQN1, CM1.profile AS profile1, CM2.SEQN AS SEQN2, CM2.profile AS profile2
  FROM VerticalSurveyD VS1, VerticalSurveyD VS2, CorrelationMatrix CM1, CorrelationMatrix CM2
  WHERE VS1.SEQN < VS2.SEQN AND VS1.variable = VS2.variable AND
        VS1.deviation > 0 AND VS2.deviation > 0 AND
        VS1.SEQN = CM1.SEQN AND VS2.SEQN = CM2.SEQN;
        
-- Gravação de pares de perfis com similaridade para rede
CALL CSVWRITE('../data/nhanes2005-2006/profile-pair-correlation-fb.csv', 'SELECT * FROM ProfileCorrelation');

11797

# Profile Network

![profile network](profile-network.png "Profile Network")

In [26]:
DROP VIEW IF EXISTS ProfileCorrelationNWeight;
DROP VIEW IF EXISTS ProfileCorrelationUnique;

CREATE VIEW ProfileCorrelationUnique AS
  SELECT DISTINCT * FROM ProfileCorrelation;

CREATE VIEW ProfileCorrelationNWeight AS
  SELECT PC.profile1 AS source, PC.profile2 as target, COUNT(*) as weight
  FROM ProfileCorrelationUnique PC
  GROUP BY PC.profile1, PC.profile2;
  
SELECT COUNT(*), SUM(weight) FROM ProfileCorrelationNWeight;
SELECT * FROM ProfileCorrelationNWeight;

-- Gravação de pares de perfis com similaridade para rede
CALL CSVWRITE('../data/nhanes2005-2006/profile-pair-correlation-number-fb.csv', 'SELECT * FROM ProfileCorrelationNWeight');

In [27]:
CREATE VIEW ProfileCorrelationSWeight AS
  SELECT PC.profile1 AS source, PC.profile2 as target, COUNT(*) as weight
  FROM ProfileCorrelation PC
  GROUP BY PC.profile1, PC.profile2;
  
SELECT COUNT(*), SUM(weight) FROM ProfileCorrelationSWeight;
SELECT * FROM ProfileCorrelationSWeight;

-- Gravação de pares de perfis com similaridade para rede
CALL CSVWRITE('../data/nhanes2005-2006/profile-pair-correlation-similarity-fb.csv', 'SELECT * FROM ProfileCorrelationSWeight');