In [1]:
%defaultDatasource jdbc:h2:mem:db

# Reference values for NHANES for the 2005-2006 survey

* Extracted from data of the NHANES Web site (https://wwwn.cdc.gov/nchs/nhanes/).

## Importing normal ranges of values indicated in the NHANES documentation

* For each variable it is indicated
  - applicable gender
  - age range (ageStart until ageEnd)

* The range is indicated in the form of mininum and maximum values considered normal.

In [2]:
DROP TABLE IF EXISTS ReferenceRanges;
CREATE TABLE ReferenceRanges (
  variable VARCHAR(8),
  gender VARCHAR(1),
  ageStart SMALLINT,
  ageEnd SMALLINT,
  min DECIMAL(7,1),
  max DECIMAL(7,1),
  PRIMARY KEY(variable,gender,ageStart,ageEnd)
) AS SELECT
  variable,gender,ageStart,ageEnd,min,max
FROM CSVREAD('../data/nhanes2005-2006/reference-ranges.csv');

SELECT DISTINCT variable FROM ReferenceRanges;
SELECT * FROM ReferenceRanges;

# Survey NHANES 2005-2006

* Extracted from data of the NHANES Web site (https://wwwn.cdc.gov/nchs/nhanes/).

## Importing data from the survey NHANES 2005-2006

* We selected four commonly used blood test variables, as shows the following figure.

![btc fishbones](Hematology_Fishbone_Schematic.png "BTC Fishbones")
By <a href="//commons.wikimedia.org/w/index.php?title=User:Major_Small&amp;action=edit&amp;redlink=1" class="new" title="User:Major Small (page does not exist)">Major Small</a> - <span class="int-own-work" lang="en">Own work</span>, <a href="https://creativecommons.org/licenses/by/3.0" title="Creative Commons Attribution 3.0">CC BY 3.0</a>, <a href="https://commons.wikimedia.org/w/index.php?curid=27274895">Link</a>

In [3]:
DROP TABLE IF EXISTS Survey;

CREATE TABLE Survey (
  SEQN VARCHAR(8),
  RIAGENDR VARCHAR(1),
  RIDAGEYR SMALLINT,
  LBXWBCSI DECIMAL(7,1),
  LBXHGB DECIMAL(7,1),
  LBXHCT DECIMAL(7,1),
  LBXPLTSI DECIMAL(7,1),
  PRIMARY KEY(SEQN)
) AS SELECT
  SEQN,RIAGENDR,RIDAGEYR,LBXWBCSI,LBXHGB,LBXHCT,LBXPLTSI
FROM CSVREAD('../data/nhanes2005-2006/combined-selected-variables.csv');

SELECT COUNT(*) FROM Survey;
SELECT * FROM Survey;

# Codes and description of NHANES variables

* The codes and description of variables are stored in a table.

In [4]:
DROP TABLE IF EXISTS VariableDescription;
CREATE TABLE VariableDescription (
  variable VARCHAR(8),
  acronym VARCHAR(8),
  name VARCHAR(50),
  unit VARCHAR(30),
  file VARCHAR(20),
  ranges VARCHAR(100),
  PRIMARY KEY(variable)
) AS SELECT
  variable,acronym,name,unit,file,ranges
FROM CSVREAD('../data/nhanes2005-2006/reference-ranges-variables.csv');

SELECT * FROM VariableDescription;

# Binary evaluation of individuals out of the normal ranges

* For each variable, this table defines an extra binary column _b which is initialized with 0 and will receive 1 is the variable is out of the NHANES range.

## Generation of the starting matrix initialized with 0

In [5]:
DROP TABLE IF EXISTS SurveyB;
CREATE TABLE SurveyB (
  SEQN VARCHAR(8),
  RIAGENDR VARCHAR(1),
  RIDAGEYR SMALLINT,
  LBXWBCSI DECIMAL(7,1),
  LBXWBCSI_b SMALLINT DEFAULT 0,
  LBXHGB DECIMAL(7,1),
  LBXHGB_b SMALLINT DEFAULT 0,
  LBXHCT DECIMAL(7,1),
  LBXHCT_b SMALLINT DEFAULT 0,
  LBXPLTSI DECIMAL(7,1),
  LBXPLTSI_b SMALLINT DEFAULT 0,
  PRIMARY KEY(SEQN)
) AS SELECT
  SEQN,RIAGENDR,RIDAGEYR,LBXWBCSI,0,LBXHGB,0,LBXHCT,0,LBXPLTSI,0
FROM CSVREAD('../data/nhanes2005-2006/combined-selected-variables.csv');

## Matrix building

* Each variable is compared with the limits of the NHANES ranges, and the binary _b columns are updated.

In [6]:
-- Computing LBXWBCSI
UPDATE SurveyB SB
SET SB.LBXWBCSI_b = 1
WHERE EXISTS
(SELECT RRb.min
 FROM ReferenceRanges RRb
 WHERE RRb.variable='LBXWBCSI' AND SB.RIAGENDR=RRb.gender AND SB.RIDAGEYR>=RRb.ageStart AND SB.RIDAGEYR<=RRb.ageEnd AND SB.LBXWBCSI<RRb.min);
UPDATE SurveyB SB
SET SB.LBXWBCSI_b = 1
WHERE SB.LBXWBCSI_b = 0 AND
EXISTS (SELECT RRb.max
 FROM ReferenceRanges RRb
 WHERE RRb.variable='LBXWBCSI' AND SB.RIAGENDR=RRb.gender AND SB.RIDAGEYR>=RRb.ageStart AND SB.RIDAGEYR<=RRb.ageEnd AND SB.LBXWBCSI>RRb.max);

-- Computing LBXHGB
UPDATE SurveyB SB
SET SB.LBXHGB_b = 1
WHERE EXISTS
(SELECT RRb.min
 FROM ReferenceRanges RRb
 WHERE RRb.variable='LBXHGB' AND SB.RIAGENDR=RRb.gender AND SB.RIDAGEYR>=RRb.ageStart AND SB.RIDAGEYR<=RRb.ageEnd AND SB.LBXHGB<RRb.min);
UPDATE SurveyB SB
SET SB.LBXHGB_b = 1
WHERE SB.LBXHGB_b = 0 AND
EXISTS (SELECT RRb.max
 FROM ReferenceRanges RRb
 WHERE RRb.variable='LBXHGB' AND SB.RIAGENDR=RRb.gender AND SB.RIDAGEYR>=RRb.ageStart AND SB.RIDAGEYR<=RRb.ageEnd AND SB.LBXHGB>RRb.max);

-- Computing LBXHCT
UPDATE SurveyB SB
SET SB.LBXHCT_b = 1
WHERE EXISTS
(SELECT RRb.min
 FROM ReferenceRanges RRb
 WHERE RRb.variable='LBXHCT' AND SB.RIAGENDR=RRb.gender AND SB.RIDAGEYR>=RRb.ageStart AND SB.RIDAGEYR<=RRb.ageEnd AND SB.LBXHCT<RRb.min);
UPDATE SurveyB SB
SET SB.LBXHCT_b = 1
WHERE SB.LBXHCT_b = 0 AND
EXISTS (SELECT RRb.max
 FROM ReferenceRanges RRb
 WHERE RRb.variable='LBXHCT' AND SB.RIAGENDR=RRb.gender AND SB.RIDAGEYR>=RRb.ageStart AND SB.RIDAGEYR<=RRb.ageEnd AND SB.LBXHCT>RRb.max);

-- Computing LBXPLTSI
UPDATE SurveyB SB
SET SB.LBXPLTSI_b = 1
WHERE EXISTS
(SELECT RRb.min
 FROM ReferenceRanges RRb
 WHERE RRb.variable='LBXPLTSI' AND SB.RIAGENDR=RRb.gender AND SB.RIDAGEYR>=RRb.ageStart AND SB.RIDAGEYR<=RRb.ageEnd AND SB.LBXPLTSI<RRb.min);
UPDATE SurveyB SB
SET SB.LBXPLTSI_b = 1
WHERE SB.LBXPLTSI_b = 0 AND
EXISTS (SELECT RRb.max
 FROM ReferenceRanges RRb
 WHERE RRb.variable='LBXPLTSI' AND SB.RIAGENDR=RRb.gender AND SB.RIDAGEYR>=RRb.ageStart AND SB.RIDAGEYR<=RRb.ageEnd AND SB.LBXPLTSI>RRb.max);

## Final Matrix

* Building of the final matrix that has the identification of the person, a binary _b matrix, and a profile built by the concatenation of lines in the binary matrix.
* The profile represents the in a binary form what is out of the ranges in each person.
* Only anormal persons are filtered.

* The resulting matrix produces a CSV file.

In [7]:
DROP VIEW IF EXISTS DeviationProfiles;
DROP VIEW IF EXISTS CorrelationMatrix;

CREATE VIEW CorrelationMatrix AS
SELECT DISTINCT SB.SEQN, 
  CONCAT(SB.LBXWBCSI_b, SB.LBXHGB_b, SB.LBXHCT_b, SB.LBXPLTSI_b) AS profile,
  SB.LBXWBCSI_b, SB.LBXHGB_b, SB.LBXHCT_b, SB.LBXPLTSI_b
FROM SurveyB SB, ReferenceRanges RR
WHERE SB.RIAGENDR=RR.gender AND SB.RIDAGEYR>=RR.ageStart AND SB.RIDAGEYR<=RR.ageEnd AND
(LBXWBCSI_b>0 OR LBXHGB_b>0 OR LBXHCT_b>0 OR LBXPLTSI_b>0);

SELECT COUNT(*) FROM CorrelationMatrix;
SELECT * FROM CorrelationMatrix;

CALL CSVWRITE('../data/nhanes2005-2006/correlation-matrix-fb.csv', 'SELECT * FROM CorrelationMatrix');

# Profiles network

* Persons are here related from their binary profiles, producing a profiles network.

## Grouping profiles

* Profiles are grouped according ro a binary pattern and people with the same profile are aggregated.

In [8]:
DROP VIEW IF EXISTS DeviationProfiles;

CREATE VIEW DeviationProfiles AS
SELECT CM.profile, COUNT(*) AS individuals
FROM CorrelationMatrix CM
GROUP BY CM.profile;

SELECT SUM(individuals) FROM DeviationProfiles;
SELECT * FROM DeviationProfiles;

CALL CSVWRITE('../data/nhanes2005-2006/profile-deviation-fb.csv', 'SELECT DP.profile AS id, DP.individuals AS weight FROM DeviationProfiles DP');

# Matrix with deviation intensity

* This second matrix records the deviation of variables that overcomes the limits and how much the overcome.

## Geração de nova matriz de base inicializada com 0

In [9]:
DROP TABLE IF EXISTS SurveyD;
CREATE TABLE SurveyD (
  SEQN VARCHAR(8),
  RIAGENDR VARCHAR(1),
  RIDAGEYR SMALLINT,
  LBXWBCSI DECIMAL(7,1),
  LBXWBCSI_d SMALLINT DEFAULT 0,
  LBXHGB DECIMAL(7,1),
  LBXHGB_d SMALLINT DEFAULT 0,
  LBXHCT DECIMAL(7,1),
  LBXHCT_d SMALLINT DEFAULT 0,
  LBXPLTSI DECIMAL(7,1),
  LBXPLTSI_d SMALLINT DEFAULT 0,
  PRIMARY KEY(SEQN)
) AS SELECT
  SEQN,RIAGENDR,RIDAGEYR,LBXWBCSI,0,LBXHGB,0,LBXHCT,0,LBXPLTSI,0
FROM CSVREAD('../data/nhanes2005-2006/combined-selected-variables.csv');

SELECT * FROM SurveyD;

## Matrix building

* Each variable is compared with the limits of the NHANES ranges, and the deviation _d columns receive the difference.

In [10]:
-- Computing LBXWBCSI
UPDATE SurveyD SD
SET SD.LBXWBCSI_d =
(SELECT RRa.min-SD.LBXWBCSI
 FROM ReferenceRanges RRa
 WHERE RRa.variable='LBXWBCSI' AND SD.RIAGENDR=RRa.gender AND SD.RIDAGEYR>=RRa.ageStart AND SD.RIDAGEYR<=RRa.ageEnd AND SD.LBXWBCSI<RRa.min)
WHERE EXISTS
(SELECT RRb.min
 FROM ReferenceRanges RRb
 WHERE RRb.variable='LBXWBCSI' AND SD.RIAGENDR=RRb.gender AND SD.RIDAGEYR>=RRb.ageStart AND SD.RIDAGEYR<=RRb.ageEnd AND SD.LBXWBCSI<RRb.min);
UPDATE SurveyD SD
SET SD.LBXWBCSI_d =
(SELECT SD.LBXWBCSI-RRa.max
 FROM ReferenceRanges RRa
 WHERE RRa.variable='LBXWBCSI' AND SD.RIAGENDR=RRa.gender AND SD.RIDAGEYR>=RRa.ageStart AND SD.RIDAGEYR<=RRa.ageEnd AND SD.LBXWBCSI>RRa.max)
WHERE SD.LBXWBCSI_d = 0 AND
EXISTS (SELECT RRb.max
 FROM ReferenceRanges RRb
 WHERE RRb.variable='LBXWBCSI' AND SD.RIAGENDR=RRb.gender AND SD.RIDAGEYR>=RRb.ageStart AND SD.RIDAGEYR<=RRb.ageEnd AND SD.LBXWBCSI>RRb.max);

-- Computing LBXHGB
UPDATE SurveyD SD
SET SD.LBXHGB_d =
(SELECT RRa.min-SD.LBXHGB
 FROM ReferenceRanges RRa
 WHERE RRa.variable='LBXHGB' AND SD.RIAGENDR=RRa.gender AND SD.RIDAGEYR>=RRa.ageStart AND SD.RIDAGEYR<=RRa.ageEnd AND SD.LBXHGB<RRa.min)
WHERE EXISTS
(SELECT RRb.min
 FROM ReferenceRanges RRb
 WHERE RRb.variable='LBXHGB' AND SD.RIAGENDR=RRb.gender AND SD.RIDAGEYR>=RRb.ageStart AND SD.RIDAGEYR<=RRb.ageEnd AND SD.LBXHGB<RRb.min);
UPDATE SurveyD SD
SET SD.LBXHGB_d =
(SELECT SD.LBXHGB-RRa.max
 FROM ReferenceRanges RRa
 WHERE RRa.variable='LBXHGB' AND SD.RIAGENDR=RRa.gender AND SD.RIDAGEYR>=RRa.ageStart AND SD.RIDAGEYR<=RRa.ageEnd AND SD.LBXHGB>RRa.max)
WHERE SD.LBXHGB_d = 0 AND
EXISTS (SELECT RRb.max
 FROM ReferenceRanges RRb
 WHERE RRb.variable='LBXHGB' AND SD.RIAGENDR=RRb.gender AND SD.RIDAGEYR>=RRb.ageStart AND SD.RIDAGEYR<=RRb.ageEnd AND SD.LBXHGB>RRb.max);

-- Computing LBXHCT
UPDATE SurveyD SD
SET SD.LBXHCT_d =
(SELECT RRa.min-SD.LBXHCT
 FROM ReferenceRanges RRa
 WHERE RRa.variable='LBXHCT' AND SD.RIAGENDR=RRa.gender AND SD.RIDAGEYR>=RRa.ageStart AND SD.RIDAGEYR<=RRa.ageEnd AND SD.LBXHCT<RRa.min)
WHERE EXISTS
(SELECT RRb.min
 FROM ReferenceRanges RRb
 WHERE RRb.variable='LBXHCT' AND SD.RIAGENDR=RRb.gender AND SD.RIDAGEYR>=RRb.ageStart AND SD.RIDAGEYR<=RRb.ageEnd AND SD.LBXHCT<RRb.min);
UPDATE SurveyD SD
SET SD.LBXHCT_d =
(SELECT SD.LBXHCT-RRa.max
 FROM ReferenceRanges RRa
 WHERE RRa.variable='LBXHCT' AND SD.RIAGENDR=RRa.gender AND SD.RIDAGEYR>=RRa.ageStart AND SD.RIDAGEYR<=RRa.ageEnd AND SD.LBXHCT>RRa.max)
WHERE SD.LBXHCT_d = 0 AND
EXISTS (SELECT RRb.max
 FROM ReferenceRanges RRb
 WHERE RRb.variable='LBXHCT' AND SD.RIAGENDR=RRb.gender AND SD.RIDAGEYR>=RRb.ageStart AND SD.RIDAGEYR<=RRb.ageEnd AND SD.LBXHCT>RRb.max);

-- Computing LBXPLTSI
UPDATE SurveyD SD
SET SD.LBXPLTSI_d =
(SELECT RRa.min-SD.LBXPLTSI
 FROM ReferenceRanges RRa
 WHERE RRa.variable='LBXPLTSI' AND SD.RIAGENDR=RRa.gender AND SD.RIDAGEYR>=RRa.ageStart AND SD.RIDAGEYR<=RRa.ageEnd AND SD.LBXPLTSI<RRa.min)
WHERE EXISTS
(SELECT RRb.min
 FROM ReferenceRanges RRb
 WHERE RRb.variable='LBXPLTSI' AND SD.RIAGENDR=RRb.gender AND SD.RIDAGEYR>=RRb.ageStart AND SD.RIDAGEYR<=RRb.ageEnd AND SD.LBXPLTSI<RRb.min);
UPDATE SurveyD SD
SET SD.LBXPLTSI_d =
(SELECT SD.LBXPLTSI-RRa.max
 FROM ReferenceRanges RRa
 WHERE RRa.variable='LBXPLTSI' AND SD.RIAGENDR=RRa.gender AND SD.RIDAGEYR>=RRa.ageStart AND SD.RIDAGEYR<=RRa.ageEnd AND SD.LBXPLTSI>RRa.max)
WHERE SD.LBXPLTSI_d = 0 AND
EXISTS (SELECT RRb.max
 FROM ReferenceRanges RRb
 WHERE RRb.variable='LBXPLTSI' AND SD.RIAGENDR=RRb.gender AND SD.RIDAGEYR>=RRb.ageStart AND SD.RIDAGEYR<=RRb.ageEnd AND SD.LBXPLTSI>RRb.max);

## Final Matrix

* Building of the final matrix that has the identification of the person and a deviation _d matrix.
* Only anormal persons are filtered.

In [11]:
DROP VIEW IF EXISTS CorrelationMatrixWeighted;

CREATE VIEW CorrelationMatrixWeighted AS
SELECT DISTINCT SD.SEQN, 
  SD.LBXWBCSI_d, SD.LBXHGB_d, SD.LBXHCT_d, SD.LBXPLTSI_d
FROM SurveyD SD, ReferenceRanges RR
WHERE SD.RIAGENDR=RR.gender AND SD.RIDAGEYR>=RR.ageStart AND SD.RIDAGEYR<=RR.ageEnd AND
(LBXWBCSI_d>0 OR LBXHGB_d>0 OR LBXHCT_d>0 OR LBXPLTSI_d>0);

SELECT COUNT(*) FROM CorrelationMatrixWeighted;
SELECT * FROM CorrelationMatrixWeighted;

CALL CSVWRITE('../data/nhanes2005-2006/correlation-matrix-weighted-fb.csv', 'SELECT * FROM CorrelationMatrixWeighted');

# Variables Network

* In this network each node is a variable and each edge indicates that two variables are correlated in a certain intensity.

## List of the variable pairs

* This view prepares the list of correlation pairs initialized with 0.

In [12]:
DROP VIEW IF EXISTS VariablesCorrelation;
DROP VIEW IF EXISTS Variables;

CREATE VIEW Variables AS
SELECT DISTINCT variable AS var1 FROM ReferenceRanges;

CREATE VIEW VariablesCorrelation AS
SELECT DISTINCT Variables.var1, ReferenceRanges.variable AS var2, 0 AS correlation
FROM Variables, ReferenceRanges
WHERE Variables.var1 < ReferenceRanges.variable;

## Survey verticalization

* Persons and variables that are originally presented as a matrix are transformed in a list: person, variable and value. This list will facilitate the subsequent analyses.

In [13]:
DROP VIEW IF EXISTS VerticalSurvey;

CREATE VIEW VerticalSurvey AS
  SELECT SU.SEQN, RR.variable, SU.LBXWBCSI AS value, 0 AS deviation
  FROM Survey SU, ReferenceRanges RR
  WHERE RR.variable='LBXWBCSI'
UNION
  SELECT SU.SEQN, RR.variable, SU.LBXHGB AS value, 0 AS deviation
  FROM Survey SU, ReferenceRanges RR
  WHERE RR.variable='LBXHGB'
UNION
  SELECT SU.SEQN, RR.variable, SU.LBXHCT AS value, 0 AS deviation
  FROM Survey SU, ReferenceRanges RR
  WHERE RR.variable='LBXHCT'
UNION
  SELECT SU.SEQN, RR.variable, SU.LBXPLTSI AS value, 0 AS deviation
  FROM Survey SU, ReferenceRanges RR
  WHERE RR.variable='LBXPLTSI'
;

-- transformation of the view in a table to enable updates
DROP TABLE IF EXISTS VerticalSurveyD;
CREATE TABLE VerticalSurveyD (
  SEQN VARCHAR(8),
  variable VARCHAR(8),
  value DECIMAL(7,1),
  deviation DECIMAL(7,1),
  PRIMARY KEY(SEQN, variable)
) AS SELECT * FROM VerticalSurvey;
  
CALL CSVWRITE('../data/nhanes2005-2006/vertical-survey-fb.csv', 'SELECT SEQN,variable,value FROM VerticalSurvey');

10660

## Computation of the deviation value for the variables that are out of the limits

In [14]:
UPDATE VerticalSurveyD VS
SET VS.deviation =
(SELECT RRa.min-VS.value
 FROM Survey SUa, ReferenceRanges RRa
 WHERE RRa.variable=VS.variable AND SUa.SEQN=VS.SEQN AND SUa.RIAGENDR=RRa.gender AND SUa.RIDAGEYR>=RRa.ageStart AND SUa.RIDAGEYR<=RRa.ageEnd AND VS.value<RRa.min)
WHERE EXISTS
(SELECT RRb.min
 FROM Survey SUb, ReferenceRanges RRb
 WHERE RRb.variable=VS.variable AND SUb.SEQN=VS.SEQN AND SUb.RIAGENDR=RRb.gender AND SUb.RIDAGEYR>=RRb.ageStart AND SUb.RIDAGEYR<=RRb.ageEnd AND VS.value<RRb.min);

UPDATE VerticalSurveyD VS
SET VS.deviation =
(SELECT VS.value-RRa.max
 FROM Survey SUa, ReferenceRanges RRa
 WHERE RRa.variable=VS.variable AND SUa.SEQN=VS.SEQN AND SUa.RIAGENDR=RRa.gender AND SUa.RIDAGEYR>=RRa.ageStart AND SUa.RIDAGEYR<=RRa.ageEnd AND VS.value>RRa.max)
WHERE EXISTS
(SELECT RRb.max
 FROM Survey SUb, ReferenceRanges RRb
 WHERE RRb.variable=VS.variable AND SUb.SEQN=VS.SEQN AND SUb.RIAGENDR=RRb.gender AND SUb.RIDAGEYR>=RRb.ageStart AND SUb.RIDAGEYR<=RRb.ageEnd AND VS.value>RRb.max);
 
SELECT * FROM VerticalSurveyD WHERE deviation > 0;

## Average of the variables

In [15]:
SELECT variable as id, COUNT(*) as weight FROM VerticalSurveyD VS WHERE deviation>0 GROUP BY variable;

CALL CSVWRITE('../data/nhanes2005-2006/variable-number-deviation-fb.csv', 'SELECT variable as id, COUNT(*) as weight FROM VerticalSurveyD VS WHERE deviation>0 GROUP BY variable');

## Variable correlation by person

* Pairwise correlation of variables that cooccur in the same person.

In [16]:
DROP VIEW IF EXISTS VariablePairCorrelation;
DROP VIEW IF EXISTS IndividualVariablesCorrelation;

CREATE VIEW IndividualVariablesCorrelation AS
SELECT VS1.SEQN, CM.profile, VC.var1, VC.var2
FROM VariablesCorrelation VC, VerticalSurveyD VS1, VerticalSurveyD VS2, CorrelationMatrix CM
WHERE VS1.SEQN = VS2.SEQN AND VS1.variable = VC.var1 AND VS2.variable = VC.var2 AND 
      VS1.deviation > 0 AND VS2.deviation > 0 AND
      VS1.SEQN = CM.SEQN;

SELECT * FROM IndividualVariablesCorrelation
ORDER BY var1, var2;

## Correlation of variable pairs

* Aggregation of correlations of variable pairs.
* Preparation to build a network where variables are vertices and edges connect variables that surpassed the limits together for the same person.

In [17]:
DROP VIEW IF EXISTS VariablePairCorrelation;
CREATE VIEW VariablePairCorrelation AS
SELECT var1 AS source, var2 as TARGET, COUNT(*) AS weight
FROM IndividualVariablesCorrelation
GROUP BY var1, var2;

SELECT * FROM VariablePairCorrelation;

CALL CSVWRITE('../data/nhanes2005-2006/variable-pair-correlation-fb.csv', 'SELECT * FROM VariablePairCorrelation');

# Variable Network

* Variable network produced in the Gephi from the CVS created in the previous step.

![variable network](variable-network-fb.png "Variable Network")

# Profile Network

* Returning to the profile network.

## Correlation analysis of profile pairs

* Each time that two persons share a variable out of the ranges, an edge is created between them.
* The edges are grouped by profile pairs. For each pair is computed the number of individuals/variables that cooccur.

In [18]:
DROP VIEW IF EXISTS ProfileCorrelation;

CREATE VIEW ProfileCorrelation AS
  SELECT CM1.SEQN AS SEQN1, CM1.profile AS profile1, CM2.SEQN AS SEQN2, CM2.profile AS profile2
  FROM VerticalSurveyD VS1, VerticalSurveyD VS2, CorrelationMatrix CM1, CorrelationMatrix CM2
  WHERE VS1.SEQN < VS2.SEQN AND VS1.variable = VS2.variable AND
        VS1.deviation > 0 AND VS2.deviation > 0 AND
        VS1.SEQN = CM1.SEQN AND VS2.SEQN = CM2.SEQN;
        
-- Gravação de pares de perfis com similaridade para rede
CALL CSVWRITE('../data/nhanes2005-2006/profile-pair-correlation-fb.csv', 'SELECT * FROM ProfileCorrelation');

67421

In [19]:
DROP VIEW IF EXISTS ProfileCorrelationNWeight;
DROP VIEW IF EXISTS ProfileCorrelationUnique;

CREATE VIEW ProfileCorrelationUnique AS
  SELECT DISTINCT * FROM ProfileCorrelation;

CREATE VIEW ProfileCorrelationNWeight AS
  SELECT PC.profile1 AS source, PC.profile2 as target, COUNT(*) as weight
  FROM ProfileCorrelationUnique PC
  GROUP BY PC.profile1, PC.profile2;
  
SELECT COUNT(*), SUM(weight) FROM ProfileCorrelationNWeight;
SELECT * FROM ProfileCorrelationNWeight;

In [20]:
CREATE VIEW ProfileCorrNWeight AS
SELECT source, target, weight w FROM ProfileCorrelationNWeight WHERE source < target
UNION
SELECT target, source, weight w FROM ProfileCorrelationNWeight WHERE source > target;

CREATE VIEW ProfileCorrFinalNWeight AS
SELECT source, target, SUM(w) AS weight
FROM ProfileCorrNWeight
GROUP BY source, target;

SELECT * FROM ProfileCorrFinalNWeight;

-- Gravação de pares de perfis com similaridade para rede
CALL CSVWRITE('../data/nhanes2005-2006/profile-pair-correlation-number-fb.csv', 'SELECT * FROM ProfileCorrFinalNWeight');

# Profile Network

![profile network](profile-network-fb.png "Profile Network")

In [21]:
CREATE VIEW ProfileCorrelationSWeight AS
  SELECT PC.profile1 AS source, PC.profile2 as target, COUNT(*) as weight
  FROM ProfileCorrelation PC
  GROUP BY PC.profile1, PC.profile2;
  
SELECT COUNT(*), SUM(weight) FROM ProfileCorrelationSWeight;
SELECT * FROM ProfileCorrelationSWeight;

-- Gravação de pares de perfis com similaridade para rede
CALL CSVWRITE('../data/nhanes2005-2006/profile-pair-correlation-similarity-fb.csv', 'SELECT * FROM ProfileCorrelationSWeight');