# Face PCA

This code is to generate the Face PCA as well as to export such database.
The workflow follows as this:

- Open the necessary files, scripts, and data
- Remove the IDs than don't have Covariates or 3D face data
- PCA workflow:
    - Initial PCA taking a random sample of the same number of males and females, detect outliers using the mahalanobis distance with respect to the origin, and remove them
    - New PCA using the non-outliers, and taking random sample of the same number of males and females
    - To detect possible differences due to sampling errors, I will sample a number of *n* times and measure the correlation of the PC scores
    - If correlation is high enough, take the first result of the PCA
- Export final dataset

# Preliminaries

Locate the corresponding folder and load the database

In [1]:
%Folders
folder.codePath = 'C:\Users\tzarzar\Box Sync\Research\GeneralCode\Matlab Code';
addpath(genpath(folder.codePath));
folder.dataPath16 = 'R:\ShriverLab\Facial features input files and databases\PSU_KU_WFS2016';
folder.databases  = 'C:\Users\tzarzar\Box Sync\Research\FacialSD\DataBases';
folder.results    = 'C:\Users\tzarzar\Box Sync\Research\FacialSD\Results\FacePCA';
cd(folder.dataPath16);
load PENNDATA_AccumulatedShapeData; %loading faces (Data)
load RefScan; %loading AM (RefScan)
load NewMaskIndex; %loading AM trimming index (MaskIndex)
crop(RefScan, 'VertexIndex', MaskIndex); %Trimming the RefScan
cd(folder.databases);
Covariates = readtable('Covariates.csv');





Get the intersection between Covariates and face data

In [2]:
[keep1, keep2] = GetIntersection(Data, Covariates);
Data           = reduceData(Data, keep1);
Covariates     = Covariates(keep2, :);
nFaces         = length(Data.Names)
nCovariates    = length(Covariates.ID)


nFaces =

        5939


nCovariates =

        5939





Get the number of males and females in the dataset

In [3]:
sum(strcmp(Covariates.Sex, 'Female'))
sum(strcmp(Covariates.Sex, 'Male'))


ans =

        3617


ans =

        2284





## Getting centroid size

In [8]:
CS = zeros(1, size(Data.Shape, 2));
for i = 1:size(CS, 2)
    obj = meshObj();
    obj.Faces    = RefScan.Faces;
    obj.Vertices = reshape(Data.Shape(:,i), [3 (size(Data.Shape, 1) / 3)]);
    CS(1,i)      = centroidSize(obj);
end
CS = log(CS);
CS(1,CS > 4.6) = "NA";





## GPA

Generalized Procrustes Analysis

In [9]:
TotalShape = [Data.Shape, Data.NormShape]; %Concatenation of original faces (Data.Shape) and their reflection (Data.Norm)
model      = shapePCA; %Creating an empty shape space object
model.RefScan = clone(RefScan); %Defining the AM that was used to create the shape space
AlignedData   = LSGenProcrustes(model, TotalShape, true, 3, RefScan);

Starting parallel pool (parpool) using the 'local' profile ...
connected to 4 workers.




Decomposing faces into components of symmetry and asymmetry

In [10]:
OrigHead = AlignedData(:, 1:nFaces);
ReflHead = AlignedData(:, nFaces+1:end);
SymHead  = (OrigHead + ReflHead)/2; %facial symmetry component
AsymHead = (OrigHead - ReflHead) + mean(SymHead, 2); %facial asymmetry component





In [11]:
size(SymHead)


ans =

       20370        5939





Running the Face PCA. I will use a random sample of 2000 males and females

In [12]:
%[sample_fem, ind1]  = datasample(SymHead(:,strcmp(Covariates.Sex, 'Female')), 2000, 2);

%[sample_male, ind2] = datasample(SymHead(:,strcmp(Covariates.Sex, 'Male')), 2000, 2);
%total_sample = [sample_fem sample_male];





In [13]:
%vect = [1:size(SymHead(:,strcmp(Covariates.Sex, 'Female')), 2)];
%sum(~ismember(vect, ind1))





In [14]:
%SymHead(:,strcmp(Covariates.Sex, 'Female'))






## PCA

In [12]:
getAverage(model, SymHead); %Compute the average head
getModel(model, SymHead); 
means = mean(SymHead, 2);
stripPercVar(model, 98);
model


model = 

  shapePCA with properties:

    AvgVertices: [3×6790 double]
         AvgVec: [20370×1 double]
        RefScan: [1×1 meshObj]
        Average: [1×1 meshObj]
         EigVal: [87×1 double]
         EigVec: [20370×87 double]
         Tcoeff: [5939×87 double]
       AvgCoeff: [87×1 double]
      Centering: 1
           nrEV: 87
         EigStd: [87×1 double]
      Explained: [87×1 double]
              n: 5939
              U: [5939×87 double]
              S: [87×1 double]
              V: [20370×87 double]
           Type: 'shapePCA'





In [13]:
model.Tcoeff(1:10, 1:10)


ans =

   -2.7500   -0.0476    1.4329    1.9755   -0.1145   -0.0344    0.4069   -0.3473    0.3787   -0.0347
   -3.0011    0.1057    0.6759    2.1056    1.1979    0.2540   -0.4840   -0.2451   -0.1560    0.6118
   -0.9400   -1.4434    1.2838    0.5728   -1.3678    0.4241   -0.2904   -0.1218    0.2807    0.5717
   -2.5950    1.3977    0.8939    0.1309    0.5854    1.4701   -0.4995    0.8075   -0.1866   -0.1333
    2.5805   -0.0855    0.0949   -0.1114    1.3244    0.6152    0.3337    0.1918    0.0475   -0.3890
   -1.5390   -1.3281    2.0045   -0.8951    0.0862    0.0572   -0.8213    0.7036    0.0923   -0.2445
    1.2813   -0.2222    1.3759   -2.5794    0.9949    1.3452    0.0238    1.9273   -0.1870   -0.4813
    2.0618   -0.1109   -1.0123    0.6022   -0.4662   -0.3614    0.6623   -1.0766   -0.0339    0.7550
    0.5504    1.3370   -1.6557   -0.0498    1.3299    0.4508   -0.1899    0.2561   -0.9428   -0.6860
    0.9299    0.3762    2.2317   -1.2411    0.2449   -0.1337   -0.7296   -0.0513   



# Removing outliers

We will use the mahalanobis distance with respect to the origin as a measure of similarity

In [14]:
origin    = zeros(1, size(model.Tcoeff, 2)); 
mahaldist = sqrt(sum(( (model.Tcoeff ./ model.EigStd') - origin ) .^ 2, 2 ));
size(mahaldist)


ans =

        5939           1





We will define outliers if they are 3 scaled median absolute deviation (MAD) away from the median of the mahaldist distribution.
Below you can see the individual IDs identified as outliers

In [15]:
sum(isoutlier(mahaldist))
Covariates.ID(isoutlier(mahaldist))


ans =

   213


ans =

  213×1 cell array

    '131203'
    '131239'
    '132046'
    '132047'
    '132067'
    '140056'
    '140103'
    '140219'
    '140258'
    '140478'
    '140490'
    '140518'
    '140697'
    '140713'
    '140739'
    '140909'
    '141181'
    '141183'
    '141188'
    '141204'
    '141211'
    '141248'
    '141309'
    '141358'
    '141378'
    '141399'
    '141469'
    '141502'
    '141551'
    '141574'
    '141956'
    '142007'
    '143026'
    '143293'
    '143470'
    '143534'
    '143551'
    '143552'
    '143561'
    '50238'
    '50239'
    '50243'
    '50248'
    '50250'
    '50259'
    '50275'
    '50286'
    '50310'
    '50313'
    '50324'
    '50326'
    '50347'
    '50380'
    '50606'
    '50630'
    '50656'
    '50657'
    '50670'
    '50692'
    '50759'
    '50791'
    '50838'
    '50841'
    '50910'
    '50920'
    '50942'
    '60032'
    '60068'
    '60081'
    '60141'
    '60177'
    '60178'
    '60190'
    '60198'
    '60251'
    '60252'
    '



# Creating database for export

Exporting database with the identified outliers removed. 
Also, we will compute BMI

In [16]:
PCnames        = strseq('PC', 1:model.nrEV);
Covariates.BMI = Covariates.Weight ./ ( (Covariates.Height ./ 100) .^2);
Covariates.CS  = CS';
coeffs         = [Covariates, array2table(model.Tcoeff, 'VariableNames', PCnames)];
cd(folder.results)
csvwrite('eigenvalues.csv', model.EigVal);
csvwrite('eigenvectors.csv', model.EigVec);
csvwrite('means.csv', means);
csvwrite('facets.csv', model.Average.Faces');
writetable(coeffs(~isoutlier(mahaldist), :), 'coeffs.csv')



