In [1]:
using CSV, DataFrames, Distributions, Gadfly, GLM

## 1. Chargement des données

Assurez vous d'avoir télécharger les données dans le répertoire de ce calepin.

In [2]:
data = CSV.read("train.csv")
first(data,5)

│   caller = read(::String) at CSV.jl:40
└ @ CSV C:\Users\Philippe\.julia\packages\CSV\MKemC\src\CSV.jl:40


Unnamed: 0_level_0,id,radius,texture,perimeter,area,smoothness,compactness,concavity
Unnamed: 0_level_1,Int64,Float64,Float64,Float64,Float64,Float64,Float64,Float64
1,1,16.641,21.3452,110.798,901.347,0.112388,0.145148,0.153247
2,2,13.6884,21.1035,90.5878,578.09,0.0941204,0.101193,0.035445
3,3,12.9131,14.3055,85.3309,511.539,0.0863069,0.136213,0.13536
4,4,12.9474,15.1198,86.2821,539.445,0.0925851,0.0849802,0.0937507
5,5,19.4972,24.8959,127.491,1200.37,0.101691,0.0989018,0.166237


## Analyse VIF
On effectue une analyse VIF afin de trouver les variables qui sont colinéaires. 

In [3]:
# On regroupe les donnees dans une matrice afin de pouvoir appliquer la formule du VIF vue en cours
y = data.diagnosis
n = length(y)

x1 = data.radius
x2 = data.texture

x3 = data.perimeter
x4 = data.area
x5 = data.smoothness
x6 = data.compactness
x7 = data.concavity
x8 = data.concave_points
x9 = data.symmetry
x10 = data.fractal_dimension


X = hcat(ones(n), x1, x2, x3, x4, x5, x6, x7, x8, x9, x10)

455×11 Array{Float64,2}:
 1.0  16.641    21.3452  110.798   …  0.0920613   0.198671  0.0623664
 1.0  13.6884   21.1035   90.5878     0.0235371   0.159577  0.0666234
 1.0  12.9131   14.3055   85.3309     0.0403623   0.161607  0.0643816
 1.0  12.9474   15.1198   86.2821     0.0347676   0.180346  0.0620803
 1.0  19.4972   24.8959  127.491      0.0910854   0.163297  0.0525529
 1.0  17.9685   23.9732  113.391   …  0.0484731   0.151199  0.0557704
 1.0  23.3328   26.9409  158.255      0.162356    0.219854  0.0638086
 1.0  14.2786   19.7755   97.6468     0.077536    0.168895  0.0776988
 1.0  15.1174   17.6523   97.2838     0.086759    0.195411  0.0621996
 1.0  11.6321   17.1527   72.8543     0.0119356   0.164791  0.0580381
 1.0  10.2032   15.277    62.1731  …  0.00219165  0.173398  0.0601456
 1.0  17.9246   23.1838  119.639      0.100181    0.168613  0.0733048
 1.0  12.353    13.3258   76.958      0.0303991   0.189011  0.0640515
 ⋮                                 ⋱                        ⋮    

In [4]:
# On compute le VIF  
function compute_VIF(structureMatrix::Array{T,2} where T<:Real)
    
    n, m = size(structureMatrix)
    
    p = m-1  # nb de variables explicatives
    
    VIF = Float64[]
    
    for j in 2:m
       
        y = structureMatrix[:,j]
        X = structureMatrix[:, setdiff(1:m, j)]
        
        β̂ = X\y
        
        e = y - X*β̂
        
        SST = sum( (y .- mean(y)).^2)
        SSE = e'e
        
        R² = 1 - SSE/SST
        
        push!(VIF, 1/(1-R²))
        
    end
    
    return VIF
    
end

VIF = compute_VIF(X)

10-element Array{Float64,1}:
 266.1297590763434   
   1.1706510066905178
 301.44740092368613  
  53.67515909901567  
   3.0052608367491307
  12.706350011376493 
  10.931677551238156 
  21.945126992683015 
   1.7760514674474783
   6.19575964017116  

### On obtiens la matrice pearson suivante:
<img src="matrice_pearson.png">

 On constate qu'en premier lieu les variables radius, perimeter et area sont correlés, ainsi que
 concavity et compactness et concave points et radius, perimeter, area, compactness et concavity.

In [8]:
pearson_matrix = cor(X)
print(pearson_matrix)
CSV.write("pearson.csv",  DataFrame(pearson_matrix), header=false)

[1.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN; NaN 1.0 0.316611 0.99641 0.987744 0.127805 0.495466 0.666441 0.819577 0.112859 -0.319216; NaN 0.316611 1.0 0.32387 0.317515 -0.0292291 0.227757 0.295231 0.290478 0.0600477 -0.0765374; NaN 0.99641 0.32387 1.0 0.986401 0.161333 0.544293 0.705321 0.846982 0.149781 -0.272874; NaN 0.987744 0.317515 0.986401 1.0 0.138941 0.488267 0.674318 0.820254 0.115079 -0.291538; NaN 0.127805 -0.0292291 0.161333 0.138941 1.0 0.633129 0.485734 0.522616 0.563241 0.609808; NaN 0.495466 0.227757 0.544293 0.488267 0.633129 1.0 0.8802 0.828568 0.588045 0.561278; NaN 0.666441 0.295231 0.705321 0.674318 0.485734 0.8802 1.0 0.916493 0.476153 0.336019; NaN 0.819577 0.290478 0.846982 0.820254 0.522616 0.828568 0.916493 1.0 0.441229 0.164148; NaN 0.112859 0.0600477 0.149781 0.115079 0.563241 0.588045 0.476153 0.441229 1.0 0.490386; NaN -0.319216 -0.0765374 -0.272874 -0.291538 0.609808 0.561278 0.336019 0.164148 0.490386 1.0]

"pearson.csv"

À partir de cette analyse, on repère certaines variables qui sont particulièrement déjà bien définies par d'autres. C'est le cas du groupe radius - perimeter - area, ainsi que de concave_points. Pour obtenir une matrice sans trop de colinéarité, on peut choisir d'enlever les variables perimeter, area, et concave_points, et vérifier que tous les VIF descendent en-dessous de 10.

In [9]:
new_X = hcat(ones(n), x1, x2, x5, x6, x7, x9, x10)
compute_VIF(new_X)

7-element Array{Float64,1}:
 5.5327097389763775
 1.1669474886241644
 2.2058772479566606
 8.8653931948315   
 6.715173681197087 
 1.7536338017101563
 5.3889528405995994