
# MTH3302 : Méthodes probabilistes et statistiques pour l'I.A.

Jonathan Jalbert<br/>
Professeur adjoint au Département de mathématiques et de génie industriel<br/>
Polytechnique Montréal<br/>

Le projet a été développé à l'aide de Alice Breton, étudiante à la maîtrise en génie informatique. Elle a suivi le cours lors de la session Hiver 2019.



# Projet : Débordement d'égouts

La description du projet est disponible à l'adresse suivante :
https://www.kaggle.com/t/a238b752c33a41d9803c2cdde6bfc929

Ce calepin Jupyter de base permet de charger et de nettoyer les données fournies. La dernière section détaille la génération du fichier des prédictions afin de le soumettre sur Kaggle dans le bon format.

Dans un premier temps, vous devrez récupérer l'archive *data.zip* sur Moodle. Ce dossier contient les fichiers suivants :
- surverses.csv
- precipitation.csv
- ouvrages-surverses.csv
- test.csv

Veuillez le décompresser dans le répertoire de ce calepin.

Le fichier *surverse.csv* répertorie s'il y a surverse (1) ou non (0) au cours de la journée pour les 170 ouvrages de débordement de 2013 à 2018 pour les mois de mai à octobre (inclusivement). Des renseignements additionnels sur les données sont disponibles à l'adresse suivante :

http://donnees.ville.montreal.qc.ca/dataset/debordement


Le fichier *precipitation.csv* contient les précipitations horaires en dixième de *mm* enregistrées à 5 stations pluviométriques de 2013 à 2019 :
- McTavish (7024745)
- Ste-Anne-de-Bellevue (702FHL8)
- Montreal/Pierre Elliott Trudeau Intl (702S006)
- Montreal/St-Hubert (7027329)
- L’Assomption (7014160)

Plus d'informations sur les précipitations sont disponibles à l'adresse suivante :

https://climat.meteo.gc.ca/climate_data/hourly_data_f.html?hlyRange=2008-01-08%7C2019-11-12&dlyRange=2002-12-23%7C2019-11-12&mlyRange=%7C&StationID=30165&Prov=QC&urlExtension=_f.html&searchType=stnName&optLimit=yearRange&StartYear=1840&EndYear=2019&selRowPerPage=25&Line=17&searchMethod=contains&Month=11&Day=12&txtStationName=montreal&timeframe=1&Year=2019

Le fichier *ouvrages-surverses.csv* contient différentes caractéristiques des ouvrages de débordement. 

http://donnees.ville.montreal.qc.ca/dataset/ouvrage-surverse

Le fichier *test.csv* contient les ouvrages et les jours pour lesquels vous devez prédire s'il y a eu surverse (true) ou non (false). Notez que l'on s'intéresse ici à 5 ouvrages de débordement localisés tout autour de l'Ile de Montréal :
- 3260-01D dans Rivière-des-Prairies 
- 3350-07D dans Ahunstic 
- 4240-01D dans Pointe-aux-Trembles 
- 4350-01D dans le Vieux-Montréal 
- 4380-01D dans Verdun

#### Remarque

Dans le projet, on ne s'intéresse qu'aux surverses occasionnées par les précipitations. On ignore les surverses occasionnées par 
- fonte de neige (F)
- travaux planifiés et entretien (TPL)
- urgence (U)
- autre (AUT)

On suppose que lorsqu'il n'y a pas de raison pour la surverse, il s'agit d'une surverse causée par les précipitations. Puisque Nous nous intéresserons uniquement aux surverses occasionnées par les précipitations liquides, nous ne considérons que les mois de mai à octobre inclusivement.

In [31]:
using CSV, DataFrames, Statistics, Dates, Gadfly, Missings

# Chargement des données et nettoyage préliminaire

## Chargement des surverses

In [32]:
data = CSV.read("data/surverses.csv",missingstring="-99999")
first(data,5)

Unnamed: 0_level_0,NO_OUVRAGE,DATE,SURVERSE,RAISON
Unnamed: 0_level_1,String,Date,Int64⍰,String⍰
1,0642-01D,2013-05-01,0,missing
2,0642-01D,2013-05-02,0,missing
3,0642-01D,2013-05-03,0,missing
4,0642-01D,2013-05-04,0,missing
5,0642-01D,2013-05-05,0,missing


## Nettoyage des données sur les surverses

#### Extraction des surverses pour les mois de mai à octobre inclusivement

In [33]:
data = filter(row -> month(row.DATE) > 4, data) 
data = filter(row -> month(row.DATE) < 11, data) 
first(data,5)

Unnamed: 0_level_0,NO_OUVRAGE,DATE,SURVERSE,RAISON
Unnamed: 0_level_1,String,Date,Int64⍰,String⍰
1,0642-01D,2013-05-01,0,missing
2,0642-01D,2013-05-02,0,missing
3,0642-01D,2013-05-03,0,missing
4,0642-01D,2013-05-04,0,missing
5,0642-01D,2013-05-05,0,missing


#### Remplacement des valeurs *missing* dans la colonne :RAISON par "Inconnue"

In [34]:
raison = coalesce.(data[:,:RAISON],"Inconnue")
data[!,:RAISON] = raison
first(data,5)

Unnamed: 0_level_0,NO_OUVRAGE,DATE,SURVERSE,RAISON
Unnamed: 0_level_1,String,Date,Int64⍰,String
1,0642-01D,2013-05-01,0,Inconnue
2,0642-01D,2013-05-02,0,Inconnue
3,0642-01D,2013-05-03,0,Inconnue
4,0642-01D,2013-05-04,0,Inconnue
5,0642-01D,2013-05-05,0,Inconnue


#### Exlusion des surverses coccasionnées par d'autres facteurs que les précipitations liquides

Ces facteurs correspondent à : 
- la fonte de neige (F), 
- les travaux planifiés et entretien (TPL)
- urgence (U)
- autre (AUT)

In [35]:
data = filter(row -> row.RAISON ∈ ["P","Inconnue","TS"], data) 
select!(data, [:NO_OUVRAGE, :DATE, :SURVERSE])
first(data,5)
data

Unnamed: 0_level_0,NO_OUVRAGE,DATE,SURVERSE
Unnamed: 0_level_1,String,Date,Int64⍰
1,0642-01D,2013-05-01,0
2,0642-01D,2013-05-02,0
3,0642-01D,2013-05-03,0
4,0642-01D,2013-05-04,0
5,0642-01D,2013-05-05,0
6,0642-01D,2013-05-06,0
7,0642-01D,2013-05-07,0
8,0642-01D,2013-05-08,0
9,0642-01D,2013-05-09,0
10,0642-01D,2013-05-10,0


#### Exclusion des lignes où :SURVERSE est manquante

In [36]:
surverse_df = dropmissing(data, disallowmissing=true);

In [37]:
n₁ = sum(x->x==1, surverse_df[:SURVERSE], dims=1) 
n₀ = sum(x->x==0, surverse_df[:SURVERSE], dims=1) 
n = n₀ + n₁

│   caller = top-level scope at In[37]:1
└ @ Core In[37]:1
│   caller = top-level scope at In[37]:2
└ @ Core In[37]:2


1-element Array{Int64,1}:
 161098

In [38]:
filtervals = ["3260-01D"; "3350-07D"; "4240-01D"; "4350-01D"; "4380-01D"]
surverse_df1 = filter(row-> row.NO_OUVRAGE == filtervals[1], surverse_df)
surverse_df2 = filter(row-> row.NO_OUVRAGE == filtervals[2], surverse_df)
surverse_df3 = filter(row-> row.NO_OUVRAGE == filtervals[3], surverse_df)
surverse_df4 = filter(row-> row.NO_OUVRAGE == filtervals[4], surverse_df)
surverse_df5 = filter(row-> row.NO_OUVRAGE == filtervals[5], surverse_df);

In [39]:
#### on prend pour chaque ouvrage le nombre de fois ou il a eu surver et non
n₁ = Int64[]
n₀  = Int64[]
n  = Int64[]

0-element Array{Int64,1}

In [40]:
n1₁ = sum(x->x==1, surverse_df1[:SURVERSE], dims=1) 
push!(n₁, n1₁[1])
n1₀ = sum(x->x==0, surverse_df1[:SURVERSE], dims=1)  
push!(n₀, n1₀[1])
n1= n1₁[1] + n1₀[1]
push!(n, n1)

│   caller = top-level scope at In[40]:1
└ @ Core In[40]:1
│   caller = top-level scope at In[40]:3
└ @ Core In[40]:3


1-element Array{Int64,1}:
 1097

In [41]:
n1₁ = sum(x->x==1, surverse_df2[:SURVERSE], dims=1) 
push!(n₁, n1₁[1])
n1₀ = sum(x->x==0, surverse_df2[:SURVERSE], dims=1)  
push!(n₀, n1₀[1])
n1= n1₁[1] + n1₀[1]
push!(n, n1)

│   caller = top-level scope at In[41]:1
└ @ Core In[41]:1
│   caller = top-level scope at In[41]:3
└ @ Core In[41]:3


2-element Array{Int64,1}:
 1097
  729

In [42]:
n1₁ = sum(x->x==1, surverse_df3[:SURVERSE], dims=1) 
push!(n₁, n1₁[1])
n1₀ = sum(x->x==0, surverse_df3[:SURVERSE], dims=1)  
push!(n₀, n1₀[1])
n1= n1₁[1] + n1₀[1]
push!(n, n1)

│   caller = top-level scope at In[42]:1
└ @ Core In[42]:1
│   caller = top-level scope at In[42]:3
└ @ Core In[42]:3


3-element Array{Int64,1}:
 1097
  729
 1100

In [43]:
n1₁ = sum(x->x==1, surverse_df4[:SURVERSE], dims=1) 
push!(n₁, n1₁[1])
n1₀ = sum(x->x==0, surverse_df4[:SURVERSE], dims=1)  
push!(n₀, n1₀[1])
n1= n1₁[1] + n1₀[1]
push!(n, n1)

│   caller = top-level scope at In[43]:1
└ @ Core In[43]:1
│   caller = top-level scope at In[43]:3
└ @ Core In[43]:3


4-element Array{Int64,1}:
 1097
  729
 1100
 1100

In [44]:
n1₁ = sum(x->x==1, surverse_df5[:SURVERSE], dims=1) 
push!(n₁, n1₁[1])
n1₀ = sum(x->x==0, surverse_df5[:SURVERSE], dims=1)  
push!(n₀, n1₀[1])
n1= n1₁[1] + n1₀[1]
push!(n, n1)

│   caller = top-level scope at In[44]:1
└ @ Core In[44]:1
│   caller = top-level scope at In[44]:3
└ @ Core In[44]:3


5-element Array{Int64,1}:
 1097
  729
 1100
 1100
 1103

In [203]:
n

5-element Array{Int64,1}:
 1097
  729
 1100
 1100
 1103

## Chargement des précipitations

In [204]:
test = CSV.read("data/precipitations.csv",missingstring="-99999")
rename!(test, Symbol("St-Hubert")=>:StHubert)
first(test,5)

Unnamed: 0_level_0,date,heure,McTavish,Bellevue,Assomption,Trudeau,StHubert
Unnamed: 0_level_1,Date,Int64,Int64⍰,Int64⍰,Int64⍰,Int64⍰,Int64⍰
1,2013-01-01,0,0,0,0,0,missing
2,2013-01-01,1,0,0,0,0,missing
3,2013-01-01,2,0,0,0,0,missing
4,2013-01-01,3,0,0,0,0,missing
5,2013-01-01,4,0,0,0,0,missing


In [205]:
data = CSV.read("data/precipitations.csv",missingstring="-99999")
rename!(data, Symbol("St-Hubert")=>:StHubert)
first(data,5)

Unnamed: 0_level_0,date,heure,McTavish,Bellevue,Assomption,Trudeau,StHubert
Unnamed: 0_level_1,Date,Int64,Int64⍰,Int64⍰,Int64⍰,Int64⍰,Int64⍰
1,2013-01-01,0,0,0,0,0,missing
2,2013-01-01,1,0,0,0,0,missing
3,2013-01-01,2,0,0,0,0,missing
4,2013-01-01,3,0,0,0,0,missing
5,2013-01-01,4,0,0,0,0,missing


## Nettoyage des données sur les précipitations

#### Extraction des précipitations des mois de mai à octobre inclusivement

In [206]:
data = filter(row -> month(row.date) > 4, data) 
data = filter(row -> month(row.date) < 11, data) 
test = filter(row -> month(row.date) == 4, test) 

Unnamed: 0_level_0,date,heure,McTavish,Bellevue,Assomption,Trudeau,StHubert
Unnamed: 0_level_1,Date,Int64,Int64⍰,Int64⍰,Int64⍰,Int64⍰,Int64⍰
1,2013-04-01,0,15,4,17,10,missing
2,2013-04-01,1,0,3,10,3,missing
3,2013-04-01,2,0,0,0,0,missing
4,2013-04-01,3,0,0,0,0,missing
5,2013-04-01,4,0,0,0,0,missing
6,2013-04-01,5,0,0,0,0,missing
7,2013-04-01,6,0,0,0,0,missing
8,2013-04-01,7,0,0,0,0,missing
9,2013-04-01,8,0,4,0,3,missing
10,2013-04-01,9,0,0,0,0,missing


# Analyse exploratoire

Cette section consitue une analyse exploratoire superficielle permettant de voir s'il existe un lien entre les précipitations et les surverses.

Prenons arbitrairement l'ouvrage de débordement près du Bota-Bota (4350-01D). La station météorologique la plus proche est McTavish. Prenons deux variables explicatives simple :
- la somme journalière des précipitations
- le taux horaire maximum journalier de précipitations

#### Calcul de la quantité journalière de précipitations pour chacune des stations météorologiques

In [207]:
pcp_sum = by(data, :date,  McTavish = :McTavish=>sum, Bellevue = :Bellevue=>sum, 
   Assomption = :Assomption=>sum, Trudeau = :Trudeau=>sum, StHubert = :StHubert=>sum)
first(pcp_sum ,5)

Unnamed: 0_level_0,date,McTavish,Bellevue,Assomption,Trudeau,StHubert
Unnamed: 0_level_1,Date,Int64⍰,Int64⍰,Int64⍰,Int64⍰,Int64⍰
1,2013-05-01,0,0,0,0,missing
2,2013-05-02,0,0,0,0,missing
3,2013-05-03,0,0,0,0,missing
4,2013-05-04,0,0,0,0,missing
5,2013-05-05,0,0,0,0,missing


In [208]:
pcpBefore_sum = by(test, :date,  McTavish = :McTavish=>sum, Bellevue = :Bellevue=>sum, 
   Assomption = :Assomption=>sum, Trudeau = :Trudeau=>sum, StHubert = :StHubert=>sum)



Unnamed: 0_level_0,date,McTavish,Bellevue,Assomption,Trudeau,StHubert
Unnamed: 0_level_1,Date,Int64⍰,Int64⍰,Int64⍰,Int64⍰,Int64⍰
1,2013-04-01,39,19,37,24,missing
2,2013-04-02,0,0,0,0,missing
3,2013-04-03,0,0,0,0,missing
4,2013-04-04,0,0,0,0,missing
5,2013-04-05,12,0,0,3,missing
6,2013-04-06,0,0,0,0,missing
7,2013-04-07,0,0,30,0,missing
8,2013-04-08,0,0,0,0,missing
9,2013-04-09,154,132,120,143,missing
10,2013-04-10,0,0,0,0,missing


In [209]:
for j=1:size(pcpBefore_sum,1)
    means = 0
    sum = 0
    alo = names(pcpBefore_sum)
    for col in alo
        if col != alo[1]
            if !ismissing(pcpBefore_sum[j, col]) 
                sum = sum +1
                means = means + pcpBefore_sum[j, col]
            end
        end
    end
    if sum != 0
        means = means / sum
    end
    for col in alo
        if ismissing(pcpBefore_sum[j, col]) && col != alo[1]
            tests = floor(means)
            pcpBefore_sum[j, col] = tests
        end
    end
end


In [210]:

for j=1:size(pcp_sum,1)
    means = 0
    sum = 0
    alo = names(pcp_sum)
    for col in alo
        if col != alo[1]
            if !ismissing(pcp_sum[j, col]) 
                sum = sum +1
                means = means + pcp_sum[j, col]
            end
        end
    end
    if sum != 0
        means = means / sum
    end
    for col in alo
        if ismissing(pcp_sum[j, col]) && col != alo[1]
            tests = floor(means)
            pcp_sum[j, col] = tests
        end
    end
end


In [211]:

pcpBefore_max = by(test, :date,  McTavish = :McTavish=>maximum, Bellevue = :Bellevue=>maximum, 
   Assomption = :Assomption=>maximum, Trudeau = :Trudeau=>maximum, StHubert = :StHubert=>maximum)


Unnamed: 0_level_0,date,McTavish,Bellevue,Assomption,Trudeau,StHubert
Unnamed: 0_level_1,Date,Int64⍰,Int64⍰,Int64⍰,Int64⍰,Int64⍰
1,2013-04-01,15,5,17,10,missing
2,2013-04-02,0,0,0,0,missing
3,2013-04-03,0,0,0,0,missing
4,2013-04-04,0,0,0,0,missing
5,2013-04-05,12,0,0,3,missing
6,2013-04-06,0,0,0,0,missing
7,2013-04-07,0,0,20,0,missing
8,2013-04-08,0,0,0,0,missing
9,2013-04-09,53,42,40,44,missing
10,2013-04-10,0,0,0,0,missing


In [212]:
pcpBefore_max

Unnamed: 0_level_0,date,McTavish,Bellevue,Assomption,Trudeau,StHubert
Unnamed: 0_level_1,Date,Int64⍰,Int64⍰,Int64⍰,Int64⍰,Int64⍰
1,2013-04-01,15,5,17,10,missing
2,2013-04-02,0,0,0,0,missing
3,2013-04-03,0,0,0,0,missing
4,2013-04-04,0,0,0,0,missing
5,2013-04-05,12,0,0,3,missing
6,2013-04-06,0,0,0,0,missing
7,2013-04-07,0,0,20,0,missing
8,2013-04-08,0,0,0,0,missing
9,2013-04-09,53,42,40,44,missing
10,2013-04-10,0,0,0,0,missing


#### Extraction du taux horaire journalier maximum des précipitations pour chacune des stations météorologiques

In [213]:
pcp_max = by(data, :date,  McTavish = :McTavish=>maximum, Bellevue = :Bellevue=>maximum, 
   Assomption = :Assomption=>maximum, Trudeau = :Trudeau=>maximum, StHubert = :StHubert=>maximum)
first(pcp_max,5)

Unnamed: 0_level_0,date,McTavish,Bellevue,Assomption,Trudeau,StHubert
Unnamed: 0_level_1,Date,Int64⍰,Int64⍰,Int64⍰,Int64⍰,Int64⍰
1,2013-05-01,0,0,0,0,missing
2,2013-05-02,0,0,0,0,missing
3,2013-05-03,0,0,0,0,missing
4,2013-05-04,0,0,0,0,missing
5,2013-05-05,0,0,0,0,missing


In [214]:
for j=1:size(pcp_max,1)
    means = 0
    sum = 0
    alo = names(pcp_max)
    for col in alo
        if col != alo[1]
            if !ismissing(pcp_max[j, col]) 
                sum = sum +1
                means = means + pcp_max[j, col]
            end
        end
    end
    if sum != 0
        means = means / sum
    end
    for col in alo
        if ismissing(pcp_max[j, col]) && col != alo[1]
            tests = floor(means)
            pcp_max[j, col] = tests
        end
    end
end

In [215]:
for j=1:size(pcpBefore_max,1)
    means = 0
    sum = 0
    alo = names(pcpBefore_max)
    for col in alo
        if col != alo[1]
            if !ismissing(pcpBefore_max[j, col]) 
                sum = sum +1
                means = means + pcpBefore_max[j, col]
            end
        end
    end
    if sum != 0
        means = means / sum
    end
    for col in alo
        if ismissing(pcpBefore_max[j, col]) && col != alo[1]
            tests = floor(means)
            pcpBefore_max[j, col] = tests
        end
    end
end

In [216]:
p = DataFrame(McTavish =Int64[])
for j=1:size(pcpBefore_max,1)
    alo = names(pcpBefore_max)
    means = 0
    sum = 0
    for col in alo
        if col != alo[1]
            if !ismissing(pcpBefore_max[j, col]) 
                sum = sum +1
                means = means + pcpBefore_max[j, col]
            end
        end
    end
    push!(p, means)
end
pcpBefore_max = DataFrame(date = pcpBefore_max[:date]; McTavish =p[:McTavish]);


│   caller = top-level scope at In[216]:16
└ @ Core In[216]:16
│   caller = top-level scope at In[216]:16
└ @ Core In[216]:16


In [217]:
p = DataFrame(pcpBefore_sum =Int64[])
for j=1:size(pcpBefore_sum,1)
    alo = names(pcpBefore_sum)
    means = 0
    sum = 0
    for col in alo
        if col != alo[1]
            if !ismissing(pcpBefore_sum[j, col]) 
                sum = sum +1
                means = means + pcpBefore_sum[j, col]
            end
        end
    end
    push!(p, means)
end
pcpBefore_sum = DataFrame(date = pcpBefore_sum[:date]; McTavish =p[:McTavish]);

│   caller = top-level scope at In[217]:16
└ @ Core In[217]:16
│   caller = top-level scope at In[217]:16
└ @ Core In[217]:16


ArgumentError: ArgumentError: column name :McTavish not found in the data frame

In [218]:
p = DataFrame(McTavish =Int64[])
for j=1:size(pcp_max,1)
    alo = names(pcp_max)
    means = 0
    sum = 0
    for col in alo
        if col != alo[1]
            if !ismissing(pcp_max[j, col]) 
                sum = sum +1
                means = means + pcp_max[j, col]
            end
        end
    end
    push!(p, means)
end
pcp_max = DataFrame(date = pcp_max[:date]; McTavish =p[:McTavish])

│   caller = top-level scope at In[218]:16
└ @ Core In[218]:16
│   caller = top-level scope at In[218]:16
└ @ Core In[218]:16


Unnamed: 0_level_0,date,McTavish
Unnamed: 0_level_1,Date,Int64
1,2013-05-01,0
2,2013-05-02,0
3,2013-05-03,0
4,2013-05-04,0
5,2013-05-05,0
6,2013-05-06,0
7,2013-05-07,0
8,2013-05-08,0
9,2013-05-09,36
10,2013-05-10,30


In [219]:
p = DataFrame(McTavish =Int64[])
for j=1:size(pcp_sum,1)
    alo = names(pcp_sum)
    means = 0
    sum = 0
    for col in alo
        if col != alo[1]
            if !ismissing(pcp_sum[j, col]) 
                sum = sum +1
                means = means + pcp_sum[j, col]
            end
        end
    end
    push!(p, means)
end
pcp_sum = DataFrame(date = pcp_sum[:date]; McTavish =p[:McTavish])

│   caller = top-level scope at In[219]:16
└ @ Core In[219]:16
│   caller = top-level scope at In[219]:16
└ @ Core In[219]:16


Unnamed: 0_level_0,date,McTavish
Unnamed: 0_level_1,Date,Int64
1,2013-05-01,0
2,2013-05-02,0
3,2013-05-03,0
4,2013-05-04,0
5,2013-05-05,0
6,2013-05-06,0
7,2013-05-07,0
8,2013-05-08,0
9,2013-05-09,36
10,2013-05-10,30


#### Inclusion dans un dataframe de ces deux variables explicatives potentielles

In [220]:
ouvrage = "4350-01D"
moyenneSumSurverses = Float64[]
moyenneMaxSurverses = Float64[]
moyenneSumBeforeSurverses = Float64[]
moyenneMaxBeforeSurverses = Float64[]
varianceSumSurverses = Float64[];
varianceMaxSurverses = Float64[];
varianceSumBeforeSurverses = Float64[]
varianceMaxBeforeSurverses = Float64[]

moyenneSumNonSurverses = Float64[]
moyenneMaxNonSurverses = Float64[]
moyenneSumBeforeNonSurverses = Float64[]
moyenneMaxBeforeNonSurverses = Float64[]
varianceSumNonSurverses = Float64[]
varianceMaxNonSurverses = Float64[]
varianceMaxBeforeNonSurverses = Float64[]
varianceSumBeforeNonSurverses = Float64[]
for j=1:size(filtervals,1)
    dfSurverse = filter(row -> row.NO_OUVRAGE == filtervals[j]  && row.SURVERSE ==1, surverse_df)
    dfNonSurverse = filter(row -> row.NO_OUVRAGE == filtervals[j]  && row.SURVERSE ==0, surverse_df)


    moyenneSumSurverse = 0;
    moyenneMaxSurverse = 0;
    moyenneSumBeforeSurverse = 0;
    moyenneMaxBeforeSurverse = 0;
    for i=1:size(dfSurverse,1)

        ind = findfirst(pcp_sum[:,:date] .== dfSurverse[i,:DATE])
        moyenneSumSurverse = moyenneSumSurverse +  pcp_sum[ind,:McTavish]
        
        indmax = findfirst(pcp_max[:,:date] .== dfSurverse[i,:DATE])
        moyenneMaxSurverse = moyenneMaxSurverse +  pcp_max[indmax,:McTavish]
        
        indx = findfirst(pcp_sum[:,:date] .== (dfSurverse[i,:DATE]-Dates.Day(1)))
        sumBeforeToadd = 0;
        if indx == nothing
            indx = findfirst(pcpBefore_sum[:,:date] .== (dfSurverse[i,:DATE]-Dates.Day(1)))
            sumBeforeToadd = pcpBefore_sum[indx,:McTavish]
        else
            sumBeforeToadd = pcp_sum[indx,:McTavish]
        end
        moyenneSumBeforeSurverse = moyenneSumBeforeSurverse +  sumBeforeToadd
        
        indx = findfirst(pcp_max[:,:date] .== (dfSurverse[i,:DATE]-Dates.Day(1)))
        sumBeforeToadd = 0;
        if indx == nothing
            indx = findfirst(pcpBefore_max[:,:date] .== (dfSurverse[i,:DATE]-Dates.Day(1)))
            sumBeforeToadd = pcpBefore_max[indx,:McTavish]
        else
            sumBeforeToadd = pcp_max[indx,:McTavish]
        end
        moyenneMaxBeforeSurverse = moyenneMaxBeforeSurverse +  sumBeforeToadd
    end
    moyenneSumSurverse = moyenneSumSurverse / size(dfSurverse,1)
    push!(moyenneSumSurverses, moyenneSumSurverse)
    moyenneMaxSurverse = moyenneMaxSurverse / size(dfSurverse,1)
    push!(moyenneMaxSurverses, moyenneMaxSurverse)
    moyenneSumBeforeSurverse = moyenneSumBeforeSurverse / size(dfSurverse,1)
    push!(moyenneSumBeforeSurverses, moyenneSumBeforeSurverse)
    moyenneMaxBeforeSurverse = moyenneMaxBeforeSurverse / size(dfSurverse,1)
    push!(moyenneMaxBeforeSurverses, moyenneMaxBeforeSurverse)
    
    moyenneSumNonSurverse = 0;
    moyenneMaxNonSurverse = 0;
    moyenneSumBeforeNonSurverse = 0;
    moyenneMaxBeforeNonSurverse = 0;
    for i=1:size(dfNonSurverse,1)

        ind = findfirst(pcp_sum[:,:date] .== dfNonSurverse[i,:DATE])
        moyenneSumNonSurverse = moyenneSumNonSurverse +  pcp_sum[ind,:McTavish]
        
        ind = findfirst(pcp_max[:,:date] .== dfNonSurverse[i,:DATE])
        moyenneMaxNonSurverse = moyenneMaxNonSurverse +  pcp_max[ind,:McTavish]
        
        indx = findfirst(pcp_sum[:,:date] .== (dfNonSurverse[i,:DATE]-Dates.Day(1)))
        sumBeforeToadd = 0;
        if indx == nothing
            indx = findfirst(pcpBefore_sum[:,:date] .== (dfNonSurverse[i,:DATE]-Dates.Day(1)))
            sumBeforeToadd = pcpBefore_sum[indx,:McTavish]
        else
            sumBeforeToadd = pcp_sum[indx,:McTavish]
        end
        moyenneSumBeforeNonSurverse = moyenneSumBeforeNonSurverse +  sumBeforeToadd
        
        indx = findfirst(pcp_max[:,:date] .== (dfNonSurverse[i,:DATE]-Dates.Day(1)))
        sumBeforeToadd = 0;
        if indx == nothing
            indx = findfirst(pcpBefore_max[:,:date] .== (dfNonSurverse[i,:DATE]-Dates.Day(1)))
            sumBeforeToadd = pcpBefore_max[indx,:McTavish]
        else
            sumBeforeToadd = pcp_max[indx,:McTavish]
        end
        moyenneMaxBeforeNonSurverse = moyenneMaxBeforeNonSurverse +  sumBeforeToadd
    end
    moyenneSumNonSurverse = moyenneSumNonSurverse / size(dfNonSurverse,1)
    push!(moyenneSumNonSurverses, moyenneSumNonSurverse)
    moyenneMaxNonSurverse = moyenneMaxNonSurverse / size(dfNonSurverse,1)
    push!(moyenneMaxNonSurverses, moyenneMaxNonSurverse)
    
    moyenneSumBeforeNonSurverse = moyenneSumBeforeNonSurverse / size(dfNonSurverse,1)
    push!(moyenneSumBeforeNonSurverses, moyenneSumBeforeNonSurverse)
    moyenneMaxBeforeNonSurverse = moyenneMaxBeforeNonSurverse / size(dfNonSurverse,1)
    push!(moyenneMaxBeforeNonSurverses, moyenneMaxBeforeNonSurverse)
    
    
    #### maintenant la variance
    varianceSumSurverse = 0;
    varianceMaxSurverse = 0;
    varianceSumBeforeSurverse = 0;
    varianceMaxBeforeSurverse = 0;
    for i=1:size(dfSurverse,1)

        ind = findfirst(pcp_sum[:,:date] .== dfSurverse[i,:DATE])
        varianceSumSurverse = varianceSumSurverse +  (pcp_sum[ind,:McTavish]-moyenneSumSurverse)^2

        ind = findfirst(pcp_max[:,:date] .== dfSurverse[i,:DATE])
        varianceMaxSurverse = varianceMaxSurverse +  (pcp_max[ind,:McTavish]-moyenneMaxSurverse)^2
        
         indx = findfirst(pcp_sum[:,:date] .== (dfSurverse[i,:DATE]-Dates.Day(1)))
        sumBeforeToadd = 0;
        if indx == nothing
            indx = findfirst(pcpBefore_sum[:,:date] .== (dfSurverse[i,:DATE]-Dates.Day(1)))
            sumBeforeToadd = pcpBefore_sum[indx,:McTavish]
        else
            sumBeforeToadd = pcp_sum[indx,:McTavish] 
        end
        varianceSumBeforeSurverse = varianceSumBeforeSurverse +  (sumBeforeToadd - moyenneSumBeforeSurverse)^2
        
        indx = findfirst(pcp_max[:,:date] .== (dfSurverse[i,:DATE]-Dates.Day(1)))
        sumBeforeToadd = 0;
        if indx == nothing
            indx = findfirst(pcpBefore_max[:,:date] .== (dfSurverse[i,:DATE]-Dates.Day(1)))
            sumBeforeToadd = pcpBefore_max[indx,:McTavish]
        else
            sumBeforeToadd = pcp_max[indx,:McTavish]
        end
        varianceMaxBeforeSurverse = varianceMaxBeforeSurverse +  (sumBeforeToadd-moyenneMaxBeforeSurverse)^2
    end
    varianceSumSurverse = varianceSumSurverse / size(dfSurverse,1)
    push!(varianceSumSurverses, varianceSumSurverse)
    varianceMaxSurverse = varianceMaxSurverse / size(dfSurverse,1)
    push!(varianceMaxSurverses, varianceMaxSurverse)

    varianceSumBeforeSurverse = varianceSumBeforeSurverse / size(dfSurverse,1)
    push!(varianceSumBeforeSurverses, varianceSumBeforeSurverse)
    varianceMaxBeforeSurverse = varianceMaxBeforeSurverse / size(dfSurverse,1)
    push!(varianceMaxBeforeSurverses,varianceMaxBeforeSurverse)
    
    varianceSumNonSurverse = 0;
    varianceMaxNonSurverse = 0;
    varianceSumBeforeNonSurverse = 0;
    varianceMaxBeforeNonSurverse = 0;
    for i=1:size(dfNonSurverse,1)

        ind = findfirst(pcp_sum[:,:date] .== dfNonSurverse[i,:DATE])
        varianceSumNonSurverse = varianceSumNonSurverse +  (pcp_sum[ind,:McTavish]-moyenneSumNonSurverse)^2

        ind = findfirst(pcp_max[:,:date] .== dfNonSurverse[i,:DATE])
        varianceMaxNonSurverse = varianceMaxNonSurverse +  (pcp_max[ind,:McTavish]-moyenneMaxNonSurverse)^2
        
        indx = findfirst(pcp_sum[:,:date] .== (dfNonSurverse[i,:DATE]-Dates.Day(1)))
        sumBeforeToadd = 0;
        if indx == nothing
            indx = findfirst(pcpBefore_sum[:,:date] .== (dfNonSurverse[i,:DATE]-Dates.Day(1)))
            sumBeforeToadd = pcpBefore_sum[indx,:McTavish]
        else
            sumBeforeToadd = pcp_sum[indx,:McTavish] 
        end
        varianceSumBeforeNonSurverse = varianceSumBeforeNonSurverse +  (sumBeforeToadd - moyenneSumBeforeNonSurverse)^2
        
        indx = findfirst(pcp_max[:,:date] .== (dfNonSurverse[i,:DATE]-Dates.Day(1)))
        sumBeforeToadd = 0;
        if indx == nothing
            indx = findfirst(pcpBefore_max[:,:date] .== (dfNonSurverse[i,:DATE]-Dates.Day(1)))
            sumBeforeToadd = pcpBefore_max[indx,:McTavish]
        else
            sumBeforeToadd = pcp_max[indx,:McTavish]
        end
        varianceMaxBeforeNonSurverse = varianceMaxBeforeNonSurverse +  (sumBeforeToadd-moyenneMaxBeforeNonSurverse)^2
    end
    varianceSumNonSurverse = varianceSumNonSurverse / size(dfNonSurverse,1)
    push!(varianceSumNonSurverses, varianceSumNonSurverse)
    varianceMaxNonSurverse = varianceMaxNonSurverse / size(dfNonSurverse,1)
    push!(varianceMaxNonSurverses, varianceMaxNonSurverse) 
    
    varianceSumBeforeNonSurverse = varianceSumBeforeNonSurverse / size(dfNonSurverse,1)
    push!(varianceSumBeforeNonSurverses, varianceSumBeforeNonSurverse)
    varianceMaxBeforeNonSurverse = varianceMaxBeforeNonSurverse / size(dfNonSurverse,1)
    push!(varianceMaxBeforeNonSurverses,varianceMaxBeforeNonSurverse)
    
   
end

#### Traçage des distribution de la somme des précipitations en fonction des surverses ou non

On remarque que les deux distributions sont très différentes. Ceci suggère que la somme des précipitations à la station McTavish a un effet sur les surverses au Bota-Bota.

UndefVarError: UndefVarError: df not defined

#### Traçage des distribution de la somme des précipitations en fonction des surverses ou non

On remarque que les deux distributions sont très différentes. Ceci suggère que le maximum journalier des précipitations à la station McTavish a un effet sur les surverses au Bota-Bota.

UndefVarError: UndefVarError: df not defined

# Création du fichier de prédictions pour soumettre sur Kaggle

Dans ce cas-ci, nous prédirons une surverse avec une probabilité de 1/2 sans considérer aucune variable explicative.

In [223]:

testfile = CSV.read("data/test.csv")
first(testfile,5)

Unnamed: 0_level_0,NO_OUVRAGE,DATE
Unnamed: 0_level_1,String,Date
1,3260-01D,2019-05-02
2,3260-01D,2019-05-09
3,3260-01D,2019-05-10
4,3260-01D,2019-05-15
5,3260-01D,2019-05-20


In [225]:
surverse = Int[]
for i=1:size(testfile,1)
    indproba = findfirst(filtervals[:] .== testfile[i,:NO_OUVRAGE])
    ind = findfirst(pcp_sum[:,:date] .== testfile[i,:DATE])
    sum = pcp_sum[ind,:McTavish]
    ind = findfirst(pcp_max[:,:date] .== testfile[i,:DATE])
    max = pcp_max[ind,:McTavish]

    Psurverse = n₁[indproba]/n[indproba]

    pSumSurverses = (1/sqrt(2*π*varianceSumSurverses[indproba])) - (1/2)*(((sum-moyenneSumSurverses[indproba])^2)/varianceSumSurverses[indproba])

    pMaxSurverses = (1/sqrt(2*π*varianceMaxSurverses[indproba])) - (1/2)*(((max-moyenneMaxSurverses[indproba])^2)/varianceMaxSurverses[indproba])
    pSumBeforeSurverses = (1/sqrt(2*π*varianceSumBeforeSurverses[indproba])) - (1/2)*(((sum-moyenneSumBeforeSurverses[indproba])^2)/varianceSumBeforeSurverses[indproba])

    pMaxBeforeSurverses = (1/sqrt(2*π*varianceMaxBeforeSurverses[indproba])) - (1/2)*(((max-moyenneMaxBeforeSurverses[indproba])^2)/varianceMaxBeforeSurverses[indproba])
     

    Pnonsurverse = n₀[indproba]/n[indproba]
    pSumNonSurverses = (1/sqrt(2*π*varianceSumNonSurverses[indproba])) - (1/2)*(((sum-moyenneSumNonSurverses[indproba])^2)/varianceSumNonSurverses[indproba])
    pMaxNonSurverses = (1/sqrt(2*π*varianceMaxNonSurverses[indproba])) - (1/2)*(((max-moyenneMaxNonSurverses[indproba])^2)/varianceMaxNonSurverses[indproba])
    pSumBeforeNonSurverses = (1/sqrt(2*π*varianceSumBeforeNonSurverses[indproba])) - (1/2)*(((sum-moyenneSumBeforeNonSurverses[indproba])^2)/varianceSumBeforeNonSurverses[indproba])
    pMaxBeforeNonSurverses = (1/sqrt(2*π*varianceMaxBeforeNonSurverses[indproba])) - (1/2)*(((max-moyenneMaxBeforeNonSurverses[indproba])^2)/varianceMaxBeforeNonSurverses[indproba])

    pxSsurverse = pSumSurverses * pMaxSurverses * pSumBeforeSurverses * pMaxBeforeSurverses
    
    pxSnonsurverse = pSumNonSurverses * pMaxNonSurverses * pSumBeforeNonSurverses * pMaxBeforeNonSurverses
    
    psurverse = (pxSsurverse * Psurverse)/(pxSsurverse * Psurverse + pxSnonsurverse*Pnonsurverse)
    push!(surverse, (psurverse<0.5));
    println("$psurverse")
end
# Pour chacune des lignes du fichier test, comportant un ouvrage et une date, une prédiction est requise.
# Dans ce cas-ci, utilisons une prédiction les plus naîve. 
# On prédit avec une chance sur deux qu'il y ait surverse, sans utiliser de variables explicatives



# Création du fichier sampleSubmission.csv pour soumettre sur Kaggle
ID = testfile[:,:NO_OUVRAGE].*"_".*string.(testfile[:,:DATE])
sampleSubmission = DataFrame(ID = ID, Surverse=surverse)
CSV.write("result3.csv",sampleSubmission)

# Vous pouvez par la suite déposer le fichier sampleSubmission.csv sur Kaggle.
ID

1.0000004720295084
1.296010284167169e-6
7.316262376364449e-7
0.9789608295300377
0.003585646653003223
6.739630525348012e-7
0.9999997735154459
1.0000271626416397
1.0000078413357025
1.1867908190868467e-5
0.9930458154293297
0.9826145724884523
8.399559234677628e-7
8.777895940450326e-8
0.9879228829524979
0.998562716407682
9.546461858134708e-7
0.9879228829524979
1.1715352377655045e-5
1.2581968863485102e-5
1.3846833904709905e-5
0.9950463670279931
0.9999513273446329
0.9826145724884523
1.000157560481203
0.9661788395348624
-1.296907562612776e-7
2.8918342116551063e-7
0.9631075299847948
0.9717601594364244
0.9631075299847948
2.5103261764850723e-7
1.000000186399301
0.9907303388766205
1.000003750281907
0.9829543472852031
0.9815188328788704
0.9766797502779967
1.5253891635273045e-6
0.9799965576383284
0.9766797502779967
0.9843061103054944
0.9799965576383284
5.596950093673445e-10
0.9799965576383284
1.0000086121319973
0.9999982807226121
0.9745710189353768
0.9636145791088744
0.9636145791088744
0.96361457910

283-element Array{String,1}:
 "3260-01D_2019-05-02"
 "3260-01D_2019-05-09"
 "3260-01D_2019-05-10"
 "3260-01D_2019-05-15"
 "3260-01D_2019-05-20"
 "3260-01D_2019-05-23"
 "3260-01D_2019-05-24"
 "3260-01D_2019-05-26"
 "3260-01D_2019-05-30"
 "3350-07D_2019-05-01"
 "3350-07D_2019-05-02"
 "3350-07D_2019-05-08"
 "3350-07D_2019-05-09"
 ⋮                    
 "4380-01D_2019-09-01"
 "4380-01D_2019-09-02"
 "4380-01D_2019-09-04"
 "4380-01D_2019-09-05"
 "4380-01D_2019-09-12"
 "4380-01D_2019-09-13"
 "4380-01D_2019-09-16"
 "4380-01D_2019-09-22"
 "4380-01D_2019-09-26"
 "4380-01D_2019-09-28"
 "4380-01D_2019-09-29"
 "4380-01D_2019-09-30"