## Notebook 4
This notebook looks at proportions. In the dataset we have multiple count response variables hence we can study the ratios between them. These include the ratio between male and female mosquitoes, and between coluzzii female and gambiae female (when such info is available). 

In the previous notebooks we have been using Poisson glm to model counts. Here for proportions (i.e. we need a pair of counts to form a response) we will be using binomial glm. One aim is to see whether the proportion remains constant over time and space. 

Only on wet season data (May - October). 

In [21]:
# LOAD R PACKAGES
require(compiler)
enableJIT(3)
setMKLthreads(22)
require(lme4)
require(lmerTest)
require(MASS)

# LOAD DATASET dat1
#setwd('variance/Florian')
load('Per house data PSC 2012 to 2019 polish2.RData')
load('BF_weather.RData')
ls()


Number of threads at maximum: no change has been made.



In [22]:
# SUBSET
dat1<-dat1[dat1$month.assigned %in% 5:10,]
# TRANSFORM num.persons
dat1$persons.status<-dat1$num.persons
dat1$persons.status[dat1$num.persons>3]<-'Hi'
dat1$persons.status[dat1$num.persons<=3]<-'Low'
dat1$persons.status[dat1$num.persons==0]<-'None'
dat1$persons.status<-factor(dat1$persons.status, levels=c('None', 'Low', 'Hi'))
dim(dat1)

### male vs female
We can use 100% of the dataset to analyse the ratio between male vs female collected during a house visit. Let us run some descriptive stats on the proportion of females:

In [23]:
# DESCRIPTIVE STATS ON FEMALE PROPORTION
female.proportion<-dat1$count.f/(dat1$count.f+dat1$count.m)
cat('female proportion by village: ')
by(female.proportion, dat1$village, mean, na.rm=T)
cat('female proportion by month: ')
by(female.proportion, dat1$month.assigned, mean, na.rm=T)
cat(paste(c('Total female mosquitoes caught: ', sum(dat1$count.f, na.rm=T), '\n')))
cat(paste(c('Total male mosquitoes caught: ', sum(dat1$count.m, na.rm=T), '\n')))

female proportion by village: 

dat1$village: Bana market
[1] 0.8616874
------------------------------------------------------------ 
dat1$village: Bana village
[1] 0.8347233
------------------------------------------------------------ 
dat1$village: Pala
[1] 0.7714175
------------------------------------------------------------ 
dat1$village: Souroukoudingan
[1] 0.7992017

female proportion by month: 

dat1$month.assigned: 1
[1] NA
------------------------------------------------------------ 
dat1$month.assigned: 2
[1] NA
------------------------------------------------------------ 
dat1$month.assigned: 3
[1] NA
------------------------------------------------------------ 
dat1$month.assigned: 4
[1] NA
------------------------------------------------------------ 
dat1$month.assigned: 5
[1] 0.8288299
------------------------------------------------------------ 
dat1$month.assigned: 6
[1] 0.803896
------------------------------------------------------------ 
dat1$month.assigned: 7
[1] 0.8376873
------------------------------------------------------------ 
dat1$month.assigned: 8
[1] 0.8215945
------------------------------------------------------------ 
dat1$month.assigned: 9
[1] 0.8577249
------------------------------------------------------------ 
dat1$month.assigned: 10
[1] 0.7764326
------------------------------------------------------------ 
dat1$month.assigned: 11
[1] NA
-------

Total female mosquitoes caught:  25524 
Total male mosquitoes caught:  6483 


As expected we collected more females than males. The overall female proportion is about 78%. Such proportion is higher for some villages and for some months. We need to build some models to test whether these effects are statistically significant. 

In [24]:
# MODEL 1. VILLAGE PLUS RANDOM EFFECTS
overdispersion<-1:nrow(dat1)
m_1<-glmer(cbind(count.f, count.m)~(1|site.id)+(1|year.assigned)+(1|overdispersion)+village, 
            data=dat1, family='binomial', 
            control=glmerControl(optimizer="bobyqa",optCtrl=list(maxfun=2e5)))
summary(m_1)

boundary (singular) fit: see ?isSingular


Generalized linear mixed model fit by maximum likelihood (Laplace
  Approximation) [glmerMod]
 Family: binomial  ( logit )
Formula: cbind(count.f, count.m) ~ (1 | site.id) + (1 | year.assigned) +  
    (1 | overdispersion) + village
   Data: dat1
Control: glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 2e+05))

     AIC      BIC   logLik deviance df.resid 
  5627.6   5665.3  -2806.8   5613.6     1607 

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.06234 -0.21984  0.03681  0.50223  1.32506 

Random effects:
 Groups         Name        Variance Std.Dev.
 overdispersion (Intercept) 1.2971   1.1389  
 site.id        (Intercept) 0.4881   0.6987  
 year.assigned  (Intercept) 0.0000   0.0000  
Number of obs: 1614, groups:  
overdispersion, 1614; site.id, 284; year.assigned, 6

Fixed effects:
                       Estimate Std. Error z value Pr(>|z|)    
(Intercept)              2.3183     0.2358   9.834  < 2e-16 ***
villageBana village     -0.2600     0.2612 

There is a warning on singularity. This is caused the zero variance estimate for the yearly effect (year.assigned). Let us remove the term and refit the same model:

In [16]:
# MODEL 1, REFIT, REMOVE YEARLY EFFECT
overdispersion<-1:nrow(dat1)
m_1<-glmer(cbind(count.f, count.m)~(1|site.id)+(1|overdispersion)+village, 
            data=dat1, family='binomial', 
            control=glmerControl(optimizer="bobyqa",optCtrl=list(maxfun=2e5)))
summary(m_1)

Generalized linear mixed model fit by maximum likelihood (Laplace
  Approximation) [glmerMod]
 Family: binomial  ( logit )
Formula: cbind(count.f, count.m) ~ (1 | site.id) + (1 | overdispersion) +  
    village
   Data: dat1
Control: glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 2e+05))

     AIC      BIC   logLik deviance df.resid 
  5625.6   5657.9  -2806.8   5613.6     1608 

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.06234 -0.21984  0.03681  0.50223  1.32506 

Random effects:
 Groups         Name        Variance Std.Dev.
 overdispersion (Intercept) 1.2971   1.1389  
 site.id        (Intercept) 0.4881   0.6987  
Number of obs: 1614, groups:  overdispersion, 1614; site.id, 284

Fixed effects:
                       Estimate Std. Error z value Pr(>|z|)    
(Intercept)              2.3183     0.2358   9.834  < 2e-16 ***
villageBana village     -0.2600     0.2612  -0.996  0.31942    
villagePala             -0.7535     0.2679  -2.813  0.00492 ** 
vi

Now the warning message has gone, and that all the parameter estimates and maximised log-likelihood are just as same as before. It will be interesting to understand why yearly effect gets "cancelled out" when working on ratios. 

Model 2 includes month.assigned as well: 

In [17]:
# MODEL 2. VILLAGE AND MONTH
m_2<-glmer(cbind(count.f, count.m)~(1|site.id)+(1|overdispersion)+village+month.assigned, 
            data=dat1, family='binomial', 
            control=glmerControl(optimizer="bobyqa",optCtrl=list(maxfun=2e5)))
summary(m_2, correlation=FALSE)

Generalized linear mixed model fit by maximum likelihood (Laplace
  Approximation) [glmerMod]
 Family: binomial  ( logit )
Formula: cbind(count.f, count.m) ~ (1 | site.id) + (1 | overdispersion) +  
    village + month.assigned
   Data: dat1
Control: glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 2e+05))

     AIC      BIC   logLik deviance df.resid 
  5609.0   5668.2  -2793.5   5587.0     1603 

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.16846 -0.21526  0.04689  0.50296  1.32398 

Random effects:
 Groups         Name        Variance Std.Dev.
 overdispersion (Intercept) 1.2598   1.1224  
 site.id        (Intercept) 0.4924   0.7017  
Number of obs: 1614, groups:  overdispersion, 1614; site.id, 284

Fixed effects:
                       Estimate Std. Error z value Pr(>|z|)    
(Intercept)              2.1694     0.2624   8.269  < 2e-16 ***
villageBana village     -0.2741     0.2616  -1.048  0.29474    
villagePala             -0.7160     0.2688  -2.66

In [25]:
# MODEL 3. THROW EVERYTHINGS IN. 
m_3a<-glmer(cbind(count.f, count.m)~(1|site.id)+(1|overdispersion)+village*month.assigned
            +persons.status+mosquito.net, 
            data=dat1, family='binomial', 
            control=glmerControl(optimizer="bobyqa",optCtrl=list(maxfun=2e5)))
summary(m_3a, correlation=FALSE)


Correlation matrix not shown by default, as p = 27 > 12.
Use print(obj, correlation=TRUE)  or
    vcov(obj)        if you need it



Generalized linear mixed model fit by maximum likelihood (Laplace
  Approximation) [glmerMod]
 Family: binomial  ( logit )
Formula: cbind(count.f, count.m) ~ (1 | site.id) + (1 | overdispersion) +  
    village * month.assigned + persons.status + mosquito.net
   Data: dat1
Control: glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 2e+05))

     AIC      BIC   logLik deviance df.resid 
  5462.6   5618.2  -2702.3   5404.6     1552 

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.34421 -0.21388  0.05654  0.51737  1.37152 

Random effects:
 Groups         Name        Variance Std.Dev.
 overdispersion (Intercept) 1.1425   1.0689  
 site.id        (Intercept) 0.4963   0.7045  
Number of obs: 1581, groups:  overdispersion, 1581; site.id, 278

Fixed effects:
                                        Estimate Std. Error z value Pr(>|z|)   
(Intercept)                              1.57746    0.74735   2.111   0.0348 * 
villageBana village                     -0.23448 

### coluzzii vs gambiae

In [7]:
# SUBSET (2017-2019)
dat2<-dat1[dat1$year.assigned %in% 2017:2019,]
dim(dat2)

In [8]:
# DESCRIPTIVE STATS ON THE PROPORTIONS (VERY ROUGH)
col.f.proportion<-dat1$col.f/(dat1$col.f+dat1$gam.f)
by(col.f.proportion, dat1$village, mean, na.rm=T)
by(col.f.proportion, dat1$month.assigned, mean, na.rm=T)

dat1$village: Bana market
[1] 0.9157716
------------------------------------------------------------ 
dat1$village: Bana village
[1] 0.9172278
------------------------------------------------------------ 
dat1$village: Pala
[1] 0.2774607
------------------------------------------------------------ 
dat1$village: Souroukoudingan
[1] 0.6235339

dat1$month.assigned: 1
[1] NA
------------------------------------------------------------ 
dat1$month.assigned: 2
[1] NA
------------------------------------------------------------ 
dat1$month.assigned: 3
[1] NA
------------------------------------------------------------ 
dat1$month.assigned: 4
[1] NA
------------------------------------------------------------ 
dat1$month.assigned: 5
[1] 0.9792023
------------------------------------------------------------ 
dat1$month.assigned: 6
[1] 0.6047952
------------------------------------------------------------ 
dat1$month.assigned: 7
[1] 0.8558036
------------------------------------------------------------ 
dat1$month.assigned: 8
[1] 0.6601493
------------------------------------------------------------ 
dat1$month.assigned: 9
[1] 0.9415679
------------------------------------------------------------ 
dat1$month.assigned: 10
[1] 0.717817
------------------------------------------------------------ 
dat1$month.assigned: 11
[1] NA
-------

In [9]:
overdispersion<-1:nrow(dat1)
m_0<-glmer(cbind(col.f, gam.f)~(1|overdispersion)+village*month.assigned, 
         data=dat1, family='binomial', 
         control=glmerControl(optimizer="bobyqa",optCtrl=list(maxfun=2e5)))
summary(m_0)

fixed-effect model matrix is rank deficient so dropping 6 columns / coefficients

Correlation matrix not shown by default, as p = 18 > 12.
Use print(obj, correlation=TRUE)  or
    vcov(obj)        if you need it



Generalized linear mixed model fit by maximum likelihood (Laplace
  Approximation) [glmerMod]
 Family: binomial  ( logit )
Formula: cbind(col.f, gam.f) ~ (1 | overdispersion) + village * month.assigned
   Data: dat1
Control: glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 2e+05))

     AIC      BIC   logLik deviance df.resid 
  1252.0   1337.6   -607.0   1214.0      648 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-5.7353 -0.3886  0.2219  0.5028  2.3862 

Random effects:
 Groups         Name        Variance Std.Dev.
 overdispersion (Intercept) 0.5168   0.7189  
Number of obs: 667, groups:  overdispersion, 667

Fixed effects:
                                       Estimate Std. Error z value Pr(>|z|)    
(Intercept)                             3.99488    0.42990   9.293  < 2e-16 ***
villageBana village                     1.07057    0.84117   1.273  0.20312    
villagePala                            -3.34558    0.40588  -8.243  < 2e-16 ***
villageSouroukoudin

In [10]:
m_1<-glmer(cbind(col.f, gam.f)~(1|overdispersion)+village*month.assigned+mosquito.net, 
         data=dat1, family='binomial', 
         control=glmerControl(optimizer="bobyqa",optCtrl=list(maxfun=2e5)))
summary(m_1)

fixed-effect model matrix is rank deficient so dropping 6 columns / coefficients

Correlation matrix not shown by default, as p = 19 > 12.
Use print(obj, correlation=TRUE)  or
    vcov(obj)        if you need it



Generalized linear mixed model fit by maximum likelihood (Laplace
  Approximation) [glmerMod]
 Family: binomial  ( logit )
Formula: 
cbind(col.f, gam.f) ~ (1 | overdispersion) + village * month.assigned +  
    mosquito.net
   Data: dat1
Control: glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 2e+05))

     AIC      BIC   logLik deviance df.resid 
  1231.6   1321.4   -595.8   1191.6      639 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-5.6993 -0.3741  0.2246  0.4806  2.3666 

Random effects:
 Groups         Name        Variance Std.Dev.
 overdispersion (Intercept) 0.5422   0.7363  
Number of obs: 659, groups:  overdispersion, 659

Fixed effects:
                                       Estimate Std. Error z value Pr(>|z|)    
(Intercept)                             3.98691    0.48806   8.169 3.11e-16 ***
villageBana village                     1.07029    0.84318   1.269  0.20432    
villagePala                            -3.49189    0.43127  -8.097 5.64e-16 *

In [26]:
m_2<-glmer(cbind(col.f, gam.f)~(1|overdispersion)+village*month.assigned+persons.status, 
         data=dat1, family='binomial', 
         control=glmerControl(optimizer="bobyqa",optCtrl=list(maxfun=2e5)))
summary(m_2)

fixed-effect model matrix is rank deficient so dropping 6 columns / coefficients

Correlation matrix not shown by default, as p = 20 > 12.
Use print(obj, correlation=TRUE)  or
    vcov(obj)        if you need it



Generalized linear mixed model fit by maximum likelihood (Laplace
  Approximation) [glmerMod]
 Family: binomial  ( logit )
Formula: 
cbind(col.f, gam.f) ~ (1 | overdispersion) + village * month.assigned +  
    persons.status
   Data: dat1
Control: glmerControl(optimizer = "bobyqa", optCtrl = list(maxfun = 2e+05))

     AIC      BIC   logLik deviance df.resid 
  1228.3   1322.4   -593.2   1186.3      632 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-5.7369 -0.3861  0.2182  0.4784  2.3890 

Random effects:
 Groups         Name        Variance Std.Dev.
 overdispersion (Intercept) 0.53     0.728   
Number of obs: 653, groups:  overdispersion, 653

Fixed effects:
                                       Estimate Std. Error z value Pr(>|z|)    
(Intercept)                              3.5790     0.8641   4.142 3.44e-05 ***
villageBana village                      1.0743     0.8424   1.275 0.202239    
villagePala                             -3.4854     0.4277  -8.148 3.69e-16

Persons and mosquito net does not affect the ratio of coluzzii and gambiae. Different month x village combinations have different ratio. 

Note that we did not genotype all mosquitoes we collected. It is worth thinking how we can combine the two models (Poisson count glm for the combined counts and then the binomial glm for the proportion) for power analysis. 

One simple solution is to use the implied coluzzi and gambiae counts (count.f * proportion). This is quite good when genotyped.f is similar to count.f. 

Another method is to build a complex hierachical model. The observed female counts follow a Poisson distribution, and that the observed female coluzzii count follows another hypergeometric distribution (sample without replacement). Then we can build a probabilistic model on how many coluzzii there are in the sampled pool. 

### male vs female