# Case study 2: How many scallop dredges do we need to sample to make sure we can detect a biologically important reduction in flatfish bycatch?


## Background
One feature that consistently emerges is that real world data are often less well-behaved than those in the studies that make it to publication. In the previous example of finfish and Nephrops separation we would seem to have a fair chance of detecting effects, if they were present.  

Onto a real world example. We will be conducting trials this year (2022) to test if Pisces can reduce numbers of flatfish - primarily Yellowtail flounder - that are caught in scallop dredges in the Northwest Atlantic. In the trial design section I illustrated some of the issues and considerations we evaluate when designing a trial. An important piece of the trial design puzzle is simulating data to ascertain how much *sampling effort* we will need to detect an effect - if it is indeed present. When there is preliminary data available, there is no excuse not to use this to make informed decisions about the trial conduct.  


````{margin}
```{attention}
I cannot stress enough the importance of either getting pilot data from some source, or simulating data using realistic published parameter estimates.  
Trials are expensive. We don't want to get to the end of a study with an equivocal result simply because we planned the 'ideal' study from an armchair, without any reality checks. 
```
````

The data set available for this trial comes from the Northeast Fisheries Observer Program (NEFOP). Fortuitously, one of their data sets is of flatfish bycatch on scallop dredges in the same closed (rather paradoxically called 'Access') area of Georges Bank that we will be working in.

````{list-table}
*  - ```{figure} scallop_dredge.jpg
       :height: 300px
       :align: center
     ```
* - ```{figure} dupaul_team.jpg
       :height: 300px
       :align: center
     ```
````


## Initial exploration
The data were provided in a refreshingly nice tidy flat CSV file - checked, no data errors. The data are the weights of flatfish in each dredge, with a measure of dredge duration. We will need to read these in and convert to Catch Per Unit Effort (CPUE) to standardize the weight to a common unit. In this case, pounds (yes - pounds!) of fish per hour. Once we do this housekeeping, it's onto the important stuff.  

First things first - plot the data.  

### Are there any methodological issues we should be aware of?
I mentioned above that we need to convert values to effort units - in this case, hours. This *standardizes* the counts to a common measure. However... it is wise to check that this standardization actually does what we think it's doing. Let's plot the CPUE versus effort. There should be *no relationship*. I use a log scale here to zoom up and stretch out the scales a bit. 

```{figure} yellowtaileffort1.png
:height: 400px
:align: center
```

Hang on, according to my statistics text there should be no relationship because, well, isn't standardization supposed to remove this relationship? 

Statistics, meet field ecology! We have higher CPUE in the shorter dredges (<30 minutes) and lower CPUE in the longer dredges (>3 hours). There are a couple of possible explanations. Short dredges might have been exploratory drops on hills or the suchlike, which might have had higher abundances of Yellowtail. With longer dredges, we might tow over less optimal habitats (shingle, maybe) which could generate a lowering of the overall average catch. Without knowing exactly what happened on the boat, these are just speculations. However, it would seem wise to restrict our exploration to a more restricted set of dredge durations. Say... 1-3 hours:  

```{figure} yellowtaileffort2.png
:height: 400px
:align: center
```

That's better. It just so happens that the decision had been made already to stick to 1 hour dredges for the trial. This would seem to be a wise decision.  


### What do the catch distributions look like?
A histogram of the catches is the first point of call. As above I've converted to a log scale, simply we can't see any detail otherwise:  

```{figure} yellowtailcatch.png
:height: 400px
:align: center
```

````{margin}
```{note}
The more geeky types out there may be thinking to themselves "Hang on! The log of zero is undefined. How did you get it on the graph?"  
Well, my pedantic little friends, I actually plotted log10(x+0.1) for illustrative purposes. But well spotted...
```
````
First things first: There are a whole pile of zeroes. 922 out of 2616 data points to be exact.  
This is (arguably) referred to as zero-inflation - we have more zeroes than might be expected from the theoretical statistical distributions that form the cornerstone of a lot of our analytical methods.  

Second point: Although most of the catches are around the 3 lbs/hour mark, there are a not-inconsiderable number of values up to ~200 lbs/per hour. 

Both of these in combination present analytical problems. Zeroes contain less information than a continuous scale of values, and when there are non-zero values, they are close enough to log-normally distributed.  

### Next challenge: come up with a model we can use to simulate some data
This sample distribution suggests a few alternate models. The one I chose to play with was the Tweedie distribution, which can accommodate both the zeroes and the extreme values, and appropriately scale the variance with the mean (very important!).
````{margin}
```{note}
Typically with zero-inflation we have a few models we can apply. They behave similarly so any one approach should put us in the ballpark. Really any one of these could be used:
- **Negative binomial**. Technically the negative binomial is a discrete distribution (counts), however a lot of routines to calculate a negbin model will still run. Although they'll whine and complain about it
- **Zero-inflated and hurdle models**. These are two-stage models where the zeroes are modeled as a binary process first (is there anything in the sample?), then the >0 samples modeled separately. In this data set, the non-zero values are pretty damned closed to log-normal, so we would look possibly at a hurdle log-normal model. 
   - As an aside... Zero inflated models differ from hurdle models in that the *statistical process* generating the non zero values can also conceivably generate zero values. The difference is not as discrete as for hurdles, which is why hurdle models tend to converge better than ZI (zero inflated) models
- **Tweedie models** are a relatively new addition to the statistical arsenal. They fall within the group of exponential dispersion models, and in many R implementations cover the spectrum from Poisson through Negative Binomial to inverse Gaussian
```
````
So how do we simulate a sample distribution? Let's considerable a normal or bell-shaped distribution. To generate a normal distribution I can write a simple line of code in R and tell it to output 1000 values with (in this case), a mean of 5 and standard deviation of 1:

```R
y <- rnorm(1000, 5, 1)

```

The normal distribution is described by two parameters - mean, and standard deviation. To generate a dummy data set, we need some reasonable estimates of these values.  
```{figure} normal.png
:height: 300px
:align: center 
```
Our observed distribution is nowhere near normal, and I have already decided to use a Tweedie distribution. So what do I need to specify for that one? First, I need the Tweedie value. We can estimate this directly from the data:

```R
tweedie.profile(access$YTCPUE ~ 1, do.plot = TRUE, verbose = F)

```

```{figure} tweedie.png
:height: 300px
:align: center
```

The value is ~1.5, so now we can simulate a Tweedie distribution. From the data, the mean catch of Yellowtail was 3.31 lbs/hour. We will use this as our reference or control level. To generate a Tweedie distribution, we need to specify a mean value and a Tweedie parameter. Why not a standard deviation, I hear you ask? Because the variance/standard deviation changes with the mean. The Tweedie parameter specifies the relationship of this change.  

Let's be adventurous. How many replicates do we need to sample to detect a 50% reduction in Yellowtail CPUE? We need to do get our coding geek on for this step... 

```R
#Function to make our data with defined parameters
simdat <- function(n, contmean, trtmean) {
    control <- as.data.frame(rtweedie(n, xi=1.5, mu=contmean, phi=7.8))
    names(control) <- "CPUE"
    trt <- data.frame(rtweedie(n, xi=1.5, mu=trtmean, phi=7.8))
    names(trt) <- "CPUE"
    comb <- rbind(control, trt)
    Light <- c(rep("Control", n), rep("Treatment", n))
    comb2 <- data.frame(cbind(Light, comb))
    return(comb2)
}

nreps <-1000
pvalues <- numeric(nreps)

for (i in 1:nreps) {
    test <- summary(glm(CPUE ~ Light, family = tweedie(var.power = 1.5, link.power = 0),
          data=simdat(150, 3.3, 1.65)))
    pvalues[i] <- test$coefficients[2,4]
}

hist(pvalues, main="150 replicates, 50% change")
abline(v=0.05)

```
So what have I done here? I made a control distribution with a mean of 3.3, and a treatment distribution in which *Pisces* (bless its heart), generated a 50% reduction in Yellowtail catches. I specified 150 replicate dredges for each treament, and ran the 'experiment' 1000 times. What we want to know now, is how many times we got a 'statistically significant' result? Remember, I have already set a 'real' difference in the means. The probability of *not* finding a significant effect when there is indeed a difference is the Type II error (β). The **power** of the test - for a defined effect size, and at a specified level of replication - is 1-β.  

By convention, most studies aim to achieve a power of 0.8. In  other words, *if* there is an effect, we should detect it 80% of the time.  

```{admonition} Statistical significance (α), Type II error (β), power (1-β), effect size and *n*
I really don't want to get into this in detail, because the whole classical statistics things is just annoying and I'm kind of sick of it. But let's get some basic terms and concepts out of the way. 
In the classical approach we *test* hypotheses, not prove them. Our aim is to attempt to reject  hypotheses. Indeed to paraphrase Karl Popper "Science proceeds by bold conjectures, and ingenious attempts to refute them". Well, we are a bit less ambitious. In practice we set up a bit of a straw man **null hypothesis** and try to reject that. The Null hypothesis is simply *there is no difference between the groups*.   
But of course there is variability in systems, and no samples are exactly the same. So what we need to do is evaluate the observed difference between sample and control values (say) given natural variability described by some statistical distribution. The normal distribution, for example.  
Most of you should have a passing familiarity with *statistical significance*. This is the p-value or α, and also called the Type I error. The p-value is the probability that *if you ran the sample program a gazillion times*, and *if the Null Hypothesis was TRUE* you would reject the Null Hypothesis. In other words, a **false positive**.  By convention, we use a value of 0.05. So, we accept that 5% of the time we would interpret the result as there being a 'significant' effect, when in fact there wasn't.  
The flip side of the Type I error is Type II error, or β. This is the probability that *if you ran the sample program a gazillion times*, and *if the Null Hypothesis was FALSE* that you would *fail* to reject the Null Hypothesis. In other words, a **false negative**. Changing the α level will automatically change the β level.  
The *power* is 1-β. In words, the power is the probability that *if you ran the sample program a gazillion times*, and *if the Null Hypothesis was FALSE*, that you would correctly reject the null hypothesis.
```
````{margin}
```{warning}
So does the p-value say anything about the truth or otherwise of the hypothesis? Sorry to burst the bubble, but no it doesn't. Although many people, including researchers who obviously weren't paying attention in Stats 101, interpret it as such. You will have noted my wording: *if you ran the sample program a gazillion times*... The probability refers to how many times you would get the *data* **if the hypothesis was true**. In probability terms, this is p(data|hypothesis). In words, the posterior probability of the data given the hypothesis. A more useful measure would be: p(hypothesis|data), or the posterior probability of the hypothesis given the data. However to calculate this we need Bayes Theorem, and requires a value: p(hypothesis). This is the *prior probability* of the hypothesis, and needs to be specified beforehand. This causes great wailing and gnashing of teeth amongst classical statisticians.  
For our purposes, does it matter too much? Aside from the philosophical angst of Classical vs Bayesian statistics, probably not. Our interpretation is largely based on the weight of the evidence, but don't be tempted to interpret results as measures of proof of hypotheses.
```
````
Enough of the distraction. So what is the power of a test to detect a 50% reduction in bycatch with 150 replicates *each* of the light vs control dredges? If we ran the experiment 1000 times, we would correctly identify an effect about 73% of the time. This is not too far off our preferred power of 0.8 (80%). But that's a lot of replicates. What would our power to detect a 50% change be if we only(!) had 100 replicates? We would only correctly reject the null hypothesis around 58% of the time. This is not too crash hot...  

What if *Pisces* was really rocking, and reduced bycatch to 1/3 of our baseline value? We start to get acceptable power (0.812) with around 75 replicates.

````{margin}
```{note}
You will note I seem a bit vague with my wording. *Around* 75 replicates, for example. This is not Australian understatement. This is because power calculations are based on estimates. I am assuming our observed mean value of 3.3 is the 'true' value. It is of course an estimate with a standard error. Similarly, the tweedie parameter I used is an estimate from the data. It too has a standard error. I also ran 1000 simulations, so even if the mean and the Tweedie factors were measured without error, there would still be a standard error on the power estimate.  
These values simply give us a ballpark, but it's better than nothing!
```
````
### We have our ballpark figure
Detecting 50% reduction in Yellowtail flounder bycatch in this fishery, all else being equal, will be a stretch. Realistically we should be able to detect somewhere between 50-67% reduction with 100 replicates, but for a 67% change, anywhere over 75 replicates should suffice.  Of course the big assumption here is that Pisces is actually working to generate this very large effect. Scallop dredges are towed quickly (~5kts), and flatfish are reticent about moving until objects are on top of them.  

```{note}
I like not this news! Bring me other news!
```

There are a few cards in our hand.

#### Trial conduct
This analysis assumes we are running a trial in which each haul is either light or control. The variability in hauls we are basing the power estimates on is reflective of the *between haul* variance. In the upcoming trial we will have a twin rig setup.  **If** the dredges within a haul fish similarly - and there is a high likelihood they will - a paired design will be able to account for and remove much of this variability. The way this works is that the mean value of each haul is estimated and 'removed' from the model, and the response variable effectively becomes the *difference* between the treatment and control. This can and should considerably increase the power of the test.  

The second element to consider is that Pisces needs to be given the best chance of working. We will need to exert a large effect to be detectable. We can try to prevent flatfish entering the dredge, but given what we know about their behaviour this might not work. However there is also a Plan B. Pisces might work better to illuminate the escape mesh. These are all things to consider on pilot trials.

#### Biological vs statistical importance
Our focus here has been on detecting the statistical difference between the the two means. The nature of the data is that there are a lot of zeroes, and a pile of big values. Note the mean value itself is rather modest - around 3.3 pounds of fish per hour. What if we think outside the box and think to how catches with 50% and 67% reductions in bycatch scale up to the level of a fishing trip? In real terms, what does this look like?  

Let's look at the catches of each of 100 dredges over a single trip under three scenarios: Control, 50% and 67% reductions: 

```{figure} FishPoundsTrip.png
:height: 400px
:align: center
```
There are some inescapable features. There are lots of zeroes and - even with the change in mean - even if our putative effect is operating, we are still going to get some dredges with big catches. This is important to keep in mind. If the fisher sees the big catches, they may be tempted to think the gear is not working. We need to maintain perspective.

````{margin}
```{warning}
Fish are not a statistical distribution. However, we seem to have modeled the clumped behavior OK with the Tweedie distribution. And from experience with fish and fish data, I think this is probably a reasonable approximation of what might happen. Reducing the mean value did not remove the aggregations, so this simulation seems to retain that element of fish behavior.  
The unknown element is how Yellowtail flounder respond to Pisces. If we hit these big aggregations, does this behaviour interact with their response to light? Will a group of fish flee, or hunker down and get caught? This is where video on the dredge will come into play
```
````

There is a lot of spread of values under the different Pisces effect scenarios for any single trip. What if we sum up their effects over, say, 100 trips?  

```{figure} FishPounds100Trips.png
:height: 400px
:align: center
```

Now we're talking. Each of these points is a total catch for a 100 tow trip. Over a few seasons, the total catch does indeed follow our expectation, given I manually set the effect size. It should also become apparent that even if Pisces has a more modest effect that might not tip the *statistically significant* scale, at the level of removal of Yellowtail from the population this effect will be important at the fishery management scale. A more suitable measure of effect might be the total catch over the cruise, with a suitable randomization error estimate (bootstrap, Monte Carlo or similar).  

### Final comments
A ballpark figure of 100 replicates per light treatment in a paired design should give ~0.8 power to detect a somewhere between 50-67% reduction in Yellowtail bycatch. Even if the statistical difference is a bit marginal, the total catch with a suitable error estimate should provide a fishery-relevant estimate of a bycatch reduction effect. We are asking quite a lot of Pisces, so we need to find the configuration to give the fish the best chance of responding and avoiding capture. The response will need to be sizeable to be detectable. A paired design will almost certainly improve our chances to detect effects, and I encourage you to check out the chapters on trial design.

This exploration is conditional on this data set being representative of what we will encounter. In this case, the data set is sizeable and robust. If it were smaller and less extensive, we might be more cautious. However this foray has demonstrated a few things:
- Time spent exploring pilot data of some sort is rarely wasted
- In contrast with our first case study, the real world is often a lot messier than data from a custom trial that made it into a journal might suggest
- There is no need to get too wrapped up in theoretical power calculations. In the age of modern computing, we can just throw computer grunt at the problem to get a robust answer