# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Study Design

Week 2 | Day 3

### LEARNING OBJECTIVES
*After this lesson, you will be able to:*
- Design an experiment
- Demonstrate good and bad examples of study design

## Experimental Design



Experimental design is what puts the **_science_** in data science.

Data scientists need to know how to think about, propose, construct, and analyze experiments. Proper design of experiments is crucial to the success of companies.

This lecture is about the process that comes before the statistical analysis, and why it is just as important, if not more so!

## But first a video
[Power Poses](https://www.youtube.com/embed/TdU2l0i2Wh0)

##  Just one problem.

## It's total BS

##   Or it appears to be, because no one can reproduce her results.

![](http://i.imgur.com/qzArkcy.png)

![](http://i.imgur.com/muiQnbF.png)

## Carney's comments

>We ran subjects in [chunks] and checked the effect along the way. It was something like 25 subjects run, then 10, then 7, then 5. Back then this did not seem like p-hacking. It seemed like saving money (assuming your effect size was big enough and p-value was the only issue).” Elsewhere, she notes that “The self-report [dependent variable about feelings of power] was p-hacked in that many different power questions and chosen were the ones that ‘worked.’”

> Carney also highlights other problems with the way the experiments were run: For one thing, she writes, too many of the people involved were aware of the hypothesis being tested. Generally speaking, this is a bad idea, since this awareness can influence how an experiment is conducted, 

[Source](http://nymag.com/scienceofus/2016/09/power-poses-co-author-i-dont-think-power-poses-are-real.html)

## Do not end up like this.

## But she isn't alone


In fact, it has been said there is a **"reproducability crisis"** taking place right now.

A project spearheaded by the Open Science Collaboration, **aimed to replicate 100 studies** published in three high-profile psychology journals during 2008.

>The idea was to see whether there was a reproducibility problem, and if so, to stimulate efforts to address it --Brain Nosek

Researchers who conducted the replication studies also asked the original authors to scrutinize the replication plan and provide feedback, and they registered their protocols in advance, publicly sharing their study designs and analysis strategies. “Most of the original authors were open and receptive,” project coordinator Mallory Kidwell told me.

## What were the results?

Despite this careful planning, less than half of the replication studies reproduced the original results. **While 97 percent of the original studies produced results with a “statistically significant” p-value of 0.05 or less, only 36 percent of the replication studies did the same**. The mean effect sizes in the replicated results were less than half those of the original results, and 83 percent of the replicated effects were smaller than the original estimates.

## Exercise

So clearly, reproducibility is important!

Pair up and spend 5 minutes discussing what are some things that you as a data scientist would need to report in order to make your results replicable by another researcher?

## Other study design problems

- Coding errors
- P-hacking
- Under-powered studies

![](http://i.imgur.com/B4bVTMn.png)

## “I clicked on cell L51, and saw that they had only averaged rows 30 through 44, instead of rows 30 through 49.”

## But wait, there's more

![](http://i.imgur.com/pm3YXwY.png)

## So what happened?

By default, Excel and other popular spreadsheet applications convert some gene symbols to dates and numbers. For example, instead of writing out “Membrane-Associated Ring Finger (C3HC4) 1, E3 Ubiquitin Protein Ligase,” researchers have dubbed the gene MARCH1. Excel converts this into a date—03/01/2016, say—because that’s probably what the majority of spreadsheet users mean when they type it into a cell. Similarly, gene identifiers like “2310009E13” are converted to exponential numbers (2.31E+19). In both cases, the conversions strip out valuable information about the genes in question.

## P-hacking

<img src="http://www.statisticsdonewrong.com/_images/xkcd-significant.png", widht=800>

[P-hacking Politics](http://fivethirtyeight.com/features/science-isnt-broken/#part1)

## Under-powered studies

What is statistical power?

- The power of a study is the likelihood that it will distinguish an effect of a certain size from pure luck. A study might easily detect a huge benefit from a medication, but detecting a subtle difference is much less likely. [Source](http://www.statisticsdonewrong.com/power.html)

> in one sample of studies published between 1975 and 1990 in prestigious medical journals, 27% of randomized controlled trials gave negative results, but 64% of these didn’t collect enough data to detect a 50% difference in primary outcome between treatment groups. Fifty percent! Even if one medication decreases symptoms by 50% more than the other medication, there’s insufficient data to conclude it’s more effective. And 84% of the negative trials didn’t have the power to detect a 25% difference.

## How likely are we to detect a biased coin?

<img src="http://i.imgur.com/lvtgEZn.png" width=500>
<br>
<img src="http://i.imgur.com/rScaiHV.png" width=500>

## So we've seen the bad, let's talk about the right way to design experiments

<a title="By ArchonMagnus (Own work) [CC BY-SA 4.0 (http://creativecommons.org/licenses/by-sa/4.0)], via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File%3AThe_Scientific_Method_as_an_Ongoing_Process.svg"><img width="512" alt="The Scientific Method as an Ongoing Process" src="https://upload.wikimedia.org/wikipedia/commons/thumb/5/5c/The_Scientific_Method_as_an_Ongoing_Process.svg/512px-The_Scientific_Method_as_an_Ongoing_Process.svg.png"/></a>

## What is a hypothesis?

> A hypothesis is **a suggested solution for an unexplained occurrence that does not fit into current accepted scientific theory**. The basic idea of a hypothesis is that there is no pre-determined outcome. For a hypothesis to be termed a scientific hypothesis, **it has to be something that can be supported or refuted through carefully crafted experimentation or observation**. This is called falsifiability and testability, according to the Encyclopedia Britannica. [Source](http://www.livescience.com/21490-what-is-a-scientific-hypothesis-definition-of-hypothesis.html)

## Characteristics of a good tests

- **S**pecific
- **M**easurable
- **A**chievable
- **R**elevant
- **T**ime/Cost Limited

- Specific: The question and key success metrics are clearly defined
- Measurable: The question is one that can be measured with your data
- Attainable: The question you are asking is possible to answer with the data you can collect
- Reproducible: Another person (or you in 6 months!) can replicate your analysis
- Time-bound: The question can be answered given real-world constraints of time and money

## Exercise:

Spend the few minutes coming up with a hypothesis and a study design for the 2nd project.
Be prepared to discuss your question and your strategy.

In [1]:
import pandas as pd
df = pd.read_csv('/Users/timothyernst/GA-DSI/projects/projects-weekly/project-02/assets/billboard.csv')
df.head()

Unnamed: 0,year,artist.inverted,track,time,genre,date.entered,date.peaked,x1st.week,x2nd.week,x3rd.week,...,x67th.week,x68th.week,x69th.week,x70th.week,x71st.week,x72nd.week,x73rd.week,x74th.week,x75th.week,x76th.week
0,2000,Destiny's Child,Independent Women Part I,"3,38,00 AM",Rock,"September 23, 2000","November 18, 2000",78,63,49,...,*,*,*,*,*,*,*,*,*,*
1,2000,Santana,"Maria, Maria","4,18,00 AM",Rock,"February 12, 2000","April 8, 2000",15,8,6,...,*,*,*,*,*,*,*,*,*,*
2,2000,Savage Garden,I Knew I Loved You,"4,07,00 AM",Rock,"October 23, 1999","January 29, 2000",71,48,43,...,*,*,*,*,*,*,*,*,*,*
3,2000,Madonna,Music,"3,45,00 AM",Rock,"August 12, 2000","September 16, 2000",41,23,18,...,*,*,*,*,*,*,*,*,*,*
4,2000,"Aguilera, Christina",Come On Over Baby (All I Want Is You),"3,38,00 AM",Rock,"August 5, 2000","October 14, 2000",57,47,45,...,*,*,*,*,*,*,*,*,*,*
