<center><h1>7SSG2059 Geocomputation 2016/17</h1></center>

<h1><center>Practical 10a: Analysis of Relationships in Weather and Air Quality Data</h1></center>

<p><center><i>James Millington, 27 November 2016</i></center>


## Overview

Practical 10 is split into two notebooks - one examining relationships in Heathrow Weather and Air Quality data (10a) and one examining relationships in NS-SeC and house prices data (10b). The two notebooks are self-contained and can be used independently. You should decide which sets of data you most likely want to use for your final report, and work through the corresponding notebook during supervised practical time. This will give you the basics of analyses that you can then build on for your final report. Of course, you are welcome to work through both notebooks, although you are unlikely to be able to complete both during class time. 

## Helper Functions

Before getting to the data and code in this notebook, you should first run the code in the next three code blocks. These code blocks:

1. import packages required for functionality in the remiander of the notebook and set `matplotlib` font parameters 
2. define a function to help interpret OLS regression output
3. define a function to plot a histogram to file

Take a quick look at the code when running these blocks, but don't spend too much time as we will return to look at the function definitions more closely later in the notebook. 

In [None]:
#import packages required for functionalit below and set matplotlib font parameters 
import os
import pandas as pd
import seaborn as sb      
import numpy as np
import matplotlib.pyplot as plt    #see http://matplotlib.org/users/pyplot_tutorial.html
import statsmodels.api as sm       #see http://statsmodels.sourceforge.net/stable/  

#set matplotlib font params
plt.rcParams['axes.titlesize'] = 20
plt.rcParams['axes.labelsize'] = 20
plt.rcParams['xtick.labelsize'] = 14
plt.rcParams['ytick.labelsize'] = 14

%matplotlib inline

In [None]:
#define function to help interpret OLS regression output
def mod_diagnostics(model, data):
    
    """
    Output to file model diagnostics for an OLS model
    
    Input:
        model - statsmodels.regression.linear_model.OLS object
        data  - pandas.DataFrame containing data for model
        
    Output:
        XX-XX-OLS_SampleXX_Summary.txt contains the model summary output
        XX-XX-OLS_SampleXX_ResidHist.png is histogram of the residuals
        XX-XX-OLS_SampleXX_StdResid.png is a plot of standardised residuals against fitted values
        
        if model is univariate: XX-XX_OLS_SampleXX_Regression.png is a scatter plot with regression line
        
    Requires:
        statsmodels.api
        pandas
        numpy
        matplotlib.pyplot
    """
    
    fitted = model.fit()
    dep = model.endog_names
    indep_names = ""
    
    #create a string containing list of indep names for output files
    for name in model.exog_names[1:]:            #we don't want 0 element as that is the intercept
        indep_names += "{0}_".format(name)


    #Want to include name of DataFrame in the output filename but currently DataFrame does not have a name attribute
    #So for now use nobs from fitted  (Dan potential solution: pass data in a dictionary and access the label)
    samplesize = str(int(fitted.nobs))
    
    f1 = open("{0}-{1}OLS_Sample{2}_Summary.txt".format(dep, indep_names, samplesize), "w")
    f1.write(fitted.summary().as_text())
    f1.close()

    #calculate standardized residuals ourselves
    fitted_sr = (fitted.resid / np.std(fitted.resid)) 

    #Histogram of residuals
    ax = plt.hist(fitted.resid)
    plt.xlabel('Residuals')
    plt.savefig('{0}-{1}OLS-Sample{2}_ResidHist.png'.format(dep, indep_names, samplesize), bbox_inches='tight')
    plt.close()

    #standardized residuals vs fitted values
    ax = plt.plot(fitted.fittedvalues, fitted_sr, 'bo')
    plt.axhline(linestyle = 'dashed', c = 'black')
    plt.xlabel('Fitted Values')
    plt.ylabel('Standardized Residuals')                
    plt.savefig('{0}-{1}OLS-Sample{2}_StdResid.png'.format(dep, indep_names, samplesize), bbox_inches='tight')
    plt.close()
  
    
    if(len(model.exog_names) == 2):  #univariate model (with intercept)
            
        indep = model.exog_names[1]
        
        #scatter plot with regression line 
        ax = plt.plot(data[indep], data[dep], 'bo')
        x = np.arange(data[indep].min(), data[indep].max(), 0.1)    #list of values to plot the regression line using
        plt.plot(x, fitted.params[1]*x + fitted.params[0], '-', c = 'black')  #plot a line using the standard equation with parms from the model
        
        plt.xlabel(indep)
        plt.ylabel(dep)                
        plt.savefig('{0}-{1}OLS_Sample{2}_Regression.png'.format(dep, indep, samplesize), bbox_inches='tight')
        plt.close()


In [None]:
#define function to plot histogram to file
def plot_hist(series):
    
    """
    Output to file a simple histogram
    
    Input:
        series - pandas.Series containing data (may also be able to take a numpy array)
        
    Output:
        XX-SampleXX-Hist.png - the histogram image
        
    Requires:
        pandas
        matplotlib.pyplot
    """

    out_name = "{0}-Sample{1}-Hist.png".format(series.name, len(series))
    plt.hist(series.dropna())
    plt.xlabel(series.name)
    plt.ylabel('Count')
    plt.savefig(out_name, bbox_inches='tight')           #save the figure
    plt.close()

## Heathrow Weather and Air Quality Data

The additional data we'll use with the Heathrow Weather data are air quality data have been downloaded from the Air Quality England [website](http://www.airqualityengland.co.uk/) (AQE 2016) for the [Hounslow Hatton Cross site](http://www.airqualityengland.co.uk/site/latest?site_id=HS7) (site HS7). This site was chosen as it is near Heathrow Airport. 

Air pollution is an important aspect of the ongoing argument about the construction of the third runway at Heathrow (e.g. GLA 2012). In particular, although Nitrogen Dioxide (NO2) concentrations around Heathrow, are lower than in the centre of London, they are still often above recommended levels (e.g. Heathrow 2012). By looking at relationships between weather and air quality we may begin to better understand the drivers of pollution.

In Practical 9 we used code to clean the air quality data and join it to the weather data to create a single time-series of data. This code is copied in the next code block - you will need to run this code if you have not already done so to create the `HeathrowAQWeather2016.pkl` file. However, if you have already created `HeathrowAQWeather2016.pkl` you can skip that code and simply load the data into memory (the subsequent code block).

In [None]:
##ONLY run this code block if you have NOT already created HeathrowAQWeather2016.pkl in Practical 9

#read weather data
metDF = pd.read_pickle("CleanedHeathrowData2016.pkl")

#read the aq data
aqDF = pd.read_csv('AirQuality_HattonCross_2016.csv', header=0, skiprows=3, usecols=range(0,10), skipfooter = 1, engine = 'python')

#rename columns and drop those not needed
aqCN = ['Date', 'Time', 'PM10', 'PM10_su', 'NO', 'NO_su', 'NO2', 'NO2_su', 'NOx', 'NOx_su']
aqDF.columns = aqCN
del aqDF['PM10_su'] 
del aqDF['NO_su']   
del aqDF['NO2_su'] 
del aqDF['NOx_su'] 

#fix issue with AQ data time formatting and set TimeDateIndex
aqDF.Time.replace(to_replace = '24:00:00', value = '00:00:00', inplace = True) 
aqDF["DT"] = pd.to_datetime(aqDF.Date + aqDF.Time, format='%d/%m/%Y%H:%M:%S')  #create a new series containing a datetime object
aqDF.index = aqDF["DT"]     

import datetime as dt
oneday = dt.timedelta(days=1)                           #create a timedelta object of days = 1
aqDF.loc[aqDF['Time'] == '00:00:00','DT'] += oneday     #update the DT cell for rows with Time == 00:00 by adding oneday
del aqDF['DT'] 
aqDF = aqDF.drop_duplicates()

aqfirstDate = aqDF.index[0]
aqlastDate = aqDF.index[len(aqDF.Date) - 1]
aqDF = aqDF.reindex(index=pd.date_range(start = aqfirstDate, end = aqlastDate, freq = '1H'), fill_value = None)   #http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.reindex.html

#select only column we want for the final joined dataframe
metDF = metDF.select_dtypes(include=['float64'])
del metDF['WindGust'] 
del metDF['LocID']

del aqDF['Date'] 
del aqDF['Time']

#join!
aqmetDF = aqDF.join(metDF)

#reset pressure values that don't make sense
aqmetDF.loc[aqmetDF.Pressure == 0] = None

#write the data to file
aqmetDF.to_pickle("HeathrowAQWeather2016.pkl")

In [None]:
## run this code if you HAVE already created HeathrowAQWeather2016.pkl
#import pandas as pd                                   #already imported above but would be needed otherwise
aqmetDF = pd.read_pickle("HeathrowAQWeather2016.pkl")  #assumes file is saved in the same folder as this notebook file

Note: the previous code block assumes `HeathrowAQWeather2016.pkl` is saved in the same folder as this notebook file, but it is possible to read (and write) data to other folders by specifying the 'path' we want to use. The following code block shows one way to do this (as James discussed in Week 9 lecture). **ONLY** run the next code block if you want to read data from a location other than the folder in which this notebook files is saved - it is more for your information for future use. 

In [None]:
##ONLY run this code if you want to read data from a location other than the folder in which this notebook files is saved
#import os             #already imported above but would be needed otherwise

#set the path to the directory where we want to read and save from
path = os.path.join(os.path.expanduser("~"),"Google Drive","Teaching","2016-17","Undergrad","Geocomp","Week10","Practical")
os.chdir(path)

#the following line would now read the pkl file from my Week 9 folder (specified above)
#nsvalDF.to_pickle("HeathrowAQWeather2016.pkl")

### Exploratory Analysis and Plotting



The first step in our analysis should be looking at the shape of the distributions of variables and any possible relationships between variables. Let's use a pairplot to do this:

In [None]:
sb.pairplot(aqmetDF.dropna(axis = 0))

#### Task

Describe the distributions and relationships of the **air quality** variables (edit this text block):

**A: **

We should also check for missing data in our time series. We'll use a loop to automate the plotting of a time series for each of the air quality variables: 

In [None]:
for name in ['PM10', 'NO', 'NO2', 'NOx']:
    
    fig = plt.figure(name)   #create a new figure (closes any existing)
    fig = aqmetDF[name].plot()  
    plt.xlabel('Date')
    plt.ylabel(name)
    plt.legend(loc = 0)  

From the time series plots you can see that there is a fair chunk of missing data for June (and there also other smaller chunks of missing data in other places). For your final report you may want to think about what time period you want to analyse, and you may also want to do some interpolation. For the remainder of this practical we will use data for September only (without any interpolation). 

To subset for September:

In [None]:
aqmetDF_sept = aqmetDF['2016-09']

#### Task
Repeat the exploatory plots we just did for the entire data set (i.e. pairplot and time series plots) for the `aqmetDF_sept` dataframe just created:

#### Task

For the September time series you just plotted, what differences or similarities can you see between the plots? (edit this text block to answer)

**A: **

### Correlation and Relationships

Now that we have all the weather and air quality data together in one DataFrame we can start to look at correlations and relationships. Let’s look directly at some correlation matrices for the variables:

In [None]:
corrmat = aqmetDF_sept.corr()
print "Pearson correlation coefficient matrix:", '\n', corrmat, '\n'

corspmat = aqmetDF_sept.corr(method = "spearman")
print "Spearman rank correlation coefficient matrix:", '\n', corspmat, '\n'

If you recall, the degree of linearity in the relationships between variables (e.g. look back to your pairplot) should determine which type of correlation we should use. We’ll focus here on NO2 as that has been highlighted as a particular issue for air quality around Heathrow (but feel free to examine other variables for your final report). 

#### Task

For the weather variable with the strongest correlation to NO2, create jointplots to examine their relationships (replace `???` with the names of the appropriate variable):

In [None]:
from scipy.stats import spearmanr
sb.jointplot(???, "NO2", data=aqmetDF_sept, stat_func=spearmanr)


#### Task

Can you tell if the relationship is linear? Is it a Positive or Negative relationship? How would you describe the distribution of the variables? (edit this text block to answer)

**A: **

As all these variables have been measured over time we can also look at their relationships through time. For now, we'll focus on the weather variable with the strongest relationship to NO2. Check you understand how the code below produces a two-axis time series of NO2 with Wind Speed (consult the comments):

In [None]:
a = sb.axes_style()                                                #save default sb style settings
sb.set_style("darkgrid", {'axes.grid': False})                     #turn off the grid

fig, ax1 = plt.subplots()                                          #set up the figure and left axis
ax1.set_ylim(0,25)                                                 #modify y-axis limits   (to create room for legends later) 

name1 = "WindSpeed"
ax1 = aqmetDF_sept[name1].plot(style='-r', label = name1)          #plot 
ax1.set_ylabel(name1, color='r')                                   #set first y-axis label

name2 = "NO2"
ax2 = ax1.twinx()                                                  #create a twin of the left axis for right
ax2.set_ylim(0,100)    
ax2 = aqmetDF_sept[name2].plot(style='-b', label = name2)          #plot
ax2.set_ylabel(name2, color='b')                                   #change second axis label

ax2.legend(loc = 1)                                                #add legend 
ax1.legend(loc = 2)                                                #add legend 
ax1.set_xlabel("Date", color = 'black')

#uncomment next two lines to save the figure to an image file (e.g. for use in reports)
#plt.savefig('{0}{1}.png'.format(name1, name2), bbox_inches='tight') #save the figure
#plt.close()                                                        #close plot
                
sb.set_style(a)                                                    #revert style back to defaults


From the plot just produced you should be able to see how NO2 is generally greater when Wind Speed is lower (indicating the negative relationship found through the correlation above). However, the pattern could be clearer – there’s a lot of variation from hour to hour which makes the relationship more difficult to see. Maybe the use of running means would help here… 

#### Task

Create (and save to file) a two-axis plot with the 12-hour running mean of WindSpeed on the left axis and the 12-hour running mean of NO2 on the right axis. Copy the code from the last code block and edit appropriately to do this.

The 12-hour running mean plot has made it much easier to see a relationship between the two variables. Consequently, it's likely that using such a running mean will help to strengthen the correlation between the variables. To check this, we first need to calculate the running mean for all our variables:

In [None]:
aqmetDF_sept_12mw = aqmetDF_sept.rolling(window=12, center = True).mean() 

Check you understand what the dataframe just created contains (e.g. by using code to find out about it). Now we can check the correlation between the 12-hour running means of NO2 and Wind Speed. 

#### Task

Create a `jointplot` to visualise the relationship between the 12-hour running means of Wind Speed and NO2 (use the `aqmetDF_sept_12mw` dataframe just created):

#### Task

Create correlation matrices (pearson and spearman) for the 12-hour running means of all the variables (use the `aqmetDF_sept_12mw` dataframe just created):

#### Task

Compare the correlations between NO2 and Windspeed for the raw data and the 12-hour running mean. How have the correlations changed? Are they stronger or weaker? What is your physical interpretation for any differences? (edit this code block to answer)

**A: **

### Simple Linear Regression

We seem to have identified a reasonably good relationship between Wind Speed and NO2. Maybe we could use regression to try to predict what level of NO2 we would expect at Heathrow over a 12 hour period from the mean wind speed over that period. Regression would allow us to do this.

To fit a regression we can use the OLS method in the statsmodels package. The `statsmodels.api` was imported with alias `sm` at the top of this notebook, so we can use it to fit the regression between our variables as follows:

In [None]:
#create OLS object
NO2_WS_RM_mod = sm.OLS.from_formula("NO2 ~ WindSpeed", aqmetDF_sept_12mw, missing = 'drop')  #use the missing argument with value drop to tell python to ignore missing data
#fit the regression
fitted_NO2_WS_RM_mod = NO2_WS_RM_mod.fit()

Note how there are two steps to fitting the regression model. First, we create a OLS model object by specifying the 'formula' and the data to use - see that 'formula' does not use the `=` symbol and instead relates variables using `~`. 

The second line above then actually fits the regression model (using the `fit` method) and puts this in a 'fitted model' object. We can then get a summary of the regression model using the `summary` method with the fitted model object:

In [None]:
print "Summary of NO2_WS_RM_mod", '\n', fitted_NO2_WS_RM_mod.summary()           #output

#### Task
Using your model output, answer the following questions. You will find it useful to refer to your lecture notes (including Week 9) and [Johnson (2014)](http://connor-johnson.com/2014/02/18/linear-regression-with-python/) provides a nice overview of the model output and what it all means:

Qa)	What is the model r2? Does this make sense given the value of Pearson’s r you got earlier? 

**A: **

Qb)	What is the value of the model intercept? 

**A: **

Qc)	What is the parameter value for the slope of the regression? Is it statistically significantly different from zero (at 95% confidence)? 

**A: **

Qd)	Replace ???: For every 1 m/s of Wind Speed increase, we would expect ??? µg/m3 of NO2 ??? [more/less]

**A: **

Qe)	According to the output are the residuals normally distributed (look at `Prob(Omnibus`) and `Prob(JB)` and compare to lecture slides)? 

**A: **

Qf)	According to the Durbin-Watson test is there structure (auto-correlation) in the residuals? 

**A: **

Qg)	Given your answers to Qe) and Qf) do you think the assumptions of linear regression have been violated? 

**A: **

To check further if the assumptions of linear regression have been violated, we can look at plots of the residuals. The `mod_diagnostics` function defined above provides these for us...

#### More Model Diagnostics

Go and look at the `mod_diagnostics` function at the top of this notebook now to check you understand it. Identify where it does the following:
- Fits the model passed to it 
- Writes the fit model summary to file on disk
- Saves the histogram of the residuals to file using a string format
- Creates a scatter plot of standardized residuals
- Only creates a scatter plot with a regression line when the number of independent variables is equal to 1 (and think about why we don’t try to do this when we have more than one independent variable).

Now we can use the `mod_diagnostics` function to create diagnostic plots for the `NO2_WS_RM_mod` model. Remember two things about `mod_diagnostics` function:
1. it takes a OLS model object as an argument, NOT a fitted model object
2. it writes its output to file (so you'll need to check your hard disk for output) 

In [None]:
mod_diagnostics(NO2_WS_RM_mod, aqmetDF_sept_12mw)

#### Task

Go and look at the output just created by `mod_diagnostics` - if you can't find it on file check above if you have set the working directory somewehere else. 

From the histogram of the residuals you should be able to see that residuals are positively skewed. But does this matter? Not really, as we have a large sample size here. 

What is more problematic is the structure in the residuals, which you should be able to see in the standardised residuals plot. So we need to do a little more work to get a valid regression. 

There are a couple of things we might do:
1.	We should think about how we may have introduced structure into the input variables ourselves (which can be the root of structure in the residuals)
2.	We should think about whether we can transform the distribution of the input variables to make them more normal (which often helps to make residuals more normal)

#### Sampling from the Running Mean

The structure we can see in the residuals of the last model is also present in the regression plot itself (as hopefully you can see for yourself). We may have introduced this structure ourselves by taking the running mean. Think about it: the value of every hour in our running mean data is weighted by nearby hours. These values are no longer independent. 

One way around might be to sample the running mean data to only use independent values. Given we used a 12-hour running mean, using data for every 12th hour should do this... (think about that). 

The code below creates a new dataframe containing only values for 6am and 6pm (by building on [this](http://stackoverflow.com/a/10567298) SO answer). Check you generally understand what is going on here and what the new dataframe contains:

In [None]:
hour = aqmetDF_sept_12mw.index.hour
selector = ((hour == 6) | (hour == 18))   
sixes = aqmetDF_sept_12mw[selector]
sixes.info()
print sixes.head()

corrmat_sixes = sixes.corr()
print "\nPearson correlation coefficient matrix:", '\n', corrmat_sixes, '\n'

#### Task

Answer the following questions (edit this code block):

What has happened to the correlation between NO2 and Wind Speed for the sampled data?

**A: **

What has happened to the number of values in this dataframe compared to the previous one?

**A: **

#### Task

Create a joint plot from the sampled dataframe just created to check that the relationship between the NO2 and Wind Speed is still roughly linear:

Okay, looks good, let's try fitting a regression to these data to see what we get:

In [None]:
NO2_WS_six_mod = sm.OLS.from_formula("NO2 ~ WindSpeed", sixes)  #use the missing argument with value drop to tell python to ignore missing data
mod_diagnostics(NO2_WS_six_mod, sixes)
print NO2_WS_six_mod.fit().summary()

#### Task

Compare the results for this regression on the sampled data to the previous model:

Q: (How) have model parameters changed? 

**A: **

Q: How does the model fit comapre to the previous model? _Hint: compare the confidence intervals_ 

**A: **

Q: Have we successfully removed structure from the residuals? 

**A: **

Q: Are residuals normally distributed? 

**A: **


#### Transform input variables

There is some ambiguity about whether the residuals in our new regression are normal. As our sample size is much smaller than previously we should maybe take this quite seriously and try our second option from above (transformation) to further improve the situation. 

You can see from the last jointplot we made that the distribution of the NO2 data is heavily positively skewed - if we take a log transform of these data and use them in our model it may help with the issue of the normality of residuals:

In [None]:
sixes['logNO2'] = np.log(sixes['NO2'])     #transform the data by taking natural log

#output some plots to quickly compare 
fig1 = sixes.logNO2.hist()
plt.title('Log NO2')

fig2 = plt.figure()
fig2 = sixes.NO2.hist()
plt.title('NO2')

#plot_hist(sixes['logNO2'])      #note that the plot_hist function defined above plots a histogram in an image file on disk


The log of NO2 seems to be less skewed than the original NO2 data so maybe that will help us to meet the assumptions of linear regression (given a low sample size for the sampled data)

#### Task
Fit a regression model for the logNO2 data for the sample running mean data, using Wind Speed as a predictor. Print the output in this notebook and to file (e.g. using the `mod_diagnostics` function):

#### Task

Check the output of the model summary and plots of residuals 

Qa)	Looking at the plots of residuals do you think this model meets the assumptions of linear regression better?

**A: **

Qb)	How much difference is there between the explanatory power (r2) of this new model and the previous one that use all the running mean data points?

**A: **

Qc)	Using r2 how much of the variation in mean 12-hourly NO2 can we explain by mean 12-hourly wind speed? 

**A: **

Qd) Replace ?? to answer: For every 1 m/s of Wind Speed increase, we would expect a ???% NO2 ??? [increase/decrease]

_Hint: remember the dependent variable is log(NO2) and read [this](http://stats.stackexchange.com/a/18639) CV answer and see Table 2 of Lin et.al (2013)_

### Multiple Linear Regression

Finally, how about if we wanted to include multiple variables in our model to predict NO2? For example, what if we wanted include the second-most correlated weather variable with NO2 in our model? 

First we need to check for colinearity between the predictor variables.

#### Task

In the last correlation matrix created here, check if the second-most correlated variable with NO2 meets the general assumption about colinearity (with Wind Speed) highlighted in previous lectures.

Q: If we were to include Wind Speed and the second-most correlated variable with NO2 in a multiple linear regression model to predict log(NO2), would we be violating the assumption of independence between the predictors? Why?

**A: **

Assuming there's no colinearity, we fit and summarise the multiple regression model in a very similar way to our simple linear regression. We just change the equation to include an extra variable:

In [None]:
logNO2_WS_Pr_sixes_mod = sm.OLS.from_formula("logNO2 ~ WindSpeed + Pressure", data = sixes)  #use the missing argument with value drop to tell python to ignore missing data
mod_diagnostics(logNO2_WS_Pr_sixes_mod, sixes)
print logNO2_WS_Pr_sixes_mod.fit().summary()

#### Task

Qa) Do you think the model meets the assumptions of linear regression?

**A: **

Qb) What percentage of variance in logNO2 is explained by this model?

**A: **

Qc) What percentage of variance in logNO2 is explained by Pressure?

**A: **

Qd) Replace ???: If Wind Speed is ???, for every 1 mb of Pressure increase we would expect a ??? % NO2 ??? [increase/decrease]

*Hint: see [this page](http://sites.stat.psu.edu/~ajw13/stat200/mos/12_multregr/12_multregr_print.html) for help*

## Final Project

Think about what the results above tell us about the key weather variables that influence air quality.

Start thinking about your final project and what data you might analyse for it. 

#### References
- AQE (2016) _Air Quality England_ [Online] Available at: http://www.airqualityengland.co.uk/ 
- GLA (2012) Air and noise pollution around a growing Heathrow Airport [Online] Available from: http://www.london.gov.uk/mayor-assembly/london-assembly/publications/tackling-air-and-noise-pollution-around-heathrow 
- Heathrow (2012) Heathrow Air Quality [Online] Available from: http://www.heathrow.com/file_source/Company/Static/PDF/Communityandenvironment/air-quality-strategy_LHR.pdf
- Johnson (2014) Linear Regression with Python [Online] Available at: http://connor-johnson.com/2014/02/18/linear-regression-with-python/
- Lin et al. (2013) Too Big to Fail: Large Samples and the p-Value Problem _Information Systems Research_ 24 906–917 DOI: [10.1287/isre.2013.0480](http://dx.doi.org/10.1287/isre.2013.0480)
- Lumley et al. (2002) _Annu. Rev. Public Health_ 23:151–69 DOI: [10.1146/annurev.publhealth.23.100901.140546](http://dx.doi.org/10.1146/annurev.publhealth.23.100901.140546)