Skip to content

PyBCSession08

Katy Huff edited this page Jan 29, 2012 · 9 revisions

Other Wiki Pages: PyBc, Session01, Session02, Session03, Session04, Session05, Session06, Session07, Session08, Session09, f2py, swig, Cpython, Cython, PyTables, PyTaps, PythonBots, Django, AdvancedPython, WxPython, standardlib,

Statistics in Python with R

Python Boot Camp 2010 - Session 8 - January 13

Presented By: Nico Preston

In this session we will demonstrate how to interact with the R-statistical framework from Python. It is possible to do statistics within Python; however, R has become THE main language for statistics among scientists from numerous disciplines. The ascension of R was catalyzed by widespread participation from the statistics community and adoption by prominent scientists.

If you need convincing check out the New York Times article [[ http://www.nytimes.com/2009/01/07/technology/business-computing/07program.html | Data Analysts Captivated by R’s Power ]]. Cool, eh?

Here you will learn the basic terminology and techniques for using R seamlessly (well almost) from Python. We'll discuss data types and how to move variables form one environment to the other. We'll discuss various ways to call R and how to build custom functions. Lastly we'll demonstrate basic statistics and graphical outputs.

Acknowledgments

This tutorial is a mashup of the excellent documentation found on the [http://rpy.sourceforge.net/rpy2/doc/html/index.html rpy2 project site]

And, Introductroy Statistics with R (Peter Dalgaard - 2008), 2nd ed, Springer

Installation

This tutorial requires Python 2.4+ and R-2.7.0+.

We will be using rpy2 as an interface to R in Python. This can be installed form binaries or source code by followed instructions on the rpy2 sourceforge site: http://rpy.sourceforge.net/rpy2/doc/html/overview.html

Getting help for R (one letter Google blues)

help.search("text") or ?function_name

R-Seek: http://www.rseek.org/

R-Search: http://www.dangoldstein.com/search_r.html

Mailing list archives: http://tolstoy.newcastle.edu.au/R/

Getting started

Once you've installed rpy2, test the installation by loading rpy2:

#!CodeExample
#!python
import rpy2

If you get another prompt without error messages, you're good to go.

Next we will load the library and assign it to the variable 'r' to cut down on typing.

#!CodeExample
#!python
import rpy2.robjects as robjects
r = robjects.r

Note, many of the commands we will be using are now embedded as Python objects within the robjects module:

#!CodeExample
#!python
robjects.r
print(robjects.r)

R: the calculator

In its simplest form R is a tricked-out calculator. Try these simple commands to get started. Don't forget you need to have import robjects.r and assigned it to the variable 'r'.

#!CodeExample
#!python
print r(2+2)
print r['exp'](-2)
print r.rnorm(15)

Syntax

R syntax is somewhat different from Python. One major difference is that everything is stored as a vector. So when you print the variable pi, you see how it can be referenced using python indices (where 0 is position 1).

#!CodeExample
#!python
print r('pi')
print r('pi') + 2
print r('pi')[0] + 2

When you added '2', it was stored in the second position of the vector.

Remember 'r' stands in the place of robjects.r, if you're not convinced try this:

#!CodeExample
#!python
print robjects.r('pi')

Convinced?

Calling R

R can be called in one of three ways depending on the situation. For the previous example you could access 'pi' with either of the three following methods:

#!CodeExample
#!python
print r('pi') #method 1
print r['pi'] #method 2
print r.pi    #method 3

Why? Isn't this redundant?

Each of these calls if different:

  • Method 1: sends a string to R. So the word 'pi' is sent to the R command line for evaluation. This is often the easiest way to translate your code into Python; however, for big calculations and frequent interactivity with Python it can become inefficient.
  • Method 2: uses __get.item__() is equivalent to calling the variable from the R command line. This if often the preferred method.
  • Method 3: retrieves the 'pi' object.

As you work more with Rpy2 you will choose the method suited to the circumstance. In this tutorial I'll use the "( )" notation because we're focused on learning Python and just sending some strings to R.

Basic plot

NOTE: Avoid closing the graphing window during these examples. It can cause R to hang. If you do by mistake, simply restart your python session.

Here is a basic plot built with random normal (rnorm) numbers:

#!CodeExample
#!python
r('x <- rnorm(1000)') # generate x in R
r('plot(x)')

Troubleshoot: If you have trouble seeing the plot (wait a minute), then activate X11 first with:

#!CodeExample
#!python
r.X11()

Example: Body Mass Index (BMI)

Here we will calculate some summary statistics for Body Mass Index then test the hypothesis that the sample group are typical pear-shaped programmers.

Load variables

We need to load the variables from python using the convenience classes from robjects. Here we use !FloatVector, the other options are !IntVector, !BoolVector, and !StrVector. Alternatively, we could load them with an rstring (e.g. r('c(60, 72, 57, 90, 95, 72')); however, they would then be exclusively in the R environment. In this example we use !FloatVector then load them into the R environment with the globalEnv command.

#!CodeExample
#!python
weight = robjects.FloatVector([60, 72, 57, 90, 95, 72])
height = robjects.FloatVector([1.75, 1.8, 1.65, 1.9, 1.74, 1.91])
robjects.globalEnv["weight"] = weight
robjects.globalEnv["height"] = height

Understanding the Global Environment

globalEnv = robjects.globalEnv
print robjects.r.ls(globalEnv)

Calculate body mass index (bmi)

Now we will calculate bmi using a string to be evaluated in R. Note the variables were loaded into the R environment in the previous step.

#!CodeExample
#!python
bmi = r('weight/height^2')
print bmi

Calculate sample mean weight

Now, some summary statistics, lets calculate the mean weight.

#!CodeExample
#!python
print r('sum(weight)')
print r('length(weight)')
xbar = r('sum(weight)/length(weight)')
robjects.globalEnv["xbar"] = xbar

Next, in the series, we'll characterize the dataset by calculating the standard deviation

#!CodeExample
#!python
step_0 = r('weight-xbar')
step_1 = r('(weight-xbar)^2')
step_2 = r('sum((weight-xbar)^2)')
stdev = r('sqrt(sum((weight-xbar)^2)/(length(weight)-1))')
print stdev

Of course, the beauty of R is that all these simple calculations can be done in one step in R.

#!CodeExample
#!python
xbar2 = r('mean(weight)')
stdev2 = r('sd(weight)')
print stdev2

Hands-on Example

OK big deal, eh? We can do the same thing in Numpy. Take a minute to perform this calculation in Numpy using your [http://hackerwithin.org/cgi-bin/hackerwithin.fcgi/wiki/[[Session05#UsingNumPy notes from this morning].|PyBCSession05#UsingNumPy notes from this morning].]]

How do the answers compare?

... pause for statistically nerdy reflection. ''Hint'': you have a ''sample'' of 6 individuals.

Conclusion (I hope): both languages are awesome. R is a language for statistics and it does that really really well. Python is general scripting language that does that really really well. Combined they're unstoppable, imaging the New York Times header now ... ah ... Python allows you to adopt "best of breed" programming practices. You can do almost everything with it, then bring in specialized tools when you need them.

Indeed these calculations are simple built in functions. The functions available to you is seemingly limitless. You can download, open, and customize the functions you need for your research. This openness has been one of the major draws of R for scientific statistics. Check out the list of [http://cran.r-project.org/web/packages/ contributed packages] to find the R package for your favorite statistic.

Hypothesis testing

OK, so we've got some nice summary statistics. Now lets do something more scientific and test a hypothesis. In this case, well test whether this sample group falls within the range of doctor recommended BMIs (20 - 25). So the mean recommended BMI is 22.5, in other words you should weigh approximately 22.5 times your height.

Let's test whether our sample group fall within that range.

  • The null hypothesis (HO) is that they do.
  • The alternative hypothesis (HA) is that they do not.

We'll use a simple t-test. This is a metric for determining whether a sample is significantly different from the mean. The default is a mean of 0, in this case we test whether they differ from from mean, or mu, 22.5:

#!CodeExample
#!python
robjects.globalEnv["bmi"] = bmi
t=test = r('t.test(bmi, mu = 22.5)')
print t=test

Ok, so they do not. Looks like Python programmers are not "pear-shaped". Must be because their scripting language is so efficient that they have time left over to get outside and do other stuff.

Plot the data

No analysis is complete without a picture. Lets make a simple graphic.

#!CodeExample
#!python
#from Python-space variables
r.plot(height,weight)
#OR from R-space variables r('plot(height,weight)')

R is capable of generating publication-ready graphics. You can customize every facet of the layout and the format of the output. Here we'll simply change the shape of the points for demonstration purposes:

#!CodeExample
#!python
#customize the plot
#change the point shape
r.plot(height,weight, pch=2)

Now we'll add a curve of expected values, i.e. the doctor-recommended height weight relationship. Although are values are not significantly different from expected values, they do have some variability so we'll start a new vector. Also, the relationship between height and weight is quadratic (squared) so we need to generalize with a simple linear relationship. If you look closely the lines form a slight curve.

#!CodeExample
#!python
hh = robjects.FloatVector([1.65, 1.70, 1.75, 1.80, 1.85, 1.90])
robjects.globalEnv["hh"] = hh
r('lines(hh,22.5*hh^2)')

Some other plotting examples

Nobody suspects the statistics inquisition. So here are a bunch in no particular order. These ought'a show you what RPy2 can do.

Linear model

#!CodeExample
#!python
import rpy2.robjects as robjects
r = robjects.r
x = robjects.IntVector(range(10))
y = r.rnorm(10)
r.X11()
r.layout(r.matrix(robjects.IntVector([1,2,3,2]),
nrow= 2, ncol = 2))
r.plot(r.runif(10), y, xlab= "runif", ylab = "foo/bar", col="red")
args = [x, y]
kwargs = {'ylab':"foo/bar", 'type':"b", 'col':"blue", 'log':"x"}
r.plot(\*args, \**kwargs)

Riddle-me that error message.

Analysis of variance

#!CodeExample
#!python
import rpy2.robjects as robjects

r = robjects.r

ctl = robjects.FloatVector([4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14])
trt = robjects.FloatVector([4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69])
group = r.gl(2, 10, 20, labels = ["Ctl","Trt"])
weight = ctl + trt

robjects.globalEnv["weight"] = weight
robjects.globalEnv["group"] = group
lm_D9 = r.lm("weight ~ group")
print(r.anova(lm_D9))

lm_D90 = r.lm("weight ~ group - 1")
print(r.summary(lm_D90))

print(lm_D9.names)
print(lm_D9.r['coefficients'])



Now try to extract the slope only.

Build a function

#!CodeExample
#!python
r(**
        f <- function(r) { 2 * pi * r }
        f(3)
        \**)
robjects.globalEnv['f']

print r['f']

Create a dataframe

#!CodeExample
#!python
d = {'value': robjects.IntVector((1,2,3)),
         'letter': robjects.StrVector(('x', 'y', 'z'))}
dataf = r['data.frame'](\**d)
print dataf

Killer Rabbit of Caerbannog

"A terrible monster with nasty, big, pointy teeth!"

Install Onion Package

Launch R

Linux: install.packages("onion")

Mac: Open Packages & Data > Package Installer type "onion" click Get list click Install Selected

Win: Packages -> Install package Select 'onion'

Return to iPython

#!CodeExample
#!python
r.library('onion')
r.data('bunny')
r('p3d(bunny,theta= 3,phi = 104,box=FALSE)')

P.S. ignore the error messages

"Run away!"

Import to Numpy

#!CodeExample
#!python
import numpy
weight = robjects.FloatVector([60, 72, 57, 90, 95, 72])
w_numpy = numpy.array(weight)
print w_numpy
print w_numpy.dtype
w_numpy.std()