Skip to content

Latest commit

 

History

History
358 lines (267 loc) · 14 KB

faq.rst

File metadata and controls

358 lines (267 loc) · 14 KB

FAQ

Installation

To install Pingouin, open a command prompt (or Terminal or Anaconda Prompt) and type:

pip install pingouin --upgrade

You should now be able to use Pingouin. To try it, you need to open an interactive Python console (either IPython or Jupyter). For example, type the following command in a command prompt:

ipython

Now, let's do a simple paired T-test using Pingouin:

import pingouin as pg
# Create two variables
x = [4, 6, 5, 7, 6]
y = [2, 2, 3, 1, 2]
# Run a T-test
pg.ttest(x, y, paired=True)
# 1) Import the full package
# --> Best if you are planning to use several Pingouin functions.
import pingouin as pg
pg.ttest(x, y)

# 2) Import specific functions
# --> Best if you are planning to use only this specific function.
from pingouin import ttest
ttest(x, y)

Statsmodels is a great statistical Python package that provides several advanced functions (regression, GLM, time-series analysis) as well as an R-like syntax for fitting models. However, statsmodels can be quite hard to grasp and use for Python beginners and/or users who just want to perform simple statistical tests. The goal of Pingouin is not to replace statsmodels but rather to provide some easy-to-use functions to perform the most widely-used statistical tests. In addition, Pingouin also provides some novel functions (to cite but a few: effect sizes, pairwise T-tests and correlations, ICC, repeated measures correlation, circular statistics...).

The scipy.stats module provides several low-level statistical functions. However, most of these functions do not return a very detailed output (e.g. only the T- and p-values for a T-test). Most Pingouin functions are using the low-level SciPy funtions to provide a richer, more exhaustive, output. See for yourself!:

import pingouin as pg
from scipy.stats import ttest_ind

x = [4, 6, 5, 7, 6]
y = [2, 2, 3, 1, 2]

print(pg.ttest(x, y))   # Pingouin: returns a DataFrame with T-value, p-value, degrees of freedom, tail, Cohen d, power and Bayes Factor
print(ttest_ind(x, y))  # SciPy: returns only the T- and p-values

Data

You need to use the :py:func:`pandas.read_csv` or :py:func:`pandas.read_excel` functions:

import pandas as pd
pd.read_csv('myfile.csv')     # Load a .csv file
pd.read_excel('myfile.xlsx')  # Load an Excel file

Pingouin hates missing values as much as you do!

Most functions of Pingouin will automatically remove the missing values. In the case of paired measurements (e.g. paired T-test, correlation, or repeated measures ANOVA), a listwise deletion of missing values is performed, meaning that the entire row is removed. This is generally the best strategy if you have a large sample size and only a few missing values. However, this can be quite drastic if there are a lot of missing values in your data. In that case, it might be useful to look at imputation methods (see Pandas documentation), or use a linear mixed effect model instead, which natively supports missing values. However, the latter is not implemented in Pingouin.

In wide format, each row represent a subject, and each column a measurement (e.g. "Pre", "Post"). This is the most convenient way for humans to look at repeated measurements. It typically results in spreadsheet with a larger number of columns than rows. An example of wide-format dataframe is shown below:

Subject Pre Post Gender Age
1 2.5 3.1 M 24
2 4.2 4.8 F 32
3 2.5 2.9 F 38

In long-format, each row is one time point per subject and each column is a variable (e.g. one column with the "Subject" identifier, another with the "Scores" and another with the "Time" grouping factors). In long-format, there are usually many more rows than columns. While this is harder to read for humans, this is much easier to read for computers. For this reason, all the repeated measures functions in Pingouin work only with long-format dataframe. In the example below, the wide-format dataframe from above was converted into a long-format dataframe:

Subject Gender Age Time Scores
1 M 24 Pre 2.5
1 M 24 Post 3.1
2 F 32 Pre 4.2
2 F 32 Post 4.8
3 F 38 Pre 2.5
3 F 38 Post 2.9

The Pandas package provides some convenient functions to convert from one format to the other:

No, the central idea behind Pingouin is that all data manipulations and descriptive statistics should be first performed in Pandas (or NumPy). For example, to compute the mean, standard deviation, and quartiles of all the numeric columns of a pandas DataFrame, one can easily use the :py:meth:`pandas.DataFrame.describe` method:

data.describe()

Others

Pingouin is licensed under the GNU General Public License v3.0 (GPL-3), which is less permissive than the BSD or MIT licenses. The reason for this is that Pingouin borrows extensively from R packages, which are all licensed under the GPL-3. To read more about what you can do and cannot do with a GPL-3 license, please visit tldrlegal.com or choosealicense.com.

Pingouin uses outdated, a Python package that automatically checks if a newer version of Pingouin is available upon loading. Alternatively, you can click "Watch" on the GitHub of Pingouin.

/pictures/github_watch_release.png

Whenever a new release is out there, you can simply upgrade your version by typing the following line in a terminal window:

pip install --upgrade pingouin

There are many ways to contribute to Pingouin, even if you are not a programmer, for example, reporting bugs or results that are inconsistent with other statistical softwares, improving the documentation and examples, or, even buying the developpers a coffee!

To cite Pingouin, please use the publication in JOSS:

BibTeX:

@ARTICLE{Vallat2018,
  title    = "Pingouin: statistics in Python",
  author   = "Vallat, Raphael",
  journal  = "The Journal of Open Source Software",
  volume   =  3,
  number   =  31,
  pages    = "1026",
  month    =  nov,
  year     =  2018
}