Skip to content

Commit

Permalink
Working on documentation for new methods, pushing since I'm taking lunch
Browse files Browse the repository at this point in the history
  • Loading branch information
Corey-Bryant committed Mar 10, 2022
1 parent 3f616b4 commit 3ee235b
Show file tree
Hide file tree
Showing 6 changed files with 187 additions and 17 deletions.
4 changes: 2 additions & 2 deletions source/anova_documentation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ anova methods

* **return_type** : The type of data structure the results should be returned as. Supported options are 'Dataframe' which will return a Pandas DataFrame or 'Dictionary' which will return a dictionary.
* **decimals** : The number of decimal places the data should be rounded too.
* **pretty_format ** : If pretty formatting should be applied. This adds extra empty spaces in the returned data structure for visualization of the results.
* **pretty_format** : If pretty formatting should be applied. This adds extra empty spaces in the returned data structure for visualization of the results.

* **regression_table(return_type = "Dataframe", decimals = 4, conf_level = 0.95)**

Expand Down Expand Up @@ -93,7 +93,7 @@ called 'systolic'.

.. code:: python
import researchpy as rp
import researchpy as rp
import pandas as pd
# Used to load example data #
import statsmodels.datasets
Expand Down
15 changes: 7 additions & 8 deletions source/difference_test_documentation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -201,12 +201,11 @@ that N is the total number of observations.
Rank-Biserial correlation coefficient r (between or within subjects design)
""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""
The following formula is used to calculate the Rank-Biserial
correlation coefficient r using the W-value and N. This formula
is used to calculate the r coefficient for the Wilcoxon ranked-sign test.
correlation coefficient r :cite:`Kerby2012` for the Wilcoxon ranked-sign test.

.. math::
\text{Rank-Biserial r} = \frac{W}{\sum{\text{rank}}}
\text{Rank-Biserial r = } \frac{\sum{Ranks}_{+} - \sum{Ranks}_{-}}{\sum{Ranks}_{total}}
Expand Down Expand Up @@ -274,7 +273,7 @@ Now the data is in the correct structure.
# If you don't store the 2 returned DataFrames, it outputs as a tuple and
# is displayed
difference_test("StressReactivity ~ C(Exercise)",
rp.difference_test("StressReactivity ~ C(Exercise)",
data = df2,
equal_variances = True,
independent_samples = True).conduct(effect_size = "all")
Expand Down Expand Up @@ -302,7 +301,7 @@ Now the data is in the correct structure.
.. code:: python
# Otherwise you can store them as objects
summary, results = difference_test("StressReactivity ~ C(Exercise)",
summary, results = rp.difference_test("StressReactivity ~ C(Exercise)",
data = df2,
equal_variances = True,
independent_samples = True).conduct(effect_size = "all")
Expand Down Expand Up @@ -343,7 +342,7 @@ Now the data is in the correct structure.
.. code:: python
# Paired samples t-test
summary, results = difference_test("StressReactivity ~ C(Exercise)",
summary, results = rp.difference_test("StressReactivity ~ C(Exercise)",
data = df2,
equal_variances = True,
independent_samples = False).conduct(effect_size = "all")
Expand Down Expand Up @@ -383,7 +382,7 @@ Now the data is in the correct structure.
.. code:: python
# Welch's t-test
summary, results = difference_test("StressReactivity ~ C(Exercise)",
summary, results = rp.difference_test("StressReactivity ~ C(Exercise)",
data = df2,
equal_variances = False,
independent_samples = True).conduct(effect_size = "all")
Expand Down Expand Up @@ -424,7 +423,7 @@ Now the data is in the correct structure.
.. code:: python
# Wilcoxon signed-rank test
summary, results = difference_test("StressReactivity ~ C(Exercise)",
summary, results = rp.difference_test("StressReactivity ~ C(Exercise)",
data = df2,
equal_variances = False,
independent_samples = False).conduct(effect_size = "r")
Expand Down
26 changes: 25 additions & 1 deletion source/refs.bib
Original file line number Diff line number Diff line change
@@ -1,5 +1,29 @@
% Encoding: UTF-8
@article{Kerby2012,
author = {Dave S. Kerby},
title = {The Simple Difference Formula: An Approach to Teaching Nonparametric Correlation},
journal = {Innovative Teaching},
year = {2014},
volume = {3},
number = {1},
note = {10.2466/11.IT.3.1}
}

@article{Fritz_Morris_Richler2012,
author = {Catherine O. Fritz, Peter E. Morris, and Jennifer J. Richler},
title = {Effect Size Estimates: Current Use, Calculations, and Interpretation},
journal = {Journal of Experimental Psychology: General},
year = {2012},
volume = {141},
number = {1},
pages = {2-18},
note = {10.1037/a0024338}
}



@book{grissomkim2012,
title = {Effect Sizes for Research: Univariate and Multivariate Applications},
publisher = {Routledge},
Expand Down Expand Up @@ -63,7 +87,7 @@ @inbook{hedges1985
note = {10.2307/1164953}
}

@book{ kline2004,
@book{kline2004,
title = {Beyond significance testing: Reforming data analysis methods in behavioral research},
publisher = {American Psychological Association},
year = {2004},
Expand Down
147 changes: 147 additions & 0 deletions source/signrank_documentation.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
*************
signrank()
*************

Description
===========
Conducts the Wilcoxon signed-ranks test for paired-sample data. Data can be entered using the formula_like structure,
or by passing two array like structures. How to use both of these approaches will be
demonstrated. The results can be returned as Pandas DataFrame object (default) or as a Python
dictionary object.

The data model is passed to *signrank* and then the *conduct* method needs to be applied. This
method returns 3 data objects within a tuple.



Parameters
==========

Input
-----
**signrank(formula_like = None, data = {}, group1 = None, group2 = None, zero_method = "pratt", correction = False, mode = "auto")**

* **formula_like** : A valid formula which will parse the data into a design matrix.
* **data** : The dataframe which contains the data to be analyzed; required if using *formula_like*.
* **group1** : The array like object which contains data for the paired-sample.
* **group2** : The array like object which contains data for the paired-sample.
* **zero_method** : How to handle the zero-differences in the ranking process. Available options are (from :cite:`scipy_wilcoxon`):
* *"pratt"* : Includes zero-differences in the ranking process, but drops the ranks of the zeros (default).
* *"wilcox"* : Discards all zero-differences.
* **correction** : Boolean value indicating if the continuity correction should be applied; see :cite:`scipy_wilcoxon` for more information.
* **mode** : Method to calculate the p-value, see :cite:`scipy_wilcoxon` for more information. Options are:
* *"auto"* : Use the exact distribution if there are no more than 25 observations and no ties, otherwise a normal approximation will be used (default).
* *"exact"* : Use the exact distribution, can be used if there are no more than 25 observations and no ties.
* *"approx"* : Use a normal approximation.


Returns
-------
Returns an object with class "signrank"; this object has an accessible method which is described below.

signrank methods
^^^^^^^^^^^^^^^^

* **conduct(return_type = "Dataframe", effect_size = [])**

* **return_type** : The type of data structure the results should be returned as. Supported options are 'Dataframe' which will return a Pandas DataFrame or 'Dictionary' which will return a dictionary.
* **effect_size** : A list object which indicates which effect size measures should be calculated. Available options are:
* *pd* : Calculates the Rank-Biserial r coefficient.
* *pearson* : Calculates the Pearson r coefficient.

After using the conduct method three objects will be returned within a tuple. The first object
provides descriptive information regarding the ranks, the second object contains the adjustment information,
and the third object contains the test results.



Effect size measures formulas
=============================
By default no effect size measures are calculated.

Rank-Biserial r :cite:`Kerby2012`
""""""""""""""""""""""""""""""""""""""""""""""""
.. math::
\text{Rank-Biserial r = } \frac{\sum{Ranks}_{+} - \sum{Ranks}_{-}}{\sum{Ranks}_{total}}
Pearson r :cite:`Fritz_Morris_Richler2012`
""""""""""""""""""""""""""""""""""""""""""
.. math::
\text{Pearson r = } \frac{Z}{\sqrt{N}}
Where N is the total number of observations included in the model.



Examples
========

Loading Packages and Data
-------------------------
First to load required libraries for this example. Below, an example data set will be loaded
in using statsmodels.datasets; the data loaded in is a data set available through Stata
called 'fuel'.

.. code:: python
import researchpy as rp
import pandas as pd
# Used to load example data #
import statsmodels.datasets
fuel = statsmodels.datasets.webuse('fuel')
fuel["id"] = range(1, fuel.shape[0] + 1)
fuel.info()
.. raw:: html

<table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th>mpg1</th> <th>mpg2</th> <th>id</th> </tr> </thead> <tbody> <tr> <td>20.0000</td> <td>24.0000</td> <td>1</td> </tr> <tr> <td>23.0000</td> <td>25.0000</td> <td>2</td> </tr> <tr> <td>21.0000</td> <td>21.0000</td> <td>3</td> </tr> <tr> <td>25.0000</td> <td>22.0000</td> <td>4</td> </tr> <tr> <td>18.0000</td> <td>23.0000</td> <td>5</td> </tr> </tbody></table>


The data is currently in a wide structure where each column, mpg1 and mpg2, represent a value for the same ID. This format
is supported by signrank. The long format structure is also supported using the *formula_like* approach, in order to have
the data ready for this demonstration section the transformation will be conducted here.

.. code:: python
fuel2 = pandas.melt(fuel, id_vars = "id",
value_vars = ["mpg1", "mpg2"],
var_name = "mpg")
fuel2.head()
.. raw:: html

<table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th>id</th> <th>mpg</th> <th>value</th> </tr> </thead> <tbody> <tr> <td>1</td> <td>mpg1</td> <td>20.0000</td> </tr> <tr> <td>2</td> <td>mpg1</td> <td>23.0000</td> </tr> <tr> <td>3</td> <td>mpg1</td> <td>21.0000</td> </tr> <tr> <td>4</td> <td>mpg1</td> <td>25.0000</td> </tr> <tr> <td>5</td> <td>mpg1</td> <td>18.0000</td> </tr> </tbody></table>



Signrank using Wide Structure Datasets
---------------------------------------
Since the test returns 3 data objects, this demonstration will assign each data object to variable. This is not required, but
it makes the output look cleaner.

.. code:: python
desc, var_adj, res = signrank(group1 = fuel.mpg1, group2 = fuel.mpg2).conduct()
print(desc, var_adj, res, sep = "\n"*2)
.. raw:: html

<table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th>sign</th> <th>obs</th> <th>sum ranks</th> <th>expected</th> </tr> </thead> <tbody> <tr> <td>positive</td> <td>3</td> <td>13.5000</td> <td>38.5000</td> </tr> <tr> <td>negative</td> <td>8</td> <td>63.5000</td> <td>38.5000</td> </tr> <tr> <td>zero</td> <td>1</td> <td>1.0000</td> <td>1.0000</td> </tr> <tr> <td>all</td> <td>12</td> <td>78.0000</td> <td>78.0000</td> </tr> </tbody></table>

<table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th>unadjusted variance</th> <th>adjustment for ties</th> <th>adjustment for zeros</th> <th>adjusted variance</th> </tr> </thead> <tbody> <tr> <td>162.5000</td> <td>-1.6250</td> <td>-0.2500</td> <td>160.6250</td> </tr> </tbody></table>

<table border="1" class="dataframe"> <thead> <tr style="text-align: right;"> <th>z</th> <th>w</th> <th>pval</th> </tr> </thead> <tbody> <tr> <td>-1.9726</td> <td>13.5000</td> <td>0.0485</td> </tr> </tbody></table>


If one does not assign each object to a variable, the output is still readable.

.. code:: python
signrank(group1 = fuel.mpg1, group2 = fuel.mpg2).conduct()
.. raw:: literal
8 changes: 4 additions & 4 deletions source/summarize_documentation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -94,7 +94,7 @@ First demonstration will show how to get descriptive statistics for a single var

.. code:: python
summarize(auto.price)
rp.summarize(auto.price)
.. raw:: html

Expand All @@ -108,7 +108,7 @@ Now let's get information from 2 variables at the same time.

.. code:: python
summarize(auto[["price", "mpg"]])
rp.summarize(auto[["price", "mpg"]])
.. raw:: html

Expand All @@ -126,7 +126,7 @@ Pandas Series Groupby Object

.. code:: python
summarize(auto.groupby("foreign")["price"])
rp.summarize(auto.groupby("foreign")["price"])
.. raw:: html
Expand All @@ -139,7 +139,7 @@ Pandas Dataframe Groupby Object

.. code:: python
summarize(auto.groupby(["foreign"])[["price", "mpg"]])
rp.summarize(auto.groupby(["foreign"])[["price", "mpg"]])
.. raw:: html

Expand Down
4 changes: 2 additions & 2 deletions source/ttest_documentation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -180,12 +180,12 @@ that N is the total number of observations.
Rank-Biserial correlation coefficient r (between or within subjects design)
---------------------------------------------------------------------------
The Rank-Biserial r is also provided for the Wilcoxon signed-rank test as is
The Rank-Biserial r :cite:`Kerby2012` is also provided for the Wilcoxon signed-rank test as is
calculated as:

.. math::
\text{Rank-Biserial r} = \frac{W}{\sum{\text{rank}}}
\text{Rank-Biserial r = } \frac{\sum{Ranks}_{+} - \sum{Ranks}_{-}}{\sum{Ranks}_{total}}
Expand Down

0 comments on commit 3ee235b

Please sign in to comment.