# Compare proportions

The compare proportions test is used to evaluate if the frequency of occurrence of some event, behavior, intention, etc. differs across groups. The null hypothesis for the difference in proportions across groups in the population is set to zero.

In [1]:
import matplotlib as mpl
import pyrsm as rsm

# increase plot resolution
mpl.rcParams["figure.dpi"] = 150

In [2]:
rsm.load_data(pkg="data", name="titanic", dct=globals())

In [3]:
rsm.describe(titanic)

## Titanic

This dataset describes the survival status of individual passengers on the Titanic. The titanic data frame does not contain information from the crew, but it does contain actual ages of (some of) the passengers. The principal source for data about Titanic passengers is the Encyclopedia Titanica. One of the original sources is Eaton & Haas (1994) Titanic: Triumph and Tragedy, Patrick Stephens Ltd, which includes a passenger list created by many researchers and edited by Michael A. Findlay.

## Variables

* survival - Survival (Yes, No)
* pclass - Passenger Class (1st, 2nd, 3rd)
* sex - Sex (female, male)
* age - Age in years
* sibsp - Number of Siblings/Spouses Aboard
* parch - Number of Parents/Children Aboard
* fare - Passenger Fare
* name - Name
* cabin - Cabin
* embarked - Port of Embarkation (Cherbourg, Queenstown, Southampton)

##  Notes

`pclass` is a proxy for socio-economic status (SES) 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower

Age is in Years; Fractional if Age less than One (1). If the Age is Estimated, it is in the form xx.5

With respect to the family relation variables (i.e. sibsp and parch) some relations were ignored.  The following are the definitions used for sibsp and parch.

Sibling:  Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
Spouse:   Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)
Parent:   Mother or Father of Passenger Aboard Titanic
Child:    Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic

Other family relatives excluded from this study include cousins, nephews/nieces, aunts/uncles, and in-laws. Some children travelled only with a nanny, therefore parch=0 for them.  As well, some travelled with very close friends or neighbors in a village, however, the definitions do not support such relations.

Note: Missing values and the `ticket` variable were removed from the data

## Related reading

<a href="http://phys.org/news/2012-07-shipwrecks-men-survive.html" target="_blank">In shipwrecks, men more likely to survive</a>

In [4]:
titanic

Unnamed: 0,pclass,survived,sex,age,sibsp,parch,fare,name,cabin,embarked
0,1st,Yes,female,29.0000,0,0,211.337494,"Allen, Miss. Elisabeth Walton",B5,Southampton
1,1st,Yes,male,0.9167,1,2,151.550003,"Allison, Master. Hudson Trevor",C22 C26,Southampton
2,1st,No,female,2.0000,1,2,151.550003,"Allison, Miss. Helen Loraine",C22 C26,Southampton
3,1st,No,male,30.0000,1,2,151.550003,"Allison, Mr. Hudson Joshua Crei",C22 C26,Southampton
4,1st,No,female,25.0000,1,2,151.550003,"Allison, Mrs. Hudson J C (Bessi",C22 C26,Southampton
...,...,...,...,...,...,...,...,...,...,...
1038,3rd,No,male,45.5000,0,0,7.225000,"Youseff, Mr. Gerious",,Cherbourg
1039,3rd,No,female,14.5000,1,0,14.454200,"Zabour, Miss. Hileni",,Cherbourg
1040,3rd,No,male,26.5000,0,0,7.225000,"Zakarian, Mr. Mapriededer",,Cherbourg
1041,3rd,No,male,27.0000,0,0,7.225000,"Zakarian, Mr. Ortin",,Cherbourg


In [5]:
from typing import List, Tuple
import pandas as pd
from math import sqrt
from scipy import stats
import numpy as np

In [8]:
class compare_props:
    def __init__(
        self,
        data: pd.DataFrame,
        grouping_var: str,
        var: str,
        level: str,
        combinations: List[Tuple[str, str]],
        alt_hypo: str,
        conf: float,
        multiple_comp_adjustment: str = "none",
    ) -> None:
        self.data = data
        self.grouping_var = grouping_var
        self.var = var
        self.level = level
        self.combinations = combinations
        self.alt_hypo = alt_hypo
        self.conf = conf
        self.multiple_comp_adjustment = multiple_comp_adjustment

        self.p_val = None

        print("Pairwise proportion comparisons")

    def calculate(self) -> None:
        combinations_elements = set()
        for combination in self.combinations:
            combinations_elements.add(combination[0])
            combinations_elements.add(combination[1])
        combinations_elements = list(combinations_elements)

        rows1 = []
        for element in combinations_elements:
            subset = self.data[
                (self.data[self.var] == self.level)
                & (self.data[self.grouping_var] == element)
            ][self.var]
            ns = len(subset)
            n_missing = subset.isna().sum()
            n = len(self.data[(self.data[self.grouping_var] == element)][self.var]) - n_missing
            p =  ns / n
            print(f"ns: {ns}, n: {n}, p: {p}, nmissing: {n_missing}")
            sd = sqrt(p * (1 - p))
            se = sd / sqrt(n)
            z_score = stats.norm.ppf((1 + self.conf) / 2)
            # was printing out imaginary part in som cases
            me = np.real(z_score * sd / sqrt(n))
            row = [element, ns, p, n, n_missing, sd, se, me]
            rows1.append(row)

        self.table1 = pd.DataFrame(
            rows1,
            columns=[
                self.grouping_var,
                self.level,
                "p",
                "n",
                "n_missing",
                "sd",
                "se",
                "me",
            ],
        )

        alt_hypo_sign = " > "
        if self.alt_hypo == "less":
            alt_hypo_sign = " < "
        elif self.alt_hypo == "two-sided":
            alt_hypo_sign = " != "

        rows2 = []
        for v1, v2 in self.combinations:
            null_hypo = v1 + " = " + v2
            alt_hypo = v1 + alt_hypo_sign + v2

            subset1 = self.data[
                (self.data[self.var] == self.level)
                & (self.data[self.grouping_var] == v1)
            ][self.var]

            ns1 = len(subset1)
            n_missing1 = subset1.isna().sum()
            n1 = len(self.data[(self.data[self.grouping_var] == v1)][self.var]) - n_missing1
            p1 = ns1 / n1

            subset2 = self.data[
                (self.data[self.var] == self.level)
                & (self.data[self.grouping_var] == v2)
            ][self.var]

            ns2 = len(subset2)
            n_missing2 = subset2.isna().sum()
            n2 = len(self.data[self.data[self.grouping_var] == v2][self.var]) - n_missing2
            p2 = ns2 / n2

            diff = p1 - p2

            chisq, self.p_val, df, _ = stats.chi2_contingency() # unsure about this
            # print(f"chisq: {chisq}")

            row = [
                null_hypo,
                alt_hypo,
                diff,
                self.p_val,
                chisq,
                df,
                # zero_percent,
                # x_percent,
            ]
            rows2.append(row)

        self.table2 = pd.DataFrame(
            rows2,
            columns=[
                "Null hyp.",
                "Alt. hyp.",
                "diff",
                "p.value",
                "chisq.value",
                "df",
                # "0%",
                # str(self.conf * 100) + "%",
            ],
        )

    def summary(self, dec: int = 3) -> None:
        if self.p_val == None:
            self.calculate()
        data_name = ""
        if hasattr(self.data, "description"):
            data_name = self.data.description.split("\n")[0].split()[1].lower()
        if len(data_name) > 0:
            print(f"Data: {data_name}")

        print(f"Variables: {self.grouping_var}, {self.var}")
        print(f"Level: {self.level} in {self.var}")
        print(f"Confidence: {self.conf}")
        print(f"Adjustment: {self.multiple_comp_adjustment}")

        print()

        print(self.table1.round(dec).to_string(index=False))
        print(self.table2.round(dec).to_string(index=False))


In [9]:
compare_proportion = compare_props(data=titanic, grouping_var="pclass", var="survived", level="Yes", combinations=[("1st", "2nd"), ("1st", "3rd"), ("2nd", "3rd")], alt_hypo="two-sided", conf=0.95)
compare_proportion.calculate()
compare_proportion.summary()

Pairwise proportion comparisons
ns: 131, n: 500, p: 0.262, nmissing: 0
ns: 179, n: 282, p: 0.6347517730496454, nmissing: 0
ns: 115, n: 261, p: 0.44061302681992337, nmissing: 0
observed: survived  Yes   No  Total
pclass                   
1st       179  103    282
2nd       115  146    261
3rd       131  369    500
Total     425  618   1043
observed: survived  Yes   No  Total
pclass                   
1st       179  103    282
2nd       115  146    261
3rd       131  369    500
Total     425  618   1043
observed: survived  Yes   No  Total
pclass                   
1st       179  103    282
2nd       115  146    261
3rd       131  369    500
Total     425  618   1043
Data: titanic
Variables: pclass, survived
Level: Yes in survived
Confidence: 0.95
Adjustment: none

pclass  Yes     p   n  n_missing    sd    se    me
   3rd  131 0.262 500          0 0.440 0.020 0.039
   1st  179 0.635 282          0 0.481 0.029 0.056
   2nd  115 0.441 261          0 0.496 0.031 0.060
Null hyp.  Alt. hyp.  