**Problem 5c (Chi-square independence test).** 
You are given the results of IPSOS exit polls for 2015 parliamentary elections in Poland in table **data**. Decide if we can assume that gender and voting preferences are independent. To this end:
 * Compute row totals $r_i$, column totals $c_j$, and overall total $N$.
 * If the variables are independent, we expect to see $f_{ij} = r_i c_j / N$ in $i$-th row $j$-th column.
 * Compute the test statistic as before, i.e. $$ S = \sum_{ij} \frac{\left(f_{ij}-X_{ij}\right)^2}{f_{ij}}.$$
 * Again test vs $\chi^2$ CDF. However, if the variables are independent, we only have $(r-1)(c-1)$ degrees of freedom here (we only need to know the row and column totals).
 * The KORWiN party looks like an obvious outlier. Note, when we work with categorical variables we should not just remove a category -- it is better to aggregate them. Introduce an aggregated category by summing the votes for the parties with less than 5% total votes and repeat the experiment.
 
**Note:** This kind of data is (to the best of our knowledge) not available online. It has been recreated based on
online infographics and other tidbits of information available online. It is definitely not completely accurate, hopefully it is not very far off. Moreover, exit polls do not necessary reflect the actual distribution of the population.

In [22]:
import numpy as np
# Rows: women, men
# Columns:          PiS, PO, Kukiz, Nowoczesna, Lewica, PSL, Razem, KORWiN
data = np.array([[ 17508, 11642,  3308,  3131,  2911,  2205,  1852, 1235],
                 [ 17672,  9318,  4865,  3259,  3029,  2479,  1606, 3259]])

Hipoteza $H_0$ - wybór partii politycznej w głosowaniu jest niezależny od płci

p-wartość - prawdopodobieństwo, że dane z próbki są prawdopodobnymi wynikami dla sytuacji $H_0$. Jeśli mała, to znaczy, że hipoteza jest fałszywa

Hyphotesis $H_0$ - voting preferences are independent of the sex

p-value - probability that the sample data is consistent with a true null hypothesis. If small, it indicates that the null hypothesis is false.

### Statystyka testowa

In [94]:
import scipy

def calculate_p_value(data):
    row_sums = data.sum(axis=1)
    column_sums = data.sum(axis=0)
    total = data.sum()

    # calculate dot product of row sums and column sums
    expected = np.matmul(row_sums[:, np.newaxis], column_sums[np.newaxis, :]) / total

    S = 0
    for i in range(len(row_sums)):
        for j in range(len(column_sums)):
            S += (data[i, j] - expected[i, j]) ** 2 / expected[i, j]

    #Obliczanie p-wartości
    p_value = 1 - scipy.stats.chi2.cdf(S, (len(row_sums) - 1) * (len(column_sums) - 1))
    return p_value


In [95]:
p_value = calculate_p_value(data)
print("p-value = {:.2f}".format(p_value))

p-value = 0.00


### Wniosek
Hipoteza, że ludzie głosują tak samo niezależnie od płci jest fałszywa, p-wartośc jest mniejsza od 0.05.

Odrzućmy jednak małe partie (< 6% głosów), które mają bardziej radykalne poglądy i połączmy je w jedną kategorię.

In [102]:
THRESHOLD = 0.06
odsetek =  data.sum(axis=0)/data.sum()
print(odsetek)
to_agregate = np.nonzero(odsetek < THRESHOLD)[0]
rest = np.nonzero(odsetek >= THRESHOLD)[0]
agregated_party = data[:, to_agregate].sum(axis=1)[:, np.newaxis]

new_data = np.concatenate((data[:, rest], agregated_party), axis=1)
p_value = calculate_p_value(new_data)
print("p-value = {:.2f}".format(p_value))

[0.39404563 0.23476965 0.09154448 0.07157338 0.066533   0.05246475
 0.03873251 0.05033659]
p-value = 0.00


Zatem po złączeniu małych partii p_value nadal jest bardzo małe, stąd musimy odrzucić hipotezę.

Sprawdźmy nasz wynik korzystająć z gotowej funkcji w bibliotece scipy na naszych danych:

In [101]:
p_value = scipy.stats.chi2_contingency(new_data)[1]
print("p-value = {:.2f}".format(p_value))

p-value = 0.00
