Project Implementation for IZV 2020/2021  
Script: stat.ipynb 
Date: 09.12.2020  
Author: Mikhail Abramov  
xabram00@stud.fit.vutbr.cz  

---

<h3>Task 2: Hypothesis test</h3>

Hypothesis: `If the culprit of the accident was under the strong influence of alcohol, serious health consequences were more common`

Method â€” `Chi-squared test`, conditions:
*  culprit of the accident was under the strong influence of alcohol, if `p11 >= 7`
*  serious health consequences have occurred, if death or serious injury has occurred `p13a + p13b > 0`
*  exclude accidents where culprit was under influence of drugs `p11 == 4 or p11 == 5`

---
<h3>1. Prepare appropriate dataframe:</h3>

Needed:
*  take appropriate columns
*  sum p13a and p13b columns
*  exclude rows based on p11 value and columns p13a and p13b
*  change categories of p11 and p13 to binaries values


In [1]:
import pandas as pd
import numpy as np
H1 = 'If the culprit of the accident was under the strong influence of alcohol, serious health consequences were more common'
# read input file
df = pd.read_pickle("accidents.pkl.gz")
# take nedeed columns
df = df[['p11', 'p13a', 'p13b']]
# sum 13a and 13b columns
df['p13'] = df[['p13a', 'p13b']].sum(axis=1)
# delete 13a and 13b columns
df = df.drop(columns=['p13a','p13b'])
# exclude p11==5 and p11==4
df = df.loc[(df['p11'] != 4) & (df['p11'] != 5)]
# change all categories of notdrunked to -> 0
df.loc[df['p11'] < 7, 'p11'] = 0
# change all categories of drunked to -> 1
df.loc[df['p11'] >= 7, 'p11'] = 1
# change all serious health consequences to -> 1
df.loc[df['p13'] > 0, 'p13'] = 1
# change all not serious health consequences to -> 0
df.loc[df['p13'] == 0, 'p13'] = 0
print(df)



        p11  p13
0         0    0
1         0    0
2         0    0
3         1    0
4         0    0
...     ...  ...
487156    0    0
487157    0    0
487158    0    0
487159    0    0
487160    0    0

[485683 rows x 2 columns]


---  
<h3>2. Prepare crosstab:</h3>

Needed:
*  crosstab

In [2]:
crosstab = pd.crosstab(index=df['p11'],columns=df['p13'])
print(crosstab)

p13       0      1
p11               
0    457528  10777
1     16492    886


---  
<h3>3. Calculate chi2 contingency:</h3>

In [3]:
from scipy.stats import chi2_contingency
stat, p, dof, expected = chi2_contingency(crosstab)
print(f"Achived results:")
print(f"test Statistics: {stat}\ndegrees of freedom: {dof}\np-value: {p}\n\nExpected frequencies:\n {expected}")

Achived results:
test Statistics: 558.1749514234125
degrees of freedom: 1
p-value: 2.0971505700338304e-123

Expected frequencies:
 [[4.57059308e+05 1.12456916e+04]
 [1.69606916e+04 4.17308438e+02]]


---
<h3>4. Interpret statistics (1):</h4>
First of all we need to determine and test our Null Hypothesis:

* H0: `there is no difference between ordinary culpruit and culprit which was under the strong influence of alcohol`

In [4]:
from scipy.stats import chi2
prob = 0.95
critical = chi2.ppf(prob, dof)
print(f'probability={prob:.3f}, critical={critical:.3f}, stat={stat:.3f}')
if abs(stat) >= critical:
	print('Dependent (reject H0)')
else:
	print('Independent (fail to reject H0)')
alpha = 1.0 - prob
print(f'\nsignificance={alpha:.3f}, p={p:.3f}')
if p <= alpha:
	print('Dependent (reject H0)')
else:
	print('Independent (fail to reject H0)')
print(f'\nExpected frequencies:\n {expected}')

probability=0.950, critical=3.841, stat=558.175
Dependent (reject H0)

significance=0.050, p=0.000
Dependent (reject H0)

Expected frequencies:
 [[4.57059308e+05 1.12456916e+04]
 [1.69606916e+04 4.17308438e+02]]


---
<h3>5. Interpret statistics (2):</h4>
Now when we know that data in crosstab is dependent we can calculate probability to make dicision about our hypotesis

* Hypothesis: `If the culprit of the accident was under the strong influence of alcohol, serious health consequences were more common`

In [6]:
PYY = crosstab.loc[1].loc[1]/(crosstab.loc[1].loc[1]+crosstab.loc[1].loc[0])
PNY = crosstab.loc[0].loc[1]/(crosstab.loc[0].loc[1]+crosstab.loc[0].loc[0])
print(f'Probability: If condition - yes => result yes: {PYY:.3%}')
print(f'Probability: If condition - no  => result yes: {PNY:.3%}')
if (PYY > PNY):
    print(f'\nTrue - {H1}')
    print(f'{PYY:.3%} > {PNY:.3%}')
else:
    print(f'\nFalse - {H1}')
    print(f'{PYY:.3%} <= {PNY:.3%}')

Probability: If condition - yes => result yes: 5.098%


NameError: name 'LeftSide' is not defined

---
<h3>6. Result:</h4>

* Based on result of Xi-squared test we make desision about the dependency of available data, and then we were able base final solution on probability.
* True - `If the culprit of the accident was under the strong influence of alcohol, serious health consequences were more common`