In [4]:
import pandas as pd

# Data Preparation
df = pd.read_pickle("accidents.pkl.gz")

## Hypotéza 1

Na silnicích první třídy se při nehodách umíralo se stejnou pravděpodobností jako na silnicích třetí třídy.

## Hypothesis

**Null Hypothesis (H0)**: There is no significant difference in fatal crashes between first-class and third-class roads.

**Alternative Hypothesis (H1)**: There is a significant difference in fatal crashes between first-class and third-class roads.

## Conclusion

The p-value is compared to the significance level (0.05). If the p-value is less than the significance level, the null hypothesis is rejected, indicating a significant association between road class and fatal crashes.

Test below outputs following data:

- Chi-square statistic: 167.2443757129343
- P-value: 2.95835646229767e-38
- Significance level: 0.05
- Reject: There is a significant difference in fatal crashes between first-class and third-class roads

In [5]:
from scipy.stats import chi2_contingency

ALPHA = 0.05

# 1. Class road
df1 = df[df["p36"] == 1]

# 3. Class road
df3 = df[df["p36"] == 3]

# Create contingency table for fatal crashes at road class 1. and 3.
contingency_table = pd.crosstab(df1["p36"], df1['p13a'] > 0)
contingency_table.loc[3] = pd.crosstab(df3["p36"], df3['p13a'] > 0).loc[3]

# print(contingency_table)

# Run chi-square test
chi2, p, _, _ = chi2_contingency(contingency_table.values)

# Output results
print(f"Chi-square statistic: {chi2}")
print(f"P-value: {p}")
print(f"Significance level: {ALPHA}")

if p < ALPHA:
    print("Reject: There is a significant difference in fatal crashes between first-class and third-class roads")
else:
    print("Fail to reject: There is no significant difference in fatal crashes between first-class and third-class roads")

Chi-square statistic: 167.2443757129343
P-value: 2.95835646229767e-38
Significance level: 0.05
Reject: There is a significant difference in fatal crashes between first-class and third-class roads


## Hypotéza 2

Při nehodách vozidel značky Škoda je škoda na vozidle nižší než při nehodách vozidel Audi.

## Hypothesis

**Null Hypothesis (H0)**: There is no significant difference in vehicle damage between Skoda and Audi vehicles.

**Alternative Hypothesis (H1)**: Škoda vehicle damage is significantly lower than Audi vehicle damage.

## Conclusion

The test produces a u-statistic and a p-value. U-Test was chosen over T-Test because data was not [normally distributed](https://www.youtube.com/watch?v=LcxB56PzylA). The p-value represents the probability of observing a u-statistic as extreme as the one calculated, assuming the null hypothesis is true.

The p-value is compared to the significance level (0.05). If the p-value is less than the significance level, the null hypothesis is rejected, suggesting a significant difference in vehicle damage between Skoda and Audi vehicles.

Output:

- T-statistic: -18.576738138973823
- P-value: 1.9879833820234112e-76
- Significance level: 0.05
- Reject: Skoda vehicle damage is significantly lower than Audi vehicle damage.

In [6]:
from scipy.stats import mannwhitneyu

ALPHA = 0.05

skoda_damage = df[df['p45a'] == 39]['p14']
audi_damage = df[df['p45a'] == 2]['p14']

# Performing Mann-Whitney U test for independent samples
U_statistic, p = mannwhitneyu(skoda_damage, audi_damage)

# Output results
print(f"U-statistic: {U_statistic}")
print(f"P-value: {p}")
print(f"Significance level: {ALPHA}")

if p < ALPHA:
    print("Reject: Skoda vehicle damage is significantly lower than Audi vehicle damage.")
else:
    print("Fail to reject: There is no significant difference in vehicle damage between Skoda and Audi vehicles.")

U-statistic: 893517999.0
P-value: 1.8082422042771395e-165
Significance level: 0.05
Reject: Skoda vehicle damage is significantly lower than Audi vehicle damage.
