In [1]:
import pandas as pd
from scipy.stats import chi2_contingency, ttest_ind

Load data from pickle archive

In [2]:
df = pd.read_pickle("accidents.pkl.gz")

# Two hypothesis will be investigated

### Hypothesis 1: The death probability on 1. class roads is the same as death probability on 3. class roads.

Create a new column `fatal_accident` with a boolean value True if the accident was fatal

In [3]:
df.loc[df["p13a"] > 0, "fatal_accident"] = True
df.loc[df["p13a"] == 0, "fatal_accident"] = False

Filter only 1. and 3. class roads

In [4]:
df_filtered = df[df["p36"].isin([1, 3])]
df[["p1", "fatal_accident"]].head(5)

Unnamed: 0,p1,fatal_accident
0,2100160001,False
1,2100160002,False
2,2100160003,False
3,2100160004,False
4,2100160005,False


Create a contingency table with frequency distribution of fatal accidents on 1. and 3. class roads

In [5]:
contingency_table = pd.crosstab(index=df_filtered["fatal_accident"], columns=df_filtered["p36"])
contingency_table

p36,1,3
fatal_accident,Unnamed: 1_level_1,Unnamed: 2_level_1
False,78618,73352
True,911,448


Perform Chi-square contingency test of independence of variables in the previously calculated contingency table

In [6]:
res = chi2_contingency(contingency_table)
g, p, dof, expected = res
(g,p,dof,expected)

(125.72070150000258,
 3.5395243450138555e-29,
 1,
 array([[78824.11109444, 73145.88890556],
        [  704.88890556,   654.11109444]]))

Subtract the expected frequencies from captured frequencies in input dataset

In [7]:
contingency_table - expected

p36,1,3
fatal_accident,Unnamed: 1_level_1,Unnamed: 2_level_1
False,-206.111094,206.111094
True,206.111094,-206.111094


## Hypothesis 1 conclusion:
#### Since the p-value is much smaller than a=0.05, we reject the H0 hypothesis. There is sufficient evidence to support the claim that more accidents happen on 1. class roads.

### Hypothesis 2: Economic cost of damage to investigated vehicle in accidents caused by Skoda vehicles is lower than those cause by Audio vehicles.

Filtering of accidents caused by Audi and Skoda vehicles

In [8]:
audi = df[df["p45a"] == 2]["p53"]
skoda = df[df["p45a"] == 39]["p53"]
(audi, skoda)

(0         4000
 64         400
 121        500
 122          0
 123        200
           ... 
 572799     100
 572810      10
 572818     300
 572842     700
 572931      50
 Name: p53, Length: 11445, dtype: int64,
 6          200
 9          300
 11          50
 16        1200
 17         500
           ... 
 572903    1200
 572911      10
 572916     130
 572921     100
 572923       0
 Name: p53, Length: 118379, dtype: int64)

Perform the t-test for the means of two independent samples of scores (Skoda and Audi brands)

In [9]:
ttest_ind(skoda, audi, equal_var=False, alternative='less')

Ttest_indResult(statistic=-23.622116776600297, pvalue=6.1078288453876684e-121)

## Hypothesis 2 conclusion:
#### Using t-test, the H0 hypothesis was that the damage for both car brands is equally distributed. Since the p-value is much smaller than a=0.05, we reject the H0 hypothesis. There is sufficient evidence to support the claim that the economic cost of damage to vehicle in accidents caused by Skoda vehicles is lower than those cause by Audi vehicles.