## Rekapitulace projektu
- Téma: Silniční nehody v USA (FARS), rok 2020.
- Tým: Bc. Jan Kovář (kovj19), Bc. Dávid Vikor (vikd00).
- Cíl: Identifikovat faktory ovlivňující závažnost a formulovat doporučení pro prevenci.
- Příklady otázek: závažnost vs. prostředí (urban/rural), počasí, světlo a denní doba; vliv alkoholu/drog; rozdíly víkend vs. pracovní den; regionální rozdíly.
- Data: FARS – tabulky ACCIDENT (lokace/čas/prostředí), PERSON (účastníci, zranění, alkohol/drogy), VEHICLE (vozidla); spojení přes klíč CASENUM.
- Hypotézy: kumulace rizik zvyšuje závažnost; vyšší závažnost na rychlých komunikacích; tma/dešť/sníh zhoršují následky; odlišnosti mezi regiony a město/venkov.


In [37]:
# === IMPORT KNIŽNÍC ===
import pandas as pd
import numpy as np
from pathlib import Path
from cleverminer import cleverminer
from pandas.api.types import CategoricalDtype

In [38]:
pd.set_option('display.max_columns', 100)
pd.set_option('display.width', 120)

In [39]:
# === DEFINÍCIA CIEST K DÁTAM ===
DATA_DIR = Path.cwd() / "data" if (Path.cwd() / "data").exists() else Path.cwd().parent / "data"

PATH_ACC = DATA_DIR / "acc_20.csv"
PATH_PERS = DATA_DIR / "pers_20.csv"
PATH_VEH = DATA_DIR / "veh_20.csv"
PATH_HOL = DATA_DIR / "US Holiday Dates (2004-2021).csv"

# Overíme, že súbory existujú
for path in [PATH_ACC, PATH_PERS, PATH_VEH, PATH_HOL]:
    print(f"{path.name}: {'OK' if path.exists() else '❌ NOT FOUND, EXTRACT data.zip FIRST!'}")

acc_20.csv: OK
pers_20.csv: OK
veh_20.csv: OK
US Holiday Dates (2004-2021).csv: OK


In [40]:
# Export directory
DATA_EXPORT_DIR = DATA_DIR / "derived"
DATA_EXPORT_DIR.mkdir(parents=True, exist_ok=True)
print('Export dir:', DATA_EXPORT_DIR)


Export dir: c:\Users\janci\vscodeprojects\4IZ460-SP\data\derived


In [41]:
# === NAČÍTANIE CSV SÚBOROV ===
acc = pd.read_csv(PATH_ACC, encoding='utf-8')
pers = pd.read_csv(PATH_PERS, encoding='latin1', low_memory=False)
veh = pd.read_csv(PATH_VEH, encoding='latin1', low_memory=False)
hol = pd.read_csv(PATH_HOL, encoding='utf-8')

# Kontrola veľkosti a typov
print("ACCIDENTS:", acc.shape)
print("PERSONS:", pers.shape)
print("VEHICLES:", veh.shape)
print("HOLIDAYS:", hol.shape)


ACCIDENTS: (54745, 80)
PERSONS: (131962, 104)
VEHICLES: (94718, 167)
HOLIDAYS: (342, 6)


In [42]:
# === ČISTENIE DÁT ===

# Unikátnosť nehôd v ACC
print("Počet riadkov:", len(acc))
print("Počet unikátnych CASENUM:", acc['CASENUM'].nunique())

if len(acc) == acc['CASENUM'].nunique():
    print("✅ Každý riadok v ACC zodpovedá jednej nehode.")
else:
    print("⚠️ Pozor: duplikované nehody!")
    display(acc[acc['CASENUM'].duplicated(keep=False)].head())
    
    
    
# Integrity check (CASENUM väzby) ---
acc_ids = set(acc['CASENUM'])
pers_ids = set(pers['CASENUM'])
veh_ids = set(veh['CASENUM'])

missing_in_acc_from_pers = pers_ids - acc_ids
missing_in_acc_from_veh  = veh_ids  - acc_ids

print("PERS bez ACC:", len(missing_in_acc_from_pers))
print("VEH bez ACC:", len(missing_in_acc_from_veh))



# Kontrola NaN hodnôt v dôležitých stĺpcoch ACC
cols_important = ['YEAR', 'MONTH', 'DAY_WEEK', 'HOUR', 'MAXSEV_IMNAME', 'WEATHR_IMNAME']
missing_summary = acc[cols_important].isna().sum()
print("NaN hodnoty v dôležitých stĺpcoch:")
print(missing_summary[missing_summary > 0])



# Platné rozsahy hodnôt v stĺpcoch MONTH a HOUR
def invalid_values(df, col, valid_range):
    invalid = df[~df[col].between(valid_range[0], valid_range[1])]
    return invalid[[col]] if not invalid.empty else None

invalid_months = invalid_values(acc, 'MONTH', (1, 12))
invalid_hours  = invalid_values(acc, 'HOUR', (0, 23))

if invalid_months is not None:
    print(" Neplatné mesiace:")
    display(invalid_months.head())

if invalid_hours is not None:
    print("Neplatné hodiny:")
    display(invalid_hours.head())
    
    
    
# Dátové typy
print("\nTypy dát v ACC:")
display(acc.dtypes.head(15))

# Ak by sa CASENUM načítal ako float, opravíme:
if acc['CASENUM'].dtype != 'object':
    acc['CASENUM'] = acc['CASENUM'].astype(str)
if pers['CASENUM'].dtype != 'object':
    pers['CASENUM'] = pers['CASENUM'].astype(str)
if veh['CASENUM'].dtype != 'object':
    veh['CASENUM'] = veh['CASENUM'].astype(str)
    
    

# Duplicitné záznamy v detailných tabuľkách

pers_dups = pers.duplicated(subset=['CASENUM', 'VEH_NO', 'PER_NO']).sum()
veh_dups  = veh.duplicated(subset=['CASENUM', 'VEH_NO']).sum()

print(f"Duplicity v PERS: {pers_dups}")
print(f"Duplicity v VEH: {veh_dups}")



# Mini report ---
print(f"""
===== DATA QUALITY SUMMARY =====
ACC: {acc.shape[0]} nehôd ({acc['CASENUM'].nunique()} unikátnych)
PERS: {pers.shape[0]} osôb
VEH:  {veh.shape[0]} vozidiel
Chýbajúce v ACC: {int(acc[cols_important].isna().sum().sum())} buniek
Duplicity (PERS): {pers_dups}, (VEH): {veh_dups}
================================
""")


Počet riadkov: 54745
Počet unikátnych CASENUM: 54745
✅ Každý riadok v ACC zodpovedá jednej nehode.
PERS bez ACC: 0
VEH bez ACC: 0
NaN hodnoty v dôležitých stĺpcoch:
Series([], dtype: int64)
Neplatné hodiny:


Unnamed: 0,HOUR
436,99
538,99
965,99
1293,99
2183,99



Typy dát v ACC:


CASENUM            int64
STRATUM            int64
STRATUMNAME       object
REGION             int64
REGIONNAME        object
PSU                int64
PJ                 int64
PSU_VAR            int64
URBANICITY         int64
URBANICITYNAME    object
VE_TOTAL           int64
VE_FORMS           int64
PVH_INVL           int64
PEDS               int64
PERMVIT            int64
dtype: object

Duplicity v PERS: 0
Duplicity v VEH: 0

===== DATA QUALITY SUMMARY =====
ACC: 54745 nehôd (54745 unikátnych)
PERS: 131962 osôb
VEH:  94718 vozidiel
Chýbajúce v ACC: 0 buniek
Duplicity (PERS): 0, (VEH): 0



In [43]:
# %% 4) VÝBER STĹPCOV A ČASOVÉ ATRIBÚTY

cols_acc = [
    "CASENUM", "YEAR", "MONTH", "DAY_WEEK", "HOUR",
    "URBANICITYNAME", "REGIONNAME",
    "LGTCON_IMNAME", "WEATHR_IMNAME",
    "MAXSEV_IMNAME", "ALCHL_IMNAME",
    "VE_TOTAL"
]
acc_sub = acc[cols_acc].copy()
print("Vybraných stĺpcov:", len(acc_sub.columns))

# Syntetický mesačný dátum (len na vizualizácie/agregácie podľa mesiaca)
acc_sub["DATE_SYNTH"] = pd.to_datetime(
    acc_sub["YEAR"].astype(str) + "-" + acc_sub["MONTH"].astype(str) + "-01",
    errors="coerce"
)

# Denná časť – HOUR=99 => Unknown
acc_sub["HOUR_CLEAN"] = np.where(acc_sub["HOUR"] == 99, np.nan, acc_sub["HOUR"])

def hour_to_daypart(hour):
    if pd.isna(hour): return "Unknown"
    if 0 <= hour < 6: return "Night"
    if 6 <= hour < 12: return "Morning"
    if 12 <= hour < 18: return "Afternoon"
    if 18 <= hour <= 23: return "Evening"
    return "Unknown"

acc_sub["DAYPART"] = acc_sub["HOUR_CLEAN"].apply(hour_to_daypart)

# Weekend / Weekday (v datasete 1=Sun, 7=Sat)
acc_sub["WEEKEND"] = acc_sub["DAY_WEEK"].apply(lambda x: "Weekend" if x in [1, 7] else "Weekday")

# Sviatok zatiaľ nevieme spoľahlivo určiť
acc_sub["IS_HOLIDAY"] = "Unknown"

# Sezóna
def month_to_season(m):
    if m in [12, 1, 2]:  return "Winter"
    if m in [3, 4, 5]:   return "Spring"
    if m in [6, 7, 8]:   return "Summer"
    if m in [9, 10, 11]: return "Autumn"
    return "Unknown"

acc_sub["SEASON"] = acc_sub["MONTH"].apply(month_to_season)

# Upratanie helper stĺpca
acc_sub.drop(columns=["HOUR_CLEAN"], inplace=True)

display(acc_sub.head(3))
print(acc_sub[["DAYPART","WEEKEND","SEASON","IS_HOLIDAY"]].describe(include="all"))


Vybraných stĺpcov: 12


Unnamed: 0,CASENUM,YEAR,MONTH,DAY_WEEK,HOUR,URBANICITYNAME,REGIONNAME,LGTCON_IMNAME,WEATHR_IMNAME,MAXSEV_IMNAME,ALCHL_IMNAME,VE_TOTAL,DATE_SYNTH,DAYPART,WEEKEND,IS_HOLIDAY,SEASON
0,202002121240,2020,1,4,8,Rural Area,"West (MT, ID, WA, OR, CA, NV, NM, AZ, UT, CO, ...",Daylight,Cloudy,No Apparent Injury (O),No Alcohol Involved,2,2020-01-01,Morning,Weekday,Unknown,Winter
1,202002121829,2020,1,4,1,Urban Area,"South (MD, DE, DC, WV, VA, KY, TN, NC, SC, GA,...",Dark - Not Lighted,Clear,Suspected Minor Injury (B),No Alcohol Involved,1,2020-01-01,Night,Weekday,Unknown,Winter
2,202002121849,2020,1,4,13,Urban Area,"South (MD, DE, DC, WV, VA, KY, TN, NC, SC, GA,...",Daylight,Clear,No Apparent Injury (O),No Alcohol Involved,2,2020-01-01,Afternoon,Weekday,Unknown,Winter


          DAYPART  WEEKEND  SEASON IS_HOLIDAY
count       54745    54745   54745      54745
unique          5        2       4          1
top     Afternoon  Weekday  Autumn    Unknown
freq        23275    40786   17232      54745


In [44]:
# %% 5) AGREGÁCIA PERS NA ÚROVEŇ NEHODY

# Bezpečné doplnenie chýbajúcich stĺpcov (ak by náhodou neboli v PERS)
for col in ["INJ_SEV","DRINKING","DRUGS","HOSPITAL","PER_NO","VEH_NO"]:
    if col not in pers.columns:
        pers[col] = np.nan

to_num = lambda s: pd.to_numeric(s, errors="coerce").fillna(0)

pers_agg = (
    pers.groupby("CASENUM")
        .agg(
            persons        = ("PER_NO", "count"),
            injured        = ("INJ_SEV", lambda s: (to_num(s) > 0).sum()),
            max_inj_sev    = ("INJ_SEV", lambda s: to_num(s).max()),
            any_drinking   = ("DRINKING", lambda s: True if (to_num(s) == 1).any() else False),
            any_drugs      = ("DRUGS",    lambda s: True if (to_num(s) == 1).any() else False),
            any_hospitalized = ("HOSPITAL", lambda s: True if (to_num(s) == 1).any() else False),
        )
        .reset_index()
)
display(pers_agg.head())


Unnamed: 0,CASENUM,persons,injured,max_inj_sev,any_drinking,any_drugs,any_hospitalized
0,202002121240,2,0,0,False,False,False
1,202002121829,1,1,2,False,False,False
2,202002121849,2,1,9,False,False,False
3,202002123484,1,0,0,False,False,False
4,202002123576,2,0,0,False,False,False


In [45]:
# %% 6) AGREGÁCIA VEH NA ÚROVEŇ NEHODY

for col in ["ROLLOVER","FIRE_EXP","TOWED","VEH_NO"]:
    if col not in veh.columns:
        veh[col] = np.nan

to_num = lambda s: pd.to_numeric(s, errors="coerce").fillna(0)

veh_agg = (
    veh.groupby("CASENUM")
        .agg(
            vehicles    = ("VEH_NO", lambda s: pd.Series(s).nunique()),
            any_rollover= ("ROLLOVER", lambda s: True if (to_num(s) == 1).any() else False),
            any_fire    = ("FIRE_EXP", lambda s: True if (to_num(s) == 1).any() else False),
            any_towed   = ("TOWED",    lambda s: True if (to_num(s) == 1).any() else False),
        )
        .reset_index()
)
display(veh_agg.head())


Unnamed: 0,CASENUM,vehicles,any_rollover,any_fire,any_towed
0,202002121240,2,False,False,False
1,202002121829,1,False,False,False
2,202002121849,2,False,False,False
3,202002123484,1,False,False,False
4,202002123576,1,False,False,False


In [46]:
# %% 7) MERGE NA ÚROVEŇ NEHODY

final_df = (
    acc_sub
      .merge(pers_agg, on="CASENUM", how="left")
      .merge(veh_agg,  on="CASENUM", how="left")
)

assert final_df["CASENUM"].is_unique
print("Shape final_df:", final_df.shape)

# přehled NA hodnot 
cand_cols = [
    "persons","injured","max_inj_sev","vehicles",
    "any_drinking","any_drugs","any_hospitalized",
    "any_rollover","any_fire","any_towed",
]
present = [c for c in cand_cols if c in final_df.columns]
sub = final_df[present]

total = len(final_df)
rows_any_na  = sub.isna().any(axis=1).sum()
rows_all_na  = sub.isna().all(axis=1).sum()
na_by_col    = sub.isna().sum().sort_values(ascending=False)

print(f"Rows with ANY NA in selected features: {rows_any_na} / {total} ({rows_any_na/total:.1%})")
print(f"Rows with ALL NA in selected features: {rows_all_na} / {total} ({rows_all_na/total:.1%})")
print("NA by column:")
print(na_by_col.to_string())

# Defaulty po merge (ak by niektorá nehoda nemala osoby/vozidlá v detaili)
for c in ["persons","injured","max_inj_sev","vehicles"]:
    if c in final_df.columns:
        final_df[c] = pd.to_numeric(final_df[c], errors="coerce").fillna(0).astype(int)

for c in ["any_drinking","any_drugs","any_hospitalized","any_rollover","any_fire","any_towed"]:
    if c in final_df.columns:
        final_df[c] = final_df[c].fillna("No")

display(final_df.head())

Shape final_df: (54745, 27)
Rows with ANY NA in selected features: 19 / 54745 (0.0%)
Rows with ALL NA in selected features: 0 / 54745 (0.0%)
NA by column:
persons             19
injured             19
max_inj_sev         19
any_drinking        19
any_drugs           19
any_hospitalized    19
vehicles             0
any_rollover         0
any_fire             0
any_towed            0


Unnamed: 0,CASENUM,YEAR,MONTH,DAY_WEEK,HOUR,URBANICITYNAME,REGIONNAME,LGTCON_IMNAME,WEATHR_IMNAME,MAXSEV_IMNAME,ALCHL_IMNAME,VE_TOTAL,DATE_SYNTH,DAYPART,WEEKEND,IS_HOLIDAY,SEASON,persons,injured,max_inj_sev,any_drinking,any_drugs,any_hospitalized,vehicles,any_rollover,any_fire,any_towed
0,202002121240,2020,1,4,8,Rural Area,"West (MT, ID, WA, OR, CA, NV, NM, AZ, UT, CO, ...",Daylight,Cloudy,No Apparent Injury (O),No Alcohol Involved,2,2020-01-01,Morning,Weekday,Unknown,Winter,2,0,0,False,False,False,2,False,False,False
1,202002121829,2020,1,4,1,Urban Area,"South (MD, DE, DC, WV, VA, KY, TN, NC, SC, GA,...",Dark - Not Lighted,Clear,Suspected Minor Injury (B),No Alcohol Involved,1,2020-01-01,Night,Weekday,Unknown,Winter,1,1,2,False,False,False,1,False,False,False
2,202002121849,2020,1,4,13,Urban Area,"South (MD, DE, DC, WV, VA, KY, TN, NC, SC, GA,...",Daylight,Clear,No Apparent Injury (O),No Alcohol Involved,2,2020-01-01,Afternoon,Weekday,Unknown,Winter,2,1,9,False,False,False,2,False,False,False
3,202002123484,2020,1,4,14,Rural Area,"West (MT, ID, WA, OR, CA, NV, NM, AZ, UT, CO, ...",Daylight,Snow,No Apparent Injury (O),No Alcohol Involved,1,2020-01-01,Afternoon,Weekday,Unknown,Winter,1,0,0,False,False,False,1,False,False,False
4,202002123576,2020,1,4,17,Rural Area,"Northeast (PA, NJ, NY, NH, VT, RI, MA, ME, CT)",Dark - Not Lighted,Snow,No Apparent Injury (O),No Alcohol Involved,1,2020-01-01,Afternoon,Weekday,Unknown,Winter,2,0,0,False,False,False,1,False,False,False


In [47]:
# %% 8) KATEGORIZÁCIA PRE CLEVERMINER (ante/succ, nízka kardinalita)

# Počet vozidiel podľa ACC (VE_TOTAL)
final_df["VEH_COUNT_CAT"] = pd.cut(
    pd.to_numeric(final_df["VE_TOTAL"], errors="coerce").fillna(0),
    bins=[-1,1,2,99], labels=["1","2","3+"]
)

# Binarizácia závažnosti nehody (succedent)
def map_severity(name: str) -> str:
    if not isinstance(name, str):
        return "Other/Unknown"
    s = name.lower()
    if "fatal" in s or "(k)" in s or "serious injury (a)" in s or "suspected serious" in s:
        return "Serious/Fatal"
    if "no apparent" in s or "possible" in s or "minor" in s:
        return "None/Minor"
    return "Other/Unknown"

final_df["SEVERITY_BIN"] = final_df["MAXSEV_IMNAME"].apply(map_severity)

# Sanity náhľad
display(final_df[["MAXSEV_IMNAME","SEVERITY_BIN"]].head(10))


Unnamed: 0,MAXSEV_IMNAME,SEVERITY_BIN
0,No Apparent Injury (O),None/Minor
1,Suspected Minor Injury (B),None/Minor
2,No Apparent Injury (O),None/Minor
3,No Apparent Injury (O),None/Minor
4,No Apparent Injury (O),None/Minor
5,Possible Injury (C),None/Minor
6,No Apparent Injury (O),None/Minor
7,No Apparent Injury (O),None/Minor
8,No Apparent Injury (O),None/Minor
9,No Apparent Injury (O),None/Minor


In [48]:
# %% 9) EXPORT CSV

# kontrola existence export dir
DATA_EXPORT_DIR.mkdir(parents=True, exist_ok=True)

out_cols = [
    # kľúč & čas
    'CASENUM','YEAR','MONTH','DAY_WEEK','HOUR','DATE_SYNTH',
    # kontext nehody
    'URBANICITYNAME','REGIONNAME','LGTCON_IMNAME','WEATHR_IMNAME',
    'MAXSEV_IMNAME','ALCHL_IMNAME','VE_TOTAL',
    # naše nové atribúty
    'DAYPART','WEEKEND','SEASON','IS_HOLIDAY',
    # agregáty z PERS
    'persons','any_drinking','any_drugs','any_hospitalized',
    # agregáty z VEH
    'vehicles','any_rollover','any_fire','any_towed',
    # kategórie pre CM
    'VEH_COUNT_CAT','SEVERITY_BIN'
]
final_ready = final_df[out_cols].copy()

export_path = DATA_EXPORT_DIR / 'accident_level.csv'
final_ready.to_csv(export_path, index=False, encoding='utf-8')
print('Exported:', export_path)


Exported: c:\Users\janci\vscodeprojects\4IZ460-SP\data\derived\accident_level.csv


In [49]:
# %% 10) MINI SANITY-CHECK (rýchle rozdelenia)

for col in ["SEVERITY_BIN","DAYPART","WEEKEND","URBANICITYNAME","VEH_COUNT_CAT"]:
    print(f"\n{col}")
    print(final_ready[col].value_counts(dropna=False).head(10))

# malý krížový pohľad (príklad): závažnosť × denná časť
ct = pd.crosstab(final_ready["SEVERITY_BIN"], final_ready["DAYPART"])
print("\nCrosstab: SEVERITY_BIN × DAYPART")
print(ct)



SEVERITY_BIN
SEVERITY_BIN
None/Minor       46663
Serious/Fatal     7881
Other/Unknown      201
Name: count, dtype: int64

DAYPART
DAYPART
Afternoon    23275
Evening      13863
Morning      12705
Night         4569
Unknown        333
Name: count, dtype: int64

WEEKEND
WEEKEND
Weekday    40786
Weekend    13959
Name: count, dtype: int64

URBANICITYNAME
URBANICITYNAME
Urban Area    40922
Rural Area    13823
Name: count, dtype: int64

VEH_COUNT_CAT
VEH_COUNT_CAT
2     33399
1     17016
3+     4330
Name: count, dtype: int64

Crosstab: SEVERITY_BIN × DAYPART
DAYPART        Afternoon  Evening  Morning  Night  Unknown
SEVERITY_BIN                                              
None/Minor         20583    11200    11152   3456      272
Other/Unknown         77       52       52     19        1
Serious/Fatal       2615     2611     1501   1094       60


In [50]:
# %% Dataset pre CleverMiner z acc_sub (bez agregácií)
def map_severity_ord(x):
    if pd.isna(x):
        return "Non-Serious"  # konzervatívne
    s = str(x)
    if "Fatal" in s:
        return "Fatal"
    if "Serious" in s:
        return "Serious"
    return "Non-Serious"

df_cm = acc_sub.copy()

# vyberieme len potrebné stĺpce
keep_cols = [
    "URBANICITYNAME","WEATHR_IMNAME","LGTCON_IMNAME",
    "DAYPART","WEEKEND","ALCHL_IMNAME","MAXSEV_IMNAME"
]
df_cm = df_cm[keep_cols].copy()

# ordinal cieľ (succedent)
df_cm["SEVERITY_ORD"] = df_cm["MAXSEV_IMNAME"].apply(map_severity_ord)
sev_order = CategoricalDtype(categories=["Fatal","Serious","Non-Serious"], ordered=True)
df_cm["SEVERITY_ORD"] = df_cm["SEVERITY_ORD"].astype(sev_order)

# doplniť Unknown a konvert na kategórie
for c in ["URBANICITYNAME","WEATHR_IMNAME","LGTCON_IMNAME","DAYPART","WEEKEND","ALCHL_IMNAME"]:
    df_cm[c] = df_cm[c].astype("string").fillna("Unknown").astype("category")

print({c: df_cm[c].nunique() for c in df_cm.columns})
baseline = (df_cm["SEVERITY_ORD"].isin(["Fatal","Serious"])).mean()
print(f"Baseline P(Serious/Fatal): {baseline:.3f}")


{'URBANICITYNAME': 2, 'WEATHR_IMNAME': 11, 'LGTCON_IMNAME': 7, 'DAYPART': 5, 'WEEKEND': 2, 'ALCHL_IMNAME': 2, 'MAXSEV_IMNAME': 8, 'SEVERITY_ORD': 3}
Baseline P(Serious/Fatal): 0.144


Analýza A — 4ft-Miner

Otázka: „Za akých okolností (miesto/počasie/osvetlenie/deňná doba/víkend) rastie pravdepodobnosť závažnej nehody?“

In [51]:
# %% Analýza A — 4ft-Miner (ante = prostredie, succ = závažnosť)
clm_A = cleverminer(
    df=df_cm,
    proc='4ftMiner',
    quantifiers={
        'Base': 300,     # min. počet prípadov A∧S (tune podľa potreby)
        'conf': 0.28,    # min. P(S|A) (môžeš ladiť podľa baseline)
        # prípadne: 'aad': 0.05  # Above-Average Difference
    },
    ante={
        'attributes':[
            {'name':'URBANICITYNAME','type':'subset','minlen':1,'maxlen':1},
            {'name':'WEATHR_IMNAME','type':'subset','minlen':1,'maxlen':2},
            {'name':'LGTCON_IMNAME','type':'subset','minlen':1,'maxlen':2},
            {'name':'DAYPART','type':'subset','minlen':1,'maxlen':1},
            {'name':'WEEKEND','type':'subset','minlen':1,'maxlen':1},
        ],
        'minlen':1,'maxlen':3,'type':'con'
    },
    succ={
        'attributes':[
            {'name':'SEVERITY_ORD','type':'lcut','minlen':1,'maxlen':2}
            # lcut nad ordinal ["Fatal","Serious","Non-Serious"] => vyskúša {Fatal} aj {Fatal,Serious}
        ],
        'minlen':1,'maxlen':1,'type':'con'
    }
)

clm_A.print_summary()
clm_A.print_rulelist()
if clm_A.get_rulecount() >= 1: clm_A.print_rule(1)
if clm_A.get_rulecount() >= 2: clm_A.print_rule(2)


Cleverminer version 1.2.4.
Starting data preparation ...
Automatically reordering numeric categories ...
Automatically reordering numeric categories ...done
Encoding columns into bit-form...
Encoding columns into bit-form...done
Data preparation finished.
Will go for  4ftMiner
Starting to mine rules.
  0%|                                                    |Elapsed Time: 0:00:00
  6%|###                                                 |Elapsed Time: 0:00:00
 15%|#######                                             |Elapsed Time: 0:00:00
 25%|#############                                       |Elapsed Time: 0:00:00
 26%|#############                                       |Elapsed Time: 0:00:00
 30%|###############                                     |Elapsed Time: 0:00:00
 50%|##########################                          |Elapsed Time: 0:00:00
100%|####################################################|Elapsed Time: 0:00:00
Done. Total verifications : 3036, rules 9, times: prep 0.0

Analýza B — SD4ft-Miner

Otázka: „Líši sa vplyv alkoholu na závažnosť nehody medzi víkendom a pracovným dňom?“

In [52]:
# %% Analýza B — SD4ft-Miner (porovnanie Weekend vs Weekday)
clm_B = cleverminer(
    df=df_cm,
    proc='SD4ftMiner',
    quantifiers={
        'FrstBase': 200,   # min |A∧S| vo frst (Weekend)
        'ScndBase': 200,   # min |A∧S| v scnd (Weekday)
        'Ratioconf': 1.25  # požadovaný pomer P(S|A) frst / scnd (>= 1.25 = +25 %)
        # alternatíva: 'Deltaconf': 0.05
    },
    ante={
        'attributes':[
            {'name':'ALCHL_IMNAME','type':'subset','minlen':1,'maxlen':1}
        ],
        'minlen':1,'maxlen':1,'type':'con'
    },
    succ={
        'attributes':[
            {'name':'SEVERITY_ORD','type':'lcut','minlen':1,'maxlen':2}
        ],
        'minlen':1,'maxlen':1,'type':'con'
    },
    frst={
        'attributes':[
            {'name':'WEEKEND','type':'subset','minlen':1,'maxlen':1}
            # miner si zvolí subset reprezentujúci "Weekend"
        ],
        'minlen':1,'maxlen':1,'type':'con'
    },
    scnd={
        'attributes':[
            {'name':'WEEKEND','type':'subset','minlen':1,'maxlen':1}
            # ...a kontrast "Weekday"
        ],
        'minlen':1,'maxlen':1,'type':'con'
    }
)

clm_B.print_summary()
clm_B.print_rulelist()
if clm_B.get_rulecount() >= 1: clm_B.print_rule(1)


Cleverminer version 1.2.4.
Starting data preparation ...
Automatically reordering numeric categories ...
Automatically reordering numeric categories ...done
Encoding columns into bit-form...
Encoding columns into bit-form...done
Data preparation finished.
Will go for  SD4ftMiner
Starting to mine rules.
  0%|                                                    |Elapsed Time: 0:00:00
100%|####################################################|Elapsed Time: 0:00:00
Done. Total verifications : 16, rules 2, times: prep 0.03sec, processing 0.01sec

CleverMiner task processing summary:

Task type : SD4ftMiner
Number of verifications : 16
Number of rules : 2
Total time needed : 00h 00m 00s
Time of data preparation : 00h 00m 00s
Time of rule mining : 00h 00m 00s


List of rules:
RULEID BASE1 BASE2 RatioConf DeltaConf Rule
     1   312   735    1.314    +0.006  ALCHL_IMNAME(No Alcohol Involved) => SEVERITY_ORD(Fatal) | --- : WEEKEND(Weekend) x WEEKEND(Weekday)
     2  1966  4532    1.343    +0.041 

Analyza C - 4ftMiner

"Které bezpečnostní prvky snižují závažnost zranění?"

In [53]:
# 1) Robustné nájdenie stĺpcov (datasety sa líšia názvoslovím)
def first_present(df, names):
    for n in names:
        if n in df.columns: return n
    return None

COLS = {
    'rest':  first_present(pers, ['REST_USE_IMNAME','REST_USE']),
    'helm':  first_present(pers, ['HELM_USE_IMNAME','HELM_USE']),
    'airb':  first_present(pers, ['AIR_BAG_IMNAME','AIR_BAG']),
    'eject': first_present(pers, ['EJECT_IMNAME','EJECTION','EJECT']),
    'inj_name': first_present(pers, ['INJ_SEV_IMNAME','INJ_SEVNAME']),
    'inj_num': 'INJ_SEV' if 'INJ_SEV' in pers.columns else None,
}

# 2) Cieľ (succ): PERSON_SEV_BIN = Non-Serious vs Serious/Fatal
def map_person_sev_from_name(x:str) -> str:
    if not isinstance(x, str): return "None/Minor"
    s = x.lower()
    if 'fatal' in s or 'suspected serious' in s or '(a)' in s or 'serious injury' in s:
        return 'Serious/Fatal'
    return 'None/Minor'

if COLS['inj_name']:
    sev_bin = pers[COLS['inj_name']].astype('string').map(map_person_sev_from_name)
elif COLS['inj_num']:
    sev_bin = pd.to_numeric(pers[COLS['inj_num']], errors='coerce').map(lambda v: 'Serious/Fatal' if v>=3 else 'None/Minor')
else:
    raise ValueError("Nenašiel som stĺpec s personálnou závažnosťou (INJ_SEV).")

# 3) Zostav PERS dataset pre CleverMiner
df_pers = pd.DataFrame({'PERSON_SEV_BIN': sev_bin.fillna('None/Minor').astype('category')})

# pridaj bezpečnostné prvky, ktoré existujú
feat_map = {}
if COLS['rest']:  feat_map['RESTraint'] = pers[COLS['rest']]
if COLS['helm']:  feat_map['HELMET']    = pers[COLS['helm']]
if COLS['airb']:  feat_map['AIRBAG']    = pers[COLS['airb']]
if COLS['eject']: feat_map['EJECTION']  = pers[COLS['eject']]

for k, s in feat_map.items():
    df_pers[k] = s.astype('string').fillna('Unknown').astype('category')

print("Použité safety stĺpce:", [c for c in df_pers.columns if c!='PERSON_SEV_BIN'])
print({c: df_pers[c].nunique() for c in df_pers.columns})

# 4) 4ftMiner: hľadáme A => PERSON_SEV_BIN = "None/Minor" (t.j. ochranné prvky vedú k nižšej závažnosti)
clm_safe = cleverminer(
    df=df_pers,
    proc='4ftMiner',
    quantifiers={
        'Base': 400,     # minimálna veľkosť A∧S; dolaď podľa počtu osôb
        'conf': 0.85     # vysoká pravdepodobnosť "None/Minor" (znižuje závažnosť)
        # prípadne pridaj 'aad': +0.05
    },
    ante={
        'attributes':[
            *( [{'name':'RESTraint','type':'subset','minlen':1,'maxlen':1}] if 'RESTraint' in df_pers.columns else [] ),
            *( [{'name':'HELMET','type':'subset','minlen':1,'maxlen':1}]    if 'HELMET'    in df_pers.columns else [] ),
            *( [{'name':'AIRBAG','type':'subset','minlen':1,'maxlen':1}]    if 'AIRBAG'    in df_pers.columns else [] ),
            *( [{'name':'EJECTION','type':'subset','minlen':1,'maxlen':1}]  if 'EJECTION'  in df_pers.columns else [] ),
        ],
        'minlen':1,'maxlen':2,'type':'con'
    },
    succ={
        'attributes':[
            {'name':'PERSON_SEV_BIN','type':'subset','minlen':1,'maxlen':1}
        ],
        'minlen':1,'maxlen':1,'type':'con'
    }
)

# Bezpečný výpis top pravidiel
def cm_rulecount(cm): 
    rc = cm.get_rulecount()
    return int(rc) if isinstance(rc, int) else 0

def cm_print_top(cm, k:int=3):
    rc = cm_rulecount(cm)
    if rc==0:
        print("No rules."); return
    cm.print_rulelist()
    for i in range(1, min(k, rc)+1):
        cm.print_rule(i)

print("\n=== Safety features → lower severity ===")
clm_safe.print_summary()
cm_print_top(clm_safe, k=5)

Použité safety stĺpce: ['RESTraint', 'HELMET', 'AIRBAG', 'EJECTION']
{'PERSON_SEV_BIN': 2, 'RESTraint': 14, 'HELMET': 8, 'AIRBAG': 10, 'EJECTION': 5}
Cleverminer version 1.2.4.
Starting data preparation ...
Automatically reordering numeric categories ...
Automatically reordering numeric categories ...done
Encoding columns into bit-form...
Encoding columns into bit-form...done
Data preparation finished.
Will go for  4ftMiner
Starting to mine rules.
  0%|                                                    |Elapsed Time: 0:00:00
100%|####################################################|Elapsed Time: 0:00:00
Done. Total verifications : 146, rules 59, times: prep 0.06sec, processing 0.03sec

=== Safety features → lower severity ===

CleverMiner task processing summary:

Task type : 4ftMiner
Number of verifications : 146
Number of rules : 59
Total time needed : 00h 00m 00s
Time of data preparation : 00h 00m 00s
Time of rule mining : 00h 00m 00s


List of rules:
RULEID BASE  CONF  AAD    Rule

Analyza D - "Jak věk osoby ovlivňuje závažnost zranění?"

In [54]:
# %% PERS-level: "Jak věk osoby ovlivňuje závažnost zranění?"  (opravené)

import pandas as pd
import numpy as np
from cleverminer import cleverminer

# === 1) Vek do 7 intervalov ===
if "AGE" in pers.columns:
    age_num = pd.to_numeric(pers["AGE"], errors="coerce")
elif "AGE_IMNAME" in pers.columns:
    # vytiahni prvé číslo z reťazca typu "25 years" alebo "35-44"
    s = pers["AGE_IMNAME"].astype(str)
    age_num = pd.to_numeric(s.str.extract(r"(\d+)")[0], errors="coerce")
else:
    raise ValueError("V datasete sa nenašiel stĺpec s vekom (AGE/AGE_IMNAME).")

# binovanie – max 7 kategórií
bins = [-1, 15, 24, 34, 44, 54, 64, 120]
labels = ["0–15","16–24","25–34","35–44","45–54","55–64","65+"]
age_band = pd.cut(age_num, bins=bins, labels=labels)
age_band = age_band.cat.remove_unused_categories()

# === 2) Cieľ – závažnosť zranenia (None/Minor vs Serious/Fatal) ===
inj_col = None
for c in ["INJ_SEV_IMNAME","INJ_SEVNAME","INJ_SEV"]:
    if c in pers.columns:
        inj_col = c
        break
if inj_col is None:
    raise ValueError("Nenašiel som stĺpec so závažnosťou zranenia (INJ_SEV*).")

if "NAME" in inj_col.upper():
    def map_person_sev_from_name(x:str) -> str:
        if not isinstance(x,str): return "None/Minor"
        s = x.lower()
        if "fatal" in s or "serious" in s:
            return "Serious/Fatal"
        return "None/Minor"
    sev_bin = pers[inj_col].astype(str).map(map_person_sev_from_name)
else:
    sev_bin = pd.to_numeric(pers[inj_col], errors="coerce").map(
        lambda v: "Serious/Fatal" if v>=3 else "None/Minor"
    )

# === 3) Dataset pre miner ===
df_age = pd.DataFrame({
    "AGE_BAND": age_band,
    "PERSON_SEV_BIN": sev_bin
}).dropna()

print(df_age["AGE_BAND"].value_counts())

# === 4) 4ftMiner – kde je vyššia pravdepodobnosť závažných zranení ===
clm_age = cleverminer(
    df=df_age,
    proc="4ftMiner",
    quantifiers={"Base":400, "aad":0.05},
    ante={
        "attributes":[{"name":"AGE_BAND","type":"seq","minlen":1,"maxlen":3}],
        "minlen":1,"maxlen":1,"type":"con"
    },
    succ={
        "attributes":[{"name":"PERSON_SEV_BIN","type":"subset","minlen":1,"maxlen":1}],
        "minlen":1,"maxlen":1,"type":"con"
    }
)

print("\n=== Age → severity ===")
clm_age.print_summary()
clm_age.print_rulelist()
rc_age = int(clm_age.get_rulecount() or 0)
if rc_age > 0:
    clm_age.print_rule(1)


AGE_BAND
16–24    26467
25–34    25574
35–44    18533
45–54    15410
55–64    13502
65+      11662
0–15     10105
Name: count, dtype: int64
Cleverminer version 1.2.4.
Starting data preparation ...
Automatically reordering numeric categories ...
Automatically reordering numeric categories ...done
Encoding columns into bit-form...
Encoding columns into bit-form...done
Data preparation finished.
Will go for  4ftMiner
Starting to mine rules.
  0%|                                                    |Elapsed Time: 0:00:00
100%|####################################################|Elapsed Time: 0:00:00
Done. Total verifications : 36, rules 7, times: prep 0.03sec, processing 0.02sec

=== Age → severity ===

CleverMiner task processing summary:

Task type : 4ftMiner
Number of verifications : 36
Number of rules : 7
Total time needed : 00h 00m 00s
Time of data preparation : 00h 00m 00s
Time of rule mining : 00h 00m 00s


List of rules:
RULEID BASE  CONF  AAD    Rule
     1  2019 0.079 +0.080 AGE_B