## The Bus Failure Example

We will practice joint, marginal, and conditional probability.

Imagine we have a dataset recording the number of failed buses in three U.S. metropolitan areas (Seattle, Chicago, and Miami).

In [1]:
%%writefile bus-failures.txt
city	winter	spring	summer	fall
Seattle	18	9	3	8
Chicago	36	13	14	5
Miami	3	6	11	8

Overwriting bus-failures.txt


In [2]:
import pandas as pd
df = pd.read_csv('bus-failures.txt', sep='\t', index_col='city')
df

Unnamed: 0_level_0,winter,spring,summer,fall
city,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Seattle,18,9,3,8
Chicago,36,13,14,5
Miami,3,6,11,8


I.e., what the above table tells us is that there were 18 buses that failed in Seattle's winter last year, while 11 buses failed in the summer in Miami.

And the total number of failed buses every year is:

In [3]:
Nfailures = df.sum().sum()
Nfailures

134

## Joint probability of a bus failiure

Imagine you're the CEO of the (imaginary) "National Bus Transportation Company Inc" ("NBTC Inc") that operates the bus systems in these three cities.

If a bus breaks down sometime next year in one of the three cities (you don't know when or where), compute the probability it will fail in any of the (season, city) pairs -- $p({\rm season}, {\rm city})$:

In [4]:
p = df / Nfailures
p

Unnamed: 0_level_0,winter,spring,summer,fall
city,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Seattle,0.134328,0.067164,0.022388,0.059701
Chicago,0.268657,0.097015,0.104478,0.037313
Miami,0.022388,0.044776,0.08209,0.059701


Note: in the above, we assume a bus has failed and are asking about the probability it failed in some specific (season, city) pair. We're not asking about how probable it is for a bus to fail int the first place (for that, we'd also need to know the total number of buses in these cities).

Therefore, the probabilities across all (season, city) pairs must sum up to 1.

In [5]:
p.sum().sum()

0.9999999999999999

(within numerical precision).

How much more likely is it that the bus has failed in Chicago in winter than in Miami in the spring?

In [6]:
p.loc["Chicago", "winter"] / p.loc["Miami", "spring"]

6.0

## Probability the bus failed in some particular season

You're NBTC's Chief Mechanic, and would like to know when you'll need more or fewer mechancs to service the busses nation wide.

If a bus fails next year, what is the probability it will fail in some particular season -- $p({\rm season})$?

In [7]:
p_season = p.sum()
p_season

winter    0.425373
spring    0.208955
summer    0.208955
fall      0.156716
dtype: float64

This is the marginal probability, with 'city' (the uninteresting variable, for this purpose) marginalized out.

Based on the above, when do you need more mechanics, in the summer or in winter? Care to speculate why?

## Probability a bus will fail in any given city

You also want to understand the failiures as a function of city, $p(city)$, to know where to send more spare parts.

Compute the marginal distribution $p({\rm city})$:

In [8]:
p_city = p.sum(axis=1)
p_city

city
Seattle    0.283582
Chicago    0.507463
Miami      0.208955
dtype: float64

Which city requires most maintenance?

## Probability a bus will fail in Seattle, as a function of season

Finally, imagine you're the Chief Regional Mechanic for Seattle. You're primarily interested in what will happen in Seattle next year as a function of season -- $p(season|city=Seattle)$.

Compute this distribution, and assess when you'll expect to have the most work:

In [15]:
p_season_given_seattle = p.loc["Seattle"] / p_city.loc["Seattle"]
p_season_given_seattle

winter    0.473684
spring    0.236842
summer    0.078947
fall      0.210526
Name: Seattle, dtype: float64

Where does this come from? Remember the definition of conditional probability:

$$ p(city, season) = p(season | city) * p(city) $$

so therefore:

$$ p(season | city) = \frac{p(city, season)}{p(city)} $$

and because I've asked you specifically about Seattle, this is:

$$ p(season | city=Seattle) = \frac{p(city=Seattle, season)}{p(city=Seattle)} $$

which translates to the line above when spelled out with Pandas.

Note that if you did everything well, the probability of failiure should sum up to 1:

In [16]:
p_season_given_seattle.sum()

1.0

## A final note: an unwritten condition

Go back and look at how the joint probability we started with was defined: I said "If a bus fails ... what is the probability it failed in a given season and city". If you look at that formulation, it's clear it's really a *conditional probability* (clauses like "If a bus fails...", or "Given X..." are the usual givaways!).

Therefore, a more accurate way to write it down would have been:

$$p({\rm season}, {\rm city} \,|\, {\rm bus\, has\, failed})$$

and then carry the "bus has failed" tag throughout -- for example:

$$p({\rm season}, {\rm city} \,|\, {\rm city=Seattle,\,\rm bus\, has\, failed})$$

This is arduous (lots of writing!). Typically, such conditions that are implied throughout a problem are omitted and not written explicitly. As long as all probabilities are subject to the same condition, the math stays the same.

This is very common: for example, any die-rolling example should technically say "if the die is fair", and "if the roller doesn't cheat", and plus any other conditions describing the experimental setup. We usually write those out in the introduction of the problem, and omit them from the notation.