In [2]:
import thinkplot
import thinkstats2
import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error
import scipy.stats as ss
from fractions import Fraction

##Seaborn for fancy plots. 
import matplotlib.pyplot as plt
import seaborn as sns
plt.rcParams["figure.figsize"] = (15,5)

# Probability and Bayes

There are a couple of different ways to understand the ideas we've looked at in statistics:
<ul>
<li> Tabulating the data - we've looked at everything in terms of totals, X value occurs Y times in our dataset. 
<li> Probability - the likelihood that X value occurs. 
</ul>

These two are different ways of thinking about the same thing. If we are counting we may say that 10 out of 100 people in our sample have red hair. If we are thinking probabilistically we may say that the probability that someone has red hair is 10%. 

In general, the approach we are taking is more directly transferable to data science, as we generally deal with datasets and counts of what happens in those datasets. There is another branch of statistics, Bayesian Statistics, that focuses on probability and solves the same challenges we are solving through that lens. We don't need to be experts in Bayesian Stats, but having some exposure to the probabilistic view will help our understanding. It will also likely help with our ability to deal with probability and stats in real life, as we often will see and use odds, probability, and summation when talking about real world events. As evidenced by the last few years, much of society really struggles with understanding the likelihood that something will happen. 

#### What Are The Odds my Car Gets Stolen?

We have a dataset here on cars in a (very sketchy) part of a city. Let's look at the which cars were stolen, which were not, and calculate the probabilities. 

We generally write the probability that something will happen as P(thing), where the value is on a 0 to 1 scale. 

<ul>
<li>P(stolen)
<li>P(S)
</ul>

In [3]:
df = pd.read_csv("data/vehicle_stolen_dataset.csv", names=["ID", "Make", "Color", "Time", "Stolen"])
df.head(20)

Unnamed: 0,ID,Make,Color,Time,Stolen
0,N001,BMW,black,night,yes
1,N002,Audi,black,night,no
2,N003,NISSAN,black,night,yes
3,N004,VEGA,red,day,yes
4,N005,BMW,blue,day,no
5,N006,Audi,black,day,yes
6,N007,VEGA,red,night,no
7,N008,Audi,blue,day,yes
8,N009,VEGA,black,day,yes
9,N010,NISSAN,blue,day,no


In [4]:
#Calculate how many cars were stolen divided by how many cars
pStolen = len(df[df["Stolen"]=="yes"])/len(df)
pStolen

0.65

There are 20 total cars
13 of those cars were stolen. 

The probability (fraction of the whole) of cars being stolen is 13/20 = 65% = ...
<ul>
<li>P(Stolen) = .65, or
<li>P(S) = .65
</ul>

#### What are the chances my car is a BMW?

We can calculate this probability for anything that we can count...

In [5]:
#BMWs divided by number of cars
pBMW = len(df[df["Make"]=="BMW"])/len(df)
pBMW

0.3

P(BMW) = .3
P(B) = .3

#### Chances of Both?

Where things start getting intersting is if we are looking at multiple things at a time. What are the chances that both of the above are true? A car is BMW and stolen.

In [6]:
#How many rows are both true, divided by total
pBoth = len(df[(df["Make"]=="BMW") & (df["Stolen"]=="yes")])/len(df)
pBoth

0.2

P(B and S) = .2
P(S and B) = .2 

## Conditional Probability

#### What if I have a BMW? Then how likely is it my car gets stolen?

This requires another concept - conditional probability. In order to computer this I need to calculate the probability that my car gets stolen, GIVEN the assumption that my car is a BMW. Expressed in probability notation:
<ul>
<li>P(S|BMW) = ?
<li>P(S|B) = ?
</ul>

When expressing conditional probability we use a vertical bar, sometimes called a pipe, inside the probability notation. The first value is the thing we are getting the probability of; the second value is the condition. So we can think of it as "if the second part is true, what is the probability of the first part?"

The probability that S is true, given the asertion that B is true. To calculate:
<ul>
<li>Select all the results where the make is BMW.
<li>Of those, how many are stolen?
</ul>

In [7]:
#If I have a BMW, whats the prob of it being stolen?
df_BMW = df[df["Make"]=="BMW"]
pSB = len(df_BMW[df_BMW["Stolen"]=="yes"])/len(df_BMW)
pSB

0.6666666666666666

The probability of my BMW being stolen is a little higher! Or in other words once we update our knowledge that the car we are looking at is a BMW, we can update our prediction to more accurately reflect that totality of our new knowledge. 

### Bayes(ish) Theorem 1

We can calculate these probabilities slightly more efficently than all that dataframe manipulation.
Probability of my car being stolen if it is a BMW:

pSB = P(Stolen and BMW)/P(BMW) = pBoth/pBMW

This is theorem #1:

In general form: $$P(A|B) = \frac{P(A~\mathrm{and}~B)}{P(B)}$$

in code...

In [8]:
#Alternate calculation
pSB2 = pBoth/pBMW
pSB2 == pSB

False

#WHAT?????????????????????????????????????

Probabilites often use fractions, while numbers are normally floating point (decimals). We can use some fractions to make it easier!

In [10]:
#Redo with fractions
pBothF = Fraction(len(df[(df["Make"]=="BMW") & (df["Stolen"]=="yes")]), len(df))
pBMWF = Fraction(len(df[df["Make"]=="BMW"]), len(df))
pSB2F = pBothF/pBMWF
print(pSB2F) #The print thing tells the Fraction to look fractiony
#print(Fraction(4,5))

2/3
4/5


What about the other way around? If my car is stolen, what's the chances that it is a BMW?

P(BMW | Stolen) = P(BMW and STOLEN)/P(STOLEN)

In [13]:
df_Stol = df[df["Stolen"] == "yes"]
pBS = Fraction(len(df_Stol[df_Stol["Make"] == "BMW"]), len(df_Stol))
#pBS = len(df_Stol[df_Stol["Make"] == "BMW"])/ len(df_Stol)
print(pBS)

0.3076923076923077


### Bayes(ish) Theorem 2

Note - these two probabilities are different. The prob that a car is STOLEN and a BMW is the same as the prob it is a BMW and STOLEN. The probability that a car is stolen GIVEN it is a BMW is not the same as the probability that a car is a BMW given that it is stolen. 

We can manipulate what we did above by multiplying both sides by pBMW. We get:

p(Stolen and BMW) = P(Stolen)*P(S|B)

This is Theorem 2:

In nice print, generally: $$P(A~\mathrm{and}~B) = P(B) ~ P(A|B)$$

In [14]:
#Check
print(pBothF)
print(pBMWF*pSB2F)

1/5
1/5


### Bayes Theorem

We know, from intuition and from above, that conjunctions (and) works in either direction: P(Stolen and BMW) = P(BMW and STOLEN)

Written in general form:
$$P(A~\mathrm{and}~B) = P(B~\mathrm{and}~A)$$

If we apply Theorem 2 from above to both sides, we have: P(BMW)*P(STOLEN | BMW) = P(STOLEN)*P(BMW | STOLEN)

Written in general form:
$$P(B) P(A|B) = P(A) P(B|A)$$

With this equation we can solve for whatever we don't know - like either conditional:

1. You can check $B$ first, then $A$ conditioned on $B$, or

2. You can check $A$ first, then $B$ conditioned on $A$.

If we divide through by $P(B)$, we get Theorem 3:

$$P(A|B) = \frac{P(A) P(B|A)}{P(B)}$$

That is Bayes's Theorem.

In [15]:
#What is the prob of a car being red, given that it is a nissan. 
#What is the prob that a nissan is red. 

#Probability of stolen given that you have a BMW
#First, get pStolen in fraction form
pSF = Fraction(len(df[df["Stolen"]=="yes"]), len(df))

#Calculate numerator
numerator = pSF*pBS

print(numerator/pBMWF)

0.6666666666666667


### The Law of Total Probability

In order to make all this stuff work, we need one other thing - the law of total probability.
Here's one form of the law, expressed in mathematical notation:

$$P(A) = P(B_1 \mathrm{and} A) + P(B_2 \mathrm{and} A)$$

In words, this means two things:

- The probabilities are mutually exclusive, only one can be true at a time.

- The probabilites are exhaustive, they sum to 1. 

Bayes is useful because whichever condition we don't know can be calculated - it becomes much more useful when we don't have all the data, like we do here. If we have all of the data, we can just count. More on that later. 

Another example...

On the Titanic, what is the probability that someone who died is from third class? Let's set up our equation

p(3rd | Dead) = p(3rd)*p(Dead | 3rd)/p(Dead)

In [16]:
#The titanic is built into Seaborn, to make it easy.
titanic = sns. load_dataset('titanic')
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


In [17]:
#Calculate some fractions
pD = Fraction(len(titanic["survived"] == False),len(titanic))
p3 = Fraction(len(titanic["class"] == 3),len(titanic))
tmp = titanic[titanic["pclass"] == 3]
pD3 = Fraction((tmp["survived"]==False).sum(), len(tmp))


In [18]:
print((pD3*p3)/pD)

372/491


Damn, that's tuff. 

So, if you're going on a boat, be rich! This key fact will be on the final. 

### Exercise - Will My German Car Get Stolen?

What is the probability of your car being stolen if it is a German car?

Note: In our data the German cars are BMWs and Audis. 

P(Stolen | German) = P(Stolen) * P(German | Stolen) / P(German)

In [28]:
print((pSF*pGermanStolen)/pGerman)

2/3


In [20]:
#What is the probability of your car being stolen if it is German (i.e. BMW or Audi)
#P(Stolen) = pSF
#pGerman = P(Audi or BMW) = P(BMW)+P(Audi) = pBMWF
pAudi = Fraction(len(df[df["Make"]=="Audi"]), len(df))
pGerman = pAudi + pBMWF
pGerman

Fraction(3, 5)

In [27]:
tmp = df[df["Stolen"]=="yes"]
tmpBMW = Fraction(len(tmp[tmp["Make"]=="BMW"]), len(tmp))
tmpAudi = Fraction(len(tmp[tmp["Make"]=="Audi"]), len(tmp))
tmpAudi
pGermanStolen = tmpBMW + tmpAudi
print(pGermanStolen)

8/13


In [None]:
#What is the probability of a theft happening at night?

As we can see above, when we have all the data, we can calculate all the probabilities directly. Bayes is more useful when we don't have all that. That's next time... Something to consider....


## The Monty Hall Problem

The Monty Hall problem is based on a game show called *Let's Make a Deal*. If you are a contestant on the show, here's how the game works:

* The host, Monty Hall, shows you three closed doors -- numbered 1, 2, and 3 -- and tells you that there is a prize behind each door.

* One prize is valuable (traditionally a car), the other two are less valuable (traditionally goats).

* The object of the game is to guess which door has the car. If you guess right, you get to keep the car.

The key - after you pick a door, Monty will open another, revealing a goat. Then Monty offers you the option to stick with your original choice or switch to the remaining unopened door.

Do you switch?