<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Linda-the-Banker" data-toc-modified-id="Linda-the-Banker-0.1"><span class="toc-item-num">0.1&nbsp;&nbsp;</span>Linda the Banker</a></span></li><li><span><a href="#Probability" data-toc-modified-id="Probability-0.2"><span class="toc-item-num">0.2&nbsp;&nbsp;</span>Probability</a></span></li><li><span><a href="#Fraction-of-Bankers" data-toc-modified-id="Fraction-of-Bankers-0.3"><span class="toc-item-num">0.3&nbsp;&nbsp;</span>Fraction of Bankers</a></span></li></ul></li><li><span><a href="#The-Probability-Function" data-toc-modified-id="The-Probability-Function-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>The Probability Function</a></span></li><li><span><a href="#Political-Views-and-Parties" data-toc-modified-id="Political-Views-and-Parties-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Political Views and Parties</a></span></li></ul></div>

## Linda the Banker

<blockquote>
    So there's a woman named Linda. Linda is 31 years old, single, outspoken, and very bright. She majored in philosophy. As a student, she was deeply concerned with issues of discrimination and social justice, and also participated in anti-nuclear demonstrations.
    Which is more probable?
    
    1. Linda is a bank teller.
    2. Linda is a bank teller and is active in the feminist movement.
</blockquote>

It seems the second statement is more probable.

## Probability

So, what's probability? It's a surprisingly difficult question, though, if you refer to this document on [probability interpretation](https://en.wikipedia.org/wiki/Probability_interpretations).
To avoid getting stuck before we start, we will use a simple definition for now and refine it later; A probability is a fraction of a finite set.
So, if we survey 1000 people, and 20 of them are bank tellers, the fraction taht work as bank tellers is 0.02 or 2%. If we choose a random person from this population, the probability that he/she is a bank teller is 2%. By "at random", it means that every person in the dataset has the same chance of being chosen. (This might not always be the case, though.)
With this deifnition and an appropriate dataset, we can compute probabilities by counting. We'll use data from [General Social Survey](http://gss.norc.org/) (GSS), for a demonstration.

In [1]:
import pandas as pd
gss = pd.read_csv('gss_bayes.csv', index_col=0)
gss.head()

Unnamed: 0_level_0,year,age,sex,polviews,partyid,indus10
caseid,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,1974,21.0,1,4.0,2.0,4970.0
2,1974,41.0,1,5.0,0.0,9160.0
5,1974,58.0,2,6.0,1.0,2670.0
6,1974,30.0,1,5.0,4.0,6870.0
7,1974,48.0,1,5.0,4.0,7860.0


The `DataFrame` has one row for each person surveyed and one column for each variable selected.

The columns are

- `caseid`: Respondent id (which is the index of the table).
- `year`: Year when the respondent was surveyed.
- `age`: Respondent's age when surveyed.
- `sex`: Male or female.
- `polviews`: Political views on a range from liberal to conservative.
- `partyid`: Political party affiliation, Democrat, Independent, or Republican.
- `indus10`: [Code](https://gssdataexplorer.norc.org/variables/17/vshow) for the industry the respondent works in.

Let's look at these variables in more detail, starting with `indus10`.

## Fraction of Bankers

The code for "Banking and related activities" is 6870, so we can select bankers like this:

![gss_banking]("/home/scratch/ThinkBayes2/Screenshot_20221018_141110.png")



In [3]:
banker = (gss['indus10'] == 6870)
banker.head()

caseid
1    False
2    False
5    False
6     True
7    False
Name: indus10, dtype: bool

The result is a Pandas `Series` that contains the Boolean values `True` or `False`.

If we use the `sum` function on this `Series`, it treats True as 1 and False as -, so the total is the number of the bankers.

In [4]:
banker.sum()

728

Now, compute the *fraction* of bankers by using `mean`.

In [5]:
banker.mean()

0.014769730168391155

About 1.5% of the respondents work in banking industry, so if we choose a random person from the dataset, the probability that they are a banker is about 1.5%.

# The Probability Function

Now, put the code from the previous section in a function that takes a Boolean series and returns a prrobability:

In [6]:
def prob(A):
    """Computes the probability of a proposition, A"""
    return A.mean()

So we can compute the fraction of bankers like this:


In [7]:
prob(banker)

0.014769730168391155

Now let's look at another variable in this dataset. The values of the column `sex` are encoded like this:

So we can make a Boolean series that is `True` for female respondents and `False` otherwise.

In [8]:
female = (gss['sex'] == 2)

In [9]:
female.head()

caseid
1    False
2    False
5     True
6    False
7    False
Name: sex, dtype: bool

We can use this boolean series to compute the fraction of respondents who are women.

In [10]:
prob(female)

0.5378575776019476

# Political Views and Parties

The other variables considered are `polviews`, which describes the political views of the respondents, and `partyid`, which describes their affiliation with a political party.

The values of `polviews` are one a seven-point scale: