## Demo Week 3: Lecture 2

In [64]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

plt.style.use('fivethirtyeight')
sns.set_context("notebook")

#### Exercise

Suppose you're trying to meet your friend in either Central Park (denoted C) or Riverside Park (denoted R). 

Right now, you know that the chances of your friend being at $x$ are

| &nbsp; | Chance |
| --- | --- | 
| **x = C** | 0.4 |
| **x = R** | 0.6 | 

We want to determine the chance of your friend being at $y$ in 1 hour

| &nbsp; | Chance |
| --- | --- | 
| **y = C** | ? |
| **y = R** | ? | 

a. Use ```numpy``` to create a $2 \times 1$ array called ```v``` with chance of being in Central Park and Riverside Park. Try testing the dimensions.

In [66]:
v = np.array([0.6,0.4])
v.shape

(2,)

b. Since your friend likes to wander around at random, you need to make a table about the likelihood of locations   

| &nbsp; | x = C | x=  R| &nbsp; |
| --- | --- | --- | --- |
| **y = C** | 0.3 | 0.2 | 0.5 
| **y = R** | 0.1 | 0.4 | 0.5
| &nbsp; | 0.4 | 0.6 

Use ```numpy``` to create a $2 \times 2$ matrix ```M``` with chance of walking between parks. Try testing the dimensions.

In [67]:
M = np.array([
    [0.3,0.2],
    [0.1,0.4]])

M.shape

(2, 2)

Use the ```numpy``` sum method to sum the entries over columns and rows. What do you notice about these numbers?

In [68]:
print('Sum over Columns')
print(np.sum(M,axis=1))

print('Sum over Rows')
print(np.sum(M,axis=0))

Sum over Columns
[0.5 0.5]
Sum over Rows
[0.4 0.6]


In [69]:
print('P(y=C) + P(y=R) = ')
print(np.sum(M,axis=1).sum())

print('P(x=C) + P(x=R) = ')
print(np.sum(M,axis=1).sum())

P(y=C) + P(y=R) = 
1.0
P(x=C) + P(x=R) = 
1.0


Why should this be the case?

c. Using the definition of conditional probability, adjust the numbers in the tables to show the probability of $y$ given $x$

| &nbsp; | x = C | x=  R| &nbsp; |
| --- | --- | --- | --- |
| **y = C** | 0.3/0.4 | 0.2/0.6 | 
| **y = R** | 0.1/0.4 | 0.4/0.6 |


Adjust the probabilities in `M` to form `N` containing these four numbers 

In [23]:
N = np.multiply(M, np.array([[1/0.4, 1/0.6], [1/0.4, 1/0.6] ]))
print(N)

[[0.75       0.33333333]
 [0.25       0.66666667]]


d. Note that $$P(y = C) = P(y = C \text{ and } x = C) + P(y = C \text{ and } x = R)$$ This can be rewritten as $$P(y = C) = P(y = C | x = C) P(x =C) + P(y = C | x = R) P(x = R)$$ Compute these numbers using the values in `N` and `v`.

In [21]:
print(np.dot(N,v))

[0.58333333 0.41666667]


Therefore we have the probabilities for the location in 1 hour

In [25]:
print('Probability of y = C is')
print(np.dot(N,v)[0])

print('Probability of y = R is')
print(np.dot(N,v)[1])

Probability of y = C is
0.5833333333333333
Probability of y = R is
0.41666666666666674


## Reading in DataFrames from Files

Pandas has a number of very useful file reading tools. You can see them enumerated by typing "pd.re" and pressing tab. We'll be using read_csv today. 

In [50]:
elections = pd.read_csv("~/shared/elections.csv")
elections # if we end a cell with an expression or variable name, the result will print

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
1,Carter,Democratic,41.0,1980,loss
2,Anderson,Independent,6.6,1980,loss
3,Reagan,Republican,58.8,1984,win
4,Mondale,Democratic,37.6,1984,loss
5,Bush,Republican,53.4,1988,win
6,Dukakis,Democratic,45.6,1988,loss
7,Clinton,Democratic,43.0,1992,win
8,Bush,Republican,37.4,1992,loss
9,Perot,Independent,18.9,1992,loss


We can use the head command to return only a few rows of a dataframe.

In [51]:
elections.head(7)

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
1,Carter,Democratic,41.0,1980,loss
2,Anderson,Independent,6.6,1980,loss
3,Reagan,Republican,58.8,1984,win
4,Mondale,Democratic,37.6,1984,loss
5,Bush,Republican,53.4,1988,win
6,Dukakis,Democratic,45.6,1988,loss


There is also a tail command.

In [52]:
elections.tail(7)

Unnamed: 0,Candidate,Party,%,Year,Result
16,Bush,Republican,50.7,2004,win
17,Obama,Democratic,52.9,2008,win
18,McCain,Republican,45.7,2008,loss
19,Obama,Democratic,51.1,2012,win
20,Romney,Republican,47.2,2012,loss
21,Clinton,Democratic,48.2,2016,loss
22,Trump,Republican,46.1,2016,win


The read_csv command lets us specify a column to use an index. For example, we could have used Year as the index.

In [53]:
elections_year_index = pd.read_csv("elections.csv", index_col = "Year")
elections_year_index.head(5)

Unnamed: 0_level_0,Candidate,Party,%,Result
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1980,Reagan,Republican,50.7,win
1980,Carter,Democratic,41.0,loss
1980,Anderson,Independent,6.6,loss
1984,Reagan,Republican,58.8,win
1984,Mondale,Democratic,37.6,loss


Alternately, we could have used the set_index commmand.

In [54]:
elections_party_index = elections.set_index("Party")
elections_party_index.head(5)

Unnamed: 0_level_0,Candidate,%,Year,Result
Party,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Republican,Reagan,50.7,1980,win
Democratic,Carter,41.0,1980,loss
Independent,Anderson,6.6,1980,loss
Republican,Reagan,58.8,1984,win
Democratic,Mondale,37.6,1984,loss


The set_index command (along with all other data frame methods) does not modify the dataframe. That is, the original "elections" is untouched. Note: There is a flag called "inplace" which does modify the calling dataframe.

In [55]:
elections.head() #the index remains unchanged

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
1,Carter,Democratic,41.0,1980,loss
2,Anderson,Independent,6.6,1980,loss
3,Reagan,Republican,58.8,1984,win
4,Mondale,Democratic,37.6,1984,loss


By contrast, column names are ideally unique. For example, if we try to read in a file for which column names are not unique, Pandas will automatically rename any duplicates.

## The [] Operator

The DataFrame class has an indexing operator [] that lets you do a variety of different things. If your provide a String to the [] operator, you get back a Series corresponding to the requested label.

In [56]:
elections_year_index.head(6)

Unnamed: 0_level_0,Candidate,Party,%,Result
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1980,Reagan,Republican,50.7,win
1980,Carter,Democratic,41.0,loss
1980,Anderson,Independent,6.6,loss
1984,Reagan,Republican,58.8,win
1984,Mondale,Democratic,37.6,loss
1988,Bush,Republican,53.4,win


In [10]:
elections_year_index["Candidate"].head(6)

Year
1980      Reagan
1980      Carter
1980    Anderson
1984      Reagan
1984     Mondale
1988        Bush
Name: Candidate, dtype: object

The [] operator also accepts a list of strings. In this case, you get back a DataFrame corresponding to the requested strings.

In [11]:
elections_year_index[["Candidate", "Party"]].head(6)

Unnamed: 0_level_0,Candidate,Party
Year,Unnamed: 1_level_1,Unnamed: 2_level_1
1980,Reagan,Republican
1980,Carter,Democratic
1980,Anderson,Independent
1984,Reagan,Republican
1984,Mondale,Democratic
1988,Bush,Republican


A list of one label also returns a DataFrame. This can be handy if you want your results as a DataFrame, not a series.

In [12]:
elections_year_index[["Candidate"]].head(6)

Unnamed: 0_level_0,Candidate
Year,Unnamed: 1_level_1
1980,Reagan
1980,Carter
1980,Anderson
1984,Reagan
1984,Mondale
1988,Bush


Note that we can also use the to_frame method to turn a Series into a DataFrame.

In [13]:
elections_year_index["Candidate"].to_frame().head(5)

Unnamed: 0_level_0,Candidate
Year,Unnamed: 1_level_1
1980,Reagan
1980,Carter
1980,Anderson
1984,Reagan
1984,Mondale


The [] operator also accepts numerical slices as arguments. In this case, we are indexing by row, not column!

In [14]:
elections_year_index[0:3]

Unnamed: 0_level_0,Candidate,Party,%,Result
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1980,Reagan,Republican,50.7,win
1980,Carter,Democratic,41.0,loss
1980,Anderson,Independent,6.6,loss


If you provide a single argument to the [] operator, it tries to use it as a name. This is true even if the argument passed to [] is an integer. 

In [15]:
#elections_year_index[0] #this does not work, try uncommenting this to see it fail in action, woo

The following cells allow you to test your understanding.

In [57]:
weird = pd.DataFrame({
    1:["topdog","botdog"], 
    "1":["topcat","botcat"]
})
weird

Unnamed: 0,1,1.1
0,topdog,topcat
1,botdog,botcat


In [58]:
weird[1] #try to predict the output

0    topdog
1    botdog
Name: 1, dtype: object

In [59]:
weird["1"] #try to predict the output

0    topcat
1    botcat
Name: 1, dtype: object

In [60]:
weird[1:] #try to predict the output

Unnamed: 0,1,1.1
1,botdog,botcat


## Boolean Array Selection

The `[]` operator also supports array of booleans as an input. In this case, the array must be exactly as long as the number of rows. The result is a filtered version of the data frame, where only rows corresponding to True appear.

In [16]:
elections_year_index[[False, False, False, False, False, 
          False, False, True, False, False,
          True, False, False, False, True,
          False, False, False, False, False,
          False, False, True]]

Unnamed: 0_level_0,Candidate,Party,%,Result
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1992,Clinton,Democratic,43.0,win
1996,Clinton,Democratic,49.2,win
2000,Bush,Republican,47.9,win
2016,Trump,Republican,46.1,win


One very common task in Data Science is filtering. Boolean Array Selection is one way to achieve this in Pandas. We start by observing logical operators like the equality operator can be applied to Pandas Series data to generate a Boolean Array. For example, we can compare the 'Result' column to the String 'win':

In [17]:
elections_year_index.head(5)

Unnamed: 0_level_0,Candidate,Party,%,Result
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1980,Reagan,Republican,50.7,win
1980,Carter,Democratic,41.0,loss
1980,Anderson,Independent,6.6,loss
1984,Reagan,Republican,58.8,win
1984,Mondale,Democratic,37.6,loss


In [19]:
iswin = elections_year_index['Result'] == 'win'
iswin#.head(5)

Year
1980     True
1980    False
1980    False
1984     True
1984    False
1988     True
1988    False
1992     True
1992    False
1992    False
1996     True
1996    False
1996    False
2000    False
2000     True
2004    False
2004     True
2008     True
2008    False
2012     True
2012    False
2016    False
2016     True
Name: Result, dtype: bool

The output of the logical operator applied to the Series is another Series with the same name and index, but of datatype boolean. The entry at row #i represents the result of the application of that operator to the entry of the original Series at row #i.

Such a boolean Series can be used as an argument to the [] operator. For example, the following code creates a DataFrame of all election winners since 1980.

In [20]:
elections_year_index[iswin]

Unnamed: 0_level_0,Candidate,Party,%,Result
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1980,Reagan,Republican,50.7,win
1984,Reagan,Republican,58.8,win
1988,Bush,Republican,53.4,win
1992,Clinton,Democratic,43.0,win
1996,Clinton,Democratic,49.2,win
2000,Bush,Republican,47.9,win
2004,Bush,Republican,50.7,win
2008,Obama,Democratic,52.9,win
2012,Obama,Democratic,51.1,win
2016,Trump,Republican,46.1,win


Above, we've assigned the result of the logical operator to a new variable called `iswin`. This is uncommon. Usually, the series is created and used on the same line. Such code is a little tricky to read at first, but you'll get used to it quickly.

In [21]:
elections_year_index[elections_year_index['Result'] == 'win']

Unnamed: 0_level_0,Candidate,Party,%,Result
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1980,Reagan,Republican,50.7,win
1984,Reagan,Republican,58.8,win
1988,Bush,Republican,53.4,win
1992,Clinton,Democratic,43.0,win
1996,Clinton,Democratic,49.2,win
2000,Bush,Republican,47.9,win
2004,Bush,Republican,50.7,win
2008,Obama,Democratic,52.9,win
2012,Obama,Democratic,51.1,win
2016,Trump,Republican,46.1,win


We can select multiple criteria by creating multiple boolean Series and combining them using the `&` operator.

In [61]:
win50plus = (elections_year_index['Result'] == 'win') & (elections_year_index['%'] < 50)

In [62]:
win50plus.head(5)

Year
1980    False
1980    False
1980    False
1984    False
1984    False
dtype: bool

In [63]:
elections_year_index[(elections_year_index['Result'] == 'win')
          & ~(elections_year_index['%'] < 50)]

# Note for Python experts: The reason we use the & symbol and not the word "and" is because the Python __and__ 
# method overrides the "&" operator, not the "and" operator.

Unnamed: 0_level_0,Candidate,Party,%,Result
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1980,Reagan,Republican,50.7,win
1984,Reagan,Republican,58.8,win
1988,Bush,Republican,53.4,win
2004,Bush,Republican,50.7,win
2008,Obama,Democratic,52.9,win
2012,Obama,Democratic,51.1,win


The | operator is the symbol for or.

In [26]:
elections_year_index[(elections_year_index['Party'] == 'Republican')
          | (elections_year_index['Party'] == "Democratic")]

Unnamed: 0_level_0,Candidate,Party,%,Result
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1980,Reagan,Republican,50.7,win
1980,Carter,Democratic,41.0,loss
1984,Reagan,Republican,58.8,win
1984,Mondale,Democratic,37.6,loss
1988,Bush,Republican,53.4,win
1988,Dukakis,Democratic,45.6,loss
1992,Clinton,Democratic,43.0,win
1992,Bush,Republican,37.4,loss
1996,Clinton,Democratic,49.2,win
1996,Dole,Republican,40.7,loss


If we have multiple conditions (say Republican or Democratic), we can use the isin operator to simplify our code.

In [27]:
elections_year_index['Party'].isin(["Republican", "Democratic"])

Year
1980     True
1980     True
1980    False
1984     True
1984     True
1988     True
1988     True
1992     True
1992     True
1992    False
1996     True
1996     True
1996    False
2000     True
2000     True
2004     True
2004     True
2008     True
2008     True
2012     True
2012     True
2016     True
2016     True
Name: Party, dtype: bool

In [312]:
elections_year_index[elections_year_index['Party'].isin(["Republican", "Democratic"])]

Unnamed: 0_level_0,Candidate,Party,%,Result
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1980,Reagan,Republican,50.7,win
1980,Carter,Democratic,41.0,loss
1984,Reagan,Republican,58.8,win
1984,Mondale,Democratic,37.6,loss
1988,Bush,Republican,53.4,win
1988,Dukakis,Democratic,45.6,loss
1992,Clinton,Democratic,43.0,win
1992,Bush,Republican,37.4,loss
1996,Clinton,Democratic,49.2,win
1996,Dole,Republican,40.7,loss


An alternate simpler way to get back a specific set of rows is to use the `query` command.

In [28]:
elections_year_index.query("Result == 'win' and Year < 2000")

Unnamed: 0_level_0,Candidate,Party,%,Result
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1980,Reagan,Republican,50.7,win
1984,Reagan,Republican,58.8,win
1988,Bush,Republican,53.4,win
1992,Clinton,Democratic,43.0,win
1996,Clinton,Democratic,49.2,win


## loc and iloc

In [30]:
elections.head(5)

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
1,Carter,Democratic,41.0,1980,loss
2,Anderson,Independent,6.6,1980,loss
3,Reagan,Republican,58.8,1984,win
4,Mondale,Democratic,37.6,1984,loss


In [31]:
elections.loc[[0, 1, 2, 3, 4], ['Candidate','Party', 'Year']]

Unnamed: 0,Candidate,Party,Year
0,Reagan,Republican,1980
1,Carter,Democratic,1980
2,Anderson,Independent,1980
3,Reagan,Republican,1984
4,Mondale,Democratic,1984


Note: The `loc` command won't work with numeric arguments if we're using the elections DataFrame that was indexed by year.

In [32]:
elections_year_index.head(5)

Unnamed: 0_level_0,Candidate,Party,%,Result
Year,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1980,Reagan,Republican,50.7,win
1980,Carter,Democratic,41.0,loss
1980,Anderson,Independent,6.6,loss
1984,Reagan,Republican,58.8,win
1984,Mondale,Democratic,37.6,loss


In [33]:
#causes error
#elections_year_index.loc[[0, 1, 2, 3, 4], ['Candidate','Party']]#

In [34]:
elections_year_index.loc[[1980, 1984], ['Candidate','Party']]

Unnamed: 0_level_0,Candidate,Party
Year,Unnamed: 1_level_1,Unnamed: 2_level_1
1980,Reagan,Republican
1980,Carter,Democratic
1980,Anderson,Independent
1984,Reagan,Republican
1984,Mondale,Democratic


Loc also supports slicing (for all types, including numeric and string labels!). Note that the slicing for loc is **inclusive**, even for numeric slices.

In [35]:
elections.loc[0:4, 'Candidate':'Year']

Unnamed: 0,Candidate,Party,%,Year
0,Reagan,Republican,50.7,1980
1,Carter,Democratic,41.0,1980
2,Anderson,Independent,6.6,1980
3,Reagan,Republican,58.8,1984
4,Mondale,Democratic,37.6,1984


In [36]:
elections_year_index.loc[1980:1984, 'Candidate':'Party']

Unnamed: 0_level_0,Candidate,Party
Year,Unnamed: 1_level_1,Unnamed: 2_level_1
1980,Reagan,Republican
1980,Carter,Democratic
1980,Anderson,Independent
1984,Reagan,Republican
1984,Mondale,Democratic


If we provide only a single label for the column argument, we get back a Series.

In [37]:
elections.loc[0:4, 'Candidate']

0      Reagan
1      Carter
2    Anderson
3      Reagan
4     Mondale
Name: Candidate, dtype: object

If we want a data frame instead and don't want to use to_frame, we can provde a list containing the column name.

In [38]:
elections.loc[0:4, ['Candidate']]

Unnamed: 0,Candidate
0,Reagan
1,Carter
2,Anderson
3,Reagan
4,Mondale


If we give only one row but many column labels, we'll get back a Series corresponding to a row of the table. This new Series has a neat index, where each entry is the name of the column that the data came from.

In [39]:
elections.head(1)

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win


In [40]:
elections.loc[0, 'Candidate':'Year']

Candidate        Reagan
Party        Republican
%                  50.7
Year               1980
Name: 0, dtype: object

In [41]:
elections.loc[[0], 'Candidate':'Year']

Unnamed: 0,Candidate,Party,%,Year
0,Reagan,Republican,50.7,1980


If we omit the column argument altogether, the default behavior is to retrieve all columns. 

In [42]:
elections.loc[[2, 4, 5]]

Unnamed: 0,Candidate,Party,%,Year,Result
2,Anderson,Independent,6.6,1980,loss
4,Mondale,Democratic,37.6,1984,loss
5,Bush,Republican,53.4,1988,win


Loc also supports boolean array inputs instead of labels. If the arrays are too short, loc assumes the missing values are False.

In [43]:
elections.loc[[True, False, False, True], [True, False, False, True]]

Unnamed: 0,Candidate,Year
0,Reagan,1980
3,Reagan,1984


In [44]:
elections.loc[[0, 3], ['Candidate', 'Year']]

Unnamed: 0,Candidate,Year
0,Reagan,1980
3,Reagan,1984


We can use boolean array arguments for one axis of the data, and labels for the other.

In [45]:
elections.loc[[True, False, False, True], 'Candidate':'%']

Unnamed: 0,Candidate,Party,%
0,Reagan,Republican,50.7
3,Reagan,Republican,58.8


A student asks what happens if you give scalar arguments for the requested rows AND columns. The answer is that you get back just a single value.

In [46]:
elections.loc[0, 'Candidate']

'Reagan'

### iloc

loc's cousin iloc is very similar, but is used to access based on numerical position instead of label. For example, to access to the top 3 rows and top 3 columns of a table, we can use [0:3, 0:3]. iloc slicing is **exclusive**, just like standard Python slicing of numerical values.

In [47]:
elections.head(5)

Unnamed: 0,Candidate,Party,%,Year,Result
0,Reagan,Republican,50.7,1980,win
1,Carter,Democratic,41.0,1980,loss
2,Anderson,Independent,6.6,1980,loss
3,Reagan,Republican,58.8,1984,win
4,Mondale,Democratic,37.6,1984,loss


In [48]:
elections.iloc[0:3, 0:3]

Unnamed: 0,Candidate,Party,%
0,Reagan,Republican,50.7
1,Carter,Democratic,41.0
2,Anderson,Independent,6.6


We will use both loc and iloc in the course. Loc is generally preferred for a number of reasons, for example: 

1. It is harder to make mistakes since you have to literally write out what you want to get.
2. Code is easier to read, because the reader doesn't have to know e.g. what column #31 represents.
3. It is robust against permutations of the data, e.g. the social security administration switches the order of two columns.

However, iloc is sometimes more convenient. We'll provide examples of when iloc is the superior choice.

## Quick Challenge

Which of the following expressions return DataFrame of the first 3 Candidate and Party names for candidates that won with more than 50% of the vote.

In [None]:
elections.iloc[[0, 3, 5], [0, 3]]

In [None]:
elections.loc[[0, 3, 5], "Candidate":"Year"]

In [None]:
elections.loc[elections["%"] > 50, ["Candidate", "Year"]].head(3)

In [None]:
elections.loc[elections["%"] > 50, ["Candidate", "Year"]].iloc[0:2, :]

## Sampling

Pandas dataframes also make it easy to get a sample. We simply use the `sample` method and provide the number of samples that we'd like as the arugment. Sampling is done without replacement by default. Set `replace=True` if you want replacement.

In [343]:
elections.sample(10)

Unnamed: 0,Candidate,Party,%,Year,Result
15,Kerry,Democratic,48.3,2004,loss
19,Obama,Democratic,51.1,2012,win
8,Bush,Republican,37.4,1992,loss
17,Obama,Democratic,52.9,2008,win
3,Reagan,Republican,58.8,1984,win
13,Gore,Democratic,48.4,2000,loss
2,Anderson,Independent,6.6,1980,loss
1,Carter,Democratic,41.0,1980,loss
22,Trump,Republican,46.1,2016,win
20,Romney,Republican,47.2,2012,loss


In [348]:
elections.query("Year < 1992").sample(50, replace=True)

Unnamed: 0,Candidate,Party,%,Year,Result
5,Bush,Republican,53.4,1988,win
4,Mondale,Democratic,37.6,1984,loss
6,Dukakis,Democratic,45.6,1988,loss
1,Carter,Democratic,41.0,1980,loss
5,Bush,Republican,53.4,1988,win
5,Bush,Republican,53.4,1988,win
5,Bush,Republican,53.4,1988,win
6,Dukakis,Democratic,45.6,1988,loss
1,Carter,Democratic,41.0,1980,loss
6,Dukakis,Democratic,45.6,1988,loss
