# Data analysis with Python — PART 2

Okay, so we've successfully downloaded a dataset and read it into our code ... now what?

There are a number of things you might want to do with data — a number of questions you might want to ask it. Think of data like a human source: You're interviewing it to find answers to your questions. 

Also remember that **data is not all-knowing**: There are plenty of questions it can't/won't answer. Understanding your data's limitations is a good way to start down the right path. 

Let's take a closer look at the data we're working with — it's Maricopa County's precinct-level election results file. Run the code below to read the data you saved in from your machine and view it. 

In [3]:
# load your libraries 
import pandas as pd

# read in your csv 
df = pd.read_csv("elections_03_02_23.csv",
                dtype={'i_contest_id': str, 'candidate_id': str})

# this line of code tells our program that we want to be able to see all the columns in our df
pd.set_option('display.max_columns', None)

# view the first five rows
df.head()


Unnamed: 0.1,Unnamed: 0,i_contest_id,contest_ext_id,contest_name,contest_party_affiliation,contest_vote_for,contest_type,contest_order,precinct_id,precinct_ext_id,precinct_name,precinct_order,precinct_status,precinct_registered,precinct_turnout,precinct_turnout_perc,candidate_id,candidate_ext_id,candidate_name,candidate_type,candidate_affiliation,candidate_order,registered,turnout,turnout_perc,overvotes,undervotes,votes,turnout_early_vote,overvotes_early_vote,undervotes_early_vote,votes_early_vote,turnout_election_day,overvotes_election_day,undervotes_election_day,votes_election_day,turnout_provisional,overvotes_provisional,undervotes_provisional,votes_provisional
0,0,\n1,,US Senate,,1.0,Candidacy,1.0,1.0,1.0,0001 ACACIA,1.0,0.0,1790.0,960.0,0.54,1,50482.0,"MASTERS, BLAKE",R,REP,1.0,1790.0,960.0,0.54,0.0,4.0,435.0,805.0,0.0,4.0,333.0,154.0,0.0,0.0,102.0,1.0,0.0,0.0,0.0
1,1,\n1,,US Senate,,1.0,Candidacy,1.0,1.0,1.0,0001 ACACIA,1.0,0.0,1790.0,960.0,0.54,2,50483.0,"KELLY, MARK",R,DEM,2.0,1790.0,960.0,0.54,0.0,4.0,494.0,805.0,0.0,4.0,447.0,154.0,0.0,0.0,46.0,1.0,0.0,0.0,1.0
2,2,\n1,,US Senate,,1.0,Candidacy,1.0,1.0,1.0,0001 ACACIA,1.0,0.0,1790.0,960.0,0.54,3,50478.0,"VICTOR, MARC J.",R,LBT,3.0,1790.0,960.0,0.54,0.0,4.0,26.0,805.0,0.0,4.0,21.0,154.0,0.0,0.0,5.0,1.0,0.0,0.0,0.0
3,3,\n1,,US Senate,,1.0,Candidacy,1.0,1.0,1.0,0001 ACACIA,1.0,0.0,1790.0,960.0,0.54,571,,Write-in,W,,4.0,1790.0,960.0,0.54,0.0,4.0,0.0,805.0,0.0,4.0,0.0,154.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
4,4,\n1,,US Senate,,1.0,Candidacy,1.0,1.0,1.0,0001 ACACIA,1.0,0.0,1790.0,960.0,0.54,704,,"BORDES, SHERRISE",Q,,100001.0,1790.0,960.0,0.54,0.0,4.0,0.0,805.0,0.0,4.0,0.0,154.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


# Sorting to find "the most" and "the least"

You may have used the sorting functionality on Excel before. Sorting is a way of rearranging data based on its value. You can sort from least to greatest or from greatest to least. Sorting works on numbers, but you can also sort alphabetically on character columns to see your data arranged a-z or z-a. 

Here's the syntax for sorting:

    name of dataframe.sort_values('name of column')

If you want to see the largest value at the top of your df, you can add the parameter **ascending=False**. So your syntax would look like: 

    name of dataframe.sort_values('name of column', ascending=False)

Our data has one row per candidate per precinct, and a wealth of information in its columns. We can see the number of people registered in each precinct, the precinct turnout, the number of early votes and election day votes cast per candidate, a candidates affiliation, and more. 

It could be interesting to see which precincts has the highest and the lowest turnouts. So let's start by sorting for that.

Write a sort statement that will show us the precinct with the highest turnout percentage in the code chunk below, then run the code cell. 

Copy and paste the name column you'll be sorting on from here: 'precinct_turnout_perc'

In [4]:
# Remember: The sytax to sort largest-smallest looks like the below
# [name of dataframe].sort_values('[name of column]', ascending=False)
# write your sort statement below:


Unnamed: 0.1,Unnamed: 0,i_contest_id,contest_ext_id,contest_name,contest_party_affiliation,contest_vote_for,contest_type,contest_order,precinct_id,precinct_ext_id,precinct_name,precinct_order,precinct_status,precinct_registered,precinct_turnout,precinct_turnout_perc,candidate_id,candidate_ext_id,candidate_name,candidate_type,candidate_affiliation,candidate_order,registered,turnout,turnout_perc,overvotes,undervotes,votes,turnout_early_vote,overvotes_early_vote,undervotes_early_vote,votes_early_vote,turnout_election_day,overvotes_election_day,undervotes_election_day,votes_election_day,turnout_provisional,overvotes_provisional,undervotes_provisional,votes_provisional
150100,150100,\n211,,"Superior Court HOPKINS, STEPHEN MATTHEW",,1.0,Measure,215.0,131.0,131.0,0131 CINCO,131.0,0.0,2.0,2.0,1.0,457,1.0,NO,M,NON,628.0,2.0,2.0,1.0,0.0,0.0,2.0,2.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
208132,208132,\n241,,PROPOSITION 132,,1.0,Measure,246.0,131.0,131.0,0131 CINCO,131.0,0.0,2.0,2.0,1.0,527,1.0,NO,M,NON,690.0,2.0,2.0,1.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
146356,146356,\n209,,"Superior Court GREEN, JENNIFER E.",,1.0,Measure,213.0,131.0,131.0,0131 CINCO,131.0,0.0,2.0,2.0,1.0,449,1.0,NO,M,NON,624.0,2.0,2.0,1.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
41868,41868,\n57,,State Treasurer,,1.0,Candidacy,57.0,131.0,131.0,0131 CINCO,131.0,0.0,2.0,2.0,1.0,126,50428.0,"QUEZADA, MARTÃN",R,DEM,182.0,2.0,2.0,1.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
41869,41869,\n57,,State Treasurer,,1.0,Candidacy,57.0,131.0,131.0,0131 CINCO,131.0,0.0,2.0,2.0,1.0,658,,Write-in,W,,183.0,2.0,2.0,1.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
180910,180910,\n228,,"Superior Court SCHWARTZ, ARYEH D.",,1.0,Measure,231.0,560.0,560.0,0560 NVP 14,560.0,0.0,0.0,0.0,0.0,499,1.0,NO,M,NON,660.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
180909,180909,\n228,,"Superior Court SCHWARTZ, ARYEH D.",,1.0,Measure,231.0,560.0,560.0,0560 NVP 14,560.0,0.0,0.0,0.0,0.0,498,1.0,YES,M,NON,659.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
180908,180908,\n228,,"Superior Court SCHWARTZ, ARYEH D.",,1.0,Measure,231.0,559.0,559.0,0559 NVP 13,559.0,0.0,0.0,0.0,0.0,499,1.0,NO,M,NON,660.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
219102,219102,\n247,,PROPOSITION 310,,1.0,Measure,252.0,,,,1000000.0,1.0,0.0,0.0,0.0,539,1.0,NO,M,NON,700.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


So ... this doesn't actually end up telling us much — because the total number of people registered in the highest-turnout precinct is so low! 100% percent of the 2 folks registered showed up. That's not the kind of information that we're looking for.

And it doesn't look like sorting from least to greatest would have given us any more telling results. The above sorted df shows us its bottom rows, too — and we can see that there are precincts where there's nobody registered to vote, and some strange rows with NaNs all the way across. This is where filtering comes in ... 

# Filtering to isolate or exclude variables

... Which brings us to our next section! You've probably used filtering in Excel before, but here's a quick recap on it: Filtering is a way of subsetting your data, so you can look at slices of it that matter to you. You can use filtering for so many different reasons: It can help you exclude errors, or focus on data that meets certain criteria. In the context of our data: If you wanted to only look at a certain precinct, you'd filter to that precinct. If you wanted to only look at the gubernatorial race, you'd filter for that too. And if you wanted to exclude rows with NaNs or zeroes (like we do!) you can use filtering to do that, too. 

So let's do it! 

You can filter with logical operators like >, < and =. So to only see rows for precincts where there were more than 200 registered voters, for example, you'd write something like this: 

    greater_than_200 = df[(df.precinct_registered > 200)]

"greater than 200" is the name of a new dataframe — the filtered version of your original df. Preserving your original data can help you retrace your steps and double-check your work, to make sure the math is math-ing after you perform a series of operations. The "df.precinct_registered > 200" calls on your original dataframe — *df* — and then on the column we're interested in — *precinct_registered*. It sets the condition that if a row has a value greater than 200 in the precinct_registered column, it will be included in the new dataframe, *greater_than_200*. Otherwise, it's out!

Important sidenote: if you wanted to filter so only rows where 'precinct_registered' equalled 200 showed up, you'd use two equal signs. So, the statement in the parens would look like (df.precinct_registered == 200)

The syntax to pull a single precinct or contest from the data looks pretty similar. To pull just the Acacia precinct, for example, you'd write something like: 

    just_acacia = df[(df.precinct_name  == "0001 ACACIA")]

And to get rid of NA values, you can use the .dropnull() function in your filter. I.e.,:

    no_na_precincts = df[df.precinct_id.notnull()]

The above line of code would filter out any rows with an NA value in the 'precinct_id' column.

In the below code block, there's an incomplete line of code. Write a statement on the right side of the equals sign to filter out any rows where there's a zero value in the precinct_registered field. 


In [24]:

no_zeroes = #put your code here! remember, we're starting with our original dataframe, df

(219104, 40)
(211505, 40)


Let's see if it worked. There are tons of ways to do this, but an easy one is using 
***.shape***, a Python functionality that allows you to see the total number of rows and columns in a dataframe. 

In the code chunk below, uncomment the first line and run the cell to see the shape of our original dataframe, ***df***. Uncomment the second and re-run the cell to see the shape of our new dataframe, ***no_zeroes***. 

In [None]:
#print(df.shape)
#print(no_zeroes.shape)

We see the new dataframe has fewer rows, so we know our filter did something — but did it do the right thing? There's a quick way to find out and perform a gut-check: sorting the new dataframe, no_zeroes, on the precinct_registered column from the lowest to the highest value. If you see zeroes at the top of our spit-out, that means something went wrong with your filter. 

Run the codeblock below to sort the precinct_registered column in the no_zeroes dataframe from lowest to highest.

In [None]:
no_zeroes.sort_values('precinct_registered')

# Putting it alllll together

Now that we've gotten rid of those pesky zero-registered-voters rows, we can get down to real business. Let's start by picking a precinct to look at — the precinct with the highest registration.

In the code chunk below, write a statement to sort the ***no_zeroes*** dataframe's 'precinct_registered' column from highest to lowest. Then run the cell. 

In [26]:
# remember the sorting syntax -- when you're looking for the largest values first, it's
# [name of dataframe].sort_values('[name of column]', ascending=False)
# Write your code below!


Unnamed: 0.1,Unnamed: 0,i_contest_id,contest_ext_id,contest_name,contest_party_affiliation,contest_vote_for,contest_type,contest_order,precinct_id,precinct_ext_id,precinct_name,precinct_order,precinct_status,precinct_registered,precinct_turnout,precinct_turnout_perc,candidate_id,candidate_ext_id,candidate_name,candidate_type,candidate_affiliation,candidate_order,registered,turnout,turnout_perc,overvotes,undervotes,votes,turnout_early_vote,overvotes_early_vote,undervotes_early_vote,votes_early_vote,turnout_election_day,overvotes_election_day,undervotes_election_day,votes_election_day,turnout_provisional,overvotes_provisional,undervotes_provisional,votes_provisional
80956,80956,\n98,,CAWCD Board of Directors,,5.0,Candidacy,98.0,626.0,626.0,0626 PEBBLE CREEK,626.0,0.0,6996.0,6431.0,0.92,177,51298.0,"DUPLESSIS, SHELBY",R,NON,274.0,6974.0,6418.0,0.92,12.0,13430.0,665.0,6098.0,10.0,12702.0,640.0,316.0,2.0,712.0,25.0,4.0,0.0,16.0,0.0
108033,108033,\n187,,"Appeals Court BAILEY, CYNTHIA",,1.0,Measure,190.0,626.0,626.0,0626 PEBBLE CREEK,626.0,0.0,6996.0,6431.0,0.92,418,1.0,YES,M,NON,581.0,6974.0,6418.0,0.92,9.0,1797.0,3344.0,6098.0,9.0,1699.0,3154.0,316.0,0.0,95.0,189.0,4.0,0.0,3.0,1.0
33723,33723,\n51,,State Senator Dist-29,,1.0,Candidacy,51.0,626.0,626.0,0626 PEBBLE CREEK,626.0,0.0,6996.0,6431.0,0.92,113,50423.0,"SHAMP, JANAE",R,REP,163.0,6974.0,6418.0,0.92,0.0,159.0,3908.0,6098.0,0.0,149.0,3631.0,316.0,0.0,10.0,274.0,4.0,0.0,0.0,3.0
33724,33724,\n51,,State Senator Dist-29,,1.0,Candidacy,51.0,626.0,626.0,0626 PEBBLE CREEK,626.0,0.0,6996.0,6431.0,0.92,114,51570.0,"RAYMER, DAVID",R,DEM,164.0,6974.0,6418.0,0.92,0.0,159.0,2346.0,6098.0,0.0,149.0,2313.0,316.0,0.0,10.0,32.0,4.0,0.0,0.0,1.0
33725,33725,\n51,,State Senator Dist-29,,1.0,Candidacy,51.0,626.0,626.0,0626 PEBBLE CREEK,626.0,0.0,6996.0,6431.0,0.92,574,,Write-in,W,,165.0,6974.0,6418.0,0.92,0.0,159.0,0.0,6098.0,0.0,149.0,0.0,316.0,0.0,10.0,0.0,4.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11012,11012,\n1,,US Senate,,1.0,Candidacy,1.0,848.0,848.0,0848 TRES,848.0,0.0,1.0,0.0,0.00,2,50483.0,"KELLY, MARK",R,DEM,2.0,1.0,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
11013,11013,\n1,,US Senate,,1.0,Candidacy,1.0,848.0,848.0,0848 TRES,848.0,0.0,1.0,0.0,0.00,3,50478.0,"VICTOR, MARC J.",R,LBT,3.0,1.0,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
11014,11014,\n1,,US Senate,,1.0,Candidacy,1.0,848.0,848.0,0848 TRES,848.0,0.0,1.0,0.0,0.00,571,,Write-in,W,,4.0,1.0,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
11015,11015,\n1,,US Senate,,1.0,Candidacy,1.0,848.0,848.0,0848 TRES,848.0,0.0,1.0,0.0,0.00,704,,"BORDES, SHERRISE",Q,,100001.0,1.0,0.0,0.00,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


Okay, awesome – so the precinct with the most registered voters is Pebble Creek. That's the precinct we're going to narrow down to. 

Write a filter statement on the right side of the equals sign in the code chunk below. Filter so that the new dataframe, ***pebble_creek***, contains only rows for the Pebble Creek precinct. 

In [None]:
# remember filtering syntax! when you're filtering on a character column, like precinct_name, 
# your statement will look like this: just_acacia = df[(df.precinct_name  == "0001 ACACIA")]

pebble_creek = # write your code here! 

Aaaaaaannnndddd we've got Pebble Creek! Just a couple more rounds of sorting and filtering left folks, I promise (well ... until the next section). Let's go ahead and choose a contest. 

In the code chunk below, write a statement to create a new dataframe that only contains rows for our chosen contest. *Hint: 'contest_name' is the column you'll be filtering on*.

In [None]:
# write your statement below!
 

Our final step has arrived: We're gonna sort to see the candidate who received the most votes in Pebble Creek.

In the code chunk below, write a statement to sort your candidate dataframe's 'votes' column from largest to smallest. Run the chunk, then write a comment below your code with the name of the candidate who received the most votes. 

In [None]:
# Write your code under here! 

There you have it! Sorting and filtering are bread-and-butter blocks of data exploration/analysis. They're powerful on their own, but they're even more useful when they're paired with groubys and joins (also known as merges). Which brings us to part 3!