In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # matplotlib

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input directory

from subprocess import check_output
print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.

The initial step, at this point, will be to read in data from the .csv files containing our data, and then to explore their content.

We will read in both county_facts.csv and county_facts_dictionary.csv, which we may assume contains information on some of the variables contained in the former.

In [None]:
# Read in both the county_facts.csv and county_facts_dictionary.csv using pd.read_csv()
county_facts = pd.read_csv("../input/county_facts.csv")
county_facts_dictionary = pd.read_csv("../input/county_facts_dictionary.csv")
# Check that these files have been read in properly by calling len() to see the number of rows.
len(county_facts), len(county_facts_dictionary)

Since the *county_facts_dictionary* is relatively short, at only 51 rows, it might be feasible simply to call it and look at its content.

In [None]:
county_facts_dictionary

So the county_facts_dictionary DataFrame evidently consists of two columns, **column_name** and **description**. Evidently, those descriptions will be essentially for decoding various labels used in *county_facts*.

Since we may often need to check what these labels refer to, the following code block shows us how to extract the row containing the description for a particular label.

In [None]:
# The general syntax is DataFrame[DataFrame.ColumnName == VALUE]
county_facts_dictionary[county_facts_dictionary.column_name == 'POP060210']

Now, we should probably find out a bit more about what is given in **county_facts**.

In [None]:
county_facts.head()

We can see that county_facts has three means of identification:
1. a *fips*; 2. an *area_name*; 3. a *state_abbreviation*.

After those three columns, we have the 51 columns that correspond to each of the 51 columns in the **county_facts_dictionary**.

At this point, we can move to the primary results themselves. 


In [None]:
primary_results = pd.read_csv("../input/primary_results.csv")
primary_results.head()

Calling *.head()* on **primary_results** shows that this DataFrame contains 8 columns. 

Let's begin interrogating this data in an exploratory fashion by finding the counties where Ted Cruz (random choice) won the largest proportion of the vote.

In [None]:
Cruz = primary_results[primary_results.candidate == 'Ted Cruz']
# The .sort_values() function will put Cruz' top faction of votes at the top of the table. 
Cruz_Max = Cruz.sort_values(by= 'fraction_votes', ascending=False)
Cruz_Max.head()

In this primary campaign, winning over forty percent of the vote in the crowded Republican field is a substantially positive result for a candidate. Counties like Hancock in Iowa or Elko in Nevada probably show certain characteristics indicative of potential Cruz success.

To what extent are counties like Hancock or Elko standouts in terms of Cruz' performance? Let's get some basic descriptive statistics on the board.

Contrast Cruz' best performances with his bottom five (which are all in New Hampshire):

In [None]:
Cruz_Max.tail()

In [None]:
# Mean proportion of the vote for Cruz, by county.
Cruz_mean = sum(Cruz_Max.fraction_votes) / len(Cruz_Max)
Cruz_mean

In [None]:
# Median
np.median(Cruz_Max.fraction_votes)
# So there's really not too much difference between the mean and median, though the latter is slightly greater.

In [None]:
# Range
Cruz_range = max(Cruz_Max.fraction_votes) - min(Cruz_Max.fraction_votes)
Cruz_range

In [None]:
# Standard Deviation
Cruz_std = np.std(Cruz_Max.fraction_votes)
Cruz_std

In [None]:
# Number of standard deviations maximum differs from the mean
(max(Cruz_Max.fraction_votes) - Cruz_mean) / Cruz_std

In [None]:
# Number of standard deviations minimum differs from the mean
(Cruz_mean - min(Cruz_Max.fraction_votes)) / Cruz_std

The basic measures so far show that Cruz' best performance is a bit more atypical than his worst performance.

Now for the interesting part: what characterizes the counties where Cruz has performed best?

We will join the *Cruz_Max* data with the *county_facts* data on the fips column. Then, we will test the correlation of fraction of the vote with 2014 estimated population.

In [None]:
Cruz_New = pd.merge(Cruz_Max, county_facts, on='fips', how='inner')
np.corrcoef(Cruz_New.fraction_votes, Cruz_New.PST045214)
# There's a slight negative correlation, which would suggest that Cruz tends to win more of the vote in less populous counties.

In [None]:
# Cruz votes by population
plt.scatter(Cruz_New.PST045214, Cruz_New.fraction_votes)
plt.show()

This visualization is really helpful, because it reveals a definite outlier in population that should be removed.

In [None]:
Cruz_byPopulation = Cruz_New.sort_values(by='PST045214', ascending = False)
Cruz_PopOutlierRemoved = Cruz_byPopulation[1:172]

In [None]:
# Recalculate the correlation coefficient between fraction of the vote and population
np.corrcoef(Cruz_PopOutlierRemoved.fraction_votes, Cruz_PopOutlierRemoved.PST045214)
# So indeed, that outlier was subtantially diminishing the correlation!

In [None]:
# We can use population density as a stand-in for urban vs. rural performance. 
# POP060210 is population per square mile in 2010
np.corrcoef(Cruz_New.fraction_votes, Cruz_New.POP060210)
# There's a better correlation here.

In [None]:
# Cruz by population density
plt.scatter(Cruz_New.POP060210, Cruz_New.fraction_votes)
plt.show()

Now, can we build any predictive models?

Let's start to interrogate the profile of a Sanders voter. 
The media profile of such a voter is a young (< 30 years old), white, male. 
We will need to pull out the data on Sanders, and then find the best county