# Case Study -- US Election Data

This is a case study notebook. It includes processing/cleaning the data and visualisation. Mostly, it is a case study of approaching data analytics from an information theoretical point of view. We will build an ID3 decision tree in the later part of the notebook.

## 4/Sep/2020 <span style="color:red">Caveat</span>

The notebook is under construction. The algorithm-check (prediction on data and compare with target labels) is ONLY A SANITY CHECK, NOT a standard evaluation procedure.

The notebook has an accompanying video series:
https://www.youtube.com/playlist?list=PLuXKrCpJ4KeZ1jB3_8EjtC9r4pSUs1rvB


In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets 
# preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved 
# outside of the current session

In [None]:
usa_2016_presidential_election_by_county = pd.read_csv('/kaggle/input/us-elections-dataset/usa-2016-presidential-election-by-county.csv', sep=';')

In [None]:
print(f"Totally, there are {len(usa_2016_presidential_election_by_county)} records")

usa_2016_presidential_election_by_county.head()

In [None]:
for k in usa_2016_presidential_election_by_county.keys():
    print(k)
# State
# County

In [None]:
df = usa_2016_presidential_election_by_county.dropna(subset=[
    "Votes16 Clintonh", "Votes16 Trumpd", 
    "Republicans 2016", "Democrats 2016",
    "Republicans 2012", "Republicans 2008", 
    "Democrats 2012", "Democrats 2008", "Votes"])

# Entropy of the vote distributions

Let's start by asking a random American would vote which candidates in 2016. And let us simplify the query by just considering Trump, D vs Clinton, H.

If there is no futher information of "the American" of interest, the chance is approximately half/half. In terms of entropy, the decision of an average American needs about 1-bit to transmit.

In [None]:
n_dem = df["Votes16 Clintonh"].sum() 
n_rep = df["Votes16 Trumpd"].sum()
p_dem = n_dem / (n_dem + n_rep)
p_rep = n_rep / (n_dem + n_rep)
print(f"Votes for DEM {n_dem}, probability {p_dem:.4f}")
print(f"Votes for REP {n_rep}, probability {p_rep:.4f}")


Let us recall the entropy

$- (p_1 \log p_1 + p_2 \log p_2 + ...)$

Here in this example, we only have two classes, and $p_1$ and $p_2$.

In [None]:
ent = - (p_dem * np.log2(p_dem) + p_rep * np.log2(p_rep)).sum()
print(f"Entropy: {ent:.4f}")
print(f"""This means if you store all the election ballots in 2016, the MINIMUM file size
cannot be less than {ent * (n_dem+n_rep):.2f} bits""")

Let us see some places where the people had more obvious preference.

In [None]:
# check out all California 
df[df["State"] == "California"]

In [None]:
# Let us summarise the entropy computation and report in a function
def exam_votes(df_i):
    n_dem = df_i["Votes16 Clintonh"].sum() 
    n_rep = df_i["Votes16 Trumpd"].sum()
    p_dem = n_dem / (n_dem + n_rep)
    p_rep = n_rep / (n_dem + n_rep)
    print(f"2016 Vote Statistics {n_dem + n_rep} votes in {len(df_i)} counties")
    print(f"Votes for DEM {n_dem}, probability {p_dem:.4f}")
    print(f"Votes for REP {n_rep}, probability {p_rep:.4f}")
    ent = - (p_dem * np.log2(p_dem) + p_rep * np.log2(p_rep)).sum()
    print(f"Entropy: {ent:.4f}")
    print(f"""This means if you store all the election ballots in 2016, the MINIMUM file size
cannot be less than {ent * (n_dem+n_rep):.2f} bits""")
    return ent, p_dem, p_rep, n_dem, n_rep


In [None]:
ent, p_dem, p_rep, n_dem, n_rep = exam_votes(df[df["State"] == "California"])

Before we move forward, find a really partisan place to check.

In [None]:
import plotly.express as px
fig = px.scatter_geo(df, lat="lat", lon="lon", color="Republicans 2016", hover_name="County", size="Votes")#, 
# you can try to remove "size" to get quicker rendering and get smaller counties more visible
fig.show()

In [None]:
px.scatter(df, x="Republicans 2016", y="Democrats 2016", hover_name="County")

In [None]:
# Get the record of the county "District of Columbia, District of Columbia"
df[df["County"] == "District of Columbia, District of Columbia"] # only one record

In [None]:
# Let's to the entropy computation
_ = exam_votes(df[df["County"] == "District of Columbia, District of Columbia"])

# Predictive task -- setting up and primitive attempt

Now we have a measurement of how uncertain / predictable the people in a place is when it comes to voting. The next question is how to apply this analysis to perform some analytics. Say, try to say something about a future election (or an election that the result is not given to the data model).

First, let us change our view point from individual ballots to counties (**Why?**). We introduce the targets of interest -- one party won the election in a county, and split the data in two parts. (We will use "Democrats 2016" and "Republicans 2016" instead of Clinton and Trump's votes where they represent roughly the same kind of information with the former normalised.)

In [None]:
df["Republicans Won 2016"] = df["Democrats 2016"] < df["Republicans 2016"]
df["Republicans Won 2012"] = df["Democrats 2012"] < df["Republicans 2012"]
df["Republicans Won 2008"] = df["Democrats 2008"] < df["Republicans 2008"]

In [None]:
# Check the 2016 results
df["Republicans Won 2016"].value_counts(normalize=True)

In [None]:
prob = df["Republicans Won 2016"].value_counts(normalize=True)
prob = np.array(prob)
print(f"Distribution of *repub won* w.r.t. county is [True (Rep Won), False (Dem Won)]={prob}")
ent = - (prob * np.log2(prob)).sum()
print(f"Entropy is {ent:.4f}")

Q: why the prob/ent changed from the previous investigation?

### State

Following the idea of the previous attempt, we want to consider sub-groups of the population, so that the vote results are more certain. 



In [None]:
# summerise the previous analysis into a county based function

def exam_counties(df, verbose=True):
    prob = df["Republicans Won 2016"].value_counts(normalize=True)
    prob = np.array(prob)
    ent = - (prob * np.log2(np.maximum(prob, 1e-6))).sum()
    if verbose:
        print(f"Distribution of *repub won* w.r.t. county is [True (Rep Won), False (Dem Won)]={prob}")
        print(f"Entropy is {ent:.4f}")
    return ent

In [None]:
# Let's try states ...
states = df["State"].value_counts()
states

In [None]:

exam_counties(df[df["State"]=="Georgia"], verbose=False)

In [None]:
total_ent = 0
num_counties = 0
for k, v in states.iteritems():
    ent = exam_counties(df[df["State"]==k], verbose=False) # in this particular state
    print(f"State {k} has {v} counties, result entropy {ent:.3f}")
    total_ent += v * ent
    num_counties += v
    
print(f"Weighted sum of entropies {total_ent/num_counties :.3f}")

Knowing the states provides information about the outcome of the counties. Overall, the county-wise results become more predictable given the knowledge of the states. 

ðŸ‘‰ The statement is **OVERALL**, it does not apply to individual sub-population. See the example below.

In [None]:
ent = exam_counties(df[df["ST"]=="CA"]) # in this particular state the result is more unpredicable (in terms of counties)

ðŸ‘‰ Note also now the "population" consists of individual counties, and narrowing down to a single county won't be of much use.

The difference between the overall uncertainty is the _information gain_ of knowning the "State". Let us see another example of using some other information.

### Education

In [None]:
# Examine the education information.
df[["Less Than High School Diploma", "At Least High School Diploma",
    "At Least Bachelors's Degree","Graduate Degree"]]

In [None]:
fig = px.scatter(df, x="At Least Bachelors's Degree", y="Democrats 2016", 
                 color="Republicans Won 2016", color_discrete_sequence=['red','blue'])
fig.show()
# fig = px.scatter(df, x="At Least Bachelors's Degree", y="Democrats 2016", color="Republicans Won 2016",
#                  color_discrete_sequence=['red','blue'], size="Votes")
# fig.show()
# You can check the size="Votes" to see how significant the individual counties are


So it seems that whether there are more than ? percent of population has "at least bachelors's degree" can be an indicator of the vote outcome. Let us split the data and do some statistics.

In [None]:
df["More Than 30p Bachelors"] = df["At Least Bachelors's Degree"] > 30

In [None]:
# We do the same calculation as above
total_ent = 0
num_counties = 0
attr = "More Than 30p Bachelors"
for k, v in df[attr].value_counts().iteritems():
    ent = exam_counties(df[df[attr]==k], verbose=False) # in this particular state
    print(f"there are {v} counties where {attr} is {k}, result entropy {ent:.3f}")
    total_ent += v * ent
    num_counties += v
    
print(f"Weighted sum of entropies {total_ent/num_counties :.3f}")

# recall that the original entropy is ... (copied from above)
prob = df["Republicans Won 2016"].value_counts(normalize=True)
prob = np.array(prob)
print(f"Distribution of *repub won* w.r.t. county is [True (Rep Won), False (Dem Won)]={prob}")
ent0 = - (prob * np.log2(prob)).sum()
print(f"Entropy is {ent0:.4f}")
print(f"Info Gain: {ent0 - total_ent/num_counties:.4f}")

### Population

Let's check the following attributes

```
White (Not Latino) Population
African American Population
Native American Population
Asian American Population
Other Race or Races
Latino Population
Children Under 6 Living in Poverty
Adults 65 and Older Living in Poverty
Total Population
```

In [None]:
fig = px.scatter(df, x="White (Not Latino) Population", y="Democrats 2016", color="Republicans Won 2016",
                color_discrete_sequence=['red','blue'])
fig.show()

In [None]:
df["White (Not Latino) Population Is Greater Than 60p"] = df["White (Not Latino) Population"] > 60

In [None]:
# By now we have a pattern of measuring the information gain
# 1. get the unique values:
#    True/False, 
#    Texas, Georgia, Virginia ...
#    and get the sub-populations (of counties)
# 2. compute the entropy of the sub-populations
# 3. get the weighted sum of entropy
# 4. compare
# 

def compute_weighted_sub_entropy(df, attr, verbose=True):
    total_ent = 0
    num_counties = 0
    for k, v in df[attr].value_counts().iteritems():
        ent = exam_counties(df[df[attr]==k], verbose=False) # in this particular sub-population
        if verbose:
            print(f"there are {v} counties where {attr} is {k}, result entropy {ent:.3f}")
        total_ent += v * ent
        num_counties += v
    
    weighted_ent = total_ent/num_counties
    if verbose:
        print(f"Weighted sum of entropies {weighted_ent:.3f}")
    return weighted_ent

weighted_ent = compute_weighted_sub_entropy(df, "White (Not Latino) Population Is Greater Than 60p")
print(f"Info Gain: {ent0 - weighted_ent:.4f}")

## Select which attribute to examine

It becomes natural / obvious that how to find the most efficient (in terms of getting more information) attribute to examine. 

ðŸ‘‰ Of course, an immediate question would be "how to figure out a candidate set of promising attributes", such as how do you know to check the education level, how do you know to cut at 30p. The answer is in most cases, we don't have a certain strategy, and rely on experience and playing with data (EDA). Machine learning research is heading for the direction where less human experience is needed. For our study of decision trees, let us use the following ones.

In [None]:
attributes = ["White (Not Latino) Population", 
    "African American Population",
    "Native American Population",
    "Asian American Population", 
    "Latino Population",
    "Less Than High School Diploma",
    "At Least High School Diploma",
    "At Least Bachelors's Degree",
    "Graduate Degree",
    "School Enrollment",
    "Median Earnings 2010",
    "Children Under 6 Living in Poverty",
    "Adults 65 and Older Living in Poverty",
    "Preschool.Enrollment.Ratio.enrolled.ages.3.and.4",
    "Poverty.Rate.below.federal.poverty.threshold",
    "Gini.Coefficient",
    "Child.Poverty.living.in.families.below.the.poverty.line",
    "Management.professional.and.related.occupations",
    "Service.occupations",
    "Sales.and.office.occupations",
    "Farming.fishing.and.forestry.occupations",
    "Construction.extraction.maintenance.and.repair.occupations",
    "Production.transportation.and.material.moving.occupations",
    "Median Age",
    "Poor.physical.health.days",
    "Poor.mental.health.days",
    "Low.birthweight",
    "Teen.births",
    "Children.in.single.parent.households",
    "Adult.smoking",
    "Adult.obesity",
    "Diabetes",
    "Sexually.transmitted.infections",
    "HIV.prevalence.rate",
    "Uninsured",
    "Unemployment",
    "Violent.crime",
    "Homicide.rate",
    "Injury.deaths",
    "Infant.mortality"]
new_attributes = []
for a in attributes:
    new_a = "Quant4." + a
    df[new_a] = pd.qcut(df[a], q=4, labels=["q1", "q2", "q3", "q4"])
    new_attributes.append(new_a)


In [None]:
vote_info = [
    "Votes16 Trumpd",
    "Votes16 Clintonh",
    "State",
    "ST",
    "Fips",
    "County",
    "Precincts",
    "Votes",
    "Democrats 08 (Votes)",
    "Democrats 12 (Votes)",
    "Republicans 08 (Votes)",
    "Republicans 12 (Votes)",
    "Republicans 2016",
    "Democrats 2016",
    "Green 2016",
    "Libertarians 2016",
    "Republicans 2012",
    "Republicans 2008",
    "Democrats 2012",
    "Democrats 2008"]
df_new = df[new_attributes]
df_new.dropna('columns', 'any')
df_new

In [None]:
# The compute entropy and info_gain are copied from our exercise notebook,
# the procedure is as explained in the analysis steps above. 
def compute_entropy(y):
    """
    :param y: The data samples of a discrete distribution
    """
    if len(y) < 2: #  a trivial case
        return 0
    freq = np.array( y.value_counts(normalize=True) )
    return -(freq * np.log2(freq + 1e-6)).sum() # the small eps for 
    # safe numerical computation 
    
def compute_info_gain(samples, attr, target):
    values = samples[attr].value_counts(normalize=True)
    split_ent = 0
    for v, fr in values.iteritems():
        index = samples[attr]==v
        sub_ent = compute_entropy(target[index])
        split_ent += fr * sub_ent
    
    ent = compute_entropy(target)
    return ent - split_ent

class TreeNode:
    """
    A recursively defined data structure to store a tree.
    Each node can contain other nodes as its children
    """
    def __init__(self, node_name="", min_sample_num=10, default_decision=None):
        self.children = {} # Sub nodes --
        # recursive, those elements of the same type (TreeNode)
        self.decision = None # Undecided
        self.split_feat_name = None # Splitting feature
        self.name = node_name
        self.default_decision = default_decision
        self.min_sample_num = min_sample_num

    def pretty_print(self, prefix=''):
        if self.split_feat_name is not None:
            for k, v in self.children.items():
                v.pretty_print(f"{prefix}:When {self.split_feat_name} is {k}")
                #v.pretty_print(f"{prefix}:{k}:")
        else:
            print(f"{prefix}:{self.decision}")

    def predict(self, sample):
        if self.decision is not None:
            # uncomment to get log information of code execution
            print("Decision:", self.decision)
            return self.decision
        else: 
            # this node is an internal one, further queries about an attribute 
            # of the data is needed.
            attr_val = sample[self.split_feat_name]
            child = self.children[attr_val]
            # uncomment to get log information of code execution
            print("Testing ", self.split_feat_name, "->", attr_val)

            # [Exercise]
            # Insert your code here
            return child.predict(sample)

    def fit(self, X, y):
        """
        The function accepts a training dataset, from which it builds the tree 
        structure to make decisions or to make children nodes (tree branches) 
        to do further inquiries
        :param X: [n * p] n observed data samples of p attributes
        :param y: [n] target values
        """
        if self.default_decision is None:
            self.default_decision = y.mode()[0]
            
            
        print(self.name, "received", len(X), "samples")
        if len(X) < self.min_sample_num:
            # If the data is empty when this node is arrived, 
            # we just make an arbitrary decision
            if len(X) == 0:
                self.decision = self.default_decision
                print("DECESION", self.decision)
            else:
                self.decision = y.mode()[0]
                print("DECESION", self.decision)
            return
        else: 
            unique_values = y.unique()
            if len(unique_values) == 1:
                self.decision = unique_values[0]
                print("DECESION", self.decision)
                return
            else:
                info_gain_max = 0
                for a in X.keys(): # Examine each attribute
                    aig = compute_info_gain(X, a, y)
                    if aig > info_gain_max:
                        # [Exercise]
                        # Insert your code here
                        info_gain_max = aig
                        self.split_feat_name = a
                print(f"Split by {self.split_feat_name}, IG: {info_gain_max:.2f}")
                self.children = {}
                for v in X[self.split_feat_name].unique():
                    index = X[self.split_feat_name] == v
                    self.children[v] = TreeNode(
                        node_name=self.name + ":" + self.split_feat_name + "==" + str(v),
                        min_sample_num=self.min_sample_num,
                        default_decision=self.default_decision)
                    self.children[v].fit(X[index], y[index])

# Test tree building
data = df[new_attributes].dropna('columns', 'any')
target = df["Republicans Won 2016"]

t = TreeNode(min_sample_num=50)
t.fit(data, target)

In [None]:
corr = 0
err_fp = 0
err_fn = 0
for (i, ct), tgt in zip(data.iterrows(), target):
    a = t.predict(ct)
    if a and not tgt:
        err_fp += 1
    elif not a and tgt:
        err_fn += 1
    else:
        corr += 1
        


In [None]:
corr, err_fp, err_fn