## PCA and Senate Voting Data


We return to the Senate voting data examined in HW1, with $X$ the $m \times n$ data matrix, where each row corresponds to a Senator, and each column to a bill. In the written solutions, we derive that our objective, maximizing the variance of $(f(x)$ is equivalent to

$$\max_{a \::\: a^Ta =1} \: \frac{1}{n} a^T(X^TX - \mu_x \mu_x^T)a$$

We will proceed to compute and analyze the senator data accordingly

In [None]:
# Import the necessary packages for data manipulation, computation and PCA 
import pandas as pd
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA
%matplotlib inline

In [None]:
senator_df =  pd.read_csv('senator_data_pca/data_matrix.csv')
affiliation_file = open("senator_data_pca/politician_labels.txt", "r")
affiliations = [line.split('\n')[0].split(' ')[1] for line in affiliation_file.readlines()]
X = np.array(senator_df.values[:, 3:].T, dtype='float64') #transpose to get senators as rows

The data matrix we are interested in is 
$$\overline{X} =  a^T(\frac{1}{n}X^TX - \mu_x \mu_x^T)a$$
So we will compute that, and use scipy to compute its PCA

In [None]:
n = len(X)
mu_x = X.mean(axis=0, dtype='float64').reshape(X.shape[1], 1)
X_bar = 1/n*(X.T @ X) - mu_x @ mu_x.T
pca = PCA()
pca.fit(X_bar)
a_1 = pca.components_[0] # This is the first principal vector/axis 

In [None]:
def f(a, b, senators):
    data = []
    for senator in senators: 
        data.append(a.T @ senator + b)
    return np.array(data)
# We will retrieve the scores for each senator and make sure that for all values of b, the variance is in fact the same
a_1_scores = f(a_1, 0, X)
f(a_1, 0, X).var()

#### Comparing variances

In the written solutions we found that if $a = \mu_x$, then b = $-\mu_x^T\mu_x$, and the variance is 


$$ \mu_x^T (\frac{1}{n}X^TX - \mu_x \mu_x^T)\mu_x$$


In [None]:
b = -mu_x.T @ mu_x
a = mu_x
scores = f(a, b, X)
scores.var()/np.linalg.norm(mu_x)

In [None]:
scores.mean()

We can see that mean of the scores is extremely close to 0, and that the variance is 134.55 (which has to be normalized to be compared to the first principal component. Let us visualize the scores of each senators (using a = $\mu_x$) according to party affilitations

In [None]:
plt.title('Score for a_1')
plt.scatter(a_1_scores, np.zeros_like(a_1_scores), c=affiliations)

We can see that majority of the blue is close to one side of the axis and red is close to the other side. This also shows that senators tend to stick closer to the means of their political parties in their voting

#### Total Variance

We have shown that the the total variance explained by the first two principal component is $\lambda_1 +\lambda_2$, where $\lambda_i$ corresponds to the eigenvector of $\overline{X}$, $a_i$ 
$$\overline{X}a_i = \lambda_i a_i$$

Hence we will find the two highest eigenvalues of $\overline{X}$

In [None]:
eigenvals = np.linalg.eigvals(X_bar)[:2]
eigenvals

In [None]:
eigenvals.sum()

Above are the two highest Eigenvalues, and their sum is the total explained variance

In [None]:
pca = PCA(n_components=2)
projected = pca.fit_transform(X)

In [None]:
plt.scatter(projected[:, 0], projected[:, 1], c=affiliations)

## 4. 


In [None]:
bills = senator_df['bill_type bill_name bill_ID'].values
#a_1 sorted by absolute value. The most partisan bills will have the highest absolute value, while the more non-partisan will have a lower value
a_sorted = np.argsort(abs(a_1))
#most partisan
for i in range(1, 11): 
    print(bills[a_sorted[-i]])

In [None]:
#least partisan
for i in range(10): 
    print(bills[a_sorted[i]])

In [None]:
# We will do the same with the senators
senators = senator_df.columns.values[3:]
scores_sorted = np.argsort(abs(f(a_1, 0, X)))
# Most extreme
for i in range(1, 11): 
    print(senators[scores_sorted[-i]], affiliations[scores_sorted[-i]])

In [None]:
# Least extreme
for i in range(1, 11): 
    print(senators[scores_sorted[i]], affiliations[scores_sorted[i]])