# Congressional Voting Records Data Set


This data set includes votes for each of the U.S. House of Representatives Congressmen on the 16 key votes identified by the CQA. The CQA lists nine different types of votes: voted for, paired for, and announced for (these three simplified to yea), voted against, paired against, and announced against (these three simplified to nay), voted present, voted present to avoid conflict of interest, and did not vote or otherwise make a position known (these three simplified to an unknown disposition).

Attribute Information:
Class Name: 2 (democrat, republican) 

handicapped-infants: 2 (y,n) 

water-project-cost-sharing: 2 (y,n) 

adoption-of-the-budget-resolution: 2 (y,n) 

physician-fee-freeze: 2 (y,n) 

el-salvador-aid: 2 (y,n) 

religious-groups-in-schools: 2 (y,n) 

anti-satellite-test-ban: 2 (y,n) 

aid-to-nicaraguan-contras: 2 (y,n) 

mx-missile: 2 (y,n) 

immigration: 2 (y,n) 

synfuels-corporation-cutback: 2 (y,n) 

education-spending: 2 (y,n) 

superfund-right-to-sue: 2 (y,n) 

crime: 2 (y,n) 

duty-free-exports: 2 (y,n) 

export-administration-act-south-africa: 2 (y,n)

URL:https://archive.ics.uci.edu/ml/datasets/congressional+voting+records



# Section 2: Loading the dataset and preprocessing of data

In [1]:
import pandas as pd
import numpy as np
dataframe1 = pd.read_csv("E:\Tranparent machine learning\CongressDataset.csv")

Replacing the missing value "?" with NaN and then converting the String values into Boolean( as Scikit Learn decision tree classifier works only with numeric or boolean values) and after that replacing all the NaN values with mode of that particular column( We can also drop the rows having NaN values but that may lead to potential data loss).

In [26]:
dataframe2 = dataframe1.replace('?',np.NaN)
for column in dataframe2.columns:
    dataframe2[column].fillna(dataframe2[column].mode()[0], inplace=True)

Converting all the feautres into bianry Values

In [27]:
dataframe2['Class'] = dataframe2['Class'].map(lambda x : "y" if x == "democrat" else "n")
dataframe2 = dataframe2.applymap(lambda x:1 if x == "y" else 0)

Making a copy of dataset and then spiliting it into training and test data

In [28]:
dataframe3 = dataframe2.copy(deep = True)

Removing the column which is to predicted from the dataset

In [29]:
del dataframe3['Class']

Spilting dataset into training and test data

In [51]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split( dataframe3, dataframe2['Class'], test_size = 0.33)

In [52]:
from sklearn.naive_bayes import BernoulliNB
clf = BernoulliNB()
clf.fit(X_train,y_train)


BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

Now using the scikit learn we will find the log_probability of all the instance present in the test set

In [53]:
prob_per_instance = clf.predict_proba(X_test)

Finding the log probaility of each feature in dataset using feature_log_prob which will give the value P(X1=1|C=0),(P(X2=1|C=0) and P(X1=1|C=1),(P(X2=1|C=1) 

In [54]:
clf.feature_log_prob_

array([[-1.7227666 , -0.5441116 , -1.77405989, -0.04567004, -0.06453852,
        -0.10337835, -1.31730149, -1.54044504, -1.88528553, -0.55961579,
        -1.94591015, -0.24116206, -0.16462198, -0.02715099, -2.3206036 ,
        -0.40101076],
       [-0.5555258 , -0.62451867, -0.09749836, -3.263576  , -1.35933855,
        -0.70967648, -0.23275241, -0.18560563, -0.2536591 , -0.74357803,
        -0.7668349 , -2.1184437 , -1.16643489, -0.98997845, -0.49995595,
        -0.04470018]])

In [55]:
class_0_X_1, class_1_X_1 = clf.feature_log_prob_

In [56]:
class_0_X_1 = np.exp(class_0_X_1)
class_1_X_1 = np.exp(class_1_X_1)

Now taking the probabiltiy P(X1=0|C=0),(P(X2=0|C=0) and P(X1=0|C=1),(P(X2=0|C=1)

In [57]:
class_0_X_0 = 1 - class_0_X_1
class_1_X_0 = 1- class_1_X_1 

Calculating the classs probabilities

In [69]:
a,b = (clf.class_log_prior_)
class_prior = b-a

Calculating the log probabilities

In [59]:
log_prob_X_0_C_0_C_1 = np.log(class_1_X_0/class_0_X_0)
log_prob_X_1_C_0_C_1 = np.log(class_1_X_1/class_0_X_1)

In [60]:
def log_evidence_calulator(Object):
    
    positive_log_evidence = []
    negative_log_evidence = []
    for i,value in enumerate(Object):
        if value > 0:
            if log_prob_X_1_C_0_C_1[i] > 0:
                positive_log_evidence.append(log_prob_X_1_C_0_C_1[i])
            else:
                negative_log_evidence.append(log_prob_X_1_C_0_C_1[i])    
        else:
            if log_prob_X_0_C_0_C_1[i] > 0:
                positive_log_evidence.append(log_prob_X_0_C_0_C_1[i])
            else:
                negative_log_evidence.append(log_prob_X_0_C_0_C_1[i])
    return(positive_log_evidence,negative_log_evidence)

def most_positive_feature(Object,positive_evidence,negative_evidence):
    list6 = dataframe3.columns
    log_evidence_per_row = []
    overall_log_evidence = []
    most_negative_feature_name = []
    for i,value in enumerate(Object):
        if value > 0:
            if log_prob_X_1_C_0_C_1[i] > 0:
                log_evidence_per_row.append(log_prob_X_1_C_0_C_1[i])
            else:
                log_evidence_per_row.append(log_prob_X_1_C_0_C_1[i])    
        else:
            if log_prob_X_0_C_0_C_1[i] > 0:
                log_evidence_per_row.append(log_prob_X_0_C_0_C_1[i])
            else:
                log_evidence_per_row.append(log_prob_X_0_C_0_C_1[i])
    overall_log_evidence.append(log_evidence_per_row)
    dataframe6 = pd.DataFrame.from_records(overall_log_evidence,columns=list6)
    negative_feature_list = []
    last_row = dataframe6.iloc[0,:]
    last_row.argsort()
    ascending_sort = dataframe6[last_row.argsort]
    
    length_negative = len(negative_evidence)
    if (length_negative > 3):
        length = 3
    else:
        length = len(negative_evidence)
    j= 0;
    negative_feature = ascending_sort.iloc[:,0:length]
    if(length > 0):
        for i in negative_feature:
            j = j+1
            if (j <= 3):
                negative_feature_list.append(i)
            else:
                break
    length =0        
    length_postive = len(positive_evidence)    
    if (length_postive > 3):
        length = 3
    else:
        length = len(positive_evidence)
        
    positive_feature = ascending_sort[ascending_sort.columns[-length:]]
    
    l= 0
    positive_feature_list = []
    if length > 0:
        for k in positive_feature:
            l= l+1
            if( l<= 3):
                positive_feature_list.append(k)
            else:
                break
        
    return (positive_feature_list,negative_feature_list)    

    
    
    
    
        
    

        
    

# Most positive object with respect to probabilities

Most positive object considering 1= positive and 0 = negative

In [70]:
print ("The most positive object:\n",X_test.iloc[(np.argmax(prob_per_instance[:,1]))])
pos_log_evidence, neg_log_evidence = log_evidence_calulator(X_test.iloc[(np.argmax(prob_per_instance[:,1]))])
total_pos_log_evidence = 0;
total_neg_log_evidence = 0;
for i in pos_log_evidence:
    total_pos_log_evidence = total_pos_log_evidence + i
for j in neg_log_evidence:
    total_neg_log_evidence = total_neg_log_evidence + j
index = np.argmax(prob_per_instance[:,1])    

print("\ntotal Postive log evidence:", total_pos_log_evidence + class_prior)
print("\n\ntotal Negative log evidence:", total_neg_log_evidence)
print("\n\nProbability distribution:",prob_per_instance[index] )
pos_feature,neg_feature = most_positive_feature(X_test.iloc[(np.argmax(prob_per_instance[:,1]))],pos_log_evidence,neg_log_evidence)
print("\n\nMost Positive feature",pos_feature)
print("\n\nMost Negative feature",neg_feature)

The most positive object:
 handicapped-infants                       1
water-project-cost-sharing                0
adoption-of-the-budget-resolution         1
physician-fee-freeze                      0
el-salvador-aid                           0
religious-groups-in-schools               0
anti-satellite-test-ban                   1
aid-to-nicaraguan-contras                 1
mx-missile                                1
immigration                               0
synfuels-corporation-cutback              1
education-spending                        0
superfund-right-to-sue                    0
crime                                     0
duty-free-exports                         1
export-administration-act-south-africa    1
Name: 116, dtype: int64

total Postive log evidence: 24.3418539467


total Negative log evidence: 0


Probability distribution: [  2.68205174e-11   1.00000000e+00]


Most Positive feature ['el-salvador-aid', 'physician-fee-freeze', 'crime']


Most Negative feature []


# Most negative object with respect to probabilities

In [71]:
print ("The most negative object:\n",X_test.iloc[(np.argmax(prob_per_instance[:,0]))])
pos_log_evidence, neg_log_evidence = log_evidence_calulator(X_test.iloc[(np.argmax(prob_per_instance[:,0]))])
total_pos_log_evidence = 0;
total_neg_log_evidence = 0;
for i in pos_log_evidence:
    total_pos_log_evidence = total_pos_log_evidence + i
for j in neg_log_evidence:
    total_neg_log_evidence = total_neg_log_evidence + j
index = np.argmax(prob_per_instance[:,0])    

print("\n\ntotal Postive log evidence:", total_pos_log_evidence + class_prior)
print("\n\ntotal Negative log evidence:", total_neg_log_evidence)
print("\n\nProbability distribution:",prob_per_instance[index] )
pos_feature,neg_feature = most_positive_feature((X_test.iloc[(np.argmax(prob_per_instance[:,0]))]),pos_log_evidence,neg_log_evidence)
print("\n\nMost Positive feature",pos_feature)
print("\n\nMost Negative feature",neg_feature)

The most negative object:
 handicapped-infants                       0
water-project-cost-sharing                1
adoption-of-the-budget-resolution         0
physician-fee-freeze                      1
el-salvador-aid                           1
religious-groups-in-schools               1
anti-satellite-test-ban                   0
aid-to-nicaraguan-contras                 0
mx-missile                                0
immigration                               1
synfuels-corporation-cutback              0
education-spending                        1
superfund-right-to-sue                    1
crime                                     1
duty-free-exports                         0
export-administration-act-south-africa    0
Name: 330, dtype: int64


total Postive log evidence: 0.498016665473


total Negative log evidence: -19.5197586064


Probability distribution: [  9.99999995e-01   5.48229545e-09]


Most Positive feature []


Most Negative feature ['physician-fee-freeze', 'adoption-of-t

Finding the log probaility of each feature in dataset using feature_log_prob which will give the value P(X1=1|c=0),(P(X2=1|c=0) and P(X1=1|c=1),(P(X2=1|c=1) 

# The Object having the Highest Positive evidence


In [72]:
positive_log_evidence_list = []
negative_log_evidence_list = []
for i in range(0,len(X_test)):
    
    total_pos_log_evidence = 0;
    total_neg_log_evidence = 0;
    pos_log_evidence, neg_log_evidence = log_evidence_calulator(X_test.iloc[i])
    for i in pos_log_evidence:
        total_pos_log_evidence = total_pos_log_evidence + i
    for j in neg_log_evidence:
        total_neg_log_evidence = total_neg_log_evidence + j
    positive_log_evidence_list.append(total_pos_log_evidence)
    negative_log_evidence_list.append(total_neg_log_evidence)

positive_log_evidence = np.array(positive_log_evidence_list)
negative_log_evidence = np.array(negative_log_evidence_list)
print("n\n\The object having highest evidence", X_test.iloc[np.argmax(positive_log_evidence)])
print("\n\nThe total log positive evidence:",np.amax(positive_log_evidence) +class_prior )
print("\n\nThe total log negaitve evidence:", negative_log_evidence[np.argmax(positive_log_evidence)])
print("\n\nThe probability distribution",prob_per_instance[np.argmax(positive_log_evidence)])
pos_log_evidence, neg_log_evidence = log_evidence_calulator(X_test.iloc[np.argmax(positive_log_evidence)])



pos_feature,neg_feature = most_positive_feature((X_test.iloc[np.argmax(positive_log_evidence)]),pos_log_evidence,neg_log_evidence)
print("\n\nMost Positive feature",pos_feature)
print("\n\nMost Negative feature",neg_feature)
    

n
\The object having highest evidence handicapped-infants                       1
water-project-cost-sharing                0
adoption-of-the-budget-resolution         1
physician-fee-freeze                      0
el-salvador-aid                           0
religious-groups-in-schools               0
anti-satellite-test-ban                   1
aid-to-nicaraguan-contras                 1
mx-missile                                1
immigration                               0
synfuels-corporation-cutback              1
education-spending                        0
superfund-right-to-sue                    0
crime                                     0
duty-free-exports                         1
export-administration-act-south-africa    1
Name: 116, dtype: int64


The total log positive evidence: 24.3418539467


The total log negaitve evidence: 0.0


The probability distribution [  2.68205174e-11   1.00000000e+00]


Most Positive feature ['el-salvador-aid', 'physician-fee-freeze', 'crime']




# The object that has the largest (in magnitude) negative evidence.


In [73]:
positive_log_evidence_list = []
negative_log_evidence_list = []
total_log_evidence_list = []
total_log_evidence_per_row = []
for i in range(0,len(X_test)):
    
    total_pos_log_evidence = 0;
    total_neg_log_evidence = 0;
    pos_log_evidence, neg_log_evidence = log_evidence_calulator(X_test.iloc[i])
    for i in pos_log_evidence:
        
        total_pos_log_evidence = total_pos_log_evidence + i
    for j in neg_log_evidence:
        total_neg_log_evidence = total_neg_log_evidence + j
    positive_log_evidence_list.append(total_pos_log_evidence)
    negative_log_evidence_list.append(total_neg_log_evidence)

positive_log_evidence = np.array(positive_log_evidence_list)
negative_log_evidence = np.array(negative_log_evidence_list)
print("\n\nThe object having highest evidence\n", X_test.iloc[np.argmax(np.absolute(negative_log_evidence))])
print("\n\nThe total log positive evidence:",positive_log_evidence[np.argmax(np.absolute(negative_log_evidence))]+class_prior)
print("\n\nThe total log negaitve evidence:", np.amax(np.absolute(negative_log_evidence)) )
print("\n\nThe probability distribution",prob_per_instance[np.argmax(np.absolute(negative_log_evidence))])

pos_log_evidence, neg_log_evidence = log_evidence_calulator(X_test.iloc[np.argmax(np.absolute(negative_log_evidence))])

pos_feature,neg_feature = most_positive_feature((X_test.iloc[np.argmax(np.absolute(negative_log_evidence))]),pos_log_evidence,neg_log_evidence)
print("\n\nMost Positive feature",pos_feature)
print("\n\nMost Negative feature",neg_feature)




The object having highest evidence
 handicapped-infants                       0
water-project-cost-sharing                1
adoption-of-the-budget-resolution         0
physician-fee-freeze                      1
el-salvador-aid                           1
religious-groups-in-schools               1
anti-satellite-test-ban                   0
aid-to-nicaraguan-contras                 0
mx-missile                                0
immigration                               1
synfuels-corporation-cutback              0
education-spending                        1
superfund-right-to-sue                    1
crime                                     1
duty-free-exports                         0
export-administration-act-south-africa    0
Name: 330, dtype: int64


The total log positive evidence: 0.498016665473


The total log negaitve evidence: 19.5197586064


The probability distribution [  9.99999995e-01   5.48229545e-09]


Most Positive feature []


Most Negative feature ['physician-fee-f

# The most uncertain object (the probabilities are closest to 0.5)


In [74]:
most_uncertain_prob = prob_per_instance[:,0] - prob_per_instance[:,1]
#np.min(np.absolute(most_uncertain_prob))
print("The most uncertain object \n", X_test.iloc[np.argmin(np.absolute(most_uncertain_prob))])

pos_log_evidence, neg_log_evidence = log_evidence_calulator(X_test.iloc[np.argmin(np.absolute(most_uncertain_prob))])
total_pos_log_evidence = 0;
total_neg_log_evidence = 0;
for i in pos_log_evidence:
    total_pos_log_evidence = total_pos_log_evidence + i
for j in neg_log_evidence:
    total_neg_log_evidence = total_neg_log_evidence + j
index = np.argmin(np.absolute(most_uncertain_prob))

print("\n\ntotal Postive log evidence:", total_pos_log_evidence + class_prior)
print("\n\ntotal Negative log evidence:", total_neg_log_evidence)
print("\n\nProbability distribution:",prob_per_instance[index] )
pos_feature,neg_feature = most_positive_feature((X_test.iloc[np.argmin(np.absolute(most_uncertain_prob))]),pos_log_evidence,neg_log_evidence)
print("\n\nMost Positive feature",pos_feature)
print("\n\nMost Negative feature",neg_feature)


The most uncertain object 
 handicapped-infants                       0
water-project-cost-sharing                1
adoption-of-the-budget-resolution         1
physician-fee-freeze                      0
el-salvador-aid                           1
religious-groups-in-schools               1
anti-satellite-test-ban                   0
aid-to-nicaraguan-contras                 1
mx-missile                                0
immigration                               1
synfuels-corporation-cutback              0
education-spending                        0
superfund-right-to-sue                    1
crime                                     1
duty-free-exports                         0
export-administration-act-south-africa    1
Name: 373, dtype: int64


total Postive log evidence: 8.36815023506


total Negative log evidence: -8.67739702213


Probability distribution: [ 0.5767014  0.4232986]


Most Positive feature ['education-spending', 'adoption-of-the-budget-resolution', 'physician-fee-fre