# Congressional Voting Classification

#### Objective
The main objective is to predict whether congressmen is Democrat or Republican based on voting patterns by using the decision tree with the adaboost.

#### Adaboost
AdaBoost is an ensemble learning method (also known as “meta-learning”) which was initially created to increase the efficiency of binary classifiers. AdaBoost uses an iterative approach to learn from the mistakes of weak classifiers, and turn them into strong ones.


#### Data Set
This data set includes votes for each of the U.S. House of Representatives Congressmen on the 16 key votes identified by the CQA. The CQA lists nine different types of votes: voted for, paired for, and announced for (these three simplified to yea), voted against, paired against, and announced against (these three simplified to nay), voted present, voted present to avoid conflict of interest, and did not vote or otherwise make a position known (these three simplified to an unknown disposition).


## Attribute Information:
1. Class Name: 2 (democrat, republican)
2. handicapped-infants: 2 (y,n)
3. water-project-cost-sharing: 2 (y,n)
4. adoption-of-the-budget-resolution: 2 (y,n)
5. physician-fee-freeze: 2 (y,n)
6. el-salvador-aid: 2 (y,n)
7. religious-groups-in-schools: 2 (y,n)
8. anti-satellite-test-ban: 2 (y,n)
9. aid-to-nicaraguan-contras: 2 (y,n)
10. mx-missile: 2 (y,n)
11. immigration: 2 (y,n)
12. synfuels-corporation-cutback: 2 (y,n)
13. education-spending: 2 (y,n)
14. superfund-right-to-sue: 2 (y,n)
15. crime: 2 (y,n)
16. duty-free-exports: 2 (y,n)
17. export-administration-act-south-africa: 2 (y,n)



#### Source
The dataset can be obtained from the:
https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records

#### Tasks:
1.	Obtained the dataset
2.	Apply pre-processing operations
3.	Train Adaboost model from scratch and test the model
4.	Train Adaboost model using sklearn
6.	Compare the performance of Adaboost, Random Forest and Decision Trees


## Part 1: Adaboost from Scratch

In [96]:
# Load the libraries
import pandas as pd
import numpy as np
import random
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_recall_fscore_support
from sklearn.ensemble import RandomForestClassifier

In [37]:
# Load the dataset 
data = pd.read_csv('house-votes.data')
data.head(20)

Unnamed: 0,republican,n,y,n.1,y.1,y.2,y.3,n.2,n.3,n.4,y.4,?,y.5,y.6,y.7,n.5,y.8
0,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,?
1,democrat,?,y,y,?,y,y,n,n,n,n,y,n,y,y,n,n
2,democrat,n,y,y,n,?,y,n,n,n,n,y,n,y,n,n,y
3,democrat,y,y,y,n,y,y,n,n,n,n,y,?,y,y,y,y
4,democrat,n,y,y,n,y,y,n,n,n,n,n,n,y,y,y,y
5,democrat,n,y,n,y,y,y,n,n,n,n,n,n,?,y,y,y
6,republican,n,y,n,y,y,y,n,n,n,n,n,n,y,y,?,y
7,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,y
8,democrat,y,y,y,n,n,n,y,y,y,n,n,n,n,n,?,?
9,republican,n,y,n,y,y,n,n,n,n,n,?,?,y,y,n,n


In [38]:
# Preprocessing
# Encoding categorical variables (if any)
# Feature Scaling
# Filling missing values (if any)
ran = ['y','n']
data = data.replace(to_replace ="?",value = ran[random.randint(0,1)])

In [39]:
data.head(20)

Unnamed: 0,republican,n,y,n.1,y.1,y.2,y.3,n.2,n.3,n.4,y.4,?,y.5,y.6,y.7,n.5,y.8
0,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,n
1,democrat,n,y,y,n,y,y,n,n,n,n,y,n,y,y,n,n
2,democrat,n,y,y,n,n,y,n,n,n,n,y,n,y,n,n,y
3,democrat,y,y,y,n,y,y,n,n,n,n,y,n,y,y,y,y
4,democrat,n,y,y,n,y,y,n,n,n,n,n,n,y,y,y,y
5,democrat,n,y,n,y,y,y,n,n,n,n,n,n,n,y,y,y
6,republican,n,y,n,y,y,y,n,n,n,n,n,n,y,y,n,y
7,republican,n,y,n,y,y,y,n,n,n,n,n,y,y,y,n,y
8,democrat,y,y,y,n,n,n,y,y,y,n,n,n,n,n,n,n
9,republican,n,y,n,y,y,n,n,n,n,n,n,n,y,y,n,n


In [40]:
data[data.columns[:1]] = pd.DataFrame(np.where(data[data.columns[:1]].values=='republican',0 ,1), data.index)
data[data.columns[1:]] = pd.DataFrame(np.where(data[data.columns[1:]].values=='n',0 ,1), data.index)

In [41]:
data.head()

Unnamed: 0,republican,n,y,n.1,y.1,y.2,y.3,n.2,n.3,n.4,y.4,?,y.5,y.6,y.7,n.5,y.8
0,0,0,1,0,1,1,1,0,0,0,0,0,1,1,1,0,0
1,1,0,1,1,0,1,1,0,0,0,0,1,0,1,1,0,0
2,1,0,1,1,0,0,1,0,0,0,0,1,0,1,0,0,1
3,1,1,1,1,0,1,1,0,0,0,0,1,0,1,1,1,1
4,1,0,1,1,0,1,1,0,0,0,0,0,0,1,1,1,1


In [42]:
# Divide the dataset to training and testing set
X = data.iloc[:,1:].values
y = data.iloc[:,:1].values
print(X.shape, y.shape)

(434, 16) (434, 1)


In [45]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=123)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(347, 16) (347, 1)
(87, 16) (87, 1)


In [126]:
# Implement Adaboost model from scratch
# Adaboost consist of stumps which can be created using builtin decision trees in sklearn
# Stump can be trained by keeping the max_depth as 1
def Flag(i):
    if(i):
        return 1
    else:
        return 0

def Sign(x):
    if(x!=0):
        return abs(x)/x
    else:
        return 1      

class AdaBoostClassifier:
    import numpy as np
    from sklearn.tree import DecisionTreeClassifier

    def __init__(self, n_estimators = 50):
        self.n_estimators = n_estimators
        self.models = [None]*n_estimators
        
    def fit(self, X_train, y_train):
        X_train = np.float64(X_train)
        y_len = len(y_train)
        li = 1/y_len
        l = [li for i in range(y_len)]
        n = np.array(l)
        e = self.n_estimators
        for j in range(e):
            DT = DecisionTreeClassifier(max_depth = 1)\
                        .fit(X_train, y_train, sample_weight = n).predict
            k = sum([n[i]*Flag(y_train[i]!=DT(X_train[i].reshape(1,-1)))\
                        for i in range(y_len)])/sum(n)
            A = np.log((1-k)/k)
            n = [n[i]*np.exp(A*Flag(y_train[i]!=DT(X_train[i].reshape(1,-1))))\
                        for i in range(y_len)] 
            self.models[j] = (A, DT)

    def predict(self, X_train):
        y = 0
        for j in range(e):
            A, DT = self.models[j]
            y = y + A*DT(X_train)
        s = np.vectorize(Sign)
        y = np.where(s(y)==-1,-1,1)
        return y


In [127]:
# Train the model and test the model
model = AdaBoost(n_estimators=150)
model.fit(X_train,y_train)

In [128]:
# Evaluate the results using accuracy, precision, recall and f-measure
pred = model.predict(X_test)
precision, recall, fscore, _ = precision_recall_fscore_support(y_test, pred)
# Precision, Recall, F-Score
print("Precision: ",precision)
print("Recall: ",recall)
print("F-Score: ",fscore)

Precision:  [0.         0.63218391]
Recall:  [0. 1.]
F-Score:  [0.         0.77464789]


  'precision', 'predicted', average, warn_for)


In [129]:
# Accuracy
print("Accuracy: ",(np.sum(y_test==pred)/len(y_test)))

Accuracy:  55.0


## Part 2: Adaboost using Sklearn

In [49]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

  from numpy.core.umath_tests import inner1d


In [74]:
# Use the preprocessed dataset here
X = data.iloc[:,1:].values
y = data.iloc[:,:1].values
print(X.shape, y.shape)

(434, 16) (434, 1)


In [83]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=123)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)

(347, 16) (347, 1)
(87, 16) (87, 1)


In [84]:
# Train the Adaboost Model using builtin Sklearn Dataset
model = AdaBoostClassifier(DecisionTreeClassifier(),n_estimators=150)
model.fit(X_train, y_train)

  y = column_or_1d(y, warn=True)


AdaBoostClassifier(algorithm='SAMME.R',
          base_estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'),
          learning_rate=1.0, n_estimators=150, random_state=None)

In [85]:
model.score(X_test, y_test)

0.9310344827586207

In [86]:
from sklearn.metrics import precision_recall_fscore_support

In [87]:
# Test the model with testing set and print the accuracy, precision, recall and f-measure
pred = model.predict(X_test)
precision, recall, fscore, _ = precision_recall_fscore_support(y_test, pred)

In [88]:
print("Precision: ",precision)
print("Recall: ",recall)
print("F Score: ",fscore)

Precision:  [0.88235294 0.96226415]
Recall:  [0.9375     0.92727273]
F Score:  [0.90909091 0.94444444]


In [89]:
# Play with parameters such as
# number of decision trees
# Criterion for splitting
# Max depth
# Minimum samples per split and leaf
model1 = AdaBoostClassifier(DecisionTreeClassifier(),n_estimators=200)
model1.fit(X_train, y_train)
model1.score(X_test, y_test)

  y = column_or_1d(y, warn=True)


0.9425287356321839

In [90]:
model2 = AdaBoostClassifier(DecisionTreeClassifier(criterion='entropy'),n_estimators=40)
model2.fit(X_train, y_train)
model2.score(X_test, y_test)

  y = column_or_1d(y, warn=True)


0.9310344827586207

In [91]:
model3 = AdaBoostClassifier(DecisionTreeClassifier(criterion='gini'),n_estimators=150)
model3.fit(X_train, y_train)
model3.score(X_test, y_test)

  y = column_or_1d(y, warn=True)


0.9425287356321839

In [92]:
model4 = AdaBoostClassifier(DecisionTreeClassifier(criterion='gini',max_depth=2),n_estimators=150)
model4.fit(X_train, y_train)
model4.score(X_test, y_test)

  y = column_or_1d(y, warn=True)


0.9195402298850575

In [93]:
model5 = AdaBoostClassifier(DecisionTreeClassifier(min_samples_split=5,min_samples_leaf=2),n_estimators=150)
model5.fit(X_train, y_train)
model5.score(X_test, y_test)

  y = column_or_1d(y, warn=True)


0.9080459770114943

## Part 3: Compare the models

In [97]:
# Train Adaboost, Random Forest and Decision tree models from sklearn
clf1 = AdaBoostClassifier(DecisionTreeClassifier(criterion='gini'),n_estimators=150)
clf2 = RandomForestClassifier(criterion='gini')
clf3 = DecisionTreeClassifier()

In [98]:
clf1.fit(X_train,y_train)
clf2.fit(X_train,y_train)
clf3.fit(X_train,y_train)

  y = column_or_1d(y, warn=True)
  


DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [101]:
# Run the model on testing set
acc1 = clf1.score(X_test,y_test)
acc2 = clf2.score(X_test,y_test)
acc3 = clf3.score(X_test,y_test)
print("Adaboost Accuracy: ",acc1)
print("Random Forest Accuracy: ",acc2)
print("Decision Tree Accuracy: ",acc3)

Adaboost Accuracy:  0.9310344827586207
Random Forest Accuracy:  0.9310344827586207
Decision Tree Accuracy:  0.9195402298850575


In [103]:
# Compare their accuracy, precision, recall and f-measure
pred1 = clf1.predict(X_test)
pred2 = clf2.predict(X_test)
pred3 = clf3.predict(X_test)
precision1, recall1, fscore1, _ = precision_recall_fscore_support(y_test, pred1)
precision2, recall2, fscore2, _ = precision_recall_fscore_support(y_test, pred2)
precision3, recall3, fscore3, _ = precision_recall_fscore_support(y_test, pred3)

In [112]:
Model = ['Adaboost','Random Forest','Decision Tree']
for i in range(3):
    print(Model[i]+" Precision: "+str(vars()['precision' + str(i+1)]))
    print(Model[i]+" Recall: "+str(vars()['recall' + str(i+1)]))
    print(Model[i]+" F-Score: "+str(vars()['fscore' + str(i+1)]))
    print()

Adaboost Precision: [0.88235294 0.96226415]
Adaboost Recall: [0.9375     0.92727273]
Adaboost F-Score: [0.90909091 0.94444444]

Random Forest Precision: [0.88235294 0.96226415]
Random Forest Recall: [0.9375     0.92727273]
Random Forest F-Score: [0.90909091 0.94444444]

Decision Tree Precision: [0.85714286 0.96153846]
Decision Tree Recall: [0.9375     0.90909091]
Decision Tree F-Score: [0.89552239 0.93457944]

