<font color='blue' size=6> 2004 Lok Sabha Candidate Winner Prediction</font>
<hr/>
<font size=2> - Durgesh Samariya | The ML PhD Student </font>


<font color='red' size=5><center>Please Upvote my kernel if you like my work.</center></font>

<a id="top"></a>

<div class="list-group" id="list-tab" role="tablist">
<h1 class="list-group-item list-group-item-action active" data-toggle="list" style='background:MediumSeaGreen; border:0' role="tab" aria-controls="home"><center>Table of Content</center></h1>

- [1. Introduction](#introduction)
- [2. Import Required Libraries](#library)

<a id='introduction'></a>
# Introduction
<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

<a id='library'></a>
# Import Required Libraries
<a href="#top" class="btn btn-primary btn-sm" role="button" aria-pressed="true" style="color:white" data-toggle="popover">Go to TOC</a>

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# wordcloud
from wordcloud import WordCloud, STOPWORDS

# machine learning
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report
from sklearn import preprocessing

%matplotlib inline

plt.style.use('seaborn-dark-palette')

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Load Dataset

Let's load `LokSabha2004.csv` file.

In [None]:
%time data = pd.read_csv('../input/lok-sabha-election-candidate-list-2004-to-2019/LokSabha2004.csv')

In [None]:
missing = (data.isnull().sum())

In [None]:
type(missing[missing>0])

# Data exploration

## Check the data

Let's glimpse at dataset.

In [None]:
data

## Data contains:
- **Candidate** - Name of the Candidate.
- **Party** - Policatical Party.
- **Criminal Cases** - Criminal Cases against candidate.
- **Education** - Education of candidate.
- **Age** - Age of candidate.
- **Total Assets** - Total assets of candidate.
- **Constituency** - Name of constituency from candidate stand in election.
- **Liabilities** - Total Liabilities of candidate.
- **Winner** - Does candidate won in election that year? (0 - No, 1- Yes)
- **Gender** - Gender of candidate. (M-Male, F-Female)

Let's check statistics of data.

In [None]:
data.describe()

We make few observation here:
- In `age` feature we can see minimum value is `Zero (0)`, which is wrong we might have some wrong values. In data analysis part I will decide what to do with this candidates.
- We have max 36 criminal cases value and lowest is zero. This feature probably help in prediction.

Let's check information of the dataset.

In [None]:
data.info()

Let's check shape of the dataset.

In [None]:
data.shape

We have 3642 samples in dataset and 10 features.

Let's check if there is any missing value in the data.

In [None]:
def missing_data(data):
    total = data.isnull().sum()
    percent = (data.isnull().sum()/data.isnull().count()*100)
    missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
    types = []
    for col in data.columns:
        dtype = str(data[col].dtype)
        types.append(dtype)
    missing_data['Types'] = types
    return(np.transpose(missing_data))

In [None]:
%%time
missing_data(data)

We have 260 missing/NaN values in Total Assest. I will decide what to do with this value after data analysis. 

# Exploring important features

<font color='slateblue' size=+2.5> The Class Variable; Winner
    </font>

In [None]:
data['Winner'].value_counts()

In [None]:

plt.figure(figsize=(10,6))
sns.countplot('Winner', data=data, palette='Set3')
plt.xticks(rotation=90)
plt.title('Winner Count',fontsize=20)
plt.ylabel('Count',fontsize=16)
plt.xlabel('Winner?',fontsize=16)

In 2004, total 3642 candidates participated in election. However 388 won the election and 3254 lose.

## Name wordcloud

In [None]:
def build_wordcloud(df, title):
    wordcloud = WordCloud(
        background_color='black', 
        stopwords=set(STOPWORDS), 
        max_words=100, 
        max_font_size=40, 
        random_state=666
    ).generate(str(df))

    fig = plt.figure(1, figsize=(15,15))
    plt.axis('off')
    fig.suptitle(title, fontsize=16)
    fig.subplots_adjust(top=2.3)

    plt.imshow(wordcloud)
    plt.show()

In [None]:
build_wordcloud(data['Candidate'], 'Prevalent words in Name for all dataset')

## Party Analysis

In [None]:
temp = data['Party'].value_counts().head(20)

In [None]:
plt.figure(figsize=(10,6))
sns.barplot(x=temp.index, y=temp.values, palette='Set3')
plt.xticks(rotation=90)
plt.title('Number of Seats Contested by PARTIES (TOP 20)',fontsize=20)
plt.ylabel('Number of Seats',fontsize=16)
plt.xlabel('Political Parties',fontsize=16)

In [None]:
data = data.fillna(0)

In [None]:
data.drop(['Candidate'], axis=1, inplace=True)

In [None]:
data

# Feature Mapping

In [None]:
data["Party"] = data["Party"].astype("category")
data = pd.get_dummies(data, columns = ["Party"],prefix="Party")

In [None]:
data["Education"] = data["Education"].astype("category")
data = pd.get_dummies(data, columns = ["Education"],prefix="Education")

In [None]:
data["Constituency"] = data["Constituency"].astype("category")
data = pd.get_dummies(data, columns = ["Constituency"],prefix="Constituency")

In [None]:
data['Gender'] = data['Gender'].map({'M':1, 'F':0})

# Model Prediction

In [None]:
X = data.copy().drop('Winner', axis=1)
y = data['Winner']

In [None]:
X_scaled = preprocessing.scale(X)

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X_scaled, y, test_size=0.3)
X_train.shape, Y_train.shape, X_test.shape

## KNN Classifier

In [None]:
# k-nearest neighbor
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, Y_train)
knn_Y_pred = knn.predict(X_test)
knn_accuracy = knn.score(X_test, Y_test)
knn_accuracy

In [None]:
# creating confusion matrix heatmap

conf_mat = confusion_matrix(Y_test, knn_Y_pred)
fig = plt.figure(figsize=(10,7))
group_names = ['True Neg','False Pos','False Neg','True Pos']
group_counts = ["{0:0.0f}".format(value) for value in
                conf_mat.flatten()]
labels = [f"{v1}\n{v2}" for v1, v2 in zip(group_names,group_counts)]
labels = np.asarray(labels).reshape(2,2)
sns.heatmap(conf_mat, annot=labels, annot_kws={"size": 16}, fmt='')
plt.title("Confusion Matrix")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()

In [None]:
# getting precision, recall and f1-score via classification report

print(classification_report(Y_test, knn_Y_pred))

## Decision Tree Classifier

In [None]:
# Decision Tree

decision_tree = DecisionTreeClassifier()
decision_tree.fit(X_train, Y_train)
decision_tree_Y_pred = decision_tree.predict(X_test)
decision_tree_accuracy = decision_tree.score(X_test, Y_test)
decision_tree_accuracy

In [None]:
# creating confusion matrix heatmap

conf_mat = confusion_matrix(Y_test, decision_tree_Y_pred)
fig = plt.figure(figsize=(10,7))
group_names = ['True Neg','False Pos','False Neg','True Pos']
group_counts = ["{0:0.0f}".format(value) for value in
                conf_mat.flatten()]
labels = [f"{v1}\n{v2}" for v1, v2 in zip(group_names,group_counts)]
labels = np.asarray(labels).reshape(2,2)
sns.heatmap(conf_mat, annot=labels, annot_kws={"size": 16}, fmt='')
plt.title("Confusion Matrix")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()

In [None]:
# getting precision, recall and f1-score via classification report

print(classification_report(Y_test, decision_tree_Y_pred))

## SVM Classifier

In [None]:
# Support Vector Machine
svc = SVC()
svc.fit(X_train, Y_train)
svm_Y_pred = svc.predict(X_test)
svc_accuracy = svc.score(X_test, Y_test)
svc_accuracy

In [None]:
# creating confusion matrix heatmap

conf_mat = confusion_matrix(Y_test, svm_Y_pred)
fig = plt.figure(figsize=(10,7))
group_names = ['True Neg','False Pos','False Neg','True Pos']
group_counts = ["{0:0.0f}".format(value) for value in
                conf_mat.flatten()]
labels = [f"{v1}\n{v2}" for v1, v2 in zip(group_names,group_counts)]
labels = np.asarray(labels).reshape(2,2)
sns.heatmap(conf_mat, annot=labels, annot_kws={"size": 16}, fmt='')
plt.title("Confusion Matrix")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()

In [None]:
# getting precision, recall and f1-score via classification report

print(classification_report(Y_test, svm_Y_pred))

## Random Forest Classifier

In [None]:
# Random Forest

random_forest = RandomForestClassifier(n_estimators=1000)
random_forest.fit(X_train, Y_train)
random_forest_Y_pred = random_forest.predict(X_test)
random_forest.score(X_train, Y_train)
random_forest_accuracy = random_forest.score(X_test, Y_test)
random_forest_accuracy

In [None]:
# creating confusion matrix heatmap

conf_mat = confusion_matrix(Y_test, random_forest_Y_pred)
fig = plt.figure(figsize=(10,7))
group_names = ['True Neg','False Pos','False Neg','True Pos']
group_counts = ["{0:0.0f}".format(value) for value in
                conf_mat.flatten()]
labels = [f"{v1}\n{v2}" for v1, v2 in zip(group_names,group_counts)]
labels = np.asarray(labels).reshape(2,2)
sns.heatmap(conf_mat, annot=labels, annot_kws={"size": 16}, fmt='')
plt.title("Confusion Matrix")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()

In [None]:

# getting precision, recall and f1-score via classification report

print(classification_report(Y_test, random_forest_Y_pred))

## Gaussian Naive Bayes Classifier

In [None]:
# Gaussian Naive Bayes

gaussian = GaussianNB()
gaussian.fit(X_train, Y_train)
gaussian_Y_pred = gaussian.predict(X_test)
gaussian_accuracy = gaussian.score(X_test, Y_test)
gaussian_accuracy

In [None]:
# creating confusion matrix heatmap

conf_mat = confusion_matrix(Y_test, gaussian_Y_pred)
fig = plt.figure(figsize=(10,7))
group_names = ['True Neg','False Pos','False Neg','True Pos']
group_counts = ["{0:0.0f}".format(value) for value in
                conf_mat.flatten()]
labels = [f"{v1}\n{v2}" for v1, v2 in zip(group_names,group_counts)]
labels = np.asarray(labels).reshape(2,2)
sns.heatmap(conf_mat, annot=labels, annot_kws={"size": 16}, fmt='')
plt.title("Confusion Matrix")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()

In [None]:
# getting precision, recall and f1-score via classification report

print(classification_report(Y_test, gaussian_Y_pred))

## Stochastic Gradient Descent Classifier

In [None]:
# Stochastic Gradient Descent

sgd = SGDClassifier()
sgd.fit(X_train, Y_train)
sgd_Y_pred = sgd.predict(X_test)
sgd_accuracy = sgd.score(X_test, Y_test)
sgd_accuracy

In [None]:
# creating confusion matrix heatmap

conf_mat = confusion_matrix(Y_test, sgd_Y_pred)
fig = plt.figure(figsize=(10,7))
group_names = ['True Neg','False Pos','False Neg','True Pos']
group_counts = ["{0:0.0f}".format(value) for value in
                conf_mat.flatten()]
labels = [f"{v1}\n{v2}" for v1, v2 in zip(group_names,group_counts)]
labels = np.asarray(labels).reshape(2,2)
sns.heatmap(conf_mat, annot=labels, annot_kws={"size": 16}, fmt='')
plt.title("Confusion Matrix")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()

In [None]:
# getting precision, recall and f1-score via classification report

print(classification_report(Y_test, sgd_Y_pred))

## Linear SVM Classifier

In [None]:
# Linear SVC

linear_svc = LinearSVC()
linear_svc.fit(X_train, Y_train)
linear_svc_Y_pred = linear_svc.predict(X_test)
linear_svc_accuracy = linear_svc.score(X_test, Y_test)
linear_svc_accuracy

In [None]:
# creating confusion matrix heatmap

conf_mat = confusion_matrix(Y_test, linear_svc_Y_pred)
fig = plt.figure(figsize=(10,7))
group_names = ['True Neg','False Pos','False Neg','True Pos']
group_counts = ["{0:0.0f}".format(value) for value in
                conf_mat.flatten()]
labels = [f"{v1}\n{v2}" for v1, v2 in zip(group_names,group_counts)]
labels = np.asarray(labels).reshape(2,2)
sns.heatmap(conf_mat, annot=labels, annot_kws={"size": 16}, fmt='')
plt.title("Confusion Matrix")
plt.xlabel("Predicted Label")
plt.ylabel("True Label")
plt.show()

In [None]:
# getting precision, recall and f1-score via classification report

print(classification_report(Y_test, linear_svc_Y_pred))

In [None]:
models = pd.DataFrame({
    'Model': ['Support Vector Machines', 'KNN', 'Linear SVC', 'Decision Tree','Random Forest', 'Stochastic Gradient Descent', 'Gaussian Naive Bayes'],
    'Score': [svc_accuracy, knn_accuracy, linear_svc_accuracy, decision_tree_accuracy, random_forest_accuracy, sgd_accuracy, gaussian_accuracy]})
models.sort_values(by='Score', ascending=False)