# Introduction

## Introduction
In this notebook, we will look for anomalous results in polls in the 2020 (aka 2019c) elections for the Israeli Parliament (Knesset).
 
Such anomalies can be the result of human error (most likely), fraud, or reflect real drastic changes (least likely) 

the data is as downloaded from the Central Elections Committee. It was downloaded at 05/03/2020 08:10, and is up to date for 04/03/2020 21:38 according to the committee's website. It is not final, both due to the counting process itself (71% votes counted) and both due to the time it takes to check the results for correctness (one might hope/expect that some of the anomalies found here will be corrected by the due date, 10/03/2020).

## Method
We'll look for anomalies using the methods in [Dan Ofer's notebook](https://www.kaggle.com/danofer/israel-election-anomalies-starter) from the anomaly detection competition from April 2019 (aka 2019a) elections.

## Input data and minor pre-processing
The data is uploaded as given from the central elections committee. 
1. Column names are not translated from Hebrew, because their alphabetical order matters (for example, a common human error is to type a number in the wrong column).
2. Columns that represent additional oversight measures (i.e. ברזל, סמל ועדה, ריכוז, שופט) are dropped, as I don't really know how to account for them. 
3. Index columns (i.e. שם ישוב, סמל ישוב, קלפי) are exactly just indices, and can be dropped (or used as index).
4. The following columns are: בזב (eligible voters), מצביעים (votes cast), פסולים (invalid votes) and כשרים (valid votes). These are metadata columns that can be manipulated (i.e. dividing מצביעים by בזב yields the voter turnout)
5. The columns after כשרים are the number of votes per party.

In [None]:
import pandas as pd
import numpy as np
from sklearn.ensemble import IsolationForest

filename = '/kaggle/input/israeli-elections-2015-2013/votes per booth 2020.csv'
encoding = 'iso_8859_8'

df_raw = pd.read_csv(filename, encoding=encoding)
df_raw.dropna(axis=1, how='all', inplace=True)
df_raw.head()

In [None]:
df = df_raw.copy()

oversight_columns = ['סמל ועדה', 'ברזל', 'שופט', 'ריכוז']
df.drop(oversight_columns, axis=1, inplace=True)

index_columns = ['שם ישוב', 'סמל ישוב', 'קלפי']
metadata_columns = ['בזב', 'מצביעים', 'פסולים', 'כשרים']

party_columns = df.columns.difference(pd.Index(index_columns+metadata_columns))
pd.testing.assert_series_equal(df['כשרים'], df[party_columns].sum(axis=1).rename('כשרים'))
#df.set_index(index_columns, inplace=True) - will set index after feature engineering
df.head()

## Feature engineering

(Pretty self-explanatory)

In [None]:
df['turnout'] = (df['מצביעים'] / df['בזב']).replace(np.inf, -1)
df['percent_invalid'] = df['פסולים'] / df['מצביעים']

df["total_voting_booths"] = df.groupby(["סמל ישוב"])["קלפי"].transform("size")
df["booth_per_capita"] = df["total_voting_booths"].div(df["בזב"]).replace(np.inf, -1)

df["max_party_vote"] = df[party_columns].max(axis=1)
df["max_party_ratio"] =df['max_party_vote'].div(df['כשרים'], axis=0)

df[party_columns] = df[party_columns].div(df["כשרים"], axis=0)

df.set_index(index_columns, inplace=True)

assert(not(df.replace([np.inf, -np.inf], np.nan).isna().any().any()))
df.head()

In [None]:
party_votes = df[party_columns].mul(df['כשרים'], axis=0).sum() / df['כשרים'].sum()
party_votes.sort_values(ascending=False)
threshold = 0.001
major_parties = party_votes[party_votes >= threshold].index
minor_parties = party_votes[party_votes < threshold].index
print('the major parties (those with >{} of votes) are'.format(threshold))
print(major_parties.to_list())
df.drop(minor_parties, inplace=True, axis=1)

## Model

After engineering our features, and making sure there are no nan's / inf's in our dataset, we can fit out Isolation Forest model for anomaly detection (Low score = more anomalous).

To learn more on [Isolation Forest](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.IsolationForest.html) for anomaly detection, visit [Dan Ofer's Notebook](https://www.kaggle.com/danofer/anomaly-detection-for-feature-engineering-v2) on the [Credit Card Fraud Detection dataset](https://www.kaggle.com/mlg-ulb/creditcardfraud)



In [None]:
model = IsolationForest(n_estimators=140,max_samples=600).fit(df)
predictions = pd.Series(index=df.index, data=model.decision_function(df), name='anomaly score').sort_values(ascending=False)
predictions.to_frame()

In [None]:
predictions.tail(10)

# Conclusions
This notebook allows us to point out suspicious booths. The suspicious ones will have to be checked manually: For example, the second most anomalous booth is in the [Arab al-Na'im (ערב אל נעים)](https://en.wikipedia.org/wiki/Arab_al-Na%27im) village in the Galillee, an unrecognized Bedouin village, with a surprising support for the right-wing [Likud](https://en.wikipedia.org/wiki/Likud) party currently in charge. Again, this might reflect a genuine shift in support, or something else (see [this article](https://www.haaretz.co.il/news/elections/.premium-1.2595935) in the Haaretz newspaper's website). 

In [None]:
# Saving the results for further use

df_raw.set_index(index_columns).join(predictions).to_csv('anomalous_booths_2020_2.csv', encoding=encoding)