Real-time Election Results: Portugal 2019 Data Set

https://archive.ics.uci.edu/ml/datasets/Real-time+Election+Results%3A+Portugal+2019

Data Set Information:
A data set describing the evolution of results in the Portuguese Parliamentary Elections of October 6th 2019.
The data spans a time interval of 4 hours and 25 minutes, in intervals of 5 minutes, concerning the results of the 27 parties involved in the electoral event.
The data set is tailored for predictive modelling tasks, mostly focused on numerical forecasting tasks.
Regardless, it allows for other tasks such as ordinal regression or learn-to-rankProvide a short description of your data set (less than 200 characters).

Additional (and updated) information may be found in [Web Link] :
- Raw data sets
- R code to build the final data set
- Basic operations to build predictive modelling tasks using this data set


In [20]:
import warnings
warnings.filterwarnings("ignore")

In [50]:
import numpy as np

In [21]:
import matplotlib.pyplot as plt

In [22]:
import seaborn as sns

In [23]:
import pandas as pd

In [24]:
pd.set_option('display.max_columns', None)
pd.set_option("max_rows", None)

In [25]:
df=pd.read_csv("ELECTION DATA.txt",sep=",")

In [26]:
df['time'] = pd.to_datetime(df['time'])
df.head()

Unnamed: 0,TimeElapsed,time,territoryName,totalMandates,availableMandates,numParishes,numParishesApproved,blankVotes,blankVotesPercentage,nullVotes,nullVotesPercentage,votersPercentage,subscribedVoters,totalVoters,pre.blankVotes,pre.blankVotesPercentage,pre.nullVotes,pre.nullVotesPercentage,pre.votersPercentage,pre.subscribedVoters,pre.totalVoters,Party,Mandates,Percentage,validVotesPercentage,Votes,Hondt,FinalMandates
0,0,2019-10-06 20:10:02,Território Nacional,0,226,3092,1081,9652,2.5,8874,2.3,51.36,752529,386497,8317,1.94,8171,1.91,52.66,813743,428546,PS,0.0,38.29,40.22,147993.0,94.0,106.0
1,0,2019-10-06 20:10:02,Território Nacional,0,226,3092,1081,9652,2.5,8874,2.3,51.36,752529,386497,8317,1.94,8171,1.91,52.66,813743,428546,PPD/PSD,0.0,33.28,34.95,128624.0,81.0,77.0
2,0,2019-10-06 20:10:02,Território Nacional,0,226,3092,1081,9652,2.5,8874,2.3,51.36,752529,386497,8317,1.94,8171,1.91,52.66,813743,428546,B.E.,0.0,6.81,7.15,26307.0,16.0,19.0
3,0,2019-10-06 20:10:02,Território Nacional,0,226,3092,1081,9652,2.5,8874,2.3,51.36,752529,386497,8317,1.94,8171,1.91,52.66,813743,428546,CDS-PP,0.0,4.9,5.14,18923.0,12.0,5.0
4,0,2019-10-06 20:10:02,Território Nacional,0,226,3092,1081,9652,2.5,8874,2.3,51.36,752529,386497,8317,1.94,8171,1.91,52.66,813743,428546,PCP-PEV,0.0,4.59,4.83,17757.0,11.0,12.0


In [27]:
def missing_values_table(df):
        zero_val = (df == 0.00).astype(int).sum(axis=0)
        mis_val = df.isnull().sum()
        mis_val_percent = 100 * df.isnull().sum() / len(df)
        
        mz_table = pd.concat([zero_val, mis_val, mis_val_percent], axis=1)
        mz_table = mz_table.rename(columns = {0 : 'Zero Values', 1 : 'Missing Values', 2 : '% of Total Missing Values'})
        mz_table['Attributes'] = df.nunique()
        mz_table['Data Type'] = df.dtypes
        
        mz_table = mz_table.sort_values('Data Type', ascending=False).round(1)
        
        print ("Your selected dataframe has " + str(df.shape[1]) + " columns and " + str(df.shape[0]) + " Rows.\n"      
               "There are " + str(mz_table.shape[0]) + " columns that have missing values.")

        return mz_table

missing_values_table(df)

Your selected dataframe has 28 columns and 11016 Rows.
There are 28 columns that have missing values.


Unnamed: 0,Zero Values,Missing Values,% of Total Missing Values,Attributes,Data Type
territoryName,0,0,0.0,21,object
Party,0,0,0.0,21,object
FinalMandates,9063,1,0.0,17,float64
pre.nullVotesPercentage,0,0,0.0,88,float64
Hondt,9072,1,0.0,37,float64
Votes,0,1,0.0,2972,float64
validVotesPercentage,0,1,0.0,1208,float64
Percentage,0,1,0.0,1186,float64
Mandates,10397,1,0.0,33,float64
blankVotesPercentage,0,0,0.0,134,float64


In [28]:
df = df.dropna()

In [29]:
df.isnull().sum()

TimeElapsed                 0
time                        0
territoryName               0
totalMandates               0
availableMandates           0
numParishes                 0
numParishesApproved         0
blankVotes                  0
blankVotesPercentage        0
nullVotes                   0
nullVotesPercentage         0
votersPercentage            0
subscribedVoters            0
totalVoters                 0
pre.blankVotes              0
pre.blankVotesPercentage    0
pre.nullVotes               0
pre.nullVotesPercentage     0
pre.votersPercentage        0
pre.subscribedVoters        0
pre.totalVoters             0
Party                       0
Mandates                    0
Percentage                  0
validVotesPercentage        0
Votes                       0
Hondt                       0
FinalMandates               0
dtype: int64

In [30]:
df.shape

(11015, 28)

In [15]:
df.columns

Index(['TimeElapsed', 'time', 'territoryName', 'totalMandates',
       'availableMandates', 'numParishes', 'numParishesApproved', 'blankVotes',
       'blankVotesPercentage', 'nullVotes', 'nullVotesPercentage',
       'votersPercentage', 'subscribedVoters', 'totalVoters', 'pre.blankVotes',
       'pre.blankVotesPercentage', 'pre.nullVotes', 'pre.nullVotesPercentage',
       'pre.votersPercentage', 'pre.subscribedVoters', 'pre.totalVoters',
       'Party', 'Mandates', 'Percentage', 'validVotesPercentage', 'Votes',
       'Hondt', 'FinalMandates'],
      dtype='object')

# Binary Categorical Attribures

# 1.Label Encoading

In [39]:
from sklearn.preprocessing import LabelEncoder

In [40]:
le = LabelEncoder()

In [41]:
df['territoryName'] = le.fit_transform(df['territoryName'])
df['Party'] = le.fit_transform(df['Party'])

# Defining the Variable

In [43]:
x=df.drop(['FinalMandates','time'],axis=1)

In [44]:
y=df['FinalMandates']

# Evaluation

In [45]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

def maxr2_score(regr,x,y):
    max_r_score=0
    for r_state in range(42,101):
        X_train, X_test, y_train, y_test = train_test_split(x, y,random_state= r_state,test_size=0.20)
    
        regr.fit(X_train,y_train)

        y_pred = regr.predict(X_test)
        r2_scr=r2_score(y_test,y_pred)
        print("r2_score corresponding to random state:",r_state," is:",r2_scr)
        if r2_scr>max_r_score:
            max_r_score=r2_scr
            final_r_state=r_state
    print()
    print()
    print("max r2 score corresponding to ", final_r_state," is ", max_r_score)
    return final_r_state

In [46]:
from sklearn.tree import DecisionTreeRegressor
dt = DecisionTreeRegressor()
r_state = maxr2_score(dt,x,y)

r2_score corresponding to random state: 42  is: 0.999805008598716
r2_score corresponding to random state: 43  is: 0.9996815491212715
r2_score corresponding to random state: 44  is: 0.9999408940632822
r2_score corresponding to random state: 45  is: 0.9999470108919517
r2_score corresponding to random state: 46  is: 0.9999465872072548
r2_score corresponding to random state: 47  is: 0.9998749671930225
r2_score corresponding to random state: 48  is: 0.9999909150539897
r2_score corresponding to random state: 49  is: 0.9999279513179843
r2_score corresponding to random state: 50  is: 0.9999741906118872
r2_score corresponding to random state: 51  is: 0.9999905779803046
r2_score corresponding to random state: 52  is: 0.999931558983037
r2_score corresponding to random state: 53  is: 0.9999775884316184
r2_score corresponding to random state: 54  is: 0.9998702410653008
r2_score corresponding to random state: 55  is: 0.9993497543553935
r2_score corresponding to random state: 56  is: 0.99994657618282

In [47]:
#lets do the cross validation
from sklearn import linear_model
from sklearn.model_selection import cross_val_score
cross_val_score(DecisionTreeRegressor(),x,y,cv=5,scoring="r2")

array([0.99998233, 0.99976855, 0.99939225, 0.99992014, 0.99593827])

# Prediction

In [48]:
x_train, x_test, y_train, y_test = train_test_split(x, y,random_state = 47,test_size=0.20)
dt.fit(x_train,y_train)
y_pred=dt.predict(x_test)

In [51]:
print("RMSE is: ",np.sqrt(mean_squared_error(y_test,y_pred)))
print("r2_score is: ",r2_score(y_test,y_pred))

RMSE is:  0.06737406503342708
r2_score is:  0.9999038209177097


In [53]:
#lets make the dataframe for price_pred
y_pred=pd.DataFrame(y_pred,columns=["Prediction"])