There were 11 candidates in the first round of the last French elections and 2 in the second: Macron and Le Pen.  Thus in the second round supporters of the eliminated candidates redistributed their votes between Macron, Le Pen, NOTA, Null, or Absent. 

In some cases votes transfer is rather obvious: most people who voted Macron in the first round would probably vote for him in the second. Sometimes it is non-trivial: for example, Mélenchon, who came 4th in the first round, didn't endorse any of the leading candidates. We can only try to estimate how did his supporters actually vote in the second round.

 What can we say about this votes transfer using the data, without  assuming any previous knowledge about the political affiliations of the candidates? I discuss the method in some detail in a [blog post here][1] and below give a shorter but more technical description.

Suppose a person who voted for candidate A in the first round has a probability p<sub>A,B</sub> of voting for candidate B in the second. Since we know the results for both rounds for each polling station, it is not difficult to estimate p<sub>A,B</sub>. 

Indeed, suppose  x<sub>A</sub> and  y<sub>A</sub> are results of candidate A in the first and second round respectively. <br>
<p>Then y<sub>B</sub>= &sum;<sub>A</sub> p<sub>A,B</sub> x<sub>A</sub>.</p>
<br>
We know x<sub>A</sub> and y<sub>A</sub>, so one can just run linear regression in order to recover p<sub>A,B</sub>. However,  p<sub>A,B</sub> found in this way may not satisfy the basic properties of probabilities: they should take values between 0 and 1 and they should sum to 1. 
Thus it is more correct to reformulate this problem as a quadratic optimisation with constraints:

Find p<sub>A,B</sub> such that the squared deviation of &sum;<sub>A</sub> p<sub>A,B</sub> x<sub>A</sub> from y<sub>B</sub> is minimal under the condition that
<br>0&le; p<sub>A,B</sub> &le;1,
<br> &sum;<sub>B </sub>p<sub>A,B</sub>=1 for all A

I used the standard SciPy optimiser to solve this problem and got the following result:

![Vote transfer in the second round][2]

 Below is code which leads to this result. Would be happy to hear your questions and comments!
  [1]: https://grishaoryol.wordpress.com/2017/05/17/vote-transfers-in-french-election/
  [2]: https://grishaoryol.files.wordpress.com/2017/05/sankeymatic_2400x2400-v2.png?w=1400 

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from matplotlib import colors as mcolors,cm,colorbar
import scipy
import re
from IPython.display import HTML
from sklearn import model_selection

## Load the data

In [None]:
def cleanup(df_bur):
    # Clean the dataset: drop some columns we won't need, drop leading zeros from the
    # codes, transform from long to wide format
    df_bur=df_bur.drop(['First name','Sex'],axis=1)
    df_bur['Polling station']=df_bur['Polling station'].apply(lambda x:str(x)[2:] if (str(x)[:2]=='BV') else x).apply(lambda x: re.sub("^[0]+","",str(x)))
    df_bur['Commune code']=df_bur['Commune code'].apply(lambda x: re.sub("^[0]+","",str(x)))
    df_bur['Department code']=df_bur['Department code'].apply(lambda x: re.sub("^[0]+","",str(x)))
    df_bur['Constituency code']=df_bur['Constituency code'].apply(lambda x: re.sub("^[0]+","",str(x)))
    e1=df_bur[[u'Department',
       u'Constituency', u'Commune','Polling station']+[u'Registered',u'Abstentions',                  u'% Abs/Reg',
                          u'Voters',                  u'% Vot/Reg',
                           u'None of the above(NOTA)',               u'% NOTA/Reg',
                     u'% NOTA/Vot',                       u'Nulls',
                       u'% Nulls/Reg',                 u'% Nulls/Vot',
                         u'Expressed']].set_index([u'Department',
       u'Constituency', u'Commune', 'Polling station']).drop_duplicates()
    
    df_b1=df_bur.pivot_table(index=[u'Department',
       u'Constituency', u'Commune', 'Polling station'],columns = u'Surname',values='Voted')
    
    tab=pd.merge(df_b1,e1,left_index=True,right_index=True)
    
    if ('MÉLENCHON' in tab.columns):
        tab=tab.rename(columns={'MÉLENCHON':'MELENCHON'})
    
    return tab

#### Load the data

In [None]:
df_bur = pd.read_csv("../input/French_Presidential_Election_2017_First_Round.csv",sep=',',
                    dtype={'Department code':'object','Polling station':'object',29:'object'})
df_bur2 = pd.read_csv("../input/French_Presidential_Election_2017_Second_Round.csv",sep=',',
                     dtype={'Polling station':'object'})

#### and clean it up a bit

In [None]:
tab=cleanup(df_bur)

In [None]:
tab2=cleanup(df_bur2)

#### Check that the data if consistent

In [None]:
print(np.all(tab['Voters']+tab['Abstentions']-tab['Registered']==0))
print((tab[[u'ARTHAUD', u'ASSELINEAU', u'CHEMINADE', u'DUPONT-AIGNAN', u'FILLON',
       u'HAMON', u'LASSALLE', u'LE PEN', u'MACRON', u'POUTOU',u'MELENCHON']].sum(axis=1)-tab[u'Expressed']!=0).sum())
print((tab[[u'ARTHAUD', u'ASSELINEAU', u'CHEMINADE', u'DUPONT-AIGNAN', u'FILLON',
       u'HAMON', u'LASSALLE', u'LE PEN', u'MACRON', u'POUTOU',u'MELENCHON', u'None of the above(NOTA)',u'Nulls']].sum(axis=1)-tab[u'Voters']!=0).sum())

#### Merge the results of two rounds into one table

In [None]:
merged=pd.merge(tab,tab2,left_index=True,right_index=True,how='inner')

Below we merge into one category voters who voted NOTA, voters who voted Null and 
those who abstained. We also merge into one category the five least successful candidates .
This step is not strictly necessary, but there are two reasons for doing it: first, with less categories
there are less parameters to tune and the model turns out more robust. Second, the shares of votes corresponding to 
these minor categories are so small that they are comparable with the precision of the model itself, so by including them 
we won't get any reliable new information

In [None]:
merged['Abstentions, NOTA, null_y']=merged[['None of the above(NOTA)_y','Nulls_y','Abstentions_y']].sum(axis=1)
merged['Abstentions, NOTA, null_x']=merged[['None of the above(NOTA)_x','Nulls_x','Abstentions_x']].sum(axis=1)
merged['Other candidates']=merged[['ARTHAUD','ASSELINEAU','CHEMINADE','LASSALLE','POUTOU']].sum(axis=1)

In [None]:
nms_compressed=['Other candidates',
 u'DUPONT-AIGNAN',
 u'FILLON',
 u'HAMON',
 u'LE PEN_x',
 u'MELENCHON',
 'MACRON_x','Abstentions, NOTA, null_x']

In [None]:
options_2iem_compressed=['LE PEN_y', 'MACRON_y', 'Abstentions, NOTA, null_y']

## Optimization procedure

In [None]:
def optimize(merged):
    print('opimizing, input table has '+str(merged.shape[0])+" rows")
    
    y1=(merged[options_2iem_compressed].T/merged['Registered_y']).T[merged['Registered_x']!=0]
    X1=(merged[nms_compressed].T/merged['Registered_x']).T[merged['Registered_x']!=0]
    n_2iem=len(options_2iem_compressed)
    
    # Probabilities are naturally organized as a table, but scipy.optimize.minimize works with 
    # a list of parameters. So this functions reshapes this list to a table
    def rshp(prob):
        tmp1=np.reshape(prob,(len(options_2iem_compressed)-1,X1.shape[1])).T
        tmp2=np.concatenate((tmp1,np.array([1-tmp1.sum(axis=1)]).T),axis=1)
        return(tmp2)

    # This is a loss function, with takes as an input probabilities and outputs the quadratic
    # deviation of the computed results of the second round from the actual ones
    def loss_func(prob):
        y1=(merged[options_2iem_compressed].T/merged['Registered_y']).T[merged['Registered_x']!=0]
        tmp2=rshp(prob)
        ret=np.sum((np.dot(X1,tmp2)-y1)**2).sum()
        return ret
    
    
    # Constraint for the probabilities table: sum of values in each row should be equalt to 1.
    def fun_constr(prob):
        return 1-np.reshape(prob,(n_2iem-1,X1.shape[1])).sum(axis=0)
    
    bs=[(0,1)]*(X1.shape[1])*(n_2iem-1) # bounds for probabilities: bewteen 0 and 1
    x0=np.array(X1.shape[1]*(n_2iem-1)*[1/float(n_2iem)]) # starting point for the optimization procedure
    constr={'type':'ineq','fun':fun_constr} # impose the constraint: sum inside each row omitting the last element should be smaller than 1
    print(x0)
    print(bs)
    # run the quadratic optimization
    opt=scipy.optimize.minimize(loss_func,
                   x0,#jac=jac,
                   method = 'SLSQP',
                   bounds = bs,
                   constraints = constr

        )
    print(opt)
    print('error: '+str(np.sqrt(opt.fun/len(y1))))
    res=pd.DataFrame(rshp(opt.x),columns=options_2iem_compressed,index=nms_compressed).round(3)
    return(res)

In [None]:
# run the optimization procedure on the whole dataset
res_c=optimize(merged)

###Precision of the model

#### Let us see how the results change if we take two different subsets of the original data

In [None]:
# take a random subset of the original table droping 1/10 of the data
splitter = model_selection.ShuffleSplit(1,0.1)
train = merged.iloc[[x for x in splitter.split(merged)][0][0]]

In [None]:
#run the optimization procedure on the first random subset
res_train=optimize(train)

In [None]:
# take another random subset
splitter = model_selection.ShuffleSplit(1,0.1)
train = merged.iloc[[x for x in splitter.split(merged)][0][0]]

In [None]:
#run the optimization procedure on the second random subset
res_train1=optimize(train)

Are the results different?

In [None]:
res_train1-res_train

The difference is small, so the model is pretty robust

## Visualization of the results

In [None]:
# Votes for particular candidates as a share of all registered voters (first round)
res_premier=merged[nms_compressed].sum()/merged['Registered_x'].sum()

Which percentage of all registered voters went from candidate A in the first to candidate B in the second round:

In [None]:
res_new=((res_premier*res_c.T).T).round(3)

Draw the bar plot

In [None]:
def form(x): 
    return x if ((len(x)<2) or ((x[len(x)-2:]!='_x') and  (x[len(x)-2:]!='_y') )) else x[:len(x)-2]
chart1=res_new.copy()
chart1['total']=chart1.sum(axis=1)
chart1=chart1.sort_values(['total'])*100
bar_h=10
x_init=0.14*100
fig=plt.figure(facecolor='white')
ax=fig.add_subplot(111)
ax.set_facecolor('white')
ax.text(0,bar_h*len(chart1.index)+bar_h*0.2,'First round',fontsize=13)
for i in range(len(chart1.index)):
    ax.bar(left=x_init+chart1.iloc[i,0],height=bar_h,width=chart1.iloc[i,1],
           bottom=bar_h*i,color='red',align='edge',edgecolor='black')
    ax.bar(left=x_init,height=bar_h,width=chart1.iloc[i,0],
           bottom=bar_h*i,color='blue',align='edge',edgecolor='black')
    ax.bar(left=x_init+chart1.iloc[i,0]+chart1.iloc[i,1],height=bar_h,width=chart1.iloc[i,2],
           bottom=bar_h*i,color='grey',align='edge',edgecolor='black')
    ax.text(0,bar_h*i+bar_h*0.5,form(chart1.index[i]),fontweight='bold',fontsize=13)
lg=ax.legend(['Macron','Le Pen','Abstentions, NOTA, null'],loc=(0.65,0.12),title='Second round',fontsize=13)
plt.setp(lg.get_title(),fontsize=14)
ax.set_yticks([])
ticks=np.arange(0,40,5)
ax.set_xticks(ticks+x_init)
ax.set_xticklabels([str(x) for x in ticks],fontsize=13)
ax.text(0,bar_h*(len(chart1.index)+1),'Transfer of votes between the two rounds',
             fontweight='bold',
             fontsize=17
             )
ax.set_xlabel('% of all votes',fontsize=13)
plt.show()

#### The table containing percents of vote transfers

In [None]:
table_html=(pd.DataFrame(np.array(res_c),columns=['Le Pen','Macron','Abstentions, NOTA, null'],
             index=['Other candidates','Dupont-Aignan','Fillon','Hamon',
                    'Le Pen','Melenchon','Macron','Abstentions, NOTA, null']).sort_values('Macron',ascending=False)*100).round(1)
table_html

In [None]:
res_new.columns=['Le Pen','Macron','Abstentions, NOTA, null']
res_new.index=['Other candidates','Dupont-Aignan','Fillon','Hamon',
                    'Le Pen','Melenchon','Macron','Abstentions, NOTA, null']
res_new

This is the final result. The flow chart in the beginning of the notebook was generated from this table using an online tool sankeymatic.com. Feel free to comment and ask questions!