## K-Fold Cross Validation - SVC model

We build a model to predict the political regime of a country based on our parameters.

In [2]:
import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn import svm

In [3]:
DF = pd.read_csv('DATA/Democracy-Index.csv', usecols=range(1,10))
# makes a copy of the data frame to work on it
# and remove the columns with 'Democracy' and 'Country'
# in order to build a model
df_target = DF.copy()
del DF['Democracy']
del DF['Country']

A single train/test split maded with the train_test_split function in the cross_validation library:

In [19]:
# Split the data into train/test data sets with 30% reserved for testing
X_train, X_test, y_train, y_test = train_test_split(DF, df_target['Democracy'], test_size=0.3, random_state=0)

# Build an SVC model for predicting the polical regime using training data
clf = svm.SVC(kernel='linear', C=1).fit(X_train, y_train)

# Now measure its performance with the test data
clf.score(X_test, y_test)

0.8

It is quiet good ! Let's try to do better with a K-Fold cross validation.

Indeed, a K- Fold cross validation made with K = 5 give us an even better result :

In [20]:
# We give cross_val_score a model, the entire data set and its "real" values, and the number of folds:
scores = cross_val_score(clf, DF, df_target['Democracy'], cv=5)

# Print the accuracy for each fold:
print(scores)

# And the mean accuracy of all 5 folds:
print(scores.mean())

[0.87878788 0.87878788 0.84848485 0.81818182 0.84375   ]
0.8535984848484848


Now we can use our model to predict the political regime of a 'fake' country, by choosing arbitrary parameters.

Play with it !

In [35]:
# parameters of our test country : 
expectancy = 80
popD = 150
gini = 29
ageMed = 35
skyscraper = 260
children = 2
pressF = 20

if clf.predict([[expectancy, popD, gini, ageMed, skyscraper, children, pressF]]) == 1:
    print('This country is a democracy.')
else:
    print('This country is not a democracy.')

This country is a democracy.


We could also try with a polynomial kernel, but here it is overfiting, and the results are not so good:

In [6]:
clf = svm.SVC(kernel='poly', C=1)
scores = cross_val_score(clf, DF, df_target['Democracy'], cv=5)
print(scores)
print(scores.mean())

[0.48484848 0.57575758 0.54545455 0.54545455 0.53125   ]
0.5365530303030303
