Let's use a *toy example* from the Eurobarometer to study the left-right position of citizen based on various background items.


In [1]:
import csv

data = csv.DictReader( open('eurobaro.csv') )

## we need something called labels (what we try to explain) and
## something we call the features (with what we try to explain)

labels = []
features = []

for row in data:
    
    leftright = row['d1'] ## this is TEXT
    leftright = leftright.replace('Box', '').replace('- left', '').replace('- right', '').replace('DK', '-1').replace('Refusal', '-1')
    leftright = leftright.strip()
    leftright = int( leftright )
    
    if 1 <= leftright <= 3 or 8 <= leftright <= 10: ## we're interested on the two extreams only
        
        labels.append( leftright > 5 )
        
        f = []
        
        sex = int( row['d10'] == 'Woman' ) ## man/woman
        age = int( row['d11'] )
        
        rural = 0
        town = 0
        city = 0
        
        if row['d25'] == 'Rural area or village':
            rural = 1
            
        if row['d25'] == 'Small or middle sized town':
            town = 1
            
        if row['d25'] == 'Large town':
            city = 1
            
        bill_paying = 0
        
        if row['d60'] == 'Most of the time':
            bill_paying = 1
            
        _temp = {
            'Very satisfied' : 1,
            'Fairly satisfied' : 2,
            'Not very satisfied' : 3,
            'Not at all satisfied' : 4,
            'DK' : 2.5 ## mean!
        }
        
        live_satisfaction = _temp[ row['d70'] ]
        
        features.append( [ sex, age, rural, city, town, bill_paying, live_satisfaction ] )
        
print 'Have', len( labels ), 'labels and', len( features ), 'features.'    

Have 7606 labels and 7606 features.


In [2]:
import numpy
from sklearn import svm, grid_search, cross_validation


estimator = svm.SVC()
grid = [
    {'C': numpy.arange( 0.5 , 3, 1 ), 'gamma': numpy.arange( .01, .1, .01 ) , 'kernel': ['rbf', 'sigmoid'] },
]
 
model = grid_search.GridSearchCV( estimator , grid, cv = 3, verbose = 3) ## demo only!

features = numpy.array( features )
labels = numpy.array( labels )

features_train, features_test, labels_train, labels_test = cross_validation.train_test_split( features, labels, test_size = .25 )

model.fit( features_train, labels_train )

print 'Training model was', model.score( features_train, labels_train )
print 'Testing model was', model.score( features_test, labels_test )

Fitting 3 folds for each of 54 candidates, totalling 162 fits
[CV] kernel=rbf, C=0.5, gamma=0.01 ...................................
[CV] .......... kernel=rbf, C=0.5, gamma=0.01, score=0.560463 -   0.7s
[CV] kernel=rbf, C=0.5, gamma=0.01 ...................................
[CV] .......... kernel=rbf, C=0.5, gamma=0.01, score=0.561284 -   0.7s
[CV] kernel=rbf, C=0.5, gamma=0.01 ...................................
[CV] .......... kernel=rbf, C=0.5, gamma=0.01, score=0.561284 -   0.7s
[CV] kernel=sigmoid, C=0.5, gamma=0.01 ...............................
[CV] ...... kernel=sigmoid, C=0.5, gamma=0.01, score=0.558360 -   0.8s
[CV] kernel=sigmoid, C=0.5, gamma=0.01 ...............................
[CV] ...... kernel=sigmoid, C=0.5, gamma=0.01, score=0.558653 -   0.7s
[CV] kernel=sigmoid, C=0.5, gamma=0.01 ...............................
[CV] ...... kernel=sigmoid, C=0.5, gamma=0.01, score=0.558653 -   1.0s
[CV] kernel=rbf, C=0.5, gamma=0.02 ...................................
[CV] ..........

[Parallel(n_jobs=1)]: Done  31 tasks       | elapsed:   18.2s
[Parallel(n_jobs=1)]: Done 127 tasks       | elapsed:  1.4min



[CV] kernel=rbf, C=2.5, gamma=0.04 ...................................
[CV] .......... kernel=rbf, C=2.5, gamma=0.04, score=0.546554 -   0.8s
[CV] kernel=rbf, C=2.5, gamma=0.04 ...................................
[CV] .......... kernel=rbf, C=2.5, gamma=0.04, score=0.564966 -   0.8s
[CV] kernel=sigmoid, C=2.5, gamma=0.04 ...............................
[CV] ...... kernel=sigmoid, C=2.5, gamma=0.04, score=0.558360 -   0.4s
[CV] kernel=sigmoid, C=2.5, gamma=0.04 ...............................
[CV] ...... kernel=sigmoid, C=2.5, gamma=0.04, score=0.558653 -   0.4s
[CV] kernel=sigmoid, C=2.5, gamma=0.04 ...............................
[CV] ...... kernel=sigmoid, C=2.5, gamma=0.04, score=0.558653 -   0.4s
[CV] kernel=rbf, C=2.5, gamma=0.05 ...................................
[CV] .......... kernel=rbf, C=2.5, gamma=0.05, score=0.550473 -   0.8s
[CV] kernel=rbf, C=2.5, gamma=0.05 ...................................
[CV] .......... kernel=rbf, C=2.5, gamma=0.05, score=0.546554 -   0.8s
[CV] 

[Parallel(n_jobs=1)]: Done 162 out of 162 | elapsed:  1.7min finished
