## Active Learning

Download the titanic dataset here: https://drive.google.com/file/d/0Bz9_0VdXvv9bbVhpOEMwUDJ2elU/view?usp=sharing

In this exercise, we will simulate active learning. We will keep the small sample of observations for testing and we will test how quality of the model rises when we use active learning to choose labeled observations.

In [12]:
# Load the Data into variable df

In [13]:
import pandas as pd

df = pd.read_csv("data/titanic_train.csv")

In [14]:
# TEST SAMPLE
# USE THIS SAMPLE ONLY FOR TESTING
test_df = df.sample(n=100, random_state=42)
# KEEP ONLY THOSE WHO ARE NOT IN THE TEST SET
df = df[~df.passenger_id.isin(test_df.passenger_id.tolist())]

In [15]:
# FIT THE FIRST MODEL ONLY ON THE DATAFRAME START_DF
start_df = df.sample(n=100, random_state=42)
# DROP OBS FROM START_DF FROM DF
df = df[~df.passenger_id.isin(start_df.passenger_id.tolist())]

In [16]:
start_df

Unnamed: 0,passenger_id,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest,survived
569,1246,3,"Thorneycroft, Mr. Percival",male,,1,0,376564,16.1000,,S,,,,0
402,1087,3,"Olsson, Mr. Nils Johan Goransson",male,28.0000,0,0,347464,7.8542,,S,,,,0
151,763,3,"Dean, Miss. Elizabeth Gladys ""Millvina""",female,0.1667,1,2,C.A. 2315,20.5750,,S,10,,"Devon, England Wichita, KS",1
282,379,2,"Collyer, Mrs. Harvey (Charlotte Annie Tate)",female,31.0000,1,1,C.A. 31921,26.2500,,S,14,,"Bishopstoke, Hants / Fayette Valley, ID",1
337,805,3,"Foo, Mr. Choong",male,,0,0,1601,56.4958,,S,13,,"Hong Kong New York, NY",1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
377,216,1,"Newsom, Miss. Helen Monypeny",female,19.0000,0,2,11752,26.2833,D47,S,5,,"New York, NY",1
556,1173,3,"Sage, Miss. Constance Gladys",female,,8,2,CA. 2343,69.5500,,S,,,,0
323,799,3,"Fischer, Mr. Eberhard Thelander",male,18.0000,0,0,350036,7.7958,,S,,,,0
134,960,3,"Lemberopolous, Mr. Peter L",male,34.5000,0,0,2683,6.4375,,C,,196.0,,0


### Tasks

1. fit the first model only on the **start_df** using **SVM** and evaluate accuracy, precision and recall on test_df
2. in each iteration, add 10 observations from **df** to your trainset (choose the observation using active learning approach) 
    - score all observations in df and take 10 where the model isn't sure what class it is. The probability of surviving will be around 50% 
3. refit the model and evaluate on **test_df** again.    
3. the goal is to converge to the optimal solution as fast as possible by choosing **right** observations in each iteration
4. plot the graphs for each eval metric, where on the axis x is iteration number, on y is the metric value for that model

In [45]:
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score
from sklearn.preprocessing import OneHotEncoder
y = start_df["survived"]
X = start_df.drop("survived",axis=1)
X = X.set_index('passenger_id')
X = X.drop(['name','cabin','ticket','boat','body','cabin','home.dest'], axis=1)
X = pd.get_dummies(data=X,columns=['sex','embarked','pclass'])


X.isna().describe()
X = X.fillna(method='bfill')
# X.age = X.('mean')
X.isna().describe()
svm = SVC()

svm.fit(X=X,y=y)
svm.score(X=test_df.drop('survived', axis=1),y=test_df.survived)

Feature names unseen at fit time:
- boat
- body
- cabin
- embarked
- home.dest
- ...
Feature names seen at fit time, yet now missing:
- embarked_C
- embarked_Q
- embarked_S
- pclass_1
- pclass_2
- ...



ValueError: could not convert string to float: 'Elias, Mr. Joseph Jr'

In [None]:
with open('data/titanic_test.csv') as test:
    y_test = test['']