## Active Learning

Download the titanic dataset here: https://drive.google.com/file/d/0Bz9_0VdXvv9bbVhpOEMwUDJ2elU/view?usp=sharing

In this exercise, we will simulate active learning. We will keep the small sample of observations for testing and we will test how quality of the model rises when we use active learning to choose labeled observations.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# import warnings
# warnings.filterwarnings('ignore')

In [2]:
# Load the Data into variable df

In [3]:
df=pd.read_csv('./titanic/train.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [4]:
# Remove extra cols
df = df.drop(['Name', 'Ticket', 'Cabin'], axis=1)

In [5]:
# TEST SAMPLE
# USE THIS SAMPLE ONLY FOR TESTING
test_df = df.sample(n=100, random_state=42)
X_test=test_df.drop('Survived', axis=1)
y_test=test_df['Survived']
# KEEP ONLY THOSE WHO ARE NOT IN THE TEST SET
df = df[~df.PassengerId.isin(test_df.PassengerId.tolist())]

In [6]:
# FIT THE FIRST MODEL ONLY ON THE DATAFRAME START_DF
start_df = df.sample(n=100, random_state=42)
# DROP OBS FROM START_DF FROM DF
df = df[~df.PassengerId.isin(start_df.PassengerId.tolist())]

In [7]:
df.shape

(691, 9)

### Tasks

1. fit the first model only on the **start_df** using **SVM** and evaluate accuracy, precision and recall on test_df
2. in each iteration, add 10 observations from **df** to your trainset (choose the observation using active learning approach) 
    - score all observations in df and take 10 where the model isn't sure what class it is. The probability of surviving will be around 50% 
3. refit the model and evaluate on **test_df** again.    
3. the goal is to converge to the optimal solution as fast as possible by choosing **right** observations in each iteration
4. plot the graphs for each eval metric, where on the axis x is iteration number, on y is the metric value for that model

In [8]:
from sklearn.pipeline import FeatureUnion

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder

from sklearn.svm import SVC

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score,precision_score

from sklearn.preprocessing import FunctionTransformer

In [9]:
# Split the data into features (X) and target (y)
X = start_df.drop('Survived', axis=1)
y = start_df['Survived']



# Define the transformation we want as a function
numeric_transform = Pipeline([('impute_mean', SimpleImputer(strategy='mean')),                            
                              ('scaling', MinMaxScaler())])

categorical_transform = Pipeline([('impute_mode', SimpleImputer(strategy='most_frequent')), 
                                  ('one-hot-encode', OneHotEncoder(sparse=False))])


# (name, transformer, list of column names)
preprocessing_tips = ColumnTransformer([('numeric', numeric_transform, ['Pclass', 'Age','SibSp', 'Parch','Fare']), 
                                        ('categorical', categorical_transform, 
                                         ['Sex', 'Embarked'])])

#classifier
classifier = SVC(probability = True)


pipeline = Pipeline(steps=[('preprocessing', preprocessing_tips),                           
                           ('classifier', classifier)])

pipeline.fit(X,y)
y_pred = pipeline.predict(X_test)
y_pred_proba = pipeline.predict_proba(X_test) 

accuracy = accuracy_score(y_test, y_pred)
precision=precision_score(y_test,y_pred)


print(f"Accuracy: {accuracy:.4f}")
print(f"Precision: {precision:.4f}")


result={}

result['Before_Active_Learning']=list ((round(accuracy,3),round(precision,3)))

Accuracy: 0.7800
Precision: 0.7250


In [10]:
# df1=df.drop(['Survived'], axis=1)
# df1.shape

In [11]:
 def accuracy_precision(start_df):
    X = start_df.drop('Survived', axis=1)
    y = start_df['Survived']
    
    numeric_transform = Pipeline([('impute_mean', SimpleImputer(strategy='mean')),                            
                                  ('scaling', MinMaxScaler())])

    categorical_transform = Pipeline([('impute_mode', SimpleImputer(strategy='most_frequent')), 
                                      ('one-hot-encode', OneHotEncoder(sparse=False))])
    preprocessing_tips = ColumnTransformer([('numeric', numeric_transform, ['Pclass', 'Age','SibSp', 'Parch','Fare']), 
                                            ('categorical', categorical_transform, 
                                             ['Sex', 'Embarked'])])

    classifier = SVC(probability = True)
    pipeline = Pipeline(steps=[('preprocessing', preprocessing_tips),                           
                               ('classifier', classifier)])
    pipeline.fit(X,y)
    y_pred = pipeline.predict(X_test)
    y_pred_proba = pipeline.predict_proba(X_test) 

    accuracy = accuracy_score(y_test, y_pred)
    precision=precision_score(y_test,y_pred)

    #return accuracy and precision
    return list ((round(accuracy,3),round(precision,3)))

In [12]:
# a = pipeline.predict_proba(df1)[:,1]
# b = df1.iloc[np.argwhere((a >=0.3) & (a<=0.7) ) [:10, 0] ]
# df1 = df1[~df1.PassengerId.isin(b.PassengerId.tolist())]
# start_df= pd.concat([start_df, b])

In [13]:
# df1=df.drop(['Survived'], axis=1)
df1=df.copy()

In [14]:
for i in range(20):
    #print(i)
    #Running model on df1 to get the prob
    a = pipeline.predict_proba(df1)[:,1]
    #creating a 10 row table out of the prob that are in middle
    b = df1.iloc[np.argwhere((a >=0.3) & (a<=0.7) ) [:10, 0] ]
    
    #Iterate while there are rows with uncertain probability
    if b.shape[0]==0:
        break
    
    #Removing the values from df1 for next itteration
    df1 = df1[~df1.PassengerId.isin(b.PassengerId.tolist())]    
    
    #adding values to start_df
    start_df= pd.concat([start_df, b]).copy()

    print(accuracy_precision(start_df))

    result[f'Itter_{i+1}']=accuracy_precision(start_df)

    print('------------')

[0.79, 0.744]
------------
[0.8, 0.763]
------------
[0.8, 0.763]
------------
[0.8, 0.763]
------------
[0.81, 0.784]
------------
[0.81, 0.784]
------------
[0.81, 0.784]
------------


In [16]:
aaa=max(result, key=lambda x: sum(result[x]))

# bbb=sum(result[max(result, key=lambda x: sum(result[x]))])
bbb=result[str(aaa)]

print(aaa,bbb,'\n')
print(result)

Itter_5 [0.81, 0.784] 

{'Before_Active_Learning': [0.78, 0.725], 'Itter_1': [0.79, 0.744], 'Itter_2': [0.8, 0.763], 'Itter_3': [0.8, 0.763], 'Itter_4': [0.8, 0.763], 'Itter_5': [0.81, 0.784], 'Itter_6': [0.81, 0.784], 'Itter_7': [0.81, 0.784]}
