# Part D: Train and Evaluate various models

### Appraoch:
    1) Training following model on extracted features (Reasons based on intutions)
        - Logistic Regressions
        - SVM 
        - XGBoost
            
    2) Experiment on constrained dataset 
        - Only on generated dataset for "Hello world" without any image augumentation
    
    3) Evaluation of model from ^ on augumented dataset
        - Why? Check if the model has generalised for the characterstics of the font
    
    4) Experiment with augumented dataset
        - Why? Check performance of the model trained on augumented dataset
        
        
### Conclusion:
    1) LR gave 94% precision on the limited distribution of "Hello World text"
    2) While SVM was 90% precision but was also able to capture font from other phrases
    3) For the augumented data - LR was 83% precise
    
    
### Other experiments that can be performed:
    1) Hyperparameter optimisations for XGB, SVM
    2) Currently, I maxpooled RESNET features from last layer to flatten 
       but we can use 3D features with CNNs
    3) We can train a good capacity fully connected NN (1-3 layers) with softmax 
       to capture complex patterns
    4) Add data for more phrases to build more training data
    5) Maybe try with resizing image with padding on top and bottom

In [1]:
import os
import pickle
import numpy as np
import pandas as pd

from PIL import Image

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

In [2]:
df_train = pd.read_csv("train_file.csv")
df_test = pd.read_csv("test_file.csv")

In [3]:
df_train['label'].value_counts()

Roboto           2016
OpenSans         1680
Arimo            1680
Ubuntu           1344
Oswald           1176
DancingScript     840
PTSerif           672
NotoSans          672
FredokaOne        168
PatuaOne          168
Name: label, dtype: int64

In [4]:
df_test['label'].value_counts()

Roboto           864
Arimo            720
OpenSans         720
Ubuntu           576
Oswald           504
DancingScript    360
NotoSans         288
PTSerif          288
PatuaOne          72
FredokaOne        72
Name: label, dtype: int64

In [5]:
def load_extracted_features(filepaths):
    """
    
    Load extracted features file for all the given filepaths
    """
    
    image_vectors = []
    for filepath in filepaths:
        
        load_filepath = 'data/extracted_features/{}'.format(filepath.split('/')[-1].replace('.png', 'pkl'))
        
        with open(load_filepath, 'rb') as load_filepath:
            image_vectors.append(pickle.load(load_filepath))
            
    return image_vectors

# Step 1 - Experiment on constrained dataset (as in the problem statement)
- Only "Hello World" text images
- Image with no image augumentation

In [6]:
# Filter original train-test file
df_train_contrained = df_train[
    (df_train['text'] == 'Hello-World')
    & (df_train['agumentation_type'].isnull())
].reset_index()

df_test_contrained = df_test[
    (df_test['text'] == 'Hello-World')
    & (df_test['agumentation_type'].isnull())
].reset_index()

In [7]:
# Make training and testing objects
X_train = np.array(load_extracted_features(list(df_train_contrained['file_path_preprocessed'])))
X_test = np.array(load_extracted_features(list(df_test_contrained['file_path_preprocessed'])))

y_train = list(df_train_contrained['label'])
y_test = list(df_test_contrained['label'])

X_train.shape, X_test.shape, len(y_train), len(y_test)

((1320, 2048), (540, 2048), 1320, 540)

### Train Logistic Regression

In [8]:
%%time

from sklearn.linear_model import LogisticRegression
clf_lr = LogisticRegression(random_state=2, class_weight='balanced')
clf_lr = clf_lr.fit(X_train, y_train)
clf_lr



CPU times: user 2.98 s, sys: 3.19 ms, total: 2.98 s
Wall time: 2.98 s


In [9]:
scores = clf_lr.predict_proba(X_test)
y_pred = list(pd.DataFrame(scores, columns=clf_lr.classes_).idxmax(axis=1))
print (classification_report(y_test, y_pred))

               precision    recall  f1-score   support

        Arimo       1.00      0.99      0.99        82
DancingScript       1.00      1.00      1.00        47
   FredokaOne       1.00      1.00      1.00         6
     NotoSans       0.54      0.64      0.58        33
     OpenSans       0.85      0.77      0.81        92
       Oswald       1.00      1.00      1.00        73
      PTSerif       1.00      1.00      1.00        32
     PatuaOne       1.00      1.00      1.00        14
       Roboto       0.97      1.00      0.99       111
       Ubuntu       1.00      1.00      1.00        50

     accuracy                           0.94       540
    macro avg       0.94      0.94      0.94       540
 weighted avg       0.94      0.94      0.94       540



### Train SVM

In [10]:
%%time

from sklearn.svm import SVC
clf_svm = SVC(C=0.1, probability=True)
clf_svm = clf_svm.fit(X_train, y_train)
clf_svm



CPU times: user 29.9 s, sys: 313 ms, total: 30.2 s
Wall time: 29.5 s


In [11]:
scores = clf_svm.predict_proba(X_test)
y_pred = list(pd.DataFrame(scores, columns=clf_svm.classes_).idxmax(axis=1))
print (classification_report(y_test, y_pred))

               precision    recall  f1-score   support

        Arimo       1.00      0.98      0.99        82
DancingScript       1.00      1.00      1.00        47
   FredokaOne       1.00      0.83      0.91         6
     NotoSans       0.43      0.45      0.44        33
     OpenSans       0.70      0.78      0.74        92
       Oswald       1.00      1.00      1.00        73
      PTSerif       0.97      0.97      0.97        32
     PatuaOne       1.00      0.86      0.92        14
       Roboto       0.99      0.92      0.95       111
       Ubuntu       1.00      1.00      1.00        50

     accuracy                           0.90       540
    macro avg       0.91      0.88      0.89       540
 weighted avg       0.91      0.90      0.90       540



### Train XGBoost Model

In [12]:
%%time

from xgboost import XGBClassifier
clf_xgb = XGBClassifier(n_jobs= -1, random_state=2, max_depth=4, objective='multi:softmax')
clf_xgb = clf_xgb.fit(X_train, y_train)
clf_xgb

CPU times: user 3min 59s, sys: 119 ms, total: 3min 59s
Wall time: 31.8 s


In [13]:
scores = clf_xgb.predict_proba(X_test)
y_pred = list(pd.DataFrame(scores, columns=clf_xgb.classes_).idxmax(axis=1))
print (classification_report(y_test, y_pred))

               precision    recall  f1-score   support

        Arimo       0.99      0.99      0.99        82
DancingScript       1.00      1.00      1.00        47
   FredokaOne       1.00      1.00      1.00         6
     NotoSans       0.52      0.48      0.50        33
     OpenSans       0.79      0.79      0.79        92
       Oswald       1.00      0.97      0.99        73
      PTSerif       0.91      1.00      0.96        32
     PatuaOne       1.00      0.79      0.88        14
       Roboto       0.96      0.99      0.97       111
       Ubuntu       1.00      1.00      1.00        50

     accuracy                           0.92       540
    macro avg       0.92      0.90      0.91       540
 weighted avg       0.92      0.92      0.92       540



# Step 2 - Evaluate on augumented dataset

In [14]:
# Filter original train-test file
df_augumented = pd.concat([
    df_train[~df_train['agumentation_type'].isnull()].reset_index(),
    df_test[~df_test['agumentation_type'].isnull()].reset_index()
]).reset_index(drop=True)
    
X_test_augumented = np.array(load_extracted_features(list(df_augumented['file_path_preprocessed'])))
y_test_augumented = list(df_augumented['label'])

X_test_augumented.shape, len(y_test_augumented)

((11160, 2048), 11160)

In [15]:
scores = clf_lr.predict_proba(X_test_augumented)
y_pred = list(pd.DataFrame(scores, columns=clf_lr.classes_).idxmax(axis=1))
print (classification_report(y_test_augumented, y_pred))

               precision    recall  f1-score   support

        Arimo       0.48      0.19      0.27      1800
DancingScript       0.62      0.38      0.48       900
   FredokaOne       0.50      0.21      0.29       180
     NotoSans       0.08      0.11      0.10       720
     OpenSans       0.20      0.30      0.24      1800
       Oswald       0.56      0.26      0.36      1260
      PTSerif       0.61      0.43      0.51       720
     PatuaOne       0.71      0.09      0.17       180
       Roboto       0.42      0.40      0.41      2160
       Ubuntu       0.24      0.49      0.32      1440

     accuracy                           0.32     11160
    macro avg       0.44      0.29      0.31     11160
 weighted avg       0.40      0.32      0.33     11160



In [16]:
scores = clf_svm.predict_proba(X_test_augumented)
y_pred = list(pd.DataFrame(scores, columns=clf_svm.classes_).idxmax(axis=1))
print (classification_report(y_test_augumented, y_pred))

               precision    recall  f1-score   support

        Arimo       0.96      0.03      0.06      1800
DancingScript       1.00      0.02      0.04       900
   FredokaOne       1.00      0.01      0.02       180
     NotoSans       1.00      0.00      0.01       720
     OpenSans       0.16      0.99      0.28      1800
       Oswald       1.00      0.01      0.03      1260
      PTSerif       1.00      0.06      0.12       720
     PatuaOne       0.00      0.00      0.00       180
       Roboto       0.86      0.03      0.05      2160
       Ubuntu       0.77      0.07      0.12      1440

     accuracy                           0.19     11160
    macro avg       0.78      0.12      0.07     11160
 weighted avg       0.79      0.19      0.10     11160



  'precision', 'predicted', average, warn_for)


In [17]:
scores = clf_xgb.predict_proba(X_test_augumented)
y_pred = list(pd.DataFrame(scores, columns=clf_xgb.classes_).idxmax(axis=1))
print (classification_report(y_test_augumented, y_pred))

               precision    recall  f1-score   support

        Arimo       0.36      0.19      0.25      1800
DancingScript       0.61      0.35      0.44       900
   FredokaOne       0.54      0.08      0.14       180
     NotoSans       0.23      0.01      0.02       720
     OpenSans       0.21      0.22      0.22      1800
       Oswald       0.22      0.70      0.34      1260
      PTSerif       0.57      0.42      0.48       720
     PatuaOne       0.06      0.06      0.06       180
       Roboto       0.32      0.33      0.32      2160
       Ubuntu       0.37      0.21      0.27      1440

     accuracy                           0.29     11160
    macro avg       0.35      0.26      0.25     11160
 weighted avg       0.34      0.29      0.28     11160



# Step 3 - Train model on whole dataset (original + augumented)

In [18]:
# Make training and testing objects
X_train = np.array(load_extracted_features(list(df_train['file_path_preprocessed'])))
X_test = np.array(load_extracted_features(list(df_test['file_path_preprocessed'])))

y_train = list(df_train['label'])
y_test = list(df_test['label'])

X_train.shape, X_test.shape, len(y_train), len(y_test)

((10416, 2048), (4464, 2048), 10416, 4464)

### Train Logistic Regression

In [19]:
%%time

from sklearn.linear_model import LogisticRegression
clf_lr = LogisticRegression(random_state=2, class_weight='balanced')
clf_lr = clf_lr.fit(X_train, y_train)
clf_lr



CPU times: user 2min 5s, sys: 87 ms, total: 2min 5s
Wall time: 2min 5s


In [20]:
scores = clf_lr.predict_proba(X_test)
y_pred = list(pd.DataFrame(scores, columns=clf_lr.classes_).idxmax(axis=1))
print (classification_report(y_test, y_pred))

               precision    recall  f1-score   support

        Arimo       0.87      0.88      0.87       720
DancingScript       0.98      0.99      0.99       360
   FredokaOne       0.94      0.69      0.80        72
     NotoSans       0.43      0.47      0.45       288
     OpenSans       0.65      0.67      0.66       720
       Oswald       0.99      0.99      0.99       504
      PTSerif       0.97      0.94      0.96       288
     PatuaOne       0.92      0.93      0.92        72
       Roboto       0.80      0.81      0.81       864
       Ubuntu       0.88      0.84      0.86       576

     accuracy                           0.82      4464
    macro avg       0.84      0.82      0.83      4464
 weighted avg       0.83      0.82      0.82      4464



### Train SVM

In [21]:
%%time

from sklearn.svm import SVC
clf_svm = SVC(probability=True)
clf_svm = clf_svm.fit(X_train, y_train)
clf_svm



CPU times: user 38min 21s, sys: 101 ms, total: 38min 21s
Wall time: 38min 21s


In [22]:
scores = clf_svm.predict_proba(X_test)
y_pred = list(pd.DataFrame(scores, columns=clf_svm.classes_).idxmax(axis=1))
print (classification_report(y_test, y_pred))

               precision    recall  f1-score   support

        Arimo       0.81      0.84      0.83       720
DancingScript       0.99      1.00      0.99       360
   FredokaOne       0.73      0.64      0.68        72
     NotoSans       0.52      0.30      0.38       288
     OpenSans       0.61      0.74      0.67       720
       Oswald       0.97      0.97      0.97       504
      PTSerif       0.95      0.90      0.92       288
     PatuaOne       0.76      0.78      0.77        72
       Roboto       0.80      0.80      0.80       864
       Ubuntu       0.85      0.81      0.83       576

     accuracy                           0.80      4464
    macro avg       0.80      0.78      0.78      4464
 weighted avg       0.80      0.80      0.80      4464



### Train XGBoost Model

In [23]:
%%time

from xgboost import XGBClassifier
clf_xgb = XGBClassifier(n_jobs= -1, random_state=2, max_depth=4, objective='multi:softmax')
clf_xgb = clf_xgb.fit(X_train, y_train)
clf_xgb

CPU times: user 44min 55s, sys: 2.06 s, total: 44min 57s
Wall time: 5min 51s


In [24]:
scores = clf_xgb.predict_proba(X_test)
y_pred = list(pd.DataFrame(scores, columns=clf_xgb.classes_).idxmax(axis=1))
print (classification_report(y_test, y_pred))

               precision    recall  f1-score   support

        Arimo       0.69      0.84      0.76       720
DancingScript       0.98      1.00      0.99       360
   FredokaOne       1.00      0.51      0.68        72
     NotoSans       0.62      0.17      0.26       288
     OpenSans       0.63      0.67      0.65       720
       Oswald       0.96      0.94      0.95       504
      PTSerif       0.93      0.88      0.90       288
     PatuaOne       1.00      0.54      0.70        72
       Roboto       0.70      0.79      0.74       864
       Ubuntu       0.77      0.74      0.76       576

     accuracy                           0.76      4464
    macro avg       0.83      0.71      0.74      4464
 weighted avg       0.77      0.76      0.75      4464

