# Part D: Train and Evaluate various models

### Appraoch:
    1) Training following model on extracted features (Reasons based on intutions)
        - Logistic Regressions
        - SVM 
        - XGBoost
            
    2) Experiment on constrained dataset 
        - Only on generated dataset for "Hello world" without any image augumentation
    
    3) Evaluation of model from ^ on augumented dataset
        - Why? Check if the model has generalised for the characterstics of the font
    
    4) Experiment with augumented dataset
        - Why? Check performance of the model trained on augumented dataset
        
        
### Conclusion:
    1) LR gave 94% precision on the limited distribution of "Hello World text"
    2) While SVM was 90% precision but was also able to capture font from other phrases
    3) For the augumented data - LR was 83% precise
    
    
### Other experiments that can be performed:
    1) Hyperparameter optimisations for XGB, SVM
    2) Currently, I maxpooled RESNET features from last layer to flatten 
       but we can use 3D features with CNNs
    3) We can train a good capacity fully connected NN (1-3 layers) with softmax 
       to capture complex patterns
    4) Add data for more phrases to build more training data
    5) Maybe try with resizing image with padding on top and bottom
    6) Experiment with other features extractors trained on Imagenet data

In [1]:
import os
import pickle
import numpy as np
import pandas as pd

from PIL import Image

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

In [2]:
df_train = pd.read_csv("train_file.csv")
df_test = pd.read_csv("test_file.csv")

os.makedirs('models',  exist_ok=True)

In [3]:
df_train['label'].value_counts()

Roboto           3024
Arimo            2520
OpenSans         2520
Ubuntu           2016
Oswald           1764
DancingScript    1260
PTSerif          1008
NotoSans         1008
PatuaOne          252
FredokaOne        252
Name: label, dtype: int64

In [4]:
df_test['label'].value_counts()

Roboto           1296
OpenSans         1080
Arimo            1080
Ubuntu            864
Oswald            756
DancingScript     540
NotoSans          432
PTSerif           432
PatuaOne          108
FredokaOne        108
Name: label, dtype: int64

In [5]:
def load_extracted_features(filepaths):
    """
    
    Load extracted features file for all the given filepaths
    """
    
    image_vectors = []
    for filepath in filepaths:
        
        load_filepath = 'data/extracted_features/{}'.format(filepath.split('/')[-1].replace('.png', 'pkl'))
        
        with open(load_filepath, 'rb') as load_filepath:
            image_vectors.append(pickle.load(load_filepath))
            
    return image_vectors

# Step 1 - Experiment on constrained dataset (as in the problem statement)
- Only "Hello, World!", "HELLO, WORLD!" text images
- Image with no image augumentation

In [6]:
# Filter original train-test file
df_train_contrained = df_train[
    (df_train['text'].isin(['Hello,-World!', 'HELLO,-WORLD!']))
    & (df_train['agumentation_type'].isnull())
].reset_index()

df_test_contrained = df_test[
    (df_test['text'].isin(['Hello,-World!', 'HELLO,-WORLD!']))
    & (df_test['agumentation_type'].isnull())
].reset_index()

In [7]:
# Make training and testing objects
X_train = np.array(load_extracted_features(list(df_train_contrained['file_path_preprocessed'])))
X_test = np.array(load_extracted_features(list(df_test_contrained['file_path_preprocessed'])))

y_train = list(df_train_contrained['label'])
y_test = list(df_test_contrained['label'])

X_train.shape, X_test.shape, len(y_train), len(y_test)

((2627, 2048), (1093, 2048), 2627, 1093)

### Train Logistic Regression

In [8]:
%%time

from sklearn.linear_model import LogisticRegression
clf_lr = LogisticRegression(random_state=2, class_weight='balanced')
clf_lr = clf_lr.fit(X_train, y_train)
clf_lr



CPU times: user 7.1 s, sys: 15.2 ms, total: 7.12 s
Wall time: 7.12 s


In [9]:
scores = clf_lr.predict_proba(X_test)
y_pred = list(pd.DataFrame(scores, columns=clf_lr.classes_).idxmax(axis=1))
print (classification_report(y_test, y_pred))

               precision    recall  f1-score   support

        Arimo       0.99      0.99      0.99       183
DancingScript       1.00      1.00      1.00        82
   FredokaOne       1.00      1.00      1.00        19
     NotoSans       0.70      0.80      0.75        75
     OpenSans       0.90      0.85      0.88       177
       Oswald       1.00      1.00      1.00       121
      PTSerif       1.00      1.00      1.00        65
     PatuaOne       1.00      1.00      1.00        17
       Roboto       0.99      0.98      0.99       221
       Ubuntu       1.00      1.00      1.00       133

     accuracy                           0.96      1093
    macro avg       0.96      0.96      0.96      1093
 weighted avg       0.96      0.96      0.96      1093



In [10]:
with open('models/lr_weights.pkl', 'wb') as save_filepath:
    pickle.dump(clf_lr, save_filepath, protocol=4)

### Train SVM

In [11]:
%%time

from sklearn.svm import SVC
clf_svm = SVC(C=0.1, probability=True)
clf_svm = clf_svm.fit(X_train, y_train)
clf_svm



CPU times: user 2min 11s, sys: 17 ms, total: 2min 11s
Wall time: 2min 11s


In [12]:
scores = clf_svm.predict_proba(X_test)
y_pred = list(pd.DataFrame(scores, columns=clf_svm.classes_).idxmax(axis=1))
print (classification_report(y_test, y_pred))

               precision    recall  f1-score   support

        Arimo       0.92      0.90      0.91       183
DancingScript       1.00      1.00      1.00        82
   FredokaOne       1.00      0.89      0.94        19
     NotoSans       0.73      0.61      0.67        75
     OpenSans       0.72      0.86      0.79       177
       Oswald       1.00      1.00      1.00       121
      PTSerif       1.00      0.97      0.98        65
     PatuaOne       1.00      0.88      0.94        17
       Roboto       0.90      0.86      0.88       221
       Ubuntu       0.95      0.93      0.94       133

     accuracy                           0.89      1093
    macro avg       0.92      0.89      0.90      1093
 weighted avg       0.90      0.89      0.89      1093



In [13]:
with open('models/svm_weights.pkl', 'wb') as save_filepath:
    pickle.dump(clf_svm, save_filepath, protocol=4)

### Train XGBoost Model

In [14]:
%%time

from xgboost import XGBClassifier
clf_xgb = XGBClassifier(n_jobs= -1, random_state=2, max_depth=4, objective='multi:softmax')
clf_xgb = clf_xgb.fit(X_train, y_train)
clf_xgb

CPU times: user 10min 30s, sys: 331 ms, total: 10min 31s
Wall time: 1min 24s


In [15]:
scores = clf_xgb.predict_proba(X_test)
y_pred = list(pd.DataFrame(scores, columns=clf_xgb.classes_).idxmax(axis=1))
print (classification_report(y_test, y_pred))

               precision    recall  f1-score   support

        Arimo       0.99      0.98      0.99       183
DancingScript       1.00      1.00      1.00        82
   FredokaOne       1.00      0.79      0.88        19
     NotoSans       0.83      0.60      0.70        75
     OpenSans       0.83      0.94      0.88       177
       Oswald       1.00      1.00      1.00       121
      PTSerif       0.97      1.00      0.98        65
     PatuaOne       1.00      0.88      0.94        17
       Roboto       0.97      0.99      0.98       221
       Ubuntu       0.97      0.98      0.98       133

     accuracy                           0.95      1093
    macro avg       0.96      0.92      0.93      1093
 weighted avg       0.95      0.95      0.95      1093



In [16]:
with open('models/xgb_weights.pkl', 'wb') as save_filepath:
    pickle.dump(clf_xgb, save_filepath, protocol=4)

# Step 2 - Evaluate on augumented dataset

In [17]:
# Filter original train-test file
df_augumented = pd.concat([
    df_train[~df_train['agumentation_type'].isnull()].reset_index(),
    df_test[~df_test['agumentation_type'].isnull()].reset_index()
]).reset_index(drop=True)
    
X_test_augumented = np.array(load_extracted_features(list(df_augumented['file_path_preprocessed'])))
y_test_augumented = list(df_augumented['label'])

X_test_augumented.shape, len(y_test_augumented)

((16740, 2048), 16740)

In [18]:
scores = clf_lr.predict_proba(X_test_augumented)
y_pred = list(pd.DataFrame(scores, columns=clf_lr.classes_).idxmax(axis=1))
print (classification_report(y_test_augumented, y_pred))

               precision    recall  f1-score   support

        Arimo       0.35      0.38      0.36      2700
DancingScript       0.86      0.44      0.58      1350
   FredokaOne       0.10      0.46      0.17       270
     NotoSans       0.13      0.21      0.16      1080
     OpenSans       0.40      0.08      0.13      2700
       Oswald       0.86      0.46      0.60      1890
      PTSerif       0.48      0.41      0.44      1080
     PatuaOne       0.31      0.30      0.30       270
       Roboto       0.46      0.24      0.31      3240
       Ubuntu       0.21      0.55      0.30      2160

     accuracy                           0.33     16740
    macro avg       0.42      0.35      0.34     16740
 weighted avg       0.45      0.33      0.34     16740



In [19]:
scores = clf_svm.predict_proba(X_test_augumented)
y_pred = list(pd.DataFrame(scores, columns=clf_svm.classes_).idxmax(axis=1))
print (classification_report(y_test_augumented, y_pred))

               precision    recall  f1-score   support

        Arimo       0.80      0.07      0.13      2700
DancingScript       1.00      0.11      0.20      1350
   FredokaOne       1.00      0.06      0.11       270
     NotoSans       0.64      0.03      0.06      1080
     OpenSans       0.18      0.97      0.30      2700
       Oswald       1.00      0.08      0.15      1890
      PTSerif       0.83      0.17      0.28      1080
     PatuaOne       1.00      0.04      0.07       270
       Roboto       0.73      0.15      0.25      3240
       Ubuntu       0.69      0.10      0.18      2160

     accuracy                           0.24     16740
    macro avg       0.79      0.18      0.17     16740
 weighted avg       0.71      0.24      0.20     16740



In [20]:
scores = clf_xgb.predict_proba(X_test_augumented)
y_pred = list(pd.DataFrame(scores, columns=clf_xgb.classes_).idxmax(axis=1))
print (classification_report(y_test_augumented, y_pred))

               precision    recall  f1-score   support

        Arimo       0.32      0.37      0.34      2700
DancingScript       0.84      0.30      0.44      1350
   FredokaOne       0.34      0.19      0.25       270
     NotoSans       0.23      0.03      0.06      1080
     OpenSans       0.35      0.13      0.19      2700
       Oswald       0.60      0.25      0.35      1890
      PTSerif       0.28      0.41      0.33      1080
     PatuaOne       0.30      0.10      0.15       270
       Roboto       0.31      0.45      0.37      3240
       Ubuntu       0.21      0.44      0.28      2160

     accuracy                           0.31     16740
    macro avg       0.38      0.27      0.28     16740
 weighted avg       0.37      0.31      0.30     16740



# Step 3 - Train model on whole dataset (original + augumented)

In [21]:
# Make training and testing objects
X_train = np.array(load_extracted_features(list(df_train['file_path_preprocessed'])))
X_test = np.array(load_extracted_features(list(df_test['file_path_preprocessed'])))

y_train = list(df_train['label'])
y_test = list(df_test['label'])

X_train.shape, X_test.shape, len(y_train), len(y_test)

((15624, 2048), (6696, 2048), 15624, 6696)

### Train Logistic Regression

In [22]:
%%time

from sklearn.linear_model import LogisticRegression
clf_lr = LogisticRegression(random_state=2, class_weight='balanced')
clf_lr = clf_lr.fit(X_train, y_train)
clf_lr



CPU times: user 4min 47s, sys: 69.6 ms, total: 4min 47s
Wall time: 4min 47s


In [23]:
scores = clf_lr.predict_proba(X_test)
y_pred = list(pd.DataFrame(scores, columns=clf_lr.classes_).idxmax(axis=1))
print (classification_report(y_test, y_pred))

               precision    recall  f1-score   support

        Arimo       0.83      0.85      0.84      1080
DancingScript       0.99      0.99      0.99       540
   FredokaOne       0.92      0.81      0.86       108
     NotoSans       0.41      0.46      0.43       432
     OpenSans       0.66      0.66      0.66      1080
       Oswald       0.99      0.98      0.99       756
      PTSerif       0.97      0.95      0.96       432
     PatuaOne       0.96      0.86      0.91       108
       Roboto       0.80      0.79      0.80      1296
       Ubuntu       0.84      0.81      0.82       864

     accuracy                           0.81      6696
    macro avg       0.84      0.82      0.82      6696
 weighted avg       0.81      0.81      0.81      6696



In [28]:
with open('models/lr_augumented_weights.pkl', 'wb') as save_filepath:
    pickle.dump(clf_lr, save_filepath, protocol=4)

### Train SVM

In [24]:
%%time

from sklearn.svm import SVC
clf_svm = SVC(probability=True)
clf_svm = clf_svm.fit(X_train, y_train)
clf_svm



CPU times: user 1h 28min 22s, sys: 277 ms, total: 1h 28min 23s
Wall time: 1h 28min 23s


In [25]:
scores = clf_svm.predict_proba(X_test)
y_pred = list(pd.DataFrame(scores, columns=clf_svm.classes_).idxmax(axis=1))
print (classification_report(y_test, y_pred))

               precision    recall  f1-score   support

        Arimo       0.83      0.85      0.84      1080
DancingScript       1.00      1.00      1.00       540
   FredokaOne       0.84      0.73      0.78       108
     NotoSans       0.58      0.38      0.46       432
     OpenSans       0.67      0.73      0.70      1080
       Oswald       0.98      0.97      0.97       756
      PTSerif       0.97      0.93      0.95       432
     PatuaOne       0.76      0.83      0.80       108
       Roboto       0.75      0.82      0.79      1296
       Ubuntu       0.87      0.78      0.82       864

     accuracy                           0.81      6696
    macro avg       0.82      0.80      0.81      6696
 weighted avg       0.81      0.81      0.81      6696



In [29]:
with open('models/svm_augumented_weights.pkl', 'wb') as save_filepath:
    pickle.dump(clf_svm, save_filepath, protocol=4)

### Train XGBoost Model

In [26]:
%%time

from xgboost import XGBClassifier
clf_xgb = XGBClassifier(n_jobs= -1, random_state=2, max_depth=4, objective='multi:softmax')
clf_xgb = clf_xgb.fit(X_train, y_train)
clf_xgb

CPU times: user 1h 6min 17s, sys: 368 ms, total: 1h 6min 17s
Wall time: 8min 18s


In [27]:
scores = clf_xgb.predict_proba(X_test)
y_pred = list(pd.DataFrame(scores, columns=clf_xgb.classes_).idxmax(axis=1))
print (classification_report(y_test, y_pred))

               precision    recall  f1-score   support

        Arimo       0.67      0.83      0.75      1080
DancingScript       0.98      0.99      0.98       540
   FredokaOne       1.00      0.63      0.77       108
     NotoSans       0.73      0.18      0.29       432
     OpenSans       0.64      0.67      0.65      1080
       Oswald       0.96      0.96      0.96       756
      PTSerif       0.93      0.89      0.91       432
     PatuaOne       0.97      0.57      0.72       108
       Roboto       0.68      0.78      0.73      1296
       Ubuntu       0.76      0.71      0.73       864

     accuracy                           0.76      6696
    macro avg       0.83      0.72      0.75      6696
 weighted avg       0.77      0.76      0.75      6696



In [30]:
with open('models/xgb_augumented_weights.pkl', 'wb') as save_filepath:
    pickle.dump(clf_xgb, save_filepath, protocol=4)