# Model Validation

Random train test split may not always ensure consistent model performance. To further validate the approach, we split the data into 5 mutually exclusive sets, train on the 4 sets and predict on the 5th one. This is repeated 5 times, with 5 combinations of the train and test sets. This helps to get a better clarity about the model.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.model_selection import KFold

In [2]:
df = pd.read_csv('data/dataset.csv')
df = df.drop_duplicates(subset='Text')
df = df.reset_index(drop=True)

In [3]:
documents = df['Text'].values
labels = df['language'].values

In [4]:
kfold = KFold(5, shuffle=True, random_state=1)

In [5]:
i =1
for train, test in kfold.split(documents,labels):
    
    train_doc,test_doc,train_labels,test_labels = documents[train], documents[test],labels[train],labels[test]
    vectorizer = CountVectorizer(ngram_range=(1,4),analyzer='char',max_features=25000)
    vector = vectorizer.fit_transform(train_doc)
    train_df= pd.DataFrame(vector.toarray())
    clf=RandomForestClassifier(n_estimators=1000)
    clf.fit(train_df.values,train_labels)
    vector_test = vectorizer.transform(test_doc)
    test_df = pd.DataFrame(vector_test.toarray())
    y_pred = clf.predict(test_df.values)
    print("The model performance on fold "+str(i)+":\n")
    print(classification_report(test_labels,y_pred))
    i+=1
    del y_pred, test_df,vector_test,train_df,vector

The model performance on fold 1:

              precision    recall  f1-score   support

      Arabic       0.99      1.00      0.99       188
     Chinese       0.99      0.99      0.99       201
       Dutch       0.99      1.00      1.00       193
     English       0.83      1.00      0.90       210
    Estonian       0.97      0.97      0.97       191
      French       0.97      0.99      0.98       225
       Hindi       1.00      0.98      0.99       174
  Indonesian       1.00      0.99      1.00       199
    Japanese       1.00      0.98      0.99       198
      Korean       1.00      0.99      0.99       197
       Latin       0.97      0.92      0.95       210
     Persian       1.00      1.00      1.00       193
   Portugese       0.99      0.97      0.98       191
      Pushto       1.00      0.96      0.98       227
    Romanian       1.00      1.00      1.00       157
     Russian       0.98      0.99      0.99       193
     Spanish       0.99      0.98      0.98    