### A Production ready Multi-Class Text Classifier  
https://towardsdatascience.com/a-production-ready-multi-class-text-classifier-96490408757

In [12]:
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline

In [126]:
#load data
df = pd.read_csv("data/fintech/train.csv",encoding='latin1')

In [127]:
df.head()

Unnamed: 0,name,mi_key,industry,ciq_id,description,fintech
0,"@Global, Inc.",5068011,Insurance Technology,,"@Global, Inc. develops insurance technology, c...",1
1,1212 Development Corp.,7286650,"Hotels, Resorts and Cruise Lines",,1212 Development Corp. owns and operates hotel...,0
2,12ve Degrees Corp.,7574292,Footwear Producers,,12ve Degrees Corporation manufactures footwear...,0
3,1-800-HealthPlan.com,5105655,Insurance Technology,,1-800-HealthPlan.com operates an online insura...,1
4,"1800Pay, Inc.",7587088,Money Transfer and Remittance,,"1800Pay, Inc. provides money transfer and paym...",1


In [None]:
from collections import Counter
Counter(df["industry"])

In [130]:
#pre-processing
import re 
def clean_str(string):
    """
    Tokenization/string cleaning for dataset
    Every dataset is lower cased except
    """
    string = re.sub(r"\n", "", string)    
    string = re.sub(r"\r", "", string) 
    string = re.sub(r"[0-9]", "digit", string)
    string = re.sub(r"\'", "", string)    
    string = re.sub(r"\"", "", string)    
    return string.strip().lower()
X = []
for i in range(df.shape[0]):
    X.append(clean_str(df.iloc[i][4]))
y = np.array(df["industry"])

In [131]:
#train test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=5)

Another challenge here is the multi class classification one. For that at the support vector machine implementation, we can use the OneVsRest classifier concept. The OneVsRest (or one-vs.-all, OvA or OvR, oneagainst-all, OAA) strategy involves training a single classiﬁer per class, with the samples of that class as positive samples and all other samples as negatives. This strategy requires the base classiﬁers to produce a real-valued conﬁdence score for its decision, rather than just a class label; discrete class labels alone can lead to ambiguities, where multiple classes are predicted for a single sample.

In [132]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import LinearSVC

In [133]:
#pipeline of feature engineering and model
model = Pipeline([('vectorizer', CountVectorizer()),
 ('tfidf', TfidfTransformer()),
 ('clf', OneVsRestClassifier(LinearSVC(class_weight="balanced")))])
#the class_weight="balanced" option tries to remove the biasedness of model towards majority sample

For every algorithm of machine learning used, parameter tuning plays a important role. It has been observed that with proper parameter values set, model’s performance increase reasonably. We can find the suitable parameters in our case using grid search as shown below

In [134]:
#paramater selection
from sklearn.grid_search import GridSearchCV
parameters = {'vectorizer__ngram_range': [(1, 1), (1, 2),(2,2)],
               'tfidf__use_idf': (True, False)}
gs_clf_svm = GridSearchCV(model, parameters, n_jobs=-1)
gs_clf_svm = gs_clf_svm.fit(X, y)
print(gs_clf_svm.best_score_)
print(gs_clf_svm.best_params_)



0.6358173076923077
{'tfidf__use_idf': True, 'vectorizer__ngram_range': (1, 1)}


In [135]:
#preparing the final pipeline using the selected parameters
model = Pipeline([('vectorizer', CountVectorizer(ngram_range=(1,1))),
    ('tfidf', TfidfTransformer(use_idf=True)),
    ('clf', OneVsRestClassifier(LinearSVC(class_weight="balanced")))])

In [38]:
#fit model with training data
model.fit(X_train, y_train)
#evaluation on test data
pred = model.predict(X_test)
from sklearn.metrics import confusion_matrix, accuracy_score
confusion_matrix(pred, y_test)

array([[ 73,   6,   3,   0,   3,   1,   3,   4,   4,   4,   6],
       [  1,  36,   2,   2,   2,   0,   0,   0,   0,   0,   2],
       [  2,   2,  66,   3,  18,   1,   6,   2,   0,   0,   2],
       [  1,   1,   2, 123,   2,   2,   3,   2,   0,   2,   3],
       [  4,   5,  23,   1,  82,   3,   1,   1,   2,   1,   2],
       [  2,   0,   0,   1,   0,   4,   0,   0,   0,   0,   1],
       [  1,   1,   2,   0,   0,   1,  36,   6,   1,   0,   3],
       [  6,   3,   2,   2,   1,   4,  13, 128,  35,   5,   6],
       [  1,   0,   0,   0,   0,   1,   0,   2,   2,   2,   0],
       [  0,   0,   0,   0,   0,   0,   0,   0,   1,   1,   2],
       [  0,   1,   0,   1,   0,   1,   2,   0,   1,   2,   3]],
      dtype=int64)

In [39]:
print(accuracy_score(y_test, pred))

0.6864931846344485


In [136]:
model.fit(X, y)

Pipeline(memory=None,
     steps=[('vectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
       ...lti_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0),
          n_jobs=1))])

In [137]:
test_data = pd.read_csv('data/test.csv',encoding='latin1')

In [138]:
X_test = test_data.description.fillna(' ')

In [139]:
len(X_test)

627

In [140]:
test_y_pred = model.predict(X_test)

In [141]:
new_df = pd.DataFrame(test_data['description'])

In [142]:
new_df['trans_id'] = new_data['trans_id']

In [143]:
new_df['fintech'] = test_y_pred

In [144]:
new_df.head()

Unnamed: 0,description,trans_id,fintech
0,Commodity Blenders Inc. offers commodity blend...,IQTR608951353,Agricultural Products
1,"SOLID Surface Care, Inc. provides surface care...",IQTR610187145,Healthcare Facilities
2,Lightspeed Systems Inc. develops network secur...,IQTR608184884,Payment Processors
3,"Toast, Inc. develops an Android point of sale ...",IQTR608292171,Restaurants
4,"Live Up Top, Inc. owns and operates an end to ...",IQTR608311552,Investment and Capital Markets Technology


In [145]:
new_df.to_csv('multiclass.csv')

Otherwise, if convinced with the accuracy score obtained from grid search cross validation one we can directly fit the model with whole data.

Now we need to save the prepare model in a pickle file, so that it can be deployed at the production side. The joblib function in python makes it easier.

In [42]:
from sklearn.externals import joblib
model = joblib.dump(model,'model_fintech_category.pkl')