# Modeling

---
# Train/Test Split Modeling

In this section I will be using estimators to create prediction models to predict which text relates to which indicators. 

To be able to use find model accuracy I will need to split the train data into train and test tests to find the best model through calculating the accuracy and hamming score which can be use if the prediction have a true data to compare to.

In [102]:
import pandas as pd
import numpy as np
import sklearn as sklearn

from sklearn.model_selection import cross_val_score,cross_val_predict, train_test_split, KFold
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge, RidgeCV
from sklearn.linear_model import RidgeClassifierCV
from sklearn.linear_model import Lasso, LassoCV
from sklearn.linear_model import ElasticNet, ElasticNetCV
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

from sklearn.metrics import accuracy_score
from sklearn.multiclass import OneVsRestClassifier
from sklearn.multioutput import MultiOutputClassifier

from sklearn.model_selection import GridSearchCV 
from sklearn.pipeline import Pipeline

In [103]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

cvec = CountVectorizer(stop_words='english')

tvec = TfidfVectorizer(stop_words='english')

### Spliting the Train Data into Train and Test Sets

To be able to use the accuracy and hamming function we will need to split the train data in to train and test sets.

In [104]:
train_clean = pd.read_csv('../data/clean_train.csv')

In [105]:
# Get the number of reviews based on the dataframe size.
text = train_clean.shape[0]
print(f'There are {text} Train Text.')

There are 2995 Train Text.


In [106]:
X = train_clean[['clean_text']]
y = train_clean[['3.1.1', '3.1.2', '3.2.1', '3.2.2','3.3.1', '3.3.2', '3.3.3', '3.3.4', '3.3.5', '3.4.1', '3.4.2', '3.5.1','3.5.2', '3.6.1', '3.7.1', '3.7.2', '3.8.1', '3.8.2', '3.9.1', '3.9.2','3.9.3', '3.a.1', '3.b.1', '3.b.2', '3.b.3', '3.c.1', '3.d.1']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42)

In [107]:
clean_text_Xtrain=[]
clean_text_Xtest=[]

for text in X_train['clean_text']:
    clean_text_Xtrain.append(text)
    

for text in X_test['clean_text']:
    clean_text_Xtest.append(text)

In [108]:
Xtrain_cvec_data_features = cvec.fit_transform(clean_text_Xtrain)
Xtest_cvec_data_features = cvec.transform(clean_text_Xtest)
print('XTrain cvec Data Shape',Xtrain_cvec_data_features.shape)
print('XTest cvec Data Shape',Xtest_cvec_data_features.shape)
        
Xtrain_tvec_data_features = tvec.fit_transform(clean_text_Xtrain)
Xtest_tvec_data_features = tvec.transform(clean_text_Xtest)
print('XTrain tvec Data Shape',Xtrain_tvec_data_features.shape)
print('XTest tvec Data Shape',Xtest_tvec_data_features.shape)

XTrain cvec Data Shape (2396, 23241)
XTest cvec Data Shape (599, 23241)
XTrain tvec Data Shape (2396, 23241)
XTest tvec Data Shape (599, 23241)


In [109]:
print('yTrain Shape',y_train.shape)
print('yTest Shape',y_test.shape)

yTrain Shape (2396, 27)
yTest Shape (599, 27)


### The Hamming Score function

Hamming-Loss is the fraction of labels that are incorrectly predicted, i.e., the fraction of the wrong labels to the total number of labels.

It reports how many times on average, the relevance of an example to a class label is incorrectly predicted. Therefore, hamming loss takes into account the prediction error (an incorrect label is predicted) and missing error (a relevant label not predicted), normalized over total number of classes and total number of examples.

We would expect the hamming loss to be 0, which would imply no error. This means practically the smaller the value of hamming loss, the better the performance of the learning algorithm.

In [110]:
#hamming loss function
def hamming_loss(y_true, y_pred):
    temp=0
    for i in range(y_true.shape[0]):
        temp += np.size(y_true[i] == y_pred[i]) - np.count_nonzero(y_true[i] == y_pred[i])
    return temp/(y_true.shape[0] * y_true.shape[1])

### Exact Match Ratio or Subset Accuracy
The Exact Match Ratio or Subset Accuracy which is the most strict metric, indicating the percentage of samples that have all their labels classified correctly.

The disadvantage of this measure is that multi-class classification problems have a chance of being partially correct, but here we ignore those partially correct matches.

There is also function in scikit-learn which implements subset accuracy, called as accuracy_score.

Now we will start modeling and get the accuracy for each model.

### Linear Regression (Base Model)

The base model produced Hamming Loss at 1.0 and Accuracy at 0.072.

** Due to long run time cvec has been commented **

In [111]:
#cvec
#lr = LinearRegression()
#lr.fit(Xtrain_cvec_data_features,y_train)

#prediction = lr.predict(Xtest_cvec_data_features)

#print('Prediction Accuracy:',accuracy_score(y_test.to_numpy(),prediction))
#print('Prediction Hamming Loss Accuracy:',hamming_loss(prediction,y_test.to_numpy()))

In [112]:
#tvec
#lr = LinearRegression()
#lr.fit(Xtrain_tvec_data_features,y_train)

#prediction = lr.predict(Xtest_tvec_data_features)

#print('Prediction Accuracy:',accuracy_score(y_test.to_numpy(),prediction))
#print('Prediction Hamming Loss Accuracy:',hamming_loss(prediction,y_test.to_numpy()))

### KNN

In [113]:
#cvec
knn = KNeighborsClassifier()
knn.fit(Xtrain_cvec_data_features,y_train)

prediction = knn.predict(Xtest_cvec_data_features)

print('Prediction Accuracy:',accuracy_score(y_test.to_numpy(),prediction))
print('Prediction Hamming Loss Accuracy:',hamming_loss(prediction,y_test.to_numpy()))

Prediction Accuracy: 0.23205342237061768
Prediction Hamming Loss Accuracy: 0.060718481419650035


In [114]:
#tvec
knn = KNeighborsClassifier()
knn.fit(Xtrain_tvec_data_features,y_train)

prediction = knn.predict(Xtest_tvec_data_features)

print('Prediction Accuracy:',accuracy_score(y_test.to_numpy(),prediction))
print('Prediction Hamming Loss Accuracy:',hamming_loss(prediction,y_test.to_numpy()))

Prediction Accuracy: 0.2921535893155259
Prediction Hamming Loss Accuracy: 0.055339145489395905


### Logistic Regression

In [115]:
#Some error might occur when running so ths code block will be commented out
#Below show the accuracy and hamming loss score of the code block
#cvec using MultiOutputClassifier
#log = MultiOutputClassifier(LogisticRegression(solver='lbfgs'))
#log.fit(Xtrain_cvec_data_features,y_train)

#prediction = log.predict(Xtest_cvec_data_features)

#print('Prediction Accuracy:',accuracy_score(y_test.to_numpy(),prediction))
#print('Prediction Hamming Loss Accuracy:',hamming_loss(prediction,y_test.to_numpy()))

#Prediction Accuracy: 0.3071786310517529
#Prediction Hamming Loss Accuracy: 0.05237123601063501

In [116]:
#tvec using MultiOutputClassifier
log = MultiOutputClassifier(LogisticRegression(solver='lbfgs'))
log.fit(Xtrain_tvec_data_features,y_train)

prediction = log.predict(Xtest_tvec_data_features)

print('Prediction Accuracy:',accuracy_score(y_test.to_numpy(),prediction))
print('Prediction Hamming Loss Accuracy:',hamming_loss(prediction,y_test.to_numpy()))

Prediction Accuracy: 0.24874791318864775
Prediction Hamming Loss Accuracy: 0.05571013417424102


### Decision Tree

In [117]:
#cvec
tree = DecisionTreeClassifier()
tree.fit(Xtrain_cvec_data_features,y_train)

prediction = tree.predict(Xtest_cvec_data_features)

print('Prediction Accuracy:',accuracy_score(y_test.to_numpy(),prediction))
print('Prediction Hamming Loss Accuracy:',hamming_loss(prediction,y_test.to_numpy()))

Prediction Accuracy: 0.3088480801335559
Prediction Hamming Loss Accuracy: 0.06585049156000743


In [118]:
#tvec
tree = DecisionTreeClassifier()
tree.fit(Xtrain_tvec_data_features,y_train)

prediction = tree.predict(Xtest_tvec_data_features)

print('Prediction Accuracy:',accuracy_score(y_test.to_numpy(),prediction))
print('Prediction Hamming Loss Accuracy:',hamming_loss(prediction,y_test.to_numpy()))

Prediction Accuracy: 0.29716193656093487
Prediction Hamming Loss Accuracy: 0.0676436035367588


### MLP Classifier

In [119]:
from sklearn.neural_network import MLPClassifier

#cvec
mlp = MLPClassifier()
mlp.fit(Xtrain_cvec_data_features,y_train)

prediction = mlp.predict(Xtest_cvec_data_features)

print('Prediction Accuracy:',accuracy_score(y_test.to_numpy(),prediction))
print('Prediction Hamming Loss Accuracy:',hamming_loss(prediction,y_test.to_numpy()))

Prediction Accuracy: 0.328881469115192
Prediction Hamming Loss Accuracy: 0.051320101403573855


In [120]:
#tvec
mlp = MLPClassifier()
mlp.fit(Xtrain_tvec_data_features,y_train)

prediction = mlp.predict(Xtest_tvec_data_features)

print('Prediction Accuracy:',accuracy_score(y_test.to_numpy(),prediction))
print('Prediction Hamming Loss Accuracy:',hamming_loss(prediction,y_test.to_numpy()))

Prediction Accuracy: 0.3071786310517529
Prediction Hamming Loss Accuracy: 0.05063995548135782




### Random Forest Classifier

In [121]:
#cvec
rfc = RandomForestClassifier()
rfc.fit(Xtrain_cvec_data_features,y_train)

prediction = rfc.predict(Xtest_cvec_data_features)

print('Prediction Accuracy:',accuracy_score(y_test.to_numpy(),prediction))
print('Prediction Hamming Loss Accuracy:',hamming_loss(prediction,y_test.to_numpy()))

Prediction Accuracy: 0.24040066777963273
Prediction Hamming Loss Accuracy: 0.056452111543931247


In [122]:
#tvec
rfc = RandomForestClassifier()
rfc.fit(Xtrain_tvec_data_features,y_train)

prediction = rfc.predict(Xtest_tvec_data_features)

print('Prediction Accuracy:',accuracy_score(y_test.to_numpy(),prediction))
print('Prediction Hamming Loss Accuracy:',hamming_loss(prediction,y_test.to_numpy()))

Prediction Accuracy: 0.24707846410684475
Prediction Hamming Loss Accuracy: 0.056637605886353797


### Ridge Classifier CV

In [123]:
#cvec
ridge = MultiOutputClassifier(RidgeClassifierCV())
ridge.fit(Xtrain_cvec_data_features,y_train)

prediction = ridge.predict(Xtest_cvec_data_features)

print('Prediction Accuracy:',accuracy_score(y_test.to_numpy(),prediction))
print('Prediction Hamming Loss Accuracy:',hamming_loss(prediction,y_test.to_numpy()))

Prediction Accuracy: 0.1636060100166945
Prediction Hamming Loss Accuracy: 0.08056637605886353


In [124]:
#tvec
ridge_best = MultiOutputClassifier(RidgeClassifierCV())
ridge_best.fit(Xtrain_tvec_data_features,y_train)

best_prediction = ridge_best.predict(Xtest_tvec_data_features)

print('Prediction Accuracy:',accuracy_score(y_test.to_numpy(),best_prediction))
print('Prediction Hamming Loss Accuracy:',hamming_loss(best_prediction,y_test.to_numpy()))

Prediction Accuracy: 0.327212020033389
Prediction Hamming Loss Accuracy: 0.047239225870277624


---
# Evaluate Best Train/Test Split Model

The Ridge Classifier CV was the best model with Hamming Loss Accuracy of 0.0472

In [125]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
import matplotlib.pyplot as plt

In [126]:
print(classification_report(y_test, best_prediction)) #classification report from sklearn

              precision    recall  f1-score   support

           0       1.00      0.43      0.60        30
           1       0.00      0.00      0.00        18
           2       0.86      0.40      0.55        45
           3       0.67      0.11      0.18        19
           4       0.96      0.63      0.76        78
           5       0.96      0.49      0.65        47
           6       0.96      0.53      0.69        45
           7       1.00      0.24      0.38        17
           8       0.72      0.38      0.50        34
           9       0.93      0.56      0.70        93
          10       0.75      0.27      0.40        11
          11       0.78      0.44      0.56        16
          12       0.83      0.33      0.48        15
          13       1.00      0.67      0.80         3
          14       1.00      0.51      0.68        35
          15       1.00      0.50      0.67        32
          16       0.75      0.32      0.45       102
          17       1.00    

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


The model is best use for some of the indicators:
- 3.3.1
- 3.6.1

Indicators that might need other model:
- 3.1.2
- 3.9.3

The model might need to be more tuned for indicators:
- 3.4.1
- 3.b.2 

---
# Modeling on Actual Train and Test Data

In this section I will be using estimators to create prediction models to predict which text relates to which indicators. This will be testing on unseen occurrences.

In [127]:
train_clean = pd.read_csv('../data/clean_train.csv')
test_clean = pd.read_csv('../data/clean_test.csv')

In [128]:
# Get the number of reviews based on the dataframe size.
text = train_clean.shape[0]
print(f'There are {text} Train Text.')

There are 2995 Train Text.


In [129]:
# Get the number of reviews based on the dataframe size.
text = test_clean.shape[0]
print(f'There are {text} Test Text.')

There are 998 Test Text.


In [130]:
clean_text_train=[]
clean_text_test=[]

for text in train_clean['clean_text']:
    clean_text_train.append(text)
    

for text in test_clean['clean_text']:
    clean_text_test.append(text)

In [131]:
train_cvec_data_features = cvec.fit_transform(clean_text_train)
test_cvec_data_features = cvec.transform(clean_text_test)
print('Train cvec Data Shape',train_cvec_data_features.shape)
print('Test cvec Data Shape',test_cvec_data_features.shape)
        
train_tvec_data_features = tvec.fit_transform(clean_text_train)
test_tvec_data_features = tvec.transform(clean_text_test)
print('Train tvec Data Shape',train_tvec_data_features.shape)
print('Test tvec Data Shape',test_tvec_data_features.shape)

Train cvec Data Shape (2995, 26077)
Test cvec Data Shape (998, 26077)
Train tvec Data Shape (2995, 26077)
Test tvec Data Shape (998, 26077)


In [132]:
y = train_clean[['3.1.1', '3.1.2', '3.2.1', '3.2.2','3.3.1', '3.3.2', '3.3.3', '3.3.4', '3.3.5', '3.4.1', '3.4.2', '3.5.1','3.5.2', '3.6.1', '3.7.1', '3.7.2', '3.8.1', '3.8.2', '3.9.1', '3.9.2','3.9.3', '3.a.1', '3.b.1', '3.b.2', '3.b.3', '3.c.1', '3.d.1']]
y

Unnamed: 0,3.1.1,3.1.2,3.2.1,3.2.2,3.3.1,3.3.2,3.3.3,3.3.4,3.3.5,3.4.1,...,3.8.2,3.9.1,3.9.2,3.9.3,3.a.1,3.b.1,3.b.2,3.b.3,3.c.1,3.d.1
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,1,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,1,0,1
3,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2990,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2991,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2992,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,1,1,0,0,0
2993,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [133]:
print(y.shape)

(2995, 27)


### MLP Classifier

In [134]:
#cvec
mlp = MLPClassifier()
mlp.fit(train_cvec_data_features,y)

prediction = mlp.predict(test_cvec_data_features)

prediction

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 1, ..., 1, 0, 0],
       [1, 0, 0, ..., 0, 0, 0]])

In [135]:
predict_mlp = pd.DataFrame(prediction)

#renaming the columns
predict_mlp.columns = y.columns[:]

submission_mlp_cvec = pd.concat([test_clean['Unique ID'],predict_mlp],axis=1)
submission_mlp_cvec

Unnamed: 0,Unique ID,3.1.1,3.1.2,3.2.1,3.2.2,3.3.1,3.3.2,3.3.3,3.3.4,3.3.5,...,3.8.2,3.9.1,3.9.2,3.9.3,3.a.1,3.b.1,3.b.2,3.b.3,3.c.1,3.d.1
0,49848,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,52348,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,103541,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,52382,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,47212,0,0,0,0,0,0,0,0,0,...,0,1,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
993,38108,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
994,30360,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
995,33883,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
996,36296,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


In [136]:
submission_mlp_cvec.to_csv('../predictions/submission_mlp_cvec.csv',index=False)
#Hamming Loss Score of 0.0488

In [137]:
#tvec
mlp = MLPClassifier()
mlp.fit(train_tvec_data_features,y)

prediction = mlp.predict(test_tvec_data_features)

prediction



array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [138]:
predict_mlp = pd.DataFrame(prediction)

#renaming the columns
predict_mlp.columns = y.columns[:]

submission_mlp_tvec = pd.concat([test_clean['Unique ID'],predict_mlp],axis=1)
submission_mlp_tvec

Unnamed: 0,Unique ID,3.1.1,3.1.2,3.2.1,3.2.2,3.3.1,3.3.2,3.3.3,3.3.4,3.3.5,...,3.8.2,3.9.1,3.9.2,3.9.3,3.a.1,3.b.1,3.b.2,3.b.3,3.c.1,3.d.1
0,49848,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,52348,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,103541,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,52382,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,47212,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
993,38108,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
994,30360,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
995,33883,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
996,36296,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [139]:
submission_mlp_tvec.to_csv('../predictions/submission_mlp_tvec.csv',index=False)
#Hamming Loss Score of 0.0464

### KNN

In [140]:
#cvec
knn = KNeighborsClassifier()
knn.fit(train_cvec_data_features,y)

prediction = knn.predict(test_cvec_data_features)

prediction

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [141]:
predict_knn = pd.DataFrame(prediction)

#renaming the columns
predict_knn.columns = y.columns[:]

submission_knn_cvec = pd.concat([test_clean['Unique ID'],predict_knn],axis=1)
submission_knn_cvec

Unnamed: 0,Unique ID,3.1.1,3.1.2,3.2.1,3.2.2,3.3.1,3.3.2,3.3.3,3.3.4,3.3.5,...,3.8.2,3.9.1,3.9.2,3.9.3,3.a.1,3.b.1,3.b.2,3.b.3,3.c.1,3.d.1
0,49848,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,52348,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,103541,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,52382,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,47212,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
993,38108,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
994,30360,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
995,33883,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,1
996,36296,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [142]:
submission_knn_cvec.to_csv('../predictions/submission_knn_cvec.csv',index=False)
#Hamming Loss Score of 0.0594

In [143]:
#tvec
knn = KNeighborsClassifier()
knn.fit(train_tvec_data_features,y)

prediction = knn.predict(test_tvec_data_features)

prediction

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [144]:
predict_knn = pd.DataFrame(prediction)

#renaming the columns
predict_knn.columns = y.columns[:]

submission_knn_tvec = pd.concat([test_clean['Unique ID'],predict_knn],axis=1)
submission_knn_tvec

Unnamed: 0,Unique ID,3.1.1,3.1.2,3.2.1,3.2.2,3.3.1,3.3.2,3.3.3,3.3.4,3.3.5,...,3.8.2,3.9.1,3.9.2,3.9.3,3.a.1,3.b.1,3.b.2,3.b.3,3.c.1,3.d.1
0,49848,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,52348,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,103541,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,52382,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4,47212,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
993,38108,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
994,30360,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
995,33883,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
996,36296,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [145]:
submission_knn_tvec.to_csv('../predictions/submission_knn_tvec.csv',index=False)
#Hamming Loss Score of 0.0513

### Ridge Classifier CV

In [147]:
#cvec
ridge = MultiOutputClassifier(RidgeClassifierCV())
ridge.fit(train_cvec_data_features,y)

prediction = ridge.predict(test_cvec_data_features)

prediction

array([[0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [148]:
predict_ridge = pd.DataFrame(prediction)

#renaming the columns
predict_ridge.columns = y.columns[:]

submission_ridge_cvec = pd.concat([test_clean['Unique ID'],predict_ridge],axis=1)
submission_ridge_cvec

Unnamed: 0,Unique ID,3.1.1,3.1.2,3.2.1,3.2.2,3.3.1,3.3.2,3.3.3,3.3.4,3.3.5,...,3.8.2,3.9.1,3.9.2,3.9.3,3.a.1,3.b.1,3.b.2,3.b.3,3.c.1,3.d.1
0,49848,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,52348,0,0,0,0,1,1,0,0,0,...,0,0,0,0,0,0,0,1,0,0
2,103541,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,52382,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,47212,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
993,38108,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
994,30360,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
995,33883,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
996,36296,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [149]:
submission_ridge_cvec.to_csv('../predictions/submission_ridge_cvec.csv',index=False)
#Hamming Loss Score of 0.0818

In [150]:
#tvec
best_ridge = MultiOutputClassifier(RidgeClassifierCV())
best_ridge.fit(train_tvec_data_features,y)

prediction_best = best_ridge.predict(test_tvec_data_features)

prediction_best

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [151]:
predict_ridge = pd.DataFrame(prediction_best)

#renaming the columns
predict_ridge.columns = y.columns[:]

submission_ridge_tvec = pd.concat([test_clean['Unique ID'],predict_ridge],axis=1)
submission_ridge_tvec

Unnamed: 0,Unique ID,3.1.1,3.1.2,3.2.1,3.2.2,3.3.1,3.3.2,3.3.3,3.3.4,3.3.5,...,3.8.2,3.9.1,3.9.2,3.9.3,3.a.1,3.b.1,3.b.2,3.b.3,3.c.1,3.d.1
0,49848,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,52348,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,103541,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,52382,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,47212,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
993,38108,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
994,30360,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
995,33883,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
996,36296,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [152]:
submission_ridge_tvec.to_csv('../predictions/submission_ridge_tvec.csv',index=False)
#Hamming Loss Score of 0.0447

---
# Evaluation and Best Model
 
From exploring each model hamming loss score, I was able to indentify the best model for text classification is the Ridge Classification CV model with the TfidfVectorizer and its default parameters. The model recieved the lowest hamming loss score at 0.0447

In [153]:
from sklearn.model_selection import cross_validate

In [155]:
print(best_ridge.score(train_tvec_data_features, y))

0.698831385642738


In [159]:
cross_validate(best_ridge, train_tvec_data_features, y, cv=5, scoring=['f1_weighted'])

{'fit_time': array([121.19, 126.13, 126.89, 126.12, 125.75]),
 'score_time': array([0.02, 0.01, 0.02, 0.01, 0.02]),
 'test_f1_weighted': array([0.53, 0.55, 0.51, 0.52, 0.53])}