# Modeling

---
# Train/Test Split Modeling

In this section I will be using estimators to create prediction models to predict which text relates to which indicators. 

To be able to use find model accuracy I will need to split the train data into train and test tests to find the best model through calculating the accuracy and hamming score which can be use if the prediction have a true data to compare to.

In [72]:
import sklearn as sklearn

from sklearn.model_selection import cross_val_score,cross_val_predict, train_test_split, KFold
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge, RidgeCV
from sklearn.linear_model import Lasso, LassoCV
from sklearn.linear_model import ElasticNet, ElasticNetCV
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

from sklearn.metrics import accuracy_score
from sklearn.multiclass import OneVsRestClassifier
from sklearn.multioutput import MultiOutputClassifier

from sklearn.model_selection import GridSearchCV 
from sklearn.pipeline import Pipeline

### Spliting the Train Data into Train and Test Sets

To be able to use the accuracy and hamming function we will need to split the train data in to train and test sets.

In [63]:
train_clean = pd.read_csv('data/clean_train.csv')

In [64]:
# Get the number of reviews based on the dataframe size.
text = train_clean.shape[0]
print(f'There are {text} Train Text.')

There are 2995 Train Text.


In [65]:
X = train_clean[['clean_text']]
y = train_clean[['3.1.1', '3.1.2', '3.2.1', '3.2.2','3.3.1', '3.3.2', '3.3.3', '3.3.4', '3.3.5', '3.4.1', '3.4.2', '3.5.1','3.5.2', '3.6.1', '3.7.1', '3.7.2', '3.8.1', '3.8.2', '3.9.1', '3.9.2','3.9.3', '3.a.1', '3.b.1', '3.b.2', '3.b.3', '3.c.1', '3.d.1']]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=42)

In [66]:
clean_text_Xtrain=[]
clean_text_Xtest=[]

for text in X_train['clean_text']:
    clean_text_Xtrain.append(text)
    

for text in X_test['clean_text']:
    clean_text_Xtest.append(text)

In [67]:
Xtrain_cvec_data_features = cvec.fit_transform(clean_text_Xtrain)
Xtest_cvec_data_features = cvec.transform(clean_text_Xtest)
print('XTrain cvec Data Shape',Xtrain_cvec_data_features.shape)
print('XTest cvec Data Shape',Xtest_cvec_data_features.shape)
        
Xtrain_tvec_data_features = tvec.fit_transform(clean_text_Xtrain)
Xtest_tvec_data_features = tvec.transform(clean_text_Xtest)
print('XTrain tvec Data Shape',Xtrain_tvec_data_features.shape)
print('XTest tvec Data Shape',Xtest_tvec_data_features.shape)

XTrain cvec Data Shape (2396, 23243)
XTest cvec Data Shape (599, 23243)
XTrain tvec Data Shape (2396, 23243)
XTest tvec Data Shape (599, 23243)


In [68]:
print('yTrain Shape',y_train.shape)
print('yTest Shape',y_test.shape)

yTrain Shape (2396, 27)
yTest Shape (599, 27)


### The Hamming Score function

Hamming-Loss is the fraction of labels that are incorrectly predicted, i.e., the fraction of the wrong labels to the total number of labels.

It reports how many times on average, the relevance of an example to a class label is incorrectly predicted. Therefore, hamming loss takes into account the prediction error (an incorrect label is predicted) and missing error (a relevant label not predicted), normalized over total number of classes and total number of examples.

We would expect the hamming loss to be 0, which would imply no error. This means practically the smaller the value of hamming loss, the better the performance of the learning algorithm.

In [70]:
#hamming loss function
def hamming_loss(y_true, y_pred):
    temp=0
    for i in range(y_true.shape[0]):
        temp += np.size(y_true[i] == y_pred[i]) - np.count_nonzero(y_true[i] == y_pred[i])
    return temp/(y_true.shape[0] * y_true.shape[1])

### Exact Match Ratio or Subset Accuracy
The Exact Match Ratio or Subset Accuracy which is the most strict metric, indicating the percentage of samples that have all their labels classified correctly.

The disadvantage of this measure is that multi-class classification problems have a chance of being partially correct, but here we ignore those partially correct matches.

There is also function in scikit-learn which implements subset accuracy, called as accuracy_score.

Now we will start modeling and get the accuracy for each model.

### Linear Regression (Base Model)

The base model produced Hamming Loss at 1.0 and Accuracy at 0.072.

** Due to long run time cvec has been commented **

In [78]:
#cvec
#lr = LinearRegression()
#lr.fit(Xtrain_cvec_data_features,y_train)

#prediction = lr.predict(Xtest_cvec_data_features)

#print('Prediction Accuracy:',accuracy_score(y_test.to_numpy(),prediction))
#print('Prediction Hamming Loss Accuracy:',hamming_loss(prediction,y_test.to_numpy()))

In [79]:
#tvec
#lr = LinearRegression()
#lr.fit(Xtrain_tvec_data_features,y_train)

#prediction = lr.predict(Xtest_tvec_data_features)

#print('Prediction Accuracy:',accuracy_score(y_test.to_numpy(),prediction))
#print('Prediction Hamming Loss Accuracy:',hamming_loss(prediction,y_test.to_numpy()))

### KNN

In [80]:
#cvec
knn = KNeighborsClassifier()
knn.fit(Xtrain_cvec_data_features,y_train)

prediction = knn.predict(Xtest_cvec_data_features)

print('Prediction Accuracy:',accuracy_score(y_test.to_numpy(),prediction))
print('Prediction Hamming Loss Accuracy:',hamming_loss(prediction,y_test.to_numpy()))

Prediction Accuracy: 0.24540901502504173
Prediction Hamming Loss Accuracy: 0.059234526680269586


In [81]:
#tvec
knn = KNeighborsClassifier()
knn.fit(Xtrain_tvec_data_features,y_train)

prediction = knn.predict(Xtest_tvec_data_features)

print('Prediction Accuracy:',accuracy_score(y_test.to_numpy(),prediction))
print('Prediction Hamming Loss Accuracy:',hamming_loss(prediction,y_test.to_numpy()))

Prediction Accuracy: 0.2938230383973289
Prediction Hamming Loss Accuracy: 0.05373152785506709


### Logistic Regression

In [None]:
#Some error might occur when running so ths code block will be commented out
#Below show the accuracy and hamming loss score of the code block
#cvec using MultiOutputClassifier
#log = MultiOutputClassifier(LogisticRegression(solver='lbfgs'))
#log.fit(Xtrain_cvec_data_features,y_train)

#prediction = log.predict(Xtest_cvec_data_features)

#print('Prediction Accuracy:',accuracy_score(y_test.to_numpy(),prediction))
#print('Prediction Hamming Loss Accuracy:',hamming_loss(prediction,y_test.to_numpy()))

#Prediction Accuracy: 0.3071786310517529
#Prediction Hamming Loss Accuracy: 0.05237123601063501

In [83]:
#tvec using MultiOutputClassifier
log = MultiOutputClassifier(LogisticRegression(solver='lbfgs'))
log.fit(Xtrain_tvec_data_features,y_train)

prediction = log.predict(Xtest_tvec_data_features)

print('Prediction Accuracy:',accuracy_score(y_test.to_numpy(),prediction))
print('Prediction Hamming Loss Accuracy:',hamming_loss(prediction,y_test.to_numpy()))

Prediction Accuracy: 0.2604340567612688
Prediction Hamming Loss Accuracy: 0.05447350522475731


### Decision Tree

In [84]:
#cvec
tree = DecisionTreeClassifier()
tree.fit(Xtrain_cvec_data_features,y_train)

prediction = tree.predict(Xtest_cvec_data_features)

print('Prediction Accuracy:',accuracy_score(y_test.to_numpy(),prediction))
print('Prediction Hamming Loss Accuracy:',hamming_loss(prediction,y_test.to_numpy()))

Prediction Accuracy: 0.3121869782971619
Prediction Hamming Loss Accuracy: 0.06609781734990416


In [85]:
#tvec
tree = DecisionTreeClassifier()
tree.fit(Xtrain_tvec_data_features,y_train)

prediction = tree.predict(Xtest_tvec_data_features)

print('Prediction Accuracy:',accuracy_score(y_test.to_numpy(),prediction))
print('Prediction Hamming Loss Accuracy:',hamming_loss(prediction,y_test.to_numpy()))

Prediction Accuracy: 0.28213689482470783
Prediction Hamming Loss Accuracy: 0.0675817720892846


### MLP Classifier

In [86]:
from sklearn.neural_network import MLPClassifier

#cvec
mlp = MLPClassifier()
mlp.fit(Xtrain_cvec_data_features,y_train)

prediction = mlp.predict(Xtest_cvec_data_features)

print('Prediction Accuracy:',accuracy_score(y_test.to_numpy(),prediction))
print('Prediction Hamming Loss Accuracy:',hamming_loss(prediction,y_test.to_numpy()))

Prediction Accuracy: 0.34557595993322204
Prediction Hamming Loss Accuracy: 0.04965065232177085


In [87]:
#tvec
mlp = MLPClassifier()
mlp.fit(Xtrain_tvec_data_features,y_train)

prediction = mlp.predict(Xtest_tvec_data_features)

print('Prediction Accuracy:',accuracy_score(y_test.to_numpy(),prediction))
print('Prediction Hamming Loss Accuracy:',hamming_loss(prediction,y_test.to_numpy()))

Prediction Accuracy: 0.31552587646076796
Prediction Hamming Loss Accuracy: 0.050516292586409446




### Random Forest Classifier

In [88]:
#cvec
rfc = RandomForestClassifier()
rfc.fit(Xtrain_cvec_data_features,y_train)

prediction = rfc.predict(Xtest_cvec_data_features)

print('Prediction Accuracy:',accuracy_score(y_test.to_numpy(),prediction))
print('Prediction Hamming Loss Accuracy:',hamming_loss(prediction,y_test.to_numpy()))

Prediction Accuracy: 0.25208681135225375
Prediction Hamming Loss Accuracy: 0.05639028009645706


In [89]:
#tvec
rfc = RandomForestClassifier()
rfc.fit(Xtrain_tvec_data_features,y_train)

prediction = rfc.predict(Xtest_tvec_data_features)

print('Prediction Accuracy:',accuracy_score(y_test.to_numpy(),prediction))
print('Prediction Hamming Loss Accuracy:',hamming_loss(prediction,y_test.to_numpy()))

Prediction Accuracy: 0.24540901502504173
Prediction Hamming Loss Accuracy: 0.057317751808569836


### Ridge Classifier CV

In [96]:
from sklearn.linear_model import RidgeClassifierCV

#cvec
ridge = MultiOutputClassifier(RidgeClassifierCV())
ridge.fit(Xtrain_cvec_data_features,y_train)

prediction = ridge.predict(Xtest_cvec_data_features)

print('Prediction Accuracy:',accuracy_score(y_test.to_numpy(),prediction))
print('Prediction Hamming Loss Accuracy:',hamming_loss(prediction,y_test.to_numpy()))

Prediction Accuracy: 0.15859766277128548
Prediction Hamming Loss Accuracy: 0.08056637605886353


In [120]:
#tvec
ridge = MultiOutputClassifier(RidgeClassifierCV())
ridge.fit(Xtrain_tvec_data_features,y_train)

prediction = ridge.predict(Xtest_tvec_data_features)

print('Prediction Accuracy:',accuracy_score(y_test.to_numpy(),prediction))
print('Prediction Hamming Loss Accuracy:',hamming_loss(prediction,y_test.to_numpy()))

Prediction Accuracy: 0.328881469115192
Prediction Hamming Loss Accuracy: 0.04748655166017436


---
# Modeling on Actual Train and Test Data

In this section I will be using estimators to create prediction models to predict which text relates to which indicators. 


In [97]:
train_clean = pd.read_csv('data/clean_train.csv')
test_clean = pd.read_csv('data/clean_test.csv')

In [98]:
# Get the number of reviews based on the dataframe size.
text = train_clean.shape[0]
print(f'There are {text} Train Text.')

There are 2995 Train Text.


In [99]:
# Get the number of reviews based on the dataframe size.
text = test_clean.shape[0]
print(f'There are {text} Test Text.')

There are 998 Test Text.


In [100]:
clean_text_train=[]
clean_text_test=[]

for text in train_clean['clean_text']:
    clean_text_train.append(text)
    

for text in test_clean['clean_text']:
    clean_text_test.append(text)

In [101]:
train_cvec_data_features = cvec.fit_transform(clean_text_train)
test_cvec_data_features = cvec.transform(clean_text_test)
print('Train cvec Data Shape',train_cvec_data_features.shape)
print('Test cvec Data Shape',test_cvec_data_features.shape)
        
train_tvec_data_features = tvec.fit_transform(clean_text_train)
test_tvec_data_features = tvec.transform(clean_text_test)
print('Train tvec Data Shape',train_tvec_data_features.shape)
print('Test tvec Data Shape',test_tvec_data_features.shape)

Train cvec Data Shape (2995, 26079)
Test cvec Data Shape (998, 26079)
Train tvec Data Shape (2995, 26079)
Test tvec Data Shape (998, 26079)


In [102]:
y = train_clean[['3.1.1', '3.1.2', '3.2.1', '3.2.2','3.3.1', '3.3.2', '3.3.3', '3.3.4', '3.3.5', '3.4.1', '3.4.2', '3.5.1','3.5.2', '3.6.1', '3.7.1', '3.7.2', '3.8.1', '3.8.2', '3.9.1', '3.9.2','3.9.3', '3.a.1', '3.b.1', '3.b.2', '3.b.3', '3.c.1', '3.d.1']]
y

Unnamed: 0,3.1.1,3.1.2,3.2.1,3.2.2,3.3.1,3.3.2,3.3.3,3.3.4,3.3.5,3.4.1,...,3.8.2,3.9.1,3.9.2,3.9.3,3.a.1,3.b.1,3.b.2,3.b.3,3.c.1,3.d.1
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,1,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,1,0,1
3,0,0,0,0,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,1,1,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2990,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2991,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2992,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,1,1,0,0,0
2993,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [103]:
print(y.shape)

(2995, 27)


### MLP Classifier

In [107]:
#cvec
mlp = MLPClassifier()
mlp.fit(train_cvec_data_features,y)

prediction = mlp.predict(test_cvec_data_features)

prediction

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 1, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0]])

In [108]:
predict_mlp = pd.DataFrame(prediction)

#renaming the columns
predict_mlp.columns = y.columns[:]

submission_mlp_cvec = pd.concat([test_clean['Unique ID'],predict_mlp],axis=1)
submission_mlp_cvec

Unnamed: 0,Unique ID,3.1.1,3.1.2,3.2.1,3.2.2,3.3.1,3.3.2,3.3.3,3.3.4,3.3.5,...,3.8.2,3.9.1,3.9.2,3.9.3,3.a.1,3.b.1,3.b.2,3.b.3,3.c.1,3.d.1
0,49848,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,52348,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,103541,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,52382,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4,47212,0,0,0,0,0,0,0,0,0,...,0,1,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
993,38108,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
994,30360,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
995,33883,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
996,36296,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [109]:
submission_mlp_cvec.to_csv('predictions/submission_mlp_cvec.csv',index=False)
#Hamming Loss Score of 0.0488

In [110]:
#tvec
mlp = MLPClassifier()
mlp.fit(train_tvec_data_features,y)

prediction = mlp.predict(test_tvec_data_features)

prediction



array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [111]:
predict_mlp = pd.DataFrame(prediction)

#renaming the columns
predict_mlp.columns = y.columns[:]

submission_mlp_tvec = pd.concat([test_clean['Unique ID'],predict_mlp],axis=1)
submission_mlp_tvec

Unnamed: 0,Unique ID,3.1.1,3.1.2,3.2.1,3.2.2,3.3.1,3.3.2,3.3.3,3.3.4,3.3.5,...,3.8.2,3.9.1,3.9.2,3.9.3,3.a.1,3.b.1,3.b.2,3.b.3,3.c.1,3.d.1
0,49848,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,52348,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,103541,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,52382,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,47212,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
993,38108,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
994,30360,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
995,33883,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
996,36296,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [112]:
submission_mlp_tvec.to_csv('predictions/submission_mlp_tvec.csv',index=False)
#Hamming Loss Score of 0.0464

### KNN

In [114]:
#cvec
knn = KNeighborsClassifier()
knn.fit(train_cvec_data_features,y)

prediction = knn.predict(test_cvec_data_features)

prediction

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 0],
       [1, 0, 1, ..., 0, 0, 0]], dtype=int64)

In [115]:
predict_knn = pd.DataFrame(prediction)

#renaming the columns
predict_knn.columns = y.columns[:]

submission_knn_cvec = pd.concat([test_clean['Unique ID'],predict_knn],axis=1)
submission_knn_cvec

Unnamed: 0,Unique ID,3.1.1,3.1.2,3.2.1,3.2.2,3.3.1,3.3.2,3.3.3,3.3.4,3.3.5,...,3.8.2,3.9.1,3.9.2,3.9.3,3.a.1,3.b.1,3.b.2,3.b.3,3.c.1,3.d.1
0,49848,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,52348,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2,103541,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,52382,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4,47212,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
993,38108,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
994,30360,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
995,33883,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
996,36296,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [116]:
submission_knn_cvec.to_csv('predictions/submission_knn_cvec.csv',index=False)
#Hamming Loss Score of 0.0594

In [117]:
#tvec
knn = KNeighborsClassifier()
knn.fit(train_tvec_data_features,y)

prediction = knn.predict(test_tvec_data_features)

prediction

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [118]:
predict_knn = pd.DataFrame(prediction)

#renaming the columns
predict_knn.columns = y.columns[:]

submission_knn_tvec = pd.concat([test_clean['Unique ID'],predict_knn],axis=1)
submission_knn_tvec

Unnamed: 0,Unique ID,3.1.1,3.1.2,3.2.1,3.2.2,3.3.1,3.3.2,3.3.3,3.3.4,3.3.5,...,3.8.2,3.9.1,3.9.2,3.9.3,3.a.1,3.b.1,3.b.2,3.b.3,3.c.1,3.d.1
0,49848,0,0,0,0,1,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
1,52348,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,1,0,0,0
2,103541,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,52382,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4,47212,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
993,38108,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
994,30360,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
995,33883,0,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
996,36296,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [119]:
submission_knn_tvec.to_csv('predictions/submission_knn_tvec.csv',index=False)
#Hamming Loss Score of 0.0513

### Ridge Classifier CV

In [124]:
#cvec
ridge = MultiOutputClassifier(RidgeClassifierCV())
ridge.fit(train_cvec_data_features,y)

prediction = ridge.predict(test_cvec_data_features)

prediction

array([[0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 1, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 1, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [125]:
predict_ridge = pd.DataFrame(prediction)

#renaming the columns
predict_ridge.columns = y.columns[:]

submission_ridge_cvec = pd.concat([test_clean['Unique ID'],predict_ridge],axis=1)
submission_ridge_cvec

Unnamed: 0,Unique ID,3.1.1,3.1.2,3.2.1,3.2.2,3.3.1,3.3.2,3.3.3,3.3.4,3.3.5,...,3.8.2,3.9.1,3.9.2,3.9.3,3.a.1,3.b.1,3.b.2,3.b.3,3.c.1,3.d.1
0,49848,0,0,0,0,1,0,0,0,0,...,1,0,0,0,0,0,0,0,0,1
1,52348,0,0,0,0,1,1,0,0,0,...,1,0,0,0,0,0,0,1,0,0
2,103541,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,52382,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,47212,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
993,38108,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
994,30360,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
995,33883,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
996,36296,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [126]:
submission_ridge_cvec.to_csv('predictions/submission_ridge_cvec.csv',index=False)
#Hamming Loss Score of 0.0818

In [127]:
#tvec
ridge = MultiOutputClassifier(RidgeClassifierCV())
ridge.fit(train_tvec_data_features,y)

prediction = ridge.predict(test_tvec_data_features)

prediction

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 1],
       [0, 0, 0, ..., 0, 0, 0],
       [1, 0, 0, ..., 0, 0, 0]], dtype=int64)

In [128]:
predict_ridge = pd.DataFrame(prediction)

#renaming the columns
predict_ridge.columns = y.columns[:]

submission_ridge_tvec = pd.concat([test_clean['Unique ID'],predict_ridge],axis=1)
submission_ridge_tvec

Unnamed: 0,Unique ID,3.1.1,3.1.2,3.2.1,3.2.2,3.3.1,3.3.2,3.3.3,3.3.4,3.3.5,...,3.8.2,3.9.1,3.9.2,3.9.3,3.a.1,3.b.1,3.b.2,3.b.3,3.c.1,3.d.1
0,49848,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,52348,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,103541,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,52382,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4,47212,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
993,38108,0,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
994,30360,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
995,33883,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
996,36296,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [129]:
submission_ridge_tvec.to_csv('predictions/submission_ridge_tvec.csv',index=False)
#Hamming Loss Score of 0.0447