### Introduction

In this notebook we will be using SVM model for the **detection of human values** given a specifc argument. It consists of a **Multilabel text classification** problem where a given piece of text needs to be classified into one or more categories out of the given list. For example, as in this case, a given argument can be categorized into 1 or more human values.

#### Flow of the notebook

The notebook will be divided into seperate sections to provide a organized walk through for the process used. The sections are:

1. [Importing Python Libraries and preparing the environment](#section01)
2. [Importing and Pre-Processing the domain data](#section02)
3. [Text preprocessing](#section03)
4. [Tf-idf vectorization](#section04)
5. [Building Pipeline](#section05)
6. [Training](#section06)
7. [Evaluation](#section07)
8. [Results](#section08)

#### Technical Details

 - Data:
	 - We are using the data available on [Zenodo](https://zenodo.org/record/7550385#.Y8wMquzMK3I)
     - [Human Value Detection 2023](https://touche.webis.de/semeval23/touche23-web/index.html) is the competion which provide the souce dataset
	 - We are referring only to the following data: `arguments-training.tsv`, `arguments-validation`, `labels-training.tsv`, `labels-validation.tsv`

     - `arguments-<split>.tsv`: Each row corresponds to one argument
        - `Argument ID`: The unique identifier for the argument
        - `Conclusion`: Conclusion text of the argument
        - `Stance`: Stance of the Premise towards the Conclusion; one of "in favor of", "against"
        - `Premise`: Premise text of the argument

     - `labels-<split>.tsv`: Each row corresponds to one argument
        - `Argument ID`: The unique identifier for the argument
        - `Other`: Each other column corresponds to one value category, with a 1 meaning that the argument resorts to the value category and a 0 that not
---
***NOTE***
- *Since test data are provided without labels, we did not consider them for our analysis. In this regards **the performances of our models have been tested only on validation data**.*
---

 - Script Objective:
	 - Builiding SVM model such that: given a textual argument, classify whether or not the argument draws on one of the following categories:
        * Self-direction: thought      
        * Self-direction: action       
        * Stimulation     
        * Hedonism       
        * Achievement      
        * Power: dominance       
        * Power: resources       
        * Face       
        * Security: personal      
        * Security: societal      
        * Tradition       
        * Conformity: rules       
        * Conformity: interpersonal       
        * Humility       
        * Benevolence: caring       
        * Benevolence: dependability      
        * Universalism: concern       
        * Universalism: nature       
        * Universalism: tolerance      
        * Universalism: objectivity       

---
***NOTE***
- *It is to be noted that the overall mechanisms for a multiclass and multilabel problems are similar, except for few differences namely:*
	- *Loss function is designed to evaluate all the probability of categories individually rather than as compared to other categories. Hence the use of `BCE` rather than `Cross Entropy` when defining loss.*
	- *Sigmoid of the outputs calcuated to rather than Softmax. Again for the reasons defined in the previous point*
	- *The [accuracy metrics](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) and [F1 scores](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score) used from sklearn package as compared to direct comparison of expected vs predicted*
---

<a id='section01'></a>
### Importing Python Libraries and preparing the environment

At this step we will be importing the libraries and modules needed to run our script. Libraries are:
* Numpy
* Pandas
* Seaborn
* Matplotlib
* Sci-kit learn

RANDOM SEED has been settled to 42 to ensure experiments reproducibility.

In [None]:
!pip install mlcm

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import numpy as np
from collections import Counter
import math
from mlcm import mlcm
import seaborn as sn
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split

from sklearn.metrics import f1_score

import os


In [2]:
# Random seed to repeat experiments.
np.random.seed(42)

<a id='section02'></a>
### Loading data

We will be working with the data and preparing for fine tuning purposes, *assuming that the `arguments-train.tsv`, `labels-train.tsv`, `arguments-validation.tsv`, `labels-validation.tsv` are already downloaded and saved in our `data` folder*

* Import files in different dataframes
* Merge arguments and labels dataframe of each corresponding split
* Taking the values of all the categories and coverting it into a list.
* Considering only Premises as text input for our classifier

In [3]:
args_train = pd.read_csv(r"C:\Users\Praveen\Desktop\SEM-6\DL\dataset-1\training-english\sentences.tsv", delimiter='\t')
labels_train = pd.read_csv(r"C:\Users\Praveen\Desktop\SEM-6\DL\dataset-1\training-english\labels.tsv", delimiter='\t')
labels_train.replace(0.5, 0, inplace=True)

In [4]:
labels_train.head()

Unnamed: 0,Text-ID,Sentence-ID,Self-direction: thought attained,Self-direction: thought constrained,Self-direction: action attained,Self-direction: action constrained,Stimulation attained,Stimulation constrained,Hedonism attained,Hedonism constrained,...,Benevolence: caring attained,Benevolence: caring constrained,Benevolence: dependability attained,Benevolence: dependability constrained,Universalism: concern attained,Universalism: concern constrained,Universalism: nature attained,Universalism: nature constrained,Universalism: tolerance attained,Universalism: tolerance constrained
0,BG_002,1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,BG_002,2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,BG_002,3,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,BG_002,4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,BG_002,5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [5]:
args_test = pd.read_csv(r"C:\Users\Praveen\Desktop\SEM-6\DL\dataset-1\validation-english\sentences.tsv", delimiter='\t')
labels_test = pd.read_csv(r"C:\Users\Praveen\Desktop\SEM-6\DL\dataset-1\validation-english\labels.tsv", delimiter='\t')
labels_test.replace(0.5, 0, inplace=True)

In [6]:
args_test.head()

Unnamed: 0,Text-ID,Sentence-ID,Text
0,EN_003,1,PM
1,EN_003,2,Orban: Gradual reopening may begin after Easter
2,EN_003,3,"The faster we vaccinate, the sooner the number..."
3,EN_003,4,This is why it is important not to be deceived...
4,EN_003,5,The prime minister said he expects that the nu...


In [7]:
print(labels_test.shape)
print(labels_train.shape)

(14904, 40)
(44758, 40)


#SVM

<a id='section03'></a>
# Text Preprocessing with NLTK

In [8]:
import re
import nltk
import string
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt')
from nltk.corpus import stopwords

# import these modules
from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()
#REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')
PUNTUACTIONS = string.punctuation
BAD_SYMBOLS_RE = re.compile('[^a-z ]')
STOPWORDS = set(stopwords.words('english'))

def preprocess(text):
    """
        text: a string

        return: modified initial string
    """
    text = text.lower() # lowercase text
    #text = REPLACE_BY_SPACE_RE.sub(' ', text) # replace REPLACE_BY_SPACE_RE symbols by space in text
    text = BAD_SYMBOLS_RE.sub('', text) # delete symbols which are in BAD_SYMBOLS_RE from text
    text = ' '.join(word for word in text.split() if word not in PUNTUACTIONS) # delete stopwors from text
    text = ' '.join(word for word in text.split() if word not in STOPWORDS) # delete stopwors from text
    text = " ".join(lemmatizer.lemmatize(word) for word in nltk.word_tokenize(text))
    return text

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Praveen\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Praveen\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\Praveen\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Praveen\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


'Conclusion' is not taken into account since it doesn't add any additional information, it is merely a repetition of what have been said in 'premise'. This also prevent overfitting as we are easing the model.

In [9]:
train = args_train.copy()
train['preprocessed'] = args_train['Text'].apply(lambda text: preprocess(text))

In [10]:
train['preprocessed'][0]

'hispanic voter losing faith democratic party poll'

In [11]:
test = args_test.copy()
test['preprocessed'] = args_test['Text'].apply(lambda text: preprocess(text))

In [12]:
train

Unnamed: 0,Text-ID,Sentence-ID,Text,preprocessed
0,EN_001,1,Hispanic Voters Are Losing Faith In The Democr...,hispanic voter losing faith democratic party poll
1,EN_001,2,The support of Hispanic voters at the midterms...,support hispanic voter midterm later year coul...
2,EN_001,3,U.S. President Joe Biden speaks to employees a...,u president joe biden speaks employee lockheed...
3,EN_001,4,(Julie Bennett/Getty Images) According to a Qu...,julie bennettgetty image according quinnipiac ...
4,EN_001,5,This marks the lowest approval rating of any d...,mark lowest approval rating demographic group
...,...,...,...,...
44753,TR_M_022,12,"The rent economy, which provides easy profits ...",rent economy provides easy profit provide empl...
44754,TR_M_022,13,"The tax base will be broadened, the tax burden...",tax base broadened tax burden poor eased trans...
44755,TR_M_022,14,Supportive arrangements will be made to increa...,supportive arrangement made increase social we...
44756,TR_M_022,15,In order to lift out of poverty those who are ...,order lift poverty affected poverty low level ...


In [13]:
train = train.drop(columns= train.columns[1:3])
train

Unnamed: 0,Text-ID,preprocessed
0,EN_001,hispanic voter losing faith democratic party poll
1,EN_001,support hispanic voter midterm later year coul...
2,EN_001,u president joe biden speaks employee lockheed...
3,EN_001,julie bennettgetty image according quinnipiac ...
4,EN_001,mark lowest approval rating demographic group
...,...,...
44753,TR_M_022,rent economy provides easy profit provide empl...
44754,TR_M_022,tax base broadened tax burden poor eased trans...
44755,TR_M_022,supportive arrangement made increase social we...
44756,TR_M_022,order lift poverty affected poverty low level ...


In [14]:
test = test.drop(columns= test.columns[1:3])
test

Unnamed: 0,Text-ID,preprocessed
0,EN_003,pm
1,EN_003,orban gradual reopening may begin easter
2,EN_003,faster vaccinate sooner number infection case ...
3,EN_003,important deceived promote antivaccination vie...
4,EN_003,prime minister said expects number vaccinated ...
...,...,...
14899,TR_M_024,quality lowcost energy supply realized
14900,TR_M_024,public energy investment realized planned stab...
14901,TR_M_024,public enterprise engaged exploration extracti...
14902,TR_M_024,energy planning international agreement securi...


In [15]:
#train.drop(columns=['Sentence-ID'], inplace=True)
train

Unnamed: 0,Text-ID,preprocessed
0,EN_001,hispanic voter losing faith democratic party poll
1,EN_001,support hispanic voter midterm later year coul...
2,EN_001,u president joe biden speaks employee lockheed...
3,EN_001,julie bennettgetty image according quinnipiac ...
4,EN_001,mark lowest approval rating demographic group
...,...,...
44753,TR_M_022,rent economy provides easy profit provide empl...
44754,TR_M_022,tax base broadened tax burden poor eased trans...
44755,TR_M_022,supportive arrangement made increase social we...
44756,TR_M_022,order lift poverty affected poverty low level ...


In [16]:
labels_train.drop(columns=['Sentence-ID'], inplace=True)
labels_train

Unnamed: 0,Text-ID,Self-direction: thought attained,Self-direction: thought constrained,Self-direction: action attained,Self-direction: action constrained,Stimulation attained,Stimulation constrained,Hedonism attained,Hedonism constrained,Achievement attained,...,Benevolence: caring attained,Benevolence: caring constrained,Benevolence: dependability attained,Benevolence: dependability constrained,Universalism: concern attained,Universalism: concern constrained,Universalism: nature attained,Universalism: nature constrained,Universalism: tolerance attained,Universalism: tolerance constrained
0,BG_002,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,BG_002,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,BG_002,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,BG_002,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,BG_002,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
44753,TR_M_022,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
44754,TR_M_022,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
44755,TR_M_022,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
44756,TR_M_022,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [17]:
import pandas as pd
df = labels_train.copy()
df['list'] = df[df.columns[1:]].values.tolist()
df_train = train[["preprocessed"]].copy()
feature_name = "preprocessed_text"
df_train.columns = [feature_name]
df_train['list'] = df['list']
df_train

Unnamed: 0,preprocessed_text,list
0,hispanic voter losing faith democratic party poll,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
1,support hispanic voter midterm later year coul...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
2,u president joe biden speaks employee lockheed...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ..."
3,julie bennettgetty image according quinnipiac ...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
4,mark lowest approval rating demographic group,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
...,...,...
44753,rent economy provides easy profit provide empl...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
44754,tax base broadened tax burden poor eased trans...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
44755,supportive arrangement made increase social we...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, ..."
44756,order lift poverty affected poverty low level ...,"[0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."


In [18]:
print(len(df_train.iloc[0].list))

38


In [19]:
import pandas as pd
df = labels_test.copy()
df['list'] = df[df.columns[2:]].values.tolist()
df_test = test[["preprocessed"]].copy()
feature_name = "preprocessed_text"
df_test.columns = [feature_name]
df_test['list'] = df['list']
df_test

Unnamed: 0,preprocessed_text,list
0,pm,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
1,orban gradual reopening may begin easter,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
2,faster vaccinate sooner number infection case ...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
3,important deceived promote antivaccination vie...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
4,prime minister said expects number vaccinated ...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
...,...,...
14899,quality lowcost energy supply realized,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
14900,public energy investment realized planned stab...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
14901,public enterprise engaged exploration extracti...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
14902,energy planning international agreement securi...,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."


# TF-IDF VECTORIZATION

In [20]:
print(len(df_test.iloc[0].list))

38


In [21]:
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.multiclass import OneVsRestClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC, SVC

In [22]:
categories = labels_train.columns[1:]
len(categories)

38

In [47]:
X_train, Y_train = train.preprocessed, df_train.list
X_test, Y_test = test.preprocessed, df_test.list

In [48]:
X_train, Y_train = np.stack(X_train), np.stack(Y_train)
X_test, Y_test = np.stack(X_test), np.stack(Y_test)

In [49]:
# import numpy as np

# # Assuming X_train, Y_train, X_test, Y_test are your original datasets

# # Calculate the number of samples to select for the 10% subset
# subset_size_train = int(0.1 * len(X_train))
# subset_size_test = int(0.1 * len(X_test))

# # Randomly select 10% of the data
# subset_indices_train = np.random.choice(len(X_train), subset_size_train, replace=False)
# subset_indices_test = np.random.choice(len(X_test), subset_size_test, replace=False)

# # Create subsets
# X_train , Y_train = X_train[subset_indices_train], Y_train[subset_indices_train]
# X_test , Y_test = X_test[subset_indices_test], Y_test[subset_indices_test]

# X_train, Y_train = np.stack(X_train), np.stack(Y_train)
# X_test, Y_test = np.stack(X_test), np.stack(Y_test)


In [50]:
print("X train:", X_train.shape, "\nY train:", Y_train.shape, "\n")
print("X test:", X_test.shape, "\nY test:", Y_test.shape, "\n")

X train: (44758,) 
Y train: (44758, 38) 

X test: (14904,) 
Y test: (14904, 38) 



Definition of useful variables and functions for Error Analysis

In [51]:
# test_res = args_test.copy()
# test_res = test_res.merge(labels_test)
# test_res['preprocessed_premise'] = args_test['Premise'].apply(lambda text: preprocess(text))
# test_res['preprocessed_conclusion'] = args_test['Conclusion'].apply(lambda text: preprocess(text))
# test_res = test_res.drop(columns= test_res.columns[1:4])
# test_res.columns

# labels_res = list(labels_test.columns[1:])


TfidfComparison compute a dataframe of size `labels x frequency` that shows the TF-IDF score for each label, considering the most frequent word for a selected label.


In [52]:
def TfidfComparison(dataset, label, labels, frequency=15):

  # selection of the 15 most frequent word for class 'label', conclusion and premises have already been preprocessed
  ds_copy1 = dataset.copy()
  ds_copy1['concat'] = ds_copy1['preprocessed_conclusion'] + ' ' + ds_copy1['preprocessed_premise']
  ds_copy1 = list(ds_copy1[ds_copy1[label]==1]['concat'])
  words_freq = Counter([k for k in re.sub('[\W]', ' ', ' '.join(ds_copy1).lower()).split(' ')])
  words_freq_dict = {}
  for element, count in words_freq.items():
    words_freq_dict[element]=count

  # sorting the words by frequency and computation of the Document Frequency for each word. The DF is computed class wise due to unbalanced number of data among the classes
  sorted_words_freq_dict = sorted(words_freq_dict.items(), key=lambda x: x[1], reverse=True)[:frequency]
  most_frequent_words = [w for w, v in sorted_words_freq_dict] # stored the most X words for the actual attribute
  df = {}
  for word in most_frequent_words:
    df[word] = 0
    for l in labels:
      ds = dataset.copy()
      ds['concat'] = ds['preprocessed_conclusion'] + ' ' + ds['preprocessed_premise']
      ds = list(ds[ds[l]==1]['concat'])
      label_words = list(set([k for k in re.sub('[\W]', ' ', ' '.join(ds).lower()).split(' ')]))
      if word in label_words:
        df[word] += 1



  # computation of TF-IDF for each class among the previously selected 15 words
  tfidf_rows = []
  for l in labels:
    dict_tmp = {}
    if l == label:
      dict_tmp['ColumnName'] = l
      for word, frequency in sorted_words_freq_dict:
        try:
          dict_tmp[word] = abs(words_freq_dict[word]/sum(list(words_freq_dict.values())) * math.log(len(labels)/df[word],10))
        except:
          dict_tmp[word] = 0
      tfidf_rows.append(dict_tmp)
      continue
    ds_copy2 = dataset.copy()
    ds_copy2['concat'] = ds_copy2['preprocessed_conclusion'] + ' ' + ds_copy2['preprocessed_premise']
    ds_copy2 = list(ds_copy2[(ds_copy2[label]==0) & (ds_copy2[l]==1)]['concat'])
    words_freq_l = Counter([k for k in re.sub('[\W]', ' ', ' '.join(ds_copy2).lower()).split(' ')])
    words_frequency = {}
    for element, count in words_freq_l.items():
      words_frequency[element]=count
    dict_tmp['ColumnName'] = l
    for word, frequency in sorted_words_freq_dict:
      try:
        dict_tmp[word] = abs(words_frequency[word]/sum(list(words_frequency.values())) * math.log(len(labels)/df[word],10))
      except Exception as ex:
        dict_tmp[word] = 0
    tfidf_rows.append(dict_tmp)
    dataframe = pd.DataFrame(tfidf_rows)
  return dataframe.style.background_gradient(cmap='Reds', subset=dataframe.columns[1:])



# Model Definition and train

In [53]:
from sklearn.metrics import accuracy_score, f1_score, classification_report
from sklearn.linear_model import SGDClassifier
from sklearn.decomposition import PCA, TruncatedSVD
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.svm import SVC

SVC_pipeline = Pipeline([
                ('vector', TfidfVectorizer(max_features=10000)),
                ('svd', TruncatedSVD(n_components=600)),
                ('clf', OneVsRestClassifier(SVC(C=18, kernel='rbf', gamma = 0.01,class_weight='balanced', max_iter=1000000, random_state=42), n_jobs=1))
                #('clf', OneVsRestClassifier(SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, class_weight = 'balanced', random_state=42, max_iter=10000), n_jobs=1)),
            ])

SVC_pipeline.fit(X_train, Y_train)

Pipeline(steps=[('vector', TfidfVectorizer(max_features=10000)),
                ('svd', TruncatedSVD(n_components=600)),
                ('clf',
                 OneVsRestClassifier(estimator=SVC(C=18,
                                                   class_weight='balanced',
                                                   gamma=0.01, max_iter=1000000,
                                                   random_state=42),
                                     n_jobs=1))])

# Evaluation


In [54]:
# TRAIN
prediction_train = SVC_pipeline.predict(X_train)
print(classification_report(Y_train, prediction_train, target_names = categories, zero_division=1))

                                        precision    recall  f1-score   support

      Self-direction: thought attained       0.03      0.85      0.06       447
   Self-direction: thought constrained       0.03      0.96      0.06        80
       Self-direction: action attained       0.06      0.68      0.11      1256
    Self-direction: action constrained       0.02      0.93      0.05       261
                  Stimulation attained       0.05      0.73      0.10       971
               Stimulation constrained       0.03      0.96      0.05       126
                     Hedonism attained       0.03      0.95      0.06       264
                  Hedonism constrained       0.03      0.93      0.05       107
                  Achievement attained       0.08      0.66      0.15      1925
               Achievement constrained       0.05      0.76      0.09       797
             Power: dominance attained       0.07      0.68      0.13      1487
          Power: dominance constrained 

In [55]:
# TEST
prediction = SVC_pipeline.predict(X_test)
print(classification_report(Y_test, prediction, target_names = categories, zero_division=1))

                                        precision    recall  f1-score   support

      Self-direction: thought attained       0.01      0.36      0.02       133
   Self-direction: thought constrained       0.00      0.03      0.00        29
       Self-direction: action attained       0.03      0.43      0.06       370
    Self-direction: action constrained       0.01      0.28      0.02       100
                  Stimulation attained       0.04      0.47      0.07       366
               Stimulation constrained       0.00      0.11      0.00        36
                     Hedonism attained       0.01      0.31      0.02        72
                  Hedonism constrained       0.00      0.19      0.01        27
                  Achievement attained       0.06      0.44      0.10       676
               Achievement constrained       0.02      0.42      0.04       227
             Power: dominance attained       0.05      0.53      0.09       440
          Power: dominance constrained 

# Error Analysis

Here, we want to know which classes have been confused with other classes and how frequently. To do this, we are going to use a Multi Label Confusion Matrix, rendered with seaborn's heatmap. For example, we can see that `Universalism: nature` has low values, since the model achieve to distiguish from the others pretty well. (confirmed also in TF-IDF analysis)

In [None]:
conf_mat,normal_conf_mat = mlcm.cm(prediction,Y_test)
df_cm = pd.DataFrame(conf_mat, index = [i for i in categories.append(pd.Index(['noclass']))],
                  columns = [i for i in categories.append(pd.Index(['noclass']))])
plt.figure(figsize = (20,14))
sn.heatmap(df_cm, annot=True)

We retained senseful to analyse the TF-IDF scores that words obtained across all the labels, since SVM is based on this. In general we observed that low support usually coincide with low F1-score, but some anomalies caught our attention.


`Conformity: interpersonal` obtain the lowest F1-score(0.1) and it also has the lowest support(60). The matrix shows perfectly that the most frequent word for this label usually are confused with many other labels(they obtain a good tf-idf score even among other labels)

In [None]:
TfidfComparison(test_res, 'Conformity: interpersonal', labels_res, frequency=15)

`Universalism: nature` show low support(127) but it achieve one of the better F1 score(0.46). As can be seen on the matrix, words have a strong TF-IDF score only for `Universalism: nature`, making them good identifiers for this class.

In [None]:
TfidfComparison(test_res, 'Universalism: nature', labels_res, frequency=15)

`Universalism: tolerance` besides having a good support(223) achieves not so good results(0.26). This because its most frequent words are important also for other labels, making them indistinguishable.

In [None]:
TfidfComparison(test_res, 'Universalism: tolerance', labels_res, frequency=15)

`Security: personal`: high support and high F1 score. Here we can see that its frequent words are good identifiers since the tf-idf scores are high and different from other classes

In [None]:
TfidfComparison(test_res, 'Security: personal', labels_res, frequency=15)

`Benevolence: dependability`: we can see that `Security: personal` also shares high TF-IDF for frequent words. It can be seen in the multi label confusion matrix that the `Benevolence: dependability` it is usually misclassified as this one.

In [None]:
TfidfComparison(test_res, 'Benevolence: dependability', labels_res, frequency=15)