## Language Analysis of Alexithymic Discourse

<hr>

Alexithymic Language Project / raul@psicobotica.com / V2 release (sept 2020)

<hr>

### Lexicosemantics Analysis

We review here the most frequent words used by participants, taking into account Part of Speech (PoS) and semantics associated to terms.

- Three corpora considered: all, non-alexithymic, alexithymic. 
- Most frequent nouns. 
- Most frequent adjectives. 
- Most frequent verbs. 

<hr>

[Explanation of Lexical Semantics](https://en.wikipedia.org/wiki/Lexical_semantics)

[List of PoS tags (Spanish)](https://universaldependencies.org/docs/es/pos/)

## Load features dataset
- Data is already pre-processed (1-Preprocessing). 
- Basic NLP features are already calculated (2-Features). 
- Some additional BoW features have been added (3-BoW).
- Some additional TF/IDF features have been added (3-TFIDF).
- N-Gram models have been generated (3-N-Grams). 

In [1]:
import pandas as pd 
import numpy as np
import ast
import heapq
import nltk

In [2]:
feats_dataset_path = "https://raw.githubusercontent.com/raul-arrabales/alexithymic-lang/master/data/Prolexitim_v2_features_3.csv"
alex_df = pd.read_csv(feats_dataset_path, header=0, delimiter=";")

In [3]:
alex_df.columns

Index(['Code', 'TAS20', 'F1', 'F2', 'F3', 'Gender', 'Age', 'Card',
       'T_Metaphors', 'T_ToM', 'T_FP', 'T_Interpret', 'T_Desc', 'T_Confussion',
       'Text', 'Alex_A', 'Alex_B', 'Words', 'Sentences', 'Tokens',
       'Tokens_Stop', 'Tokens_Stem_P', 'Tokens_Stem_S', 'POS', 'NER', 'DEP',
       'Lemmas_CNLP', 'Lemmas_Spacy', 'Chars', 'avgWL', 'avgSL', 'Pun_Count',
       'Stop_Count', 'RawTokens', 'Title_Count', 'Upper_Count', 'PRON_Count',
       'DET_Count', 'ADV_Count', 'VERB_Count', 'PROPN_Count', 'NOUN_Count',
       'NUM_Count', 'PUNCT_Count', 'SYM_Count', 'SCONJ_Count', 'CCONJ_Count',
       'INTJ_Count', 'AUX_Count', 'ADP_Count', 'ADJ_Count', 'PRON_Ratio',
       'DET_Ratio', 'ADV_Ratio', 'VERB_Ratio', 'PROPN_Ratio', 'NOUN_Ratio',
       'NUM_Ratio', 'PUNCT_Ratio', 'SYM_Ratio', 'SCONJ_Ratio', 'CCONJ_Ratio',
       'INTJ_Ratio', 'AUX_Ratio', 'ADP_Ratio', 'ADJ_Ratio', 'TTR', 'HTR',
       'BoW_PCA_1', 'BoW_PCA_2', 'BoW_PCA_3', 'TFIDF_PCA_1', 'TFIDF_PCA_2',
       'TFIDF_PCA_3']

In [4]:
alex_df.head()

Unnamed: 0,Code,TAS20,F1,F2,F3,Gender,Age,Card,T_Metaphors,T_ToM,...,ADP_Ratio,ADJ_Ratio,TTR,HTR,BoW_PCA_1,BoW_PCA_2,BoW_PCA_3,TFIDF_PCA_1,TFIDF_PCA_2,TFIDF_PCA_3
0,bc39e22ca5dba59fbd97c27987878f56,40,16,9,15,2,22,1,0,1,...,0.125,0.0,0.5625,0.875,0.429786,-0.056197,-0.360772,-0.11487,0.168706,0.031455
1,bc39e22ca5dba59fbd97c27987878f56,40,16,9,15,2,22,13HM,0,1,...,0.0,0.0,0.857143,1.0,-0.535592,0.971355,-0.133005,0.867802,0.301337,0.165452
2,20cd825cadb95a71763bad06e142c148,40,12,10,18,2,22,1,0,1,...,0.103448,0.172414,0.344828,0.793103,0.713317,-0.012597,-0.255988,-0.089725,0.143005,0.031664
3,20cd825cadb95a71763bad06e142c148,40,12,10,18,2,22,9VH,0,1,...,0.208333,0.083333,0.458333,0.875,-0.28032,-0.445467,0.372081,-0.019208,-0.07631,-0.093545
4,20cd825cadb95a71763bad06e142c148,40,12,10,18,2,22,13HM,0,1,...,0.1,0.2,0.9,1.0,-0.539096,0.998465,-0.135003,0.393093,0.108074,0.043623


## Preparing the corpora
Let's get three corpora, one global, one with "alexithymic language" and the other with "non-alexithymic language". We'll need just the PoS tagging for each
- AllDoc will contain all documents from all participants. 
- AlexDoc will contain merged text from TAS-20 positive users. 
- NoAlexDoc will contain merged text from TAS-20 negative users. 

In [13]:
# The POS feature contains the PoS tagging for each document: 
print("Doc: " + alex_df['Text'][0])
print()
print("PoS Tagged: " + alex_df['POS'][0])

Doc: es un niño pensando en cual es la respuesta de sus deberes porque no la sabe.

PoS Tagged: [('es', 'AUX'), ('un', 'DET'), ('niño', 'NOUN'), ('pensando', 'VERB'), ('en', 'ADP'), ('cual', 'PRON'), ('es', 'AUX'), ('la', 'DET'), ('respuesta', 'NOUN'), ('de', 'ADP'), ('sus', 'DET'), ('deberes', 'NOUN'), ('porque', 'SCONJ'), ('no', 'ADV'), ('la', 'PRON'), ('sabe', 'VERB'), ('.', 'PUNCT')]


In [10]:
AllDocs = alex_df['POS']
AlexDocs = alex_df[alex_df.Alex_A == 1]['POS']
NoAlexDocs = alex_df[alex_df.Alex_A == 0]['POS']

## Compute most frequent words per grammatical function
- Verbs. 
- Auxiliary verbs. 
- Nouns. 
- Proper nouns. 
- Adjectives. 
- Adverbs. 
- Subordinate conjunctions. 

In [21]:
# Extracts the list of specific PoS tokens in a list of POS tagged documents
def get_PoS_Dict(corpus, PoS_tag):
    """
    Parameters
    ----------
    corpus : series of lists of tuples (word, POS_tag)
        Documents to be analyzed. 
    PoS_tag : str
        The specific POS that we want to extract
    
    Returns
    -------
    words_sorted: sorted list with K=word, V=frequency in the corpus
        Sorted by frequency (inversed)
        
    """
    
    words_dict = {}
    
    for PoSList in corpus:  # For each PoS Tagged doc
        tag_list = ast.literal_eval(PoSList)    # Get the list of tuples
        for PoStuple in tag_list: 
            word = PoStuple[0]
            tag = PoStuple[1]
            if ( tag == PoS_tag ): 
                if word not in words_dict.keys():
                    words_dict[word] = 1
                else:
                    words_dict[word] += 1
    
    # Sort by frequency (higher first)
    words_sorted = []
    for w in sorted(words_dict, key=words_dict.get, reverse=True):
        words_sorted.append((w, words_dict[w]))
        
    return words_sorted

In [32]:
# Get ordered lists of interest

# Nouns
all_nouns = get_PoS_Dict(AllDocs, 'NOUN')
alex_nouns = get_PoS_Dict(AlexDocs, 'NOUN')
noalex_nouns = get_PoS_Dict(NoAlexDocs, 'NOUN')

# Verbs
all_verbs = get_PoS_Dict(AllDocs, 'VERB')
alex_verbs = get_PoS_Dict(AlexDocs, 'VERB')
noalex_verbs = get_PoS_Dict(NoAlexDocs, 'VERB')

# Adjectives
all_adjectives = get_PoS_Dict(AllDocs, 'ADJ')
alex_adjectives = get_PoS_Dict(AlexDocs, 'ADJ')
noalex_adjectives = get_PoS_Dict(NoAlexDocs, 'ADJ')

# Subordinated conjunctions
all_sconj = get_PoS_Dict(AllDocs, 'SCONJ')
alex_sconj = get_PoS_Dict(AlexDocs, 'SCONJ')
noalex_sconj = get_PoS_Dict(NoAlexDocs, 'SCONJ')

# Adverbs
all_adverbs = get_PoS_Dict(AllDocs, 'ADV')
alex_adverbs = get_PoS_Dict(AlexDocs, 'ADV')
noalex_adverbs = get_PoS_Dict(NoAlexDocs, 'ADV')

# Auxiliary verbs
all_aux = get_PoS_Dict(AllDocs, 'AUX')
alex_aux = get_PoS_Dict(AlexDocs, 'AUX')
noalex_aux = get_PoS_Dict(NoAlexDocs, 'AUX')

# Proper nouns
all_proper = get_PoS_Dict(AllDocs, 'PROPN')
alex_proper = get_PoS_Dict(AlexDocs, 'PROPN')
noalex_proper = get_PoS_Dict(NoAlexDocs, 'PROPN')

## Are there significant differences between alex and noalex groups?

In [37]:
noalex_sconj

[('que', 181),
 ('porque', 57),
 ('como', 42),
 ('cuando', 23),
 ('mientras', 20),
 ('si', 17),
 ('pues', 11),
 ('aunque', 7),
 ('!!!', 1),
 ('según', 1)]

In [38]:
alex_sconj

[('que', 33),
 ('porque', 7),
 ('si', 6),
 ('como', 5),
 ('cuando', 3),
 ('aunque', 2),
 ('mientras', 1)]