## Language Analysis of Alexithymic Discourse

<hr>

Alexithymic Language Project / raul@psicobotica.com / V2 release (sept 2020)

<hr>

### Lexicosemantics Analysis

We review here the most frequent words used by participants, taking into account Part of Speech (PoS) and semantics associated to terms.

- Three corpora considered: all, non-alexithymic, alexithymic. 
- Most frequent nouns. 
- Most frequent adjectives. 
- Most frequent verbs. 

<hr>

[Explanation of Lexical Semantics](https://en.wikipedia.org/wiki/Lexical_semantics)

[List of PoS tags (Spanish)](https://universaldependencies.org/docs/es/pos/)

## Load features dataset
- Data is already pre-processed (1-Preprocessing). 
- Basic NLP features are already calculated (2-Features). 
- Some additional BoW features have been added (3-BoW).
- Some additional TF/IDF features have been added (3-TFIDF).
- N-Gram models have been generated (3-N-Grams). 

In [1]:
import pandas as pd 
import numpy as np
import ast
import heapq
import nltk

In [2]:
feats_dataset_path = "https://raw.githubusercontent.com/raul-arrabales/alexithymic-lang/master/data/Prolexitim_v2_features_3.csv"
alex_df = pd.read_csv(feats_dataset_path, header=0, delimiter=";")

In [3]:
alex_df.columns

Index(['Code', 'TAS20', 'F1', 'F2', 'F3', 'Gender', 'Age', 'Card',
       'T_Metaphors', 'T_ToM', 'T_FP', 'T_Interpret', 'T_Desc', 'T_Confussion',
       'Text', 'Alex_A', 'Alex_B', 'Words', 'Sentences', 'Tokens',
       'Tokens_Stop', 'Tokens_Stem_P', 'Tokens_Stem_S', 'POS', 'NER', 'DEP',
       'Lemmas_CNLP', 'Lemmas_Spacy', 'Chars', 'avgWL', 'avgSL', 'Pun_Count',
       'Stop_Count', 'RawTokens', 'Title_Count', 'Upper_Count', 'PRON_Count',
       'DET_Count', 'ADV_Count', 'VERB_Count', 'PROPN_Count', 'NOUN_Count',
       'NUM_Count', 'PUNCT_Count', 'SYM_Count', 'SCONJ_Count', 'CCONJ_Count',
       'INTJ_Count', 'AUX_Count', 'ADP_Count', 'ADJ_Count', 'PRON_Ratio',
       'DET_Ratio', 'ADV_Ratio', 'VERB_Ratio', 'PROPN_Ratio', 'NOUN_Ratio',
       'NUM_Ratio', 'PUNCT_Ratio', 'SYM_Ratio', 'SCONJ_Ratio', 'CCONJ_Ratio',
       'INTJ_Ratio', 'AUX_Ratio', 'ADP_Ratio', 'ADJ_Ratio', 'TTR', 'HTR',
       'BoW_PCA_1', 'BoW_PCA_2', 'BoW_PCA_3', 'TFIDF_PCA_1', 'TFIDF_PCA_2',
       'TFIDF_PCA_3']

In [4]:
alex_df.head()

Unnamed: 0,Code,TAS20,F1,F2,F3,Gender,Age,Card,T_Metaphors,T_ToM,...,ADP_Ratio,ADJ_Ratio,TTR,HTR,BoW_PCA_1,BoW_PCA_2,BoW_PCA_3,TFIDF_PCA_1,TFIDF_PCA_2,TFIDF_PCA_3
0,bc39e22ca5dba59fbd97c27987878f56,40,16,9,15,2,22,1,0,1,...,0.125,0.0,0.5625,0.875,0.429786,-0.056197,-0.360772,-0.11487,0.168706,0.031455
1,bc39e22ca5dba59fbd97c27987878f56,40,16,9,15,2,22,13HM,0,1,...,0.0,0.0,0.857143,1.0,-0.535592,0.971355,-0.133005,0.867802,0.301337,0.165452
2,20cd825cadb95a71763bad06e142c148,40,12,10,18,2,22,1,0,1,...,0.103448,0.172414,0.344828,0.793103,0.713317,-0.012597,-0.255988,-0.089725,0.143005,0.031664
3,20cd825cadb95a71763bad06e142c148,40,12,10,18,2,22,9VH,0,1,...,0.208333,0.083333,0.458333,0.875,-0.28032,-0.445467,0.372081,-0.019208,-0.07631,-0.093545
4,20cd825cadb95a71763bad06e142c148,40,12,10,18,2,22,13HM,0,1,...,0.1,0.2,0.9,1.0,-0.539096,0.998465,-0.135003,0.393093,0.108074,0.043623


## Preparing the corpora
Let's get three corpora, one global, one with "alexithymic language" and the other with "non-alexithymic language". We'll need just the PoS tagging for each
- AllDoc will contain all documents from all participants. 
- AlexDoc will contain merged text from TAS-20 positive users. 
- NoAlexDoc will contain merged text from TAS-20 negative users. 

In [5]:
# The POS feature contains the PoS tagging for each document: 
print("Doc: " + alex_df['Text'][0])
print()
print("PoS Tagged: " + alex_df['POS'][0])

Doc: es un niño pensando en cual es la respuesta de sus deberes porque no la sabe.

PoS Tagged: [('es', 'AUX'), ('un', 'DET'), ('niño', 'NOUN'), ('pensando', 'VERB'), ('en', 'ADP'), ('cual', 'PRON'), ('es', 'AUX'), ('la', 'DET'), ('respuesta', 'NOUN'), ('de', 'ADP'), ('sus', 'DET'), ('deberes', 'NOUN'), ('porque', 'SCONJ'), ('no', 'ADV'), ('la', 'PRON'), ('sabe', 'VERB'), ('.', 'PUNCT')]


In [6]:
AllDocs = alex_df['POS']
AlexDocs = alex_df[alex_df.Alex_A == 1]['POS']
NoAlexDocs = alex_df[alex_df.Alex_A == 0]['POS']

## Compute most frequent words per grammatical function
- Verbs. 
- Auxiliary verbs. 
- Nouns. 
- Proper nouns. 
- Adjectives. 
- Adverbs. 
- Subordinate conjunctions. 

In [7]:
# Extracts the list of specific PoS tokens in a list of POS tagged documents
def get_PoS_Dict(corpus, PoS_tag):
    """
    Parameters
    ----------
    corpus : series of lists of tuples (word, POS_tag)
        Documents to be analyzed. 
    PoS_tag : str
        The specific POS that we want to extract
    
    Returns
    -------
    words_sorted: sorted list with K=word, V=frequency in the corpus
        Sorted by frequency (inversed)
        
    """
    
    words_dict = {}
    
    for PoSList in corpus:  # For each PoS Tagged doc
        tag_list = ast.literal_eval(PoSList)    # Get the list of tuples
        for PoStuple in tag_list: 
            word = PoStuple[0]
            tag = PoStuple[1]
            if ( tag == PoS_tag ): 
                if word not in words_dict.keys():
                    words_dict[word] = 1
                else:
                    words_dict[word] += 1
    
    # Sort by frequency (higher first)
    words_sorted = []
    for w in sorted(words_dict, key=words_dict.get, reverse=True):
        words_sorted.append((w, words_dict[w]))
        
    return words_sorted

In [8]:
# Get ordered lists of interest

# Nouns
all_nouns = get_PoS_Dict(AllDocs, 'NOUN')
alex_nouns = get_PoS_Dict(AlexDocs, 'NOUN')
noalex_nouns = get_PoS_Dict(NoAlexDocs, 'NOUN')

# Verbs
all_verbs = get_PoS_Dict(AllDocs, 'VERB')
alex_verbs = get_PoS_Dict(AlexDocs, 'VERB')
noalex_verbs = get_PoS_Dict(NoAlexDocs, 'VERB')

# Adjectives
all_adjectives = get_PoS_Dict(AllDocs, 'ADJ')
alex_adjectives = get_PoS_Dict(AlexDocs, 'ADJ')
noalex_adjectives = get_PoS_Dict(NoAlexDocs, 'ADJ')

# Subordinated conjunctions
all_sconj = get_PoS_Dict(AllDocs, 'SCONJ')
alex_sconj = get_PoS_Dict(AlexDocs, 'SCONJ')
noalex_sconj = get_PoS_Dict(NoAlexDocs, 'SCONJ')

# Adverbs
all_adverbs = get_PoS_Dict(AllDocs, 'ADV')
alex_adverbs = get_PoS_Dict(AlexDocs, 'ADV')
noalex_adverbs = get_PoS_Dict(NoAlexDocs, 'ADV')

# Auxiliary verbs
all_aux = get_PoS_Dict(AllDocs, 'AUX')
alex_aux = get_PoS_Dict(AlexDocs, 'AUX')
noalex_aux = get_PoS_Dict(NoAlexDocs, 'AUX')

# Proper nouns
all_proper = get_PoS_Dict(AllDocs, 'PROPN')
alex_proper = get_PoS_Dict(AlexDocs, 'PROPN')
noalex_proper = get_PoS_Dict(NoAlexDocs, 'PROPN')

## Save the list of nouns, verbs and adjectives
- In order of appearance (for possible further analysis)

In [11]:
# Extracts the list of specific PoS tokens keeping the order of appearance in the doc. 
def get_PoS_List(doc, PoS_tag):
    """
    Parameters
    ----------
    doc : lists of tuples (word, POS_tag) representing a PoS tagged document.
        Documents to be analyzed. 
    PoS_tag : str
        The specific POS that we want to extract
    
    Returns
    -------
    words: list with specific words keeping the order of appearance. 
        
    """
    
    words = []
    
    tag_list = ast.literal_eval(doc)    # Get the list of tuples representing the doc. 
    for PoStuple in tag_list: 
        word = PoStuple[0]
        tag = PoStuple[1]
        if ( tag == PoS_tag ): 
            words.append(word)    # Add to the list only the grammatical function we want
          
    return words

In [16]:
# testing
print("Doc: " + alex_df['Text'][0])
print()
print("Verbs: " + str(get_PoS_List(alex_df['POS'][0],'VERB')))
print()
print("Nouns: " + str(get_PoS_List(alex_df['POS'][0],'NOUN')))
print()
print("Adjectives: " + str(get_PoS_List(alex_df['POS'][0],'ADJ')))
print()
print("Sub. Conj.: " + str(get_PoS_List(alex_df['POS'][0],'SCONJ')))

Doc: es un niño pensando en cual es la respuesta de sus deberes porque no la sabe.

Verbs: ['pensando', 'sabe']

Nouns: ['niño', 'respuesta', 'deberes']

Adjectives: []

Sub. Conj.: ['porque']


In [20]:
# Persist the list of specific parts of speech
alex_df['Verb_List'] = alex_df.POS.apply(lambda x: get_PoS_List(x,'VERB'))
alex_df['Noun_List'] = alex_df.POS.apply(lambda x: get_PoS_List(x,'NOUN'))
alex_df['Adjective_List'] = alex_df.POS.apply(lambda x: get_PoS_List(x,'ADJ'))
alex_df['Subord_List'] = alex_df.POS.apply(lambda x: get_PoS_List(x,'SCONJ'))
alex_df['Adverb_List'] = alex_df.POS.apply(lambda x: get_PoS_List(x,'ADV'))
alex_df['Aux_List'] = alex_df.POS.apply(lambda x: get_PoS_List(x,'AUX'))

In [21]:
alex_df[['POS','Verb_List']].sample(4)

Unnamed: 0,POS,Verb_List
152,"[('descansando', 'VERB'), ('en', 'ADP'), ('la'...",[descansando]
156,"[('tras', 'ADP'), ('una', 'DET'), ('gran', 'AD...",[relajamos]
91,"[('dejó', 'VERB'), ('el', 'DET'), ('violín', '...","[dejó, posó, sentía]"
102,"[('dos', 'NUM'), ('mujeres', 'NOUN'), (',', 'P...","[cayéndose, desmayándose]"


In [23]:
alex_df[['POS','Aux_List']].sample(4)

Unnamed: 0,POS,Aux_List
249,"[('una', 'DET'), ('selva', 'NOUN'), ('llena', ...",[]
95,"[('un', 'DET'), ('niño', 'NOUN'), ('que', 'PRO...",[]
170,"[('estragos', 'NOUN'), ('de', 'ADP'), ('la', '...","[había, podría, ser]"
356,"[('un', 'DET'), ('niño', 'NOUN'), ('al', 'ADP'...","[eran, es]"


In [24]:
alex_df.columns

Index(['Code', 'TAS20', 'F1', 'F2', 'F3', 'Gender', 'Age', 'Card',
       'T_Metaphors', 'T_ToM', 'T_FP', 'T_Interpret', 'T_Desc', 'T_Confussion',
       'Text', 'Alex_A', 'Alex_B', 'Words', 'Sentences', 'Tokens',
       'Tokens_Stop', 'Tokens_Stem_P', 'Tokens_Stem_S', 'POS', 'NER', 'DEP',
       'Lemmas_CNLP', 'Lemmas_Spacy', 'Chars', 'avgWL', 'avgSL', 'Pun_Count',
       'Stop_Count', 'RawTokens', 'Title_Count', 'Upper_Count', 'PRON_Count',
       'DET_Count', 'ADV_Count', 'VERB_Count', 'PROPN_Count', 'NOUN_Count',
       'NUM_Count', 'PUNCT_Count', 'SYM_Count', 'SCONJ_Count', 'CCONJ_Count',
       'INTJ_Count', 'AUX_Count', 'ADP_Count', 'ADJ_Count', 'PRON_Ratio',
       'DET_Ratio', 'ADV_Ratio', 'VERB_Ratio', 'PROPN_Ratio', 'NOUN_Ratio',
       'NUM_Ratio', 'PUNCT_Ratio', 'SYM_Ratio', 'SCONJ_Ratio', 'CCONJ_Ratio',
       'INTJ_Ratio', 'AUX_Ratio', 'ADP_Ratio', 'ADJ_Ratio', 'TTR', 'HTR',
       'BoW_PCA_1', 'BoW_PCA_2', 'BoW_PCA_3', 'TFIDF_PCA_1', 'TFIDF_PCA_2',
       'TFIDF_PCA_3',

In [25]:
# Save Updated features dataset
Feats_4_path = "D:\\Dropbox-Array2001\\Dropbox\\DataSets\\Prolexitim-Dataset\\Prolexitim_v2_features_4.csv"
alex_df.to_csv(Feats_4_path, sep=';', encoding='utf-8', index=False)

## Are there significant differences between alex and noalex groups?

In [35]:
from operator import itemgetter

In [39]:
# Number of most frequent words to analyze 
Top_N = 10

### Differences in nouns usage

In [47]:
# View as dataframe: 
nouns_df = pd.DataFrame(list(zip(
    list(map(itemgetter(0), alex_nouns[0:Top_N])),
    list(map(itemgetter(0), noalex_nouns[0:Top_N])))), 
    columns=['AlexNouns','NoAlexNouns'])

In [48]:
nouns_df

Unnamed: 0,AlexNouns,NoAlexNouns
0,niño,violín
1,hombre,niño
2,violín,día
3,día,hombre
4,mujer,mujer
5,violin,padres
6,casa,casa
7,grupo,vida
8,trabajo,trabajo
9,esposa,grupo


In [85]:
# Considering the top-N sets
def print_Set_Stats(alex_list, noalex_list, top_n):
    """
    Parameters
    ----------
    alex_list : list 
        List of most frequent words in alexithymia group.
    noalex_list : list 
        List of most frequent words in non-alexithymia group.
     top_n: int
        Number of most frequent words to analyze.
    
    Returns
    -------
    Print stats
        
    """

    alex_set = set(list(map(itemgetter(0), alex_list[0:top_n])))
    noalex_set = set(list(map(itemgetter(0), noalex_list[0:top_n])))

    union = alex_set | noalex_set
    intersection = alex_set & noalex_set
    difference1 = alex_set - noalex_set
    difference2 = noalex_set - alex_set
    notincommon = alex_set ^ noalex_set

    print("WORDS ANALYSIS")
    print("--------------")
    print("Alex Set:")
    print(alex_set)
    print()
    print("NoAlex Set:")
    print(noalex_set)
    print()
    print("Union:")
    print(union)
    print("--> Union size: %d (ratio %.2f)" % (len(union),len(union)/top_n))
    print()
    print("Intersection:")
    print(intersection)
    print("--> Intersection size: %d (ratio %.2f)" % (len(intersection),len(intersection)/top_n))
    print()
    print("Alex - NoAlex Difference:")
    print(difference1)
    print("--> Difference1 size: %d (ratio %.2f)" % (len(difference1),len(difference1)/top_n))
    print()
    print("NoAlex - Alex Difference:")
    print(difference2)
    print("--> Difference2 size: %d (ratio %.2f)" % (len(difference2),len(difference2)/top_n))
    print()
    print("Not in common:")
    print(notincommon)
    print("--> Symmetric diff. size: %d (ratio %.2f)" % (len(notincommon),len(notincommon)/top_n))
    print()


In [87]:
print_Set_Stats(alex_nouns, noalex_nouns, Top_N)

WORDS ANALYSIS
--------------
Alex Set:
{'grupo', 'casa', 'esposa', 'día', 'violín', 'hombre', 'niño', 'mujer', 'violin', 'trabajo'}

NoAlex Set:
{'grupo', 'casa', 'día', 'violín', 'padres', 'hombre', 'niño', 'mujer', 'trabajo', 'vida'}

Union:
{'día', 'violín', 'padres', 'mujer', 'violin', 'trabajo', 'vida', 'grupo', 'casa', 'esposa', 'hombre', 'niño'}
--> Union size: 12 (ratio 1.20)

Intersection:
{'grupo', 'casa', 'día', 'violín', 'hombre', 'niño', 'mujer', 'trabajo'}
--> Intersection size: 8 (ratio 0.80)

Alex - NoAlex Difference:
{'esposa', 'violin'}
--> Difference1 size: 2 (ratio 0.20)

NoAlex - Alex Difference:
{'padres', 'vida'}
--> Difference2 size: 2 (ratio 0.20)

Not in common:
{'esposa', 'padres', 'violin', 'vida'}
--> Symmetric diff. size: 4 (ratio 0.40)



### Difference in verbs usage

In [88]:
print_Set_Stats(alex_verbs, noalex_verbs, Top_N)

WORDS ANALYSIS
--------------
Alex Set:
{'durmiendo', 'encontrar', 'ver', 'pensando', 'aprender', 'sabe', 'descansando', 'hacer', 'tenía', 'tocar'}

NoAlex Set:
{'gustaba', 'quería', 'trabajar', 'tenían', 'tener', 'tiene', 'hacer', 'descansar', 'tenía', 'tocar'}

Union:
{'gustaba', 'aprender', 'sabe', 'descansando', 'descansar', 'tenía', 'tener', 'durmiendo', 'encontrar', 'ver', 'quería', 'pensando', 'trabajar', 'tenían', 'tiene', 'hacer', 'tocar'}
--> Union size: 17 (ratio 1.70)

Intersection:
{'hacer', 'tenía', 'tocar'}
--> Intersection size: 3 (ratio 0.30)

Alex - NoAlex Difference:
{'durmiendo', 'encontrar', 'ver', 'pensando', 'aprender', 'sabe', 'descansando'}
--> Difference1 size: 7 (ratio 0.70)

NoAlex - Alex Difference:
{'gustaba', 'quería', 'tenían', 'trabajar', 'tiene', 'descansar', 'tener'}
--> Difference2 size: 7 (ratio 0.70)

Not in common:
{'durmiendo', 'encontrar', 'ver', 'gustaba', 'quería', 'pensando', 'aprender', 'sabe', 'trabajar', 'tenían', 'descansando', 'tener', '

### Differences in adjectives usage

In [89]:
print_Set_Stats(alex_adjectives, noalex_adjectives, Top_N)

WORDS ANALYSIS
--------------
Alex Set:
{'bella', 'plena', 'cansado', 'muerta', 'único', 'juntos', 'gran', 'hermosa', 'aburrido', 'solitario'}

NoAlex Set:
{'siguiente', 'cansado', 'triste', 'mejor', 'muerta', 'juntos', 'junto', 'solo', 'gran', 'nuevo'}

Union:
{'plena', 'cansado', 'triste', 'muerta', 'junto', 'gran', 'hermosa', 'solitario', 'nuevo', 'siguiente', 'bella', 'mejor', 'único', 'juntos', 'solo', 'aburrido'}
--> Union size: 16 (ratio 1.60)

Intersection:
{'gran', 'cansado', 'muerta', 'juntos'}
--> Intersection size: 4 (ratio 0.40)

Alex - NoAlex Difference:
{'bella', 'plena', 'único', 'hermosa', 'aburrido', 'solitario'}
--> Difference1 size: 6 (ratio 0.60)

NoAlex - Alex Difference:
{'siguiente', 'triste', 'mejor', 'junto', 'solo', 'nuevo'}
--> Difference2 size: 6 (ratio 0.60)

Not in common:
{'siguiente', 'bella', 'plena', 'triste', 'mejor', 'junto', 'solo', 'único', 'hermosa', 'aburrido', 'solitario', 'nuevo'}
--> Symmetric diff. size: 12 (ratio 1.20)

