## Language Analysis of Alexithymic Discourse

<hr>

Alexithymic Language Project / raul@psicobotica.com / V2 release (sept 2020)

<hr>

### Lexicosemantics Analysis

We review here the most frequent words used by participants, taking into account Part of Speech (PoS) and semantics associated to terms.

- Three corpora considered: all, non-alexithymic, alexithymic. 
- Most frequent nouns. 
- Most frequent adjectives. 
- Most frequent verbs. 

<hr>

[Explanation of Lexical Semantics](https://en.wikipedia.org/wiki/Lexical_semantics)

[List of PoS tags (Spanish)](https://universaldependencies.org/docs/es/pos/)

## Load features dataset
- Data is already pre-processed (1-Preprocessing). 
- Basic NLP features are already calculated (2-Features). 
- Some additional BoW features have been added (3-BoW).
- Some additional TF/IDF features have been added (3-TFIDF).
- N-Gram models have been generated (3-N-Grams). 

In [1]:
import pandas as pd 
import numpy as np
import ast
import heapq
import nltk

In [2]:
feats_dataset_path = "https://raw.githubusercontent.com/raul-arrabales/alexithymic-lang/master/data/Prolexitim_v2_features_3.csv"
alex_df = pd.read_csv(feats_dataset_path, header=0, delimiter=";")

In [3]:
alex_df.columns

Index(['Code', 'TAS20', 'F1', 'F2', 'F3', 'Gender', 'Age', 'Card',
       'T_Metaphors', 'T_ToM', 'T_FP', 'T_Interpret', 'T_Desc', 'T_Confussion',
       'Text', 'Alex_A', 'Alex_B', 'Words', 'Sentences', 'Tokens',
       'Tokens_Stop', 'Tokens_Stem_P', 'Tokens_Stem_S', 'POS', 'NER', 'DEP',
       'Lemmas_CNLP', 'Lemmas_Spacy', 'Chars', 'avgWL', 'avgSL', 'Pun_Count',
       'Stop_Count', 'RawTokens', 'Title_Count', 'Upper_Count', 'PRON_Count',
       'DET_Count', 'ADV_Count', 'VERB_Count', 'PROPN_Count', 'NOUN_Count',
       'NUM_Count', 'PUNCT_Count', 'SYM_Count', 'SCONJ_Count', 'CCONJ_Count',
       'INTJ_Count', 'AUX_Count', 'ADP_Count', 'ADJ_Count', 'PRON_Ratio',
       'DET_Ratio', 'ADV_Ratio', 'VERB_Ratio', 'PROPN_Ratio', 'NOUN_Ratio',
       'NUM_Ratio', 'PUNCT_Ratio', 'SYM_Ratio', 'SCONJ_Ratio', 'CCONJ_Ratio',
       'INTJ_Ratio', 'AUX_Ratio', 'ADP_Ratio', 'ADJ_Ratio', 'TTR', 'HTR',
       'BoW_PCA_1', 'BoW_PCA_2', 'BoW_PCA_3', 'TFIDF_PCA_1', 'TFIDF_PCA_2',
       'TFIDF_PCA_3']

In [4]:
alex_df.head()

Unnamed: 0,Code,TAS20,F1,F2,F3,Gender,Age,Card,T_Metaphors,T_ToM,...,ADP_Ratio,ADJ_Ratio,TTR,HTR,BoW_PCA_1,BoW_PCA_2,BoW_PCA_3,TFIDF_PCA_1,TFIDF_PCA_2,TFIDF_PCA_3
0,bc39e22ca5dba59fbd97c27987878f56,40,16,9,15,2,22,1,0,1,...,0.125,0.0,0.5625,0.875,0.429786,-0.056197,-0.360772,-0.11487,0.168706,0.031455
1,bc39e22ca5dba59fbd97c27987878f56,40,16,9,15,2,22,13HM,0,1,...,0.0,0.0,0.857143,1.0,-0.535592,0.971355,-0.133005,0.867802,0.301337,0.165452
2,20cd825cadb95a71763bad06e142c148,40,12,10,18,2,22,1,0,1,...,0.103448,0.172414,0.344828,0.793103,0.713317,-0.012597,-0.255988,-0.089725,0.143005,0.031664
3,20cd825cadb95a71763bad06e142c148,40,12,10,18,2,22,9VH,0,1,...,0.208333,0.083333,0.458333,0.875,-0.28032,-0.445467,0.372081,-0.019208,-0.07631,-0.093545
4,20cd825cadb95a71763bad06e142c148,40,12,10,18,2,22,13HM,0,1,...,0.1,0.2,0.9,1.0,-0.539096,0.998465,-0.135003,0.393093,0.108074,0.043623


## Preparing the corpora
Let's get three corpora, one global, one with "alexithymic language" and the other with "non-alexithymic language". We'll need just the PoS tagging for each.

We also have to take into acount the card presented to each participant, as the lexicon will be greatly influenced by that. 

- AllDoc will contain all documents from all participants. 
- AlexDoc will contain merged text from TAS-20 positive users. 
- NoAlexDoc will contain merged text from TAS-20 negative users. 

In [5]:
# The POS feature contains the PoS tagging for each document: 
print("Doc: " + alex_df['Text'][0])
print()
print("PoS Tagged: " + alex_df['POS'][0])

Doc: es un niño pensando en cual es la respuesta de sus deberes porque no la sabe.

PoS Tagged: [('es', 'AUX'), ('un', 'DET'), ('niño', 'NOUN'), ('pensando', 'VERB'), ('en', 'ADP'), ('cual', 'PRON'), ('es', 'AUX'), ('la', 'DET'), ('respuesta', 'NOUN'), ('de', 'ADP'), ('sus', 'DET'), ('deberes', 'NOUN'), ('porque', 'SCONJ'), ('no', 'ADV'), ('la', 'PRON'), ('sabe', 'VERB'), ('.', 'PUNCT')]


In [5]:
AllDocs = alex_df[['POS','Card']]
AlexDocs = alex_df[alex_df.Alex_A == 1][['POS','Card']]
NoAlexDocs = alex_df[alex_df.Alex_A == 0][['POS','Card']]

In [7]:
AlexDocs.Card.value_counts()

9VH     20
1       20
11      18
13HM    17
15HM     1
Name: Card, dtype: int64

In [8]:
NoAlexDocs.Card.value_counts()

9VH     69
1       65
11      65
13HM    64
12VN     8
3VH      6
7VH      6
13V      5
13N      5
10       4
18NM     3
13VH     2
!·HM     1
9BM      1
10N      1
Name: Card, dtype: int64

We'd need to stick to cards 9VH, 1, 11 and 13HM, which are the ones mainly represented in both classes

## Compute most frequent words per grammatical function
- Verbs. 
- Auxiliary verbs. 
- Nouns. 
- Proper nouns. 
- Adjectives. 
- Adverbs. 
- Subordinate conjunctions. 

In [12]:
# Extracts the list of specific PoS tokens in a list of POS tagged documents
def get_PoS_SortedList(corpus, PoS_tag):
    """
    Parameters
    ----------
    corpus : series of lists of tuples (word, POS_tag)
        Documents to be analyzed. 
    PoS_tag : str
        The specific POS that we want to extract
    
    Returns
    -------
    words_sorted: sorted list with K=word, V=frequency in the corpus
        Sorted by frequency (inversed)
        
    """
    
    words_dict = {}
    
    for PoSList in corpus:  # For each PoS Tagged doc
        tag_list = ast.literal_eval(PoSList)    # Get the list of tuples
        for PoStuple in tag_list: 
            word = PoStuple[0]
            tag = PoStuple[1]
            if ( tag == PoS_tag ): 
                if word not in words_dict.keys():
                    words_dict[word] = 1
                else:
                    words_dict[word] += 1
    
    # Sort by frequency (higher first)
    words_sorted = []
    for w in sorted(words_dict, key=words_dict.get, reverse=True):
        words_sorted.append((w, words_dict[w]))
        
    return words_sorted

In [15]:
# Get ordered lists of interest

# Nouns
all_nouns = get_PoS_SortedList(AllDocs.POS, 'NOUN')
alex_nouns = get_PoS_SortedList(AlexDocs.POS, 'NOUN')
noalex_nouns = get_PoS_SortedList(NoAlexDocs.POS, 'NOUN')

# Verbs
all_verbs = get_PoS_SortedList(AllDocs.POS, 'VERB')
alex_verbs = get_PoS_SortedList(AlexDocs.POS, 'VERB')
noalex_verbs = get_PoS_SortedList(NoAlexDocs.POS, 'VERB')

# Adjectives
all_adjectives = get_PoS_SortedList(AllDocs.POS, 'ADJ')
alex_adjectives = get_PoS_SortedList(AlexDocs.POS, 'ADJ')
noalex_adjectives = get_PoS_SortedList(NoAlexDocs.POS, 'ADJ')

# Subordinated conjunctions
all_sconj = get_PoS_SortedList(AllDocs.POS, 'SCONJ')
alex_sconj = get_PoS_SortedList(AlexDocs.POS, 'SCONJ')
noalex_sconj = get_PoS_SortedList(NoAlexDocs.POS, 'SCONJ')

# Adverbs
all_adverbs = get_PoS_SortedList(AllDocs.POS, 'ADV')
alex_adverbs = get_PoS_SortedList(AlexDocs.POS, 'ADV')
noalex_adverbs = get_PoS_SortedList(NoAlexDocs.POS, 'ADV')

# Auxiliary verbs
all_aux = get_PoS_SortedList(AllDocs.POS, 'AUX')
alex_aux = get_PoS_SortedList(AlexDocs.POS, 'AUX')
noalex_aux = get_PoS_SortedList(NoAlexDocs.POS, 'AUX')

# Proper nouns
all_proper = get_PoS_SortedList(AllDocs.POS, 'PROPN')
alex_proper = get_PoS_SortedList(AlexDocs.POS, 'PROPN')
noalex_proper = get_PoS_SortedList(NoAlexDocs.POS, 'PROPN')

In [17]:
# Propper nouns tagging for Spanish just didn't work (see Preprocessing notebook): 
all_proper

[('the', 1), ('explorers', 1)]

In [19]:
alex_verbs[0:10]

[('tocar', 12),
 ('hacer', 7),
 ('tenía', 6),
 ('encontrar', 5),
 ('sabe', 5),
 ('aprender', 4),
 ('durmiendo', 4),
 ('descansando', 4),
 ('pensando', 3),
 ('ver', 3)]

In [20]:
noalex_verbs[0:10]

[('tocar', 41),
 ('quería', 21),
 ('tenía', 21),
 ('tiene', 20),
 ('hacer', 18),
 ('trabajar', 15),
 ('gustaba', 14),
 ('tener', 12),
 ('descansar', 12),
 ('tenían', 11)]

## Are there significant differences between alex and noalex groups?
- Independently of the card being shown to the participant. 

In [21]:
from operator import itemgetter

In [22]:
# Number of most frequent words to analyze 
Top_N = 10

### Differences in nouns usage

In [23]:
# View as dataframe: 
nouns_df = pd.DataFrame(list(zip(
    list(map(itemgetter(0), alex_nouns[0:Top_N])),
    list(map(itemgetter(0), noalex_nouns[0:Top_N])))), 
    columns=['AlexNouns','NoAlexNouns'])

In [24]:
nouns_df

Unnamed: 0,AlexNouns,NoAlexNouns
0,niño,violín
1,hombre,niño
2,violín,día
3,día,hombre
4,mujer,mujer
5,violin,padres
6,casa,casa
7,grupo,vida
8,trabajo,trabajo
9,esposa,grupo


In [25]:
# Considering the top-N sets
def print_Set_Stats(alex_list, noalex_list, top_n):
    """
    Parameters
    ----------
    alex_list : list 
        List of most frequent words in alexithymia group.
    noalex_list : list 
        List of most frequent words in non-alexithymia group.
     top_n: int
        Number of most frequent words to analyze.
    
    Returns
    -------
    Print stats
        
    """

    alex_set = set(list(map(itemgetter(0), alex_list[0:top_n])))
    noalex_set = set(list(map(itemgetter(0), noalex_list[0:top_n])))

    union = alex_set | noalex_set
    intersection = alex_set & noalex_set
    difference1 = alex_set - noalex_set
    difference2 = noalex_set - alex_set
    notincommon = alex_set ^ noalex_set

    print("WORDS ANALYSIS")
    print("--------------")
    print("Alex Set:")
    print(alex_set)
    print()
    print("NoAlex Set:")
    print(noalex_set)
    print()
    print("Union:")
    print(union)
    print("--> Union size: %d (ratio %.2f)" % (len(union),len(union)/top_n))
    print()
    print("Intersection:")
    print(intersection)
    print("--> Intersection size: %d (ratio %.2f)" % (len(intersection),len(intersection)/top_n))
    print()
    print("Alex - NoAlex Difference:")
    print(difference1)
    print("--> Difference1 size: %d (ratio %.2f)" % (len(difference1),len(difference1)/top_n))
    print()
    print("NoAlex - Alex Difference:")
    print(difference2)
    print("--> Difference2 size: %d (ratio %.2f)" % (len(difference2),len(difference2)/top_n))
    print()
    print("Not in common:")
    print(notincommon)
    print("--> Symmetric diff. size: %d (ratio %.2f)" % (len(notincommon),len(notincommon)/top_n))
    print()


In [26]:
print_Set_Stats(alex_nouns, noalex_nouns, Top_N)

WORDS ANALYSIS
--------------
Alex Set:
{'niño', 'violín', 'día', 'trabajo', 'mujer', 'violin', 'grupo', 'esposa', 'casa', 'hombre'}

NoAlex Set:
{'niño', 'vida', 'violín', 'día', 'trabajo', 'mujer', 'grupo', 'casa', 'hombre', 'padres'}

Union:
{'vida', 'violín', 'mujer', 'grupo', 'esposa', 'padres', 'niño', 'día', 'trabajo', 'violin', 'casa', 'hombre'}
--> Union size: 12 (ratio 1.20)

Intersection:
{'niño', 'violín', 'día', 'trabajo', 'mujer', 'grupo', 'casa', 'hombre'}
--> Intersection size: 8 (ratio 0.80)

Alex - NoAlex Difference:
{'violin', 'esposa'}
--> Difference1 size: 2 (ratio 0.20)

NoAlex - Alex Difference:
{'vida', 'padres'}
--> Difference2 size: 2 (ratio 0.20)

Not in common:
{'vida', 'violin', 'esposa', 'padres'}
--> Symmetric diff. size: 4 (ratio 0.40)



### Difference in verbs usage

In [27]:
print_Set_Stats(alex_verbs, noalex_verbs, Top_N)

WORDS ANALYSIS
--------------
Alex Set:
{'durmiendo', 'descansando', 'aprender', 'encontrar', 'ver', 'hacer', 'tenía', 'sabe', 'tocar', 'pensando'}

NoAlex Set:
{'gustaba', 'tenían', 'tiene', 'quería', 'descansar', 'tener', 'hacer', 'tenía', 'tocar', 'trabajar'}

Union:
{'descansando', 'quería', 'tener', 'tenía', 'trabajar', 'durmiendo', 'gustaba', 'tenían', 'aprender', 'encontrar', 'tiene', 'descansar', 'ver', 'hacer', 'sabe', 'tocar', 'pensando'}
--> Union size: 17 (ratio 1.70)

Intersection:
{'tenía', 'tocar', 'hacer'}
--> Intersection size: 3 (ratio 0.30)

Alex - NoAlex Difference:
{'durmiendo', 'descansando', 'aprender', 'encontrar', 'ver', 'sabe', 'pensando'}
--> Difference1 size: 7 (ratio 0.70)

NoAlex - Alex Difference:
{'gustaba', 'tenían', 'quería', 'descansar', 'tiene', 'tener', 'trabajar'}
--> Difference2 size: 7 (ratio 0.70)

Not in common:
{'durmiendo', 'gustaba', 'descansando', 'tenían', 'tiene', 'quería', 'descansar', 'tener', 'aprender', 'encontrar', 'ver', 'sabe', 'tr

### Differences in adjectives usage

In [28]:
print_Set_Stats(alex_adjectives, noalex_adjectives, Top_N)

WORDS ANALYSIS
--------------
Alex Set:
{'hermosa', 'cansado', 'aburrido', 'gran', 'único', 'solitario', 'bella', 'muerta', 'juntos', 'plena'}

NoAlex Set:
{'cansado', 'solo', 'gran', 'mejor', 'nuevo', 'siguiente', 'junto', 'muerta', 'juntos', 'triste'}

Union:
{'hermosa', 'cansado', 'único', 'siguiente', 'junto', 'muerta', 'juntos', 'triste', 'solo', 'aburrido', 'gran', 'nuevo', 'mejor', 'solitario', 'bella', 'plena'}
--> Union size: 16 (ratio 1.60)

Intersection:
{'muerta', 'juntos', 'cansado', 'gran'}
--> Intersection size: 4 (ratio 0.40)

Alex - NoAlex Difference:
{'hermosa', 'aburrido', 'único', 'solitario', 'bella', 'plena'}
--> Difference1 size: 6 (ratio 0.60)

NoAlex - Alex Difference:
{'solo', 'mejor', 'nuevo', 'siguiente', 'junto', 'triste'}
--> Difference2 size: 6 (ratio 0.60)

Not in common:
{'plena', 'hermosa', 'solo', 'aburrido', 'bella', 'único', 'mejor', 'solitario', 'nuevo', 'siguiente', 'junto', 'triste'}
--> Symmetric diff. size: 12 (ratio 1.20)



## Are there significant differences between alex and noalex groups?
- Now, considering specific cards. 

### Analysis for Card 1:

<img src="https://psicobotica.com/prolexitim/nlp/stimuli/TAT-1.jpg" align="left" width=320> 

In [41]:
# Nouns
card1_alex_nouns = get_PoS_SortedList(AlexDocs[AlexDocs.Card == '1'].POS, 'NOUN')
card1_noalex_nouns = get_PoS_SortedList(NoAlexDocs[NoAlexDocs.Card == '1'].POS, 'NOUN')

# Verbs
card1_alex_verbs = get_PoS_SortedList(AlexDocs[AlexDocs.Card == '1'].POS, 'VERB')
card1_noalex_verbs = get_PoS_SortedList(NoAlexDocs[NoAlexDocs.Card == '1'].POS, 'VERB')

# Adjectives
card1_alex_adjectives = get_PoS_SortedList(AlexDocs[AlexDocs.Card == '1'].POS, 'ADJ')
card1_noalex_adjectives = get_PoS_SortedList(NoAlexDocs[NoAlexDocs.Card == '1'].POS, 'ADJ')

In [42]:
# Card 1 - Nouns - Alex Vs. NoAlex
print_Set_Stats(card1_alex_nouns, card1_noalex_nouns, Top_N)

WORDS ANALYSIS
--------------
Alex Set:
{'niño', 'abuelo', 'violín', 'violinista', 'día', 'clase', 'violin', 'esfuerzo', 'música', 'padres'}

NoAlex Set:
{'padre', 'niño', 'instrumento', 'violín', 'profesor', 'día', 'clases', 'años', 'música', 'padres'}

Union:
{'abuelo', 'violín', 'instrumento', 'profesor', 'años', 'padres', 'padre', 'niño', 'violinista', 'día', 'clase', 'violin', 'clases', 'música', 'esfuerzo'}
--> Union size: 15 (ratio 1.50)

Intersection:
{'niño', 'violín', 'día', 'música', 'padres'}
--> Intersection size: 5 (ratio 0.50)

Alex - NoAlex Difference:
{'abuelo', 'violinista', 'clase', 'violin', 'esfuerzo'}
--> Difference1 size: 5 (ratio 0.50)

NoAlex - Alex Difference:
{'padre', 'instrumento', 'profesor', 'años', 'clases'}
--> Difference2 size: 5 (ratio 0.50)

Not in common:
{'padre', 'instrumento', 'violinista', 'profesor', 'abuelo', 'clase', 'violin', 'clases', 'años', 'esfuerzo'}
--> Symmetric diff. size: 10 (ratio 1.00)



In [43]:
# Card 1 - Verbs - Alex Vs. NoAlex
print_Set_Stats(card1_alex_verbs, card1_noalex_verbs, Top_N)

WORDS ANALYSIS
--------------
Alex Set:
{'roto', 'tiene', 'aprender', 'obligaban', 'tenía', 'odiaba', 'quedó', 'tocar', 'pensando', 'dijeron'}

NoAlex Set:
{'gustaba', 'aprender', 'quería', 'estudiar', 'tiene', 'jugar', 'tenía', 'tocar', 'sabía', 'obligaban'}

Union:
{'roto', 'quería', 'tenía', 'odiaba', 'sabía', 'obligaban', 'gustaba', 'tiene', 'aprender', 'estudiar', 'jugar', 'quedó', 'tocar', 'pensando', 'dijeron'}
--> Union size: 15 (ratio 1.50)

Intersection:
{'tiene', 'aprender', 'tenía', 'tocar', 'obligaban'}
--> Intersection size: 5 (ratio 0.50)

Alex - NoAlex Difference:
{'roto', 'odiaba', 'quedó', 'pensando', 'dijeron'}
--> Difference1 size: 5 (ratio 0.50)

NoAlex - Alex Difference:
{'gustaba', 'quería', 'estudiar', 'jugar', 'sabía'}
--> Difference2 size: 5 (ratio 0.50)

Not in common:
{'gustaba', 'roto', 'quería', 'estudiar', 'jugar', 'odiaba', 'quedó', 'sabía', 'pensando', 'dijeron'}
--> Symmetric diff. size: 10 (ratio 1.00)



In [44]:
# Card 1 - Adjectives - Alex Vs. NoAlex
print_Set_Stats(card1_alex_adjectives, card1_noalex_adjectives, Top_N)

WORDS ANALYSIS
--------------
Alex Set:
{'frustrado', 'intentanto', 'gran', 'aburrido', 'geniales', 'estresado', 'acomodada', 'mejor', 'solitario', 'distraído'}

NoAlex Set:
{'dormido', 'frustrado', 'cansado', 'aburrido', 'gran', 'buen', 'mejor', 'harto', 'llamado', 'triste'}

Union:
{'frustrado', 'cansado', 'estresado', 'harto', 'triste', 'distraído', 'dormido', 'intentanto', 'gran', 'aburrido', 'geniales', 'acomodada', 'buen', 'mejor', 'solitario', 'llamado'}
--> Union size: 16 (ratio 1.60)

Intersection:
{'frustrado', 'gran', 'mejor', 'aburrido'}
--> Intersection size: 4 (ratio 0.40)

Alex - NoAlex Difference:
{'intentanto', 'estresado', 'geniales', 'acomodada', 'solitario', 'distraído'}
--> Difference1 size: 6 (ratio 0.60)

NoAlex - Alex Difference:
{'dormido', 'cansado', 'buen', 'harto', 'llamado', 'triste'}
--> Difference2 size: 6 (ratio 0.60)

Not in common:
{'dormido', 'cansado', 'intentanto', 'estresado', 'geniales', 'buen', 'acomodada', 'solitario', 'harto', 'llamado', 'trist

### Analysis for Card 9VH:

<img src="https://psicobotica.com/prolexitim/nlp/stimuli/TAT-9VH.jpg" align="left" width=320> 

In [45]:
# Nouns
card9VH_alex_nouns = get_PoS_SortedList(AlexDocs[AlexDocs.Card == '9VH'].POS, 'NOUN')
card9VH_noalex_nouns = get_PoS_SortedList(NoAlexDocs[NoAlexDocs.Card == '9VH'].POS, 'NOUN')

# Verbs
card9VH_alex_verbs = get_PoS_SortedList(AlexDocs[AlexDocs.Card == '9VH'].POS, 'VERB')
card9VH_noalex_verbs = get_PoS_SortedList(NoAlexDocs[NoAlexDocs.Card == '9VH'].POS, 'VERB')

# Adjectives
card9VH_alex_adjectives = get_PoS_SortedList(AlexDocs[AlexDocs.Card == '9VH'].POS, 'ADJ')
card9VH_noalex_adjectives = get_PoS_SortedList(NoAlexDocs[NoAlexDocs.Card == '9VH'].POS, 'ADJ')

In [46]:
# Card 9VH - Nouns - Alex Vs. NoAlex
print_Set_Stats(card9VH_alex_nouns, card9VH_noalex_nouns, Top_N)

WORDS ANALYSIS
--------------
Alex Set:
{'amigos', 'sol', 'día', 'trabajo', 'siesta', 'grupo', 'jornada', 'trabajadores', 'largo', 'campo'}

NoAlex Set:
{'amigos', 'sol', 'día', 'trabajo', 'siesta', 'grupo', 'hombres', 'jornada', 'campo', 'descanso'}

Union:
{'amigos', 'sol', 'siesta', 'grupo', 'trabajadores', 'largo', 'hombres', 'campo', 'día', 'trabajo', 'jornada', 'descanso'}
--> Union size: 12 (ratio 1.20)

Intersection:
{'amigos', 'sol', 'día', 'trabajo', 'siesta', 'grupo', 'jornada', 'campo'}
--> Intersection size: 8 (ratio 0.80)

Alex - NoAlex Difference:
{'trabajadores', 'largo'}
--> Difference1 size: 2 (ratio 0.20)

NoAlex - Alex Difference:
{'descanso', 'hombres'}
--> Difference2 size: 2 (ratio 0.20)

Not in common:
{'hombres', 'trabajadores', 'largo', 'descanso'}
--> Symmetric diff. size: 4 (ratio 0.40)



In [52]:
# Card 9VH - Verbs - Alex Vs. NoAlex
print_Set_Stats(card9VH_alex_verbs, card9VH_noalex_verbs, Top_N)

WORDS ANALYSIS
--------------
Alex Set:
{'durmiendo', 'descansando', 'encuentran', 'levantando', 'descansar', 'tomando', 'ver', 'tomar', 'realizar', 'estan'}

NoAlex Set:
{'durmiendo', 'descansando', 'descansaban', 'descansar', 'trabajando', 'dormir', 'decidieron', 'segando', 'trabajar', 'comer'}

Union:
{'descansando', 'descansaban', 'tomando', 'realizar', 'trabajar', 'comer', 'durmiendo', 'encuentran', 'levantando', 'descansar', 'trabajando', 'ver', 'dormir', 'decidieron', 'tomar', 'segando', 'estan'}
--> Union size: 17 (ratio 1.70)

Intersection:
{'durmiendo', 'descansando', 'descansar'}
--> Intersection size: 3 (ratio 0.30)

Alex - NoAlex Difference:
{'encuentran', 'levantando', 'tomando', 'ver', 'tomar', 'realizar', 'estan'}
--> Difference1 size: 7 (ratio 0.70)

NoAlex - Alex Difference:
{'descansaban', 'trabajando', 'dormir', 'decidieron', 'segando', 'trabajar', 'comer'}
--> Difference2 size: 7 (ratio 0.70)

Not in common:
{'encuentran', 'levantando', 'descansaban', 'tomando', 't

In [48]:
# Card 9VH - Adjectives - Alex Vs. NoAlex
print_Set_Stats(card9VH_alex_adjectives, card9VH_noalex_adjectives, Top_N)

WORDS ANALYSIS
--------------
Alex Set:
{'dura', 'tumbado', 'renovadas', 'consciente', 'gran', 'buen', 'único', 'juntos', 'plena', 'desconocido'}

NoAlex Set:
{'dura', 'siguiente', 'larga', 'gran', 'mayor', 'largo', 'cansados', 'primeros', 'juntos', 'duro'}

Union:
{'dura', 'tumbado', 'único', 'largo', 'siguiente', 'cansados', 'juntos', 'duro', 'renovadas', 'larga', 'consciente', 'gran', 'mayor', 'buen', 'primeros', 'plena', 'desconocido'}
--> Union size: 17 (ratio 1.70)

Intersection:
{'dura', 'juntos', 'gran'}
--> Intersection size: 3 (ratio 0.30)

Alex - NoAlex Difference:
{'tumbado', 'renovadas', 'consciente', 'buen', 'único', 'plena', 'desconocido'}
--> Difference1 size: 7 (ratio 0.70)

NoAlex - Alex Difference:
{'larga', 'cansados', 'mayor', 'largo', 'siguiente', 'primeros', 'duro'}
--> Difference2 size: 7 (ratio 0.70)

Not in common:
{'tumbado', 'renovadas', 'siguiente', 'larga', 'consciente', 'mayor', 'buen', 'único', 'largo', 'cansados', 'primeros', 'duro', 'plena', 'desconoci

### Analysis for Card 11:

<img src="https://psicobotica.com/prolexitim/nlp/stimuli/TAT-11.jpg" align="left" width=320> 

In [53]:
# Nouns
card11_alex_nouns = get_PoS_SortedList(AlexDocs[AlexDocs.Card == '11'].POS, 'NOUN')
card11_noalex_nouns = get_PoS_SortedList(NoAlexDocs[NoAlexDocs.Card == '11'].POS, 'NOUN')

# Verbs
card11_alex_verbs = get_PoS_SortedList(AlexDocs[AlexDocs.Card == '11'].POS, 'VERB')
card11_noalex_verbs = get_PoS_SortedList(NoAlexDocs[NoAlexDocs.Card == '11'].POS, 'VERB')

# Adjectives
card11_alex_adjectives = get_PoS_SortedList(AlexDocs[AlexDocs.Card == '11'].POS, 'ADJ')
card11_noalex_adjectives = get_PoS_SortedList(NoAlexDocs[NoAlexDocs.Card == '11'].POS, 'ADJ')

In [54]:
# Card 11 - Nouns - Alex Vs. NoAlex
print_Set_Stats(card11_alex_nouns, card11_noalex_nouns, Top_N)

WORDS ANALYSIS
--------------
Alex Set:
{'cascada', 'montaña', 'hombre', 'camino', 'bosque', 'lugar', 'animales', 'final', 'catarata', 'viaje'}

NoAlex Set:
{'dragón', 'cascada', 'montaña', 'día', 'camino', 'agua', 'bosque', 'tesoro', 'lugar', 'piedras'}

Union:
{'dragón', 'tesoro', 'animales', 'catarata', 'cascada', 'montaña', 'piedras', 'día', 'camino', 'bosque', 'agua', 'lugar', 'hombre', 'final', 'viaje'}
--> Union size: 15 (ratio 1.50)

Intersection:
{'montaña', 'cascada', 'camino', 'bosque', 'lugar'}
--> Intersection size: 5 (ratio 0.50)

Alex - NoAlex Difference:
{'viaje', 'animales', 'hombre', 'final', 'catarata'}
--> Difference1 size: 5 (ratio 0.50)

NoAlex - Alex Difference:
{'dragón', 'día', 'agua', 'tesoro', 'piedras'}
--> Difference2 size: 5 (ratio 0.50)

Not in common:
{'dragón', 'final', 'día', 'agua', 'tesoro', 'viaje', 'animales', 'hombre', 'piedras', 'catarata'}
--> Symmetric diff. size: 10 (ratio 1.00)



In [55]:
# Card 11 - Verbs - Alex Vs. NoAlex
print_Set_Stats(card11_alex_verbs, card11_noalex_verbs, Top_N)

WORDS ANALYSIS
--------------
Alex Set:
{'vio', 'hizo', 'saben', 'encontrar', 'encontraba', 'llegado', 'caminando', 'tenía', 'cruzar', 'habia'}

NoAlex Set:
{'veo', 'lleva', 'tenían', 'disfrutar', 'cayó', 'llegado', 'llegar', 'hacer', 'pasar', 'tenía'}

Union:
{'vio', 'hizo', 'saben', 'disfrutar', 'cayó', 'caminando', 'llegar', 'tenía', 'pasar', 'habia', 'veo', 'lleva', 'tenían', 'encontrar', 'encontraba', 'llegado', 'hacer', 'cruzar'}
--> Union size: 18 (ratio 1.80)

Intersection:
{'tenía', 'llegado'}
--> Intersection size: 2 (ratio 0.20)

Alex - NoAlex Difference:
{'vio', 'hizo', 'saben', 'encontraba', 'encontrar', 'caminando', 'cruzar', 'habia'}
--> Difference1 size: 8 (ratio 0.80)

NoAlex - Alex Difference:
{'veo', 'lleva', 'cayó', 'tenían', 'disfrutar', 'llegar', 'hacer', 'pasar'}
--> Difference2 size: 8 (ratio 0.80)

Not in common:
{'veo', 'vio', 'lleva', 'hizo', 'saben', 'tenían', 'disfrutar', 'cayó', 'encontrar', 'encontraba', 'caminando', 'llegar', 'hacer', 'pasar', 'cruzar', 

In [57]:
# Card 11 - Adjectives - Alex Vs. NoAlex
print_Set_Stats(card11_alex_adjectives, card11_noalex_adjectives, Top_N)

WORDS ANALYSIS
--------------
Alex Set:
{'maravilloso', 'hermosa', 'fantásticos', 'frondoso', 'gran', 'mágico', 'originales', 'lleno', 'turbulentos', 'vistas'}

NoAlex Set:
{'alto', 'dormido', 'escondida', 'impresionante', 'cansado', 'gran', 'debido', 'altos', 'escondido', 'precioso'}

Union:
{'maravilloso', 'hermosa', 'impresionante', 'cansado', 'frondoso', 'originales', 'debido', 'turbulentos', 'alto', 'dormido', 'fantásticos', 'gran', 'mágico', 'altos', 'lleno', 'vistas', 'precioso', 'escondido', 'escondida'}
--> Union size: 19 (ratio 1.90)

Intersection:
{'gran'}
--> Intersection size: 1 (ratio 0.10)

Alex - NoAlex Difference:
{'maravilloso', 'hermosa', 'fantásticos', 'frondoso', 'mágico', 'originales', 'lleno', 'turbulentos', 'vistas'}
--> Difference1 size: 9 (ratio 0.90)

NoAlex - Alex Difference:
{'alto', 'dormido', 'impresionante', 'cansado', 'debido', 'altos', 'precioso', 'escondido', 'escondida'}
--> Difference2 size: 9 (ratio 0.90)

Not in common:
{'maravilloso', 'impresiona

### Analysis for Card 13HM:

<img src="https://psicobotica.com/prolexitim/nlp/stimuli/TAT-13HM.jpg" align="left" width=320> 

In [58]:
# Nouns
card13HM_alex_nouns = get_PoS_SortedList(AlexDocs[AlexDocs.Card == '13HM'].POS, 'NOUN')
card13HM_noalex_nouns = get_PoS_SortedList(NoAlexDocs[NoAlexDocs.Card == '13HM'].POS, 'NOUN')

# Verbs
card13HM_alex_verbs = get_PoS_SortedList(AlexDocs[AlexDocs.Card == '13HM'].POS, 'VERB')
card13HM_noalex_verbs = get_PoS_SortedList(NoAlexDocs[NoAlexDocs.Card == '13HM'].POS, 'VERB')

# Adjectives
card13HM_alex_adjectives = get_PoS_SortedList(AlexDocs[AlexDocs.Card == '13HM'].POS, 'ADJ')
card13HM_noalex_adjectives = get_PoS_SortedList(NoAlexDocs[NoAlexDocs.Card == '13HM'].POS, 'ADJ')

In [59]:
# Card 13HM - Nouns - Alex Vs. NoAlex
print_Set_Stats(card13HM_alex_nouns, card13HM_noalex_nouns, Top_N)

WORDS ANALYSIS
--------------
Alex Set:
{'cama', 'despues', 'trabajo', 'mujer', 'amante', 'esposa', 'pérdida', 'casa', 'hombre', 'noche'}

NoAlex Set:
{'vida', 'ojos', 'día', 'casa', 'trabajo', 'mujer', 'pareja', 'cama', 'hombre', 'noche'}

Union:
{'vida', 'ojos', 'casa', 'amante', 'mujer', 'pareja', 'esposa', 'noche', 'día', 'despues', 'trabajo', 'pérdida', 'cama', 'hombre'}
--> Union size: 14 (ratio 1.40)

Intersection:
{'cama', 'trabajo', 'mujer', 'casa', 'hombre', 'noche'}
--> Intersection size: 6 (ratio 0.60)

Alex - NoAlex Difference:
{'pérdida', 'esposa', 'despues', 'amante'}
--> Difference1 size: 4 (ratio 0.40)

NoAlex - Alex Difference:
{'día', 'vida', 'ojos', 'pareja'}
--> Difference2 size: 4 (ratio 0.40)

Not in common:
{'vida', 'ojos', 'día', 'amante', 'despues', 'pareja', 'esposa', 'pérdida'}
--> Symmetric diff. size: 8 (ratio 0.80)



In [60]:
# Card 13HM - Verbs - Alex Vs. NoAlex
print_Set_Stats(card13HM_alex_verbs, card13HM_noalex_verbs, Top_N)

WORDS ANALYSIS
--------------
Alex Set:
{'juan', 'mata', 'encontrar', 'hacer', 'tenía', 'amenazó', 'sabe', 'buscar', 'asi', 'va'}

NoAlex Set:
{'pasado', 'quería', 'tiene', 'llorando', 'ver', 'hacer', 'tenía', 'pasar', 'encontró', 'trabajar'}

Union:
{'quería', 'llorando', 'tenía', 'amenazó', 'pasar', 'trabajar', 'asi', 'va', 'juan', 'mata', 'pasado', 'tiene', 'encontrar', 'ver', 'hacer', 'sabe', 'buscar', 'encontró'}
--> Union size: 18 (ratio 1.80)

Intersection:
{'tenía', 'hacer'}
--> Intersection size: 2 (ratio 0.20)

Alex - NoAlex Difference:
{'juan', 'mata', 'encontrar', 'amenazó', 'sabe', 'buscar', 'asi', 'va'}
--> Difference1 size: 8 (ratio 0.80)

NoAlex - Alex Difference:
{'pasado', 'tiene', 'quería', 'llorando', 'ver', 'pasar', 'encontró', 'trabajar'}
--> Difference2 size: 8 (ratio 0.80)

Not in common:
{'pasado', 'juan', 'mata', 'quería', 'tiene', 'encontrar', 'llorando', 'ver', 'pasar', 'amenazó', 'sabe', 'encontró', 'buscar', 'trabajar', 'asi', 'va'}
--> Symmetric diff. siz

In [61]:
# Card 13HM - Adjectives - Alex Vs. NoAlex
print_Set_Stats(card13HM_alex_adjectives, card13HM_noalex_adjectives, Top_N)

WORDS ANALYSIS
--------------
Alex Set:
{'pequeña', 'cansado', 'solo', 'incomodo', 'buen', 'bella', 'muerta', 'juntos', 'plena', 'humilde'}

NoAlex Set:
{'pobre', 'normal', 'solo', 'desconsolado', 'agotado', 'nuevo', 'dormida', 'fallecida', 'muerta', 'triste'}

Union:
{'pequeña', 'cansado', 'desconsolado', 'incomodo', 'agotado', 'fallecida', 'muerta', 'juntos', 'normal', 'triste', 'pobre', 'solo', 'buen', 'dormida', 'nuevo', 'bella', 'plena', 'humilde'}
--> Union size: 18 (ratio 1.80)

Intersection:
{'solo', 'muerta'}
--> Intersection size: 2 (ratio 0.20)

Alex - NoAlex Difference:
{'pequeña', 'cansado', 'incomodo', 'buen', 'bella', 'juntos', 'plena', 'humilde'}
--> Difference1 size: 8 (ratio 0.80)

NoAlex - Alex Difference:
{'pobre', 'desconsolado', 'agotado', 'dormida', 'nuevo', 'normal', 'triste', 'fallecida'}
--> Difference2 size: 8 (ratio 0.80)

Not in common:
{'pequeña', 'plena', 'pobre', 'normal', 'cansado', 'desconsolado', 'incomodo', 'agotado', 'nuevo', 'buen', 'dormida', 'fal

## Save the list of nouns, verbs and adjectives
- In order of appearance (for possible further analysis)

In [62]:
# Extracts the list of specific PoS tokens keeping the order of appearance in the doc. 
def get_PoS_List(doc, PoS_tag):
    """
    Parameters
    ----------
    doc : lists of tuples (word, POS_tag) representing a PoS tagged document.
        Documents to be analyzed. 
    PoS_tag : str
        The specific POS that we want to extract
    
    Returns
    -------
    words: list with specific words keeping the order of appearance. 
        
    """
    
    words = []
    
    tag_list = ast.literal_eval(doc)    # Get the list of tuples representing the doc. 
    for PoStuple in tag_list: 
        word = PoStuple[0]
        tag = PoStuple[1]
        if ( tag == PoS_tag ): 
            words.append(word)    # Add to the list only the grammatical function we want
          
    return words

In [63]:
# testing
print("Doc: " + alex_df['Text'][0])
print()
print("Verbs: " + str(get_PoS_List(alex_df['POS'][0],'VERB')))
print()
print("Nouns: " + str(get_PoS_List(alex_df['POS'][0],'NOUN')))
print()
print("Adjectives: " + str(get_PoS_List(alex_df['POS'][0],'ADJ')))
print()
print("Sub. Conj.: " + str(get_PoS_List(alex_df['POS'][0],'SCONJ')))

Doc: es un niño pensando en cual es la respuesta de sus deberes porque no la sabe.

Verbs: ['pensando', 'sabe']

Nouns: ['niño', 'respuesta', 'deberes']

Adjectives: []

Sub. Conj.: ['porque']


In [64]:
# Persist the list of specific parts of speech
alex_df['Verb_List'] = alex_df.POS.apply(lambda x: get_PoS_List(x,'VERB'))
alex_df['Noun_List'] = alex_df.POS.apply(lambda x: get_PoS_List(x,'NOUN'))
alex_df['Adjective_List'] = alex_df.POS.apply(lambda x: get_PoS_List(x,'ADJ'))
alex_df['Subord_List'] = alex_df.POS.apply(lambda x: get_PoS_List(x,'SCONJ'))
alex_df['Adverb_List'] = alex_df.POS.apply(lambda x: get_PoS_List(x,'ADV'))
alex_df['Aux_List'] = alex_df.POS.apply(lambda x: get_PoS_List(x,'AUX'))

In [65]:
alex_df[['POS','Verb_List']].sample(4)

Unnamed: 0,POS,Verb_List
256,"[('un', 'DET'), ('hombre', 'NOUN'), ('que', 'P...",[encontró]
82,"[('parece', 'AUX'), ('como', 'SCONJ'), ('una',...","[encontraron, enamoraron]"
228,"[('después', 'ADV'), ('de', 'ADP'), ('un', 'DE...","[llegó, mereció, aguantar, saborear, prepara]"
327,"[('erase', 'VERB'), ('una', 'DET'), ('vez', 'N...","[erase, encontrar]"


In [69]:
alex_df[['POS','Aux_List']].sample(4)

Unnamed: 0,POS,Aux_List
117,"[('es', 'AUX'), ('una', 'DET'), ('mujer', 'NOU...","[es, está, está, es]"
192,"[('un', 'DET'), ('grupo', 'NOUN'), ('de', 'ADP...",[]
228,"[('después', 'ADV'), ('de', 'ADP'), ('un', 'DE...",[]
54,"[('un', 'DET'), ('niño', 'NOUN'), ('que', 'PRO...","[iba, era]"


In [67]:
alex_df.columns

Index(['Code', 'TAS20', 'F1', 'F2', 'F3', 'Gender', 'Age', 'Card',
       'T_Metaphors', 'T_ToM', 'T_FP', 'T_Interpret', 'T_Desc', 'T_Confussion',
       'Text', 'Alex_A', 'Alex_B', 'Words', 'Sentences', 'Tokens',
       'Tokens_Stop', 'Tokens_Stem_P', 'Tokens_Stem_S', 'POS', 'NER', 'DEP',
       'Lemmas_CNLP', 'Lemmas_Spacy', 'Chars', 'avgWL', 'avgSL', 'Pun_Count',
       'Stop_Count', 'RawTokens', 'Title_Count', 'Upper_Count', 'PRON_Count',
       'DET_Count', 'ADV_Count', 'VERB_Count', 'PROPN_Count', 'NOUN_Count',
       'NUM_Count', 'PUNCT_Count', 'SYM_Count', 'SCONJ_Count', 'CCONJ_Count',
       'INTJ_Count', 'AUX_Count', 'ADP_Count', 'ADJ_Count', 'PRON_Ratio',
       'DET_Ratio', 'ADV_Ratio', 'VERB_Ratio', 'PROPN_Ratio', 'NOUN_Ratio',
       'NUM_Ratio', 'PUNCT_Ratio', 'SYM_Ratio', 'SCONJ_Ratio', 'CCONJ_Ratio',
       'INTJ_Ratio', 'AUX_Ratio', 'ADP_Ratio', 'ADJ_Ratio', 'TTR', 'HTR',
       'BoW_PCA_1', 'BoW_PCA_2', 'BoW_PCA_3', 'TFIDF_PCA_1', 'TFIDF_PCA_2',
       'TFIDF_PCA_3',

In [68]:
# Save Updated features dataset
Feats_4_path = "D:\\Dropbox-Array2001\\Dropbox\\DataSets\\Prolexitim-Dataset\\Prolexitim_v2_features_4.csv"
alex_df.to_csv(Feats_4_path, sep=';', encoding='utf-8', index=False)