## Language Analysis of Alexithymic Discourse

<hr>

Alexithymic Language Project / raul@psicobotica.com / V2 release (sept 2020)

<hr>

### Dataset preprocessing

- Dataset load / cleansing. 
- Apply participant inclusion/exclusion criteria. 
- Descriptive analysis. 
- 

### Dataset load

In [110]:
import pandas as pd 
import numpy as np
from scipy import stats

from sklearn.metrics import cohen_kappa_score

%matplotlib inline
import matplotlib.pyplot as plt 

In [111]:
prolexitim_dataset_path = "https://raw.githubusercontent.com/raul-arrabales/alexithymic-lang/master/data/Prolexitim_v2_tagged.csv"

alex_df = pd.read_csv(prolexitim_dataset_path, header=0, delimiter=";")

In [112]:
# Keep only adults
alex_df = alex_df.drop(alex_df[alex_df.Age < 18].index) 
alex_df = alex_df.drop(alex_df[alex_df.Age > 65].index) 

In [113]:
alex_df.head()

Unnamed: 0,Code,TAS20,F1,F2,F3,Gender,Age,Card,T_Metaphors,T_ToM,T_FP,T_Interpret,T_Desc,T_Confussion,Text
0,bc39e22ca5dba59fbd97c27987878f56,40,16,9,15,2,22,1,0,1,0,1,0,0,es un niño pensando en cual es la respuesta de...
1,bc39e22ca5dba59fbd97c27987878f56,40,16,9,15,2,22,9VH,0,0,0,1,0,0,soldados descansando.
2,bc39e22ca5dba59fbd97c27987878f56,40,16,9,15,2,22,11,0,0,0,0,1,0,Una cascada.
3,bc39e22ca5dba59fbd97c27987878f56,40,16,9,15,2,22,13HM,0,1,0,0,0,0,hombre llorando porque su mujer ha muerto.
4,20cd825cadb95a71763bad06e142c148,40,12,10,18,2,22,1,0,1,0,0,0,0,un Niño cansado de estudiar y presionado por s...


In [114]:
alex_df.count()

Code            396
TAS20           396
F1              396
F2              396
F3              396
Gender          396
Age             396
Card            396
T_Metaphors     396
T_ToM           396
T_FP            396
T_Interpret     396
T_Desc          396
T_Confussion    396
Text            396
dtype: int64

In [115]:
alex_df.isnull().values.any()

False

In [119]:
df_na = alex_df[alex_df.isna().any(axis=1)]
df_null = alex_df[alex_df.isnull().any(axis=1)]
df_na

Unnamed: 0,Code,TAS20,F1,F2,F3,Gender,Age,Card,T_Metaphors,T_ToM,T_FP,T_Interpret,T_Desc,T_Confussion,Text


### Gender and Age stats

In [120]:
alex_df.Age.describe()

count    396.000000
mean      34.664141
std       12.310367
min       18.000000
25%       24.000000
50%       35.500000
75%       43.000000
max       61.000000
Name: Age, dtype: float64

In [121]:
# alex_df.Age.plot.hist(by=None, bins=20)

In [122]:
alex_df.Gender.value_counts()

2    254
1    142
Name: Gender, dtype: int64

### Build categorical TAS-20 variables

**Cutoff scoring - Criterion A** (column Alex-A):

Bagby, R. M., Parker, J. D. A. & Taylor, G. J. (1994). The twenty-item Toronto Alexithymia Scale-I. Item selection and cross-validation of the factor structure. Journal of Psychosomatic Research, 38, 23-32.

- Equal to or less than 51 = non-alexithymia.
- Scores of 52 to 60 = possible alexithymia.
- Equal to or greater than 61 = alexithymia.

Expressed into a dichotomous variable: 
- Positive (TAS-20 >= 61)
- Negative (TAS-20 < 61)


**Cutoff scoring - Criterion B** (column Alex-B):

Loas, G., Otmani, O., Fremaux, D., Lecercle, C., Duflot, M., & Delahousse, J. (1996). External validity, reliability and basic score determination of the Toronto Alexithymia Scales (TAS and TAS-20) in a group of alcoholic patients. L'Encephale, 22(1), 35-40.

- Score less that 44 for non-alexithymia has been also considered.
- Scores of 45 to 55 = possible alexithymia.
- Equal to or greater than 56 = alexithymia.

Expressed into a dichotomous variable: 
- Positive (TAS-20 >= 56)
- Negative (TAS-20 < 56)

In [123]:
alex_df['Alex_A'] = np.where(alex_df['TAS20']>=61, 1, 0)
alex_df['Alex_B'] = np.where(alex_df['TAS20']>=56, 1, 0)

In [124]:
# How much the two criteria differ? 

# Point biserial correlation
stats.pointbiserialr( alex_df.Alex_A, alex_df.Alex_B )

PointbiserialrResult(correlation=0.8714893406611904, pvalue=5.561988739894539e-124)

In [125]:
# alex_df[['Alex_A', 'Alex_B']].corr()

In [126]:
# Inter-rater reliability
cohen_kappa_score( alex_df.Alex_A, alex_df.Alex_B )

0.8633093525179856

In [127]:
alex_df.Alex_A.value_counts()

0    316
1     80
Name: Alex_A, dtype: int64

In [128]:
alex_df.Alex_B.value_counts()

0    297
1     99
Name: Alex_B, dtype: int64

### Document features

In [129]:
# Add text length in words
alex_df['Words'] = alex_df.apply(lambda row: len(row.Text.split()), axis=1)

In [130]:
# Add text length in sentences
alex_df['Sentences'] = alex_df.apply(lambda row: len(row.Text.split('.')), axis=1)

### Text Processing

In [133]:
# Using nltk for text processing

from nltk.tokenize import RegexpTokenizer 
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

In [134]:
# Reg Exp Tokenizer
tokenizer = RegexpTokenizer(r'\w+')

In [135]:
# Stop words in Spanish 
es_stop = set(stopwords.words('spanish'))

In [137]:
# Porter stemmer
p_stemmer = PorterStemmer() 

In [142]:
# Add column with list of tokens for each document
alex_df['Tokens'] = alex_df.apply(lambda row: tokenizer.tokenize(row.Text.lower()), axis=1) 

In [143]:
# Add column with list of stopped tokens for each document
alex_df['Tokens_Stop'] = alex_df.apply(lambda row: [i for i in row.Tokens if not i in es_stop], axis=1) 

In [146]:
# Add column with list of stemmed tokens for each document
alex_df['Tokens_Stem'] = alex_df.apply(lambda row: [p_stemmer.stem(i) for i in row.Tokens_Stop], axis=1) 

In [147]:
alex_df

Unnamed: 0,Code,TAS20,F1,F2,F3,Gender,Age,Card,T_Metaphors,T_ToM,...,T_Desc,T_Confussion,Text,Alex_A,Alex_B,Words,Sentences,Tokens,Tokens_Stop,Tokens_Stem
0,bc39e22ca5dba59fbd97c27987878f56,40,16,9,15,2,22,1,0,1,...,0,0,es un niño pensando en cual es la respuesta de...,0,0,16,2,"[es, un, niño, pensando, en, cual, es, la, res...","[niño, pensando, respuesta, deberes, sabe]","[niño, pensando, respuesta, deber, sabe]"
1,bc39e22ca5dba59fbd97c27987878f56,40,16,9,15,2,22,9VH,0,0,...,0,0,soldados descansando.,0,0,2,2,"[soldados, descansando]","[soldados, descansando]","[soldado, descansando]"
2,bc39e22ca5dba59fbd97c27987878f56,40,16,9,15,2,22,11,0,0,...,1,0,Una cascada.,0,0,2,2,"[una, cascada]",[cascada],[cascada]
3,bc39e22ca5dba59fbd97c27987878f56,40,16,9,15,2,22,13HM,0,1,...,0,0,hombre llorando porque su mujer ha muerto.,0,0,7,2,"[hombre, llorando, porque, su, mujer, ha, muerto]","[hombre, llorando, mujer, muerto]","[hombr, llorando, mujer, muerto]"
4,20cd825cadb95a71763bad06e142c148,40,12,10,18,2,22,1,0,1,...,0,0,un Niño cansado de estudiar y presionado por s...,0,0,29,4,"[un, niño, cansado, de, estudiar, y, presionad...","[niño, cansado, estudiar, presionado, padres, ...","[niño, cansado, estudiar, presionado, padr, tr..."
5,20cd825cadb95a71763bad06e142c148,40,12,10,18,2,22,9VH,0,1,...,0,0,Grupo de amigos después de una noche de divers...,0,0,24,3,"[grupo, de, amigos, después, de, una, noche, d...","[grupo, amigos, después, noche, diversión, can...","[grupo, amigo, despué, noch, diversión, cansad..."
6,20cd825cadb95a71763bad06e142c148,40,12,10,18,2,22,11,0,0,...,1,0,acantilado.,0,0,1,2,[acantilado],[acantilado],[acantilado]
7,20cd825cadb95a71763bad06e142c148,40,12,10,18,2,22,13HM,0,1,...,0,0,Hombre desolado porque se ha encontrado a su m...,0,0,10,2,"[hombre, desolado, porque, se, ha, encontrado,...","[hombre, desolado, encontrado, mujer, fallecida]","[hombr, desolado, encontrado, mujer, fallecida]"
8,107b920c3318629af25cd9fe09c2bd25,37,14,10,13,2,23,1,0,1,...,0,0,"Es un niño que está aburrido, está cansado de ...",0,0,12,2,"[es, un, niño, que, está, aburrido, está, cans...","[niño, aburrido, cansado, tocar, violín]","[niño, aburrido, cansado, tocar, violín]"
9,107b920c3318629af25cd9fe09c2bd25,37,14,10,13,2,23,9VH,0,0,...,1,0,Esta imagen refleja a varios hombres en el des...,0,0,24,2,"[esta, imagen, refleja, a, varios, hombres, en...","[imagen, refleja, varios, hombres, descanso, t...","[imagen, refleja, vario, hombr, descanso, trab..."


### Save processed dataset

In [148]:
# Dataset with classification, tags, text features, etc. 

alex_class_dataset_path = "D:\\Dropbox-Array2001\\Dropbox\\DataSets\\Prolexitim-Dataset\\Prolexitim_v2_classes.csv"
alex_df.to_csv(alex_class_dataset_path, sep=';', encoding='utf-8', index=False)