# Generic Content Analysis


## ⚠️ PLEASE READ BEFORE DOING ANYTHING ⚠️

Welcome to this online coding environment ! 
You are currently running a *Jupyter notebook* that we hope to be usefull for content analysis of questionnaires. 

At the end of the execution, you will have the opportunity to save all yours results in an HTML file, executing `"File" → "Save and Export Notebook as" → "HTML"` 

In [None]:
######### IMPORTS #########
from IPython.display import display
from analysis_gui.forms import *

import io
import asyncio

import warnings
warnings.filterwarnings('ignore')
############################

#Required package to download the image
# !wget https://github.com/plotly/orca/releases/download/v1.2.1/orca-1.2.1-x86_64.AppImage -O /usr/local/bin/orca
# !chmod +x /usr/local/bin/orca
# !apt-get install xvfb libgtk2.0-0 libgconf-2-4

---
# Data import

By default, this repository is configured to make you compute your own results (`experiment = "Custom"`), but we also configured it to reproduce some previous analyses:

* option `experiment = "Gauld2023_OSAS_content_analysis"` : Gauld C, Baillieul S, Martin VP, Richaud A, Pelou M, Abi-Saab P,Coelho J, Philip P, Pépin JL, Micoulaud-Franchi JA. 
What evaluate obstructive sleep apnea patient-based screening questionnaires? A systematic and quantified item content analysis. *Under review* 

* option `experiment = "Gauld2023_sleep_content_analysis"`: 
 Gauld C, Martin VP, Richaud A, Bailleul S, Lucie V, Perromat JL, Zreik I, Taillard J, Geoffroy PA, Lopez R, Micoulaud-Franchi JA. Systematic Item Content and Overlap Analysis of Self-reported Multiple Sleep Disorders Screening Questionnaires in Adults. *Journal of Clinical Medicine*. [https://doi.org/10.3390/jcm12030852](https://doi.org/10.3390/jcm12030852) 

* option `experiment = "Fried2017"` : Fried EI. The 52 symptoms of major depression: Lack of content overlap among seven common depression scales. *Journal of Affective Disorders*. 2017 Jan;208:191–7. 


## Correct symptom file formatting

For this notebook to work correcly, your symptom file should be formatted the following way : 
* the four first columns must be the category (named "Category" in our example), subcategory (named "Subcategory" in our example), the abbreviation for the symptom (as shown in the Figure, named "Ab" in our example) and the name of the symptom ("Symptom" in our example);
* the other columns are the different questionnaires, while the line are the different symptoms.
* For each questionnaire, the symptoms are coded the following way: 
    * 0: The symptom is absent from this questionnaire
    * 1: The symptom is specific in this questionnaire (i.e. the symptom has been identified in an item mentionning only one symptom)
    * 2: The symptom is componed in this questionnaire (i.e. the symptom has been identified in an item mentionning at least two symptoms)

⚠️ If you do not have categories or subcategories, just put empty columns as first or second columns ⚠️

In [None]:
cs = ChartStudioForm()
display(cs)

## Importing data by uploading an excel file 

To import your own excel spreadsheet: 
>* Select "Custom" from the dropdown menu below 
>* Click on the "Upload" button ⭱
>* Click the "Confirm" button ✓

This program only accepts `.xsl` and `.xlsx` file formats (excel documents)

In [None]:
file_selection = FileSelectionForm()
cs.chain(file_selection)
display(file_selection)

## Reference classifications
In the `Gauld2023_sleep_content_analysis` paper, we compare the symptoms of the questionnaires with two references classification (ICSD and DSM). If you have reference columns that you want to compare with but you do not want to compute metrics on, please put them in this table. Otherwise, just let this list empty [ ].

⚠️ the name of the references should match PERFECTLY the name of the columns they are in (including uppercases and lowercases, or spaces) ⚠️

# Ordering questionnaires and symptoms

First, the questionnaires are classified from having the higher number of symptoms to the lowest.

In [None]:
ref_selection = ReferenceSelectionForm()
file_selection.chain(ref_selection)
display(ref_selection)

---
# 1. Analysis of the number and frequency of symptoms

In a first step, we analyse the frequency of the symptoms.

## Histogram of number of symptoms

### Sorted by number of occurences or category

In [None]:
histo = HistogramUI()
ref_selection.chain(histo)
display(histo)

## Number of symptoms by questionnaire

In [None]:
num_symptoms = NumSymptomsUI()
ref_selection.chain(num_symptoms)
display(num_symptoms)

The table has been save in the online folder (📁 symbol on the left) under the name [table1_symptoms_per_questionnaire.xlsx](./table1_symptoms_per_questionnaire.xlsx). <br>You can change the name and the format of the file changing the name in the `sympt_per_questionnaire.to_excel()` function. 
<br>⚠️ If you need it, save the excel file on your local computer : these online file will be deleted as soon as you quit this page!

## Symptoms that are in classifications but not in questionnaires

In [None]:
ref_dis = RefDisplayUI()
ref_selection.chain(ref_dis)
display(ref_dis)

## Number of symptoms in each category for each questionnaire

In [None]:
numSymCatForm = NumSymCat()
ref_selection.chain(numSymCatForm)
display(numSymCatForm)

The table has been save in the online folder (📁 symbol on the left) under the name [table2_categorie_per_questionnaire.xlsx](./table2_categorie_per_questionnaire.xlsx). <br>You can change the name and the format of the file changing the name in the `cat_per_questionnaire.T.to_excel()` function. 
<br>⚠️ If you need it, save the excel file on your local computer : these online file will be deleted as soon as you quit this page!

## Distribution across the categories of the symptoms measured by each questionnaire
(i.e. same thing as before, but normalized by questionnaire (sum across lines equals 1)).

In [None]:
distribution_form = DistributionUI()
numSymCatForm.chain(distribution_form)
display(distribution_form)

---
# 2. Analysis and data vizualisation of content analysis Figure

## Changing shape of data

In [None]:
# changing the shape of data
display(common.melt_output)

## Content Analysis Figure

If you want to analyse custom data, you will have to set the variable `max_radius` so that the figure have the desired look !

In [None]:
circle = CircleForm()
ref_selection.chain(circle)
display(circle)

## Overlap between questionnaires - Jaccard Index

In order to estimate the overlap between the symptoms measured by the questionnes, calculate the Jaccard index, which is defined as the number of symmtoms that are measured by both questionnaires, divided by the number of unique symptoms measured both questionnaires.

### Jaccard index of symptom for each pair of questionnaire
First, we compute the Jaccard index for each pair of questionnaires and plot it using a heatmap.

In [None]:
jaccard_table = JaccardTable()
num_symptoms.chain(jaccard_table)
display(jaccard_table)

Table 3 has been save in the online folder (📁 symbol on the left) under the name [table3_jaccard_pairs.xlsx](./table3_jaccard_pairs.xlsx). <br>You can change the name and the format of the file changing the name in the `jaccard_table.to_excel()` function. 
<br>⚠️ If you need it, save the excel file on your local computer : these online file will be deleted as soon as you quit this page!

In [None]:
jaccard_heatmap = JaccardHeatmap()
jaccard_table.chain(jaccard_heatmap)
display(jaccard_heatmap)

The figure has been save in the online folder (📁 symbol on the left) under the name [figure5_heatmap_jaccard.pdf](figure5_heatmap_jaccard.pdf). <br>You can change the name and the format of the file changing the name in the `#fig.write_image()` function. 
<br>⚠️ If you want it, save the figure on your local computer : these online file will be deleted as soon as you quit this page!

### Avg. Jaccard index
Then, we compute the average of Jaccard index for each questionnaire with other questionnaires (excluding the references). 

In [None]:
jaccard_idx = AverageJaccardIndex()
jaccard_table.chain(jaccard_idx)
display(jaccard_idx)

Table 4 has been save in the online folder (📁 symbol on the left) under the name [table4_jaccard_average_questionnaires.xlsx](./table4_jaccard_average_questionnaires.xlsx). <br>You can change the name and the format of the file changing the name in the `jaccard.to_excel()` function. 
<br>⚠️ If you need it, save the excel file on your local computer : these online file will be deleted as soon as you quit this page!

### Correlation between the number of symptoms and the average Jacquart index for each questionnaire 

In [None]:
correlation = Correlation()
jaccard_idx.chain(correlation)
display(correlation)

### Jaccard index of symptoms for each pair of questionnaire for each category

Computing the same metric (average of average) for each category of questionnaires.

In [None]:
jaccardPairIndex = JaccardPairIndex('Category')
ref_selection.chain(jaccardPairIndex)
display(jaccardPairIndex)

Table 5 has been save in the online folder (📁 symbol on the left) under the name [table5_jaccard_categories.xlsx](./table5_jaccard_categories.xlsx). <br>You can change the name and the format of the file changing the name in the `res.to_excel()` function. 
<br>⚠️ If you need it, save the excel file on your local computer : these online file will be deleted as soon as you quit this page!

Computing the same metric (average of average) for each subcategory.

In [None]:
jaccardPairIndexSub = JaccardPairIndex('Subcategory')
ref_selection.chain(jaccardPairIndexSub)
display(jaccardPairIndexSub)

# Sunburst Plot

In [None]:
sunburst = SunburstForm()
ref_selection.chain(sunburst)
display(sunburst)

---
# Export to html
You have reached the end of this notebook. 
If you want to save the whole page, you can download it to html with dynamic figures:
>* "File" → "Save and Export Notebook as" → "HTML" 