# Semi-Supervised Psychometric Scoring of Document Collections

In this notebook, you can find ...

In [1]:
import libs.ssnmf_train as train
import libs.ssnmf_test as test

First of all, we need to set required parameters to train SS-NMF model. These parameters are the followings:

* **filepath**: This is a file name of an input.
* **n_topics**: The number of sub topics.
* **betaloss**: A loss function for an NMF model. ‘kullback-leibler’
* **bckg_brown**: This is a boolean parameter that regulates to use brown corpus as a background or not. If it is *True*, the model use Brown corpus as a background corpus.
* **n_top_words**: The number of the most important words.
* **word_count**: The number of words to show us.
* **train_context**: This parameters returns the important components for the model after training step. It is a dictionary which has four components that are *nmf_list, W_list, tfidf,* and *tfidf_vectorizer*, respectively.
* **output_file**: This is a file name to export a result.

In [2]:
input_file = 'pruned_schwartz.json'
beta_loss = 'kullback-leibler'
bckgrnd_brown = False
n_of_topics = 3
n_of_top_words = 7
n_of_words = 10

# Training Part

We train our SS-NMF model based on the **Schwartz Value Theory** with specified parameters. This theory categorizes ten basic human values (BHVs) in five higher order groups:

* **Openness to change:** Self-Direction and Stimulation.
* **Self-enhancement:** Achievement and Power.
* **Hedonism:** Hedonism (considered to be shared among Openness to change and Self enhancement).
* **Conservation:** Security, Conformity, and Tradition.  
* **Self-transcendence:** Benevolence and Universalism.

In [3]:
train_context = train.train_model(input_file, n_of_topics, beta_loss, bckgrnd_brown)

Reading data...
Cleaning data...
Extracting tf features for NMF...
Fitting NMF for 'universalism', 'hedonism', 'achievement', 'power', 'self-direction', 'benevolence', 'conformity', 'tradition', 'stimulation', 'security'


In the training step, we can show the most important words for each *BHV* by using **report_training_topics** function. We can simply change the number of topics (**n_topics**) and the number of most important words (**n_top_words**) for each sup topic.

In [4]:
train.report_training_topics(train_context, n_of_top_words, n_of_topics)


Topics in NMF model:
[96m[1muniversalism[0m
[1mTopic #0: [0menvironmental - state - movement - social - marriage - samesex - party
[1mTopic #1: [0mright - peace - social - war - state - law - equality
[1mTopic #2: [0menergy - specie - ecology - human - use - resource - natural

[96m[1mhedonism[0m
[1mTopic #0: [0mlove - pain - orgasm - one - empathy - people - may
[1mTopic #1: [0mone - happiness - pleasure - social - desire - life - anxiety
[1mTopic #2: [0mmay - one - experience - also - shame - emotion - pleasure

[96m[1machievement[0m
[1mTopic #0: [0msocial - capital - class - society - labour - work - inequality
[1mTopic #1: [0mwork - hour - social - individual - goal - high - management
[1mTopic #2: [0mcapital - status - social - human - need - individual - people

[96m[1mpower[0m
[1mTopic #0: [0mpower - use - experiment - milgram - make - control - process
[1mTopic #1: [0mtime - state - wealth - power - collapse - class - also
[1mTopic #2: [0mau

We can show the normalized excitation matrix ordered by scores of words for each latent topics in the dataframe format by using **report_excitation_matrix** function. Scores are multiplied by 1000 to make them more readable.

In [5]:
train.report_excitation_matrix(train_context, n_of_words)

Unnamed: 0,universalism (0) - word,universalism (0) - score,universalism (1) - word,universalism (1) - score,universalism (2) - word,universalism (2) - score,benevolence (0) - word,benevolence (0) - score,benevolence (1) - word,benevolence (1) - score,...,stimulation (1) - word,stimulation (1) - score,stimulation (2) - word,stimulation (2) - score,self-direction (0) - word,self-direction (0) - score,self-direction (1) - word,self-direction (1) - score,self-direction (2) - word,self-direction (2) - score
0,environmental,10.32,right,8.15,energy,5.63,law,8.72,good,6.39,...,tourism,35.7,sport,29.0,creativity,19.22,innovation,9.02,yes,11.44
1,state,4.86,peace,6.9,specie,5.55,truth,7.72,evil,6.27,...,travel,14.48,travel,6.92,play,8.88,idea,6.9,independence,9.26
2,movement,4.76,social,5.59,ecology,5.34,good,6.96,one,5.64,...,million,8.08,adventure,6.65,creative,8.49,unite,6.09,invention,5.55
3,social,4.02,war,5.47,human,5.17,ethic,6.87,justice,5.06,...,tourist,7.74,exploration,6.6,intelligence,4.94,territory,5.39,state,5.31
4,marriage,3.92,state,4.61,use,5.02,theory,6.58,pardon,5.02,...,international,7.65,use,5.68,process,4.18,intelligence,5.29,bully,4.98
5,samesex,3.68,law,4.07,resource,4.57,forgiveness,6.58,lie,4.86,...,country,7.6,include,5.4,theory,4.12,state,4.93,positive,4.71
6,party,3.67,equality,3.85,natural,4.08,one,5.73,trust,4.56,...,billion,6.31,game,5.21,new,4.06,new,4.82,task,4.65
7,green,3.52,one,3.85,system,3.84,natural,4.98,individual,4.38,...,world,6.14,may,5.09,work,3.95,group,4.78,individual,4.51
8,environment,3.03,world,3.47,development,3.61,may,3.96,social,4.26,...,destination,5.27,also,4.95,also,3.59,curiosity,4.43,emotion,4.33
9,right,2.84,international,3.3,study,3.57,natural law,3.64,moral,4.13,...,unite,5.2,explorer,4.92,study,3.5,music,4.16,new,4.3


By using **export_excitation_matrix** function, we can export trained data as an *excel* file. If **word_count** parameter is *-1*, the function shows the whole table.

In [17]:
output_file = "train_result_%d_%s.xlsx" % (n_of_topics,bckgrnd_brown)
train.export_excitation_matrix(train_context, output_file, word_count=-1)

For a preparation of testing part, we can create a file which includes prepared trained data based on specified parameters in the training step. 

In [6]:
pretrained_doc_name = "pretrained.p"
train.create_trained_data(train_context, output_file = pretrained_doc_name)

# Testing Part

## Evaluating Different Documents

Let's try to use the trained model for different documents. For this purpose, we just need to add a file name or web url to **test_doc_names** list.

In [7]:
# Pope ted talk, https://www.ted.com/speakers/pope_francis
test_doc_names = ["pope.txt", "dod.txt", "https://www.nationalgeographic.com/science/space/solar-system/earth/"]

Then, we can fit the proposed SS-NMF model for the test_corpus by using **prepare_test_docs** function.

In [8]:
test_corpusPP, test_context = test.prepare_test_docs(test_doc_names, pretrained_doc_name, betaloss=beta_loss)

Reading data...
Cleaning data...
Fitting NMF for 'universalism', 'hedonism', 'achievement', 'power', 'self-direction', 'benevolence', 'conformity', 'tradition', 'stimulation', 'security'


We can show interactive results for test_corpus with *Topic distribution*, *Radar Chart* and *the most important words and scores table* for each test document by using **report_interactive_result** function.

In [9]:
test.report_interactive_result(test_context, test_doc_names, pretrained_doc_name, purity_score = False, word_count = n_of_words, only_doc_words=True)

When word_count is -1, it exports all the words
When only_doc_words is set to True, it exports only the words used in the documents

In [12]:
# if you want proper document names in the output file change 'doc_names' list.
test.export_word_scores_excel(test_context, test_doc_names, pretrained_doc_name, filepath = 'ssnmf_words.xlsx', purity_score=False, word_count=-1, only_doc_words=True)

By using **report_test_excel** function, we can export the result as an *excel* file.

In [18]:
test.export_excel(test_context, test_corpusPP, test_doc_names, output_file = "test_result.xlsx")

Unnamed: 0,name,universalism,benevolence,conformity,tradition,security,power,achievement,hedonism,stimulation,self-direction,Text
0,pope.txt,4.040715,63.909306,40.332618,22.782479,6.293177,8.727138,10.311021,34.270452,23.190005,4.368536,good even good morning sure time regardless ho...
1,dod.txt,74.979626,8.658796,30.110368,4.364227,86.232317,42.547565,38.193093,2.203972,28.723649,36.564244,behalf secretary defense deputy secretary defe...
2,https://www.nationalgeographic.com/science/spa...,83.351298,6.7519,0.029905,0.400209,44.193277,37.288825,7.295849,5.778209,61.33599,0.003121,earth home planet planet solar system know har...


By using **report_test_csv** function, we can also export the result as an *csv* file.

In [19]:
test.export_csv(test_context, test_corpusPP, test_doc_names, output_file = "test_result.csv")

Unnamed: 0,name,universalism,benevolence,conformity,tradition,security,power,achievement,hedonism,stimulation,self-direction,Text
0,pope.txt,4.040715,63.909306,40.332618,22.782479,6.293177,8.727138,10.311021,34.270452,23.190005,4.368536,good even good morning sure time regardless ho...
1,dod.txt,74.979626,8.658796,30.110368,4.364227,86.232317,42.547565,38.193093,2.203972,28.723649,36.564244,behalf secretary defense deputy secretary defe...
2,https://www.nationalgeographic.com/science/spa...,83.351298,6.7519,0.029905,0.400209,44.193277,37.288825,7.295849,5.778209,61.33599,0.003121,earth home planet planet solar system know har...
