# Project Management Program Based on Topic Model

Nowadays, a project usually generates a huge amount of documents, and this study designed a topic program based on topic model for project management. The program applies the Latent Dirichlet Allocation (LDA) algorithm to build the topic model and applies the “prototypicaltext based interpretation” (PTBI) and the visualisation of PyLDAvis to select the salient topics, the prototypical paragraphs as well as the minimum number of texts for topic interpretation. 

# 1. Prerequisites

Install the libraries below. Download the two css files from https://github.com/suhao3123/CSS, create a folder named assets in the root of your app directory and include the two files in that folder to lauch the Dashboard we created in the final step.

In [1]:
# pip install numpy                      # （install numpy）
# pip intall pandas                      # (install pandas)
# pip install PyMuPDF                    # (install PyMuPDF for extracting info from PDF files)
# pip install tika                       # (install tika for extracting paragraphs from PDF files)
# pip install spacy==2.2.0               # (install spacy for lemmatization)
# conda install gensim                   # (intall gesim for topic modelling)
# pip install pyLDAvis                   # (install pyLDAvis for topic modelling visulisation)
# conda install -c conda-forge pyldavis  # (if you use aconda to install pyLADvis)
# pip install plotly                     # (install plotly for visualisation)

In [2]:
import pandas as pd
import numpy as np
import re

# glob for extracting the directories of metadata
import glob

# PyMuPDF
import fitz

# tika
import tika               
from tika import parser   

# Gensim
import gensim
import gensim.corpora as corpora
from gensim.utils import simple_preprocess
from gensim.models import CoherenceModel

# spacy for lemmatization
import spacy

# Visualisation
import plotly.express as px
import pyLDAvis
import pyLDAvis.gensim_models
import matplotlib.pyplot as plt
%matplotlib inline

# Enable logging for gensim - optional
import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.ERROR)

import warnings
warnings.filterwarnings("ignore",category=DeprecationWarning)
import os

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  method='lar', copy_X=True, eps=np.finfo(np.float).eps,

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  method='lar', copy_X=True, eps=np.finfo(np.float).eps,

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).eps, copy_Gram=True, verbose=0,

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).eps, copy_X=True, fit_path=True,

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  eps=np.finfo(np.float).eps, copy_X=True, fit_path=True,

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-

# 2. Import pdf files, data wrangling and overview

Input the directory of the pdf files you want to analyse, the chunks below will extract the texts and info of the files.

In [3]:
# Extract the directories of the PDF files, make sure the folder name does not contain number
pdf_dir = "D:\LEON\Business Analytics\Study\9. Business Project\Data set\Olympics"
pdf_files = glob.glob("%s/*.pdf" % pdf_dir)
pdf_files[:1]

['D:\\LEON\\Business Analytics\\Study\\9. Business Project\\Data set\\Olympics\\Examination_of_Witnesses_Sept_2003_-_Q1-19.pdf']

In [4]:
# Use PyMuPDF to extract all info of the PDF files (text, title, date, etc)
list_metadata = []
for i in pdf_files:
    with fitz.open(i) as doc:
        info = doc.metadata
        info['file_name'] = os.path.basename(i)
        text = ''
        for page in doc:
            text+= page.getText()
        info['Content'] = text       
    list_metadata.append(info)

In [5]:
df = pd.DataFrame(list_metadata)
df['document_id'] = df.index
df = df.drop_duplicates(subset = ['Content'])             # drop duplicate rows
df = df.dropna(subset=df.columns[[12]], how='any')        # drop rows whose text content is NaN                   
df['Word_count'] = df ['Content'].str.count(' ') + 1
df.head(3)

Unnamed: 0,format,title,author,subject,keywords,creator,producer,creationDate,modDate,trapped,encryption,file_name,Content,document_id,Word_count
0,PDF 1.7,,B Lewis,,,Microsoft Word,,D:20210822083603+00'00',D:20210822083603+00'00',,,Examination_of_Witnesses_Sept_2003_-_Q1-19.pdf,Examination of Witnesses (1-19) \n16 SEPTEMBER...,0,6115
1,PDF 1.7,,B Lewis,,,Microsoft Word,,D:20210822083606+00'00',D:20210822083606+00'00',,,Examination_of_Witnesses_Sept_2003_-_Q20-39.pdf,Examination of Witnesses (20-39) \n16 SEPTEMBE...,1,4002
2,PDF 1.7,,B Lewis,,,Microsoft Word,,D:20210822083609+00'00',D:20210822083609+00'00',,,Examination_of_Witnesses_Sept_2003_-_Q40-44.pdf,Examination of Witnesses (40-44) \n16 SEPTEMBE...,2,1007


In [6]:
# check if there are documents with few words
min_word_count= 10                                               # set the threshold of the minimum word count of each document 
min_word_count_filter = df['Word_count'] <= min_word_count
df_few_words = df[min_word_count_filter][['file_name', 'Content']]
df_few_words

Unnamed: 0,file_name,Content


In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 169 entries, 0 to 168
Data columns (total 15 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   format        169 non-null    object
 1   title         169 non-null    object
 2   author        169 non-null    object
 3   subject       169 non-null    object
 4   keywords      169 non-null    object
 5   creator       169 non-null    object
 6   producer      169 non-null    object
 7   creationDate  169 non-null    object
 8   modDate       169 non-null    object
 9   trapped       169 non-null    object
 10  encryption    3 non-null      object
 11  file_name     169 non-null    object
 12  Content       169 non-null    object
 13  document_id   169 non-null    int64 
 14  Word_count    169 non-null    int64 
dtypes: int64(2), object(13)
memory usage: 21.1+ KB


In [8]:
# Word count
df['Word_count'].sum( )

1054090

# 3. Natural language processing

### 3.1. Tokenisation
The texts extracted above will be split into individual words.

In [9]:
data = df.Content.values.tolist()

In [10]:
def sent_to_words(sentences):
    for sentence in sentences:
        yield(gensim.utils.simple_preprocess(str(sentence).encode('utf-8'), deacc=True))  # deacc=True removes punctuations

data_words= list(sent_to_words(data))

### 3.2. Processing words: 
First, the stopwords will be removed and users can add more stop words manually. Next, the bigrams (phrases containing two words) and trigrams (phrases containing three words) will be formed, then the words will be lemmitised (reducing different forms of a word into a single word). Next, a threshold allows users to remove short words.

In [11]:
# Build the bigram and trigram models
bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) # higher threshold fewer phrases.
trigram = gensim.models.Phrases(bigram[data_words], threshold=100)  
bigram_mod = gensim.models.phrases.Phraser(bigram)
trigram_mod = gensim.models.phrases.Phraser(trigram)

In [12]:
# import the stop_words from gensim
from gensim.parsing.preprocessing import remove_stopwords, STOPWORDS
stop_words = [i for i in STOPWORDS]

# add more stop words after analysing the key words of each topic from pyLDAvis in section 5.2. Topic visualisation 
new_stop_words = ['go', 'would', 'make', 'think', 'take', 'say', 'need', 'want', 'thing', 'have', 'lot', 'people', 'year','good','great','able','come','look','right',
                   'sure', 'day', 'moment', 'work','time', 'know', 'use', 'try', 'happen', 'ask', 'new', 'way', 'jonathan_stephen', 'david_higgin', 'dame_helen_ghosh','end']              
stop_words.extend(new_stop_words)

In [13]:
# Define functions for stopwords, bigrams, trigrams and lemmatization
def remove_stop_words(texts):
    return [[word for word in doc if word not in stop_words] for doc in texts]

def make_bigrams(texts):
    return [bigram_mod[doc] for doc in texts]

def make_trigrams(texts):
    return [trigram_mod[bigram_mod[doc]] for doc in texts]

def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']):
    """https://spacy.io/api/annotation"""
    texts_out = []
    for sent in texts:
        doc = nlp(" ".join(sent)) 
        texts_out.append([token.lemma_ for token in doc if token.pos_ in allowed_postags])
    return texts_out

In [14]:
# Form Trigrams
data_words_trigrams = make_trigrams(data_words)

# Initialize spacy 'en' model, keeping only tagger component (for efficiency)
# python3 -m spacy download en
nlp = spacy.load('en_core_web_sm', disable=['parser', 'ner'])

#increase the maximum length of text that the parser or NER can process
nlp.max_length = 13000000 #

# Do lemmatization keeping only noun, adj, verb
data_lemmatized1 = lemmatization(data_words_trigrams, allowed_postags=['NOUN', 'ADJ', 'VERB'])

# Set a threshold for removing the words with length less than the threshold
minimum_len = 3 
data_lemmatized2 = []
for i in data_lemmatized1:
    new_element = [x for x in i if len(x) >= minimum_len]
    data_lemmatized2.append(new_element)

# remove stop words
data_lemmatized = remove_stop_words(data_lemmatized2)

###  3.3. Dictionary and Corpus
The processed words will be inputted to generate the Dictionary and Corpus to build the topic model. The Dictionary assigns an ID (0, 1, 2, etc.) to each word; the Corpus is a list of (word ID, word frequency) of each document. We can set two parameters to filter out more stopwords as shown below.

In [15]:
# Create Dictionary, set the parameters to filter out tokens in the dictionary by their frequency
no_below = 5             # remove the tokens less frequent than no_below documents (absolute number)
no_above = 0.85          # remove the tokens more frequent than no_above documents (fraction of the total corpus size)
id2word = corpora.Dictionary(data_lemmatized)
id2word.filter_extremes(no_below = no_below, no_above = no_above)

# print the number of reserved unique tokens and word count afer removal of high and low frequency words
print('After removal of high and low frequency words - Number of unique tokens: %d, %d' % (len(id2word),id2word.num_pos))

After removal of high and low frequency words - Number of unique tokens: 3396, 332393


In [16]:
# Create Corpus
texts = data_lemmatized

# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in texts]

#  4. LDA Model
Now we can input the Dictionnay and Corpus to bulid the LDA model, a basic and widely-used topic model. We might need to tune the parameters and hyperparameters to get a higher coherence score, a measure evaluating the interpretability of the topics extracted.

### 4.1. Building LDA Model, Perparameter/Hyperparameter tuning
First we set the training parameters and hyperameters

In [17]:
# set training parameters and hyperameters
k = 20                  # number of topics
passes = 20             # number of training iterations through the corpus
iterations = 100        # maximum number of iterations through the corpus, limiting this parameter might cause some documents not to converge in time
alpha = 50.0/k          # document-topic density, a high α tends to return more salient topics in each document
eta = 0.01              # prior probabilities assigned to each term
random_state = 12345    # random seed for reproducibility
minimum_probability = 0 # topics with a probability lower than this threshold will be filtered out

Now, we need to plot the coherence score against k to identify the opitmal k where the coherence socre reaches the highest point. Because running it is quite time-consuming, I stopped some chunks below and just set k to be 10 based on the analysis of the reuslt. If users want to fit the model to the other corpus, they can remove the hashs to reactivate the chunks and analyse the coherence socres aginst k.

In [18]:
#start=1; limit=21; step=1 # set the parameters to generate a sequence of k values starting with "start" and ending in "limit" by a step of "step" f
#coherence_values = []
#model_list = []
#for i in range(start,limit,step):
    #model = gensim.models.LdaModel(corpus = corpus,id2word = id2word,alpha = alpha,eta = eta,iterations = iterations,num_topics = i,passes = passes,random_state = 12345,minimum_probability = minimum_probability)
    #model_list.append(model)
    #coherencemodel = CoherenceModel(model=model, texts=texts, dictionary=id2word, coherence='c_v')
    #coherence_values.append(coherencemodel.get_coherence())

In [19]:
#list_num_topics = [i for i in range(start, limit, step)]
#df_coherence1 = pd.DataFrame({'Number_of_Topics': list_num_topics, 'Coherence_Score': coherence_values})
#df_coherence1.to_pickle('./df_coherence1.pkl') #save the result to disk
#df_coherence = pd.read_pickle('./df_coherence1.pkl') #load the result from disk

In [20]:
#fig1 = px.line(df_coherence, x = 'Number_of_Topics', y = "Coherence_Score", title = 'Coherence scores against number of topics')
#fig1.update_layout(autosize=False, width=1000, height=400)
#fig1.update_traces(mode = "lines + markers")
#fig1.show()

In [35]:
# set num of topics to get the highest coherence socre
k = 10
lda_model = gensim.models.LdaModel(
    corpus = corpus,
    id2word = id2word,
    alpha = alpha,
    eta = eta,
    iterations = iterations,
    num_topics = k,
    passes = passes,
    random_state = 12345,
    minimum_probability = minimum_probability)

In [36]:
# print the coherence of the LDA model
coherencemodel2 = CoherenceModel(model=lda_model, texts=texts, dictionary=id2word, coherence='c_v')
coherence_score = coherencemodel2.get_coherence()
coherence_score

0.4364017884744861

### 4.2. Topic distribution of documents

In [37]:
# create the function for converting a list of tuples into a dictionary
def Convert(tup, di):
    di = dict(tup)
    return di

In [38]:
# topic distribution of documents
list_topic = []
dictionary_topic = {}
for d in texts:
    bow = id2word.doc2bow(d)
    belong = lda_model[bow]                        # generate a list of tuples of topic distribution of a document
    belong_dic = Convert(belong, dictionary_topic) # convert the list of tuples into a dictionary
    list_topic.append(belong_dic)           
                      
df_topic_distribution = pd.DataFrame(list_topic)   # convert the list of dictionaries into a dataframe

# rename the topic IDs to ensure they are as same as the topic IDs in the pyLDAvis
original_topic_id = [*df_topic_distribution]; new_topic_id = [x + 1 for x in original_topic_id]
df_topic_distribution = df_topic_distribution.rename(columns = dict(zip(original_topic_id, new_topic_id))) #rename the topic IDs to ensure they are as same as the topic IDs in the pyLDAvis
df_topic = pd.merge(df, df_topic_distribution, how = 'left', left_index=True, right_index=True) # merge with info of documents
df_topic.drop(['title','format','creator', 'producer', 'keywords', 'trapped', 'encryption','subject', 'modDate'], axis = 1)

Unnamed: 0,author,creationDate,file_name,Content,document_id,Word_count,1,2,3,4,5,6,7,8,9,10
0,B Lewis,D:20210822083603+00'00',Examination_of_Witnesses_Sept_2003_-_Q1-19.pdf,Examination of Witnesses (1-19) \n16 SEPTEMBER...,0,6115,0.022669,0.012241,0.008059,0.027981,0.009974,0.008838,0.021381,0.346660,0.515521,0.026676
1,B Lewis,D:20210822083606+00'00',Examination_of_Witnesses_Sept_2003_-_Q20-39.pdf,Examination of Witnesses (20-39) \n16 SEPTEMBE...,1,4002,0.054689,0.031731,0.008668,0.012626,0.032229,0.007006,0.012481,0.306131,0.488593,0.045848
2,B Lewis,D:20210822083609+00'00',Examination_of_Witnesses_Sept_2003_-_Q40-44.pdf,Examination of Witnesses (40-44) \n16 SEPTEMBE...,2,1007,0.058003,0.024969,0.021409,0.042886,0.040421,0.022138,0.030234,0.301232,0.424574,0.034133
3,Bronwen Lewis,D:20210822084116+00'00',Further_supplementary_memorandum_submitted_by_...,Further supplementary memorandum submitted by ...,3,431,0.040255,0.023550,0.608594,0.029913,0.025182,0.093307,0.076155,0.033556,0.037462,0.032028
4,Bronwen Lewis,D:20210822083921+00'00',Further_Supplementary_Memorandum_submitted_by_...,Further supplementary memorandum submitted by ...,4,288,0.240306,0.092834,0.056631,0.042153,0.054051,0.064404,0.096172,0.192395,0.106136,0.054918
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
164,Bronwen Lewis,D:20210822084528+00'00',Written_evidence_submitted_by_UK_Sport_-_Jan_2...,Written evidence submitted by UK Sport \n \n ...,164,3089,0.161374,0.746055,0.008360,0.015019,0.013721,0.006636,0.007021,0.007741,0.023814,0.010259
165,Bronwen Lewis,D:20210822084531+00'00',Written_evidence_submitted_by_Vision_2020_UK_-...,Written evidence submitted by Vision 2020 UK ...,165,2284,0.010344,0.904470,0.010520,0.008543,0.008640,0.011762,0.010805,0.012390,0.012407,0.010119
166,Bronwen Lewis,D:20210822084535+00'00',Written_evidence_submitted_by_VisitBritain_-_J...,Written evidence submitted by VisitBritain \n...,166,2372,0.008408,0.065365,0.009910,0.013444,0.185897,0.009476,0.018929,0.013965,0.014569,0.660036
167,Bronwen Lewis,D:20210822084543+00'00',Written_evidence_submitted_by_Womens_Sport_and...,Written evidence submitted by the Women's Spor...,167,1966,0.039542,0.632806,0.014172,0.011664,0.121685,0.009411,0.039502,0.044871,0.022595,0.063752


# 5. Topic interpretation tools

I first identify the salient topics defined by PTBI proposed by Marchetti and Puranam (2020), then combine both the topic visualisation of pyLDAvis and the prototypical texts defined by PTBI to facilitate the topic interpretation.

##  5.1. Salient topics for interpretation
PTBI assumes that the topics with little salience are not worthy of interpretation. To extract the most salient topics for interpretation, for each topic, we need to compute the fraction of documents with the probability that the documents belong to the topic is more than > 1/K (Marchetti and Puranam, 2020, p. 14), and I defined the fraction as the “salience” of the topic. 

The scree plot below shows that when the topics are sorted by salience in descending order, the salience tends to level off on topic 6, as a result, we can select the topics ahead of topic 6 as the salient topics for interpretation.

In [39]:
# compute salience: the fraction of documents with the probability that the document belongs to the topic is more than > 1/K for each document
list_percent_above = []
for i in df_topic_distribution:
    num_above = df_topic_distribution[i][df_topic_distribution[i] > 1/k].count()
    percent_above =  num_above/len(df_topic_distribution)
    list_percent_above.append(percent_above)
    
df_salient_topic = pd.DataFrame({'topic_ID':  [str(i) for i in new_topic_id], 'salience': list_percent_above}).sort_values(
    by = 'salience', ascending = False)

In [40]:
fig_L1 = px.line(df_salient_topic, x = 'topic_ID', y = 'salience', title="Scree plot of salience of topics")
fig_L1.update_layout(autosize=False, width=800, height=400)
fig_L1.update_traces(mode = "lines + markers")
fig_L1.show()

## 5.2. Topic visualisation

Check the words of each topic, if there're common words with high overall frequency such as "think" "want" or "make", return to the "import the stop_words from gensim" section, add these words to the list of stop words to remove them.

In [41]:
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim_models.prepare(lda_model, corpus, id2word, sort_topics = False )
pyLDAvis.save_html(vis, './assets/lda.html') # save the reult to disk   
vis

## 5.3. prototypical paragraphs
The prototypical paragraphs, the paragraphs with a high probabitiy that they belong to a topic, can be used to assist topic interpretation. This section classify the paragraphs into topics and provides the users 4 types of filters to select the prototypical paragraphs: N most prototypical paragraphs overall, N most prototypical paragraphs where the belong() function is greater than the threshold L, N most prototypical paragraphs of each topic and N most prototypical paragraphs of a specific topic.

### 5.3.1.  Classify the paragraphs based on the trained model

##### Extract paragraphs from documents

In [42]:
# define the function for spliting texts into paragraphs by delimiter '.\n\n' or '. \n\n'
def para_split(i):
    j = parser.from_file(i)
    m = j['content']
    import re
    return re.split('[?.!-]\n|[?.!-] \n|  \n\n|\n\n[0-9]', m)

In [43]:
list_paragraphs = []
list_para_id = []
for i in pdf_files:
    para = para_split(i)
    para = [w.replace('\n', '') for w in para]
    para = [x.strip() for x in para if x.strip()] # remove empty elements
    para_id = [x for x in range(len(para))] 
    list_paragraphs.append(para)
    list_para_id.append(para_id)

In [44]:
df_para1 = df.copy()
df_para1['paragraphs'] = list_paragraphs
df_para1['para_id'] = list_para_id
df_para2 = df_para1.apply(pd.Series.explode)
df_para3 = df_para2.reset_index()
df_para4 = df_para3[['creationDate', 'document_id', 'file_name', 'para_id', 'paragraphs']]
len(df_para4) # number of paragraphs extracted

21640

In [45]:
# set a filter to filter out the paragraphs with short words
n_word_count = 10                                                        # set the threshold of word count
para_word_count = df_para4['paragraphs'].str.split().str.len()           # word count of each paragraph
df_para = df_para4[(para_word_count>=n_word_count)].reset_index()        # select the paragraphs with word count not less than the threshold
df_para

Unnamed: 0,index,creationDate,document_id,file_name,para_id,paragraphs
0,2,D:20210822083603+00'00',0,Examination_of_Witnesses_Sept_2003_-_Q1-19.pdf,2,"MS BARBARA CASSANI Q1 Chairman: Good morning, ..."
1,3,D:20210822083603+00'00',0,Examination_of_Witnesses_Sept_2003_-_Q1-19.pdf,3,Ms Cassani: Thank you very much. Thank you ver...
2,4,D:20210822083603+00'00',0,Examination_of_Witnesses_Sept_2003_-_Q1-19.pdf,4,8 months I shall be able to meet frequently wi...
3,5,D:20210822083603+00'00',0,Examination_of_Witnesses_Sept_2003_-_Q1-19.pdf,5,The first thing I should like to say is that I...
4,6,D:20210822083603+00'00',0,Examination_of_Witnesses_Sept_2003_-_Q1-19.pdf,6,Really the backdrop is that I believe in the G...
...,...,...,...,...,...,...
17709,21631,D:20210822084546+00'00',168,Written_evidence_submitted_by_Youth_Sport_Trus...,37,7.3 When the impact of Olympics and Paralympi...
17710,21633,D:20210822084546+00'00',168,Written_evidence_submitted_by_Youth_Sport_Trus...,39,11 2007-08 School Sport Survey. 12 As ...
17711,21634,D:20210822084546+00'00',168,Written_evidence_submitted_by_Youth_Sport_Trus...,40,13 Gold Young Ambassadors work across School...
17712,21635,D:20210822084546+00'00',168,Written_evidence_submitted_by_Youth_Sport_Trus...,41,14 From national data supplied by Department...


##### Process the paragraphs

In [46]:
# tokenization
data2 = df_para.paragraphs.values.tolist()
data_words2 = list(sent_to_words(data2))

In [47]:
# Form Trigrams
data_words_trigrams2 = make_trigrams(data_words2)

# Do lemmatization keeping only noun, adj, vb
data_lemmatized2 = lemmatization(data_words_trigrams2, allowed_postags=['NOUN', 'ADJ', 'VERB'])

# set the length of word threshold as same as before for removing the words less than the threshold
data_lemmatized2_2 = []
for i in data_lemmatized2:
    new_element = [x for x in i if len(x) >= minimum_len]
    data_lemmatized2_2.append(new_element)
    
# Remove Stop Words
data_lemmatized2_1 = remove_stop_words(data_lemmatized2_2)

##### Classify the paragraphs based on the extracted topics

In [48]:
# belong function: classify topics of paragraphs, it might take a long time because there are 148,651 paragraphs in the 11,132,849-word corpus
list_topic_para = []
dictionary_topic_para = {}
for d in data_lemmatized2_1:
    bow = id2word.doc2bow(d)
    belong = lda_model[bow]
    doc_dic = Convert(belong, dictionary_topic_para)
    list_topic_para.append(doc_dic)
    df_topic_para = pd.DataFrame(list_topic_para)

In [49]:
# rename the topic IDs to ensure they are as same as the topic IDs in the pyLDAvis
df_topic_para = df_topic_para.rename(columns = dict(zip(original_topic_id, new_topic_id)))

# topic distribution of paragraphs
df_topic_para1_1 = pd.merge(df_para, df_topic_para, how = 'left', left_index=True, right_index=True)
df_topic_para1_1

# save the result to disk
df_topic_para1_1.to_pickle('./df_topic_para_Olympics.pkl')
# load the result from disk
df_topic_para1 = pd.read_pickle('./df_topic_para_Olympics.pkl') 

In [78]:
# drop the paragraphs with high frequency but meaningless for interperation based on the extraction of prototypical paragraphs below
list_remove_para = [7622, 12966]                                # input the index of the paragraphs you want to drop
df_topic_para2 = df_topic_para1.copy().drop(list_remove_para) 
df_topic_para2.to_pickle('./df_topic_para_Olympics2.pkl') # save the resuilt to disk

### 5.3.2. N most prototypical paragraphs overall

In [79]:
df_topic_para2

Unnamed: 0,index,creationDate,document_id,file_name,para_id,paragraphs,1,2,3,4,5,6,7,8,9,10
0,2,D:20210822083603+00'00',0,Examination_of_Witnesses_Sept_2003_-_Q1-19.pdf,2,"MS BARBARA CASSANI Q1 Chairman: Good morning, ...",0.073004,0.055838,0.059690,0.053839,0.070040,0.055271,0.107584,0.187902,0.265389,0.071442
1,3,D:20210822083603+00'00',0,Examination_of_Witnesses_Sept_2003_-_Q1-19.pdf,3,Ms Cassani: Thank you very much. Thank you ver...,0.088119,0.080434,0.079274,0.086185,0.086952,0.087036,0.094380,0.166091,0.143715,0.087814
2,4,D:20210822083603+00'00',0,Examination_of_Witnesses_Sept_2003_-_Q1-19.pdf,4,8 months I shall be able to meet frequently wi...,0.091401,0.090485,0.108595,0.103616,0.084920,0.107042,0.094514,0.127285,0.099105,0.093036
3,5,D:20210822083603+00'00',0,Examination_of_Witnesses_Sept_2003_-_Q1-19.pdf,5,The first thing I should like to say is that I...,0.051126,0.069764,0.030671,0.064049,0.045791,0.033653,0.045747,0.236413,0.367553,0.055234
4,6,D:20210822083603+00'00',0,Examination_of_Witnesses_Sept_2003_-_Q1-19.pdf,6,Really the backdrop is that I believe in the G...,0.103006,0.104074,0.060459,0.075865,0.061256,0.052142,0.069883,0.095555,0.273615,0.104144
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
17709,21631,D:20210822084546+00'00',168,Written_evidence_submitted_by_Youth_Sport_Trus...,37,7.3 When the impact of Olympics and Paralympi...,0.080292,0.341630,0.067202,0.065383,0.097171,0.081064,0.063351,0.058232,0.056719,0.088957
17710,21633,D:20210822084546+00'00',168,Written_evidence_submitted_by_Youth_Sport_Trus...,39,11 2007-08 School Sport Survey. 12 As ...,0.108606,0.201046,0.084207,0.090331,0.086395,0.067951,0.077790,0.094008,0.087868,0.101798
17711,21634,D:20210822084546+00'00',168,Written_evidence_submitted_by_Youth_Sport_Trus...,40,13 Gold Young Ambassadors work across School...,0.170898,0.174138,0.078430,0.083826,0.083533,0.076720,0.081847,0.083701,0.078578,0.088329
17712,21635,D:20210822084546+00'00',168,Written_evidence_submitted_by_Youth_Sport_Trus...,41,14 From national data supplied by Department...,0.095260,0.140983,0.105393,0.088466,0.100969,0.098432,0.099462,0.095923,0.080132,0.094981


In [80]:
#N most prototypical paragraphs overall
df_topic_para2_n = df_topic_para2.copy()
df_topic_para2_n['highest_p'] = df_topic_para2_n.iloc[:, 6:].max(axis = 1)          # get the highest probability among the topic distribution of each paragraph
df_topic_para2_n['salient_topic'] = df_topic_para2_n.iloc[:, 6:].idxmax(axis = 1)    # get the corresponding topic id
df_topic_para2_n = df_topic_para2_n[['index','file_name','salient_topic','paragraphs','highest_p',]]
df_topic_para2_n.columns = ['Index','file','topic', 'paragraph','probability']

In [81]:
N1 = 5   # Set N to get the N most prototypical paragraphs overall
df_topic_para2_n.nlargest(N1,['probability']).style.set_properties(subset = ['paragraph'], **{'width':'1000px', 'length': '50px'})

Unnamed: 0,Index,file,topic,paragraph,probability
17552,21447,Written_evidence_submitted_by_Vision_2020_UK_-_Jan_2010.pdf,2,"4. Most clubs seem ill-prepared for enquiries from, and inclusion of, people with disabilities who wish to participate in the sport offered by the club. There is little or no support for specialist clubs who provide opportunities for sport that cannot be integrated ie wheelchair basketball or blind cricket! There is often nothing locally to support the child or their parent in accessing the specialist provision and this often involves their having to undertake extensive travel to specialist facilities or organisations catering for this group. There is poor information regarding availability etc and http://www.parasport.org.uk was established as a portal to provide pathway and provision information. The then Mayor of London published a strategy in 2007, which highlighted all these issues and to date there has been little, if any, action to redress these anomalies in London or in the rest of the country. DCSF have, in my view, shown no leadership regarding the legacy of 2012 and its impact on PE in schools and the inclusion of those with disabilities in core curriculum activities or sporting opportunities within or after school. DCMS held a legacy event in 2008 and again in April 2009 focusing on the legacy of the Games for those with disabilities. One outcome was to seek greater links between DWP, DCSF and themselves to ensure that joint strategies were developed and pathways established that enabled children to enjoy and participate in PE and sport within schools/after school clubs/integrated and specialist provision in the community, with good national talent forums and pathways established for those who wish to participate in sport at a higher level and finally, with governing bodies having clear inclusive programmes for sports men and women with disabilities active at a national and international level",0.676578
17551,21446,Written_evidence_submitted_by_Vision_2020_UK_-_Jan_2010.pdf,2,"3. After school activities often exclude this group of children ie no possibility of inclusion in team sports such as football, cricket, rugby, basket ball etc., limited access to swimming baths, athletic fields etc. All of these sports are undertaken by people with a disability, but not now normally at school. Specialist schools did provide a massive range of sporting opportunities and sport played a major part of my own adjustment to disability. I learned about teamwork, was able to set individual goals, have competition to extend my abilities, occasionally experienced ""being a winner"" and had that thrill of competition. One example of efforts to redress this issue within a school setting is in Leeds, which has a programme of monthly sporting and physical activities arranged in school time for vision impaired children within the schools. One solution easily achieved would be for groups of schools to come together monthly to provide sporting and recreational activities for disabled children and young people within their schools",0.627952
3329,3935,NAO_Preparations_for_the_Olympics_-_Progress_report_-_June_2008.pdf,6,"6 PREPARATIONS FOR THE LONDON 2012 OLyMPIC AND PARALyMPIC GAMES: PROGRESS REPORT JuNE 2008 10 The start and completion dates for the construction of the main venue and infrastructure projects delivered by the Olympic Delivery Authority at the end of March 2008 compared with the milestones in the November 2007 Programme Baseline ReportProjectEnabling Works (site preparation) Power Lines under Grounding (switchover only) Structures, Bridges and Highways utilities Main Stadium Aquatics Centre VeloparkHandball/Indoor Sports ArenaBasketballInternational Broadcast Centre/ Main Press CentreOlympic Village Eton Manor (training facilities and Paralympic events) Broxbourne (white water canoeing) Eton Dorney (rowing) Weymouth and Portland (sailing)construction start date November 2007 March 2008 Change in programme Forecast start date baseline (months)October 2006 October 2006 0 July 2008 July 2008 0 April 2008 April 2008 0 January 2008 January 2008 0 July 2008 May 2008 –21 September 2008 September 2008 0 March 2009 March 2009 0August 2009 June 2009 –2July 2009 November 2009 4May 2009 March 2009 –2 June 2008 May 2008 –1 March 2010 January 2010 –2 August 2008 May 2009 9 March 2009 January 2009 –2 May 2008 January 2008 –4construction end date November 2007 March 2008 Change in programme Forecast end date baseline (months)September 2009 September 2009 0 September 2008 November 2008 2 December 2011 December 2011 0 December 2011 August 2011 –4 Construction Construction end date end dateFebruary 2011 April 2011 2Completion date Completion date for construction for construction and initial overlay and initial overlay for test events for test eventsJune 2011 June 2011 0Construction Construction end date end dateApril 2011 August 2011 4Completion date Completion date for construction for construction and initial overlay and initial overlay for test events for test eventsJuly 2011 August 2011 1April 2011 February 2011 –2April 2011 March 2011 –1April 2011 April 2011 0June 2011 July 2011 1 December 2011 December 2011 0 February 2012 April 2011 –10 June 2010 October 2010 4 April 2010 July 2009 –9 February 2009 January 2009 –1Source: National Audit Office examination of actual and forecast progress against the November 2007 Programme BaselineNOTE",0.623241
454,520,Jan_2003_-_Qs_200-220.pdf,9,"The Committee suspended from 4.09 pm to 4.23 pm for a division in the House Alan Keen 202. I did not get to the end of the question at the beginning but the point I am making is that because we have to have a village and all the events have to be in that area it adds costs to hosting the Olympics, I reckon at least half a billion and probably a billion. If we could spread them round the country—and I went to Japan for the World Cup and the atmosphere was brilliant. We went to different places—more people could get to see it. If we could do that with the Olympics, the point I am really asking you is that it is difficult for the Government. The Minister and the Secretary of State are going to see the President of the IOC on Friday. It will not do our bid any good if they go there telling them how they should organise the Olympic Games in the future. I am really asking you as the main channellers of funding in sport in this country, will you make these representations that the Olympics, just for the sake of having 18,000 athletes in one village, which is very nice, although it is not so nice for those whose event comes on the last day and they want a party—we could save somewhere between half a billion and a billion pounds by using facilities we have got around the country now. The athletics could be at Wembley as they were supposed to be. The football could be at the main stadium and spread around the country as it is going to be in fact. What I am saying is that instead of having the athletes all together in one village for the three weeks of the Olympics, we could put a party on for them and they could stay for a week after the Olympics when they could all get drunk if that is what they do. I think somebody needs to go to the IOC and put this point to them. We have been taking evidence from people in the last couple of days and there are tremendous difficulties. There would hardly be a difficulty if we could use stadia around the country and we did not have to have the village. It is the village that causes all the problems that we are facing now",0.621515
3335,3942,NAO_Preparations_for_the_Olympics_-_Progress_report_-_June_2008.pdf,6,"Construction of the Main Stadium and its readiness for test events is critical to the Authority’s delivery programme. At the time of the November 2007 Programme Baseline Report, which was before contract signature, construction of the Main Stadium was scheduled to be completed in February 2011, to be followed by LOCOG overlay works for test events which were to be completed by June 2011. To date good progress has been made on site preparation, which has allowed construction to start in May 2008, two months earlier than planned.1 As a result of contract negotiations the Authority has agreed a longer construction period than anticipated (35 months rather than 32). The Authority has, however, built early access dates for LOCOG overlay works into the contract so overlay can be carried out in parallel with the construction work. The forecast date for readiness for test events remains at June 2011",0.619893


### 5.3.3.  N most prototypical paragraphs where the belong() function is greater than the threshold L
I followed the method of extraction of prototypical text suggested by PTBI (Marchetti and Puranam, 2020. p. 14). PTBI attempts to not only extract the prototypical documents to improve interpretability, but also to find the minimum number of prototypical documents for topic interpretation. The algorithm is shown as follows:
1. Defines a threshold L  (L < ∈ [0,1]). For instance, we set L to be 0.5.
2. For each topic, select the documents with the probability that they belong to the topic is not less than L (0.5). 
3. For each topic, check whether the number of documents selected is not less than 1/L. For instance, if L = 0.5, for each topic we need at least 2 documents for topic interpretation. This method weakens the limitation that a few documents have a high proportion of a topic is because of randomness.
4. Compute the percentage of interpretable topics as described in step iii
5. Change L, keep iterating and find the optimal L with which the percentage of interpretable topics is the highest. 

##### Indenfication of the optimal L and miminum number of paragraphs for topic interpretation

In [82]:
List_num_doc = [x for x in range(1, 20, 1)] # generate a list of 1/L (minimum number of documents to interpret a topic)
list_L = [1/x for x in List_num_doc]        # generate a list of L

In [83]:
# define the function for computing the percentage of potentially interpretable topics against parameter L
def perc(i, df):
    list_num_topics = []
    for j in df:                                  
        topic_filter = df[j] >= i         
        m = df[j][topic_filter].count()           
        list_num_topics.append(m)                                             
        count1 = sum(map(lambda x : x >= 1/i, list_num_topics))                                     
        perc1 = count1 / k
    return(perc1)

The plot shows that when L = 0.333, the percentage of interpretable topics is 100%, so I set L to be 0.333 - ie, each topic needs at least 3 (1/0.333) paragraphs with the probability that they belong to the topic is no less than 3  for interpretation. It is worth noting that L is inversely proportional to the minimum number of paragraphs of each topic for interpretation (1/L), in other words, the lower the threshold L is, the more paragraphs that users need to interpret the topics. Although when L = 0.1 the percentage of interpretable topics is also 100%, the minimum number of paragraphs of each topic for interpretation also rises to 10 (1/0.1), which increases the workload of interpretation significantly.

In [84]:
list_perc2 = []
for i in list_L:
    num = perc(i, df_topic_para.drop(list_remove_para))
    list_perc2.append(num)

df_L2 = pd.DataFrame({'Threshold_L': list_L, 'Percentage of interpretable topics': list_perc2})
fig_L2 = px.line(df_L2, x = 'Threshold_L', y="Percentage of interpretable topics", title = 'Percentage of interpretable topics')
fig_L2.update_layout(autosize=False, width=800, height=400)
fig_L2.update_traces(mode = "lines + markers")
fig_L2.show()

In [85]:
# define the function for extracting the highest N ranked paragraphs from each topic
def top_n_filter(df, top_n):
    list_topic_id = [x+1 for x in range(0,k)]
    list_n_para = []
    list_n_p = []
    list_n_index = []
    list_n_file_name = []
    for x in range(1, k + 1): 
        n_para = [i for i in df.nlargest(top_n, [x])['paragraphs']]
        n_p = [i for i in df.nlargest(top_n, [x])[x]]
        n_index = [i for i in df_topic_para2.nlargest(top_n, [x]).index]
        n_file_name = [i for i in df.nlargest(top_n, [x])['file_name']]
        list_n_para.append(n_para)
        list_n_p.append(n_p)
        list_n_file_name.append(n_file_name)
        list_n_index.append(n_index)
    pd_n_para = pd.DataFrame({'Index':list_n_index, 'topic_id': list_topic_id, 'file': list_n_file_name, 'paragraph': list_n_para, 'probability': list_n_p})
    return(pd_n_para.apply(pd.Series.explode).reset_index().drop('index', axis = 1))

Below we get the the 3 most prototypical paragraphs of each topic when we set the optimal L to be 0.333. 

In [86]:
L = 1/3 # set the optimal L based on the analysis above
top_n_filter(df_topic_para2, int(1/L))[top_n_filter(df_topic_para2, int(1/L))['probability'] >= L].style.set_properties(subset = ['paragraph'], **{'width':'500px', 'length': '50px'})

Unnamed: 0,Index,topic_id,file,paragraph,probability
0,9932,1,Qs_202_-_238.pdf,"Lord Moynihan: Adrian, I am looking to the future and in this new found position of being apolitical and outside criticism of what may or may not have happened in the past I can tell you what is required moving forward is significant improvements rather than dwelling on what went wrong in the past. The first thing that is required is consistent funding. It is absolutely essential that we have consistent funding to the Olympic governing bodies. The second thing that is required is hiring the best coaches. I mentioned earlier that coaches are centres of excellence, and they really are. The coach inspires, the coach motivates, the coach can get the best out of our young Olympians of the future. We will deliver far better than we have in the past. The importance given to coaching needs to be on a step change level from where it has been in the past. I am very pleased to say that UK Sport and the Government are at one with us on the importance of reinforcing more emphasis on coaching. Then I mentioned in a highly competitive market, which it will be, to achieve fourth place we need to make sure not just the coaching but the management, the administration, the medical, the support services to our sports men and women are there and that we have the back-up facilities, the high performance centres properly resourced and that the governing bodies are given the support they need to make the decisions on behalf of their staff, their performance directors, their coaches. Ultimately we are all the servants of the sports men and women who will be up there winning medals in 2012. I am absolutely convinced that with the strength of the governing bodies and the talent we have in this country that is an achievable and deliverable target. I am absolutely convinced that unless Clearing the Bar is accepted we will not achieve that stretch target. That is why we are spending so many hours, the days and nights and little time we have available getting that document right and working with UK Sport to make sure we get that right. We have to do it now, we cannot afford to wait until the end of next year. We need to make sure that the funding is secure so that governing bodies can take the steps which are necessary next year. The time invested in the beginning of next year will be worth a huge amount when we get to 2012. It is time invested upfront to make sure that those plans can deliver the success. That means we will look at Beijing in a different light from how we would normally look at Beijing. We will look at Beijing as a stepping stone towards success in 2012 which may not best be judged by medal tallies, for example, but must be judged by how much progress the squads, the teams and the coaches have made en route from here to 2012; which will be our principal goal",0.615441
1,3639,1,NAO_Preparing_for_sporting_success_-_March_2008.pdf,"4 UK Sport’s ‘ultimate goals’ for medal success at the London 2012 Games will require a step change in performance amongst elite athletes. The achievements of athletes at recent elite international events in a number of sports, including sailing, cycling, rowing, boxing, disability equestrian and disability shooting, suggest that performance levels in some sports are already improving significantly. Following increased spending on elite sport, host nations can typically expect to win an extra six or seven gold medals at an Olympic Games and to win medals across a wider range of sports. This ‘host nation effect’ would not in itself be enough to deliver UK Sport’s Olympic goal, which is likely to require an improvement of eight or nine gold medals over the Great Britain team’s performance at the Athens Games in 2004 if the relative performance of other nations remained the same. Changes in the performance of other nations since 2004, especially in the context of a general trend of increased spending on elite sport, sometimes referred to as a ‘global sporting arms race’, may also have implications for UK Sport in delivering its medal aspirations",0.556173
2,9912,1,Qs_202_-_238.pdf,"To reach the figure that we will need to conclude our work in the next four weeks and, we will continue to work very closely indeed with UK Sport. UK Sport has been very active in working with the British Olympic Association, going round to see the governing bodies, using a performance-based model which has been a very constructive and critical model in the context of working out the funding requirements. I think it is best summarised as work in progress. We are about three-quarters of the way through the discussions with the governing bodies; we have more to complete. Clearly within each governing body, Olympic governing body, there is an elite performance cell. That elite performance cell has a performance director associated with it and the performance director is critical to this process, as are the coaches. It is essential as far as the British Olympic Association is concerned that when Clearing the Bar is presented to the Olympic Board and subsequently presented to Government that it is agreed by all the summer Olympic governing bodies. There will be no point in coming to a figure that says collectively we can achieve fourth place in the Olympic medal table in 2012, if suddenly hockey or athletics, for the sake of argument, woke up the following morning and said ""Hang on a second, we are not going to be able to contribute in the way you would like us to do on that budget"". It would need to be robust, it would need to be capable of detailed analysis by this Committee and other committees in Parliament. It requires a significant amount of work which is underway at the moment. I emphasise that it is being undertaken in partnership with UK Sport, that is right and proper as they have significant expertise which has been very helpful to us in this process. We are on target to complete that work, Chairman, and we intend to make sure that it is presented on time. It is a budget, as I say, that must be robust and bought into by the Olympic governing bodies which ultimately will be responsible for performance on the day. The final point I would make in answer to your question, Mike, is that consistent funding is essential. We cannot have the situation whereby a governing body receives funding one year and maybe gets what they are expecting in year two but then loses out in funding in year three. If we are going to compete to come fourth, and we believe that is a realistic target to achieve, it is a tough stretch but it is realistic—it should be a tough stretch but it must be realistic—then we need consistent funding over the next six years. That is absolutely essential. If we are going to contract with the best coaches in the world, if we are going to provide the best sport facilities, that base line budget must be agreed and the governing bodies must be confident that there will not be a move away from that base line budget in recruiting the staff necessary to move from tenth to fourth in the medal table. That is the current position. We are working hard both within the BOA as well as with UK Sport and with outside experts to make sure that model is robust and that Clearing the Bar will achieve not only what it says but be widely accepted by your Committee and the sporting world in this country",0.540828
3,17552,2,Written_evidence_submitted_by_Vision_2020_UK_-_Jan_2010.pdf,"4. Most clubs seem ill-prepared for enquiries from, and inclusion of, people with disabilities who wish to participate in the sport offered by the club. There is little or no support for specialist clubs who provide opportunities for sport that cannot be integrated ie wheelchair basketball or blind cricket! There is often nothing locally to support the child or their parent in accessing the specialist provision and this often involves their having to undertake extensive travel to specialist facilities or organisations catering for this group. There is poor information regarding availability etc and http://www.parasport.org.uk was established as a portal to provide pathway and provision information. The then Mayor of London published a strategy in 2007, which highlighted all these issues and to date there has been little, if any, action to redress these anomalies in London or in the rest of the country. DCSF have, in my view, shown no leadership regarding the legacy of 2012 and its impact on PE in schools and the inclusion of those with disabilities in core curriculum activities or sporting opportunities within or after school. DCMS held a legacy event in 2008 and again in April 2009 focusing on the legacy of the Games for those with disabilities. One outcome was to seek greater links between DWP, DCSF and themselves to ensure that joint strategies were developed and pathways established that enabled children to enjoy and participate in PE and sport within schools/after school clubs/integrated and specialist provision in the community, with good national talent forums and pathways established for those who wish to participate in sport at a higher level and finally, with governing bodies having clear inclusive programmes for sports men and women with disabilities active at a national and international level",0.676578
4,17551,2,Written_evidence_submitted_by_Vision_2020_UK_-_Jan_2010.pdf,"3. After school activities often exclude this group of children ie no possibility of inclusion in team sports such as football, cricket, rugby, basket ball etc., limited access to swimming baths, athletic fields etc. All of these sports are undertaken by people with a disability, but not now normally at school. Specialist schools did provide a massive range of sporting opportunities and sport played a major part of my own adjustment to disability. I learned about teamwork, was able to set individual goals, have competition to extend my abilities, occasionally experienced ""being a winner"" and had that thrill of competition. One example of efforts to redress this issue within a school setting is in Leeds, which has a programme of monthly sporting and physical activities arranged in school time for vision impaired children within the schools. One solution easily achieved would be for groups of schools to come together monthly to provide sporting and recreational activities for disabled children and young people within their schools",0.627952
5,17553,2,Written_evidence_submitted_by_Vision_2020_UK_-_Jan_2010.pdf,"5. Integration of disabled people into mainstream sporting provision is a concept rather than a reality. Grants to organisations has largely been based on physical access, rather than actual provision of activities, coaching etc. For some it has been a tick box action, rather than an attempt to include and integrate their provision. In order to create a lasting legacy from 2012, this offer of inclusion has to be both genuine and meaningful, ie accessible facilities with no manmade barriers re attitudes, lack of coaching or energy to find solutions regarding sporting opportunities. These are all major threats to establishing a broad base of participation pyramid with hundreds of disabled people participating in sport at the base level, filtering through at representative level for club and nationally, leading on to the tip of the pyramid—international competition at Paralympic level. This pathway does not currently exist for those with a disability and therefore, a different solution needs to be found that provides a separate provision (where appropriate) and uses the mainstream provision (where appropriate)",0.594271
6,2929,3,NAO_Preparations_for_the_Olympics_-_Progress_report_-_Feb_2010.pdf,"Figure 2The Delivery Authority’s November 2007 baseline for the Village and Media Centre and its December 2009 forecastall figures in £ million, inclusive of Vat november 2007 baseline Revised budget approved by funders in 2009latest forecast of potential costs december 2009Village and Stratford City land and infrastructure Land and infrastructure – public sector funding 522 553 560Estimated share of profits from post Games development (250) (250) (100)1Net Olympic Delivery Authority budget 272 303 460Village development Village construction and sales (gross) costs 0 1,095 1,1262Agreed sale of 1,379 units to Triathlon Homes 0 (268) (268)Estimated receipts from private sales – to fund construction 0 (177) (177)Village development – sub total 0 650 681Estimated receipts from private sales – to repay funders 0 (324) (324)Net Olympic Delivery Authority budget 0 326 357Media Centre3Media Centre 220 355 334Net Olympic Delivery Authority budget 220 355 334Total net cost to the Olympic Delivery Authority 492 984 1,151Source: National Audit Offi ce noteS1 See paragraph 4.10",0.602559
7,7679,3,PAC_-_Risk_assessment_and_management_-_July_2007_report.pdf,"6. At the time of our hearing, a number of significant areas of uncertainty remained before the budget could be finalised.18 Tax – The Department explained that tax had been excluded from the cost estimates at the time of the bid on the basis that the tax treatment could not be resolved until the delivery structures for the Games were in place.19 Contingency provision – The cost estimates at the time of the bid had included a contingency provision in respect of individual projects, but the Department now considered that an overall programme contingency margin was required to reflect the interdependencies between projects and the risks associated with the knock-on effect that problems on one project could have on the rest of the programme.20 Security – It had not been possible at the time of the bid to provide a reliable estimate of the costs of policing and wider security, and the Department had notified Parliament of a contingent liability in this respect. The Olympic Security Co-ordinator was now working up plans and budgets in association with the Home Office and the Metropolitan Police.21 Private sector funding – At the time of the bid, £738 million of private funding had been expected to help meet costs associated with the Olympic Park. In the light of further work and advice, the Department now considered there was insufficient time to negotiate contracts with the private sector within the overall timescale for the Games, so there was now little prospect of securing significant private sector funding to deliver the Olympic Park. However, most of the funding for the Olympic Village was still expected to come from the private sector.22",0.596071
8,2652,3,NAO_Budget_for_2012_Olympics_-_July_2007.pdf,"7 The Secretary of State announced a funding package of £9.325 billion to cover the costs and provisions set out in Figure 5, an increase of £5.906 billion on the public funding of £3.419 billion17 previously committed. (Figure 8) The contributions from the National Lottery and the Greater London Authority have increased, but the bulk of the additional funding is to come from the Exchequer, in line with the Government’s commitment to underwrite the cost of the Games. The costs to be covered by this increase in funding include £1.173 billion of tax18 which will ultimately flow back to the Exchequer. The Department has confirmed to us that the tax liabilities associated with the Games will be met entirely from Exchequer funding19, which means that ultimately the net increase in public sector funding is £4.733 billion. The funding increase of £5.906 billion includes contingency of £2.747 billion which the Department has made clear to us may not be used in full",0.549523
9,2002,4,Memorandum_submitted_by_LOCOG_-_Nov_2007.pdf,"Chief Medical Officer Creative Director Director of Strategy and Programme Management Head of Procurement Head of Client Services Head of Education Head of Venues Technology Head of Programme Solutions Head of Workforce Planning Head of Accommodation Head of Sport Policy Head of Culture Head of Telecommunications Head of Ceremonies Head of Administrative IT Head of Live Site Head of Venue Management Head of Ticketing — We now have in place the core components required to undertake the detailed planning for the Games. The team however remains small, at just under 200 people and our recruitment is planned carefully on the basis of the core work that needs to be undertaken now.",0.55365


### 5.3.4. N most prototypical paragraphs of each topic

In [87]:
# 2 most prototypical paragraphs of each topic
N2 = 2
top_n_filter(df_topic_para2, N2).style.set_properties(subset = ['paragraph'], **{'width':'500px', 'length': '50px'})

Unnamed: 0,Index,topic_id,file,paragraph,probability
0,9932,1,Qs_202_-_238.pdf,"Lord Moynihan: Adrian, I am looking to the future and in this new found position of being apolitical and outside criticism of what may or may not have happened in the past I can tell you what is required moving forward is significant improvements rather than dwelling on what went wrong in the past. The first thing that is required is consistent funding. It is absolutely essential that we have consistent funding to the Olympic governing bodies. The second thing that is required is hiring the best coaches. I mentioned earlier that coaches are centres of excellence, and they really are. The coach inspires, the coach motivates, the coach can get the best out of our young Olympians of the future. We will deliver far better than we have in the past. The importance given to coaching needs to be on a step change level from where it has been in the past. I am very pleased to say that UK Sport and the Government are at one with us on the importance of reinforcing more emphasis on coaching. Then I mentioned in a highly competitive market, which it will be, to achieve fourth place we need to make sure not just the coaching but the management, the administration, the medical, the support services to our sports men and women are there and that we have the back-up facilities, the high performance centres properly resourced and that the governing bodies are given the support they need to make the decisions on behalf of their staff, their performance directors, their coaches. Ultimately we are all the servants of the sports men and women who will be up there winning medals in 2012. I am absolutely convinced that with the strength of the governing bodies and the talent we have in this country that is an achievable and deliverable target. I am absolutely convinced that unless Clearing the Bar is accepted we will not achieve that stretch target. That is why we are spending so many hours, the days and nights and little time we have available getting that document right and working with UK Sport to make sure we get that right. We have to do it now, we cannot afford to wait until the end of next year. We need to make sure that the funding is secure so that governing bodies can take the steps which are necessary next year. The time invested in the beginning of next year will be worth a huge amount when we get to 2012. It is time invested upfront to make sure that those plans can deliver the success. That means we will look at Beijing in a different light from how we would normally look at Beijing. We will look at Beijing as a stepping stone towards success in 2012 which may not best be judged by medal tallies, for example, but must be judged by how much progress the squads, the teams and the coaches have made en route from here to 2012; which will be our principal goal",0.615441
1,3639,1,NAO_Preparing_for_sporting_success_-_March_2008.pdf,"4 UK Sport’s ‘ultimate goals’ for medal success at the London 2012 Games will require a step change in performance amongst elite athletes. The achievements of athletes at recent elite international events in a number of sports, including sailing, cycling, rowing, boxing, disability equestrian and disability shooting, suggest that performance levels in some sports are already improving significantly. Following increased spending on elite sport, host nations can typically expect to win an extra six or seven gold medals at an Olympic Games and to win medals across a wider range of sports. This ‘host nation effect’ would not in itself be enough to deliver UK Sport’s Olympic goal, which is likely to require an improvement of eight or nine gold medals over the Great Britain team’s performance at the Athens Games in 2004 if the relative performance of other nations remained the same. Changes in the performance of other nations since 2004, especially in the context of a general trend of increased spending on elite sport, sometimes referred to as a ‘global sporting arms race’, may also have implications for UK Sport in delivering its medal aspirations",0.556173
2,17552,2,Written_evidence_submitted_by_Vision_2020_UK_-_Jan_2010.pdf,"4. Most clubs seem ill-prepared for enquiries from, and inclusion of, people with disabilities who wish to participate in the sport offered by the club. There is little or no support for specialist clubs who provide opportunities for sport that cannot be integrated ie wheelchair basketball or blind cricket! There is often nothing locally to support the child or their parent in accessing the specialist provision and this often involves their having to undertake extensive travel to specialist facilities or organisations catering for this group. There is poor information regarding availability etc and http://www.parasport.org.uk was established as a portal to provide pathway and provision information. The then Mayor of London published a strategy in 2007, which highlighted all these issues and to date there has been little, if any, action to redress these anomalies in London or in the rest of the country. DCSF have, in my view, shown no leadership regarding the legacy of 2012 and its impact on PE in schools and the inclusion of those with disabilities in core curriculum activities or sporting opportunities within or after school. DCMS held a legacy event in 2008 and again in April 2009 focusing on the legacy of the Games for those with disabilities. One outcome was to seek greater links between DWP, DCSF and themselves to ensure that joint strategies were developed and pathways established that enabled children to enjoy and participate in PE and sport within schools/after school clubs/integrated and specialist provision in the community, with good national talent forums and pathways established for those who wish to participate in sport at a higher level and finally, with governing bodies having clear inclusive programmes for sports men and women with disabilities active at a national and international level",0.676578
3,17551,2,Written_evidence_submitted_by_Vision_2020_UK_-_Jan_2010.pdf,"3. After school activities often exclude this group of children ie no possibility of inclusion in team sports such as football, cricket, rugby, basket ball etc., limited access to swimming baths, athletic fields etc. All of these sports are undertaken by people with a disability, but not now normally at school. Specialist schools did provide a massive range of sporting opportunities and sport played a major part of my own adjustment to disability. I learned about teamwork, was able to set individual goals, have competition to extend my abilities, occasionally experienced ""being a winner"" and had that thrill of competition. One example of efforts to redress this issue within a school setting is in Leeds, which has a programme of monthly sporting and physical activities arranged in school time for vision impaired children within the schools. One solution easily achieved would be for groups of schools to come together monthly to provide sporting and recreational activities for disabled children and young people within their schools",0.627952
4,2929,3,NAO_Preparations_for_the_Olympics_-_Progress_report_-_Feb_2010.pdf,"Figure 2The Delivery Authority’s November 2007 baseline for the Village and Media Centre and its December 2009 forecastall figures in £ million, inclusive of Vat november 2007 baseline Revised budget approved by funders in 2009latest forecast of potential costs december 2009Village and Stratford City land and infrastructure Land and infrastructure – public sector funding 522 553 560Estimated share of profits from post Games development (250) (250) (100)1Net Olympic Delivery Authority budget 272 303 460Village development Village construction and sales (gross) costs 0 1,095 1,1262Agreed sale of 1,379 units to Triathlon Homes 0 (268) (268)Estimated receipts from private sales – to fund construction 0 (177) (177)Village development – sub total 0 650 681Estimated receipts from private sales – to repay funders 0 (324) (324)Net Olympic Delivery Authority budget 0 326 357Media Centre3Media Centre 220 355 334Net Olympic Delivery Authority budget 220 355 334Total net cost to the Olympic Delivery Authority 492 984 1,151Source: National Audit Offi ce noteS1 See paragraph 4.10",0.602559
5,7679,3,PAC_-_Risk_assessment_and_management_-_July_2007_report.pdf,"6. At the time of our hearing, a number of significant areas of uncertainty remained before the budget could be finalised.18 Tax – The Department explained that tax had been excluded from the cost estimates at the time of the bid on the basis that the tax treatment could not be resolved until the delivery structures for the Games were in place.19 Contingency provision – The cost estimates at the time of the bid had included a contingency provision in respect of individual projects, but the Department now considered that an overall programme contingency margin was required to reflect the interdependencies between projects and the risks associated with the knock-on effect that problems on one project could have on the rest of the programme.20 Security – It had not been possible at the time of the bid to provide a reliable estimate of the costs of policing and wider security, and the Department had notified Parliament of a contingent liability in this respect. The Olympic Security Co-ordinator was now working up plans and budgets in association with the Home Office and the Metropolitan Police.21 Private sector funding – At the time of the bid, £738 million of private funding had been expected to help meet costs associated with the Olympic Park. In the light of further work and advice, the Department now considered there was insufficient time to negotiate contracts with the private sector within the overall timescale for the Games, so there was now little prospect of securing significant private sector funding to deliver the Olympic Park. However, most of the funding for the Olympic Village was still expected to come from the private sector.22",0.596071
6,2002,4,Memorandum_submitted_by_LOCOG_-_Nov_2007.pdf,"Chief Medical Officer Creative Director Director of Strategy and Programme Management Head of Procurement Head of Client Services Head of Education Head of Venues Technology Head of Programme Solutions Head of Workforce Planning Head of Accommodation Head of Sport Policy Head of Culture Head of Telecommunications Head of Ceremonies Head of Administrative IT Head of Live Site Head of Venue Management Head of Ticketing — We now have in place the core components required to undertake the detailed planning for the Games. The team however remains small, at just under 200 people and our recruitment is planned carefully on the basis of the core work that needs to be undertaken now.",0.55365
7,17005,4,Written_evidence_submitted_by_Olympic_Park_Legacy_Company_-_Feb_2010.pdf,"10. The Olympic Park Legacy Company's strategic objectives, as set by its Founders cover: — assisting the Government and the Mayor of London in fulfilling some of the legacy promises made in the bid to host the London 2012 Olympic and Paralympic Games; — securing the timely development of the Olympic Park site as a high quality and sustainable mixed community; — promoting social, economic and environmental benefits for local communities; — securing the long term development and management of the Olympic Park site and venues in ways which provide lasting national and local sporting, cultural, education and leisure benefits and which preserve the site's Olympic heritage; — working with partners to contribute to long-term economic growth and prosperity in the wider area; — levering in private investment to maximise and provide best value for the public purse; and — promoting sustainable development, community involvement and equality of opportunity",0.490745
8,16483,5,Written_evidence_submitted_by_Host_Boroughs_Unit_-_Feb_2010.pdf,"4.1 The aim is that in the next 20 years, residents in the host boroughs will equal the London average in a range of the life indicators which you would expect to find in a successful community: — employment rates will increase to the London average; — average incomes in the bottom two fifths of earners in the host borough area will be increased to the London average; — young people in the host borough area will have improved GCSE results to at least the London average; — host borough 11 year olds will have at least the same educational attainment as the London average; — the number of families in receipt of benefits in the host boroughs area will fall to no more than the London average; — the rate of violent crime will continue to fall and reflect the London average; and — residents in the host boroughs area, particularly men, will have increased life expectancy to the London average",0.566613
9,4947,5,Oral_evidence_-_17_March_2010_Qs_100-143_-_Boroughs.pdf,"Sir Robin Wales: It is worth making a comment here. It has been very interesting watching this because Jules has led very much with a vision for this place, which the five boroughs have supported but Jules has driven that vision. It is as we have gone on and people have begun to connect with the SRF and understand what we are trying to do that people have begun to realise we are now with the OPLC, which I think is people understanding that legacy is important and beginning to line up behind the vision that Jules has pushed extensively because he understands the nature of his community and how that might work and how it will relate. I think it is a really good example of something being pushed by a local borough, backed by the rest of us, looking to have a vision that will make a difference there and will link in with the community he has got. It comes back to heroic economic assumptions. What comes out at the end will come out, but at least we are trying to do something that will deliver, something that will work for the local area, based on the vision that we have had locally and people are now beginning to line up to. So the question now is: do we get people lined up to support us on public policy and then the jobs that will come out will be the jobs that come out and they will begin to make a difference, particularly to Hackney but also to some of the people in Newham who will be able to access that, and other boroughs. It is a really good example of how the vision has been led by boroughs and people are now getting it",0.536632


### 5.3.5.  N most prototypical paragraphs of a specific topic

In [88]:
topic_id_chosen = 2                                    # choose the topic ID
num_para = 2                                            # set N to extract the N most prototypical paragraphs of a specific topic
df_n_topic_k = top_n_filter(df_topic_para2, num_para)
topic_id_filter = df_n_topic_k['topic_id'] == topic_id_chosen
df_n_topic_k[topic_id_filter].style.set_properties(subset = ['paragraph'], **{'width':'500px', 'length': '50px'})

Unnamed: 0,Index,topic_id,file,paragraph,probability
2,17552,2,Written_evidence_submitted_by_Vision_2020_UK_-_Jan_2010.pdf,"4. Most clubs seem ill-prepared for enquiries from, and inclusion of, people with disabilities who wish to participate in the sport offered by the club. There is little or no support for specialist clubs who provide opportunities for sport that cannot be integrated ie wheelchair basketball or blind cricket! There is often nothing locally to support the child or their parent in accessing the specialist provision and this often involves their having to undertake extensive travel to specialist facilities or organisations catering for this group. There is poor information regarding availability etc and http://www.parasport.org.uk was established as a portal to provide pathway and provision information. The then Mayor of London published a strategy in 2007, which highlighted all these issues and to date there has been little, if any, action to redress these anomalies in London or in the rest of the country. DCSF have, in my view, shown no leadership regarding the legacy of 2012 and its impact on PE in schools and the inclusion of those with disabilities in core curriculum activities or sporting opportunities within or after school. DCMS held a legacy event in 2008 and again in April 2009 focusing on the legacy of the Games for those with disabilities. One outcome was to seek greater links between DWP, DCSF and themselves to ensure that joint strategies were developed and pathways established that enabled children to enjoy and participate in PE and sport within schools/after school clubs/integrated and specialist provision in the community, with good national talent forums and pathways established for those who wish to participate in sport at a higher level and finally, with governing bodies having clear inclusive programmes for sports men and women with disabilities active at a national and international level",0.676578
3,17551,2,Written_evidence_submitted_by_Vision_2020_UK_-_Jan_2010.pdf,"3. After school activities often exclude this group of children ie no possibility of inclusion in team sports such as football, cricket, rugby, basket ball etc., limited access to swimming baths, athletic fields etc. All of these sports are undertaken by people with a disability, but not now normally at school. Specialist schools did provide a massive range of sporting opportunities and sport played a major part of my own adjustment to disability. I learned about teamwork, was able to set individual goals, have competition to extend my abilities, occasionally experienced ""being a winner"" and had that thrill of competition. One example of efforts to redress this issue within a school setting is in Leeds, which has a programme of monthly sporting and physical activities arranged in school time for vision impaired children within the schools. One solution easily achieved would be for groups of schools to come together monthly to provide sporting and recreational activities for disabled children and young people within their schools",0.627952


## 5.4 Dashboard for topic interpretation

Below the visualisation of PyLDAvis and the prototypical paragraphs are integrated into a dashboard, users can click the link generated to open the dashboard and interpret the topics more easily. To 
launch the dash, remember to download the two css files from https://github.com/suhao3123/CSS, create a folder named assets in the root of your app directory and include the two files in that folder.
After the first run of the whole program, users can run the chunks below independently.

In [89]:
import plotly.express as px 
import plotly.graph_objects as go

from jupyter_dash import JupyterDash

import dash
import dash_table
import dash_core_components as dcc
import dash_html_components as html
import dash_bootstrap_components as dbc
from dash_table.Format import Format, Scheme, Trim
from dash.dependencies import Input, Output, State
from dash.exceptions import PreventUpdate

In [90]:
# load the topic distribution of paragraphs from disk
df_topic_para3 = pd.read_pickle('./df_topic_para_Olympics2.pkl')
df_topic_para3_n = df_topic_para3.copy()
df_topic_para3_n['highest_p'] = df_topic_para3_n.iloc[:, 6:].max(axis = 1)         
df_topic_para3_n['salient_topic'] = df_topic_para3_n.iloc[:, 6:].idxmax(axis = 1)  
df_topic_para3_n = df_topic_para3_n[['index','file_name','salient_topic','paragraphs','highest_p',]]
df_topic_para3_n.columns = ['Index','file','topic', 'paragraph','probability']

# define the function for extracting the highest N ranked paragraphs from each topic
def top_n_filter(df, top_n):
    list_topic_id = [x+1 for x in range(0,k)]
    list_n_para = []
    list_n_p = []
    list_n_index = []
    list_n_file_name = []
    for x in range(1, k + 1): 
        n_para = [i for i in df.nlargest(top_n, [x])['paragraphs']]
        n_p = [i for i in df.nlargest(top_n, [x])[x]]
        n_index = [i for i in df_topic_para3.nlargest(top_n, [x]).index]
        n_file_name = [i for i in df.nlargest(top_n, [x])['file_name']]
        list_n_para.append(n_para)
        list_n_p.append(n_p)
        list_n_file_name.append(n_file_name)
        list_n_index.append(n_index)
    pd_n_para = pd.DataFrame({'Index':list_n_index, 'topic_id': list_topic_id, 'file': list_n_file_name, 'paragraph': list_n_para, 'probability': list_n_p})
    return(pd_n_para.apply(pd.Series.explode).reset_index().drop('index', axis = 1))

list_mark = list(np.arange(0,1.050,0.050))
list_mark_round = [round(i, 2) for i in list_mark]
marks= {x: str(x) for x in list_mark_round}

# Set up the app
external_stylesheets = [dbc.themes.BOOTSTRAP, "assets/bootstrap.min.css"]
app = JupyterDash(__name__, external_stylesheets=external_stylesheets)

# Bootstrap's cards provide a flexible content container with multiple variants and options.
pyLDAcard = dbc.Card(
    [
            dbc.CardHeader(html.H4("Topic visualisation")),                # title
            dbc.CardBody(
            [
                dbc.Row(
                        dbc.Col(
                            [
                                html.Embed(src = "assets/lda.html" ,style={ 'position': 'relative', 'left': '-250px', 'top': '-100px',
                                                                            'width':'1400px', 'height':'860px', 'transform': 'scale(0.70)'}), 
                            ]
                        )
                )
            ]
        ),
    ]
)


table_card = dbc.Card(
    [
        dbc.CardHeader(
            dbc.Row([
                  dbc.Col(html.H4("Prototypical paragraphs"))
            ])            
        ),
        
        
        dbc.CardHeader(
                         dbc.Row(
                            [
                            dbc.Col(
                            [
                                html.H6("Threshold of probability "),
                                dcc.Slider(
                                            id='slider',
                                            min=0,
                                            max=1,
                                            step=0.01,
                                marks=marks,
                                        value=0.1,
                                        ),html.Div(style={'width': '1000px'})
                            ]
                        ),
                        dbc.Col(
                        [
                                html.H6("Topic Selection"),
                                dcc.Input(id="topic_selection", type="number",min=1, max=11, step=1, value=1),
                                html.Div(style={'width': '100px'})
                        ]
                        ),
                          dbc.Col(
                        [
                                html.H6("N Selection"),
                                dcc.Input(id="rank_selection", type="number",min=1, max=20, step=1,value=5),
                                html.Div(style={'width': '100px'})
                        ]
                        ),
                                dbc.Col(
                                    [
                                        html.H6("Mode"),
                                        dcc.Dropdown(
                                                            id='dropdown',
                                                            options=[
                                                                {'label': 'N most prototypical paragraphs for topic K', 'value': 'c1'},
                                                                {'label': 'N most prototypical paragraphs overall', 'value': 'c2'},
                                                                {'label': 'N most prototypical paragraphs for each topic', 'value': 'c3'}
                                                            ],
                                               #             value = 'c1',
                                                            searchable=False,
                                                            clearable=False,
                                                            placeholder="Select a mode",
                                                        ),html.Div(style={'width': '380px'})
                                    ]
                                ),                                
                ]
            )                    
                ),
        
        dbc.CardBody(
                dbc.Col([
                    dash_table.DataTable(),html.Div(id="data_table")           
                ])    
                ),
        
        dbc.CardFooter(
            dbc.Row([
                dbc.Col(
                                    [
                                        html.H6('Please click the "Submit" button after setting the parameters above'),html.Div(style={'width': '500px'})

                                    ]
                                ),
                                
                dbc.Col(
                                    [
                                        dbc.Button("Submit", id='submit', color="success"),
                                        html.Div(id='button')
                                    ]
                                )
                ])
                )       
    ]
)
        
app.layout = html.Div(
    [
        dbc.Container(
            [dbc.Row(
                [
                dbc.Col(pyLDAcard,md=7), 
                dbc.Col(table_card,md=5)
            ]             
            )
            ],
            fluid=True,
        ),
    ]
)

@app.callback(
    Output('data_table','children'),
    Input('submit', 'n_clicks'), Input('dropdown', 'value'), Input('slider', 'value'), Input('topic_selection','value'), Input('rank_selection','value')
     )

def update_datatable(n_clicks, dropdown_value, slider_value,topic_value,top_n):
  
    ctx = dash.callback_context
    if not ctx.triggered:
        button_id = 'No clicks'
    else:
        button_id = ctx.triggered[0]['prop_id'].split('.')[0]
    
#    print(button_id)
                 
    if button_id=="submit":
        topic = topic_value                        #Topic filter of the Highest ranked paragraphs
        Top_N = top_n                          #Set rank of for for topic 
#        print(topic_value)
#        print(Top_N)   
        
        minimum_probability = slider_value #Topics with an assigned probability lower than this threshold will be discarded.
#        print(minimum_probability)            

        if dropdown_value=='c1':
            c_df = top_n_filter(df_topic_para3, Top_N)[top_n_filter(df_topic_para3, Top_N)['topic_id'] == topic]
        elif dropdown_value=='c2':
            c_df = df_topic_para3_n.nlargest(Top_N,['probability'])
        elif dropdown_value=='c3':
            c_df = top_n_filter(df_topic_para3, Top_N)
        else:
            return None
#        print(dropdown_value)
        
        table = dash_table.DataTable(
                                    id="table-line-1",
                                    columns=[
                                                dict(id=c_df.columns[0], name=c_df.columns[0]),
                                                dict(id=c_df.columns[1], name=c_df.columns[1]),
                                                dict(id=c_df.columns[2], name=c_df.columns[2]),
                                                dict(id=c_df.columns[3], name=c_df.columns[3]),
                                                dict(id=c_df.columns[4], name=c_df.columns[4], type='numeric', format=Format(precision=2, scheme=Scheme.fixed)),             
                                            ],
                                    data=c_df.to_dict("records"),
                       #             page_action='none',
                                    page_size=5,
                                    style_table={'height': '1000px', 'overflowY': 'auto'},
                                    fixed_rows={'headers': True},
                                    style_header={ 'border': '1px solid black', 'fontWeight': 'bold','textAlign': 'center', 'fontSize':'1px'},
                                    style_cell={  'fontSize':'10px','border': '1px solid grey','minWidth': 10, 'maxWidth': 30, 'width': 30,'whiteSpace': 'normal',
                                                'height': 'auto', 'lineHeight': '15px','textAlign': 'center','textOverflow': 'ellipsis', 'maxWidth': 0},
                                    css=[{
                                            'selector': '.dash-spreadsheet td div',
                                            'rule': '''
                                                line-height: 15px;
                                                max-height: 300px; min-height: 50px; height: 300px;
                                                display: block;
                                                overflow-y: hidden;
                                            '''
                                        }],
                                     style_cell_conditional=[
                                                                    {'if': {'column_id': 'Index'},'width': '5%'},
                                                                    {'if': {'column_id': 'file'},'width': '10%' },
                                                                    {'if': {'column_id': 'topic_id'},'width': '5%' },
                                                                    {'if': {'column_id': 'paragraph'},'width': '75%','textAlign': 'left'},
                                                                    {'if': {'column_id': 'probability'},'width': '5%'},
                                                                    
    
                                                                ],
    
                                    style_as_list_view=True,
                 )
#        print('end')
        return table
          
app.run_server(mode = 'external')

Dash app running on http://127.0.0.1:8050/


In [91]:
# remove the hash below and run the chunk to terminate the Dash
# app._terminate_server_for_port('localhost', 8050)