In [2]:
import pandas as pd
import re
from top2vec import Top2Vec

In [3]:
df = pd.read_json("data/vol7.json")
df

Unnamed: 0,names,descriptions
0,"AARON, Thabo Simon",An ANCYL member who was shot and severely inju...
1,"ABBOTT, Montaigne",A member of the SADF who was severely injured ...
2,"ABDUL WAHAB, Zakier",A member of QIBLA who disappeared in September...
3,"ABRAHAM, Nzaliseko Christopher",A COSAS supporter who was kicked and beaten wi...
4,"ABRAHAMS, Achmat Fardiel",Was shot and blinded in one eye by members of ...
...,...,...
21742,"ZWENI, Ernest",One of two South African Police (SAP) members ...
21743,"ZWENI, Lebuti",An ANC supporter who was shot dead by a named ...
21744,"ZWENI, Louis","Was shot dead in Tokoza, Transvaal, on 22 May ..."
21745,"ZWENI, Mpantesa William",His home was lost in an arson attack by Witdoe...


In [4]:
docs = df.descriptions.tolist()

In [5]:
docs[0]

"An ANCYL member who was shot and severely injured by SAP members at Lephoi, Bethulie, Orange Free State (OFS) on 17 April 1991. Police opened fire on a gathering at an ANC supporter's house following a dispute between two neighbours, one of whom was linked to the ANC and the other to the SAP and a councillor."

In [6]:
docs[100]

'Was interrogated, tortured and killed by AZAPO members along with five other scholars in Soweto, Johannesburg, on 1 August 1986. The incident was sparked off by the burning of the house of an AZAPO leader for which the youths were believed to have been responsible. Three perpetrators were refused amnesty, and one was granted amnesty (AC/2000/179 and AC/1999/230).'

In [7]:
docs = [d.replace("See ", "") for d in docs]
docs = [re.sub(r"\([^()]*\)", "", d).replace(" .", ".") for d in docs]
docs[100]

'Was interrogated, tortured and killed by AZAPO members along with five other scholars in Soweto, Johannesburg, on 1 August 1986. The incident was sparked off by the burning of the house of an AZAPO leader for which the youths were believed to have been responsible. Three perpetrators were refused amnesty, and one was granted amnesty.'

In [8]:
print(Top2Vec.__doc__)


    Top2Vec

    Creates jointly embedded topic, document and word vectors.


    Parameters
    ----------
    documents: List of str
        Input corpus, should be a list of strings.

    min_count: int (Optional, default 50)
        Ignores all words with total frequency lower than this. For smaller
        corpora a smaller min_count will be necessary.

    ngram_vocab: bool (Optional, default False)
        Add phrases to topic descriptions.

        Uses gensim phrases to find common phrases in the corpus and adds them
        to the vocabulary.

        For more information visit:
        https://radimrehurek.com/gensim/models/phrases.html

    ngram_vocab_args: dict (Optional, default None)
        Pass custom arguments to gensim phrases.

        For more information visit:
        https://radimrehurek.com/gensim/models/phrases.html

    embedding_model: string or callable
        This will determine which model is used to generate the document and
        word embeddings. T

In [9]:
model = Top2Vec(docs, speed="fast-learn")

2022-06-30 12:34:12,098 - top2vec - INFO - Pre-processing documents for training
2022-06-30 12:34:13,638 - top2vec - INFO - Creating joint document/word embedding
2022-06-30 12:34:44,662 - top2vec - INFO - Creating lower dimension embedding of documents
2022-06-30 12:35:02,526 - top2vec - INFO - Finding dense areas of documents
2022-06-30 12:35:04,992 - top2vec - INFO - Finding topics


In [10]:
model_learn = Top2Vec(docs, speed="learn")

2022-06-30 12:41:06,927 - top2vec - INFO - Pre-processing documents for training
2022-06-30 12:41:08,692 - top2vec - INFO - Creating joint document/word embedding
2022-06-30 12:41:40,728 - top2vec - INFO - Creating lower dimension embedding of documents
2022-06-30 12:41:47,252 - top2vec - INFO - Finding dense areas of documents
2022-06-30 12:41:49,584 - top2vec - INFO - Finding topics


In [11]:
model_dlearn = Top2Vec(docs, speed="deep-learn")

2022-06-30 12:47:53,227 - top2vec - INFO - Pre-processing documents for training
2022-06-30 12:47:54,797 - top2vec - INFO - Creating joint document/word embedding
2022-06-30 12:53:13,652 - top2vec - INFO - Creating lower dimension embedding of documents
2022-06-30 12:53:21,815 - top2vec - INFO - Finding dense areas of documents
2022-06-30 12:53:24,393 - top2vec - INFO - Finding topics


In [12]:
model_dlearn = Top2Vec(docs, speed="deep-learn", workers=14)

2022-06-30 12:54:10,969 - top2vec - INFO - Pre-processing documents for training
2022-06-30 12:54:12,481 - top2vec - INFO - Creating joint document/word embedding
2022-06-30 13:03:14,862 - top2vec - INFO - Creating lower dimension embedding of documents
2022-06-30 13:03:22,278 - top2vec - INFO - Finding dense areas of documents
2022-06-30 13:03:24,728 - top2vec - INFO - Finding topics


In [13]:
topic_sizes, topic_nums = model.get_topic_sizes()
print (topic_sizes)
print (topic_nums)

[8667 1520  893  778  745  636  625  579  577  492  416  390  355  354
  339  322  318  271  260  251  198  180  179  175  168  167  135  118
  113  110  108  100   99   99   98   95   85   80   74   73   69   69
   61   60   53   45   39   35   30   22   22]
[ 0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
 48 49 50]


In [14]:
topic_sizes, topic_nums = model_learn.get_topic_sizes()
print (topic_sizes)
print (topic_nums)

[774 336 262 231 220 220 211 209 207 206 193 187 186 185 184 181 177 176
 168 165 156 155 149 144 142 138 137 136 133 133 131 130 126 126 121 121
 120 117 117 116 114 114 114 111 109 107 106 104 103 103 102 100  99  98
  98  98  98  98  96  95  95  95  95  92  92  90  90  90  90  90  90  88
  88  86  86  86  85  84  83  83  82  82  82  82  81  81  81  81  81  80
  79  78  78  78  77  76  75  74  74  74  73  73  72  71  70  70  70  69
  69  69  68  68  68  67  67  67  67  67  67  66  66  65  65  64  64  63
  63  63  63  63  62  62  62  62  62  61  61  61  61  61  61  61  60  60
  60  59  59  58  58  57  57  57  57  57  56  56  56  56  55  55  55  55
  55  55  55  55  55  54  54  54  53  53  53  53  53  52  52  52  52  52
  51  51  51  50  50  50  50  49  49  49  49  49  49  49  48  48  48  48
  48  48  47  47  47  47  47  45  45  45  45  45  45  44  44  44  44  44
  44  44  44  43  42  42  42  42  42  42  42  41  41  41  41  40  40  40
  40  40  40  39  39  39  39  39  38  38  38  38  3

In [15]:
topic_sizes, topic_nums = model_dlearn.get_topic_sizes()
print (topic_sizes)
print (topic_nums)

[554 334 329 273 247 228 224 192 192 192 190 190 185 167 156 155 153 141
 140 139 135 129 129 129 128 127 127 127 127 127 127 126 124 123 120 119
 117 117 117 115 115 115 115 115 113 112 112 111 111 109 108 106 106 104
 103 103 102 101 100  99  98  98  97  97  97  96  96  96  95  94  94  94
  93  92  92  92  92  91  91  90  90  89  89  88  87  87  87  87  87  87
  87  86  86  85  85  85  85  85  84  84  84  83  82  81  81  81  80  80
  80  78  78  77  77  77  76  76  76  76  76  75  74  74  74  74  73  73
  73  73  73  72  72  71  71  71  70  69  69  69  69  68  68  67  66  66
  66  66  65  65  65  65  64  64  63  63  63  63  63  62  62  62  62  61
  61  60  60  60  60  60  60  59  59  59  58  58  57  57  56  56  56  55
  55  55  55  55  54  54  54  53  53  52  52  52  52  52  52  51  51  51
  51  51  49  49  49  49  48  47  47  47  46  46  46  45  45  45  45  45
  44  44  44  43  42  42  42  42  41  40  40  40  40  39  39  39  38  38
  38  37  37  37  36  35  35  35  34  34  33  33  3

In [17]:
documents, document_scores, document_ids = model.search_documents_by_topic(topic_num=50, num_docs=10)
for doc, score, doc_id in zip(documents, document_scores, document_ids):
    print(f"Document: {doc_id}, Score: {score}")
    print("-----------")
    print(doc)
    print("-----------")
    print()

Document: 7847, Score: 0.994350790977478
-----------
Was stabbed in the face by IFP supporters in Swanieville, near Krugersdorp, Transvaal, on 12 May 1991. IFP-supporting hostel-dwellers were retaliating against a previous attack by ANC-supporting squatters and approximately one hundred and fifteen shacks were set alight, twenty seven people were killed and twenty five vehicles were burnt. Twelve people were charged with crimes ranging from murder to arson but were acquitted due to lack of evidence.
-----------

Document: 9509, Score: 0.9940825700759888
-----------
She had her home burnt down by IFP supporters in Swanieville, Krugersdorp, Transvaal, on 12 May 1991. IFP-supporting hostel-dwellers were retaliating against the explusion of IFP supporters from the area. About one hundred and fifteen shacks were set alight, twenty seven people were killed and twenty five vehicles were burnt. Twelve people were charged with crimes ranging from murder to arson but were acquitted due to lack o

In [18]:
documents, document_scores, document_ids = model_learn.search_documents_by_topic(topic_num=286, num_docs=10)
for doc, score, doc_id in zip(documents, document_scores, document_ids):
    print(f"Document: {doc_id}, Score: {score}")
    print("-----------")
    print(doc)
    print("-----------")
    print()

Document: 6059, Score: 0.9398622512817383
-----------
An ANC supporter who was shot and injured by Inkatha supporters, one of whom is named, at Maliwanqa, near Port Shepstone, Natal, on 19 April 1990.
-----------

Document: 6048, Score: 0.9345434904098511
-----------
An ANC supporter who was shot and injured by Inkatha supporters, one of whom is named, at Maliwanqa, near Port Shepstone, Natal, on 19 April 1990.
-----------

Document: 16291, Score: 0.886009156703949
-----------
Was shot dead by ANC supporters, one of whom is named, during political conflict at Umlazi, Durban, on 16 December 1993. Ms Ntombela was asleep in her house when she was shot.
-----------

Document: 19918, Score: 0.8855718374252319
-----------
An ANC supporter who had her home at Inanda, near KwaMashu, Durban, destroyed in an arson attack by Inkatha supporters, one of whom is named, on 18 December 1988. 
-----------

Document: 16504, Score: 0.8843755722045898
-----------
An ANC supporter who was assaulted with a 

In [21]:
documents, document_scores, document_ids = model_dlearn.search_documents_by_topic(topic_num=271, num_docs=10)
for doc, score, doc_id in zip(documents, document_scores, document_ids):
    print(f"Document: {doc_id}, Score: {score}")
    print("-----------")
    print(doc)
    print("-----------")
    print()

Document: 3034, Score: 0.9714802503585815
-----------
His house was looted by Inkatha supporters during intense political violence in Woodyglen, Mpumalanga, KwaZulu, near Durban, on 11 February 1990, the same day Nelson Mandela was released from prison. Ten people were killed in the fighting which lasted for a week. A former IFP member was granted amnesty. Mpumalanga attacks.
-----------

Document: 6803, Score: 0.9631528854370117
-----------
She had her house looted and set alight by Inkatha supporters during intense political violence in Woodyglen, Mpumalanga, KwaZulu, near Durban, on 11 February 1990, the same day Nelson Mandela was released from prison. Ten people were killed in the fighting which lasted for a week. A former IFP member was granted amnesty. Mpumalanga attacks. 
-----------

Document: 6795, Score: 0.9625179767608643
-----------
He had his house looted and set alight by Inkatha supporters during intense political violence in Woodyglen, Mpumalanga, KwaZulu, near Durban,