# Semantic Anaylsis

This notebook details the attempt of trying to find out semanyic categories of the traing/test data of each language pair. I treat every dataset as it own corpus. My first attempt involes topic modelling. Topic Modelling is the process of extracting major themes from a given corpus of text data.

**Background**
Original technique for topic modelling was developed in 1998 by Raghavan, Tamaki and Vempala.
Then came the PLSA (Probabilistic Latent Semantic Analysis) in 1998 created by Thomas Hoffman.
Most commonly used technique called the was developed in 2002 by Andrew Ng, David Blei and Michael Jordan. Another technique which is an extension of LDA is called the Pachinko Allocation and improves on LDA by modeling correlations between topics in addition to the word correlations which constitute topics.
An alternative to LDA is the HLTA (Heirarchical Latent Tree Analysis), which models word co-occurrence using a tree of latent variables and the states of the latent variables, which correspond to soft clusters of documents, are interpreted as topics.

### Importing the required libraries

In [54]:
import pandas as pd
import gensim #the library for Topic modelling
from gensim.models.ldamulticore import LdaMulticore
from gensim import corpora, models
import pyLDAvis.gensim #LDA visualization library
from gensim.models import HdpModel
from nltk.corpus import stopwords
import string
from nltk.stem.wordnet import WordNetLemmatizer
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
import warnings
warnings.simplefilter('ignore')
from itertools import chain

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\tonyr\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\tonyr\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [37]:
#clean the data and leamtizie it 
stop = set(stopwords.words('english'))
exclude = set(string.punctuation)
lemma = WordNetLemmatizer()

def clean(text):
    stop_free = ' '.join([word for word in text.lower().split() if word not in stop])
    punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
    normalized = ' '.join([lemma.lemmatize(word) for word in punc_free.split()])
    return normalized.split()


# EN-DE
## TRAINING DATASET

In [38]:
#importing the en-de training dataset
en_de_train_df =  pd.read_csv("../data/en-de/train.ende.df.short.tsv",sep="\t").iloc[:,0:2]

In [41]:
en_de_train_df["original_cleaned"] = en_de_train_df['original'].apply(clean)

In [40]:
en_de_train_df

Unnamed: 0,index,original,original_cleaned
0,0,José Ortega y Gasset visited Husserl at Freibu...,"[josé, ortega, gasset, visited, husserl, freib..."
1,1,"However, a disappointing ninth in China meant ...","[however, disappointing, ninth, china, meant, ..."
2,2,"In his diary, Chase wrote that the release of ...","[diary, chase, wrote, release, mason, slidell,..."
3,3,Heavy arquebuses mounted on wagons were called...,"[heavy, arquebus, mounted, wagon, called, arqu..."
4,4,Once North Pacific salmon die off after spawni...,"[north, pacific, salmon, die, spawning, usuall..."
...,...,...,...
6995,6995,Some may also discourage or disallow unsanitar...,"[may, also, discourage, disallow, unsanitary, ..."
6996,6996,"In the late 1860s, the crinolines disappeared ...","[late, 1860s, crinoline, disappeared, bustle, ..."
6997,6997,"Disco was criticized as mindless, consumerist,...","[disco, criticized, mindless, consumerist, ove..."
6998,6998,Planters would then fill large hogsheads with ...,"[planter, would, fill, large, hogshead, tobacc..."


In [48]:
#create dictionary
dictionary = corpora.Dictionary(en_de_train_df["original_cleaned"])
#Total number of non-zeroes in the BOW matrix (sum of the number of unique words per document over the entire corpus).
print(dictionary.num_nnz)
doc_term_matrix = [dictionary.doc2bow(doc) for doc in en_de_train_df["original_cleaned"]]
lda = gensim.models.ldamodel.LdaModel

62567


## LDA 
in this algo. we have to specificy the number of topics

In [59]:
num_topics=10
%time ldamodel = lda(doc_term_matrix,num_topics=num_topics,id2word=dictionary,passes=50,minimum_probability=0)
ldamodel.print_topics(num_topics=num_topics)

Wall time: 1min 6s


[(0,
  '0.006*"second" + 0.005*"two" + 0.005*"american" + 0.005*"inning" + 0.004*"four" + 0.004*"one" + 0.004*"wicket" + 0.004*"five" + 0.003*"first" + 0.003*"eviction"'),
 (1,
  '0.007*"state" + 0.005*"many" + 0.005*"united" + 0.004*"however" + 0.004*"john" + 0.004*"also" + 0.004*"east" + 0.003*"female" + 0.003*"without" + 0.003*"different"'),
 (2,
  '0.004*"peter" + 0.004*"american" + 0.003*"much" + 0.003*"could" + 0.003*"later" + 0.003*"quarterfinal" + 0.003*"hurricane" + 0.003*"even" + 0.003*"however" + 0.003*"like"'),
 (3,
  '0.007*"also" + 0.003*"attack" + 0.003*"includes" + 0.003*"area" + 0.003*"bridge" + 0.003*"former" + 0.002*"became" + 0.002*"resigned" + 0.002*"like" + 0.002*"cruiser"'),
 (4,
  '0.006*"also" + 0.003*"jack" + 0.003*"air" + 0.003*"small" + 0.003*"new" + 0.002*"party" + 0.002*"water" + 0.002*"mi" + 0.002*"force" + 0.002*"well"'),
 (5,
  '0.005*"may" + 0.004*"m" + 0.004*"also" + 0.004*"tornado" + 0.003*"encyclopedia" + 0.003*"him" + 0.003*"general" + 0.003*"new" 

In [62]:
lda_display = pyLDAvis.gensim.prepare(ldamodel, doc_term_matrix, dictionary, sort_topics=False, mds='mmds')
pyLDAvis.display(lda_display)

## Hierarchical Dirichlet Process
In this algo, it attempts to find the number od topics

In [77]:
hdpmodel = HdpModel(corpus=doc_term_matrix, id2word=dictionary)
hdpmodel.show_topics()

[(0,
  '0.001*loudly + 0.001*reduced + 0.001*saving + 0.001*falsifying + 0.001*giant + 0.001*entitled + 0.001*stump + 0.001*practice + 0.001*initially + 0.001*alexandria + 0.001*coma + 0.001*robin + 0.001*treebeard + 0.001*jenny + 0.001*character + 0.001*enzyme + 0.001*cape + 0.001*southern + 0.001*kennedy + 0.001*perspective'),
 (1,
  '0.001*garrisoned + 0.001*nationally + 0.001*kepi + 0.001*optioned + 0.001*shear + 0.001*nitrogen + 0.001*glabella + 0.001*verbal + 0.001*jeff + 0.001*parkway + 0.001*furka + 0.001*appoints + 0.001*appointed + 0.001*enzyme + 0.001*forage + 0.001*appearing + 0.001*77 + 0.001*hamilton + 0.001*sadler + 0.001*punishment'),
 (2,
  '0.001*2001 + 0.001*timur + 0.001*publication + 0.001*benjamin + 0.001*fellow + 0.001*prince + 0.001*5th + 0.001*nobleman + 0.001*vine + 0.001*rubble + 0.001*61 + 0.001*hypha + 0.001*confederate + 0.001*carriage + 0.001*blithe + 0.001*crab + 0.001*besieger + 0.001*sell + 0.001*english + 0.001*directly'),
 (3,
  '0.002*ferrero + 0.00

In [78]:
lda_display = pyLDAvis.gensim.prepare(hdpmodel, doc_term_matrix, dictionary, sort_topics=False, mds='mmds')
pyLDAvis.display(lda_display)

## EN-DE
### TEST DATASET

In [68]:
#importing the en-de training dataset
en_de_test_df =  pd.read_csv("../data/en-de-test/test20.ende.df.short.tsv",sep="\t").iloc[:,0:2]

In [70]:
en_de_test_df["original_cleaned"] = en_de_test_df['original'].apply(clean)

en_de_test_df

Unnamed: 0,index,original,original_cleaned
0,0,"The Sultan appoints judges, and can grant pard...","[sultan, appoints, judge, grant, pardon, commu..."
1,1,Antisemitism in modern Ukraine Antisemitism an...,"[antisemitism, modern, ukraine, antisemitism, ..."
2,2,"Morales continued his feud with Buddy Rose, de...","[morale, continued, feud, buddy, rose, defeati..."
3,3,American Maury Tripp attended the Jamboree fro...,"[american, maury, tripp, attended, jamboree, s..."
4,4,He bowled a series of bouncers at Viv Richards...,"[bowled, series, bouncer, viv, richards, brisb..."
...,...,...,...
995,995,Cleo chooses not to tell Joel straight away an...,"[cleo, chooses, tell, joel, straight, away, co..."
996,996,The circular forbid the passage of indigents a...,"[circular, forbid, passage, indigents, needy, ..."
997,997,"The Dodgers, as the top seeded team in the Nat...","[dodger, top, seeded, team, national, league, ..."
998,998,List of schools in Victoria Australian Army Ca...,"[list, school, victoria, australian, army, cad..."


In [72]:
#create dictionary
dictionary = corpora.Dictionary(en_de_test_df["original_cleaned"])
#Total number of non-zeroes in the BOW matrix (sum of the number of unique words per document over the entire corpus).
print(dictionary.num_nnz)
doc_term_matrix = [dictionary.doc2bow(doc) for doc in en_de_test_df["original_cleaned"]]
lda = gensim.models.ldamodel.LdaModel

8932


## LDA 
in this algo. we have to specificy the number of topics

In [73]:
num_topics=10
%time ldamodel = lda(doc_term_matrix,num_topics=num_topics,id2word=dictionary,passes=50,minimum_probability=0)
ldamodel.print_topics(num_topics=num_topics)

Wall time: 15.7 s


[(0,
  '0.003*"one" + 0.002*"would" + 0.002*"many" + 0.002*"1" + 0.002*"specie" + 0.002*"it" + 0.002*"partial" + 0.002*"district" + 0.002*"several" + 0.002*"ten"'),
 (1,
  '0.004*"new" + 0.004*"division" + 0.003*"american" + 0.003*"england" + 0.003*"north" + 0.002*"later" + 0.002*"brigade" + 0.002*"along" + 0.002*"across" + 0.002*"two"'),
 (2,
  '0.004*"two" + 0.003*"first" + 0.003*"back" + 0.003*"4" + 0.002*"one" + 0.002*"10" + 0.002*"june" + 0.002*"gun" + 0.002*"served" + 0.002*"ladder"'),
 (3,
  '0.003*"january" + 0.003*"american" + 0.003*"attack" + 0.003*"two" + 0.003*"may" + 0.003*"also" + 0.002*"10" + 0.002*"world" + 0.002*"july" + 0.002*"open"'),
 (4,
  '0.004*"american" + 0.004*"war" + 0.004*"also" + 0.003*"new" + 0.003*"wicket" + 0.002*"four" + 0.002*"many" + 0.002*"him" + 0.002*"u" + 0.002*"november"'),
 (5,
  '0.004*"however" + 0.004*"also" + 0.003*"east" + 0.002*"near" + 0.002*"northern" + 0.002*"victory" + 0.002*"may" + 0.002*"toad" + 0.002*"former" + 0.002*"buzzard"'),
 (

In [74]:
lda_display = pyLDAvis.gensim.prepare(ldamodel, doc_term_matrix, dictionary, sort_topics=False, mds='mmds')
pyLDAvis.display(lda_display)

## Hierarchical Dirichlet Process

In this algo, it attempts to find the number od topics

In [80]:
hdpmodel = HdpModel(corpus=doc_term_matrix, id2word=dictionary)
hdpmodel.show_topics()

[(0,
  '0.002*mühlentor + 0.001*childhood + 0.001*61 + 0.001*echoed + 0.001*trierarch + 0.001*hearns + 0.001*prof + 0.001*debut + 0.001*controlling + 0.001*confidential + 0.001*breakout + 0.001*fish + 0.001*rok + 0.001*oscillator + 0.001*koala + 0.001*francoeur + 0.001*rotting + 0.001*unguarded + 0.001*morale + 0.001*vector'),
 (1,
  '0.001*gamete + 0.001*apologized + 0.001*firefights + 0.001*full + 0.001*enveloped + 0.001*ranking + 0.001*km2 + 0.001*troop + 0.001*disembarked + 0.001*secondly + 0.001*texas + 0.001*sydney + 0.001*powhatan + 0.001*beater + 0.001*fu + 0.001*characteristic + 0.001*giddiness + 0.001*prisoner + 0.001*shiva + 0.001*pilaster'),
 (2,
  '0.001*jtwc + 0.001*grateful + 0.001*integral + 0.001*witnessing + 0.001*élite + 0.001*doll + 0.001*kootenai + 0.001*older + 0.001*serpentine + 0.001*locating + 0.001*rare + 0.001*defends + 0.001*example + 0.001*sentence + 0.001*rape + 0.001*outnumbered + 0.001*kali + 0.001*demyan + 0.001*radio + 0.001*tenpin'),
 (3,
  '0.002*ans

In [81]:
lda_display = pyLDAvis.gensim.prepare(hdpmodel, doc_term_matrix, dictionary, sort_topics=False, mds='mmds')
pyLDAvis.display(lda_display)