<div style="background-color: lightgreen; border-radius: 5px; padding: 10px;">
    <h4>Word Embedding Categorisation</h4>
    <p>...</p>
</div>

In [2]:
from preprocessing_functions import *
from word_embedding_functions import *

import pandas as pd
import numpy as np
from pprint import pprint

import warnings
warnings.filterwarnings('ignore')

- Clean and preprocess (lemmatise) a list of documents (e.g. paragraphs)
- Get your topics through unsupervised clustering with (LDA Topic Modeling)
- Use these for the word embedding algorithm

### 1. Set Topic Words
<br>
<div style="text-align: justify;">Using LDA topic modeling to find topics and their words through unsupervised clustering highly increases the performance of the model. Look at the pyLDAvis visualisation of the LDA topic model and change any parameters if you are not happy with the cluster distributions or if any clusters are overlapping. </div>
<br>
<div style="text-align: justify;">
A good topic model will have relatively big, similarly sized and non-overlapping bubbles scattered throughout the chart. Greater distances between the clusters represents a larger semantic difference, similarly sized bubbles show that the topics are equally represented, and large circles mean that the topics are well represented in the documents. By paying attention to these three characteristics we can get an accurate representation of the dominant topics of our documents and decide whether these clusters represent good, meaningful topics.
</div>

#### 1.1 Option 1 (less optimal!): Select individual words for each category/topic

In [50]:
# Removing words with double meaning helps (e.g. bear, degree)

choice = input('Are you sure you want to set your own topic words instead of using the recommended lda topic words? (y/n)')

if choice == 'y':
    print('Running cell...')
    
    #fashion_words = ['fashion', 'model', 'vogue', 'store', 'designer', 'couture', 'catwalk', 'runway', 'modeling', 'clothing', 'cosmetic', 'brand', 'retail', 'advertising', 'perfume']
    diplomacy_words = ['War', 'Embassy', 'Army', 'Diplomatic', 'Ambassador', 'Treaty', 'Protest', 'Force', 'Mission', 'Arrest', 'Government', 'Police', 'Attack', 'Party', 'Minister']
    entertainment_words = ['Opera', 'Festival', 'Perform', 'Orchestra', 'Symphony', 'Concert', 'Music', 'Film', 'Sing', 'Theatre', 'Performance', 'Role', 'Premiere', 'Tour', 'Band']
    art_words = ['Exhibition', 'Art', 'Museum', 'Gallery', 'Exhibit', 'Painting', 'Collection', 'Paint', 'Portrait', 'Artist', 'Sculpture', 'Fashion', 'Design', 'Contemporary', 'Painter']
    education_words = ['Study', 'School', 'Professor', 'University', 'Graduate', 'Educate', 'Lecture', 'Research', 'College', 'Teach', 'Science', 'Education', 'Philosophy', 'Doctorate', 'Faculty'] # replaced bear with educate and degree with Lecture
    transportation_words = ['Railway', 'Route', 'Line', 'Operate', 'Flight', 'Station', 'Service', 'Airline', 'Airport', 'Train', 'Passenger', 'Speed', 'Aircraft', 'Rail', 'Network']
    sport_words = ['Final', 'Win', 'Team', 'Match', 'Game', 'Goal', 'Club', 'League', 'Champion', 'Championship', 'Season', 'Score', 'Round', 'Tournament', 'Football']

    lda_topic_words = { #'fashion': fashion_words,
                        'diplomacy': diplomacy_words,
                       'entertainment': entertainment_words,
                       'art': art_words,
                       'education': education_words,
                       'transportation': transportation_words,
                       'sport': sport_words}
else:
    print('Canceling cell execution...')

Are you sure you want to set your own topic words instead of using the recommended lda topic words? (y/n)n
Canceling cell execution...


#### 1.2 Option 2 (recommended!): Use Topic Model's topic words (found through LDA)

In [29]:
lda_topic_words_df_path = '../../../../data_clean/lda_models/lda_model_2million/relevant_words_per_topic.csv'
lda_topic_words_df = pd.read_csv(lda_topic_words_df_path)

In [43]:
display(lda_topic_words_df[:20])
len(lda_topic_words_df)

Unnamed: 0,Topic 1,Topic 2,Topic 3,Topic 4,Topic 5,Topic 6
0,perform,route,store,season,study,attack
1,opera,station,building,win,exhibition,protest
2,concert,service,office,club,professor,army
3,film,railway,design,goal,work,force
4,tour,line,population,match,painting,arrest
5,sing,train,firm,team,bear,troop
6,music,passenger,century,score,exhibit,police
7,premiere,airline,shopping,game,school,command
8,festival,airport,large,final,graduate,war
9,band,operate,urban,league,art,embassy


50

In [207]:
nr_of_words = 10
lda_topic_words = lda_topic_words_df[:nr_of_words].to_dict('list')

for key in lda_topic_words: # sorted_df_dict = sorted(df_dict, key=lambda d: d['article_id']) 
    print(f"{key}:", lda_topic_words[key], '\n')

Topic 1: ['perform', 'opera', 'concert', 'film', 'tour', 'sing', 'music', 'premiere', 'festival', 'band'] 

Topic 2: ['route', 'station', 'service', 'railway', 'line', 'train', 'passenger', 'airline', 'airport', 'operate'] 

Topic 3: ['store', 'building', 'office', 'design', 'population', 'firm', 'century', 'shopping', 'large', 'urban'] 

Topic 4: ['season', 'win', 'club', 'goal', 'match', 'team', 'score', 'game', 'final', 'league'] 

Topic 5: ['study', 'exhibition', 'professor', 'work', 'painting', 'bear', 'exhibit', 'school', 'graduate', 'art'] 

Topic 6: ['attack', 'protest', 'army', 'force', 'arrest', 'troop', 'police', 'command', 'war', 'embassy'] 



### 2. Loading GloVe Word Embedding
<br>
<div style="text-align: justify;"> The word embedding model could be replaced by other ones fairly easy, but for now we decided to use the Common Crawl 840B token one  (source: https://nlp.stanford.edu/projects/glove/).</div>

In [64]:
%%time

embeddings_dict, discarded_dict = load_glove_word_embeddings(GLOVE_PATH="../../../../glove.840B.300d.txt")

This will take approximately ~ 4 minutes...


  0%|          | 0/2196017 [00:00<?, ?it/s]

CPU times: total: 3min 38s
Wall time: 3min 48s


### 3. Get Mean of the words from each Topic
<br>
<div style="text-align: justify;">By taking the mean of the vectors that belong to the ~15 most relevant words of a topic we get a fairly accurate vector representation of a topic. We do this for each topic and save their vector value to a dictionary for later retrieval.</div>

In [208]:
mean_vectors_dict = {}

for topic in lda_topic_words:
    words = lda_topic_words[topic]
    words = [word for word in words if word in embeddings_dict.keys()] # checks if word is in vocabulary (i.e. has been seen by the model before)
    mean_embedding = np.mean([embeddings_dict[word.lower()] for word in words], axis=0)
    mean_vectors_dict[topic] = mean_embedding

### 4. Load Documents

#### 4.1 Load and merge chunked .csv files

In [99]:
paragraphs_with_lda_prediction_path = '../../../../data_clean/lda_models/lda_model_2million/paragraphs_lda_topic_distribution/'
file_path =  os.listdir(paragraphs_with_lda_prediction_path)[0]
sample_path = os.path.join(paragraphs_with_lda_prediction_path, file_path)

In [80]:
# os.listdir('../../../../data_clean/lda_models/lda_model_2million/paragraphs_lda_topic_distribution/')[0]
lemmatised_paragraphs_path = "../../../../data_clean/paragraphs_lemmatised/english_words_merged_NOUNVERBADJ/"
file_path = os.listdir(lemmatised_paragraphs_path)[0]
sample_path = os.path.join(lemmatised_paragraphs_path, file_path)

paragraphs_10_934384_1038204_merged_POS.csv


True

In [101]:
print(sample_path, file_path)
os.path.exists(sample_path)

../../../../data_clean/lda_models/lda_model_2million/paragraphs_lda_topic_distribution/paragraphs_10_934385_1038205_lda_topics.csv paragraphs_10_934385_1038205_lda_topics.csv


True

In [102]:
df =  pd.read_csv(sample_path)
df

Unnamed: 0,city_1,city_2,paragraph,article_id,title,merged_POS,topic_1,topic_2,topic_3,topic_4,topic_5,topic_6,lda_dominant,lda_dominant_score
0,Toulon,Nice,Honey from Provence is protected by a red labe...,15415861.0,"Authon, Alpes-de-Haute-Provence","['honey', 'label', 'indication', 'honey', 'flo...",0.002901,0.072049,0.002811,0.002407,0.917025,0.002807,topic_5,0.917025
1,Copenhagen,Oslo,"Glosimodt was born in Oslo, Norway. He studied...",15416040.0,Erik Glosimodt,"['education', 'professor', 'architect', 'pract...",0.002901,0.001859,0.002811,0.002407,0.987215,0.002807,topic_5,0.987215
2,Oslo,Copenhagen,"Glosimodt was born in Oslo, Norway. He studied...",15416040.0,Erik Glosimodt,"['education', 'professor', 'architect', 'pract...",0.018396,0.901826,0.017823,0.015259,0.028899,0.017797,topic_2,0.901826
3,Barcelona,Dusseldorf,"On the 24 March 2015, Germanwings Flight 9525 ...",15416113.0,Meolans-Revel,"['travel', 'crash']",0.018396,0.011786,0.017823,0.905299,0.028899,0.017797,topic_4,0.905299
4,Dusseldorf,Barcelona,"On the 24 March 2015, Germanwings Flight 9525 ...",15416113.0,Meolans-Revel,"['travel', 'crash']",0.005012,0.973250,0.004856,0.004158,0.007874,0.004849,topic_2,0.973250
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
103815,Rome,Bari,It publishes 15 local editions for the cities ...,19593000.0,Leggo,"['edition', 'city', 'circulation', 'copy', 'pu...",0.005704,0.003654,0.971431,0.004731,0.008961,0.005518,topic_3,0.971431
103816,Rome,Padua,It publishes 15 local editions for the cities ...,19593000.0,Leggo,"['edition', 'city', 'circulation', 'copy', 'pu...",0.005704,0.003654,0.971431,0.004731,0.008961,0.005518,topic_3,0.971431
103817,Rome,Brescia,It publishes 15 local editions for the cities ...,19593000.0,Leggo,"['edition', 'city', 'circulation', 'copy', 'pu...",0.005704,0.003654,0.833445,0.004731,0.146947,0.005518,topic_3,0.833445
103818,Rome,Verona,It publishes 15 local editions for the cities ...,19593000.0,Leggo,"['edition', 'city', 'circulation', 'copy', 'pu...",0.005704,0.003654,0.971431,0.004731,0.008961,0.005518,topic_3,0.971431


#### 4.2 Select number of paragraphs to process

In [216]:
sample_size = 5000
sample = df[:sample_size]

#### 4.3 Turn stringed lists (in csv) into list objects

In [217]:
%%time

from ast import literal_eval

    
sample['merged_POS'] = sample['merged_POS'].apply(literal_eval)

CPU times: total: 250 ms
Wall time: 395 ms


In [105]:
sample.columns

Index(['city_1', 'city_2', 'paragraph', 'article_id', 'title', 'merged_POS',
       'topic_1', 'topic_2', 'topic_3', 'topic_4', 'topic_5', 'topic_6',
       'lda_dominant', 'lda_dominant_score'],
      dtype='object')

### 5. Classify Paragraphs (by Word Embedding Algorithm)

#### 5.1 Select right parameters
<br>
<div style="text-align: justify;">bottom_threshold: If the similarity between a word and its closest topic is below the bottom_threshold it will be discarded from the classification process.
Verbose1 and Verbose2: Enabling these will print out the internal process of the algorithm.</div>

In [136]:
bottom_threshold = 0.20
verbose1 = False
verbose2 = False

In [218]:
%%time
# ~40 mins for 311k paragraphs

topics = [key for key in list(sorted(mean_vectors_dict.keys()))]
nested_l = [['index']+topics+['embedding_dominant']]

for idx, row in tqdm(sample['merged_POS'].iteritems(), total=len(sample['merged_POS'])):
    output = categorize_text(lemmatized_wordlist=row, mean_vectors_dict=mean_vectors_dict, keywords=topics, embeddings_dict=embeddings_dict, bottom_threshold=bottom_threshold, verbose1=verbose1, verbose2=verbose2)
    
    temp_l = [idx] +[result[1] for result in output['category_similarities']] + [output['prediction']]
    
    #print(row, output['category_similarities'])
    
    if len(nested_l[0]) != len(temp_l):
        raise Exception('Not the same size!')
        
    nested_l.append(temp_l)


prediction_df = pd.DataFrame(nested_l[1:],columns=nested_l[0]).set_index('index')

  0%|          | 0/5000 [00:00<?, ?it/s]

CPU times: total: 30.2 s
Wall time: 31.2 s


In [224]:
prediction_df

Unnamed: 0_level_0,Topic 1,Topic 2,Topic 3,Topic 4,Topic 5,Topic 6,embedding_dominant
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,1.177238,3.370153,7.421871,1.268505,1.032050,1.251641,Topic 3
1,0.000000,1.535294,1.509655,0.000000,3.934449,0.763062,Topic 5
2,0.000000,1.535294,1.509655,0.000000,3.934449,0.763062,Topic 5
3,0.000000,0.578503,0.000000,0.000000,0.000000,0.417536,Topic 2
4,0.000000,0.578503,0.000000,0.000000,0.000000,0.417536,Topic 2
...,...,...,...,...,...,...,...
4995,0.398827,0.411773,6.762273,0.375313,5.861372,1.856807,Topic 3
4996,0.398827,0.411773,6.762273,0.375313,5.861372,1.856807,Topic 3
4997,0.398827,0.411773,6.762273,0.375313,5.861372,1.856807,Topic 3
4998,0.398827,0.411773,6.762273,0.375313,5.861372,1.856807,Topic 3


In [225]:
prediction_df['embedding_dominant'].value_counts()

Topic 3    1426
Topic 5    1106
Topic 4     968
Topic 1     754
Topic 2     408
Topic 6     338
Name: embedding_dominant, dtype: int64

### 6. Insert classification into dataset

In [226]:
updated_df = sample.join(prediction_df)
updated_df.head(2)
updated_df['same_categorisation'] = updated_df.apply(lambda x: x.lda_dominant.endswith(x.embedding_dominant[-1]), axis=1) # (updated_df['embedding_dominant'].isin('lda_dominant') 'lda_dominant'].str.contains() == updated_df['outcome'])

## Saving "Classified Paragraphs" Dataframe

In [227]:
updated_df

Unnamed: 0,city_1,city_2,paragraph,article_id,title,merged_POS,topic_1,topic_2,topic_3,topic_4,...,lda_dominant,lda_dominant_score,Topic 1,Topic 2,Topic 3,Topic 4,Topic 5,Topic 6,embedding_dominant,same_categorisation
0,Toulon,Nice,Honey from Provence is protected by a red labe...,15415861.0,"Authon, Alpes-de-Haute-Provence","[honey, label, indication, honey, flower, hone...",0.002901,0.072049,0.002811,0.002407,...,topic_5,0.917025,1.177238,3.370153,7.421871,1.268505,1.032050,1.251641,Topic 3,False
1,Copenhagen,Oslo,"Glosimodt was born in Oslo, Norway. He studied...",15416040.0,Erik Glosimodt,"[education, professor, architect, practice, tr...",0.002901,0.001859,0.002811,0.002407,...,topic_5,0.987215,0.000000,1.535294,1.509655,0.000000,3.934449,0.763062,Topic 5,True
2,Oslo,Copenhagen,"Glosimodt was born in Oslo, Norway. He studied...",15416040.0,Erik Glosimodt,"[education, professor, architect, practice, tr...",0.018396,0.901826,0.017823,0.015259,...,topic_2,0.901826,0.000000,1.535294,1.509655,0.000000,3.934449,0.763062,Topic 5,False
3,Barcelona,Dusseldorf,"On the 24 March 2015, Germanwings Flight 9525 ...",15416113.0,Meolans-Revel,"[travel, crash]",0.018396,0.011786,0.017823,0.905299,...,topic_4,0.905299,0.000000,0.578503,0.000000,0.000000,0.000000,0.417536,Topic 2,False
4,Dusseldorf,Barcelona,"On the 24 March 2015, Germanwings Flight 9525 ...",15416113.0,Meolans-Revel,"[travel, crash]",0.005012,0.973250,0.004856,0.004158,...,topic_2,0.973250,0.000000,0.578503,0.000000,0.000000,0.000000,0.417536,Topic 2,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,Rome,Amsterdam,He was sent to the Cape of Good Hope with lett...,15638655.0,Peter Kolbe,"[letter, introduction, mayor, mandate, descrip...",0.000791,0.000507,0.421843,0.000656,...,topic_5,0.575438,0.398827,0.411773,6.762273,0.375313,5.861372,1.856807,Topic 3,False
4996,Rome,Nuremberg,He was sent to the Cape of Good Hope with lett...,15638655.0,Peter Kolbe,"[letter, introduction, mayor, mandate, descrip...",0.000791,0.000507,0.192165,0.000656,...,topic_5,0.805115,0.398827,0.411773,6.762273,0.375313,5.861372,1.856807,Topic 3,False
4997,Amsterdam,Rome,He was sent to the Cape of Good Hope with lett...,15638655.0,Peter Kolbe,"[letter, introduction, mayor, mandate, descrip...",0.000791,0.000507,0.268724,0.000656,...,topic_5,0.728556,0.398827,0.411773,6.762273,0.375313,5.861372,1.856807,Topic 3,False
4998,Amsterdam,Nuremberg,He was sent to the Cape of Good Hope with lett...,15638655.0,Peter Kolbe,"[letter, introduction, mayor, mandate, descrip...",0.000791,0.000507,0.440982,0.000656,...,topic_5,0.460599,0.398827,0.411773,6.762273,0.375313,5.861372,1.856807,Topic 3,False


In [174]:
# updated_df.to_csv('..\..\..\..\..\data\clean\lda_classified_30cities_435citypairs_311k_paragraphs_both_methods_with_education.csv', index=False)

### 7. Informatics

#### Similarity between lda topic model and word embedding algorithm output

In [228]:
same_categorisation = updated_df[updated_df['same_categorisation'] == True]['same_categorisation'].count()
total_documents = updated_df['same_categorisation'].count()

print(f"{same_categorisation} out of {total_documents} were classified the same by the LDA topic model and word embedding classification model.")

3092 out of 5000 were classified the same by the LDA topic model and word embedding classification model.


In [229]:
print(updated_df[(updated_df['lda_dominant_score'] > 0.9) & (updated_df['merged_POS'].str.len() > 5)].groupby('lda_dominant')['same_categorisation'].value_counts())
print('----------------------------------------------------')
print(updated_df[(updated_df['lda_dominant_score'] > 0.9) & (updated_df['merged_POS'].str.len() > 5)].groupby('lda_dominant')['same_categorisation'].value_counts(normalize=True))

lda_dominant  same_categorisation
topic_1       True                   376
              False                  168
topic_2       True                   240
              False                  111
topic_3       True                   178
              False                   51
topic_4       True                   531
              False                   92
topic_5       True                   481
              False                  187
topic_6       False                  202
              True                   156
Name: same_categorisation, dtype: int64
----------------------------------------------------
lda_dominant  same_categorisation
topic_1       True                   0.691176
              False                  0.308824
topic_2       True                   0.683761
              False                  0.316239
topic_3       True                   0.777293
              False                  0.222707
topic_4       True                   0.852327
              False      

#### binned LDA scores of all & differently classified paragraphs

In [161]:
updated_df['lda_dominant_score'].value_counts(bins=[0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1]).sort_index(ascending=False)

(0.9, 1.0]       11873
(0.8, 0.9]        1721
(0.7, 0.8]        1371
(0.6, 0.7]        1598
(0.5, 0.6]        1354
(0.4, 0.5]        2009
(0.3, 0.4]          73
(0.2, 0.3]           1
(0.1, 0.2]           0
(-0.001, 0.1]        0
Name: lda_dominant_score, dtype: int64

In [162]:
updated_df[updated_df['same_categorisation'] != True]['lda_dominant_score'].value_counts(bins=[0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1]).sort_index(ascending=False) #  .sum() # .sort_index(ascending=False) #.sort_values(ascending=False)

(0.9, 1.0]       3054
(0.8, 0.9]        565
(0.7, 0.8]        555
(0.6, 0.7]        793
(0.5, 0.6]        755
(0.4, 0.5]       1393
(0.3, 0.4]         42
(0.2, 0.3]          1
(0.1, 0.2]          0
(-0.001, 0.1]       0
Name: lda_dominant_score, dtype: int64

#### Percentage of similarly classified documents between LDA topic modeling and word embedding classification

In [146]:
topics = sorted(['sport', 'art', 'diplomacy', 'education', 'entertainment', 'transportation'])
thresholds = [0.5, 0.6, 0.7, 0.8, 0.9]
topic_values = []
for topic in topics:
    values = []
    for threshold in thresholds:
        #x.lda_dominant.endswith(x.embedding_dominant)
        values.append(updated_df[(updated_df['lda_dominant'].str.endswith(topic)) & (updated_df['lda_dominant_score'] > threshold) & (updated_df['merged_POS'].str.len() > 10)]['same_categorisation'].value_counts(normalize=True)[1])
        #df[(df['idxmax'] == )  & (df['max'] > 0.8)]['outcome'].value_counts(normalize=True)
    topic_values.append(values)

topic_values
new_topic_values = [[] for x in topic_values[0]]
for index, topic in enumerate(topic_values):
    for i, value in enumerate(topic):
        new_topic_values[i].append(value)
print(new_topic_values)

IndexError: index 1 is out of bounds for axis 0 with size 0

In [None]:
import matplotlib.pyplot as plt
import numpy as np

plt.figure(figsize=(10, 10), facecolor="w")
# set width of bars
barWidth = 0.25

r1 =  np.arange(0, len(topic_values[0])+3, 1.50)
# r1 = np.arange(0, len(topic_values[0])+1.5, 1.50)
colors =["#54bebe", "#76c8c8", "#98d1d1", "#badbdb", "#97d2fb", "#eccbd9", "#eccbd9"]
colors = [ "#badbdb","#98d1d1", "#76c8c8","#54bebe", "#63a4da", "#296ead", "#2a4693"]

for idx, topic in enumerate(new_topic_values):
    plt.bar(r1, topic, color=colors[idx], width=barWidth, edgecolor='white', label=thresholds[idx], zorder=3)
    r1= [x + barWidth for x in r1]

# Add xticks on the middle of the group bars
plt.ylabel('Percentage (%)', fontweight='normal')
plt.xlabel('Topic')
plt.xticks([0.5, 2, 3.5,  5, 6.5, 8], ['Art', 'Diplomacy', 'Education', 'Entertainment', 'Sport', 'Transportation'])
plt.ylim(0, 1)
plt.xlim(-0.2, 9)

# Create legend & Show graphic
plt.title('Percentage of similarly classified documents by LDA topic modeling and word embeddings')
plt.legend(title="threshold value", loc="lower right")
plt.grid(zorder=0, color='lightgray', linestyle='--')
plt.show()

### 8. Aggregate paragraphs Classification into City Pair Classification

#### 8.1 Select right parameters

In [None]:
bottom_lda_threshold = 0.7
minimal_paragraph_length = 5

In [None]:
# Requires city1/city2 update!
temp_df = updated_df[(updated_df['city_pair'].isin(updated_df['city_pair'].unique()[:])) & (updated_df['lda_dominant_score'] > bottom_lda_threshold) & (updated_df['merged_POS'].str.len() > minimal_paragraph_length)]

In [None]:
temp_df

In [None]:
grouped_df = temp_df.groupby('city_pair')

In [None]:
nested_list2 = [['city_pair', 'paragraphs', 'lemmatised_paragraph_length', 'same_categorisation_raw', 'same_categorisation_percentage',
                'lda_dominant_category', 'embedding_dominant_category', 'lda_art', 'embedding_art', 'lda_diplomacy',
                'embedding_diplomacy', 'lda_education', 'embedding_education', 'lda_entertainment', 'embedding_entertainment', 
                'lda_sport', 'embedding_sport', 'lda_transportation', 'embedding_transportation']]

In [None]:
topics

In [None]:
print(list(temp_df['embedding_dominant'].unique()))
print(list(temp_df['lda_dominant'].unique()))

#### 8.2 Aggregation algorithm

In [None]:
from IPython.display import display

lda_categories = list(temp_df['lda_dominant'].unique())
embedding_categories = list(temp_df['embedding_dominant'].unique())

for city_pair, sub_df in tqdm(grouped_df):
    paragraph_count = sub_df['paragraph'].count()
    same_categorisation_raw = sub_df['same_categorisation'].sum()
    same_categorisation_percentage = sub_df['same_categorisation'].sum()/sub_df['same_categorisation'].count()
    lemmatised_paragraph_len = sub_df['merged_POS'].str.len().mean()
    
    lda_prediction = sub_df['lda_dominant'].value_counts()
    embedding_prediction = sub_df['embedding_dominant'].value_counts()
        
    lda_dominant_category = lda_prediction.idxmax()
    embedding_dominant_category = embedding_prediction.idxmax()
    
    lda_prediction = lda_prediction.to_dict()
    embedding_prediction = embedding_prediction.to_dict()
    
    if (len(lda_prediction) != len(lda_categories)):
        for category in lda_categories:
            if category not in lda_prediction.keys():
                lda_prediction[category] = 0
                
    if (len(embedding_prediction) != len(embedding_categories)):
        for category in embedding_categories:
            if category not in embedding_prediction.keys():
                embedding_prediction[category] = 0
                
    temp_l2 = [city_pair, paragraph_count, lemmatised_paragraph_len, same_categorisation_raw, same_categorisation_percentage,
                lda_dominant_category, embedding_dominant_category, lda_prediction['lda_art'], embedding_prediction['art'], lda_prediction['lda_diplomacy'],
                embedding_prediction['diplomacy'], lda_prediction['lda_education'], embedding_prediction['education'],
                lda_prediction['lda_entertainment'], embedding_prediction['entertainment'], lda_prediction['lda_sport'],
                embedding_prediction['sport'], lda_prediction['lda_transportation'], embedding_prediction['transportation']]
    if len(nested_list2[0]) != len(temp_l2):
        raise Exception('Not the same size!')
    
    nested_list2.append(temp_l2)

In [None]:
final_df = pd.DataFrame(nested_list2[1:],columns=nested_list2[0])

final_df.head(2)

## 8.4 Save Aggregated Dataframe

In [None]:
# final_df.to_csv('..\..\..\..\..\data\clean\deliverable_435city_pairs_both_methods_with_education_final.csv', index=False)

#### 8.5 Normalise classification (by number of paragraphs)

In [None]:
# Normalize category outcomes
final_df_normalised = final_df
final_df_normalised[list(final_df_normalised.columns)[7:]] = final_df_normalised[list(final_df_normalised.columns)[7:]].div(final_df_normalised['paragraphs'], axis=0) # .count()

final_df_normalised

## 8.6 Save Normalised Aggregated Dataframe

In [None]:
# final_df_normalised.to_csv('..\..\..\..\..\data\clean\deliverable_435city_pairs_both_methods_with_education_final_normalised.csv', index=False)

# EXTRAS

#### Show closest words to topic vector

In [None]:
from scipy import spatial
from preprocessing_functions import *

def find_closest_embeddings(embedding, cutoff=25):
    return sorted(embeddings_dict.keys(), key=lambda token: spatial.distance.euclidean(embeddings_dict[token], embedding))

In [None]:
topic = 'art'

words = find_closest_embeddings(embedding=
     mean_vectors_dict[topic]    # embeddings_dict['diplomacy'] # embeddings_dict['fashion']
)[:2000]

In [None]:
english_words = get_english_words(path='../../../input/english_words_alpha_370k.txt
print(remove_non_existing_words_from_wordlist(words, english_words))