<div style="background-color: lightgreen; border-radius: 5px; padding: 10px;">
    <h4>Word Embedding Categorisation</h4>
    <p>...</p>
</div>

In [1]:
from preprocessing_functions import *
from word_embedding_functions import *

import pandas as pd
import numpy as np
from pprint import pprint

import warnings
warnings.filterwarnings('ignore')

- Clean and preprocess (lemmatise) a list of documents (e.g. paragraphs)
- Get your topics through unsupervised clustering with (LDA Topic Modeling)
- Use these for the word embedding algorithm

### 1. Set Topic Words
<br>
<div style="text-align: justify;">Using LDA topic modeling to find topics and their words through unsupervised clustering highly increases the performance of the model. Look at the pyLDAvis visualisation of the LDA topic model and change any parameters if you are not happy with the cluster distributions or if any clusters are overlapping. </div>
<br>
<div style="text-align: justify;">
A good topic model will have relatively big, similarly sized and non-overlapping bubbles scattered throughout the chart. Greater distances between the clusters represents a larger semantic difference, similarly sized bubbles show that the topics are equally represented, and large circles mean that the topics are well represented in the documents. By paying attention to these three characteristics we can get an accurate representation of the dominant topics of our documents and decide whether these clusters represent good, meaningful topics.
</div>

#### 1.1 Option 1 (less optimal!): Select individual words for each category/topic

In [50]:
# Removing words with double meaning helps (e.g. bear, degree)

choice = input('Are you sure you want to set your own topic words instead of using the recommended lda topic words? (y/n)')

if choice == 'y':
    print('Running cell...')
    
    #fashion_words = ['fashion', 'model', 'vogue', 'store', 'designer', 'couture', 'catwalk', 'runway', 'modeling', 'clothing', 'cosmetic', 'brand', 'retail', 'advertising', 'perfume']
    diplomacy_words = ['War', 'Embassy', 'Army', 'Diplomatic', 'Ambassador', 'Treaty', 'Protest', 'Force', 'Mission', 'Arrest', 'Government', 'Police', 'Attack', 'Party', 'Minister']
    entertainment_words = ['Opera', 'Festival', 'Perform', 'Orchestra', 'Symphony', 'Concert', 'Music', 'Film', 'Sing', 'Theatre', 'Performance', 'Role', 'Premiere', 'Tour', 'Band']
    art_words = ['Exhibition', 'Art', 'Museum', 'Gallery', 'Exhibit', 'Painting', 'Collection', 'Paint', 'Portrait', 'Artist', 'Sculpture', 'Fashion', 'Design', 'Contemporary', 'Painter']
    education_words = ['Study', 'School', 'Professor', 'University', 'Graduate', 'Educate', 'Lecture', 'Research', 'College', 'Teach', 'Science', 'Education', 'Philosophy', 'Doctorate', 'Faculty'] # replaced bear with educate and degree with Lecture
    transportation_words = ['Railway', 'Route', 'Line', 'Operate', 'Flight', 'Station', 'Service', 'Airline', 'Airport', 'Train', 'Passenger', 'Speed', 'Aircraft', 'Rail', 'Network']
    sport_words = ['Final', 'Win', 'Team', 'Match', 'Game', 'Goal', 'Club', 'League', 'Champion', 'Championship', 'Season', 'Score', 'Round', 'Tournament', 'Football']

    lda_topic_words = { #'fashion': fashion_words,
                        'diplomacy': diplomacy_words,
                       'entertainment': entertainment_words,
                       'art': art_words,
                       'education': education_words,
                       'transportation': transportation_words,
                       'sport': sport_words}
else:
    print('Canceling cell execution...')

Are you sure you want to set your own topic words instead of using the recommended lda topic words? (y/n)n
Canceling cell execution...


#### 1.2 Option 2 (recommended!): Use Topic Model's topic words (found through LDA)

In [10]:
lda_models_dir = "../../../../data_clean/lda_models/"
model_dir = "lda_model_2millionDocs_5topics/"
file_name = "relevant_words_per_topic.csv"

lda_topic_words_df_path = os.path.join(lda_models_dir, model_dir, file_name)
lda_topic_words_df = pd.read_csv(lda_topic_words_df_path)

In [3]:
display(lda_topic_words_df[:20])
len(lda_topic_words_df)

Unnamed: 0,Topic 1,Topic 2,Topic 3,Topic 4,Topic 5
0,service,perform,protest,season,study
1,station,opera,force,club,exhibition
2,route,concert,attack,win,work
3,railway,film,army,goal,art
4,line,tour,arrest,match,painting
5,operate,music,troop,team,exhibit
6,train,sing,police,score,professor
7,passenger,festival,war,game,school
8,airport,premiere,command,final,university
9,airline,band,embassy,league,degree


50

In [126]:
nr_of_words = 15
lda_topic_words = lda_topic_words_df[:nr_of_words].to_dict('list')

for key in lda_topic_words: # sorted_df_dict = sorted(df_dict, key=lambda d: d['article_id']) 
    print(f"{key}:", lda_topic_words[key], '\n')

Topic 1: ['service', 'station', 'route', 'railway', 'line', 'operate', 'train', 'passenger', 'airport', 'airline', 'speed', 'flight', 'connect', 'rail', 'network'] 

Topic 2: ['perform', 'opera', 'concert', 'film', 'tour', 'music', 'sing', 'festival', 'premiere', 'band', 'orchestra', 'role', 'performance', 'album', 'song'] 

Topic 3: ['protest', 'force', 'attack', 'army', 'arrest', 'troop', 'police', 'war', 'command', 'embassy', 'government', 'mission', 'party', 'diplomatic', 'arrive'] 

Topic 4: ['season', 'club', 'win', 'goal', 'match', 'team', 'score', 'game', 'final', 'league', 'player', 'defeat', 'beat', 'football', 'finish'] 

Topic 5: ['study', 'exhibition', 'work', 'art', 'painting', 'exhibit', 'professor', 'school', 'university', 'degree', 'bear', 'collection', 'graduate', 'painter', 'research'] 



### 2. Loading GloVe Word Embedding
<br>
<div style="text-align: justify;"> The word embedding model could be replaced by other ones fairly easy, but for now we decided to use the Common Crawl 840B token one  (source: https://nlp.stanford.edu/projects/glove/).</div>

In [5]:
%%time

embeddings_dict, discarded_dict = load_glove_word_embeddings(GLOVE_PATH="../../../../glove.840B.300d.txt")

This will take approximately ~ 4 minutes...


  0%|          | 0/2196017 [00:00<?, ?it/s]

CPU times: total: 2min 58s
Wall time: 3min 3s


### 3. Get Mean of the words from each Topic
<br>
<div style="text-align: justify;">By taking the mean of the vectors that belong to the ~15 most relevant words of a topic we get a fairly accurate vector representation of a topic. We do this for each topic and save their vector value to a dictionary for later retrieval.</div>

In [127]:
mean_vectors_dict = {}

for topic in lda_topic_words:
    words = lda_topic_words[topic]
    words = [word for word in words if word in embeddings_dict.keys()] # checks if word is in vocabulary (i.e. has been seen by the model before)
    mean_embedding = np.mean([embeddings_dict[word.lower()] for word in words], axis=0)
    mean_vectors_dict[topic] = mean_embedding

### 4. Load Documents

#### 4.1 Load and merge chunked .csv files

In [265]:
paragraphs_lda_topic_distribution_dir = "paragraphs_lda_topic_distribution2"

paragraphs_with_lda_prediction_path = os.path.join(lda_models_dir, model_dir, paragraphs_lda_topic_distribution_dir)

# paragraphs_with_lda_prediction_path = '../../../../data_clean/lda_models/lda_model_2million/paragraphs_lda_topic_distribution/'
file_path =  os.listdir(paragraphs_with_lda_prediction_path)[0]
sample_path = os.path.join(paragraphs_with_lda_prediction_path, file_path)

In [80]:
# os.listdir('../../../../data_clean/lda_models/lda_model_2million/paragraphs_lda_topic_distribution/')[0]
lemmatised_paragraphs_path = "../../../../data_clean/paragraphs_lemmatised/english_words_merged_NOUNVERBADJ/"
file_path = os.listdir(lemmatised_paragraphs_path)[0]
sample_path = os.path.join(lemmatised_paragraphs_path, file_path)

paragraphs_10_934384_1038204_merged_POS.csv


True

In [8]:
print(sample_path, file_path)
os.path.exists(sample_path)

../../../../data_clean/lda_models/lda_model_2millionDocs_5topics\paragraphs_lda_topic_distribution\paragraphs_10_934385_1038205_lda_topics.csv paragraphs_10_934385_1038205_lda_topics.csv


True

In [9]:
df =  pd.read_csv(sample_path)
df

Unnamed: 0,city_1,city_2,paragraph,article_id,title,merged_POS,topic_1,topic_2,topic_3,topic_4,topic_5,lda_dominant,lda_dominant_score
0,Toulon,Nice,Honey from Provence is protected by a red labe...,15415861.0,"Authon, Alpes-de-Haute-Provence","['honey', 'label', 'indication', 'honey', 'flo...",0.002758,0.002761,0.002960,0.002211,0.989310,topic_5,0.989310
1,Copenhagen,Oslo,"Glosimodt was born in Oslo, Norway. He studied...",15416040.0,Erik Glosimodt,"['education', 'professor', 'architect', 'pract...",0.002758,0.002761,0.002960,0.002211,0.989310,topic_5,0.989310
2,Oslo,Copenhagen,"Glosimodt was born in Oslo, Norway. He studied...",15416040.0,Erik Glosimodt,"['education', 'professor', 'architect', 'pract...",0.017678,0.017697,0.018973,0.915734,0.029918,topic_4,0.915734
3,Barcelona,Dusseldorf,"On the 24 March 2015, Germanwings Flight 9525 ...",15416113.0,Meolans-Revel,"['travel', 'crash']",0.468460,0.017697,0.018973,0.014171,0.480699,topic_5,0.480699
4,Dusseldorf,Barcelona,"On the 24 March 2015, Germanwings Flight 9525 ...",15416113.0,Meolans-Revel,"['travel', 'crash']",0.978201,0.004777,0.005121,0.003825,0.008076,topic_1,0.978201
...,...,...,...,...,...,...,...,...,...,...,...,...,...
103815,Rome,Bari,It publishes 15 local editions for the cities ...,19593000.0,Leggo,"['edition', 'city', 'circulation', 'copy', 'pu...",0.282504,0.005439,0.005831,0.004355,0.701872,topic_5,0.701872
103816,Rome,Padua,It publishes 15 local editions for the cities ...,19593000.0,Leggo,"['edition', 'city', 'circulation', 'copy', 'pu...",0.005433,0.005439,0.144366,0.004355,0.840407,topic_5,0.840407
103817,Rome,Brescia,It publishes 15 local editions for the cities ...,19593000.0,Leggo,"['edition', 'city', 'circulation', 'copy', 'pu...",0.005433,0.005439,0.421437,0.004355,0.563336,topic_5,0.563336
103818,Rome,Verona,It publishes 15 local editions for the cities ...,19593000.0,Leggo,"['edition', 'city', 'circulation', 'copy', 'pu...",0.005433,0.005439,0.005831,0.004355,0.978943,topic_5,0.978943


#### 4.2 Select number of paragraphs to process

In [138]:
sample_size = 5000
sample = df[:sample_size]

#### 4.3 Turn stringed lists (in csv) into list objects

In [139]:
%%time

from ast import literal_eval

if isinstance(sample['merged_POS'][0], str):
    sample['merged_POS'] = sample['merged_POS'].apply(literal_eval)

CPU times: total: 938 ms
Wall time: 962 ms


In [11]:
sample.columns

Index(['city_1', 'city_2', 'paragraph', 'article_id', 'title', 'merged_POS',
       'topic_1', 'topic_2', 'topic_3', 'topic_4', 'topic_5', 'lda_dominant',
       'lda_dominant_score'],
      dtype='object')

### 5. Classify Paragraphs (by Word Embedding Algorithm)

#### 5.1 Select right parameters
<br>
<div style="text-align: justify;">bottom_threshold: If the similarity between a word and its closest topic is below the bottom_threshold it will be discarded from the classification process.
Verbose1 and Verbose2: Enabling these will print out the internal process of the algorithm.</div>

In [59]:
bottom_threshold = 0.20
verbose1 = False
verbose2 = False

In [182]:
os.listdir(paragraphs_with_lda_prediction_path)[0].split('_lda_')


['paragraphs_10_934385_1038205', 'topics.csv']

In [245]:
import math

In [262]:
test = pd.read_csv("../../../../data_clean/lda_models/lda_model_2millionDocs_5topics\paragraphs_lda_topic_distribution\paragraphs_10_934385_1038205_lda_topics.csv")

In [263]:
test['merged_POS'].isnull().values.any()

False

In [264]:
test.iloc[0:20]

Unnamed: 0,city_1,city_2,paragraph,article_id,title,merged_POS,topic_1,topic_2,topic_3,topic_4,topic_5,lda_dominant,lda_dominant_score
0,Toulon,Nice,Honey from Provence is protected by a red labe...,15415861.0,"Authon, Alpes-de-Haute-Provence","['honey', 'label', 'indication', 'honey', 'flo...",0.002758,0.002761,0.00296,0.002211,0.98931,topic_5,0.98931
1,Copenhagen,Oslo,"Glosimodt was born in Oslo, Norway. He studied...",15416040.0,Erik Glosimodt,"['education', 'professor', 'architect', 'pract...",0.002758,0.002761,0.00296,0.002211,0.98931,topic_5,0.98931
2,Oslo,Copenhagen,"Glosimodt was born in Oslo, Norway. He studied...",15416040.0,Erik Glosimodt,"['education', 'professor', 'architect', 'pract...",0.017678,0.017697,0.018973,0.915734,0.029918,topic_4,0.915734
3,Barcelona,Dusseldorf,"On the 24 March 2015, Germanwings Flight 9525 ...",15416113.0,Meolans-Revel,"['travel', 'crash']",0.46846,0.017697,0.018973,0.014171,0.480699,topic_5,0.480699
4,Dusseldorf,Barcelona,"On the 24 March 2015, Germanwings Flight 9525 ...",15416113.0,Meolans-Revel,"['travel', 'crash']",0.978201,0.004777,0.005121,0.003825,0.008076,topic_1,0.978201
5,Barcelona,Dusseldorf,"On 24 March 2015, Germanwings Flight 9525, an ...",15416167.0,Prads-Haute-Bleone,"['mountain', 'territory', 'commune', 'passenge...",0.978201,0.004777,0.005121,0.003825,0.008076,topic_1,0.978201
6,Dusseldorf,Barcelona,"On 24 March 2015, Germanwings Flight 9525, an ...",15416167.0,Prads-Haute-Bleone,"['mountain', 'territory', 'commune', 'passenge...",0.002418,0.002421,0.064254,0.926816,0.004092,topic_4,0.926816
7,Frankfurt,Duisburg,Schamer played his entire professional career ...,15416514.0,Lothar Schamer,"['career', 'season', 'squad', 'goal', 'penalty...",0.002418,0.002421,0.002595,0.988474,0.004092,topic_4,0.988474
8,Duisburg,Frankfurt,Schamer played his entire professional career ...,15416514.0,Lothar Schamer,"['career', 'season', 'squad', 'goal', 'penalty...",0.001026,0.681328,0.001101,0.000823,0.315722,topic_2,0.681328
9,London,Rome,Bildner grew up in Rio de Janeiro where she wa...,15416521.0,Mossa Bildner,"['variety', 'music', 'teen', 'ceremony', 'love...",0.001026,0.49817,0.001101,0.000823,0.49888,topic_5,0.49888


In [274]:
%%time
# ~40 mins for 311k paragraphs

from ast import literal_eval


for file in tqdm(os.listdir(paragraphs_with_lda_prediction_path)):
    saving_path = os.path.join(lda_models_dir, model_dir, "paragraphs_word_embedding_categorisation2/")
    abs_saving_path = os.path.join(os.path.abspath(saving_path), f"{file.split('_lda_')[0]}_embedding_categorisation.csv")
    
    if os.path.exists(abs_saving_path):
        print('exists.')
        continue
    
    sample_path = os.path.join(paragraphs_with_lda_prediction_path, file)
    sample =  pd.read_csv(sample_path)
    # sample = sample[:5000]
    
    if isinstance(sample['merged_POS'][0], str):
        # print(sample['merged_POS'][0], type(sample['merged_POS'][0]))
        sample['merged_POS'] = sample['merged_POS'].apply(literal_eval)
        # print('check')
    
    # print(type(sample['merged_POS'][0]))
    
    topics = [key for key in list(sorted(mean_vectors_dict.keys()))]
    nested_l = [['index']+topics+['embedding_dominant']]

    for idx, row in tqdm(sample['merged_POS'].iteritems(), total=len(sample['merged_POS'])):
        output = categorize_text(lemmatized_wordlist=row, mean_vectors_dict=mean_vectors_dict, keywords=topics, embeddings_dict=embeddings_dict, bottom_threshold=bottom_threshold, verbose1=verbose1, verbose2=verbose2)

        temp_l = [idx] +[result[1] for result in output['category_similarities']] + [output['prediction']]

        #print(row, output['category_similarities'])

        if len(nested_l[0]) != len(temp_l):
            raise Exception('Not the same size!')

        nested_l.append(temp_l)


    prediction_df = pd.DataFrame(nested_l[1:],columns=nested_l[0]).set_index('index')
    
    updated_df = sample.join(prediction_df)
    updated_df['same_categorisation'] = updated_df.apply(lambda x: x.lda_dominant.endswith(x.embedding_dominant[-1]), axis=1) # (updated_df['embedding_dominant'].isin('lda_dominant') 'lda_dominant'].str.contains() == updated_df['outcome'])
    
    updated_df.to_csv(abs_saving_path, index=False)

  0%|          | 0/20 [00:00<?, ?it/s]

  0%|          | 0/103820 [00:00<?, ?it/s]

  0%|          | 0/103820 [00:00<?, ?it/s]

  0%|          | 0/103820 [00:00<?, ?it/s]

  0%|          | 0/103820 [00:00<?, ?it/s]

  0%|          | 0/103820 [00:00<?, ?it/s]

  0%|          | 0/103820 [00:00<?, ?it/s]

  0%|          | 0/103820 [00:00<?, ?it/s]

  0%|          | 0/103820 [00:00<?, ?it/s]

  0%|          | 0/103820 [00:00<?, ?it/s]

  0%|          | 0/103820 [00:00<?, ?it/s]

  0%|          | 0/103821 [00:00<?, ?it/s]

  0%|          | 0/103820 [00:00<?, ?it/s]

  0%|          | 0/103821 [00:00<?, ?it/s]

  0%|          | 0/103821 [00:00<?, ?it/s]

  0%|          | 0/103821 [00:00<?, ?it/s]

  0%|          | 0/103820 [00:00<?, ?it/s]

  0%|          | 0/103820 [00:00<?, ?it/s]

  0%|          | 0/103820 [00:00<?, ?it/s]

  0%|          | 0/103820 [00:00<?, ?it/s]

  0%|          | 0/103820 [00:00<?, ?it/s]

CPU times: total: 3h 9min 2s
Wall time: 3h 43min 15s


In [276]:
os.listdir(saving_path)

['paragraphs_10_934384_1038204_embedding_categorisation.csv',
 'paragraphs_11_1038204_1142024_embedding_categorisation.csv',
 'paragraphs_12_1142024_1245844_embedding_categorisation.csv',
 'paragraphs_13_1245844_1349664_embedding_categorisation.csv',
 'paragraphs_14_1349664_1453484_embedding_categorisation.csv',
 'paragraphs_15_1453484_1557304_embedding_categorisation.csv',
 'paragraphs_16_1557304_1661124_embedding_categorisation.csv',
 'paragraphs_17_1661124_1764944_embedding_categorisation.csv',
 'paragraphs_18_1764944_1868764_embedding_categorisation.csv',
 'paragraphs_19_1868764_1972584_embedding_categorisation.csv',
 'paragraphs_1_0_103821_embedding_categorisation.csv',
 'paragraphs_20_1972584_2076404_embedding_categorisation.csv',
 'paragraphs_2_103821_207642_embedding_categorisation.csv',
 'paragraphs_3_207642_311463_embedding_categorisation.csv',
 'paragraphs_4_311463_415284_embedding_categorisation.csv',
 'paragraphs_5_415284_519104_embedding_categorisation.csv',
 'paragraphs_

In [204]:
file.split('_lda_')[0]

'paragraphs_10_934385_1038205'

In [234]:

os.path.join(os.path.abspath(saving_path), c)

'C:\\Users\\diede\\Personal Files [Local]\\Applied Data Science\\Thesis - CITYNET\\data_clean\\lda_models\\lda_model_2millionDocs_5topics\\paragraphs_word_embedding_categorisation\\paragraphs_11_1038205_1142025_embedding.csv'

In [194]:
os.path.exists(saving_path)

True

In [267]:
prediction_df

Unnamed: 0_level_0,Topic 1,Topic 2,Topic 3,Topic 4,Topic 5,embedding_dominant
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
0,6.492346,1.263454,1.981242,1.216145,2.069049,Topic 1
1,6.492346,1.263454,1.981242,1.216145,2.069049,Topic 1
2,1.500070,0.000000,1.286127,0.000000,4.980572,Topic 5
3,1.500070,0.000000,1.286127,0.000000,4.980572,Topic 5
4,1.018640,0.000000,0.000000,0.000000,0.000000,Topic 1
...,...,...,...,...,...,...
4995,0.781117,0.790643,0.392753,1.041798,7.756727,Topic 5
4996,0.682769,0.781159,3.412686,1.367645,10.386795,Topic 5
4997,0.682769,0.781159,3.412686,1.367645,10.386795,Topic 5
4998,0.682769,0.781159,3.412686,1.367645,10.386795,Topic 5


In [268]:
prediction_df['embedding_dominant'].value_counts()

Topic 5    1516
Topic 4    1056
Topic 3     898
Topic 1     774
Topic 2     756
Name: embedding_dominant, dtype: int64

### 6. Insert classification into dataset

In [None]:
updated_df = sample.join(prediction_df)
updated_df.head(2)

In [270]:
updated_df['same_categorisation'] = updated_df.apply(lambda x: x.lda_dominant.endswith(x.embedding_dominant[-1]), axis=1) # (updated_df['embedding_dominant'].isin('lda_dominant') 'lda_dominant'].str.contains() == updated_df['outcome'])

## Saving "Classified Paragraphs" Dataframe

In [271]:
updated_df

Unnamed: 0,paragraph_id,city_1,city_2,paragraph,article_id,title,merged_POS,topic_1,topic_2,topic_3,...,topic_5,lda_dominant,lda_dominant_score,Topic 1,Topic 2,Topic 3,Topic 4,Topic 5,embedding_dominant,same_categorisation
0,934385,Toulon,Marseille,Honey from Provence is protected by a red labe...,15415861,"Authon, Alpes-de-Haute-Provence","[honey, label, indication, honey, flower, hone...",0.995948,0.000888,0.000952,...,0.001501,topic_1,0.995948,6.492346,1.263454,1.981242,1.216145,2.069049,Topic 1,True
1,934386,Toulon,Nice,Honey from Provence is protected by a red labe...,15415861,"Authon, Alpes-de-Haute-Provence","[honey, label, indication, honey, flower, hone...",0.950718,0.000888,0.023567,...,0.024116,topic_1,0.950718,6.492346,1.263454,1.981242,1.216145,2.069049,Topic 1,True
2,934387,Copenhagen,Oslo,"Glosimodt was born in Oslo, Norway. He studied...",15416040,Erik Glosimodt,"[education, professor, architect, practice, tr...",0.002758,0.002761,0.002960,...,0.989310,topic_5,0.989310,1.500070,0.000000,1.286127,0.000000,4.980572,Topic 5,True
3,934388,Oslo,Copenhagen,"Glosimodt was born in Oslo, Norway. He studied...",15416040,Erik Glosimodt,"[education, professor, architect, practice, tr...",0.002758,0.002761,0.002960,...,0.989310,topic_5,0.989310,1.500070,0.000000,1.286127,0.000000,4.980572,Topic 5,True
4,934389,Barcelona,Dusseldorf,"On the 24 March 2015, Germanwings Flight 9525 ...",15416113,Meolans-Revel,"[travel, crash]",0.017678,0.017697,0.018973,...,0.029918,topic_4,0.915734,1.018640,0.000000,0.000000,0.000000,0.000000,Topic 1,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,939380,Granada,London,"Whilst working on his doctoral research, he to...",15638645,Simon Thurley,"[research, post, director, history, broadcaste...",0.042910,0.290658,0.001738,...,0.663396,topic_5,0.663396,0.781117,0.790643,0.392753,1.041798,7.756727,Topic 5,True
4996,939381,Rome,Amsterdam,He was sent to the Cape of Good Hope with lett...,15638655,Peter Kolbe,"[letter, introduction, mayor, mandate, descrip...",0.000751,0.000752,0.058257,...,0.939638,topic_5,0.939638,0.682769,0.781159,3.412686,1.367645,10.386795,Topic 5,True
4997,939382,Rome,Nuremberg,He was sent to the Cape of Good Hope with lett...,15638655,Peter Kolbe,"[letter, introduction, mayor, mandate, descrip...",0.000751,0.000752,0.000806,...,0.997089,topic_5,0.997089,0.682769,0.781159,3.412686,1.367645,10.386795,Topic 5,True
4998,939383,Amsterdam,Rome,He was sent to the Cape of Good Hope with lett...,15638655,Peter Kolbe,"[letter, introduction, mayor, mandate, descrip...",0.019901,0.000752,0.000806,...,0.977939,topic_5,0.977939,0.682769,0.781159,3.412686,1.367645,10.386795,Topic 5,True


In [145]:
# updated_df.to_csv('..\..\..\..\..\data\clean\lda_classified_30cities_435citypairs_311k_paragraphs_both_methods_with_education.csv', index=False)

In [12]:
saving_path = os.path.join(lda_models_dir, model_dir, "paragraphs_word_embedding_categorisation2/")

dataframes = []

for file in tqdm(os.listdir(saving_path)):
    fp = os.path.abspath(os.path.join(saving_path, file))
    print(fp)
    temp_df = pd.read_csv(fp)
    dataframes.append(temp_df)
    
len(dataframes)

  0%|          | 0/20 [00:00<?, ?it/s]

C:\Users\diede\Personal Files [Local]\Applied Data Science\Thesis - CITYNET\data_clean\lda_models\lda_model_2millionDocs_5topics\paragraphs_word_embedding_categorisation2\paragraphs_10_934384_1038204_embedding_categorisation.csv
C:\Users\diede\Personal Files [Local]\Applied Data Science\Thesis - CITYNET\data_clean\lda_models\lda_model_2millionDocs_5topics\paragraphs_word_embedding_categorisation2\paragraphs_11_1038204_1142024_embedding_categorisation.csv
C:\Users\diede\Personal Files [Local]\Applied Data Science\Thesis - CITYNET\data_clean\lda_models\lda_model_2millionDocs_5topics\paragraphs_word_embedding_categorisation2\paragraphs_12_1142024_1245844_embedding_categorisation.csv
C:\Users\diede\Personal Files [Local]\Applied Data Science\Thesis - CITYNET\data_clean\lda_models\lda_model_2millionDocs_5topics\paragraphs_word_embedding_categorisation2\paragraphs_13_1245844_1349664_embedding_categorisation.csv
C:\Users\diede\Personal Files [Local]\Applied Data Science\Thesis - CITYNET\data_

20

In [13]:
updated_df = pd.concat(dataframes) #, keys=citypairs)
updated_df.set_index('paragraph_id', inplace=True)
updated_df.sort_index(inplace=True)
updated_df.reset_index(inplace=True)
updated_df

Unnamed: 0,paragraph_id,city_1,city_2,paragraph,article_id,title,merged_POS,topic_1,topic_2,topic_3,...,topic_5,lda_dominant,lda_dominant_score,Topic 1,Topic 2,Topic 3,Topic 4,Topic 5,embedding_dominant,same_categorisation
0,1,Birmingham,Florence,The first community of adherents of the Baha'i...,303,Alabama,"['community', 'adherent', 'center', 'found', '...",0.006307,0.006313,0.971652,...,0.010673,topic_3,0.971652,0.457759,0.000000,0.404818,0.00000,0.931113,Topic 5,False
1,2,Florence,Birmingham,The first community of adherents of the Baha'i...,303,Alabama,"['community', 'adherent', 'center', 'found', '...",0.006307,0.006313,0.971652,...,0.010673,topic_3,0.971652,0.457759,0.000000,0.404818,0.00000,0.931113,Topic 5,False
2,3,Paris,London,A major revision of the work by composer and a...,309,An American in Paris,"['revision', 'work', 'composer', 'arranger', '...",0.136577,0.815029,0.000952,...,0.046731,topic_2,0.815029,1.066222,8.625452,2.015810,0.00000,4.556324,Topic 2,True
3,4,London,Paris,A major revision of the work by composer and a...,309,An American in Paris,"['revision', 'work', 'composer', 'arranger', '...",0.000887,0.679339,0.204487,...,0.114576,topic_2,0.679339,1.066222,8.625452,2.015810,0.00000,4.556324,Topic 2,True
4,5,Madrid,Rome,Access to biocapacity in Algeria is lower than...,358,Algeria,"['access', 'world', 'hectare', 'person', 'terr...",0.672310,0.000976,0.324282,...,0.001650,topic_1,0.672310,3.473920,0.000000,3.651286,4.04859,0.624395,Topic 4,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2076399,2076400,Paris,Stockholm,Mundhir participated in Arab and international...,70585176,Mundhir Masri,"['festival', 'seminar', 'guest', 'participate'...",0.004772,0.734849,0.005121,...,0.251433,topic_2,0.734849,0.000000,1.043150,0.569592,0.00000,1.370652,Topic 5,False
2076400,2076401,London,Paris,Mundhir participated in Arab and international...,70585176,Mundhir Masri,"['festival', 'seminar', 'guest', 'participate'...",0.004772,0.978206,0.005121,...,0.008076,topic_2,0.978206,0.000000,1.043150,0.569592,0.00000,1.370652,Topic 5,False
2076401,2076402,London,Stockholm,Mundhir participated in Arab and international...,70585176,Mundhir Masri,"['festival', 'seminar', 'guest', 'participate'...",0.004772,0.978206,0.005121,...,0.008076,topic_2,0.978206,0.000000,1.043150,0.569592,0.00000,1.370652,Topic 5,False
2076402,2076403,Stockholm,Paris,Mundhir participated in Arab and international...,70585176,Mundhir Masri,"['festival', 'seminar', 'guest', 'participate'...",0.004772,0.978206,0.005121,...,0.008076,topic_2,0.978206,0.000000,1.043150,0.569592,0.00000,1.370652,Topic 5,False


### 7. Informatics

#### Similarity between lda topic model and word embedding algorithm output

In [283]:
same_categorisation = updated_df[updated_df['same_categorisation'] == True]['same_categorisation'].count()
total_documents = updated_df['same_categorisation'].count()

print(f"{same_categorisation} out of {total_documents} ({same_categorisation/total_documents*100:.2f}%) were classified the same by the LDA topic model and word embedding classification model.")

1719035 out of 2076404 (82.79%) were classified the same by the LDA topic model and word embedding classification model.


In [294]:
threshold_val = 0.7
nr_of_words = 0

print(updated_df[(updated_df['lda_dominant_score'] > threshold_val) & (updated_df['merged_POS'].str.len() > nr_of_words)]['same_categorisation'].value_counts())
print(updated_df[(updated_df['lda_dominant_score'] > threshold_val) & (updated_df['merged_POS'].str.len() > nr_of_words)]['same_categorisation'].value_counts(normalize=True))
print('----------------------------------------------------')
print(updated_df[(updated_df['lda_dominant_score'] > threshold_val) & (updated_df['merged_POS'].str.len() > nr_of_words)].groupby('lda_dominant')['same_categorisation'].value_counts())
print('----------------------------------------------------')
print(updated_df[(updated_df['lda_dominant_score'] > threshold_val) & (updated_df['merged_POS'].str.len() > nr_of_words)].groupby('lda_dominant')['same_categorisation'].value_counts(normalize=True))

True     1526853
False     207496
Name: same_categorisation, dtype: int64
True     0.880361
False    0.119639
Name: same_categorisation, dtype: float64
----------------------------------------------------
lda_dominant  same_categorisation
topic_1       True                   264811
              False                   43909
topic_2       True                   247725
              False                   73619
topic_3       True                   235704
              False                   40944
topic_4       True                   301220
              False                    4384
topic_5       True                   477393
              False                   44640
Name: same_categorisation, dtype: int64
----------------------------------------------------
lda_dominant  same_categorisation
topic_1       True                   0.857771
              False                  0.142229
topic_2       True                   0.770903
              False                  0.229097
topic_3   

#### binned LDA scores of all & differently classified paragraphs

In [295]:
updated_df['lda_dominant_score'].value_counts(bins=[0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1]).sort_index(ascending=False)

(0.9, 1.0]       1407702
(0.8, 0.9]        179868
(0.7, 0.8]        146779
(0.6, 0.7]        141634
(0.5, 0.6]        129276
(0.4, 0.5]         64579
(0.3, 0.4]          6462
(0.2, 0.3]           104
(0.1, 0.2]             0
(-0.001, 0.1]          0
Name: lda_dominant_score, dtype: int64

In [296]:
updated_df[updated_df['same_categorisation'] != True]['lda_dominant_score'].value_counts(bins=[0,0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9,1]).sort_index(ascending=False) #  .sum() # .sort_index(ascending=False) #.sort_values(ascending=False)

(0.9, 1.0]       142956
(0.8, 0.9]        29402
(0.7, 0.8]        35138
(0.6, 0.7]        48773
(0.5, 0.6]        58790
(0.4, 0.5]        38202
(0.3, 0.4]         4030
(0.2, 0.3]           78
(0.1, 0.2]            0
(-0.001, 0.1]         0
Name: lda_dominant_score, dtype: int64

#### Percentage of similarly classified documents between LDA topic modeling and word embedding classification

In [146]:
topics = sorted(['sport', 'art', 'diplomacy', 'education', 'entertainment', 'transportation'])
thresholds = [0.5, 0.6, 0.7, 0.8, 0.9]
topic_values = []
for topic in topics:
    values = []
    for threshold in thresholds:
        #x.lda_dominant.endswith(x.embedding_dominant)
        values.append(updated_df[(updated_df['lda_dominant'].str.endswith(topic)) & (updated_df['lda_dominant_score'] > threshold) & (updated_df['merged_POS'].str.len() > 10)]['same_categorisation'].value_counts(normalize=True)[1])
        #df[(df['idxmax'] == )  & (df['max'] > 0.8)]['outcome'].value_counts(normalize=True)
    topic_values.append(values)

topic_values
new_topic_values = [[] for x in topic_values[0]]
for index, topic in enumerate(topic_values):
    for i, value in enumerate(topic):
        new_topic_values[i].append(value)
print(new_topic_values)

IndexError: index 1 is out of bounds for axis 0 with size 0

In [None]:
import matplotlib.pyplot as plt
import numpy as np

plt.figure(figsize=(10, 10), facecolor="w")
# set width of bars
barWidth = 0.25

r1 =  np.arange(0, len(topic_values[0])+3, 1.50)
# r1 = np.arange(0, len(topic_values[0])+1.5, 1.50)
colors =["#54bebe", "#76c8c8", "#98d1d1", "#badbdb", "#97d2fb", "#eccbd9", "#eccbd9"]
colors = [ "#badbdb","#98d1d1", "#76c8c8","#54bebe", "#63a4da", "#296ead", "#2a4693"]

for idx, topic in enumerate(new_topic_values):
    plt.bar(r1, topic, color=colors[idx], width=barWidth, edgecolor='white', label=thresholds[idx], zorder=3)
    r1= [x + barWidth for x in r1]

# Add xticks on the middle of the group bars
plt.ylabel('Percentage (%)', fontweight='normal')
plt.xlabel('Topic')
plt.xticks([0.5, 2, 3.5,  5, 6.5, 8], ['Art', 'Diplomacy', 'Education', 'Entertainment', 'Sport', 'Transportation'])
plt.ylim(0, 1)
plt.xlim(-0.2, 9)

# Create legend & Show graphic
plt.title('Percentage of similarly classified documents by LDA topic modeling and word embeddings')
plt.legend(title="threshold value", loc="lower right")
plt.grid(zorder=0, color='lightgray', linestyle='--')
plt.show()

### 8. Aggregate paragraphs Classification into City Pair Classification

#### 8.1 Select right parameters

In [None]:
bottom_lda_threshold = 0.7
minimal_paragraph_length = 5

In [305]:
len(dataframes[0].city_1.unique())

151

In [311]:
city1_list = list(dataframes[0]['city_1'].unique())
city2_list = list(dataframes[0]['city_2'].unique())
len(city1_list), len(city2_list)

(151, 151)

In [372]:
updated_df['city_1'].value_counts()

London                                      186465
Paris                                       133309
Berlin                                       86900
Rome                                         59845
Manchester                                   58871
                                             ...  
Caserta                                        720
A Coruna                                       684
Santa Cruz de Tenerife                         577
Castellammare di Stabia-Torre Annunziata       204
Busto Arsizio                                   89
Name: city_1, Length: 151, dtype: int64

In [384]:
d1 = updated_df['city_1'].value_counts().to_dict()
d1

{'London': 186465,
 'Paris': 133309,
 'Berlin': 86900,
 'Rome': 59845,
 'Manchester': 58871,
 'Vienna': 54418,
 'Munich': 44475,
 'Liverpool': 42271,
 'Milan': 40714,
 'Madrid': 40176,
 'Edinburgh': 34151,
 'Birmingham': 33450,
 'Barcelona': 32263,
 'Amsterdam': 31842,
 'Hamburg': 29835,
 'Venice': 29708,
 'Brussels': 29139,
 'Glasgow': 28756,
 'Leeds': 28029,
 'Frankfurt': 27836,
 'Prague': 24201,
 'Florence': 23616,
 'Bristol': 23125,
 'Sheffield': 22717,
 'Warsaw': 22160,
 'Newcastle': 22094,
 'Dublin': 22010,
 'Cologne': 21035,
 'Zurich': 20259,
 'Naples': 20235,
 'Leipzig': 18723,
 'Stockholm': 18534,
 'Nottingham': 16682,
 'Copenhagen': 16485,
 'Cardiff': 16452,
 'Geneva': 16347,
 'Dresden': 16124,
 'Stuttgart': 15630,
 'Lyon': 15562,
 'Budapest': 15507,
 'Leicester': 15347,
 'Southampton': 15106,
 'Athens': 14858,
 'Brighton': 13249,
 'Turin': 13031,
 'Bologna': 12986,
 'Antwerp': 12612,
 'Lisbon': 12522,
 'Dusseldorf': 11995,
 'Bradford': 11876,
 'Basel': 11784,
 'Belfast': 116

In [385]:
d2 = updated_df['city_2'].value_counts().to_dict()
d2

{'London': 186465,
 'Paris': 133309,
 'Berlin': 86900,
 'Rome': 59845,
 'Manchester': 58871,
 'Vienna': 54418,
 'Munich': 44475,
 'Liverpool': 42271,
 'Milan': 40714,
 'Madrid': 40176,
 'Edinburgh': 34151,
 'Birmingham': 33450,
 'Barcelona': 32263,
 'Amsterdam': 31842,
 'Hamburg': 29835,
 'Venice': 29708,
 'Brussels': 29139,
 'Glasgow': 28756,
 'Leeds': 28029,
 'Frankfurt': 27836,
 'Prague': 24201,
 'Florence': 23616,
 'Bristol': 23125,
 'Sheffield': 22717,
 'Warsaw': 22160,
 'Newcastle': 22094,
 'Dublin': 22010,
 'Cologne': 21035,
 'Zurich': 20259,
 'Naples': 20235,
 'Leipzig': 18723,
 'Stockholm': 18534,
 'Nottingham': 16682,
 'Copenhagen': 16485,
 'Cardiff': 16452,
 'Geneva': 16347,
 'Dresden': 16124,
 'Stuttgart': 15630,
 'Lyon': 15562,
 'Budapest': 15507,
 'Leicester': 15347,
 'Southampton': 15106,
 'Athens': 14858,
 'Brighton': 13249,
 'Turin': 13031,
 'Bologna': 12986,
 'Antwerp': 12612,
 'Lisbon': 12522,
 'Dusseldorf': 11995,
 'Bradford': 11876,
 'Basel': 11784,
 'Belfast': 116

In [383]:
# City pairs are reported in both directions (so you will have two rows {paris-london and london-paris} for a paragraph with Paris and London in them)
# Checks if all paragraphs with city pairs are reported twice
for idx, key in enumerate(d1):
    if d1[key] != d2[key]:
        print(d1[key], d2[key])
    
    if idx == 2200000:
        break

In [388]:
big_df_grouped_city1 = updated_df.groupby(['city_1'])

In [413]:
big_df_grouped_city1.groups.keys()

dict_keys(['A Coruna', 'Alicante', 'Amsterdam', 'Antwerp', 'Athens', 'Augsburg', 'Barcelona', 'Bari', 'Basel', 'Belfast', 'Bergamo', 'Berlin', 'Bielefeld', 'Bilbao', 'Birmingham', 'Bochum-Herne', 'Bologna', 'Bonn', 'Bordeaux', 'Bournemouth', 'Bradford', 'Brasov', 'Bratislava', 'Bremen', 'Brescia', 'Brighton', 'Bristol', 'Brno', 'Brussels', 'Bucharest', 'Budapest', 'Busto Arsizio', 'Bydgoszcz', 'Cardiff', 'Caserta', 'Castellammare di Stabia-Torre Annunziata', 'Catania', 'Charleroi', 'Cluj-Napoca', 'Cologne', 'Constanta', 'Copenhagen', 'Cordoba', 'Coventry', 'Craiova', 'Darmstadt', 'Dortmund', 'Dresden', 'Dublin', 'Duisburg', 'Dusseldorf', 'Edinburgh', 'Eindhoven', 'Essen-Oberhausen', 'Florence', 'Frankfurt', 'Galati', 'Gdansk', 'Gdynia', 'Gelsenkirchen-Bottrop', 'Geneva', 'Genoa', 'Ghent', 'Glasgow', 'Gothenburg', 'Granada', 'Grenoble', 'Hamburg', 'Hanover', 'Helsinki', 'Iasi', 'Karlsruhe', 'Katowice', 'Kaunas', 'Krakow', 'Las Palmas', 'Leeds', 'Leicester', 'Leipzig', 'Lens', 'Liege', '

In [443]:
os.listdir('../../../../data_clean/lda_models/lda_model_2millionDocs_5topics/')

['0.6448632752762254.txt',
 'lda_model_2millionDocs_5topics_dictionary.dict',
 'lda_model_2millionDocs_5topics_model.model',
 'lda_model_2millionDocs_5topics_texts.pickle',
 'lda_model_2millionDocs_5topics_visualisation.html',
 'paragraphs_lda_topic_distribution',
 'paragraphs_lda_topic_distribution2',
 'paragraphs_word_embedding_categorisation',
 'paragraphs_word_embedding_categorisation2',
 'relevant_words_per_topic.csv',
 'topic_distributions_per_paragraph.csv']

In [2]:
city_pair_output_dir = os.path.abspath("../../../../data_clean/lda_models/lda_model_2millionDocs_5topics/")
dir_name = 'city_pair_classification'
directory = os.path.join(city_pair_output_dir, dir_name)
os.makedirs(directory, exist_ok=True)

In [454]:
print(sorted(updated_df['lda_dominant'].unique()))
sorted(updated_df['embedding_dominant'].unique())

['topic_1', 'topic_2', 'topic_3', 'topic_4', 'topic_5']


['Topic 1', 'Topic 2', 'Topic 3', 'Topic 4', 'Topic 5']

In [457]:
nested_list2 = [['city_1', 'city_2', 'paragraphs', 'lemmatised_paragraph_length',
                'same_categorisation_raw', 'same_categorisation_percentage',
                'lda_dominant_category', 'embedding_dominant_category',
                'lda_topic_1', 'embedding_topic_1',
                'lda_topic_2', 'embedding_topic_2',
                'lda_topic_3', 'embedding_topic_3',
                'lda_topic_4', 'embedding_topic_4', 
                'lda_topic_5', 'embedding_topic_5']
               ]

In [478]:
city_pair_df['merged_POS'].iloc[0].apply(literal_eval) #.mean()

AttributeError: 'str' object has no attribute 'apply'

In [480]:
%%time

updated_df['merged_POS'] = updated_df['merged_POS'].apply(literal_eval)

CPU times: total: 3min 4s
Wall time: 8min 29s


In [482]:
updated_df['merged_POS'][:1000].str.len().mean()

40.834

In [14]:
big_df_grouped_city1 = updated_df.groupby(['city_1'])

In [15]:
%%time

from ast import literal_eval

lda_categories = sorted(updated_df['lda_dominant'].unique())
embedding_categories = sorted(updated_df['embedding_dominant'].unique())


for city1 in tqdm(list(big_df_grouped_city1.groups.keys())):
    
    nested_list2 = [['city_1', 'city_2', 'paragraphs', 'lemmatised_paragraph_length',
                'same_categorisation_raw', 'same_categorisation_percentage',
                'lda_dominant_category', 'embedding_dominant_category',
                'lda_topic_1', 'embedding_topic_1',
                'lda_topic_2', 'embedding_topic_2',
                'lda_topic_3', 'embedding_topic_3',
                'lda_topic_4', 'embedding_topic_4', 
                'lda_topic_5', 'embedding_topic_5']
               ]

    city1_df = big_df_grouped_city1.get_group(city1).sort_values(['city_2'])
    
    p = os.path.join(directory, f"{city1}_classified_aggregated.csv")
    
    linked_cities = sorted(big_df_grouped_city1.get_group(city1)['city_2'].unique())
    for city2 in tqdm(linked_cities, leave=False):
        city_pair_df = city1_df[city1_df['city_2'] == city2]

        paragraph_count = city_pair_df['paragraph'].count()
        same_categorisation_raw = city_pair_df['same_categorisation'].sum()
        same_categorisation_percentage = city_pair_df['same_categorisation'].sum()/city_pair_df['same_categorisation'].count()
        lemmatised_paragraph_len = city_pair_df['merged_POS'].str.len().mean()

        lda_prediction = city_pair_df['lda_dominant'].value_counts()
        embedding_prediction = city_pair_df['embedding_dominant'].value_counts()

        lda_dominant_category = lda_prediction.idxmax()
        embedding_dominant_category = embedding_prediction.idxmax()

        lda_prediction = lda_prediction.to_dict()
        embedding_prediction = embedding_prediction.to_dict()

        if (len(lda_prediction) != len(lda_categories)):
            for category in lda_categories:
                if category not in lda_prediction.keys():
                    lda_prediction[category] = 0

        if (len(embedding_prediction) != len(embedding_categories)):
            for category in embedding_categories:
                if category not in embedding_prediction.keys():
                    embedding_prediction[category] = 0

        temp_l2 = [city1, city2, paragraph_count, lemmatised_paragraph_len,
                    same_categorisation_raw, same_categorisation_percentage,
                    lda_dominant_category, embedding_dominant_category,
                    lda_prediction['topic_1'],  embedding_prediction['Topic 1'],
                    lda_prediction['topic_2'],  embedding_prediction['Topic 2'],
                    lda_prediction['topic_3'],  embedding_prediction['Topic 3'],
                    lda_prediction['topic_4'],  embedding_prediction['Topic 4'],
                    lda_prediction['topic_5'],  embedding_prediction['Topic 5']
                  ]


        if len(nested_list2[0]) != len(temp_l2):
            raise Exception('Not the same size!')

        nested_list2.append(temp_l2)

    final_df = pd.DataFrame(nested_list2[1:],columns=nested_list2[0])
    final_df.to_csv(p, index=False)


    #city1_df
    
    #display(city1_df)
    # city1_df_grouped = city1_df.groupby(['city_2'])
    # city1_df_grouped
    
    #for city2 in linked_cities:
    #    city_pair_df = city1_df[city1_df['city_2'] == city2]
        #display(city1_df[city1_df['city_2'] == city2])

  0%|          | 0/151 [00:00<?, ?it/s]

  0%|          | 0/79 [00:00<?, ?it/s]

  0%|          | 0/125 [00:00<?, ?it/s]

  0%|          | 0/147 [00:00<?, ?it/s]

  0%|          | 0/146 [00:00<?, ?it/s]

  0%|          | 0/148 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

  0%|          | 0/149 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

  0%|          | 0/147 [00:00<?, ?it/s]

  0%|          | 0/136 [00:00<?, ?it/s]

  0%|          | 0/119 [00:00<?, ?it/s]

  0%|          | 0/148 [00:00<?, ?it/s]

  0%|          | 0/113 [00:00<?, ?it/s]

  0%|          | 0/141 [00:00<?, ?it/s]

  0%|          | 0/144 [00:00<?, ?it/s]

  0%|          | 0/127 [00:00<?, ?it/s]

  0%|          | 0/143 [00:00<?, ?it/s]

  0%|          | 0/141 [00:00<?, ?it/s]

  0%|          | 0/141 [00:00<?, ?it/s]

  0%|          | 0/109 [00:00<?, ?it/s]

  0%|          | 0/117 [00:00<?, ?it/s]

  0%|          | 0/93 [00:00<?, ?it/s]

  0%|          | 0/136 [00:00<?, ?it/s]

  0%|          | 0/139 [00:00<?, ?it/s]

  0%|          | 0/121 [00:00<?, ?it/s]

  0%|          | 0/128 [00:00<?, ?it/s]

  0%|          | 0/139 [00:00<?, ?it/s]

  0%|          | 0/129 [00:00<?, ?it/s]

  0%|          | 0/149 [00:00<?, ?it/s]

  0%|          | 0/143 [00:00<?, ?it/s]

  0%|          | 0/148 [00:00<?, ?it/s]

  0%|          | 0/19 [00:00<?, ?it/s]

  0%|          | 0/97 [00:00<?, ?it/s]

  0%|          | 0/134 [00:00<?, ?it/s]

  0%|          | 0/63 [00:00<?, ?it/s]

  0%|          | 0/27 [00:00<?, ?it/s]

  0%|          | 0/121 [00:00<?, ?it/s]

  0%|          | 0/113 [00:00<?, ?it/s]

  0%|          | 0/120 [00:00<?, ?it/s]

  0%|          | 0/143 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/146 [00:00<?, ?it/s]

  0%|          | 0/122 [00:00<?, ?it/s]

  0%|          | 0/122 [00:00<?, ?it/s]

  0%|          | 0/99 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

  0%|          | 0/139 [00:00<?, ?it/s]

  0%|          | 0/144 [00:00<?, ?it/s]

  0%|          | 0/146 [00:00<?, ?it/s]

  0%|          | 0/119 [00:00<?, ?it/s]

  0%|          | 0/144 [00:00<?, ?it/s]

  0%|          | 0/142 [00:00<?, ?it/s]

  0%|          | 0/134 [00:00<?, ?it/s]

  0%|          | 0/135 [00:00<?, ?it/s]

  0%|          | 0/148 [00:00<?, ?it/s]

  0%|          | 0/146 [00:00<?, ?it/s]

  0%|          | 0/68 [00:00<?, ?it/s]

  0%|          | 0/133 [00:00<?, ?it/s]

  0%|          | 0/96 [00:00<?, ?it/s]

  0%|          | 0/89 [00:00<?, ?it/s]

  0%|          | 0/147 [00:00<?, ?it/s]

  0%|          | 0/141 [00:00<?, ?it/s]

  0%|          | 0/133 [00:00<?, ?it/s]

  0%|          | 0/142 [00:00<?, ?it/s]

  0%|          | 0/137 [00:00<?, ?it/s]

  0%|          | 0/129 [00:00<?, ?it/s]

  0%|          | 0/135 [00:00<?, ?it/s]

  0%|          | 0/147 [00:00<?, ?it/s]

  0%|          | 0/125 [00:00<?, ?it/s]

  0%|          | 0/144 [00:00<?, ?it/s]

  0%|          | 0/94 [00:00<?, ?it/s]

  0%|          | 0/132 [00:00<?, ?it/s]

  0%|          | 0/120 [00:00<?, ?it/s]

  0%|          | 0/124 [00:00<?, ?it/s]

  0%|          | 0/144 [00:00<?, ?it/s]

  0%|          | 0/120 [00:00<?, ?it/s]

  0%|          | 0/137 [00:00<?, ?it/s]

  0%|          | 0/130 [00:00<?, ?it/s]

  0%|          | 0/144 [00:00<?, ?it/s]

  0%|          | 0/119 [00:00<?, ?it/s]

  0%|          | 0/138 [00:00<?, ?it/s]

  0%|          | 0/139 [00:00<?, ?it/s]

  0%|          | 0/146 [00:00<?, ?it/s]

  0%|          | 0/144 [00:00<?, ?it/s]

  0%|          | 0/127 [00:00<?, ?it/s]

  0%|          | 0/149 [00:00<?, ?it/s]

  0%|          | 0/109 [00:00<?, ?it/s]

  0%|          | 0/142 [00:00<?, ?it/s]

  0%|          | 0/148 [00:00<?, ?it/s]

  0%|          | 0/132 [00:00<?, ?it/s]

  0%|          | 0/146 [00:00<?, ?it/s]

  0%|          | 0/132 [00:00<?, ?it/s]

  0%|          | 0/145 [00:00<?, ?it/s]

  0%|          | 0/114 [00:00<?, ?it/s]

  0%|          | 0/150 [00:00<?, ?it/s]

  0%|          | 0/140 [00:00<?, ?it/s]

  0%|          | 0/148 [00:00<?, ?it/s]

  0%|          | 0/110 [00:00<?, ?it/s]

  0%|          | 0/131 [00:00<?, ?it/s]

  0%|          | 0/145 [00:00<?, ?it/s]

  0%|          | 0/139 [00:00<?, ?it/s]

  0%|          | 0/142 [00:00<?, ?it/s]

  0%|          | 0/133 [00:00<?, ?it/s]

  0%|          | 0/138 [00:00<?, ?it/s]

  0%|          | 0/145 [00:00<?, ?it/s]

  0%|          | 0/108 [00:00<?, ?it/s]

  0%|          | 0/129 [00:00<?, ?it/s]

  0%|          | 0/143 [00:00<?, ?it/s]

  0%|          | 0/111 [00:00<?, ?it/s]

  0%|          | 0/149 [00:00<?, ?it/s]

  0%|          | 0/113 [00:00<?, ?it/s]

  0%|          | 0/148 [00:00<?, ?it/s]

  0%|          | 0/129 [00:00<?, ?it/s]

  0%|          | 0/137 [00:00<?, ?it/s]

  0%|          | 0/148 [00:00<?, ?it/s]

  0%|          | 0/140 [00:00<?, ?it/s]

  0%|          | 0/150 [00:00<?, ?it/s]

  0%|          | 0/143 [00:00<?, ?it/s]

  0%|          | 0/120 [00:00<?, ?it/s]

  0%|          | 0/120 [00:00<?, ?it/s]

  0%|          | 0/52 [00:00<?, ?it/s]

  0%|          | 0/136 [00:00<?, ?it/s]

  0%|          | 0/134 [00:00<?, ?it/s]

  0%|          | 0/147 [00:00<?, ?it/s]

  0%|          | 0/140 [00:00<?, ?it/s]

  0%|          | 0/147 [00:00<?, ?it/s]

  0%|          | 0/106 [00:00<?, ?it/s]

  0%|          | 0/144 [00:00<?, ?it/s]

  0%|          | 0/146 [00:00<?, ?it/s]

  0%|          | 0/99 [00:00<?, ?it/s]

  0%|          | 0/138 [00:00<?, ?it/s]

  0%|          | 0/136 [00:00<?, ?it/s]

  0%|          | 0/136 [00:00<?, ?it/s]

  0%|          | 0/116 [00:00<?, ?it/s]

  0%|          | 0/117 [00:00<?, ?it/s]

  0%|          | 0/140 [00:00<?, ?it/s]

  0%|          | 0/145 [00:00<?, ?it/s]

  0%|          | 0/137 [00:00<?, ?it/s]

  0%|          | 0/146 [00:00<?, ?it/s]

  0%|          | 0/111 [00:00<?, ?it/s]

  0%|          | 0/108 [00:00<?, ?it/s]

  0%|          | 0/122 [00:00<?, ?it/s]

  0%|          | 0/150 [00:00<?, ?it/s]

  0%|          | 0/136 [00:00<?, ?it/s]

  0%|          | 0/150 [00:00<?, ?it/s]

  0%|          | 0/134 [00:00<?, ?it/s]

  0%|          | 0/146 [00:00<?, ?it/s]

  0%|          | 0/139 [00:00<?, ?it/s]

  0%|          | 0/100 [00:00<?, ?it/s]

  0%|          | 0/132 [00:00<?, ?it/s]

  0%|          | 0/146 [00:00<?, ?it/s]

CPU times: total: 1min 28s
Wall time: 1min 30s


In [16]:
dataframes2 = []

for file in tqdm(os.listdir(directory)):
    if 'aggregated' in file:
        fp = os.path.abspath(os.path.join(directory, file))
        # print(fp)
        temp_df = pd.read_csv(fp)
        dataframes2.append(temp_df)
    
len(dataframes2)

  0%|          | 0/302 [00:00<?, ?it/s]

151

In [None]:
1+1

In [None]:
updated_df2['city_1'].unique()

In [7]:
sum = 0
for x in dataframes2:
    print(len(x))
    sum += len(x)
print(sum)

79
125
272
418
566
696
845
975
1122
1258
1377
1525
1638
1779
1923
2050
2193
2334
2475
2584
2701
2794
2930
3069
3190
3318
3457
3586
3735
3878
4026
4045
4142
4276
4339
4366
4487
4600
4720
4863
4963
5109
5231
5353
5452
5582
5721
5865
6011
6130
6274
6416
6550
6685
6833
6979
7047
7180
7276
7365
7512
7653
7786
7928
8065
8194
8329
8476
8601
8745
8839
8971
9091
9215
9359
9479
9616
9746
9890
10009
10147
10286
10432
10576
10703
10852
10961
11103
11251
11383
11529
11661
11806
11920
12070
12210
12358
12468
12599
12744
12883
13025
13158
13296
13441
13549
13678
13821
13932
14081
14194
14342
14471
14608
14756
14896
15046
15189
15309
15429
15481
15617
15751
15898
16038
16185
16291
16435
16581
16680
16818
16954
17090
17206
17323
17463
17608
17745
17891
18002
18110
18232
18382
18518
18668
18802
18948
19087
19187
19319
19465
1456220


In [19]:
# updated_df2 = pd.concat(dataframes2) #, keys=citypairs)
#updated_df2.set_index('paragraph_id', inplace=True)
#updated_df2.sort_index(inplace=True)
#updated_df2.reset_index(inplace=True)
updated_df2['city_1'].value_counts()[:50]

Venice        150
Vienna        150
Milan         150
Rome          150
London        149
Paris         149
Brussels      149
Barcelona     149
Madrid        148
Florence      148
Porto         148
Budapest      148
Prague        148
Berlin        148
Athens        148
Munich        148
Sofia         147
Stockholm     147
Basel         147
Geneva        147
Amsterdam     147
Hamburg       147
Manchester    146
Lisbon        146
Stuttgart     146
Frankfurt     146
Dublin        146
Copenhagen    146
Valencia      146
Zurich        146
Warsaw        146
Antwerp       146
Oslo          145
Naples        145
Turin         145
Marseille     145
Strasbourg    144
Helsinki      144
Dusseldorf    144
Krakow        144
Leipzig       144
Dresden       144
Liverpool     144
Birmingham    144
Bologna       143
Cologne       143
Palermo       143
Rotterdam     143
Bucharest     143
Lyon          142
Name: city_1, dtype: int64

In [35]:
updated_df2['city_pair'] = updated_df2[['city_1', 'city_2']].apply(lambda x: ', '.join(sorted(list(set(x)))), axis=1)
updated_df2.drop_duplicates(['city_pair'], inplace=True)
# df['all_roles'] = df.drop(columns='User').apply(
#     lambda x: ', '.join(sorted(list(set(x)))), axis=1)
# df.drop_duplicates(['User', 'all_roles'], inplace=True)

In [43]:
# updated_df2[updated_df2['city_2'] == 'Paris']
updated_df2.to_csv(os.path.join(city_pair_output_dir, 'classified_city_pairs_removed_duplicates.csv'), index=False)

In [51]:
city = "Barcelona"
city_pairs = ["Paris", "London"]

updated_df2[(updated_df2['city_1'] == city) | (updated_df2['city_2'] == city)]['lda_dominant_category'].value_counts()
# updated_df2[(updated_df2['city_1'].isin(city_pairs)) & (updated_df2['city_2'].isin(city_pairs))]


topic_4    74
topic_2    36
topic_5    20
topic_3    11
topic_1     8
Name: lda_dominant_category, dtype: int64

In [31]:
updated_df2[updated_df2['city_1'] == 'Warsaw']['embedding_dominant_category'].value_counts()

Topic 5    68
Topic 3    32
Topic 2    21
Topic 1    15
Topic 4    10
Name: embedding_dominant_category, dtype: int64

In [394]:
big_df_grouped_city1.get_group(('London'))['city_2'].value_counts()

Paris                                       21874
Manchester                                  10368
Edinburgh                                    9493
Berlin                                       7668
Birmingham                                   7544
                                            ...  
Craiova                                        13
Caserta                                        12
Galati                                         12
Brasov                                         11
Castellammare di Stabia-Torre Annunziata        3
Name: city_2, Length: 149, dtype: int64

In [411]:
parislondon = updated_df[(updated_df['city_1'] == 'Paris') & (updated_df['city_2'] == 'Barcelona')]


In [412]:
parislondon[(parislondon['same_categorisation'] == True)]['embedding_dominant'].value_counts()

Topic 5    801
Topic 2    384
Topic 4    338
Topic 1    233
Topic 3    224
Name: embedding_dominant, dtype: int64

In [404]:
updated_df[(updated_df['same_categorisation'] == True)]['embedding_dominant'].value_counts()

Topic 5    556883
Topic 4    326479
Topic 1    287789
Topic 3    276797
Topic 2    271087
Name: embedding_dominant, dtype: int64

In [357]:
dataframes[0]

Unnamed: 0,paragraph_id,city_1,city_2,paragraph,article_id,title,merged_POS,topic_1,topic_2,topic_3,...,topic_5,lda_dominant,lda_dominant_score,Topic 1,Topic 2,Topic 3,Topic 4,Topic 5,embedding_dominant,same_categorisation
0,934385,Toulon,Marseille,Honey from Provence is protected by a red labe...,15415861,"Authon, Alpes-de-Haute-Provence","['honey', 'label', 'indication', 'honey', 'flo...",0.995948,0.000888,0.000952,...,0.001501,topic_1,0.995948,6.492346,1.263454,1.981242,1.216145,2.069049,Topic 1,True
1,934386,Toulon,Nice,Honey from Provence is protected by a red labe...,15415861,"Authon, Alpes-de-Haute-Provence","['honey', 'label', 'indication', 'honey', 'flo...",0.950718,0.000888,0.023567,...,0.024116,topic_1,0.950718,6.492346,1.263454,1.981242,1.216145,2.069049,Topic 1,True
2,934387,Copenhagen,Oslo,"Glosimodt was born in Oslo, Norway. He studied...",15416040,Erik Glosimodt,"['education', 'professor', 'architect', 'pract...",0.002758,0.002761,0.002960,...,0.989310,topic_5,0.989310,1.500070,0.000000,1.286127,0.000000,4.980572,Topic 5,True
3,934388,Oslo,Copenhagen,"Glosimodt was born in Oslo, Norway. He studied...",15416040,Erik Glosimodt,"['education', 'professor', 'architect', 'pract...",0.002758,0.002761,0.002960,...,0.989310,topic_5,0.989310,1.500070,0.000000,1.286127,0.000000,4.980572,Topic 5,True
4,934389,Barcelona,Dusseldorf,"On the 24 March 2015, Germanwings Flight 9525 ...",15416113,Meolans-Revel,"['travel', 'crash']",0.017678,0.017697,0.018973,...,0.029918,topic_4,0.915734,1.018640,0.000000,0.000000,0.000000,0.000000,Topic 1,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
103815,1038200,Rome,Bologna,It publishes 15 local editions for the cities ...,19593000,Leggo,"['edition', 'city', 'circulation', 'copy', 'pu...",0.005433,0.005439,0.144366,...,0.840407,topic_5,0.840407,1.272796,0.000000,0.000000,0.369280,0.961729,Topic 1,False
103816,1038201,Rome,Bari,It publishes 15 local editions for the cities ...,19593000,Leggo,"['edition', 'city', 'circulation', 'copy', 'pu...",0.005433,0.005439,0.005831,...,0.978943,topic_5,0.978943,1.272796,0.000000,0.000000,0.369280,0.961729,Topic 1,False
103817,1038202,Rome,Padua,It publishes 15 local editions for the cities ...,19593000,Leggo,"['edition', 'city', 'circulation', 'copy', 'pu...",0.282504,0.005439,0.005831,...,0.701872,topic_5,0.701872,1.272796,0.000000,0.000000,0.369280,0.961729,Topic 1,False
103818,1038203,Rome,Brescia,It publishes 15 local editions for the cities ...,19593000,Leggo,"['edition', 'city', 'circulation', 'copy', 'pu...",0.005433,0.005439,0.144366,...,0.840407,topic_5,0.840407,1.272796,0.000000,0.000000,0.369280,0.961729,Topic 1,False


In [352]:
small_df_grouped.groups
small_df_grouped_city1.get_group(('Paris')).head(5)

# ('Birmingham', 'London')

Unnamed: 0,paragraph_id,city_1,city_2,paragraph,article_id,title,merged_POS,topic_1,topic_2,topic_3,...,topic_5,lda_dominant,lda_dominant_score,Topic 1,Topic 2,Topic 3,Topic 4,Topic 5,embedding_dominant,same_categorisation
86,934471,Paris,London,"1-2 Paris Theatre, London (February 1, 1973)",15420067,BBC Sessions (Rory Gallagher album),[],0.017678,0.017697,0.469755,...,0.480699,topic_5,0.480699,0.0,0.0,0.0,0.0,0.0,Topic 1,False
88,934473,Paris,London,"2-10 &amp; 2-12 Paris Theatre, London (July 13...",15420067,BBC Sessions (Rory Gallagher album),['amp'],0.017678,0.017697,0.469755,...,0.480699,topic_5,0.480699,0.0,0.0,0.0,0.0,0.0,Topic 1,False
96,934481,Paris,London,In 1869 Sharpe joined the Architectural Associ...,15420536,Edmund Sharpe,"['group', 'architect', 'self', 'education', 't...",0.000766,0.000766,0.449879,...,0.52845,topic_5,0.52845,3.002717,0.0,6.228647,4.409327,8.026495,Topic 5,True
110,934495,Paris,London,"Films shown on the main screen were ""Triangle""...",15421310,London FrightFest Film Festival,"['film', 'screen', 'trick', 'night', 'case', '...",0.002758,0.987403,0.00296,...,0.004668,topic_2,0.987403,0.50719,1.722045,1.157752,1.385758,0.405645,Topic 2,True
182,934567,Paris,London,"Live At The Paris Theatre, London, UK 1973-Feb...",15421914,The G-Man Bootleg Series Vol.1,['live'],0.46846,0.468478,0.018973,...,0.029918,topic_2,0.468478,0.0,0.583172,0.0,0.0,0.0,Topic 2,True


In [353]:
small_df_grouped_city2.get_group(('Paris')).head(5)

Unnamed: 0,paragraph_id,city_1,city_2,paragraph,article_id,title,merged_POS,topic_1,topic_2,topic_3,...,topic_5,lda_dominant,lda_dominant_score,Topic 1,Topic 2,Topic 3,Topic 4,Topic 5,embedding_dominant,same_categorisation
87,934472,London,Paris,"1-2 Paris Theatre, London (February 1, 1973)",15420067,BBC Sessions (Rory Gallagher album),[],0.46846,0.017697,0.018973,...,0.029918,topic_1,0.46846,0.0,0.0,0.0,0.0,0.0,Topic 1,True
89,934474,London,Paris,"2-10 &amp; 2-12 Paris Theatre, London (July 13...",15420067,BBC Sessions (Rory Gallagher album),['amp'],0.46846,0.468478,0.018973,...,0.029918,topic_2,0.468478,0.0,0.0,0.0,0.0,0.0,Topic 1,False
97,934482,London,Paris,In 1869 Sharpe joined the Architectural Associ...,15420536,Edmund Sharpe,"['group', 'architect', 'self', 'education', 't...",0.000766,0.000766,0.293685,...,0.704169,topic_5,0.704169,3.002717,0.0,6.228647,4.409327,8.026495,Topic 5,True
111,934496,London,Paris,"Films shown on the main screen were ""Triangle""...",15421310,London FrightFest Film Festival,"['film', 'screen', 'trick', 'night', 'case', '...",0.002758,0.987403,0.00296,...,0.004668,topic_2,0.987403,0.50719,1.722045,1.157752,1.385758,0.405645,Topic 2,True
183,934568,London,Paris,"Live At The Paris Theatre, London, UK 1973-Feb...",15421914,The G-Man Bootleg Series Vol.1,['live'],0.017678,0.017697,0.469755,...,0.480699,topic_5,0.480699,0.0,0.583172,0.0,0.0,0.0,Topic 2,False


In [344]:
# from itertools import product

# list(product(dataframes[0]['city_1'], dataframes[0]['city_2']))
# dataframes[0].groupby('city_1').head()
small_df = dataframes[0].iloc[:1000]
small_df_grouped_city1 = small_df.groupby(['city_1'])
small_df_grouped_city2 = small_df.groupby(['city_2'])
# small_df_grouped
# for subdf in small_df_grouped:
#     print(subdf)
    

In [None]:
# Requires city1/city2 update!
temp_df = updated_df[(updated_df['city_pair'].isin(updated_df['city_pair'].unique()[:])) & (updated_df['lda_dominant_score'] > bottom_lda_threshold) & (updated_df['merged_POS'].str.len() > minimal_paragraph_length)]

In [None]:
temp_df

In [None]:
grouped_df = temp_df.groupby('city_pair')

In [None]:
nested_list2 = [['city_pair', 'paragraphs', 'lemmatised_paragraph_length', 'same_categorisation_raw', 'same_categorisation_percentage',
                'lda_dominant_category', 'embedding_dominant_category', 'lda_art', 'embedding_art', 'lda_diplomacy',
                'embedding_diplomacy', 'lda_education', 'embedding_education', 'lda_entertainment', 'embedding_entertainment', 
                'lda_sport', 'embedding_sport', 'lda_transportation', 'embedding_transportation']]

In [None]:
topics

In [None]:
print(list(temp_df['embedding_dominant'].unique()))
print(list(temp_df['lda_dominant'].unique()))

#### 8.2 Aggregation algorithm

In [None]:
from IPython.display import display

lda_categories = list(temp_df['lda_dominant'].unique())
embedding_categories = list(temp_df['embedding_dominant'].unique())

for city_pair, sub_df in tqdm(grouped_df):
    paragraph_count = sub_df['paragraph'].count()
    same_categorisation_raw = sub_df['same_categorisation'].sum()
    same_categorisation_percentage = sub_df['same_categorisation'].sum()/sub_df['same_categorisation'].count()
    lemmatised_paragraph_len = sub_df['merged_POS'].str.len().mean()
    
    lda_prediction = sub_df['lda_dominant'].value_counts()
    embedding_prediction = sub_df['embedding_dominant'].value_counts()
        
    lda_dominant_category = lda_prediction.idxmax()
    embedding_dominant_category = embedding_prediction.idxmax()
    
    lda_prediction = lda_prediction.to_dict()
    embedding_prediction = embedding_prediction.to_dict()
    
    if (len(lda_prediction) != len(lda_categories)):
        for category in lda_categories:
            if category not in lda_prediction.keys():
                lda_prediction[category] = 0
                
    if (len(embedding_prediction) != len(embedding_categories)):
        for category in embedding_categories:
            if category not in embedding_prediction.keys():
                embedding_prediction[category] = 0
                
    temp_l2 = [city_pair, paragraph_count, lemmatised_paragraph_len, same_categorisation_raw, same_categorisation_percentage,
                lda_dominant_category, embedding_dominant_category, lda_prediction['lda_art'], embedding_prediction['art'], lda_prediction['lda_diplomacy'],
                embedding_prediction['diplomacy'], lda_prediction['lda_education'], embedding_prediction['education'],
                lda_prediction['lda_entertainment'], embedding_prediction['entertainment'], lda_prediction['lda_sport'],
                embedding_prediction['sport'], lda_prediction['lda_transportation'], embedding_prediction['transportation']]
    if len(nested_list2[0]) != len(temp_l2):
        raise Exception('Not the same size!')
    
    nested_list2.append(temp_l2)

In [None]:
final_df = pd.DataFrame(nested_list2[1:],columns=nested_list2[0])

final_df.head(2)

## 8.4 Save Aggregated Dataframe

In [None]:
# final_df.to_csv('..\..\..\..\..\data\clean\deliverable_435city_pairs_both_methods_with_education_final.csv', index=False)

#### 8.5 Normalise classification (by number of paragraphs)

In [None]:
# Normalize category outcomes
final_df_normalised = final_df
final_df_normalised[list(final_df_normalised.columns)[7:]] = final_df_normalised[list(final_df_normalised.columns)[7:]].div(final_df_normalised['paragraphs'], axis=0) # .count()

final_df_normalised

## 8.6 Save Normalised Aggregated Dataframe

In [None]:
# final_df_normalised.to_csv('..\..\..\..\..\data\clean\deliverable_435city_pairs_both_methods_with_education_final_normalised.csv', index=False)

# EXTRAS

#### Show closest words to topic vector

In [None]:
from scipy import spatial
from preprocessing_functions import *

def find_closest_embeddings(embedding, cutoff=25):
    return sorted(embeddings_dict.keys(), key=lambda token: spatial.distance.euclidean(embeddings_dict[token], embedding))

In [None]:
topic = 'art'

words = find_closest_embeddings(embedding=
     mean_vectors_dict[topic]    # embeddings_dict['diplomacy'] # embeddings_dict['fashion']
)[:2000]

In [None]:
english_words = get_english_words(path='../../../input/english_words_alpha_370k.txt
print(remove_non_existing_words_from_wordlist(words, english_words))