# Creation méthodologie calcul poids similarité adresse via word2vec

Objective(s)

- Lors du POC de l’US [US 06 Union et Intersection](https://coda.io/d/CreditAgricole_dCtnoqIftTn/US-06-Union-et-Intersection_sucns), nous avons besoin d’une table contenant les poids indiquant la similarité entre deux mots. Dès lors, il est indispensable de créer un notebook avec de la documentation sur la création de ces poids et de la technique utilisée. 

  - Une table sera créé dans Athena avec deux colonnes pour chacun des mots et un poids. Les trois nouvelles variables seront appelées:

  - Mot_A
  - Mot_B
  - Index_relation

- Pour calculer les poids, il faut utiliser la table suivante XX avec les variables:

  -  `adresse_distance_inpi` 
  -  `adresse_distance_inpi` 

## Metadata

- Metadata parameters are available here: [Ressources_suDYJ#_luZqd](http://Ressources_suDYJ#_luZqd)

  - Task type:

- Jupyter Notebook

- Users: :

    - [Thomas Pernet](mailto:t.pernetcoudrier@gmail.com)

- Watchers:

  - [Thomas Pernet](mailto:t.pernetcoudrier@gmail.com)

- Estimated Log points:

  - One being a simple task, 15 a very difficult one
    -  7

- Task tag

  - \#machine-learning,#word2vec,#documentation,#similarite

- Toggl Tag

  - \#variable-computation
 
  
## Input Cloud Storage [AWS/GCP]

If link from the internet, save it to the cloud first

### Tables [AWS/BigQuery]

1. Batch 1:
  * Select Provider: Athena
  * Select table(s): ets_insee_inpi
    * Select only tables created from the same notebook, else copy/paste selection to add new input tables
    * If table(s) does not exist, add them: Add New Table
    * Information:
      * Region: 
        * NameEurope (Paris)
        * Code: eu-west-3
      * Database: inpi
      * Notebook construction file: https://github.com/thomaspernet/InseeInpi_matching/blob/master/Notebooks_matching/Data_preprocessed/programme_matching/01_preparation/03_ETS_add_variables.md
    
## Destination Output/Delivery

1. Athena: 
      * Region: Europe (Paris)
      * Database: machine_learning
      * Tables (Add name new table):
          - list_mots_insee_inpi
          - list_mots_insee_inpi_word2vec_weights

2. S3(Add new filename to Database: Ressources)
      * Origin: Jupyter notebook
      * Bucket: calfdata
      * Key: MACHINE_LEARNING/NLP/WORD2VEC_WEIGHTS
      * Filename(s): word2vec_weights_100

  
## Things to know (Steps, Attention points or new flow of information)

### Sources of information  (meeting notes, Documentation, Query, URL)

- Query [Athena/BigQuery]

  1. Link 1: [Liste ngrams](https://eu-west-3.console.aws.amazon.com/athena/home?region=eu-west-3#query/history/79a481c2-df9c-4785-993b-4c6813947770)

    - Description: Query utilisée précédemment pour créer la liste des combinaisons INSEE-INPI
    
1. GitHub
  * Repo: https://github.com/thomaspernet/InseeInpi_matching
  * Folder name: Notebooks_matching/Data_preprocessed/programme_matching/02_siretisation
  * Source code:  Test_word2Vec.md
2. Python Module [Module name](link)
  * Library 1: gensim

## Connexion serveur

In [11]:
from awsPy.aws_authorization import aws_connector
from awsPy.aws_athena import service_athena
from awsPy.aws_s3 import service_s3
from pathlib import Path
import pandas as pd
import numpy as np
import os, shutil
bucket = 'calfdata'
path = os.getcwd()
parent_path = str(Path(path).parent)
path_cred = r"{}/credential_AWS.json".format(parent_path)
con = aws_connector.aws_instantiate(credential = path_cred,
                                       region = 'eu-west-3')
client= con.client_boto()
s3 = service_s3.connect_S3(client = client,
                      bucket = 'calfdata', verbose = False) 
athena = service_athena.connect_athena(client = client,
                      bucket = 'calfdata') 

# Creation database

In [None]:
query = """CREATE DATABASE IF NOT EXISTS machine_learning
  COMMENT 'DB for machine learning tests'
  LOCATION 's3://calfdata/MACHINE_LEARNING/NLP/'
  """

# Creation Tables

Pour cacluler la similarité entre l'adresse de l'INSEE et de l'INPI, nous devons créer une liste de combinaison unique entre les deux adresses. Pour cela, nous utilisons les variables `adresse_distance_inpi`  et `adresse_distance_insee` qui ont été, au préalable, néttoyée, puis nous les concatènons en prenant le soin d'enlever les mots redondants. Dit autrement, si deux mots sont présents dans les deux adresses, alors, nous n'en gardons qu'un seul. 

La table contient environ 2,836,384 combinaisons possibles. 

In [None]:
create table = False
query_combination = """
/*Combinaison mots insee inpi*/
CREATE TABLE machine_learning.list_mots_insee_inpi
WITH (
  format='PARQUET'
) AS
SELECT unique_combinaition,
COUNT(*) AS CNT
FROM (SELECT
array_distinct(
    concat(
    array_distinct(
      split(adresse_distance_inpi, ' ')
      ),
    array_distinct(
      split(adresse_distance_insee, ' ')
    )    )
  ) unique_combinaition
FROM inpi.ets_insee_inpi 
      )
      GROUP BY unique_combinaition
"""

In [None]:
if create_table:
    output = athena.run_query(
            query=query_combination,
            database='inpi',
            s3_output='INPI/sql_output'
        )

# Calcul model

Dès lors que la table d'entrainement est prète, nous allons calculer un vecteur de poid pour chaque mot. Par défaut, nous calculons 100 poids pour chacune des occurences. Les poids sont calculés grace a la technique de Word2Vec

## [What Are Word Embeddings for Text?](https://machinelearningmastery.com/what-are-word-embeddings/) 

- by [[Jason Brownlee]]

## What Are Word Embeddings?

- A word embedding is a learned representation for text where words that have the same meaning have a similar representation
- It is this approach to representing words and documents that may be considered one of the key breakthroughs of deep learning on challenging natural language processing problems.

    -> One of the benefits of using dense and low-dimensional vectors is computational
    
- The main benefit of the dense representations is generalization power
- if we believe some features may provide similar clues, it is worthwhile to provide a representation that is able to capture these similarities
- Word embeddings are in fact a class of techniques where individual words are represented as real-valued vectors in a predefined vector space
- Each word is mapped to one vector and the vector values are learned in a way that resembles a neural network, and hence the technique is often lumped into the field of deep learning
- Key to the approach is the idea of using a dense distributed representation for each word
- Each word is represented by a real-valued vector, often tens or hundreds of dimensions. This is contrasted to the thousands or millions of dimensions required for sparse word representations, such as a one-hot encoding
- The distributed representation is learned based on the usage of words
- This allows words that are used in similar ways to result in having similar representations, naturally capturing their meaning
- This can be contrasted with the crisp but fragile representation in a bag of words model where, unless explicitly managed, different words have different representations, regardless of how they are used

## Word Embedding Algorithms
    
- Word embedding methods learn a real-valued vector representation for a predefined fixed sized vocabulary from a corpus of text
- The learning process is either joint with the neural network model on some task, such as document classification, or is an unsupervised process, using document statistics.


### Word2Vec

- Word2Vec is a statistical method for efficiently learning a standalone word embedding from a text corpus
- It was developed by Tomas Mikolov, et al. at Google in 2013 as a response to make the neural-network-based training of the embedding more efficient and since then has become the de facto standard for developing pre-trained word embedding
- Two different learning models were introduced that can be used as part of the word2vec approach to learn the word embedding; they are:
    - Continuous Bag-of-Words, or CBOW model 
    - Continuous Skip-Gram Model. 
    - The CBOW model learns the embedding by predicting the current word based on its context
    - The continuous skip-gram model learns by predicting the surrounding words given a current word
    - Both models are focused on learning about words given their local usage context, where the context is defined by a window of neighboring words
    - The key benefit of the approach is that high-quality word embeddings can be learned efficiently (low space and time complexity), allowing larger embeddings to be learned (more dimensions) from much larger corpora of text (billions of words).


In [None]:
query = """
SELECT *
FROM machine_learning.list_mots_insee_inpi
"""

### run query
output = athena.run_query(
        query=query,
        database='machine_learning',
        s3_output='INPI/sql_output'
    )

results = False
filename = 'combinaison_adresses_insee_inpi.csv'
    
while results != True:
    source_key = "{}/{}.csv".format(
                            'INPI/sql_output',
                            output['QueryExecutionId']
                                   )
    destination_key = "{}/{}".format(
                                'MACHINE_LEARNING/NLP/LISTE_INSEE_INPI',
                                filename
                            )
        
    results = s3.copy_object_s3(
                                source_key = source_key,
                                destination_key = destination_key,
                                remove = True
                            )

Load dataframe

In [None]:
list_insee_inpi = (s3.read_df_from_s3(
            key = 'MACHINE_LEARNING/NLP/LISTE_INSEE_INPI/{}'.format(filename), sep = ',')
             )

## Train model

There are many parameters on this constructor; a few noteworthy arguments you may wish to configure are:

- size: (default 100) The number of dimensions of the embedding, e.g. the length of the dense vector to represent each token (word).
- window: (default 5) The maximum distance between a target word and words around the target word.
- min_count: (default 5) The minimum count of words to consider when training the model; words with an occurrence less than this count will be ignored.
- workers: (default 3) The number of threads to use while training.
- sg: (default 0 or CBOW) The training algorithm, either CBOW (0) or skip gram (1).

In [None]:
from gensim.models import Word2Vec
import re

In [None]:
def basic_clean(text):
    return re.sub(r'[^\w\s]|[|]', '', text).split()

In [None]:
df_text = list_insee_inpi['unique_combinaition'].apply(lambda x: basic_clean(x))

In [None]:
df_text.head()

Pour le POC, nous utilisons les paramètres par défault

In [None]:
%%time 
model = Word2Vec(df_text.tolist(),
                 size = 100,
                 window = 5,
                 min_count=5,
                 sg = 0)

Nous devons calculer la similarité entre les mots communs dans l'adresse INPI/INSEE, donc nous pouvons utiliser les poids du modèles et ensuite calculer la similarité avec la méthode du cosine. 

La librarie `gensim` permet d'exporter les poids en `.txt`. Toutefois, il n'est pas concevable de calculer l'ensembles des similarités entre toutes les occurences (environ 90.000), donc lors des traitements dans Athena, nous calculerons le cosines à la demande. 

Version 1 [DEPRECATED]

Pour cela, nous allons créer un csv avec deux colonnes, `words` et `list_weights`. Attention, cette dernière n'est pas une liste dans le csv, mais le sera dans Athena. Athena permet d'importer un ensemble de valeur dans un array. Si on crée une liste dans le csv, Athena va créer une liste de liste. Ainsi, il est plus simple dans le csv de créer uniquement deux colonnes, les mots et les poids. Le séparateur `|` sera utilisé. Le csv ressemble a ca:

```
Words | list_of_weights
RUE | .1, .4 ......
AVENUE | .2, .9 ......
```

Version 2:

Sauvegarde l'ensemble des poids dans des colonnes

In [1]:
import pandas as pd

In [None]:
model.wv.save_word2vec_format('word2vec_weights_100.txt', binary=False)

In [3]:
list_header = ['Words'].extend(list(range(1, 101)))

In [4]:
model_wieghts = pd.read_csv('word2vec_weights_100.txt',
                            sep = ' ', skiprows= 1,
                           header= list_header)

In [5]:
model_wieghts

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,91,92,93,94,95,96,97,98,99,100
0,RUE,-0.637436,-0.860552,-1.090971,0.349628,-1.104881,-0.183897,0.758681,-0.348887,0.739862,...,0.596379,0.346608,0.210783,0.943285,-0.937898,0.786427,2.142715,-0.118549,-0.211833,1.362251
1,AVENUE,-0.737605,-0.276738,0.049705,-0.087973,-1.504733,0.482779,2.072946,1.858653,-0.150254,...,0.928697,-0.357324,0.556154,0.747054,-0.820126,1.575219,3.133704,0.924360,-1.321395,2.658577
2,ROUTE,-0.099740,1.810359,-1.529623,1.874255,-1.145338,-0.093540,0.157584,0.773121,-0.499695,...,1.783117,-0.816026,-0.036190,0.665525,-2.628345,-2.074172,0.740013,1.058702,1.097924,-2.042516
3,CHEMIN,0.436964,0.965870,-2.191946,0.546848,-1.283836,1.744163,0.339656,-0.150173,-1.982336,...,0.376425,-0.693189,-0.456226,-0.556797,-2.124781,0.063995,1.581101,0.347200,-0.611052,-1.175260
4,D,0.579891,-1.214994,0.598045,-0.126123,0.056561,-1.535403,1.749857,-2.412028,0.862033,...,1.043439,-1.224786,-0.218579,2.960337,-0.562912,3.300653,3.017200,-0.641307,2.912423,0.462413
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
97702,PVC,0.020943,0.048794,0.004120,0.022559,-0.087283,-0.075291,0.066714,0.048444,0.000131,...,-0.038375,0.014817,-0.015409,0.056657,-0.027957,-0.009567,-0.053609,0.028431,0.101723,-0.011203
97703,REMAUDIERE,0.011948,-0.001168,-0.000731,-0.022410,-0.008706,-0.018876,-0.037489,0.027611,0.008421,...,0.111240,-0.018480,0.038136,0.035323,-0.075554,0.023810,0.028565,0.018971,-0.005017,-0.089626
97704,QUERANT,0.017093,-0.014134,-0.030588,-0.013255,-0.026493,-0.024442,0.046053,0.014300,0.064410,...,0.090521,0.027626,0.029488,0.024999,-0.024425,-0.000887,0.010660,0.041978,0.074392,-0.081879
97705,GRAFOUILLERE,0.040421,-0.073920,-0.038152,-0.050909,-0.015952,-0.026144,-0.065605,-0.035652,-0.006728,...,0.069758,-0.019827,0.045236,-0.007118,-0.008090,0.003411,0.018797,0.041937,0.011629,-0.040715


In [6]:
#zipped_weight = list(
#    zip(
#    model_wieghts.set_index(0).index.values.tolist(),
#    model_wieghts.set_index(0).values.tolist()
#)
#    )

In [7]:
#(pd.DataFrame(zipped_weight)
#  .rename(columns= {0:'words', 1: 'list_weights'})
#  .assign(list_weights = lambda x:x['list_weights'].apply(lambda x: ','.join(map(str, x))))
#  .to_csv('word2vec_weights_100.csv', index = False, sep = "|")       
# )

In [9]:
model_wieghts.to_csv('word2vec_weights_100.csv', index = False)  

Le modèle se trouve à l'adresse suivante [MACHINE_LEARNING/NLP/WORD2VEC_WEIGHTS](https://s3.console.aws.amazon.com/s3/buckets/calfdata/MACHINE_LEARNING/NLP/WORD2VEC_WEIGHTS/?region=eu-west-3&tab=overview)

In [12]:
s3.upload_file('word2vec_weights_100.csv',
               'MACHINE_LEARNING/NLP/WORD2VEC_WEIGHTS')

## Create tables weights

Athena ne peut pas créer des array float a partir de fichier csv, du coup on utilise la fonction concat. C'est une solution pour le poc. 

On créer une table temporaire qui contient l'ensemble des poids en colonnes `list_mots_insee_inpi_word2vec_weights_temp` puis on canct les colonnes dans la table `list_mots_insee_inpi_word2vec_weights`

In [29]:
top = """
CREATE EXTERNAL TABLE IF NOT EXISTS machine_learning.list_mots_insee_inpi_word2vec_weights_temp (

`Words` string,
"""
middle = ""

for i in range(0,100):
    if i == 99:
        middle += "vec_{} float )".format(i)
    else:
        middle += "vec_{} float,".format(i)
bottom = """
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
    WITH SERDEPROPERTIES (
   'separatorChar' = ',',
   'quoteChar' = '"'
   ) 
     LOCATION 's3://calfdata/MACHINE_LEARNING/NLP/WORD2VEC_WEIGHTS'
     TBLPROPERTIES ('has_encrypted_data'='false',
              'skip.header.line.count'='1');
""" 
query = top + middle +bottom
output = athena.run_query(
        query=query,
        database='machine_learning',
        s3_output='INPI/sql_output'
    )

Execution ID: f606a67a-fe70-4c70-9b1b-8fd59a448631


In [30]:
query = """

CREATE TABLE machine_learning.list_mots_insee_inpi_word2vec_weights
WITH (
  format='PARQUET'
) AS
SELECT words,
CONCAT(

"""
middle = ""
for i in range(0, 100):
    if i ==99:
        middle  = "ARRAY[vec_{}]) as list_weights".format(i)
    else:
        middle  = "ARRAY[vec_{}],".format(i)
    query += middle
bottom = """
FROM "machine_learning"."list_mots_insee_inpi_word2vec_weights_temp"
"""
query += bottom
output = athena.run_query(
        query=query,
        database='machine_learning',
        s3_output='INPI/sql_output'
    )

Execution ID: fff81fc0-6f95-4d37-8f88-77f531fa6bb1


In [31]:
output = athena.run_query(
        query="DROP TABLE `list_mots_insee_inpi_word2vec_weights_temp`;",
        database='machine_learning',
        s3_output='INPI/sql_output'
    )

Execution ID: 88b07a18-0872-4bda-9ab5-e1d4c1b7e50a


In [None]:
#query = """
#CREATE EXTERNAL TABLE IF NOT EXISTS machine_learning.list_mots_insee_inpi_word2vec_weights (

#`Words` string,
#`list_weights` array<string>
#  )

#ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
#     WITH SERDEPROPERTIES (
#      'serialization.format' = ',',
#      'field.delim' = '|') 
#     LOCATION 's3://calfdata/MACHINE_LEARNING/NLP/WORD2VEC_WEIGHTS'
#     TBLPROPERTIES ('has_encrypted_data'='false', 
#     'skip.header.line.count'='1')
#     
#"""
#output = athena.run_query(
#        query=query,
#        database='machine_learning',
#        s3_output='INPI/sql_output'
#    )

# Analyse du modèle

La librairie `gensim` a une fonction integrée pour calculer la similarité. Toutefois, nous devons cacluler le cosine manuelement dans Athena car il n'y a pas de fonction SQL prévue a cet effet. 

## Similarité - cosine

La fonction coseine se calcule de la facon suivante:

$$\frac{u \cdot v}{\|u\|_{2}\|v\|_{2}}$$

- Source 1: Calcul cosine:
  - [Cosine similarity](https://en.wikipedia.org/wiki/Cosine_similarity)
  - [Magnitude (mathematics)](https://en.wikipedia.org/wiki/Magnitude_(mathematics)#Euclidean_vector_space)
  - [Scipy Cosine](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cosine.html)

In [None]:
from scipy.spatial import distance
from sklearn.manifold import TSNE
import numpy as np
import matplotlib.pyplot as plt

In [None]:
import warnings
warnings.filterwarnings('ignore')

Ci dessous, un exemple des 10 premiers poids des mots `BOULEVARD`  et `BD`

In [None]:
model['BOULEVARD'][:10]

In [None]:
model['BD'][:10]

In [None]:
1 - distance.cosine(model['BOULEVARD'], model['BD'])

In [None]:
model.wv.similarity('BOULEVARD', 'BD')

Calcul à la main

In [None]:
np.dot(model['BD'], model['BOULEVARD'])/ \
(np.sqrt(np.sum(np.square(model['BD']))) * np.sqrt(np.sum(np.square(model['BOULEVARD']))))

T-SNE plot pour les similarités entre les mots

In [None]:
def display_closestwords_tsnescatterplot(model, word, size):
    
    fig= plt.figure(figsize=(10,10))
    
    arr = np.empty((0,size), dtype='f')
    word_labels = [word]
    close_words = model.similar_by_word(word)
    arr = np.append(arr, np.array([model[word]]), axis=0)
    for wrd_score in close_words:
        wrd_vector = model[wrd_score[0]]
        word_labels.append(wrd_score[0])
        arr = np.append(arr, np.array([wrd_vector]), axis=0)

    tsne = TSNE(n_components=2, random_state=0)
    np.set_printoptions(suppress=True)
    Y = tsne.fit_transform(arr)
    x_coords = Y[:, 0]
    y_coords = Y[:, 1]
    plt.scatter(x_coords, y_coords)
    for label, x, y in zip(word_labels, x_coords, y_coords):
            plt.annotate(label, xy=(x, y), xytext=(0, 0), textcoords='offset points')
    plt.xlim(x_coords.min()+0.00005, x_coords.max()+0.00005)
    plt.ylim(y_coords.min()+0.00005, y_coords.max()+0.00005)
    plt.show()

In [None]:
display_closestwords_tsnescatterplot(model = model,
                                     word = 'BOULEVARD',
                                     size = 100)

In [None]:
display_closestwords_tsnescatterplot(model = model,
                                     word = 'AVENUE',
                                     size = 100)

In [None]:
display_closestwords_tsnescatterplot(model = model,
                                     word = 'ZI',
                                     size = 100)

In [None]:
display_closestwords_tsnescatterplot(model = model,
                                     word = 'APPART',
                                     size = 100)

In [None]:
display_closestwords_tsnescatterplot(model = model,
                                     word = 'CDT',
                                     size = 100)

# Generation report

In [None]:
import os, time, shutil, urllib, ipykernel, json
from pathlib import Path
from notebook import notebookapp

In [None]:
def create_report(extension = "html"):
    """
    Create a report from the current notebook and save it in the 
    Report folder (Parent-> child directory)
    
    1. Exctract the current notbook name
    2. Convert the Notebook 
    3. Move the newly created report
    
    Args:
    extension: string. Can be "html", "pdf", "md"
    
    
    """
    
    ### Get notebook name
    connection_file = os.path.basename(ipykernel.get_connection_file())
    kernel_id = connection_file.split('-', 1)[0].split('.')[0]

    for srv in notebookapp.list_running_servers():
        try:
            if srv['token']=='' and not srv['password']:  
                req = urllib.request.urlopen(srv['url']+'api/sessions')
            else:
                req = urllib.request.urlopen(srv['url']+ \
                                             'api/sessions?token=' + \
                                             srv['token'])
            sessions = json.load(req)
            notebookname = sessions[0]['name']
        except:
            pass  
    
    sep = '.'
    path = os.getcwd()
    #parent_path = str(Path(path).parent)
    
    ### Path report
    #path_report = "{}/Reports".format(parent_path)
    #path_report = "{}/Reports".format(path)
    
    ### Path destination
    name_no_extension = notebookname.split(sep, 1)[0]
    source_to_move = name_no_extension +'.{}'.format(extension)
    dest = os.path.join(path,'Reports', source_to_move)
    
    ### Generate notebook
    os.system('jupyter nbconvert --no-input --to {} {}'.format(
    extension,notebookname))
    
    ### Move notebook to report folder
    #time.sleep(5)
    shutil.move(source_to_move, dest)
    print("Report Available at this adress:\n {}".format(dest))

In [None]:
create_report(extension = "html")