### Analyse de la similarité

Les mesures de distance jouent un rôle important dans l'apprentissage automatique.

Ils fournissent la base de nombreux algorithmes d'apprentissage automatique populaires et efficaces, tels que les k-NN pour l'apprentissage supervisé et le clustering k-means pour l'apprentissage non supervisé.

Différentes mesures de distance doivent être choisies et utilisées en fonction des types de données.


#### Distance de Hamming


La distance de Hamming calcule la distance entre deux vecteurs binaires, également appelés chaînes binaires ou chaînes de bits en abrégé.

HammingDistance = (somme for i to N abs(v1[i] – v2[i])) / N

In [1]:
from scipy.spatial.distance import hamming
# Data
row1 = [0, 0, 0, 0, 0, 1]
row2 = [0, 0, 0, 0, 1, 0]
# calcule distance
dist = hamming(row1, row2)
print(dist)

0.3333333333333333


#### Distance Euclidienne


La distance euclidienne calcule la distance entre deux vecteurs de valeurs réelles.


EuclideanDistance = racine(somme for i to N (v1[i] – v2[i])^2)

Vous êtes plus susceptible d'utiliser la distance euclidienne lors du calcul de la distance entre deux vecteur de données qui ont des valeur réelles

In [2]:
from scipy.spatial.distance import euclidean
# data
row1 = [10, 20, 15, 10, 5]
row2 = [12, 24, 18, 8, 7]
# calcul distance
dist = euclidean(row1, row2)
print(dist)

6.082762530298219


#### Distance de Manhattan (Taxicab ou City Block Distance)

La distance de Manhattan, également appelée distance en taxi ou distance en "paté de maison", calcule la distance entre deux vecteurs à valeur réelle.

ManhattanDistance = sum for i to N sum |v1[i] – v2[i]|

Le nom de taxi pour cette mesure fait référence à l'intuition de ce que l'on  calcule: le chemin le plus court qu'un taxi emprunterait entre des pâtés de maisons (coordonnées sur echiquer).

In [3]:
from scipy.spatial.distance import cityblock
# data
row1 = [10, 20, 15, 10, 5]
row2 = [12, 24, 18, 8, 7]
# calcul distance
dist = cityblock(row1, row2)
print(dist)

13


Lors du calcul de la distance entre deux observations, il est possible d'avoir différents types de données par colonne (des valeurs réelles, des valeurs booléennes, des valeurs catégorielles et des valeurs ordinales). Différentes mesures de distance peuvent être nécessaires pour chacune, qui sont additionnées en un seul score de distance.

Les valeurs numériques peuvent avoir des échelles différentes. Cela peut avoir un impact considérable sur le calcul de la mesure de distance et il est souvent recommandé de normaliser les valeurs numériques avant de calculer la mesure de distance.

Une erreur numérique dans les problèmes de régression peut également être considérée comme une distance. Par exemple, l'erreur entre la valeur attendue et la valeur prédite est une mesure de distance unidimensionnelle qui peut être additionnée ou moyennée sur toutes les observations d'un batch pour obtenir une distance totale entre les résultats attendus et prévus. 

    Le calcul de l'erreur, comme l'erreur quadratique moyenne ou l'erreur absolue moyenne, peut ressembler à   une mesure de distance standard.

####  Veille sur la cosinus similarity


- Produit scalaire et notion de colinearité


Calculer la colinearité de deux vecteurs en utilisant la valeur du cosinus formée par l'angle de ces vecteurs

In [4]:
from numpy import dot, linalg

row1 = [10, 20, 15, 10, 5]
row2 = [12, 24, 18, 8, 7]

print(dot(row1, row2))

print(dot(row1, row2)/(linalg.norm(row1)*linalg.norm(row2)))

985
0.9932534635884738


#### Applications
 
- Collecter des annonces d'emploi sur indeed 
- Collecter des profils ou CV sur www.lesbonsfreelances.com/ ou un autre site
- Filtre / Analyse des jeux de données CV et postes
- Nettoyer intelligement ces documents
- featurizer votre texte :
	- Word count
   	- Tfidf
- Evaluez la similarité entre un CV et plusieurs postes (ou inversement) selon différentes metrics:
    - Cosinus




In [5]:
import os
import pandas as pd 

import re
from unidecode import unidecode
from nlp_tools import *

path="Postes/"
Poste={}
for nom in os.listdir(path):
    try:
        with open(path+nom,'r') as f:
            Poste[nom]=f.read()
    except:
        continue
        
Postes=pd.DataFrame({"Titre":Poste.keys(),"Content":Poste.values()}).dropna()
Postes.head()


list.remove(x): x not in list
list.remove(x): x not in list
list.remove(x): x not in list
list.remove(x): x not in list
list.remove(x): x not in list


Unnamed: 0,Titre,Content
0,(H_F)_DELEGUE(E)_COMMERCIAL(E)_COSMETIQUES_Nor...,"RattachÃ©(e) au Manager commercial France, vou..."
1,1_Educateur_de_Jeunes_Enfants_(H_F),"L'ASSOCIATION JEAN-COTXETrecherche,au sein du ..."
2,1_Travailleur_social_(H_F),"L'ASSOCIATION JEAN-COTXET recherche, au sein d..."
3,2e_Fonde_de_pouvoir_H_F,La Caf de Seine-Saint-Denis compte 384 000 al...
4,2_Educateurs_(H_F),"L'ASSOCIATION JEAN-COTXET recherche, au sein d..."


In [6]:
Candidats = pd.read_excel('Candidats.xlsx').dropna()
Candidats.head()

Unnamed: 0,Titre,Nom,CV
0,Graphiste Print et Web,sandrat3,Références GRAPHISTE et WEBDESIGNER FREELANCE ...
1,Assistante administrative et juridique,camillel9,Références - Assistante juridique en cabinet c...
2,Comptable Clients - optimisation relances - re...,celinef5,Références De 16/11/2020 à Aujourd’hui Gérant...
3,Graphiste & Chargée de communication,violainel,Références Compétences en Graphisme & en Commu...
4,Community manager,carlav,Références C à Vous : Community manager (Live ...


## Remove Duplicates

In [7]:
Postes.drop_duplicates('Content', inplace=True)
Candidats.drop_duplicates('CV', inplace=True)

In [8]:
print(f"Postes dimension : {Postes.shape}")
print(f"Candidates dimension : {Candidats.shape}")

Postes dimension : (763, 2)
Candidates dimension : (151, 3)


## Nettoyer

In [9]:
Candidats['CV'].iloc[0]

"Références GRAPHISTE et WEBDESIGNER FREELANCE depuis 2012.  Pour voir mes derniers travaux, merci de consulter mon site www.start-portfolio.com   2015 : ASSISTANTE MARKETING  Au sein d’INNOVZEN, start-up spécialisée dans le secteur du bien-être, seconder la Directrice marketing dans la préparation du Salon Zen à Paris et du CES à Las Vegas. Réaliser les visuels du kit salon, concevoir la plaquette commerciale, les cartes de visite, les flyers. Créer un blog, alimenter les réseaux sociaux et le site web.   2013 : GRAPHISTE  Être force de proposition pour seconder une agence de communication concernant la création du catalogue de la société URBAN-NT, fabrication et pose de mobilier urbain.  Conception de la couverture, mise en page, détourage et retouches avancées de toutes les photos, création des pictos.   Refonte du logo du groupe suivant un cahier des charges précis.   2008-2011 : CHARGÉE DE COMMUNICATION (Lycée DHUODA, GRETA NÎMES, CFA du GARD, CBEN)  Réaliser des maquettes en vue 

In [10]:
cleaner(Candidats['CV'].iloc[0])

'reference graphiste webdesigner freelance voir travail consulter site www start portfolio com assistant marketing innovzen start up specialise secteur bien seconder directeur marketing preparation salon zen pari vegas realiser visuel kit salon concevoir plaquette commercial carte visiter flyers creer blog alimenter reseau social site web graphiste forcer proposition seconder agencer communication concerner creation cataloguer societe urban nt fabrication poser mobilier urbain conception couverture miser page detourage retoucher avancee photo creation pictos refonte logo grouper cahier charger precis charge communication lycee dhuoda greta nimes cfa gard cben realiser maquette vue impression afficher flyers plaquette commercial brochure plv creer logos site internet greta nimes camargue priser vue retoucher photo journal lycee etudes formation webmaster administrateur reseau cnam montpellier obtention certificat professionnel formation graphiste maquettiste pao centrer imager nimes obt

In [11]:
Postes['Content_net']=Postes['Content'].apply(cleaner)
Candidats['CV_net']=Candidats['CV'].apply(cleaner)
Candidats.head()

Unnamed: 0,Titre,Nom,CV,CV_net
0,Graphiste Print et Web,sandrat3,Références GRAPHISTE et WEBDESIGNER FREELANCE ...,reference graphiste webdesigner freelance voir...
1,Assistante administrative et juridique,camillel9,Références - Assistante juridique en cabinet c...,reference assistant juridique cabinet comptabl...
2,Comptable Clients - optimisation relances - re...,celinef5,Références De 16/11/2020 à Aujourd’hui Gérant...,reference gerant sarlu recouvrement positif ge...
3,Graphiste & Chargée de communication,violainel,Références Compétences en Graphisme & en Commu...,reference competence graphisme communication d...
4,Community manager,carlav,Références C à Vous : Community manager (Live ...,reference community manager live tweets graphi...


In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

vectorizer_tfidf = TfidfVectorizer(min_df=3)
vectorizer_tfidf.fit(list(Candidats['CV_net'])+list(Postes['Content_net']))


X_CV = vectorizer_tfidf.transform(Candidats['CV_net'])
X_Postes=vectorizer_tfidf.transform(Postes['Content_net'])

In [13]:
X_CV_feat=pd.DataFrame(X_CV.toarray(), columns=vectorizer_tfidf.get_feature_names(),index=Candidats['Titre'])
X_CV_feat.head()

Unnamed: 0_level_0,abilit,able,abonnement,aborder,aboutir,aboutissement,absence,acad,academy,acc,...,york,you,youa,your,youtube,yu,yvelines,zcspumdhsw,zen,zone
Titre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Graphiste Print et Web,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.089351,0.0
Assistante administrative et juridique,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Comptable Clients - optimisation relances - recouvrement,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Graphiste & Chargée de communication,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Community manager,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [14]:
X_Postes_feat=pd.DataFrame(X_Postes.toarray(), columns=vectorizer_tfidf.get_feature_names(),index=Postes['Titre'])
X_Postes_feat.head()

Unnamed: 0_level_0,abilit,able,abonnement,aborder,aboutir,aboutissement,absence,acad,academy,acc,...,york,you,youa,your,youtube,yu,yvelines,zcspumdhsw,zen,zone
Titre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
(H_F)_DELEGUE(E)_COMMERCIAL(E)_COSMETIQUES_Nord-Nord-Ouest,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1_Educateur_de_Jeunes_Enfants_(H_F),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1_Travailleur_social_(H_F),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2e_Fonde_de_pouvoir_H_F,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2_Educateurs_(H_F),0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [15]:
import numpy as np

row1 = X_Postes_feat.loc['Comptable_tresorier_H_F']
row2 = X_CV_feat.loc["Comptable Clients - optimisation relances - recouvrement"]

np.dot(row1,row2)/(np.linalg.norm(row1)*np.linalg.norm(row2))

0.11138293677795506

In [16]:
from sklearn.metrics.pairwise import cosine_similarity

Resultats=pd.DataFrame(cosine_similarity(X_Postes_feat,X_CV_feat),index=Postes['Titre'],columns=Candidats['Titre'])

In [17]:
Resultats.T

Titre,(H_F)_DELEGUE(E)_COMMERCIAL(E)_COSMETIQUES_Nord-Nord-Ouest,1_Educateur_de_Jeunes_Enfants_(H_F),1_Travailleur_social_(H_F),2e_Fonde_de_pouvoir_H_F,2_Educateurs_(H_F),2_TISF_(H_F),2_Travailleurs_Sociaux_(H_F),3_ASSISTANTS_SOCIAUX_(H_F)_AU_SEIN_DU_POLE_MEDECINE,3_Ingenieurs_d'affaires_(H_F),575270_-_Commercial_-_Paris,...,Verificateur_d'appareils_Extincteurs_H_F,Verificateur_Systemes_de_desenfumage_H_F,vip_assistant_concierge_luxe_luxury_specialist_paris,Visiteur_terrain_H_F,"Vous_etes_ouvrier_en_demolition,_Wanty_a_besoin_de_vous_pour_ses_chantiers_en_region_parisienne_!",Webdesigner___Graphiste_H_F,WEBMASTER_(H_F),Webmaster_H_F,Welcome_Manager_Barista,Zootechnicien(ne)s_H_F_Rongeurs
Titre,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
Graphiste Print et Web,0.025176,0.012973,0.023642,0.048460,0.020745,0.022329,0.026828,0.073224,0.035169,0.041016,...,0.026841,0.043110,0.022177,0.024026,0.020778,0.093549,0.085333,0.129141,0.030498,0.013715
Assistante administrative et juridique,0.039550,0.034302,0.028508,0.072611,0.013896,0.039310,0.027745,0.113637,0.060656,0.032792,...,0.041963,0.047351,0.045658,0.016157,0.026736,0.035464,0.032687,0.054872,0.035314,0.030900
Comptable Clients - optimisation relances - recouvrement,0.085272,0.018188,0.024648,0.219032,0.023337,0.017465,0.022275,0.040759,0.052187,0.058948,...,0.038362,0.036625,0.037780,0.017637,0.019209,0.031580,0.039494,0.060269,0.050094,0.033965
Graphiste & Chargée de communication,0.058370,0.042950,0.044539,0.051612,0.051039,0.043410,0.044553,0.040159,0.047616,0.044852,...,0.023030,0.029929,0.035995,0.029547,0.009648,0.079216,0.031195,0.077125,0.051938,0.020899
Community manager,0.042007,0.030587,0.044632,0.049777,0.027464,0.030632,0.041049,0.025380,0.033478,0.023389,...,0.011562,0.013146,0.020962,0.025775,0.012635,0.091431,0.202660,0.150139,0.052677,0.011590
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Consultante Administrative Gestion Finance Val de Marne (94),0.063436,0.025573,0.045014,0.242394,0.043192,0.037425,0.043688,0.066096,0.065503,0.038478,...,0.039249,0.035375,0.044227,0.014660,0.022493,0.043261,0.047815,0.065080,0.049813,0.028751
Directrice artistique,0.078973,0.012627,0.025959,0.038182,0.063301,0.013527,0.028623,0.003800,0.024371,0.039116,...,0.018152,0.022028,0.083819,0.030531,0.025583,0.065341,0.065691,0.076637,0.038744,0.009724
Formatrice certifiée Qualiopi/digital & restauration,0.063767,0.033406,0.052973,0.027362,0.047601,0.055760,0.055431,0.010193,0.068616,0.047537,...,0.025884,0.030917,0.035424,0.021254,0.012176,0.043449,0.062246,0.063854,0.027945,0.021938
Graphiste digital 2D et 3D,0.054463,0.011503,0.025431,0.034289,0.022997,0.033845,0.025847,0.020594,0.063736,0.049574,...,0.012612,0.023826,0.101626,0.027658,0.020616,0.140472,0.056972,0.040870,0.009835,0.017799


In [None]:
n_max=5
for i in Resultats.T.index:
    tmp=Resultats.T.loc[i]
    n=0
    while n < n_max :
        try:
            print(i, " VS ",tmp.idxmax(),tmp[tmp.idxmax()])
        except:
            continue
        n=n+1
        tmp[tmp.idxmax()]=0
    print('#################################################')

In [18]:
from sklearn.decomposition import PCA
pca = PCA(n_components=100)
pca.fit(X_CV_feat)
pca_features = pca.transform(X_CV_feat)
print(pca_features.shape)

(151, 100)


In [19]:
Candidats['CleanCV']=Candidats['CV'].apply(cleaner)
Postes['CleanContent']=Postes['Content'].apply(cleaner)