
<a href="http://www.inokufu.com"><img src = "http://www.inokufu.com/wp-content/uploads/elementor/thumbs/logo_inokufu_vector_full-black-om2hmu9ob1jytetxemkj1ij8g7tt3hzrtssivh2fl2.png" width = 400> </a>


<h1 align=center><font size = 5>Exploratory Data Analysis : Titre</font></h1>

## Introduction

In this notebook, we conduct an Exploratory Data Analysis (EDA) on data visualization. The idea is to better understand how the differents models worked on words : are the models good representations of the real life ? Are the models correctly identifying bloom verbs ? 

Our EDA approach follows the **Data Science Methodology CRISP-DM**. For more info about this approach, check this [Wikipedia page](https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining)

## Table of Contents

<div class="alert alert-block alert-info" style="margin-top: 20px">

<font size = 3>

1. <a href="#item1">Data Import</a>

2. <a href="#item2">Functions Declarations</a>

3. <a href="#item3">Data Visualization</a>    
    
4. <a href="#item4">Next Steps</a>    

</font>
</div>
<a id='the_destination'></a>

In [None]:
import numpy as np 
np.set_printoptions(threshold=10000,suppress=True) 
import warnings
warnings.filterwarnings('ignore')

import pandas as pd

import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.colors as colors
import matplotlib.pyplot as plt
import matplotlib.gridspec as gridspec
import matplotlib.image as img
from matplotlib import rcParams

import seaborn as sns

import gensim

import matplotlib.pyplot as plt
%matplotlib inline
 
import seaborn as sns
sns.set_style("darkgrid")

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

import import_ipynb
from GL_20200327_Fonctions_Preprocessing import preprocess 



print('Libraries imported.')

## 1. Data Import <a id='item1'></a>

In [None]:
# Importing the 4 models we trained in the part 2 : Word2Vec

model_100_5 = {
    'model' : gensim.models.Word2Vec.load("./models/20200330_UdemyDesc_Desc_Obj_100_5000_Train25_Iter10.model"),
    'model_size' : 100, 'vocab_size' : 5000, 'train' : 25, 'iter' : 10
}

model_100_10 = {
    'model' : gensim.models.Word2Vec.load("./models/20200330_UdemyDesc_Desc_Obj_100_10000_Train25_Iter10.model"),
    'model_size' : 100, 'vocab_size' : 10000, 'train' : 25, 'iter' : 10
}

model_300_5 = {
    'model' : gensim.models.Word2Vec.load("./models/20200330_UdemyDesc_Desc_Obj_300_5000_Train25_Iter10.model"),
    'model_size' : 300, 'vocab_size' : 5000, 'train' : 25, 'iter' : 10
}

model_300_10 = {
    'model' : gensim.models.Word2Vec.load("./models/20200330_UdemyDesc_Desc_Obj_300_10000_Train25_Iter10.model"),
    'model_size' : 300, 'vocab_size' : 10000, 'train' : 25, 'iter' : 10
}

model_to_send = {
    'model' : gensim.models.Word2Vec.load("./models/Word2Vec_model_to_send.model"),
    'model_size' : 300, 'vocab_size' : 5000, 'train' : 100, 'iter' : 0
}

## 2. Functions declarations <a id='item2'></a>

In [None]:
# Prints closest and most far words

def tsnescatterplot(model, word, list_names, model_size):

    arrays = np.empty((0, model_size), dtype='f')
    word_labels = [word]
    color_list  = ['red']

    # adds the vector of the query word
    arrays = np.append(arrays, model.wv.__getitem__([word]), axis=0)
    
    # gets list of most similar words
    close_words = model.wv.most_similar([word])
    
    # adds the vector for each of the closest words to the array
    for wrd_score in close_words:
        wrd_vector = model.wv.__getitem__([wrd_score[0]])
        word_labels.append(wrd_score[0])
        color_list.append('blue')
        arrays = np.append(arrays, wrd_vector, axis=0)
    
    # adds the vector for each of the words from list_names to the array
    for wrd in list_names:
        wrd_vector = model.wv.__getitem__([wrd])
        word_labels.append(wrd)
        color_list.append('green')
        arrays = np.append(arrays, wrd_vector, axis=0)
        
    # Reduces the dimensionality from model_size to 50 dimensions with PCA
    reduc = PCA(n_components=19).fit_transform(arrays)
    
    # Finds t-SNE coordinates for 2 dimensions
    np.set_printoptions(suppress=True)
    
    Y = TSNE(n_components=2, random_state=0, perplexity=15).fit_transform(reduc)
    
    # Sets everything up to plot
    df = pd.DataFrame({'x': [x for x in Y[:, 0]], 'y': [y for y in Y[:, 1]], 
                       'words': word_labels, 'color': color_list})
    
    fig, _ = plt.subplots()
    fig.set_size_inches(9, 9)
    
    # Basic plot
    p1 = sns.regplot(data=df,x="x",y="y",fit_reg=False,marker="o",scatter_kws={'s': 40,'facecolors': df['color']})
    
    # Adds annotations one by one with a loop
    for line in range(0, df.shape[0]):
         p1.text(df["x"][line], df['y'][line], '  ' + df["words"][line].title(), horizontalalignment='left',
                 verticalalignment='bottom', size='medium',color=df['color'][line],weight='normal').set_size(15)

    plt.xlim(Y[:, 0].min()-50, Y[:, 0].max()+50)
    plt.ylim(Y[:, 1].min()-50, Y[:, 1].max()+50)
            
    plt.title('t-SNE visualization for {}'.format(word.title()))

In [None]:
# Prints nearest words only 

def tsnescatterplot_near_only(model, word, model_size):
    
    arrays = np.empty((0, model_size), dtype='f')
    word_labels = [word]
    color_list  = ['red']

    # adds the vector of the query word
    arrays = np.append(arrays, model.wv.__getitem__([word]), axis=0)
    
    # gets list of most similar words
    close_words = model.wv.most_similar([word])
    
    # adds the vector for each of the closest words to the array
    for wrd_score in close_words:
        wrd_vector = model.wv.__getitem__([wrd_score[0]])
        word_labels.append(wrd_score[0])
        color_list.append('blue')
        arrays = np.append(arrays, wrd_vector, axis=0)
        
    reduc = PCA(n_components=9).fit_transform(arrays)
    
    # Finds t-SNE coordinates for 2 dimensions
    np.set_printoptions(suppress=True)
    
    Y = TSNE(n_components=2, random_state=0, perplexity=15).fit_transform(reduc)
    
    # Sets everything up to plot
    df = pd.DataFrame({'x': [x for x in Y[:, 0]], 'y': [y for y in Y[:, 1]],
                       'words': word_labels, 'color': color_list})
    
    fig, _ = plt.subplots()
    fig.set_size_inches(6, 6)
    
    # Basic plot
    p1 = sns.regplot(data=df,x="x",y="y",fit_reg=False,marker="o",scatter_kws={'s': 40,'facecolors': df['color']})
    
    # Adds annotations one by one with a loop
    for line in range(0, df.shape[0]):
         p1.text(df["x"][line], df['y'][line], '  ' + df["words"][line].title(), horizontalalignment='left',
                 verticalalignment='bottom', size='medium', color=df['color'][line], weight='normal').set_size(15)
    
    plt.xlim(Y[:, 0].min()-50, Y[:, 0].max()+50)
    plt.ylim(Y[:, 1].min()-50, Y[:, 1].max()+50)
            
    plt.title('t-SNE visualization for {}'.format(word.title()))

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
 
import seaborn as sns
sns.set_style("darkgrid")

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

def tsnescatterplot_vocab(model, model_size, top):
    
    word_labels = []
    color_list  = []
    
    vocab = model.wv.vocab.items()
    array = []
    arrays = np.empty((0, model_size), dtype='f')
    words = np.empty((0, model_size), dtype='f')
    
    for wrd_score in vocab:
        word = wrd_score[0]
        wrd_count = model.wv.vocab[word].count
        wrd_vector = model.wv.__getitem__([word])
        
        word_labels.append(wrd_score[0])
        color_list.append('green')
        array.append([word,wrd_count,wrd_vector])
        
    df_test_array = pd.DataFrame(array)
    df_test_array.columns = ['word','count','vector']
    df_test_array.sort_values(by=['count'], inplace=True, ascending=False)
    
    for index,row in df_test_array.iterrows():
        arrays = np.append(arrays, row['vector'], axis=0)
        words = np.append(words, row['word'])
    
    reduc = PCA(n_components=19).fit_transform(arrays[:top])
    np.set_printoptions(suppress=True)
    Y = TSNE(n_components=2, random_state=0, perplexity=15).fit_transform(reduc)
    df = pd.DataFrame({'x': [x for x in Y[:, 0]], 'y': [y for y in Y[:, 1]], 'words': words[:top]})
    
    fig, _ = plt.subplots()
    fig.set_size_inches(9, 9)
    
    # Basic plot
    p1 = sns.regplot(data=df, x="x", y="y", fit_reg=False, marker="o", scatter_kws={'s': 40,'facecolors': 'green'})
    
    # Adds annotations one by one with a loop
    for line in range(0, df.shape[0]):
         p1.text(df["x"][line],df['y'][line],'  ' + df["words"][line].title(),horizontalalignment='left',
                 verticalalignment='bottom', size='medium',color='green',weight='normal').set_size(15)
    
    plt.xlim(Y[:, 0].min()-50, Y[:, 0].max()+50)
    plt.ylim(Y[:, 1].min()-50, Y[:, 1].max()+50)
            
    plt.title('t-SNE visualization')

## 3. Data visualization <a id='item3'></a>

In [None]:
model = model_300_5

### Prints the 10 closest words, and the 10 most far words, to the word we test

In [None]:
word_tested = 'decouvr'

tsnescatterplot(model['model'], 
                word_tested, 
                [i[0] for i in model['model'].wv.most_similar(negative=[word_tested])], 
                model['model_size'])

### Prints top N most used words of the model

In [None]:
N = 25
tsnescatterplot_vocab(model['model'],model['model_size'],N)

### Check for the nearest words to bloom verbs

In [None]:
words = 'decouvrir expliquer construire analyser synthetiser evaluer'
words = preprocess(words)
words = words.split(" ")

for word_tested in words:
    tsnescatterplot_near_only(model['model'], word_tested, model['model_size'])

### Bloom verbs analysis only

Original bloom list : 

[['Connaitre','Définir','Citer','Nommer','Décrire','Dire','Relater','Réciter','Désigner','Montrer','Énoncer','Énumérer','Identifier','Lister','Insérer','Ordonner','Arranger','Localiser','Placer','Situer','Souligner','Dupliquer','Rappeler','Reconnaitre'],
    ['Comprendre','Reformuler','Paraphraser','Démontrer','Différencier','Discriminer','Expliquer','Retracer','Formuler','Intégrer','Interpréter','Résumer','Rapporter','Associer','Relier','Regrouper','Classifier','Estimer','Réviser','Traduire','Discuter'],
    ['Appliquer','Faire','Utiliser','Exercer','Employer','Administrer','Adapter','Planifier','Opérer','Calculer','Experimenter','Simuler','Préparer','Pratiquer','Produire','Construire','Interviewer'],
    ['Analyser','Etudier','Examiner','Décomposer','Disséquer','Extraire','Rechercher','Comparer','Critiquer','Catégoriser','Contraster','Corréler','Subdiviser','Prioriser','Schématiser','Concentrer','Signaler','Regrouper'],
    ['Synthétiser','Construire','Modifier','Assembler','Compiler','Composer','Edifier','Façonner','Intégrer','Recombiner','Réorganiser','Réordonner','Reconstruire','Structurer','Systématiser','Créer','Concevoir','Développer','Adapter','Généraliser','Imaginer','Inventer'],
    ['Evaluer','Prédire','Estimer','Expertiser','Juger','Sélectionner','Choisir','Prédire','Justifier','Recommander','Noter','Décider','Défendre','Persuader']
]

In [None]:
# Definition of the bloom verbs list without duplicated verbs

bloom_array = [
    ['Connaitre','Définir','Citer','Nommer','Décrire','Dire','Relater','Réciter','Désigner','Montrer','Énoncer','Énumérer','Identifier','Lister','Insérer','Ordonner','Arranger','Localiser','Placer','Situer','Souligner','Dupliquer','Rappeler','Reconnaitre'],
    ['Comprendre','Reformuler','Paraphraser','Démontrer','Différencier','Discriminer','Expliquer','Retracer','Formuler','Interpréter','Résumer','Rapporter','Associer','Relier','Classifier','Réviser','Traduire','Discuter'],
    ['Appliquer','Faire','Utiliser','Exercer','Employer','Administrer','Planifier','Opérer','Calculer','Experimenter','Simuler','Préparer','Pratiquer','Produire','Interviewer'],
    ['Analyser','Etudier','Examiner','Décomposer','Disséquer','Extraire','Rechercher','Comparer','Critiquer','Catégoriser','Contraster','Corréler','Subdiviser','Prioriser','Schématiser','Concentrer','Signaler'],
    ['Synthétiser','Modifier','Assembler','Compiler','Composer','Edifier','Façonner','Recombiner','Réorganiser','Réordonner','Reconstruire','Structurer','Systématiser','Créer','Concevoir','Développer','Généraliser','Imaginer','Inventer'],
    ['Evaluer','Prédire','Expertiser','Juger','Sélectionner','Choisir','Prédire','Justifier','Recommander','Noter','Décider','Défendre','Persuader']
]

In [None]:
# Concat verbs by category

bloom_list = []

for i in range(len(bloom_array)):
    bloom_sentence = ' '.join(bloom_array[i])
    bloom_list.append(bloom_sentence)

In [None]:
# Print verbs of the first categ
bloom_list[0]

In [None]:
# Process verbs by category : we keep the processed verbs by string of many verbs, and by array of verbs by categ

bloom_list_processed = []
bloom_array_processed = []

for i in range(len(bloom_list)):
    processed_bloom = preprocess(bloom_list[i])
    bloom_list_processed.append(processed_bloom)
    bloom_array_processed.append(processed_bloom.split(" "))

In [None]:
# Print processed sentences composed of verbs from the first categ
bloom_list_processed[0]

In [None]:
# Print processed words from the first category
print(bloom_array_processed[0])

In [None]:
# Calculate distances between verbs of the same category 

bloom_categ = []
bloom_verb1 = []
bloom_verb2 = []
bloom_distance = []
bloom_not_vocab = []

# For each category 
for i in range(len(bloom_array_processed)):
    
    # For each verb of each category
    for j in range(len(bloom_array_processed[i])):
        
        # For each verb after the j verb
        for k in range(j+1,len(bloom_array_processed[i])):
            
            # Get verbs we have to calculate the distance between
            verb1 = bloom_array_processed[i][j]
            verb2 = bloom_array_processed[i][k]
            
            # Check if verbs are in the word2vec model
            if verb1 in model['model'].wv.vocab and verb2 in model['model'].wv.vocab:
                
                # If yes, calculate the distance between them
                distance = model['model'].similarity(verb1,verb2)
            else :
                #If no:
                
                # Distance is -1
                distance = -1
                
                # We add the verb that is not in vocab into an array
                if verb1 in model['model'].wv.vocab:
                    bloom_not_vocab.append(verb2)
                else:
                    bloom_not_vocab.append(verb1)
            
            # We add categories, verbs and distances into arrays
            bloom_categ.append(i)
            bloom_verb1.append(verb1)
            bloom_verb2.append(verb2)
            bloom_distance.append(distance)

In [None]:
# Concat of all the arrays we just filled into a Pandas DataFrame
df = pd.DataFrame([bloom_categ, bloom_verb1, bloom_verb2, bloom_distance]).transpose()

# Add the columns of the dataframe 
df.columns = ["categ","verb_1","verb_2","distance"]

# Sort the dataframe by distance descending, and print the 10 first rows (highest distances)
df.sort_values('distance',ascending = False).head(10)

In [None]:
# Verbs that are not in the model vocabulary

df_not_vocab = pd.DataFrame([bloom_not_vocab]).transpose()
list_verb_not_vocab = df_not_vocab[0].unique()
list_verb_not_vocab

In [None]:
# Check which categories are badly represented
df_negativ = df.loc[df['distance'] < 0]
df_positiv = df.loc[df['distance'] >= 0]

list_positiv_by_categ = df_positiv['categ'].value_counts().sort_index()
list_negativ_by_categ = df_negativ['categ'].value_counts().sort_index()

In [None]:
print("Positiv","\t","Negativ","\t","Rate")
for i in range(6):
    rate = round(list_positiv_by_categ[i]/(list_positiv_by_categ[i]+list_negativ_by_categ[i])*100,2)
    print(list_positiv_by_categ[i],"\t\t",list_negativ_by_categ[i],"\t\t",rate,"%")

This small analysis seems to show that : 
- categories 1,2 and 3 are well represented by the model : More than half of them are detected as close
- categories 4,5 and 6 are badly represented by the model : Less than half of them are detected as close

This could be explained by the fact that : 
- categories don't have the same number of bloom verbs 
- some verbs can be used for other reasons than education  

--> Example : "Ca va vous faire du bien de voir cette compétence" : here, "Faire" is the only bloom verb in this sentence, but he is not saying anything about learning/teaching. 

In [None]:
# Build an array with all verbs and their categories

bloom_array_processed_all = []

for i in range(len(bloom_array_processed)):
    for j in range(len(bloom_array_processed[i])):
        bloom_array_processed_all.append([i,bloom_array_processed[i][j]])

In [None]:
print(bloom_array_processed_all[:5])

In [None]:
# Distances between verbs from all categories 
# --> Goal : see if there are verbs from different categories that are close in distance

bloom_categ1 = []
bloom_verb1 = []
bloom_categ2 = []
bloom_verb2 = []
bloom_distance = []

for i in range(len(bloom_array_processed_all)):
    for j in range(i+1,len(bloom_array_processed_all)):
        verb1 = bloom_array_processed_all[i][1]
        verb2 = bloom_array_processed_all[j][1]
        
        if verb1 in model['model'].wv.vocab and verb2 in model['model'].wv.vocab:
            distance = model['model'].similarity(verb1,verb2)
        else :
            distance = -1
            
        bloom_categ1.append(bloom_array_processed_all[i][0])
        bloom_verb1.append(verb1)
        bloom_categ2.append(bloom_array_processed_all[j][0])
        bloom_verb2.append(verb2)
        bloom_distance.append(distance)

In [None]:
# Concat of all the arrays we just filled into a Pandas DataFrame
df_all = pd.DataFrame([bloom_categ1, bloom_verb1, bloom_categ2, bloom_verb2, bloom_distance]).transpose()

# Add the columns of the dataframe 
df_all.columns = ["categ_1", "verb_1", "categ_2", "verb_2", "distance"]

# Sort the dataframe by distance descending, and print the 10 first rows (highest distances)
df_all.sort_values('distance',ascending = False).head(10)

In [None]:
# Remove duplicated verbs : distance between same words = 1
df_all = df_all.loc[df_all['verb_1'] != df_all['verb_2']]

# Remove not in vocab words
df_all = df_all.loc[df_all['distance'] != -1]

print(len(df_all))

#### Close Verbs Quick Analysis

In [None]:
# Keep only the close words
df_positiv = df_all[df_all['distance'] >= 0]

In [None]:
df_positiv.head(3)

In [None]:
# Number of verbs that are close and in the same category 
well = len(df_positiv[df_positiv['categ_1'] == df_positiv['categ_2']])

# Number of verbs that are close and not in the same category
bad = len(df_positiv[df_positiv['categ_1'] != df_positiv['categ_2']])

# Rate of close words of the same category on close words
print("Rate of verbs well interpreted as close :")
print(round(well/len(df_positiv),4)*100,"%")

#### Far Verbs Quick Analysis

In [None]:
# Keep only the far words 
df_negativ = df_all.loc[df_all['distance'] < 0]

In [None]:
df_negativ.head(3)

In [None]:
# Number of verbs that are far and in the same category 
bad = len(df_negativ[df_negativ['categ_1'] == df_negativ['categ_2']])

# Number of verbs that are far and not in the same category
well = len(df_negativ[df_negativ['categ_1'] != df_negativ['categ_2']])

# Rate of far words of the same category on far words
print("Rate of verbs well interpreted as far :")
print(round(well/len(df_negativ),4)*100,"%")

In [None]:
df_positiv[df_positiv['categ_1'] != df_positiv['categ_2']].sort_values('distance',ascending=False).head(10)

## 4. Next Steps <a id='item4'></a>

This analysis on bloom verbs shows that verbs from the same category are not always said as close.
On the opposite, this analysis shows that verbs from different categories are most of the time detected as "far". 

This could be explained by the fact that : 
- categories have a lot of verbs,
- some of the verbs might not be used enough to teach the model,
- some verbs can be used for other reasons than education, which might misteach to the model

After this analysis, we could think that, the model is not learning enough from the descriptions and objectives.
This might be a new thing to think about : would a classifier be more accurate if we added the TF-IDF variables on bloom verbs ?


<hr>

Author [Guillaume Lefebvre](https://www.linkedin.com/in/guillaume-lefebvre-22117610b/) - For more information, contact us at contact@inokufu.com - Copyright &copy; 2020 [Inokufu](http://www.inokufu.com)

<a href="http://www.inokufu.com"><img src = "http://www.inokufu.com/wp-content/uploads/elementor/thumbs/logo_inokufu_vector_full-black-om2hmu9ob1jytetxemkj1ij8g7tt3hzrtssivh2fl2.png" width = 400> </a>


