<a href="https://colab.research.google.com/github/tchappui/projet-exemple/blob/master/notebooks/Exemples_sur_projet6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exemples sérieux sur le projet 6

In [0]:
import random
import string
import collections

import numpy as np
import pandas as pd

from sklearn.preprocessing import MultiLabelBinarizer


## Comment faire une prédiction de tags à partir de questions/tags et d'une matrice topics_by_questions obtenue avec un algo comme LDA

L'objectif de se document est de faire un mini demo sur la prédiction de tags à partir de la matrice topics_by_questions obtenue par application de l'algorithme Latent Dirichtlet Allocation sur une matrice de fréquences de termes. Ci dessous, nous simulons le résultat du LDA en générant la matrice topics_by_questions aléatoirement.

### Travail avec des données d'entraînement

Soit une série de questions avec des tags. Ici, j'ai affecté à chaque question un tag sous forme de lettre (a-z ou A-Z):

In [0]:
def generate_questions_with_tags(tags, ntags, nquestions):
    return pd.DataFrame(
        data = {
            'tags': [random.sample(list(tags), k=random.randint(*ntags)) for _ in range(nquestions)]
        },
        index = [
            f'Q{i+1}' for i in range(nquestions)
        ]
    )

questions = generate_questions_with_tags(string.ascii_letters, (1,6), 800)
questions.head()

Unnamed: 0,tags
Q1,[z]
Q2,"[i, W, f]"
Q3,"[Y, h, O, u]"
Q4,[V]
Q5,"[e, F, m, C, a]"


Afin de simuler la sortie du LDA, j'ai construit une matrice topics_by_questions fictive:

In [0]:
def generate_topics_by_questions(nquestions, ntopics):
    return pd.DataFrame(
        data = np.random.rand(nquestions, ntopics),
        columns = [
            f'T{i+1}' for i in range(ntopics)
        ],
        index = [
            f'Q{i+1}' for i in range(nquestions)
        ]
    )

topics_by_questions = generate_topics_by_questions(800, 20)
topics_by_questions.head()

Unnamed: 0,T1,T2,T3,T4,T5,T6,T7,T8,T9,T10,T11,T12,T13,T14,T15,T16,T17,T18,T19,T20
Q1,0.667876,0.670368,0.444114,0.018575,0.560182,0.197638,0.647024,0.404895,0.18388,0.6815,0.979525,0.909215,0.024436,0.993133,0.166233,0.815001,0.477521,0.811851,0.629245,0.89831
Q2,0.065695,0.644907,0.918673,0.666118,0.836947,0.534431,0.059772,0.666901,0.029709,0.24775,0.97284,0.726686,0.676263,0.13362,0.447225,0.981446,0.182255,0.291534,0.929467,0.033266
Q3,0.275418,0.934313,0.185457,0.743456,0.801518,0.488699,0.736354,0.233799,0.468267,0.959607,0.777867,0.97022,0.212191,0.948029,0.387114,0.099808,0.597002,0.863744,0.667015,0.857473
Q4,0.087557,0.119631,0.663539,0.837311,0.149689,0.651939,0.173716,0.925988,0.771073,0.20199,0.584244,0.093279,0.959223,0.396441,0.59503,0.733781,0.049759,0.182687,0.169875,0.467184
Q5,0.577984,3.1e-05,0.490789,0.400353,0.854179,0.636335,0.078237,0.561971,0.785664,0.119017,0.23605,0.161847,0.307323,0.412203,0.958341,0.41736,0.076678,0.849239,0.037074,0.978608


La fonction suivante permet de se construire efficacement une matrice tags_by_topics

In [0]:
def build_tags_by_topics_matrix(questions, topics_by_questions, mlb=MultiLabelBinarizer()):
    return pd.DataFrame(
        data=topics_by_questions.values.T @ mlb.fit_transform(questions.tags),
        columns=mlb.classes_,
        index=topics_by_questions.columns
    )

tags_by_topics = build_tags_by_topics_matrix(questions, topics_by_questions)
tags_by_topics

Unnamed: 0,A,B,C,D,E,F,G,H,I,J,...,q,r,s,t,u,v,w,x,y,z
T1,26.217814,22.845119,33.88166,29.26491,28.091819,33.633273,28.006491,24.959778,24.893803,33.217908,...,30.81408,26.256998,29.538213,26.328816,29.299001,26.015234,28.770893,23.101286,26.707807,23.184984
T2,27.398544,22.245984,31.418607,30.782092,27.639908,25.909498,29.310981,23.91098,20.796499,27.116915,...,25.29141,27.752655,29.796706,23.830605,32.809886,24.804149,27.807162,21.53707,31.674307,24.141625
T3,29.305627,24.847327,33.286021,28.99346,30.081317,31.403271,28.462185,28.733896,23.294147,31.705305,...,29.9973,29.76504,33.148226,25.000528,29.949691,25.173367,30.506,26.091129,29.168712,25.069274
T4,30.192881,25.855375,31.46352,28.761301,26.112993,28.057582,24.086219,28.69163,24.240002,35.676873,...,30.914372,29.476275,29.539422,25.355565,29.108536,25.759776,29.349403,23.948983,27.030907,22.28716
T5,30.050506,27.508284,34.971774,25.112433,26.052044,29.618621,25.961633,27.76969,21.136382,29.481688,...,32.366888,28.532785,33.122039,27.048553,28.203561,24.120477,30.397936,25.395052,31.568235,22.984074
T6,28.788059,22.878565,38.084207,29.490785,22.807711,32.105265,25.998492,30.579496,20.430362,35.812252,...,31.910604,31.395001,30.160983,27.991375,31.603096,22.420402,29.202068,26.251119,27.318529,20.884976
T7,28.313136,24.61492,32.74139,28.675605,24.954294,30.643193,24.856231,28.305836,26.817229,28.317564,...,27.955276,30.165964,29.794452,26.808259,35.344037,26.336384,31.194753,23.123618,31.063511,20.178655
T8,24.856963,20.208128,35.955571,29.555097,25.932375,32.538948,25.241335,25.582309,23.315529,29.102234,...,28.688723,25.434678,32.141852,24.280575,32.136042,20.767329,29.337351,26.567982,26.870935,23.479628
T9,29.611813,20.76048,37.751739,25.950301,25.473631,26.968541,21.238241,29.070978,23.379839,34.793153,...,29.195753,31.98573,29.170287,24.818909,31.412107,28.710393,31.371684,27.823822,24.887785,23.194984
T10,32.105224,23.461937,36.276082,27.433641,22.54145,30.495765,23.316505,22.168231,23.90272,29.418454,...,32.589251,27.807985,28.279446,26.828417,32.235961,26.242172,28.692,24.984644,27.802508,19.934671


### Travail avec les données de test

Voici quelques questions de test avec leurs tags associés

In [0]:
questions_test = generate_questions_with_tags(string.ascii_letters, (1, 6), 200)
questions_test.head()

Unnamed: 0,tags
Q1,"[b, E]"
Q2,"[Y, a, v, J, t, f]"
Q3,"[f, N, u]"
Q4,"[z, i]"
Q5,"[l, K, z]"


Voici la simulation du résultat de LDA avec ces questions de test

In [0]:
topics_by_questions_test = generate_topics_by_questions(200, 20)
topics_by_questions_test.head()

Unnamed: 0,T1,T2,T3,T4,T5,T6,T7,T8,T9,T10,T11,T12,T13,T14,T15,T16,T17,T18,T19,T20
Q1,0.253796,0.171801,0.145494,0.53294,0.305175,0.431407,0.953907,0.446849,0.627544,0.393286,0.395315,0.331589,0.707471,0.600622,0.306091,0.267409,0.647409,0.593122,0.312242,0.949138
Q2,0.601526,0.141298,0.322237,0.262389,0.5345,0.814243,0.321405,0.425903,0.118078,0.69225,0.786927,0.537276,0.252164,0.490084,0.535044,0.307987,0.828565,0.890954,0.114104,0.370811
Q3,0.700615,0.561099,0.728698,0.674855,0.357066,0.75503,0.869582,0.261768,0.053314,0.245443,0.627869,0.272354,0.144681,0.486586,0.266628,0.815861,0.51192,0.439295,0.163615,0.406623
Q4,0.769334,0.111478,0.962971,0.019716,0.484218,0.357224,0.968131,0.805431,0.595204,0.861751,0.806556,0.911857,0.193109,0.979692,0.838619,0.947094,0.823085,0.37776,0.191476,0.620655
Q5,0.047476,0.393989,0.921731,0.083921,0.135051,0.189011,0.862918,0.485936,0.053174,0.541607,0.33634,0.670915,0.014482,0.340964,0.330718,0.149002,0.339882,0.832367,0.628174,0.192494


In [0]:
def build_tags_by_questions_matrix(topics_by_questions, tags_by_topics):
    return pd.DataFrame(
        data=topics_by_questions.values @ tags_by_topics.values,
        columns=tags_by_topics.columns,
        index=topics_by_questions.index
    )

tags_by_test_questions = build_tags_by_questions_matrix(topics_by_questions_test, tags_by_topics)
tags_by_test_questions.head()

Unnamed: 0,A,B,C,D,E,F,G,H,I,J,...,q,r,s,t,u,v,w,x,y,z
Q1,270.293162,222.273607,333.733265,262.83474,247.814412,275.912402,244.517496,259.831985,216.98457,295.470222,...,272.894269,277.448241,279.104467,237.518658,286.449145,238.002697,271.70572,234.393419,262.95415,200.672497
Q2,270.646132,221.837345,333.51963,258.868449,249.319241,278.140233,245.269718,258.605353,213.898791,294.240965,...,275.968005,274.131523,275.712153,241.685988,284.032328,233.619098,269.320329,233.328779,262.475577,200.809138
Q3,267.853316,222.955762,326.146429,263.528212,249.427096,277.733545,248.205158,257.868787,217.166828,293.608411,...,270.540244,273.189605,278.474381,238.816211,287.826545,232.186739,272.51448,228.753432,266.605207,202.77709
Q4,361.338103,298.511064,451.760823,352.376098,336.405941,374.103042,329.894106,346.708861,293.074628,394.485404,...,366.319403,370.125853,380.175305,321.307363,388.434246,316.728368,372.24356,315.594834,355.148819,274.827078
Q5,219.420305,176.104781,266.352025,212.663982,204.403835,226.300186,198.540281,207.42277,175.855881,232.15862,...,222.318614,221.483996,224.561525,195.148718,234.746529,192.357474,218.698433,190.587123,215.2223,164.791282


In [0]:
def f(line, counter=[1]):
    recommanded = set(line.iloc[:-1].sort_values(ascending=False).iloc[:5].index)
    origin = set(line.iloc[-1])
    score = len(recommanded & origin) / len(origin)
    return pd.Series([score, recommanded, origin], index=['score', 'recommended', 'origin'])


final_df = pd.concat([tags_by_test_questions, questions_test], axis=1).apply(f, axis=1)
final_df.head()

Unnamed: 0,score,recommended,origin
Q1,0.0,"{p, C, J, K, Z}","{b, E}"
Q2,0.166667,"{p, C, J, K, Z}","{f, a, J, t, v, Y}"
Q3,0.0,"{p, C, J, K, Z}","{N, f, u}"
Q4,0.0,"{p, C, J, K, Z}","{i, z}"
Q5,0.333333,"{p, C, K, u, Z}","{K, l, z}"


In [0]:
final_df.describe()

Unnamed: 0,score
count,200.0
mean,0.120833
std,0.240148
min,0.0
25%,0.0
50%,0.0
75%,0.166667
max,1.0


In [0]:
essai=final_df.reset_index()

In [0]:
essai.rename(columns={'index': 'essai'})

Unnamed: 0,essai,score,recommended,origin
0,Q1,0.000000,"{p, C, J, K, Z}","{b, E}"
1,Q2,0.166667,"{p, C, J, K, Z}","{f, a, J, t, v, Y}"
2,Q3,0.000000,"{p, C, J, K, Z}","{N, f, u}"
3,Q4,0.000000,"{p, C, J, K, Z}","{i, z}"
4,Q5,0.333333,"{p, C, K, u, Z}","{K, l, z}"
5,Q6,0.000000,"{p, C, J, K, Z}","{V, a}"
6,Q7,0.250000,"{p, C, J, K, Z}","{W, u, Z, d}"
7,Q8,0.000000,"{p, C, J, K, Z}","{W, s}"
8,Q9,0.000000,"{p, C, J, K, Z}","{A, x, i}"
9,Q10,0.000000,"{p, C, J, K, Z}","{f, t, u, k, c, N}"
