# <font color="#114b98">Catégorisez automatiquement des questions</font>

## <font color="#114b98">Code final à déployer</font>

**Stack Overflow** est un site célèbre de questions-réponses liées au développement informatique.

L'objectif de ce projet est de développer un système de suggestion de tags pour ce site. Celui-ci prendra la forme d’un algorithme de machine learning qui assignera automatiquement plusieurs tags pertinents à une question.

**Livrable** : Le code final à déployer présenté dans un répertoire et développé progressivement à l’aide d’un logiciel de gestion de versions.

### Code final :

In [2]:
import re
import spacy
import joblib
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.stem import WordNetLemmatizer
from gensim.parsing.preprocessing import STOPWORDS
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline
# nltk.download('omw-1.4')
# nltk.download('stopwords')
# nltk.download('punkt')
# nltk.download('wordnet')
# nltk.download('averaged_perceptron_tagger')
spacy.load('en_core_web_sm')

<spacy.lang.en.English at 0x2620b437b50>

In [3]:
# Nettoyage des balises HTML
def clean_html(text):
    soup = BeautifulSoup(text, "html5lib")
    for sent in soup(['style', 'script']):
        sent.decompose()
    return ' '.join(soup.stripped_strings)


# Nettoyage du texte
def clean_text(text):
    pattern = re.compile(r'[^\w]|[\d_]')
    res = re.sub(pattern, " ", text)
    res = " ".join(word for word in res.split() if len(word) >= 3)
    return res


# Tokenisation et retrait des stopwords
nltk_stopwords = set(stopwords.words('english'))
gensim_stopwords = set(STOPWORDS)
nlp = spacy.load("en_core_web_sm")
spacy_stopwords = set(nlp.Defaults.stop_words)
stop_words = nltk_stopwords.union(gensim_stopwords, spacy_stopwords)


def tokenize(text):
    tokens = word_tokenize(text, language='english')
    return [token for token in tokens if token not in stop_words]


# POS Tagging
def filtering_nouns(tokens):
    res = [token.lower() for token, tag in pos_tag(tokens) if tag == 'NN']
    return res


# Lemmatisation
def lemmatization(tokens):
    lemmatizer = WordNetLemmatizer()
    txt = [lemmatizer.lemmatize(token) for token in tokens]
    return [" ".join(txt)]

In [4]:
post = 'Tensorflow 2.0 - AttributeError: module tensorflow has no attribute Session\
        When I am executing the command sess = tf.Session() \
        in Tensorflow 2.0 environment, I am getting an error message as below:\
        Traceback (most recent call last):\
        File "<stdin>", line 1, in <module>\
        AttributeError: module tensorflow has no attribute Session\
        System Information:\
        OS Platform and Distribution: Windows 10\
        Python Version: 3.7.1\
        Tensorflow Version: 2.0.0-alpha0 (installed with pip)\
        Steps to reproduce:\
        Installation:\
        pip install --upgrade pip\
        pip install tensorflow==2.0.0-alpha0\
        pip install keras\
        pip install numpy==1.16.2\
        Execution:\
        Execute command: import tensorflow as tf\
        Execute command: sess = tf.Session()'

In [5]:
# preprocess the sentence
txt = lemmatization(filtering_nouns(tokenize(clean_text(clean_html(post)))))

In [6]:
txt

['module attribute session command sess environment message line module attribute alpha reproduce pip install pip pip install alpha pip install command import command sess session']

In [7]:
# load the saved CountVectorizer
vectorizer_loaded = joblib.load('countvectorizer.joblib')

# load the saved classifier
clf_loaded = joblib.load('sgdc_classifier.pkl')

# load the saved MultiLabelBinarizer
mlb_loaded = joblib.load('multilabelbinarizer.joblib')

In [8]:
# generate tags
txt_vect = vectorizer_loaded.transform(txt)
tags_mlb = clf_loaded.predict(txt_vect)
tags = mlb_loaded.inverse_transform(tags_mlb)

In [9]:
tags

[('python', 'python-3.x', 'pip', 'conda', 'session')]

le POS_TAGGING a bloqué le keyword tensorflow pourtant important ici

In [10]:
def transform_with_vectorizer(x):
    return vectorizer_loaded.transform(x)

In [11]:
# Define the pipeline
text_clf = Pipeline([
    ('html_cleaner', FunctionTransformer(clean_html)),
    ('text_cleaner', FunctionTransformer(clean_text)),
    ('tokenizer', FunctionTransformer(tokenize)),
    ('noun_filter', FunctionTransformer(filtering_nouns)),
    ('lemmatizer', FunctionTransformer(lemmatization)),
    ('vectorizer', FunctionTransformer(transform_with_vectorizer)),
    ('clf', clf_loaded)
])

In [12]:
# Load the saved model and transform the text
tags = mlb_loaded.inverse_transform(text_clf.predict(post))
print(tags)

[('python', 'python-3.x', 'pip', 'conda', 'session')]
