## FAQ task

FAQ consists of questions:
1. What is preparatory course?
 - Preparatory course is a special educational program lasting 1 academic year (7-10 months), where students learn Russian and special disciplines (mathematics and physics).
2. What is invitation letter?
 - The invitation is official document which is prepared by Ministry of Internal Affairs of Russian Federation. It confirms that the student is admitted to this university.
3. ...


Now you have questions from users and you need to answer, for example:

:: Could I work while studying?
> It allows the student to find well paid work and to start climbing up on a career ladder right after completing university course. Students of the Russian universities are obliged to attend all lectures as only the knowledge gained during classroom occupations allows students to become the effective and knowing professionals. 


In this tuorial we'll describe how to build FAQ model based on config deeppavlov/configs/faq/tfidf_logreg_en_faq.json
<br>First of all we need train dataset of FAQ. As example, let's consider MIPT FAQ for entrants - https://mipt.ru/english/edu/faqs/

**Note:** Please, install all necessary requirements using command:

>\>\> python -m deeppavlov install deeppavlov/configs/faq/tfidf_logreg_en_faq.json

Let's look at the FAQ dataset:

In [29]:
import pandas as pd
FAQ_DATASET_URL = 'http://files.deeppavlov.ai/faq/mipt/faq.csv'
faq_dataset = pd.read_csv(FAQ_DATASET_URL)
faq_dataset

Unnamed: 0,Question,Answer
0,What is preparatory course?,Preparatory course is a special educational pr...
1,What is invitation letter?,The invitation is official document which is p...
2,What is registration?,Registration grants to the foreign citizen the...
3,Is it possible to study and work at the same t...,Russian education is one of the most qualitati...
4,How long does the academic year last?,Academic year proceeds 10 months (from Septemb...
5,What documents are demanded for admission?,"Passport, documents of your previous education..."
6,What is the price for one year of study?,Russian taught programs cost 250'000 rubles pe...
7,Should I insure my life?,Life insurance and health is obligatory for an...
8,In what cases student can be deducted from Uni...,"At own will, for health reasons, for the acade..."
9,I have problems. Who can help me?,If you have any problems you can address to De...


In [34]:
import deeppavlov
from deeppavlov.models.tokenizers.spacy_tokenizer import StreamSpacyTokenizer
from deeppavlov.models.sklearn import SklearnComponent
from deeppavlov.dataset_readers.faq_reader import FaqDatasetReader
from deeppavlov.core.data.data_learning_iterator import DataLearningIterator
from deeppavlov.core.data.utils import download_decompress

In [35]:
# Read FAQ data
reader = FaqDatasetReader()
faq_data = reader.read(data_url=FAQ_DATASET_URL, x_col_name='Question', y_col_name='Answer')
iterator = DataLearningIterator(data=faq_data)

x,y = iterator.get_instances()

## Train FAQ

Let's consider simple case for FAQ model (in the end you can find more complex pipeline models):
1. TF_IDF vectorizer on lemmatized questions
2. Logistic regression classifier

In [46]:
vectorizer.model.get_feature_names()

['academic',
 'admission',
 'and',
 'at',
 'be',
 'can',
 'case',
 'course',
 'deduct',
 'demand',
 'do',
 'document',
 'for',
 'from',
 'have',
 'help',
 'how',
 'in',
 'insure',
 'invitation',
 'last',
 'letter',
 'life',
 'long',
 'of',
 'one',
 'possible',
 'preparatory',
 'price',
 'problem',
 'registration',
 'same',
 'should',
 'student',
 'study',
 'the',
 'time',
 'to',
 'university',
 'what',
 'who',
 'work',
 'year']

In [45]:
# create tokenizer
tokenizer = StreamSpacyTokenizer(lemmas=True)
x_tokenized = tokenizer(x)

x_tokens_joined = tokenizer(x_tokenized)
# fit TF-IDF vectorizer on train FAQ dataset 
vectorizer = SklearnComponent(model_class="sklearn.feature_extraction.text:TfidfVectorizer",
                              save_path='faq/tfidf.pkl',
                              infer_method='transform')
vectorizer.fit(x_tokens_joined)

# Now collect (x,y) pairs: x_train - vectorized question, y_train - answer from FAQ
x_train = vectorizer(x_tokens_joined)
y_train = y 

# Let's use top 2 answers for each incoming questions (top_n param)
clf = SklearnComponent(model_class="sklearn.linear_model:LogisticRegression",
                       top_n=2,
                       c=1000,
                       penalty='l2', 
                       save_path='faq/tfidf_logreg_classifier_en_mipt_faq.pkl',
                       infer_method='predict')
clf.fit(x_train, y_train)


2019-02-12 12:30:09.281 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 165: Initializing model sklearn.feature_extraction.text:TfidfVectorizer from scratch
2019-02-12 12:30:09.281 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 108: Fitting model sklearn.feature_extraction.text:TfidfVectorizer
2019-02-12 12:30:09.292 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 165: Initializing model sklearn.linear_model:LogisticRegression from scratch
2019-02-12 12:30:09.294 INFO in 'deeppavlov.models.sklearn.sklearn_component'['sklearn_component'] at line 108: Fitting model sklearn.linear_model:LogisticRegression


## Test FAQ

In [50]:
test_questions = ['Could you help me??', 'Could I work while studying?']
tokenized_test_questions = tokenizer(test_questions)
joined_test_q_tokens = tokenizer(tokenized_test_questions)
test_q_vectorized = vectorizer(joined_test_q_tokens)
answers = clf(test_q_vectorized)

Now we have all output of FAQ model: answers and scores.
<br>
Answers:

In [51]:
for i, answer in enumerate(answers):
    print('Answers {}:\n{}\n'.format(i, answer))

Answers 0:
If you have any problems you can address to Department of Foreign Students: +7 (495) 408-70-43 (Auditorium building, room 315).

Answers 1:
Russian education is one of the most qualitative and fundamental in the world. It allows the student to find well paid work and to start climbing up on a career ladder right after completing university course. Students of the Russian universities are obliged to attend all lectures as only the knowledge gained during classroom occupations allows students to become the effective and knowing professionals. Thus, there is an opportunity to work only after classes or during vacation on the weekend.



## More models

Described model built in config - deeppavlov/configs/faq/tfidf_logreg_en_faq.json

You can also combine different components to construct pipelines for FAQ task:

Vectorizers:
 - deeppavlov.core.models.vectorizers.TfIdfVectorizer
 - deeppavlov.core.models.vectorizers.SentenceAvgW2vVectorizer
 - deeppavlov.core.models.vectorizers.SentenceW2vVectorizerTfidfWeights

Classifiers:
 - deeppavlov.models.classifiers.logreg_classifier.LogregClassifier
 - deeppavlov.models.classifiers.cos_sim_classifier.CosineSimilarityClassifier

