<a href="https://colab.research.google.com/github/velamalaappu/icc-automatic-question-answer-system-analysis./blob/main/Automatic_Question_Answer_System.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Project Introduction**

The Indian education landscape has been undergoing rapid changes for the past 10 years owing to
the advancement of web-based learning services, specifically, eLearning platforms.

Most online education institutes today treat all students in the same class/batch/cohort (referred to as
“group” hereinafter) the same. Institutional uniformity works quite well for machines but humans are
made up of different and often contradictory components!

We believe that the model of allowing all students to learn at the same pace with the same level of
interest is quite detrimental to the quality of education. This model is testing the capabilities of
students by forcing them to work together in groups at the same pace, regardless of each individual's
ability and the interests of the work.

The reality could not be farther from the truth. The reason we see a wide variation in the group
performance is owing to this thought process.

In a system based on Competency Based Learning (referred to as “CBL” going forward), the group is
accepted as a set of diverse individuals who may come with some common traits but also recognizes
their different educational backgrounds, life & work experiences and learning styles.

Competency-based learning is an approach to education that focuses on the student’s demonstration
of desired learning outcomes as central to the learning process. A key characteristic of
competency-based learning is its focus on mastery. In other learning models, students are exposed to
content–whether skills or concepts–over time, and success is measured on a summative basis. In a
competency-based learning system, students are not allowed to continue until they have
demonstrated mastery of the identified competencies (i.e., the desired learning outcomes to be
demonstrated).

It is quite common to see students get stuck, fall behind, give up & in extreme cases – drop out of the
program in the traditional model of education. A standard experience for a student in this model is
“failing” and then having to repeat the class/subject/year till they are able to achieve a passing grade.
This causes the student to lose interest in addition to social embarrassment. The result is a student
who is now trying to “pass” by rote memorization and doing the bare minimum for the passing.

While assessments based on rubrics are a cornerstone for the CBL system as well – they serve the
purpose of reinforcing the concepts, and to show the student & the instructor(s) how well understood
are each of the concepts and their application; rather than making just a judgment of passing and
failing.

In this light, to be able to cater to a large pool of students to enable seamless learning, we envision an
automated doubt resolution system. In the era of advancing AI technology, most of the doubts can be
resolved via deep learning techniques. The AI system in place can understand the context of the
question and can provide relevant answers to the questions.

# **Problem Statement**

We will solve the above-mentioned challenge by applying deep learning algorithms to textual data.
The solution to this problem can be obtained through Extractive Question Answering wherein we can
extract an answer from a text given the question.

# Topic Modelling
This is a theme extraction task on a collection of Data Science specific documents which can be done
via Latent Dirichlet Allocation (LDA). The topic model should identify the important themes of a
document and list down the top-N constituent words

# Extractive Question Answering
Extractive Question Answering is the task of extracting an answer from a text given a question. The
text would essentially be the group of documents that have the highest concentration of the topic
closest to the asked question.


In [None]:
import pandas as pd
import numpy as np
import string
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.corpus import wordnet
from nltk import pos_tag
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# **Data set**

There are three question files, one for each year of students: S08, S09, and S10, as well as 690,000 words worth of cleaned text from Wikipedia that was used to generate the questions.

The "questionanswerpairs.txt" files contain both the questions and answers. The columns in this file are as follows:

ArticleTitle is the name of the Wikipedia article from which questions and answers initially came.

Question is the question.

Answer is the answer.

Difficulty From Questioner is the prescribed difficulty rating for the question as given to the question-writer.

Difficulty From Answerer is a difficulty rating assigned by the individual who evaluated and answered the question, which may differ from the difficulty in field 4.

ArticleFile is the name of the file with the relevant article
Questions that were judged to be poor were discarded from this data set.

There are frequently multiple lines with the same question, which appear if those questions were answered by multiple individuals.

In [None]:
df1 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/automated question answer system/S08_question_answer_pairs.txt', sep='\t')
df2 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/automated question answer system/S09_question_answer_pairs.txt', sep='\t')
df3 = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/automated question answer system/S10_question_answer_pairs.txt', sep='\t', encoding = 'ISO-8859-1')

# **Preprocessing data**

In [None]:
df1.head(20)

Unnamed: 0,ArticleTitle,Question,Answer,DifficultyFromQuestioner,DifficultyFromAnswerer,ArticleFile
0,Abraham_Lincoln,Was Abraham Lincoln the sixteenth President of...,yes,easy,easy,S08_set3_a4
1,Abraham_Lincoln,Was Abraham Lincoln the sixteenth President of...,Yes.,easy,easy,S08_set3_a4
2,Abraham_Lincoln,Did Lincoln sign the National Banking Act of 1...,yes,easy,medium,S08_set3_a4
3,Abraham_Lincoln,Did Lincoln sign the National Banking Act of 1...,Yes.,easy,easy,S08_set3_a4
4,Abraham_Lincoln,Did his mother die of pneumonia?,no,easy,medium,S08_set3_a4
5,Abraham_Lincoln,Did his mother die of pneumonia?,No.,easy,easy,S08_set3_a4
6,Abraham_Lincoln,How many long was Lincoln's formal education?,18 months,medium,easy,S08_set3_a4
7,Abraham_Lincoln,How many long was Lincoln's formal education?,18 months.,medium,medium,S08_set3_a4
8,Abraham_Lincoln,When did Lincoln begin his political career?,1832,medium,easy,S08_set3_a4
9,Abraham_Lincoln,When did Lincoln begin his political career?,1832.,medium,medium,S08_set3_a4


In [None]:
all_data = df1.append([df2, df3])
all_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3998 entries, 0 to 1457
Data columns (total 6 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 0   ArticleTitle              3998 non-null   object
 1   Question                  3961 non-null   object
 2   Answer                    3422 non-null   object
 3   DifficultyFromQuestioner  3043 non-null   object
 4   DifficultyFromAnswerer    3418 non-null   object
 5   ArticleFile               3996 non-null   object
dtypes: object(6)
memory usage: 218.6+ KB


In [None]:
all_data['Question'] = all_data['ArticleTitle'].str.replace('_', ' ') + ' ' + all_data['Question']
all_data = all_data[['Question', 'Answer']]
all_data.shape

(3998, 2)

In [None]:
all_data.head(10)

Unnamed: 0,Question,Answer
0,Abraham Lincoln Was Abraham Lincoln the sixtee...,yes
1,Abraham Lincoln Was Abraham Lincoln the sixtee...,Yes.
2,Abraham Lincoln Did Lincoln sign the National ...,yes
3,Abraham Lincoln Did Lincoln sign the National ...,Yes.
4,Abraham Lincoln Did his mother die of pneumonia?,no
5,Abraham Lincoln Did his mother die of pneumonia?,No.
6,Abraham Lincoln How many long was Lincoln's fo...,18 months
7,Abraham Lincoln How many long was Lincoln's fo...,18 months.
8,Abraham Lincoln When did Lincoln begin his pol...,1832
9,Abraham Lincoln When did Lincoln begin his pol...,1832.


In [None]:
all_data = all_data.drop_duplicates(subset='Question')
all_data.head(10)

Unnamed: 0,Question,Answer
0,Abraham Lincoln Was Abraham Lincoln the sixtee...,yes
2,Abraham Lincoln Did Lincoln sign the National ...,yes
4,Abraham Lincoln Did his mother die of pneumonia?,no
6,Abraham Lincoln How many long was Lincoln's fo...,18 months
8,Abraham Lincoln When did Lincoln begin his pol...,1832
10,Abraham Lincoln What did The Legal Tender Act ...,"the United States Note, the first paper curren..."
12,Abraham Lincoln Who suggested Lincoln grow a b...,11-year-old Grace Bedell
14,Abraham Lincoln When did the Gettysburg addres...,1776
16,Abraham Lincoln Did Lincoln beat John C. Breck...,yes
18,Abraham Lincoln Was Abraham Lincoln the first ...,No


In [None]:
all_data.shape

(2461, 2)

In [None]:
all_data = all_data.dropna()
all_data.shape

(2188, 2)

In [None]:
import nltk
nltk.download("stopwords") 
stopwords_list = stopwords.words('english')

lemmatizer = WordNetLemmatizer()

def my_tokenizer(doc):
    words = word_tokenize(doc)
    
    pos_tags = pos_tag(words)
    
    non_stopwords = [w for w in pos_tags if not w[0].lower() in stopwords_list]
    
    non_punctuation = [w for w in non_stopwords if not w[0] in string.punctuation]
    
    lemmas = []
    for w in non_punctuation:
        if w[1].startswith('J'):
            pos = wordnet.ADJ
        elif w[1].startswith('V'):
            pos = wordnet.VERB
        elif w[1].startswith('N'):
            pos = wordnet.NOUN
        elif w[1].startswith('R'):
            pos = wordnet.ADV
        else:
            pos = wordnet.NOUN
        
        lemmas.append(lemmatizer.lemmatize(w[0], pos))

    return lemmas

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer


tfidf_vectorizer = TfidfVectorizer(max_df=0.8, max_features=20000,
                                   min_df=1, use_idf=True, ngram_range=(1,1))
# note: minimum of 2 occurrences, rather than 0.2 (20% of all documents)

tfidf_matrix = tfidf_vectorizer.fit_transform(tuple(all_data['Question']))

print(tfidf_matrix.shape)

(2188, 4081)


In [None]:
def ask_question(question):
    query_vect = tfidf_vectorizer.transform([question])
    similarity = cosine_similarity(query_vect, tfidf_matrix)
    max_similarity = np.argmax(similarity, axis=None)
    
    print('Your question:', question)
    print('Closest question found:', all_data.iloc[max_similarity]['Question'])
    print('Similarity: {:.2%}'.format(similarity[0, max_similarity]))
    print('Answer:', all_data.iloc[max_similarity]['Answer'])

After running the code,we need to ask question if we get  very similar to closest question found then the result will be **YES**. Or else result show **NO**

In [None]:
ask_question('When Abraham Lincoln started his political career')

Your question: When Abraham Lincoln started his political career
Closest question found: Abraham Lincoln When did Lincoln begin his political career?
Similarity: 87.84%
Answer: 1832


In [None]:
ask_question(' Was Abraham Lincoln the sixteenth President of the United States?')

Your question:  Was Abraham Lincoln the sixteenth President of the United States?
Closest question found: Abraham Lincoln Was Abraham Lincoln the sixteenth President of the United States?
Similarity: 94.43%
Answer: yes


In [None]:
ask_question('Can whales fly')

Your question: Can whales fly
Closest question found: Otter Do sea otters have a layer of fat like whales?
Similarity: 29.17%
Answer: No


In [None]:
ask_question('Who was the third president of the United States')

Your question: Who was the third president of the United States
Closest question found: Calvin Coolidge Was Coolidge the thirteenth President of the United States?
Similarity: 49.61%
Answer: No


# **Scope of project**

A. The topic model should be able to identify/extract important topics.

B. The topic model would be built on the corpus of Data Science documents.

C. The topic model should yield the most relevant and stable topics measured through the
perplexity score.

D. Once the relevant documents have been retrieved, the extractive question answering
model would generate the answer for the question.

E. The entire dual-model pipeline would be deployed in AWS/GCP/Azure

F. The dual-model pipeline must be accessible via a web application(Streamlit) for demo
purpose.

## **CONCLUSION**

In this project I concluded that Automated Question Answering aims at delivering concise information that contains answers to user questions. We will solve the challenge by applying deeplearning algorithms to textual data. This approach can be applied for real business cases as a part of Question answering/user support system. If you are building a question-answering system and use
NLP engine, like Rasa NLU, Dialog flow, Luis, this NLP engine can answer predefined questions.However, if there is no predefined intent, you can call this automatic Q&A system to search in documents and return the answer. Continuous learning can be organized by getting feedback from users. For now, the quality of this model can’t outperform people, but it is really close to it (70–80%) depending on a dataset. Every month the quality of such models increases. It means that in the very near future such models may answer questions with the same accuracy as humans, but much faster.