# Classifying Question Types
Given a question, the aim is to identify the category it belongs to. The four categories to handle for this assignment are : Who, What, When, Affirmation(yes/no).
Label any sentence that does not fall in any of the above four as "Unknown" type.

### Importing the libraries

In [3]:
import numpy as np
import pandas as pd
import re
import nltk

### Importing the dataset

In [4]:
from sklearn.preprocessing import LabelEncoder
dataset = pd.read_csv('C:/Users/avich/Desktop/NIKI/LabelledData.txt', delimiter = ',,,', quoting = 3, header=None)
y = dataset.iloc[:,1].str.strip()
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)

  


### Cleaning the texts
Using PorterStemmer for Stemming.

In [5]:
from nltk.stem.porter import PorterStemmer
corpus = []
for i in range(0,1483):
    question = re.sub('[^a-zA-Z]', ' ', dataset[0][i])
    question = question.lower()
    question = question.split()
    ps = PorterStemmer()
    question = [ps.stem(word) for word in question]
    question = ' '.join(question)
    corpus.append(question)

### Bag of Words Model
Using bag of words of model with CountVectorizer.

In [6]:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer()
X = cv.fit_transform(corpus).toarray()

### Splitting the dataset into the Training set and Test set
Splitting in Train:Test :: 80:20 Ratio

In [7]:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)



### Using Logistic Regression
Any other Classification can also be used. As this classifier had the greatest accuracy amongst other classifiers for this problem, hence using this. Fitted the model on the Training Set.

In [8]:
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(random_state = 0)
classifier.fit(X_train,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=0, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

### Predicting the Test set results

In [9]:
y_pred = classifier.predict(X_test)

### Making the Confusion Matrix & calculating Accuracy

In [10]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
accuracy = np.trace(cm)/np.sum(cm)
print("Accuracy : ", accuracy*100)

Accuracy :  96.632996633


### For Testing New Sentences

In [11]:
##############################################################
check_question = "What time does the train leave ?"
check_review = re.sub('[^a-zA-Z]', ' ', check_question)
check_review = check_review.lower()
check_review = check_review.split()
ps = PorterStemmer()
check_review = [ps.stem(word) for word in check_review]
check_review = ' '.join(check_review)
local_corpus = [check_review]

Check_X = cv.transform(local_corpus)
Check_X = Check_X.toarray()

Check_y = classifier.predict(Check_X)
print("Check Question Type : ",labelencoder_y.inverse_transform(Check_y))
##############################################################


Check Question Type :  ['what']
