## Sentence Classification

### Why not Deep-learning: 
    We have very small amount of data and LSTMs are not good with small amount of data, Hence need to use Linear Classifier

### Steps:
1. Preprocessing 
    * Create one csv file which stores all files(particularly all rows from each file) from **labeled_articles** folder (__PreprocessedCSV.csv__ will be generated after executing cell[1] contaning all rows     
    * To remove __####Abstract__ and __####Introduction__ from original text files, convert csv file to pandas dataframe and remove those rows using pandas
    * Remove whitespace from the column **label** as some of the columns contain 'OWNX' and 'OWNX ', therefore consider this as two different classes.
    * Cleaning the data by using regex 
    * Collect stopwords from stopwords.txt file given in the data and store it in **stopword list** (Later will be used as input parameter to CountVectorizer)
    
2. Tfidf features followed by SVM multiclass classification(oneVsOther)
    * Store sentences data from pandas dataframe to __X__ and labels to __y__ and convert them to numpy vectors
    * Split data into training and testing with 80:20 ratio and set any random seed value so that we can produce the same result with exact same accuracy and weights
    * Create a model having pipeline of CountVectorizer, TfidfTransformer, LinearSVC functions
    * Perform GridSearch to find optimum parameter for training the data
    * Provide the optimum parameter and train the model 
    * Plot Confusion matrix and accuracy score
    * Save the model so that it can be used directly again on this dataset
    * Test the model with inputing any sentence given from any files of **unlabeled_articles** folder 


In [1]:
import csv
import glob
import pandas as pd

#create one csv file which will store all sentences with appropriate tags
csv_file = "E:/STUDY/Placements/dishQ/SentenceCorpus/labeled_articles/preprocessedCSV.csv"  
path = "E:/STUDY/Placements/dishQ/SentenceCorpus/labeled_articles/*.txt"    # path of all txt files
files=glob.glob(path)  

# add all data of txt data into single csv file
for f in files:
    with open(f, "r") as in_text:
        in_reader = csv.reader(in_text, delimiter = '\t')
        with open(csv_file, "a") as out_csv:
            out_writer = csv.writer(out_csv)
            for row in in_reader:
                out_writer.writerow(row)

In [2]:
# give title to dataframe
title = ['label', 'sentence']
df = pd.read_csv(csv_file,sep=',',header=None,names=title)   # dataframe 
df.rename(columns=lambda x: x.strip())    # remove whitespace from column names
df['label'] = df['label'].str.strip()     # 'OWNX' and 'OWNX ' two diff classes
df = df.dropna()                          # drop the rows where Sentence value is NaN (remove ##intro, ##abstract which is present in txt file)
with pd.option_context('display.max_rows', None, 'display.max_columns', 5):
    print(df)

#print(df)
#df.to_csv('E:/STUDY/Placements/dishQ/SentenceCorpus/labeled_articles/newFile.csv',sep=',')
#df.columns

     label                                           sentence
1     MISC  The Minimum Description Length principle for o...
2     MISC  If the underlying model class is discrete, the...
3     MISC  For MDL, in general one can only have loss bou...
4     AIMX  We show that this is even the case if the mode...
5     OWNX  We derive a new upper bound on the prediction ...
6     OWNX  This implies a small bound (comparable to the ...
7     OWNX  We discuss the application to Machine Learning...
9     MISC  ``Bayes mixture", ``Solomonoff induction", ``m...
10    CONT  In many cases however, the Bayes mixture is co...
11    MISC  The MDL or MAP (maximum a posteriori) estimato...
12    MISC  In practice, the MDL estimator is usually bein...
13    MISC  How good are the predictions by Bayes mixtures...
14    MISC         This question has attracted much attention
15    MISC  In many cases, an important quality measure is...
16    MISC  In particular the square loss is often considered
17    MI

In [3]:
import re                         # regex for preprocessing text
import numpy as np
from collections import Counter
print(Counter(df["label"]))       # the number of sentences for each label

# preprocessing the data
def clean_data(text):
    text = re.sub(r"\n", "", text)    
    text = re.sub(r"\r", "", text) 
    text = re.sub(r"\'", "", text)    
    text = re.sub(r"\"", "", text)
    return text.strip().lower()

Counter({'MISC': 852, 'OWNX': 427, 'CONT': 104, 'AIMX': 94, 'BASE': 33})


In [4]:
# load stopwords from stopwords.txt
file_stopwords = open("E:\STUDY\Placements\dishQ\SentenceCorpus\word_lists/stopwords.txt","r")
stopwords = []
for w in file_stopwords:
    if w.endswith('\n'):
        w = w[:-1]
        stopwords.append(w)

In [5]:
df.iloc[:8]

Unnamed: 0,label,sentence
1,MISC,The Minimum Description Length principle for o...
2,MISC,"If the underlying model class is discrete, the..."
3,MISC,"For MDL, in general one can only have loss bou..."
4,AIMX,We show that this is even the case if the mode...
5,OWNX,We derive a new upper bound on the prediction ...
6,OWNX,This implies a small bound (comparable to the ...
7,OWNX,We discuss the application to Machine Learning...
9,MISC,"``Bayes mixture"", ``Solomonoff induction"", ``m..."


In [6]:
# train test split
from sklearn.model_selection import train_test_split
# X - text data (Sentences)
# y - labels
X = []
for i in range(len(df)):
    X.append(clean_data(df.iloc[i][1]))
y = np.array(df["label"])
X = np.array(X)

# set any specific random seed to get reproducible results
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=42)

In [7]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.svm import LinearSVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn.pipeline import Pipeline

model = Pipeline([('vectorizer', CountVectorizer()),('tfidf', TfidfTransformer()),
 ('clf', OneVsRestClassifier(LinearSVC(class_weight="balanced")))])

In [8]:
#paramater selection
from sklearn.grid_search import GridSearchCV
parameters = {'vectorizer__ngram_range': [(1, 1), (1, 2),(2,2)],
              'vectorizer__stop_words': (None,'english',stopwords),
               'tfidf__use_idf': (True, False)}
gs_clf_svm = GridSearchCV(model, parameters)
gs_clf_svm = gs_clf_svm.fit(X, y)
print(gs_clf_svm.best_params_)


  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.

  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.

  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.

  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.

  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):
  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):


{'vectorizer__ngram_range': (1, 2), 'vectorizer__stop_words': ['of', 'a', 'and', 'the', 'in', 'to', 'for', 'that', 'is', 'on', 'are', 'with', 'as', 'by', 'be', 'an', 'which', 'it', 'from', 'or', 'can', 'have', 'these', 'has', 'such'], 'tfidf__use_idf': False}


  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):


In [9]:
#the final pipeline using the best parameters
model = Pipeline([('vectorizer', CountVectorizer(stop_words=stopwords,ngram_range=(1,2))),
    ('tfidf', TfidfTransformer(use_idf=False)),
    ('clf', OneVsRestClassifier(LinearSVC(class_weight="balanced")))])

In [10]:
#fit model with training data
model.fit(X_train, y_train)

  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):


Pipeline(memory=None,
     steps=[('vectorizer', CountVectorizer(analyzer='word', binary=False, decode_error='strict',
        dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 2), preprocessor=None,
        stop_words=['of'...lti_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0),
          n_jobs=1))])

In [11]:
# vocab of the model
vocab = model.named_steps['vectorizer'].vocabulary_

#test data evaluation
pred = model.predict(X_test)

# compare predicted data with actual test data 
from sklearn.metrics import confusion_matrix, accuracy_score
confusion_matrix(pred, y_test)

  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):


array([[  9,   0,   0,   0,   5],
       [  1,   1,   0,   2,   0],
       [  0,   2,   7,   1,   2],
       [  2,   0,   9, 147,  14],
       [  7,   3,   3,  15,  72]], dtype=int64)

In [12]:
model.classes_

array(['AIMX', 'BASE', 'CONT', 'MISC', 'OWNX'], dtype='<U4')

In [13]:
accuracy_score(y_test, pred)

0.7814569536423841

In [14]:
#save the model
from sklearn.externals import joblib
joblib.dump(model, 'E:\STUDY\Placements\dishQ\SentenceCorpus/SentClassificationModel.pkl')

['E:\\STUDY\\Placements\\dishQ\\SentenceCorpus/SentClassificationModel.pkl']

In [15]:
from sklearn.externals import joblib
model = joblib.load('SentClassificationModel.pkl')

In [None]:
# ask a sentence from test files
sentence = input()

In [19]:
model.predict([sentence])[0]


  if hasattr(X, 'dtype') and np.issubdtype(X.dtype, np.float):


'OWNX'