Name: Kathiriya Ranjit (R00183586)
1. Load Dataset
2. Cleaning
3. Train test split
4. Apply 3 Classfication Model.

    4.1 Naive Bayes Classifier
    
    4.2 K-nearest Neighbor
    
    4.3 Support Vector Machine (SVM)

## 1. Load Dataset

    1.	Load Dataset:

        a.	Load dataset:
            i.	In subtaskA_data_all.csv, contain 10,000 entries with column name id,sent0, and sent1.
            ii.	In subtaskA_answers_all.csv, contain 10,000 entries with column name id and label0 of the sentence, which is true.
        b.	Process data:
            i.	In-process data I have just created a new column name label1 with all opposite of label0 so we can get both sentences correct label.
            ii.	Then after the merge of the label and other sentence is done.
            iii.	For gaining more accuracy, the separation of sentences like breaking the table into two sentence0,label0, and sentence1,label1. Join this both pairs into vertically.
            iv.	Now, 20,000 entries we have in total.


In [3]:
import numpy as np
import pandas as pd

In [4]:
df_all = pd.read_csv('./TrainingData/subtaskA_data_all.csv')

In [5]:
df_all.head(1)

Unnamed: 0,id,sent0,sent1
0,0,He poured orange juice on his cereal.,He poured milk on his cereal.


In [6]:
df_lables = pd.read_csv('./TrainingData/subtaskA_answers_all.csv',header='infer',names=['id_lbl','sen0_lbl'])

In [7]:
df_lables.head(1)

Unnamed: 0,id_lbl,sen0_lbl
0,0,0


In [8]:
df_join = pd.concat([df_all,df_lables],axis=1)

In [9]:
df_join.head(1)

Unnamed: 0,id,sent0,sent1,id_lbl,sen0_lbl
0,0,He poured orange juice on his cereal.,He poured milk on his cereal.,0,0


In [10]:
df_join.drop(['id_lbl'],axis=1,inplace=True)

In [11]:
df_join.head(1)

Unnamed: 0,id,sent0,sent1,sen0_lbl
0,0,He poured orange juice on his cereal.,He poured milk on his cereal.,0


In [12]:
df_join['sen1_lbl'] =  df_join['sen0_lbl'].apply(lambda x: int(1) if x==0 else int(0))

In [13]:
df_join.head(1)

Unnamed: 0,id,sent0,sent1,sen0_lbl,sen1_lbl
0,0,He poured orange juice on his cereal.,He poured milk on his cereal.,0,1


In [14]:
df1 = df_join[['sent0','sen0_lbl']]

In [15]:
df1 = df1.rename(columns = {"sent0": "sent", 
                                  "sen0_lbl":"label"})

In [16]:
df1.head(1)

Unnamed: 0,sent,label
0,He poured orange juice on his cereal.,0


In [17]:
df2 = df_join[['sent1','sen1_lbl']]

df2 = df2.rename(columns = {"sent1": "sent", 
                                  "sen1_lbl":"label"})

In [18]:
df2.head(1)

Unnamed: 0,sent,label
0,He poured milk on his cereal.,1


In [19]:
df = pd.concat([df1, df2], ignore_index=True,axis=0,sort=True)

In [20]:
df.info(1)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 2 columns):
label    20000 non-null int64
sent     20000 non-null object
dtypes: int64(1), object(1)
memory usage: 312.6+ KB


## 2. Cleaning

    2.	Cleaning

        a.	Lower Case: Converting all 20,000 entries into lowercase.
        b.	Removing extra character: In this, the additional symbol like ?.'; is removed.
        c.	Stop words Removing: stop words like a the, and a terminated from the corpus.
        d.	Lemmatization: converting the text into the plural form. e.g., mice become a mouse.
        e.	Eliminating the repetitions: the word which occurs twice or more times it removes.
        f. Spelling chacker correct if it is wrong. eg. fihs --> fish


In [21]:
# lower case
df['sent'] = df['sent'].apply(lambda x: " ".join(x.lower() for x in x.split()))

In [22]:
df.head(1)

Unnamed: 0,label,sent
0,0,he poured orange juice on his cereal.


In [23]:
# Removing extra symbol.
df['sent'] = df['sent'].str.replace('[^\w\s]',' ')

In [24]:
df.head(1)

Unnamed: 0,label,sent
0,0,he poured orange juice on his cereal


In [25]:
import nltk
from nltk.corpus import stopwords
from textblob import Word
import re

In [26]:
# removing stopwords like the, a, an
nltk.download('stopwords')
stop = stopwords.words('english')
df['sent'] = df['sent'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
df.head(1)

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/ranjitsmac/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Unnamed: 0,label,sent
0,0,poured orange juice cereal


In [27]:
# Pural form through lemmatization eg. mice becomes mouse.
df['sent'] = df['sent'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
df.head(1)

Unnamed: 0,label,sent
0,0,poured orange juice cereal


In [28]:
# correcting latter and repeatations.
def de_repeat(text):
    pattern = re.compile(r"(.)\1{2,}")
    return pattern.sub(r"\1\1",text)


df['sent'] = df['sent'].apply(lambda x: " ".join(de_repeat(x) for x in x.split())) 

In [29]:
df.head(1)

Unnamed: 0,label,sent
0,0,poured orange juice cereal


In [2]:
from spellchecker import SpellChecker

In [3]:
# spell check for eg. fihs --> fish
spell = SpellChecker()

In [None]:
df['sent'] = df['sent'].apply(lambda x: " ".join(x for x in spell.correction(x)))

## 3. Train test split

    3.	Train test split 

        a.	train-test-split: In train test split function, the dataset divided into two parts 1 with 80% of dataset and another with 20% of data. Further evaluated using three classification model.
        b.	For further evaluation, the dataset present in the test folder is taken and evaluated using three classification model.


In [36]:
from sklearn.model_selection import train_test_split

In [37]:
X = df['sent']
y = df['label']

In [38]:
print(X.shape)
print(y.shape)

(20000,)
(20000,)


In [39]:
X = df['sent']
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

## 4. Model Selection

    4.	Feature Extraction

        •	Machine Algorithms cannot take text data; instead, we need to perform feature extraction from text data for passing the text into a machine learning algorithm.   

        a.	Count Vectorization: It converts the raw text into DTM (Document term matrix) in this; each word occurrence counted.
        b.	TFIDF-Transformation: It transforms the count of count vectorization into frequency. Instead of 4.a and 4.b, we can use TFIDF-Vectorization for similar operations with one function.
        c.	TFIDF-Vectorization: It is just inverse of count vectorization; instead of filling DTM(Document term matrix) with word count, it calculates the inverse document frequency value for each word. Inverse document frequency is measured common or rare in a given corpus.

    5.	Apply 3 Classification Model. 
        a.	Naive Bayes Classifier
            i.	Naive Bayes is a classification algorithm based on the principle of Bayes' Theorem.
            •	works with any numbers of class
            •	Simple to implement
            
        b.	K-nearest Neighbour
            i.	It is a classification algorithm that operates on basic principal.
            •	Simple to implement
            •	works with any numbers of class
            •	easy to add more data

        c.	Support Vector Machine (SVM)
            i.	It analysed data and recognizer patterns through classification or regression techniques.
            •	SVM can handle non-linear data using the bit trick.
            •	SVM can be used to solve both classification and regression problems.


In [42]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import SVC
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import classification_report,confusion_matrix,accuracy_score

In [43]:
# 4.1 Naive Bayes Classifier
text_nb = Pipeline([('tfidf',TfidfVectorizer(max_features=1000,analyzer='word',ngram_range=(1,2))),
                   ('clf',MultinomialNB())])

text_nb.fit(X_train,y_train)

prediction = text_nb.predict(X_test)

print(classification_report(y_test,prediction))

print("Confusion Matrix = \n",confusion_matrix(y_test,prediction))

print("accuracy = ",accuracy_score(y_test,prediction))

              precision    recall  f1-score   support

           0       0.49      0.50      0.50      2954
           1       0.51      0.50      0.50      3046

    accuracy                           0.50      6000
   macro avg       0.50      0.50      0.50      6000
weighted avg       0.50      0.50      0.50      6000

Confusion Matrix = 
 [[1479 1475]
 [1528 1518]]
accuracy =  0.4995


In [44]:
# 4.2 K-nearest Neighbor
text_knn = Pipeline([('tfidf',TfidfVectorizer(max_features=1000,analyzer='word',ngram_range=(1,3))),
                   ('clf',KNeighborsClassifier(n_neighbors=101))])

text_knn.fit(X_train,y_train)

prediction = text_knn.predict(X_test)

print(classification_report(y_test,prediction))

print("Confusion Matrix = \n",confusion_matrix(y_test,prediction))

print("accuracy = ",accuracy_score(y_test,prediction))

              precision    recall  f1-score   support

           0       0.50      0.23      0.31      2954
           1       0.51      0.78      0.62      3046

    accuracy                           0.51      6000
   macro avg       0.51      0.50      0.47      6000
weighted avg       0.51      0.51      0.47      6000

Confusion Matrix = 
 [[ 673 2281]
 [ 664 2382]]
accuracy =  0.5091666666666667


In [47]:
# 4.3 Support Vector Machine (SVM)
text_cf = Pipeline([('tfidf',TfidfVectorizer(max_features=1000,analyzer='word',ngram_range=(1,3))),
                   ('clf',SVC(gamma='auto'))])

text_cf.fit(X_train,y_train)

prediction = text_cf.predict(X_test)

print(classification_report(y_test,prediction))

print("Confusion Matrix = \n",confusion_matrix(y_test,prediction))

print("accuracy = ",accuracy_score(y_test,prediction))

              precision    recall  f1-score   support

           0       0.49      1.00      0.66      2954
           1       0.00      0.00      0.00      3046

    accuracy                           0.49      6000
   macro avg       0.25      0.50      0.33      6000
weighted avg       0.24      0.49      0.32      6000

Confusion Matrix = 
 [[2954    0]
 [3046    0]]
accuracy =  0.49233333333333335


# END PART 1

# EVALUATIONS

In [None]:
# further the EVALUATIONS is done using trial dataset from github.

In [48]:
df_test_all = pd.read_csv('./TrialData/taskA_trial_data.csv')

In [49]:
df_test_lables = pd.read_csv('./TrialData/taskA_trial_answer.csv',header='infer',names=['id_lbl','sen0_lbl'])
df_test_join = pd.concat([df_test_all,df_test_lables],axis=1)
df_test_join.drop(['id_lbl'],axis=1,inplace=True)
df_test_join['sen1_lbl'] =  df_test_join['sen0_lbl'].apply(lambda x: int(1) if x==0 else int(0))
df1_test = df_test_join[['sent0','sen0_lbl']]
df1_test = df1_test.rename(columns = {"sent0": "sent", 
                                  "sen0_lbl":"label"})

df2_test = df_test_join[['sent1','sen1_lbl']]

df2_test = df2_test.rename(columns = {"sent1": "sent", 
                                  "sen1_lbl":"label"})


df_test = pd.concat([df1_test, df2_test], ignore_index=True,axis=0,sort=True)

In [50]:
df_test['sent'] = df_test['sent'].apply(lambda x: " ".join(x.lower() for x in x.split()))
df_test['sent'] = df_test['sent'].str.replace('[^\w\s]',' ')
df_test['sent'] = df_test['sent'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))
df_test['sent'] = df_test['sent'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))

def de_repeat(text):
    pattern = re.compile(r"(.)\1{2,}")
    return pattern.sub(r"\1\1",text)


df_test['sent'] = df_test['sent'].apply(lambda x: " ".join(de_repeat(x) for x in x.split())) 

In [51]:
X_train = df['sent']
y_train = df['label']
X_test = df_test['sent']
y_test = df_test['label']

In [52]:
# 4.1 Naive Bayes Classifier
text_nb = Pipeline([('tfidf',TfidfVectorizer(max_features=1000,analyzer='word',ngram_range=(1,2))),
                   ('clf',MultinomialNB())])

text_nb.fit(X_train,y_train)

prediction = text_nb.predict(X_test)

print(classification_report(y_test,prediction))

print("Confusion Matrix = \n",confusion_matrix(y_test,prediction))

print("accuracy = ",accuracy_score(y_test,prediction))

              precision    recall  f1-score   support

           0       0.55      0.49      0.52      2021
           1       0.54      0.60      0.57      2021

    accuracy                           0.54      4042
   macro avg       0.54      0.54      0.54      4042
weighted avg       0.54      0.54      0.54      4042

Confusion Matrix = 
 [[ 981 1040]
 [ 803 1218]]
accuracy =  0.5440376051459673


In [53]:
# 4.2 K-nearest Neighbor
text_knn = Pipeline([('tfidf',TfidfVectorizer(max_features=1000,analyzer='word',ngram_range=(1,3))),
                   ('clf',KNeighborsClassifier(n_neighbors=101))])

text_knn.fit(X_train,y_train)

prediction = text_knn.predict(X_test)

print(classification_report(y_test,prediction))

print("Confusion Matrix = \n",confusion_matrix(y_test,prediction))

print("accuracy = ",accuracy_score(y_test,prediction))

              precision    recall  f1-score   support

           0       0.53      0.10      0.17      2021
           1       0.50      0.91      0.65      2021

    accuracy                           0.51      4042
   macro avg       0.52      0.51      0.41      4042
weighted avg       0.52      0.51      0.41      4042

Confusion Matrix = 
 [[ 211 1810]
 [ 188 1833]]
accuracy =  0.5056902523503216


In [54]:
# 4.3 Support Vector Machine (SVM)
text_cf = Pipeline([('tfidf',TfidfVectorizer(max_features=1000,analyzer='word',ngram_range=(1,3))),
                   ('clf',SVC(gamma='auto'))])

text_cf.fit(X_train,y_train)

prediction = text_cf.predict(X_test)

print(classification_report(y_test,prediction))

print("Confusion Matrix = \n",confusion_matrix(y_test,prediction))

print("accuracy = ",accuracy_score(y_test,prediction))

              precision    recall  f1-score   support

           0       0.65      0.02      0.03      2021
           1       0.50      0.99      0.67      2021

    accuracy                           0.50      4042
   macro avg       0.58      0.50      0.35      4042
weighted avg       0.58      0.50      0.35      4042

Confusion Matrix = 
 [[  32 1989]
 [  17 2004]]
accuracy =  0.5037110341415141


# THE END OF EVALUATION